├── .gitignore
├── LICENSE
├── README.md
├── drl_implementation
├── __init__.py
├── agent
│ ├── __init__.py
│ ├── agent_base.py
│ ├── continuous_action
│ │ ├── __init__.py
│ │ ├── ddpg.py
│ │ ├── ddpg_goal_conditioned.py
│ │ ├── distributional_ddpg.py
│ │ ├── sac.py
│ │ ├── sac_goal_conditioned.py
│ │ ├── sac_parameterised_action_goal_conditioned.py
│ │ ├── sac_pointnet.py
│ │ └── td3.py
│ ├── distributed_agent_base.py
│ └── utils
│ │ ├── __init__.py
│ │ ├── env_wrapper.py
│ │ ├── exploration_strategy.py
│ │ ├── networks_conv.py
│ │ ├── networks_mlp.py
│ │ ├── networks_pointnet.py
│ │ ├── normalizer.py
│ │ ├── plot.py
│ │ ├── pointnet_2_utils.py
│ │ ├── pointnet_utils.py
│ │ ├── replay_buffer.py
│ │ └── segment_tree.py
└── examples
│ ├── KukaPushPHER.py
│ ├── PendulumDDPG.py
│ └── __init__.py
├── requirements.txt
├── setup.py
├── src
├── README.md
├── figs.png
├── pendulum.gif
└── push.gif
└── tests
└── test.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # Default ignored files
2 | .idea
3 | __pycache__/
4 | drl_implementation/agent/__pycache__/
5 | drl_implementation/agent/utils/__pycache__/
6 | exp_multi_goal/__pycache__/
7 | build/
8 | drl_implementation.egg-info/
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 XT_Yang
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## DRL_Implementation
2 | ##### Current status: minimal updates
3 |
4 | ### Introduction
5 | - This repository is a pytorch-based implementation of modern DRL algorithms, designed to be reusable for as many
6 | Gym-like training environments as possible
7 | - The package is mainly for my personal usage, however feel free to use it as you like.
8 | - It is recommended to use the [released version](https://github.com/IanYangChina/DRL_Implementation/tree/v2.0)
9 | - Understand more with the [Wiki!](https://github.com/IanYangChina/DRL_Implementation/wiki)
10 | - Tested environments: Gym, Pybullet-gym, Pybullet-multigoal-gym
11 | - *My priority is on continuous action algorithms as I'm working on robotics*
12 |
13 | #### Installation
14 | ```
15 | git clone https://github.com/IanYangChina/DRL_Implementation.git
16 | cd DRL_Implementation
17 | python -m pip install -r requirements.txt
18 | python -m pip install .
19 | ```
20 | [Click here for example codes](https://github.com/IanYangChina/DRL_Implementation/tree/master/drl_implementation/examples)
21 | , to run the codes you will need to install Gym, Pybullet, or pybullet-multigoal-gym. See env installation links below.
22 | For more use cases, have a look at the [drl_imp_test repo](https://github.com/IanYangChina/drl_imp_test)\
23 | From the project root, run `python drl_implementation/examples/$SCTIPT_NAME.py`
24 |
25 | ##### State-based
26 | - [X] Distributional DDPG, Continuous
27 | - [X] DDPG - Deterministic, Continuous
28 | - [X] TD3 -Deterministic, Continuous
29 | - [X] SAC (Adaptive Temperature) - Stochastic, Continuous
30 |
31 | ##### Replay buffers
32 | - [X] Hindsight
33 | - [X] Prioritised
34 |
35 | ##### Tested Environments
36 | - [X] [Pybullet Gym (Continuous)](https://github.com/bulletphysics/bullet3)
37 | - [X] [OpenAI Gym Mujoco Robotics Multigoal Environment (Continuous)](https://openai.com/blog/ingredients-for-robotics-research/)
38 | - [X] [Pybullet Multigoal Gym](https://github.com/IanYangChina/pybullet_multigoal_gym) (OpenAI Robotics
39 | Multigoal Pybullet Migration) (Continuous)
40 |
41 | ##### Some result figures
42 |
43 |
44 |
45 |
46 | #### Reference Papers: Algorithm
47 | * [DQN](https://www.nature.com/articles/nature14236?wm=book_wap_0005)
48 | * [DoubleDQN](https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewPaper/12389)
49 | * [DDPG](https://arxiv.org/abs/1509.02971)
50 | * [TD3](https://arxiv.org/pdf/1802.09477.pdf)
51 | * [SAC (Adaptive Temperature)](https://arxiv.org/pdf/1812.05905.pdf)
52 | * [PER](https://arxiv.org/abs/1511.05952)
53 | * [HER](http://papers.nips.cc/paper/7090-hindsight-experience-replay)
54 |
55 | #### Reference Papers: Implementation Matters
56 | * [Time limit](https://arxiv.org/abs/1712.00378)
57 | * [SOTA PPO Hyperparameters (many applicable to other algorithms)](https://arxiv.org/abs/2006.05990)
58 | * [SAC Temperature Auto-tuning](https://arxiv.org/abs/1812.05905)
59 |
--------------------------------------------------------------------------------
/drl_implementation/__init__.py:
--------------------------------------------------------------------------------
1 | from .agent.continuous_action.ddpg import DDPG
2 | from .agent.continuous_action.ddpg_goal_conditioned import GoalConditionedDDPG
3 | from .agent.continuous_action.sac import SAC
4 | from .agent.continuous_action.sac_goal_conditioned import GoalConditionedSAC
5 | from .agent.continuous_action.td3 import TD3
6 | from .agent.continuous_action.distributional_ddpg import DistributionalDDPG
7 | from .agent.continuous_action.sac_parameterised_action_goal_conditioned import GPASAC
8 |
9 | agents = {
10 | 'ddpg': DDPG,
11 | 'ddpg_her': GoalConditionedDDPG,
12 | 'sac': SAC,
13 | 'sac_her': GoalConditionedSAC,
14 | 'td3': TD3,
15 | 'distri_ddpg': DistributionalDDPG,
16 | }
17 |
--------------------------------------------------------------------------------
/drl_implementation/agent/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/drl_implementation/agent/__init__.py
--------------------------------------------------------------------------------
/drl_implementation/agent/agent_base.py:
--------------------------------------------------------------------------------
1 | import os
2 | import logging as info_logging
3 | import torch as T
4 | import numpy as np
5 | import json
6 | import subprocess as sp
7 | from torch.utils.tensorboard import SummaryWriter
8 | from .utils.plot import smoothed_plot
9 | from .utils.replay_buffer import make_buffer
10 | from .utils.normalizer import Normalizer
11 |
12 |
13 | def mkdir(paths):
14 | for path in paths:
15 | os.makedirs(path, exist_ok=True)
16 |
17 |
18 | def get_gpu_memory():
19 | command = "nvidia-smi --query-gpu=memory.free --format=csv"
20 | memory_free_info = sp.check_output(command.split()).decode('ascii').split('\n')[:-1][1:]
21 | memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
22 | return memory_free_values
23 |
24 |
25 | def reset_logging(logging_to_reset):
26 | loggers = [logging_to_reset.getLogger(name) for name in logging_to_reset.root.manager.loggerDict]
27 | loggers.append(logging_to_reset.getLogger())
28 | for logger in loggers:
29 | handlers = logger.handlers[:]
30 | for handler in handlers:
31 | logger.removeHandler(handler)
32 | handler.close()
33 | logger.setLevel(logging_to_reset.NOTSET)
34 | logger.propagate = True
35 |
36 |
37 | class Agent(object):
38 | def __init__(self,
39 | algo_params, logging=None, create_logger=False,
40 | transition_tuple=None,
41 | non_flat_obs=False, action_type='continuous',
42 | goal_conditioned=False, store_goal_ind=False, training_mode='episode_based',
43 | path=None, log_dir_suffix=None, seed=-1):
44 | """
45 | Parameters
46 | ----------
47 | algo_params : dict
48 | a dictionary of parameters
49 | transition_tuple : collections.namedtuple
50 | a python namedtuple for storing, managing and sampling experiences, see .utils.replay_buffer
51 | non_flat_obs : bool
52 | whether the observations are 1D or nD
53 | action_type : str
54 | either 'discrete' or 'continuous'
55 | goal_conditioned : bool
56 | whether the agent uses a goal-conditioned policy
57 | training_mode : str
58 | either 'episode_based' or 'env_step_based'
59 | path : str
60 | a directory to save files
61 | seed : int
62 | a random seed
63 | """
64 | # torch device
65 | self.device = T.device("cuda" if T.cuda.is_available() else "cpu")
66 | if 'cuda_device_id' in algo_params.keys():
67 | self.device = T.device("cuda:%i" % algo_params['cuda_device_id'])
68 | # path & seeding
69 | T.manual_seed(seed)
70 | T.cuda.manual_seed_all(seed) # this has no effect if cuda is not available
71 |
72 | # create a random number generator and seed it
73 | self.rng = np.random.default_rng(seed=seed)
74 | assert path is not None, 'please specify a project path to save files'
75 | self.path = path
76 | # path to save neural network check point
77 | self.ckpt_path = os.path.join(path, 'ckpts')
78 | # path to save statistics
79 | self.data_path = os.path.join(path, 'data')
80 | # create directories if not exist
81 | mkdir([self.path, self.ckpt_path, self.data_path])
82 | if log_dir_suffix is not None:
83 | comment = '-'+log_dir_suffix
84 | else:
85 | comment = ''
86 | self.create_logger = create_logger
87 | if self.create_logger:
88 | self.logger = SummaryWriter(log_dir=self.data_path, comment=comment)
89 | self.logging = logging
90 | if self.logging is None:
91 | reset_logging(info_logging)
92 | log_file_name = os.path.join(self.data_path, 'optimisation.log')
93 | if os.path.isfile(log_file_name):
94 | filemode = "a"
95 | else:
96 | filemode = "w"
97 | info_logging.basicConfig(level=info_logging.NOTSET, filemode=filemode,
98 | filename=log_file_name,
99 | format="%(asctime)s %(levelname)s %(message)s")
100 | self.logging = info_logging
101 |
102 | # non-goal-conditioned args
103 | self.non_flat_obs = non_flat_obs
104 | self.action_type = action_type
105 | if self.non_flat_obs:
106 | self.state_dim = 0
107 | self.state_shape = algo_params['state_shape']
108 | else:
109 | self.state_dim = algo_params['state_dim']
110 | if self.action_type == 'hybrid':
111 | self.discrete_action_dim = algo_params['discrete_action_dim']
112 | self.continuous_action_dim = algo_params['continuous_action_dim']
113 | self.continuous_action_max = algo_params['continuous_action_max']
114 | self.continuous_action_scaling = algo_params['continuous_action_scaling']
115 | else:
116 | self.action_dim = algo_params['action_dim']
117 | if self.action_type == 'continuous':
118 | self.action_max = algo_params['action_max']
119 | self.action_scaling = algo_params['action_scaling']
120 |
121 | # goal-conditioned args & buffers
122 | self.goal_conditioned = goal_conditioned
123 | # prioritised replay
124 | self.prioritised = algo_params['prioritised']
125 |
126 | if self.goal_conditioned:
127 | if self.non_flat_obs:
128 | self.goal_dim = 0
129 | self.goal_shape = algo_params['goal_shape']
130 | else:
131 | self.goal_dim = algo_params['goal_dim']
132 | self.hindsight = algo_params['hindsight']
133 | try:
134 | goal_distance_threshold = self.env.env.distance_threshold
135 | except:
136 | goal_distance_threshold = self.env.distance_threshold
137 |
138 | goal_conditioned_reward_func = None
139 | try:
140 | if self.env.env.goal_conditioned_reward_function is not None:
141 | goal_conditioned_reward_func = self.env.env.goal_conditioned_reward_function
142 | except:
143 | if self.env.goal_conditioned_reward_function is not None:
144 | goal_conditioned_reward_func = self.env.goal_conditioned_reward_function
145 |
146 | try:
147 | her_sample_strategy = algo_params['her_sampling_strategy']
148 | except:
149 | her_sample_strategy = 'future'
150 |
151 | try:
152 | num_sampled_goal = algo_params['num_sampled_goal']
153 | except:
154 | num_sampled_goal = 4
155 |
156 | self.buffer = make_buffer(mem_capacity=algo_params['memory_capacity'],
157 | transition_tuple=transition_tuple, prioritised=self.prioritised,
158 | seed=seed, rng=self.rng,
159 | goal_conditioned=True, keep_episode=self.hindsight,
160 | store_goal_ind=store_goal_ind,
161 | sampling_strategy=her_sample_strategy,
162 | num_sampled_goal=num_sampled_goal,
163 | terminal_on_achieved=algo_params['terminate_on_achieve'],
164 | goal_distance_threshold=goal_distance_threshold,
165 | goal_conditioned_reward_func=goal_conditioned_reward_func)
166 | else:
167 | self.goal_dim = 0
168 | self.buffer = make_buffer(mem_capacity=algo_params['memory_capacity'],
169 | transition_tuple=transition_tuple, prioritised=self.prioritised,
170 | seed=seed, rng=self.rng,
171 | goal_conditioned=False)
172 |
173 | # common args
174 | if not self.non_flat_obs:
175 | self.observation_normalization = algo_params['observation_normalization']
176 | # if not using obs normalization, the normalizer is just a scale multiplier, returns inputs*scale
177 | self.normalizer = Normalizer(self.state_dim+self.goal_dim,
178 | algo_params['init_input_means'], algo_params['init_input_vars'],
179 | activated=self.observation_normalization)
180 | try:
181 | self.update_interval = algo_params['update_interval']
182 | except:
183 | self.update_interval = 1
184 | self.actor_learning_rate = algo_params['actor_learning_rate']
185 | self.critic_learning_rate = algo_params['critic_learning_rate']
186 | self.batch_size = algo_params['batch_size']
187 | self.optimizer_steps = algo_params['optimization_steps']
188 | self.gamma = algo_params['discount_factor']
189 | self.discard_time_limit = algo_params['discard_time_limit']
190 | self.tau = algo_params['tau']
191 | self.optim_step_count = 0
192 |
193 | assert training_mode in ['episode_based', 'step_based']
194 | self.training_mode = training_mode
195 | self.total_env_step_count = 0
196 | self.total_env_episode_count = 0
197 |
198 | # network dict is filled in each specific agent
199 | self.network_dict = {}
200 | self.network_keys_to_save = None
201 |
202 | # algorithm-specific statistics are defined in each agent sub-class
203 | self.statistic_dict = {
204 | # use lowercase characters
205 | 'actor_loss': [],
206 | 'critic_loss': [],
207 | }
208 | self.get_gpu_memory = get_gpu_memory
209 |
210 | def run(self, render=False, test=False, load_network_ep=None, sleep=0):
211 | raise NotImplementedError
212 |
213 | def _interact(self, render=False, test=False, sleep=0):
214 | raise NotImplementedError
215 |
216 | def _select_action(self, obs, test=False):
217 | raise NotImplementedError
218 |
219 | def _learn(self, steps=None):
220 | raise NotImplementedError
221 |
222 | def _remember(self, *args, new_episode=False):
223 | if self.goal_conditioned:
224 | self.buffer.new_episode = new_episode
225 | self.buffer.store_experience(*args)
226 | else:
227 | self.buffer.store_experience(*args)
228 |
229 | def _soft_update(self, source, target, tau=None):
230 | if tau is None:
231 | tau = self.tau
232 |
233 | for target_param, param in zip(target.parameters(), source.parameters()):
234 | target_param.data.copy_(
235 | target_param.data * (1.0 - tau) + param.data * tau
236 | )
237 |
238 | def _save_network(self, keys=None, ep=None, step=None):
239 | if ep is None:
240 | ep = ''
241 | else:
242 | ep = '_ep'+str(ep)
243 | if step is None:
244 | step = ''
245 | else:
246 | step = '_step'+str(step)
247 | if keys is None:
248 | keys = self.network_keys_to_save
249 | assert keys is not None
250 | for key in keys:
251 | T.save(self.network_dict[key].state_dict(), self.ckpt_path+'/ckpt_'+key+ep+step+'.pt')
252 |
253 | def _load_network(self, keys=None, ep=None, step=None):
254 | if (not self.non_flat_obs) and self.observation_normalization:
255 | self.normalizer.history_mean = np.load(os.path.join(self.data_path, 'input_means.npy'))
256 | self.normalizer.history_var = np.load(os.path.join(self.data_path, 'input_vars.npy'))
257 | if ep is None:
258 | ep = ''
259 | else:
260 | ep = '_ep'+str(ep)
261 | if step is None:
262 | step = ''
263 | else:
264 | step = '_step'+str(step)
265 | if keys is None:
266 | keys = self.network_keys_to_save
267 | assert keys is not None
268 | for key in keys:
269 | self.network_dict[key].load_state_dict(T.load(self.ckpt_path+'/ckpt_'+key+ep+step+'.pt', map_location=self.device))
270 |
271 | def _save_statistics(self, keys=None):
272 | if (not self.non_flat_obs) and self.observation_normalization:
273 | np.save(os.path.join(self.data_path, 'input_means'), self.normalizer.history_mean)
274 | np.save(os.path.join(self.data_path, 'input_vars'), self.normalizer.history_var)
275 | if keys is None:
276 | keys = self.statistic_dict.keys()
277 | for key in keys:
278 | if len(self.statistic_dict[key]) == 0:
279 | continue
280 | # convert everything to a list before save via json
281 | if T.is_tensor(self.statistic_dict[key][0]):
282 | self.statistic_dict[key] = T.as_tensor(self.statistic_dict[key], device=self.device).cpu().numpy().tolist()
283 | else:
284 | self.statistic_dict[key] = np.array(self.statistic_dict[key]).tolist()
285 | json.dump(self.statistic_dict[key], open(os.path.join(self.data_path, key+'.json'), 'w'))
286 |
287 | def _plot_statistics(self, keys=None, x_labels=None, y_labels=None, window=5, save_to_file=True):
288 | if save_to_file:
289 | self._save_statistics(keys=keys)
290 | if y_labels is None:
291 | y_labels = {}
292 | for key in list(self.statistic_dict.keys()):
293 | if key not in y_labels.keys():
294 | if 'loss' in key:
295 | label = 'Loss'
296 | elif 'return' in key:
297 | label = 'Return'
298 | elif 'success' in key:
299 | label = 'Success'
300 | else:
301 | label = key
302 | y_labels.update({key: label})
303 |
304 | if x_labels is None:
305 | x_labels = {}
306 | for key in list(self.statistic_dict.keys()):
307 | if key not in x_labels.keys():
308 | if ('loss' in key) or ('alpha' in key) or ('entropy' in key) or ('step' in key):
309 | label = 'Optimization step'
310 | elif 'cycle' in key:
311 | label = 'Cycle'
312 | elif 'epoch' in key:
313 | label = 'Epoch'
314 | else:
315 | label = 'Episode'
316 | x_labels.update({key: label})
317 |
318 | if keys is None:
319 | for key in list(self.statistic_dict.keys()):
320 | if len(self.statistic_dict[key]) == 0:
321 | continue
322 | smoothed_plot(os.path.join(self.path, key+'.png'), self.statistic_dict[key],
323 | x_label=x_labels[key], y_label=y_labels[key], window=window)
324 | else:
325 | for key in keys:
326 | smoothed_plot(os.path.join(self.path, key+'.png'), self.statistic_dict[key],
327 | x_label=x_labels[key], y_label=y_labels[key], window=window)
328 |
329 |
--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/drl_implementation/agent/continuous_action/__init__.py
--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/ddpg.py:
--------------------------------------------------------------------------------
1 | import time
2 | import numpy as np
3 | import torch as T
4 | import torch.nn.functional as F
5 | from torch.optim.adam import Adam
6 | from ..utils.networks_mlp import Actor, Critic
7 | from ..agent_base import Agent
8 | from ..utils.exploration_strategy import OUNoise, GaussianNoise
9 |
10 |
11 | class DDPG(Agent):
12 | def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1):
13 | # environment
14 | self.env = env
15 | self.env.seed(seed)
16 | obs = self.env.reset()
17 | algo_params.update({'state_dim': obs.shape[0],
18 | 'action_dim': self.env.action_space.shape[0],
19 | 'action_max': self.env.action_space.high,
20 | 'action_scaling': self.env.action_space.high[0],
21 | 'init_input_means': None,
22 | 'init_input_vars': None
23 | })
24 | # training args
25 | self.training_episodes = algo_params['training_episodes']
26 | self.testing_gap = algo_params['testing_gap']
27 | self.testing_episodes = algo_params['testing_episodes']
28 | self.saving_gap = algo_params['saving_gap']
29 |
30 | super(DDPG, self).__init__(algo_params,
31 | transition_tuple=transition_tuple,
32 | goal_conditioned=False,
33 | path=path,
34 | seed=seed)
35 | # torch
36 | self.network_dict.update({
37 | 'actor': Actor(self.state_dim, self.action_dim).to(self.device),
38 | 'actor_target': Actor(self.state_dim, self.action_dim).to(self.device),
39 | 'critic': Critic(self.state_dim + self.action_dim, 1).to(self.device),
40 | 'critic_target': Critic(self.state_dim + self.action_dim, 1).to(self.device)
41 | })
42 | self.network_keys_to_save = ['actor_target', 'critic_target']
43 | self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate)
44 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'], tau=1)
45 | self.critic_optimizer = Adam(self.network_dict['critic'].parameters(), lr=self.critic_learning_rate, weight_decay=algo_params['Q_weight_decay'])
46 | self._soft_update(self.network_dict['critic'], self.network_dict['critic_target'], tau=1)
47 | # behavioural policy args (exploration)
48 | self.exploration_strategy = GaussianNoise(self.action_dim, self.action_max, sigma=0.1)
49 | # training args
50 | self.warmup_step = algo_params['warmup_step']
51 | # statistic dict
52 | self.statistic_dict.update({
53 | 'episode_return': [],
54 | 'episode_test_return': []
55 | })
56 |
57 | def run(self, test=False, render=False, load_network_ep=None, sleep=0):
58 | if test:
59 | num_episode = self.testing_episodes
60 | if load_network_ep is not None:
61 | print("Loading network parameters...")
62 | self._load_network(ep=load_network_ep)
63 | print("Start testing...")
64 | else:
65 | num_episode = self.training_episodes
66 | print("Start training...")
67 |
68 | for ep in range(num_episode):
69 | ep_return = self._interact(render, test, sleep=sleep)
70 | self.statistic_dict['episode_return'].append(ep_return)
71 | print("Episode %i" % ep, "return %0.1f" % ep_return)
72 |
73 | if (ep % self.testing_gap == 0) and (ep != 0) and (not test):
74 | ep_test_return = []
75 | for test_ep in range(self.testing_episodes):
76 | ep_test_return.append(self._interact(render, test=True))
77 | self.statistic_dict['episode_test_return'].append(sum(ep_test_return)/self.testing_episodes)
78 | print("Episode %i" % ep, "test return %0.1f" % (sum(ep_test_return)/self.testing_episodes))
79 |
80 | if (ep % self.saving_gap == 0) and (ep != 0) and (not test):
81 | self._save_network(ep=ep)
82 |
83 | if not test:
84 | print("Finished training")
85 | print("Saving statistics...")
86 | self._plot_statistics(save_to_file=True)
87 | else:
88 | print("Finished testing")
89 |
90 | def _interact(self, render=False, test=False, sleep=0):
91 | done = False
92 | obs = self.env.reset()
93 | ep_return = 0
94 | # self.exploration_strategy.reset()
95 | # start a new episode
96 | while not done:
97 | if render:
98 | self.env.render()
99 | if self.env_step_count < self.warmup_step:
100 | action = self.env.action_space.sample()
101 | else:
102 | action = self._select_action(obs, test=test)
103 | new_obs, reward, done, info = self.env.step(action)
104 | time.sleep(sleep)
105 | ep_return += reward
106 | if not test:
107 | self._remember(obs, action, new_obs, reward, 1 - int(done))
108 | if self.observation_normalization:
109 | self.normalizer.store_history(new_obs)
110 | self.normalizer.update_mean()
111 | if (self.env_step_count % self.update_interval == 0) and (self.env_step_count > self.warmup_step):
112 | self._learn()
113 | self.env_step_count += 1
114 | obs = new_obs
115 | return ep_return
116 |
117 | def _select_action(self, obs, test=False):
118 | obs = self.normalizer(obs)
119 | with T.no_grad():
120 | inputs = T.as_tensor(obs, dtype=T.float, device=self.device)
121 | action = self.network_dict['actor_target'](inputs).cpu().detach().numpy()
122 | if test:
123 | # evaluate
124 | return np.clip(action, -self.action_max, self.action_max)
125 | else:
126 | # explore
127 | return self.exploration_strategy(action)
128 |
129 | def _learn(self, steps=None):
130 | if len(self.buffer) < self.batch_size:
131 | return
132 | if steps is None:
133 | steps = self.optimizer_steps
134 |
135 | for i in range(steps):
136 | if self.prioritised:
137 | batch, weights, inds = self.buffer.sample(self.batch_size)
138 | weights = T.as_tensor(weights, device=self.device).view(self.batch_size, 1)
139 | else:
140 | batch = self.buffer.sample(self.batch_size)
141 | weights = T.ones(size=(self.batch_size, 1), device=self.device)
142 | inds = None
143 |
144 | actor_inputs = np.array(self.normalizer(batch.state))
145 | actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device)
146 | actions = T.as_tensor(np.array(batch.action), dtype=T.float32, device=self.device)
147 | critic_inputs = T.cat((actor_inputs, actions), dim=1)
148 | actor_inputs_ = np.array(self.normalizer(batch.next_state))
149 | actor_inputs_ = T.as_tensor(np.array(actor_inputs_), dtype=T.float32, device=self.device)
150 | rewards = T.as_tensor(np.array(batch.reward), dtype=T.float32, device=self.device).unsqueeze(1)
151 | done = T.as_tensor(np.array(batch.done), dtype=T.float32, device=self.device).unsqueeze(1)
152 |
153 | if self.discard_time_limit:
154 | done = done * 0 + 1
155 |
156 | with T.no_grad():
157 | actions_ = self.network_dict['actor_target'](actor_inputs_)
158 | critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1)
159 | value_ = self.network_dict['critic_target'](critic_inputs_)
160 | value_target = rewards + done * self.gamma * value_
161 |
162 | self.critic_optimizer.zero_grad()
163 | value_estimate = self.network_dict['critic'](critic_inputs)
164 | critic_loss = F.mse_loss(value_estimate, value_target, reduction='none')
165 | (critic_loss * weights).mean().backward()
166 | self.critic_optimizer.step()
167 |
168 | if self.prioritised:
169 | assert inds is not None
170 | self.buffer.update_priority(inds, np.abs(critic_loss.cpu().detach().numpy()))
171 |
172 | self.actor_optimizer.zero_grad()
173 | new_actions = self.network_dict['actor'](actor_inputs)
174 | critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1).to(self.device)
175 | actor_loss = -self.network_dict['critic'](critic_eval_inputs).mean()
176 | actor_loss.backward()
177 | self.actor_optimizer.step()
178 |
179 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'])
180 | self._soft_update(self.network_dict['critic'], self.network_dict['critic_target'])
181 |
182 | self.statistic_dict['critic_loss'].append(critic_loss.detach().mean())
183 | self.statistic_dict['actor_loss'].append(actor_loss.detach().mean())
184 |
--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/ddpg_goal_conditioned.py:
--------------------------------------------------------------------------------
1 | import time
2 | import numpy as np
3 | import torch as T
4 | import torch.nn.functional as F
5 | from torch.optim.adam import Adam
6 | from ..utils.networks_mlp import Actor, Critic
7 | from ..agent_base import Agent
8 | from ..utils.exploration_strategy import EGreedyGaussian
9 |
10 |
11 | class GoalConditionedDDPG(Agent):
12 | def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1):
13 | # environment
14 | self.env = env
15 | self.env.seed(seed)
16 | obs = self.env.reset()
17 | algo_params.update({'state_dim': obs['observation'].shape[0],
18 | 'goal_dim': obs['desired_goal'].shape[0],
19 | 'action_dim': self.env.action_space.shape[0],
20 | 'action_max': self.env.action_space.high,
21 | 'action_scaling': self.env.action_space.high[0],
22 | 'init_input_means': None,
23 | 'init_input_vars': None
24 | })
25 | self.curriculum = False
26 | if 'curriculum' in algo_params.keys():
27 | self.curriculum = algo_params['curriculum']
28 | # training args
29 | self.training_epochs = algo_params['training_epochs']
30 | self.training_cycles = algo_params['training_cycles']
31 | self.training_episodes = algo_params['training_episodes']
32 | self.testing_gap = algo_params['testing_gap']
33 | self.testing_episodes = algo_params['testing_episodes']
34 | self.saving_gap = algo_params['saving_gap']
35 |
36 | super(GoalConditionedDDPG, self).__init__(algo_params,
37 | transition_tuple=transition_tuple,
38 | goal_conditioned=True,
39 | path=path,
40 | seed=seed)
41 | # torch
42 | self.network_dict.update({
43 | 'actor': Actor(self.state_dim + self.goal_dim, self.action_dim, action_scaling=self.action_scaling).to(
44 | self.device),
45 | 'actor_target': Actor(self.state_dim + self.goal_dim, self.action_dim,
46 | action_scaling=self.action_scaling).to(self.device),
47 | 'critic_1': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
48 | 'critic_1_target': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
49 | 'critic_2': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
50 | 'critic_2_target': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
51 | })
52 | self.network_keys_to_save = ['actor_target', 'critic_1_target', 'critic_2_target']
53 | self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate)
54 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'], tau=1)
55 | self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate,
56 | weight_decay=algo_params['Q_weight_decay'])
57 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'], tau=1)
58 | self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate,
59 | weight_decay=algo_params['Q_weight_decay'])
60 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'], tau=1)
61 | # behavioural policy args (exploration)
62 | # different from the original DDPG paper, the HER paper uses another exploration strategy
63 | # paper: https://papers.nips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html
64 | self.exploration_strategy = EGreedyGaussian(action_dim=self.action_dim,
65 | action_max=self.action_max,
66 | chance=algo_params['random_action_chance'],
67 | sigma=algo_params['noise_deviation'], rng=self.rng)
68 | self.noise_deviation = algo_params['noise_deviation']
69 | # training args
70 | self.clip_value = algo_params['clip_value']
71 | # statistic dict
72 | self.statistic_dict.update({
73 | 'cycle_return': [],
74 | 'cycle_success_rate': [],
75 | 'epoch_test_return': [],
76 | 'epoch_test_success_rate': []
77 | })
78 |
79 | def run(self, test=False, render=False, load_network_ep=None, sleep=0):
80 | # training setup uses a hierarchy of Epoch, Cycle and Episode
81 | # following the HER paper: https://papers.nips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html
82 | if test:
83 | if load_network_ep is not None:
84 | print("Loading network parameters...")
85 | self._load_network(ep=load_network_ep)
86 | print("Start testing...")
87 | else:
88 | print("Start training...")
89 |
90 | for epo in range(self.training_epochs):
91 | if self.curriculum:
92 | self.env.activate_curriculum_update()
93 | for cyc in range(self.training_cycles):
94 | cycle_return = 0
95 | cycle_success = 0
96 | for ep in range(self.training_episodes):
97 | ep_return = self._interact(render, test, sleep=sleep)
98 | cycle_return += ep_return
99 | if ep_return > -self.env._max_episode_steps:
100 | cycle_success += 1
101 |
102 | self.statistic_dict['cycle_return'].append(cycle_return / self.training_episodes)
103 | self.statistic_dict['cycle_success_rate'].append(cycle_success / self.training_episodes)
104 | print("Epoch %i" % epo, "Cycle %i" % cyc,
105 | "avg. return %0.1f" % (cycle_return / self.training_episodes),
106 | "success rate %0.1f" % (cycle_success / self.training_episodes))
107 |
108 | if (epo % self.testing_gap == 0) and (epo != 0) and (not test):
109 | if self.curriculum:
110 | self.env.deactivate_curriculum_update()
111 | # testing during training
112 | test_return = 0
113 | test_success = 0
114 | for test_ep in range(self.testing_episodes):
115 | ep_test_return = self._interact(render, test=True)
116 | test_return += ep_test_return
117 | if ep_test_return > -self.env._max_episode_steps:
118 | test_success += 1
119 | self.statistic_dict['epoch_test_return'].append(test_return / self.testing_episodes)
120 | self.statistic_dict['epoch_test_success_rate'].append(test_success / self.testing_episodes)
121 | print("Epoch %i" % epo, "test avg. return %0.1f" % (test_return / self.testing_episodes))
122 |
123 | if (epo % self.saving_gap == 0) and (epo != 0) and (not test):
124 | self._save_network(ep=epo)
125 |
126 | if not test:
127 | print("Finished training")
128 | print("Saving statistics...")
129 | self._plot_statistics(
130 | x_labels={
131 | 'critic_loss': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)',
132 | 'actor_loss': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)'
133 | },
134 | save_to_file=True)
135 | else:
136 | print("Finished testing")
137 |
138 | def _interact(self, render=False, test=False, sleep=0):
139 | done = False
140 | obs = self.env.reset()
141 | if self.curriculum:
142 | self.env._max_episode_steps = self.env.env.curriculum_goal_step
143 | ep_return = 0
144 | new_episode = True
145 | # start a new episode
146 | while not done:
147 | if render:
148 | self.env.render()
149 | action = self._select_action(obs, test=test)
150 | new_obs, reward, done, info = self.env.step(action)
151 | time.sleep(sleep)
152 | ep_return += reward
153 | if not test:
154 | self._remember(obs['observation'], obs['desired_goal'], action,
155 | new_obs['observation'], new_obs['achieved_goal'], reward, 1 - int(done),
156 | new_episode=new_episode)
157 | if self.observation_normalization:
158 | self.normalizer.store_history(np.concatenate((new_obs['observation'],
159 | new_obs['achieved_goal']), axis=0))
160 | obs = new_obs
161 | new_episode = False
162 | if not test:
163 | self.normalizer.update_mean()
164 | self._learn()
165 | return ep_return
166 |
167 | def _select_action(self, obs, test=False):
168 | inputs = np.concatenate((obs['observation'], obs['desired_goal']), axis=0)
169 | inputs = self.normalizer(inputs)
170 | with T.no_grad():
171 | inputs = T.as_tensor(inputs, dtype=T.float, device=self.device)
172 | action = self.network_dict['actor_target'](inputs).cpu().detach().numpy()
173 | if test:
174 | # evaluate
175 | return np.clip(action, -self.action_max, self.action_max)
176 | else:
177 | # explore
178 | return self.exploration_strategy(action)
179 |
180 | def _learn(self, steps=None):
181 | if self.hindsight:
182 | self.buffer.modify_episodes()
183 | self.buffer.store_episodes()
184 | if len(self.buffer) < self.batch_size:
185 | return
186 | if steps is None:
187 | steps = self.optimizer_steps
188 |
189 | critic_losses = T.zeros(1, device=self.device)
190 | actor_losses = T.zeros(1, device=self.device)
191 | for i in range(steps):
192 | if self.prioritised:
193 | batch, weights, inds = self.buffer.sample(self.batch_size)
194 | weights = T.as_tensor(weights).view(self.batch_size, 1).to(self.device)
195 | else:
196 | batch = self.buffer.sample(self.batch_size)
197 | weights = T.ones(size=(self.batch_size, 1)).to(self.device)
198 | inds = None
199 |
200 | actor_inputs = np.concatenate((batch.state, batch.desired_goal), axis=1)
201 | actor_inputs = self.normalizer(actor_inputs)
202 | actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device)
203 | actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device)
204 | critic_inputs = T.cat((actor_inputs, actions), dim=1)
205 | actor_inputs_ = np.concatenate((batch.next_state, batch.desired_goal), axis=1)
206 | actor_inputs_ = self.normalizer(actor_inputs_)
207 | actor_inputs_ = T.as_tensor(actor_inputs_, dtype=T.float32, device=self.device)
208 | rewards = T.as_tensor(batch.reward, dtype=T.float32, device=self.device).unsqueeze(1)
209 | done = T.as_tensor(batch.done, dtype=T.float32, device=self.device).unsqueeze(1)
210 |
211 | if self.discard_time_limit:
212 | done = done * 0 + 1
213 |
214 | with T.no_grad():
215 | actions_ = self.network_dict['actor_target'](actor_inputs_)
216 | critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1)
217 | value_1_ = self.network_dict['critic_1_target'](critic_inputs_)
218 | value_2_ = self.network_dict['critic_2_target'](critic_inputs_)
219 | value_ = T.min(value_1_, value_2_)
220 | value_target = rewards + done * self.gamma * value_
221 | value_target = T.clamp(value_target, -self.clip_value, 0.0)
222 |
223 | self.critic_1_optimizer.zero_grad()
224 | value_estimate_1 = self.network_dict['critic_1'](critic_inputs)
225 | critic_loss_1 = F.mse_loss(value_estimate_1, value_target.detach(), reduction='none')
226 | (critic_loss_1 * weights).mean().backward()
227 | self.critic_1_optimizer.step()
228 |
229 | if self.prioritised:
230 | assert inds is not None
231 | self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy()))
232 |
233 | self.critic_2_optimizer.zero_grad()
234 | value_estimate_2 = self.network_dict['critic_2'](critic_inputs)
235 | critic_loss_2 = F.mse_loss(value_estimate_2, value_target.detach(), reduction='none')
236 | (critic_loss_2 * weights).mean().backward()
237 | self.critic_2_optimizer.step()
238 |
239 | self.actor_optimizer.zero_grad()
240 | new_actions = self.network_dict['actor'](actor_inputs)
241 | critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1).to(self.device)
242 | new_values_1 = self.network_dict['critic_1'](critic_eval_inputs)
243 | new_values_2 = self.network_dict['critic_2'](critic_eval_inputs)
244 | actor_loss = -T.min(new_values_1, new_values_2).mean()
245 | actor_loss.backward()
246 | self.actor_optimizer.step()
247 |
248 | critic_losses += critic_loss_1.detach().mean()
249 | actor_losses += actor_loss.detach().mean()
250 |
251 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'])
252 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'])
253 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'])
254 |
255 | self.statistic_dict['critic_loss'].append(critic_losses / steps)
256 | self.statistic_dict['actor_loss'].append(actor_losses / steps)
257 |
--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/distributional_ddpg.py:
--------------------------------------------------------------------------------
1 | import time
2 | import numpy as np
3 | import torch as T
4 | import torch.nn.functional as F
5 | from torch.optim.adam import Adam
6 | from ..utils.networks_mlp import Actor, Critic
7 | from ..agent_base import Agent
8 | from ..utils.exploration_strategy import OUNoise, GaussianNoise
9 |
10 |
11 | class DistributionalDDPG(Agent):
12 | def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1):
13 | # environment
14 | self.env = env
15 | self.env.seed(seed)
16 | obs = self.env.reset()
17 | algo_params.update({'state_dim': obs.shape[0],
18 | 'action_dim': self.env.action_space.shape[0],
19 | 'action_max': self.env.action_space.high,
20 | 'action_scaling': self.env.action_space.high[0],
21 | 'init_input_means': None,
22 | 'init_input_vars': None
23 | })
24 | # training args
25 | self.training_episodes = algo_params['training_episodes']
26 | self.testing_gap = algo_params['testing_gap']
27 | self.testing_episodes = algo_params['testing_episodes']
28 | self.saving_gap = algo_params['saving_gap']
29 |
30 | super(DistributionalDDPG, self).__init__(algo_params,
31 | transition_tuple=transition_tuple,
32 | goal_conditioned=False,
33 | path=path,
34 | seed=seed)
35 | # torch
36 | # categorical distribution atoms
37 | self.num_atoms = algo_params['num_atoms']
38 | self.value_max = algo_params['value_max']
39 | self.value_min = algo_params['value_min']
40 | self.delta_z = (self.value_max - self.value_min) / (self.num_atoms - 1)
41 | self.support = T.linspace(self.value_min, self.value_max, steps=self.num_atoms).to(self.device)
42 | # network
43 | self.network_dict.update({
44 | 'actor': Actor(self.state_dim, self.action_dim).to(self.device),
45 | 'actor_target': Actor(self.state_dim, self.action_dim).to(self.device),
46 | 'critic': Critic(self.state_dim + self.action_dim, self.num_atoms, softmax=True).to(self.device),
47 | 'critic_target': Critic(self.state_dim + self.action_dim, self.num_atoms, softmax=True).to(self.device)
48 | })
49 | self.network_keys_to_save = ['actor_target', 'critic_target']
50 | self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate)
51 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'], tau=1)
52 | self.critic_optimizer = Adam(self.network_dict['critic'].parameters(), lr=self.critic_learning_rate,
53 | weight_decay=algo_params['Q_weight_decay'])
54 | self._soft_update(self.network_dict['critic'], self.network_dict['critic_target'], tau=1)
55 | # behavioural policy args (exploration)
56 | self.exploration_strategy = GaussianNoise(self.action_dim, scale=0.3, sigma=1.0)
57 | # training args
58 | self.warmup_step = algo_params['warmup_step']
59 | # statistic dict
60 | self.statistic_dict.update({
61 | 'episode_return': [],
62 | 'episode_test_return': []
63 | })
64 |
65 | def run(self, test=False, render=False, load_network_ep=None, sleep=0):
66 | if test:
67 | num_episode = self.testing_episodes
68 | if load_network_ep is not None:
69 | print("Loading network parameters...")
70 | self._load_network(ep=load_network_ep)
71 | print("Start testing...")
72 | else:
73 | num_episode = self.training_episodes
74 | print("Start training...")
75 |
76 | for ep in range(num_episode):
77 | ep_return = self._interact(render, test, sleep=sleep)
78 | self.statistic_dict['episode_return'].append(ep_return)
79 | print("Episode %i" % ep, "return %0.1f" % ep_return)
80 |
81 | if (ep % self.testing_gap == 0) and (ep != 0) and (not test):
82 | ep_test_return = []
83 | for test_ep in range(self.testing_episodes):
84 | ep_test_return.append(self._interact(render, test=True))
85 | self.statistic_dict['episode_test_return'].append(sum(ep_test_return) / self.testing_episodes)
86 | print("Episode %i" % ep, "test return %0.1f" % (sum(ep_test_return) / self.testing_episodes))
87 |
88 | if (ep % self.saving_gap == 0) and (ep != 0) and (not test):
89 | self._save_network(ep=ep)
90 |
91 | if not test:
92 | print("Finished training")
93 | print("Saving statistics...")
94 | self._plot_statistics(save_to_file=True)
95 | else:
96 | print("Finished testing")
97 |
98 | def _interact(self, render=False, test=False, sleep=0):
99 | done = False
100 | obs = self.env.reset()
101 | ep_return = 0
102 | while not done:
103 | if render:
104 | self.env.render()
105 | if self.env_step_count < self.warmup_step:
106 | action = self.env.action_space.sample()
107 | else:
108 | action = self._select_action(obs, test=test)
109 | new_obs, reward, done, info = self.env.step(action)
110 | time.sleep(sleep)
111 | ep_return += reward
112 | if not test:
113 | self._remember(obs, action, new_obs, reward, 1 - int(done))
114 | if self.observation_normalization:
115 | self.normalizer.store_history(new_obs)
116 | self.normalizer.update_mean()
117 | if (self.env_step_count % self.update_interval == 0) and (self.env_step_count > self.warmup_step):
118 | self._learn()
119 | self.env_step_count += 1
120 | obs = new_obs
121 | return ep_return
122 |
123 | def _select_action(self, obs, test=False):
124 | obs = self.normalizer(obs)
125 | with T.no_grad():
126 | inputs = T.as_tensor(obs, dtype=T.float, device=self.device)
127 | action = self.network_dict['actor_target'](inputs).cpu().detach().numpy()
128 | if test:
129 | # evaluate
130 | return np.clip(action, -self.action_max, self.action_max)
131 | else:
132 | # explore
133 | return self.exploration_strategy(action)
134 |
135 | def _learn(self, steps=None):
136 | if len(self.buffer) < self.batch_size:
137 | return
138 | if steps is None:
139 | steps = self.optimizer_steps
140 |
141 | for i in range(steps):
142 | if self.prioritised:
143 | batch, weights, inds = self.buffer.sample(self.batch_size)
144 | weights = T.as_tensor(weights, device=self.device).view(self.batch_size, 1)
145 | else:
146 | batch = self.buffer.sample(self.batch_size)
147 | weights = T.ones(size=(self.batch_size, 1), device=self.device)
148 | inds = None
149 |
150 | actor_inputs = self.normalizer(batch.state)
151 | actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device)
152 | actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device)
153 | critic_inputs = T.cat((actor_inputs, actions), dim=1)
154 | actor_inputs_ = self.normalizer(batch.next_state)
155 | actor_inputs_ = T.as_tensor(actor_inputs_, dtype=T.float32, device=self.device)
156 | rewards = T.as_tensor(batch.reward, dtype=T.float32, device=self.device)
157 | done = T.as_tensor(batch.done, dtype=T.float32, device=self.device)
158 |
159 | if self.discard_time_limit:
160 | done = done * 0 + 1
161 |
162 | with T.no_grad():
163 | actions_ = self.network_dict['actor_target'](actor_inputs_)
164 | critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1)
165 | value_dist_ = self.network_dict['critic_target'](critic_inputs_)
166 | value_dist_target = self.project_value_distribution(value_dist_, rewards, done)
167 | value_dist_target = T.as_tensor(value_dist_target, device=self.device)
168 |
169 | self.critic_optimizer.zero_grad()
170 | value_dist_estimate = self.network_dict['critic'](critic_inputs)
171 | critic_loss = F.binary_cross_entropy(value_dist_estimate, value_dist_target, reduction='none').sum(dim=1)
172 | (critic_loss * weights).mean().backward()
173 | self.critic_optimizer.step()
174 |
175 | if self.prioritised:
176 | assert inds is not None
177 | self.buffer.update_priority(inds, np.abs(critic_loss.cpu().detach().numpy()))
178 |
179 | self.actor_optimizer.zero_grad()
180 | new_actions = self.network_dict['actor'](actor_inputs)
181 | critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1)
182 | # take the expectation of the value distribution as the policy loss
183 | actor_loss = -(self.network_dict['critic'](critic_eval_inputs) * self.support)
184 | actor_loss = actor_loss.sum(dim=1)
185 | actor_loss.mean().backward()
186 | self.actor_optimizer.step()
187 |
188 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'])
189 | self._soft_update(self.network_dict['critic'], self.network_dict['critic_target'])
190 |
191 | self.statistic_dict['critic_loss'].append(critic_loss.detach().mean())
192 | self.statistic_dict['actor_loss'].append(actor_loss.detach().mean())
193 |
194 | def project_value_distribution(self, value_dist, rewards, done):
195 | # refer to https://github.com/schatty/d4pg-pytorch/blob/7dc23096a45bc4036fbb02493e0b052d57cfe4c6/models/d4pg/l2_projection.py#L7
196 | # comments added
197 | copy_value_dist = value_dist.data.cpu().numpy()
198 | copy_rewards = rewards.data.cpu().numpy()
199 | copy_done = (1-done).data.cpu().numpy().astype(np.bool)
200 | batch_size = self.batch_size
201 | n_atoms = self.num_atoms
202 | projected_dist = np.zeros((batch_size, n_atoms), dtype=np.float32)
203 |
204 | # calculate the next state value for each atom in the support set
205 | for atom in range(n_atoms):
206 | atom_ = copy_rewards + (self.value_min + atom * self.delta_z) * self.gamma
207 | tz_j = np.clip(atom_, a_max=self.value_max, a_min=self.value_min)
208 | # compute where the next value is on the indexes axis of the support set
209 | b_j = (tz_j - self.value_min) / self.delta_z
210 | # compute floor and ceiling indexes of the next value on the support set
211 | l = np.floor(b_j).astype(np.int64)
212 | u = np.ceil(b_j).astype(np.int64)
213 | # since l and u are floor and ceiling indexes of the next value on the support set
214 | # their difference is always 0 at the boundary and 1 otherwise
215 | # thus, the predicted probability of the next value is distributed proportional to
216 | # the difference between the projected value index (b_j) and its floor or ceiling
217 | # boundary case, floor == ceiling
218 | eq_mask = (u == l) # this line gives an array of boolean masks
219 | projected_dist[eq_mask, l[eq_mask]] += copy_value_dist[eq_mask, atom]
220 | # otherwise, ceiling - floor == 1, i.e., (u - b_j) + (b_j - l) == 1
221 | ne_mask = (u != l)
222 | projected_dist[ne_mask, l[ne_mask]] += copy_value_dist[ne_mask, atom] * (u - b_j)[ne_mask]
223 | projected_dist[ne_mask, u[ne_mask]] += copy_value_dist[ne_mask, atom] * (b_j - l)[ne_mask]
224 |
225 | # check if a terminal state exists
226 | if copy_done.any():
227 | projected_dist[copy_done] = 0.0
228 | # value at a terminal state should equal to the immediate reward only
229 | tz_j = np.clip(copy_rewards[copy_done], a_max=self.value_max, a_min=self.value_min)
230 | b_j = (tz_j - self.value_min) / self.delta_z
231 | l = np.floor(b_j).astype(np.int64)
232 | u = np.ceil(b_j).astype(np.int64)
233 | eq_mask = (u == l)
234 | eq_dones = copy_done.copy()
235 | eq_dones[copy_done] = eq_mask
236 | # the value probability is only set to 1.0
237 | # when it is a terminal state and its floor and ceiling indexes are the same
238 | if eq_dones.any():
239 | projected_dist[eq_dones, l[eq_mask]] = 1.0
240 | ne_mask = (u != l)
241 | ne_dones = copy_done.copy()
242 | ne_dones[copy_done] = ne_mask
243 | # the value probability is only distributed while summed to 1.0
244 | # when it is a terminal state and its floor and ceiling indexes differ by 1 index
245 | if ne_dones.any():
246 | projected_dist[ne_dones, l[ne_mask]] = (u - b_j)[ne_mask]
247 | projected_dist[ne_dones, u[ne_mask]] = (b_j - l)[ne_mask]
248 |
249 | return projected_dist
250 |
--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/sac.py:
--------------------------------------------------------------------------------
1 | import time
2 | import numpy as np
3 | import torch as T
4 | import torch.nn.functional as F
5 | from torch.optim.adam import Adam
6 | from ..utils.networks_mlp import StochasticActor, Critic
7 | from ..agent_base import Agent
8 |
9 |
10 | class SAC(Agent):
11 | def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1):
12 | # environment
13 | self.env = env
14 | self.env.seed(seed)
15 | obs = self.env.reset()
16 | algo_params.update({'state_dim': obs.shape[0],
17 | 'action_dim': self.env.action_space.shape[0],
18 | 'action_max': self.env.action_space.high,
19 | 'action_scaling': self.env.action_space.high[0],
20 | 'init_input_means': None,
21 | 'init_input_vars': None
22 | })
23 | # training args
24 | self.training_episodes = algo_params['training_episodes']
25 | self.testing_gap = algo_params['testing_gap']
26 | self.testing_episodes = algo_params['testing_episodes']
27 | self.saving_gap = algo_params['saving_gap']
28 |
29 | super(SAC, self).__init__(algo_params,
30 | transition_tuple=transition_tuple,
31 | goal_conditioned=False,
32 | path=path,
33 | seed=seed)
34 | # torch
35 | self.network_dict.update({
36 | 'actor': StochasticActor(self.state_dim, self.action_dim, log_std_min=-6, log_std_max=1, action_scaling=self.action_scaling).to(self.device),
37 | 'critic_1': Critic(self.state_dim + self.action_dim, 1).to(self.device),
38 | 'critic_1_target': Critic(self.state_dim + self.action_dim, 1).to(self.device),
39 | 'critic_2': Critic(self.state_dim + self.action_dim, 1).to(self.device),
40 | 'critic_2_target': Critic(self.state_dim + self.action_dim, 1).to(self.device),
41 | 'alpha': algo_params['alpha'],
42 | 'log_alpha': T.tensor(np.log(algo_params['alpha']), requires_grad=True, device=self.device),
43 | })
44 | self.network_keys_to_save = ['actor', 'critic_1_target']
45 | self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate)
46 | self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate)
47 | self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate)
48 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'], tau=1)
49 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'], tau=1)
50 | self.target_entropy = -self.action_dim
51 | self.alpha_optimizer = Adam([self.network_dict['log_alpha']], lr=self.actor_learning_rate)
52 | # training args
53 | self.warmup_step = algo_params['warmup_step']
54 | self.actor_update_interval = algo_params['actor_update_interval']
55 | self.critic_target_update_interval = algo_params['critic_target_update_interval']
56 | # statistic dict
57 | self.statistic_dict.update({
58 | 'episode_return': [],
59 | 'episode_test_return': [],
60 | 'alpha': [],
61 | 'policy_entropy': [],
62 | })
63 |
64 | def run(self, test=False, render=False, load_network_ep=None, sleep=0):
65 | if test:
66 | num_episode = self.testing_episodes
67 | if load_network_ep is not None:
68 | print("Loading network parameters...")
69 | self._load_network(ep=load_network_ep)
70 | print("Start testing...")
71 | else:
72 | num_episode = self.training_episodes
73 | print("Start training...")
74 |
75 | for ep in range(num_episode):
76 | ep_return = self._interact(render, test, sleep=sleep)
77 | self.statistic_dict['episode_return'].append(ep_return)
78 | print("Episode %i" % ep, "return %0.1f" % ep_return)
79 |
80 | if (ep % self.testing_gap == 0) and (ep != 0) and (not test):
81 | ep_test_return = []
82 | for test_ep in range(self.testing_episodes):
83 | ep_test_return.append(self._interact(render, test=True))
84 | self.statistic_dict['episode_test_return'].append(sum(ep_test_return)/self.testing_episodes)
85 | print("Episode %i" % ep, "test return %0.1f" % (sum(ep_test_return)/self.testing_episodes))
86 |
87 | if (ep % self.saving_gap == 0) and (ep != 0) and (not test):
88 | self._save_network(ep=ep)
89 |
90 | if not test:
91 | print("Finished training")
92 | print("Saving statistics...")
93 | self._plot_statistics(save_to_file=True)
94 | else:
95 | print("Finished testing")
96 |
97 | def _interact(self, render=False, test=False, sleep=0):
98 | done = False
99 | obs = self.env.reset()
100 | ep_return = 0
101 | # start a new episode
102 | while not done:
103 | if render:
104 | self.env.render()
105 | if self.env_step_count < self.warmup_step:
106 | action = self.env.action_space.sample()
107 | else:
108 | action = self._select_action(obs, test=test)
109 | new_obs, reward, done, info = self.env.step(action)
110 | time.sleep(sleep)
111 | ep_return += reward
112 | if not test:
113 | self._remember(obs, action, new_obs, reward, 1 - int(done))
114 | if self.observation_normalization:
115 | self.normalizer.store_history(new_obs)
116 | self.normalizer.update_mean()
117 | if (self.env_step_count % self.update_interval == 0) and (self.env_step_count > self.warmup_step):
118 | self._learn()
119 | self.env_step_count += 1
120 | obs = new_obs
121 | return ep_return
122 |
123 | def _select_action(self, obs, test=False):
124 | inputs = self.normalizer(obs)
125 | inputs = T.as_tensor(inputs, dtype=T.float, device=self.device)
126 | return self.network_dict['actor'].get_action(inputs, mean_pi=test).detach().cpu().numpy()
127 |
128 | def _learn(self, steps=None):
129 | if len(self.buffer) < self.batch_size:
130 | return
131 | if steps is None:
132 | steps = self.optimizer_steps
133 |
134 | for i in range(steps):
135 | if self.prioritised:
136 | batch, weights, inds = self.buffer.sample(self.batch_size)
137 | weights = T.tensor(weights).view(self.batch_size, 1).to(self.device)
138 | else:
139 | batch = self.buffer.sample(self.batch_size)
140 | weights = T.ones(size=(self.batch_size, 1)).to(self.device)
141 | inds = None
142 |
143 | actor_inputs = self.normalizer(batch.state)
144 | actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device)
145 | actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device)
146 | critic_inputs = T.cat((actor_inputs, actions), dim=1)
147 | actor_inputs_ = self.normalizer(batch.next_state)
148 | actor_inputs_ = T.as_tensor(actor_inputs_, dtype=T.float32, device=self.device)
149 | rewards = T.as_tensor(batch.reward, dtype=T.float32, device=self.device).unsqueeze(1)
150 | done = T.as_tensor(batch.done, dtype=T.float32, device=self.device).unsqueeze(1)
151 |
152 | if self.discard_time_limit:
153 | done = done * 0 + 1
154 |
155 | with T.no_grad():
156 | actions_, log_probs_ = self.network_dict['actor'].get_action(actor_inputs_, probs=True)
157 | critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1)
158 | value_1_ = self.network_dict['critic_1_target'](critic_inputs_)
159 | value_2_ = self.network_dict['critic_2_target'](critic_inputs_)
160 | value_ = T.min(value_1_, value_2_) - (self.network_dict['alpha'] * log_probs_)
161 | value_target = rewards + done * self.gamma * value_
162 |
163 | self.critic_1_optimizer.zero_grad()
164 | value_estimate_1 = self.network_dict['critic_1'](critic_inputs)
165 | critic_loss_1 = F.mse_loss(value_estimate_1, value_target.detach(), reduction='none')
166 | (critic_loss_1 * weights).mean().backward()
167 | self.critic_1_optimizer.step()
168 |
169 | if self.prioritised:
170 | assert inds is not None
171 | self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy()))
172 |
173 | self.critic_2_optimizer.zero_grad()
174 | value_estimate_2 = self.network_dict['critic_2'](critic_inputs)
175 | critic_loss_2 = F.mse_loss(value_estimate_2, value_target.detach(), reduction='none')
176 | (critic_loss_2 * weights).mean().backward()
177 | self.critic_2_optimizer.step()
178 |
179 | self.statistic_dict['critic_loss'].append(critic_loss_1.detach().mean())
180 |
181 | if self.optim_step_count % self.critic_target_update_interval == 0:
182 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'])
183 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'])
184 |
185 | if self.optim_step_count % self.actor_update_interval == 0:
186 | self.actor_optimizer.zero_grad()
187 | new_actions, new_log_probs = self.network_dict['actor'].get_action(actor_inputs, probs=True)
188 | critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1)
189 | new_values = T.min(self.network_dict['critic_1'](critic_eval_inputs),
190 | self.network_dict['critic_2'](critic_eval_inputs))
191 | actor_loss = (self.network_dict['alpha']*new_log_probs - new_values).mean()
192 | actor_loss.backward()
193 | self.actor_optimizer.step()
194 |
195 | self.alpha_optimizer.zero_grad()
196 | alpha_loss = (self.network_dict['log_alpha'] * (-new_log_probs - self.target_entropy).detach()).mean()
197 | alpha_loss.backward()
198 | self.alpha_optimizer.step()
199 | self.network_dict['alpha'] = self.network_dict['log_alpha'].exp()
200 |
201 | self.statistic_dict['actor_loss'].append(actor_loss.detach().mean())
202 | self.statistic_dict['alpha'].append(self.network_dict['alpha'].detach())
203 | self.statistic_dict['policy_entropy'].append(-new_log_probs.detach().mean())
204 |
205 | self.optim_step_count += 1
206 |
--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/sac_goal_conditioned.py:
--------------------------------------------------------------------------------
1 | import time
2 | import numpy as np
3 | import torch as T
4 | import torch.nn.functional as F
5 | from torch.optim.adam import Adam
6 | from ..utils.networks_mlp import StochasticActor, Critic
7 | from ..agent_base import Agent
8 |
9 |
10 | class GoalConditionedSAC(Agent):
11 | def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1):
12 | # environment
13 | self.env = env
14 | self.env.seed(seed)
15 | obs = self.env.reset()
16 | algo_params.update({'state_dim': obs['observation'].shape[0],
17 | 'goal_dim': obs['desired_goal'].shape[0],
18 | 'action_dim': self.env.action_space.shape[0],
19 | 'action_max': self.env.action_space.high,
20 | 'action_scaling': self.env.action_space.high[0],
21 | 'init_input_means': None,
22 | 'init_input_vars': None
23 | })
24 | # training args
25 | self.training_epochs = algo_params['training_epochs']
26 | self.training_cycles = algo_params['training_cycles']
27 | self.training_episodes = algo_params['training_episodes']
28 | self.testing_gap = algo_params['testing_gap']
29 | self.testing_episodes = algo_params['testing_episodes']
30 | self.saving_gap = algo_params['saving_gap']
31 |
32 | super(GoalConditionedSAC, self).__init__(algo_params,
33 | transition_tuple=transition_tuple,
34 | goal_conditioned=True,
35 | path=path,
36 | seed=seed)
37 | # torch
38 | self.network_dict.update({
39 | 'actor': StochasticActor(self.state_dim + self.goal_dim, self.action_dim, log_std_min=-6, log_std_max=1,
40 | action_scaling=self.action_scaling).to(self.device),
41 | 'critic_1': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
42 | 'critic_1_target': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
43 | 'critic_2': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
44 | 'critic_2_target': Critic(self.state_dim + self.goal_dim + self.action_dim, 1).to(self.device),
45 | 'alpha': algo_params['alpha'],
46 | 'log_alpha': T.tensor(np.log(algo_params['alpha']), requires_grad=True, device=self.device),
47 | })
48 | self.network_keys_to_save = ['actor', 'critic_1_target']
49 | self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate)
50 | self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate)
51 | self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate)
52 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'], tau=1)
53 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'], tau=1)
54 | self.target_entropy = -self.action_dim
55 | self.alpha_optimizer = Adam([self.network_dict['log_alpha']], lr=self.actor_learning_rate)
56 | # training args
57 | self.clip_value = algo_params['clip_value']
58 | self.actor_update_interval = algo_params['actor_update_interval']
59 | self.critic_target_update_interval = algo_params['critic_target_update_interval']
60 | # statistic dict
61 | self.statistic_dict.update({
62 | 'cycle_return': [],
63 | 'cycle_success_rate': [],
64 | 'epoch_test_return': [],
65 | 'epoch_test_success_rate': [],
66 | 'alpha': [],
67 | 'policy_entropy': [],
68 | })
69 |
70 | def run(self, test=False, render=False, load_network_ep=None, sleep=0):
71 | # training setup uses a hierarchy of Epoch, Cycle and Episode
72 | # following the HER paper: https://papers.nips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html
73 | if test:
74 | if load_network_ep is not None:
75 | print("Loading network parameters...")
76 | self._load_network(ep=load_network_ep)
77 | print("Start testing...")
78 | else:
79 | print("Start training...")
80 |
81 | for epo in range(self.training_epochs):
82 | for cyc in range(self.training_cycles):
83 | cycle_return = 0
84 | cycle_success = 0
85 | for ep in range(self.training_episodes):
86 | ep_return = self._interact(render, test, sleep=sleep)
87 | cycle_return += ep_return
88 | if ep_return > -50:
89 | cycle_success += 1
90 |
91 | self.statistic_dict['cycle_return'].append(cycle_return / self.training_episodes)
92 | self.statistic_dict['cycle_success_rate'].append(cycle_success / self.training_episodes)
93 | print("Epoch %i" % epo, "Cycle %i" % cyc,
94 | "avg. return %0.1f" % (cycle_return / self.training_episodes),
95 | "success rate %0.1f" % (cycle_success / self.training_episodes))
96 |
97 | if (epo % self.testing_gap == 0) and (epo != 0) and (not test):
98 | test_return = 0
99 | test_success = 0
100 | for test_ep in range(self.testing_episodes):
101 | ep_test_return = self._interact(render, test=True)
102 | test_return += ep_test_return
103 | if ep_test_return > -50:
104 | test_success += 1
105 | self.statistic_dict['epoch_test_return'].append(test_return / self.testing_episodes)
106 | self.statistic_dict['epoch_test_success_rate'].append(test_success / self.testing_episodes)
107 | print("Epoch %i" % epo, "test avg. return %0.1f" % (test_return / self.testing_episodes))
108 |
109 | if (epo % self.saving_gap == 0) and (epo != 0) and (not test):
110 | self._save_network(ep=epo)
111 |
112 | if not test:
113 | print("Finished training")
114 | print("Saving statistics...")
115 | self._plot_statistics(
116 | x_labels={
117 | 'critic_loss': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)',
118 | 'actor_loss': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)',
119 | 'alpha': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)',
120 | 'policy_entropy': 'Optimization epoch (per ' + str(self.optimizer_steps) + ' steps)'
121 | },
122 | save_to_file=True)
123 | else:
124 | print("Finished testing")
125 |
126 | def _interact(self, render=False, test=False, sleep=0):
127 | done = False
128 | obs = self.env.reset()
129 | ep_return = 0
130 | new_episode = True
131 | # start a new episode
132 | while not done:
133 | if render:
134 | self.env.render()
135 | action = self._select_action(obs, test=test)
136 | new_obs, reward, done, info = self.env.step(action)
137 | time.sleep(sleep)
138 | ep_return += reward
139 | if not test:
140 | self._remember(obs['observation'], obs['desired_goal'], action,
141 | new_obs['observation'], new_obs['achieved_goal'], reward, 1 - int(done),
142 | new_episode=new_episode)
143 | if self.observation_normalization:
144 | self.normalizer.store_history(np.concatenate((new_obs['observation'],
145 | new_obs['achieved_goal']), axis=0))
146 | obs = new_obs
147 | new_episode = False
148 |
149 | if not test:
150 | self.normalizer.update_mean()
151 | self._learn()
152 | return ep_return
153 |
154 | def _select_action(self, obs, test=False):
155 | inputs = np.concatenate((obs['observation'], obs['desired_goal']), axis=0)
156 | inputs = self.normalizer(inputs)
157 | inputs = T.as_tensor(inputs, dtype=T.float).to(self.device)
158 | return self.network_dict['actor'].get_action(inputs, mean_pi=test).detach().cpu().numpy()
159 |
160 | def _learn(self, steps=None):
161 | if self.hindsight:
162 | self.buffer.modify_episodes()
163 | self.buffer.store_episodes()
164 | if len(self.buffer) < self.batch_size:
165 | return
166 | if steps is None:
167 | steps = self.optimizer_steps
168 |
169 | critic_losses = T.zeros(1, device=self.device)
170 | actor_losses = T.zeros(1, device=self.device)
171 | alphas = T.zeros(1, device=self.device)
172 | policy_entropies = T.zeros(1, device=self.device)
173 | for i in range(steps):
174 | if self.prioritised:
175 | batch, weights, inds = self.buffer.sample(self.batch_size)
176 | weights = T.as_tensor(weights, device=self.device).view(self.batch_size, 1)
177 | else:
178 | batch = self.buffer.sample(self.batch_size)
179 | weights = T.ones(size=(self.batch_size, 1), device=self.device)
180 | inds = None
181 |
182 | actor_inputs = np.concatenate((batch.state, batch.desired_goal), axis=1)
183 | actor_inputs = self.normalizer(actor_inputs)
184 | actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device)
185 | actions_np = np.array(batch.action)
186 | actions = T.as_tensor(actions_np, dtype=T.float32, device=self.device)
187 | critic_inputs = T.cat((actor_inputs, actions), dim=1)
188 | actor_inputs_ = np.concatenate((batch.next_state, batch.desired_goal), axis=1)
189 | actor_inputs_ = self.normalizer(actor_inputs_)
190 | actor_inputs_ = T.as_tensor(actor_inputs_, dtype=T.float32, device=self.device)
191 | rewards_np = np.array(batch.reward)
192 | rewards = T.as_tensor(rewards_np, dtype=T.float32, device=self.device).unsqueeze(1)
193 | done_np = np.array(batch.done)
194 | done = T.as_tensor(done_np, dtype=T.float32, device=self.device).unsqueeze(1)
195 |
196 | if self.discard_time_limit:
197 | done = done * 0 + 1
198 |
199 | with T.no_grad():
200 | actions_, log_probs_ = self.network_dict['actor'].get_action(actor_inputs_, probs=True)
201 | critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1)
202 | value_1_ = self.network_dict['critic_1_target'](critic_inputs_)
203 | value_2_ = self.network_dict['critic_2_target'](critic_inputs_)
204 | value_ = T.min(value_1_, value_2_) - (self.network_dict['alpha'] * log_probs_)
205 | value_target = rewards + done * self.gamma * value_
206 | value_target = T.clamp(value_target, -self.clip_value, 0.0)
207 |
208 | self.critic_1_optimizer.zero_grad()
209 | value_estimate_1 = self.network_dict['critic_1'](critic_inputs)
210 | critic_loss_1 = F.mse_loss(value_estimate_1, value_target.detach(), reduction='none')
211 | (critic_loss_1 * weights).mean().backward()
212 | self.critic_1_optimizer.step()
213 |
214 | if self.prioritised:
215 | assert inds is not None
216 | self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy()))
217 |
218 | self.critic_2_optimizer.zero_grad()
219 | value_estimate_2 = self.network_dict['critic_2'](critic_inputs)
220 | critic_loss_2 = F.mse_loss(value_estimate_2, value_target.detach(), reduction='none')
221 | (critic_loss_2 * weights).mean().backward()
222 | self.critic_2_optimizer.step()
223 |
224 | critic_losses += critic_loss_1.detach().mean()
225 |
226 | if self.optim_step_count % self.critic_target_update_interval == 0:
227 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'])
228 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'])
229 |
230 | if self.optim_step_count % self.actor_update_interval == 0:
231 | self.actor_optimizer.zero_grad()
232 | new_actions, new_log_probs, entropy = self.network_dict['actor'].get_action(actor_inputs, probs=True,
233 | entropy=True)
234 | critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1).to(self.device)
235 | new_values = T.min(self.network_dict['critic_1'](critic_eval_inputs),
236 | self.network_dict['critic_2'](critic_eval_inputs))
237 | actor_loss = (self.network_dict['alpha'] * new_log_probs - new_values).mean()
238 | actor_loss.backward()
239 | self.actor_optimizer.step()
240 |
241 | self.alpha_optimizer.zero_grad()
242 | alpha_loss = (self.network_dict['log_alpha'] * (-new_log_probs - self.target_entropy).detach()).mean()
243 | alpha_loss.backward()
244 | self.alpha_optimizer.step()
245 | self.network_dict['alpha'] = self.network_dict['log_alpha'].exp()
246 |
247 | actor_losses += actor_loss.detach()
248 | alphas += self.network_dict['alpha'].detach()
249 | policy_entropies += entropy.detach().mean()
250 |
251 | self.optim_step_count += 1
252 |
253 | self.statistic_dict['critic_loss'].append(critic_losses / steps)
254 | self.statistic_dict['actor_loss'].append(actor_losses / steps)
255 | self.statistic_dict['alpha'].append(alphas / steps)
256 | self.statistic_dict['policy_entropy'].append(policy_entropies / steps)
257 |
--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/sac_parameterised_action_goal_conditioned.py:
--------------------------------------------------------------------------------
1 | import time
2 | import numpy as np
3 | import torch as T
4 | import torch.nn.functional as F
5 | from torch.optim.adam import Adam
6 | from ..utils.networks_mlp import StochasticActor
7 | from ..utils.networks_pointnet import CriticPointNet, CriticPointNet2
8 | from ..agent_base import Agent
9 | from collections import namedtuple
10 |
11 |
12 | class GPASAC(Agent):
13 | def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1):
14 | # environment
15 | self.env = env
16 | self.env.seed(seed)
17 | obs = self.env.reset()
18 | algo_params.update({'state_shape': obs['observation'].shape,
19 | 'goal_shape': obs['desired_goal'].shape,
20 | 'discrete_action_dim': self.env.discrete_action_space.n,
21 | 'continuous_action_dim': self.env.continuous_action_space.shape[0],
22 | 'continuous_action_max': self.env.continuous_action_space.high,
23 | 'continuous_action_scaling': self.env.continuous_action_space.high[0],
24 | })
25 | # training args
26 | self.cur_ep = 0
27 | self.warmup_step = algo_params['warmup_step']
28 | self.training_episodes = algo_params['training_episodes']
29 | self.testing_gap = algo_params['testing_gap']
30 | self.testing_episodes = algo_params['testing_episodes']
31 | self.saving_gap = algo_params['saving_gap']
32 |
33 | self.use_demonstrations = algo_params['use_demonstrations']
34 | self.demonstrate_percentage = algo_params['demonstrate_percentage']
35 | assert 0 < self.demonstrate_percentage < 1, "Demonstrate percentage should be between 0 and 1"
36 | self.n_demonstrate_episodes = int(self.demonstrate_percentage * self.training_episodes)
37 | self.planned_skills = algo_params['planned_skills']
38 | assert not (self.use_demonstrations and self.planned_skills), "Cannot demonstrate and planned skills at the same time"
39 | self.skill_plan = algo_params['skill_plan']
40 | self.use_planned_skills = False
41 |
42 | if transition_tuple is None:
43 | transition_tuple = namedtuple("transition",
44 | ('state', 'desired_goal', 'action',
45 | 'next_state', 'achieved_goal', 'reward', 'done', 'next_skill'))
46 | super(GPASAC, self).__init__(algo_params,
47 | non_flat_obs=True,
48 | action_type='hybrid',
49 | transition_tuple=transition_tuple,
50 | goal_conditioned=True,
51 | path=path,
52 | seed=seed,
53 | create_logger=True)
54 | # torch
55 | self.network_dict.update({
56 | 'discrete_actor': StochasticActor(2048, self.discrete_action_dim, continuous=False,
57 | fc1_size=1024,
58 | log_std_min=-6, log_std_max=1).to(self.device),
59 | 'continuous_actor': StochasticActor(2048 + self.discrete_action_dim, self.continuous_action_dim,
60 | fc1_size=1024,
61 | log_std_min=-6, log_std_max=1,
62 | action_scaling=self.continuous_action_scaling).to(self.device),
63 | 'critic_1': CriticPointNet(output_dim=1, action_dim=self.discrete_action_dim+self.continuous_action_dim).to(self.device),
64 | 'critic_1_target': CriticPointNet(output_dim=1, action_dim=self.discrete_action_dim+self.continuous_action_dim).to(self.device),
65 | 'critic_2': CriticPointNet(output_dim=1, action_dim=self.discrete_action_dim+self.continuous_action_dim).to(self.device),
66 | 'critic_2_target': CriticPointNet(output_dim=1, action_dim=self.discrete_action_dim+self.continuous_action_dim).to(self.device),
67 | 'alpha_discrete': algo_params['alpha'],
68 | 'log_alpha_discrete': T.tensor(np.log(algo_params['alpha']), requires_grad=True, device=self.device),
69 | 'alpha_continuous': algo_params['alpha'],
70 | 'log_alpha_continuous': T.tensor(np.log(algo_params['alpha']), requires_grad=True, device=self.device),
71 | })
72 | self.network_dict['critic_1_target'].eval()
73 | self.network_dict['critic_2_target'].eval()
74 | self.network_keys_to_save = ['discrete_actor', 'continuous_actor', 'critic_1', 'critic_1_target']
75 | self.discrete_actor_optimizer = Adam(self.network_dict['discrete_actor'].parameters(),
76 | lr=self.actor_learning_rate)
77 | self.continuous_actor_optimizer = Adam(self.network_dict['continuous_actor'].parameters(),
78 | lr=self.actor_learning_rate)
79 | self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate)
80 | self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate)
81 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'], tau=1)
82 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'], tau=1)
83 | self.target_discrete_entropy = -self.discrete_action_dim
84 | self.target_continuous_entropy = -self.continuous_action_dim
85 | self.alpha_discrete_optimizer = Adam([self.network_dict['log_alpha_discrete']], lr=self.actor_learning_rate)
86 | self.alpha_continuous_optimizer = Adam([self.network_dict['log_alpha_continuous']], lr=self.actor_learning_rate)
87 | # training args
88 | # self.clip_value = algo_params['clip_value']
89 | self.actor_update_interval = algo_params['actor_update_interval']
90 | self.critic_target_update_interval = algo_params['critic_target_update_interval']
91 |
92 | def run(self, test=False, render=False, load_network_ep=None, sleep=0):
93 | if test:
94 | if load_network_ep is not None:
95 | print("Loading network parameters...")
96 | self._load_network(ep=load_network_ep)
97 | print("Start testing...")
98 | else:
99 | print("Start training...")
100 |
101 | for ep in range(self.training_episodes):
102 | if self.use_demonstrations and (ep < self.n_demonstrate_episodes):
103 | self.use_planned_skills = True
104 | elif self.planned_skills:
105 | self.use_planned_skills = True
106 | else:
107 | self.use_planned_skills = False
108 | self.cur_ep = ep
109 | loss_info = self._interact(render, test, sleep=sleep)
110 | self.logger.add_scalar(tag='Task/return', scalar_value=loss_info['emd_loss'], global_step=self.cur_ep)
111 | self.logger.add_scalar(tag='Task/heightmap_loss', scalar_value=loss_info['height_map_loss'], global_step=ep)
112 | print("Episode %i" % ep, "return %0.1f" % loss_info['emd_loss'])
113 | if not test and self.hindsight:
114 | self.buffer.hindsight()
115 |
116 | if (ep % self.testing_gap == 0) and (ep != 0) and (not test):
117 | if self.planned_skills:
118 | self.use_planned_skills = True
119 | else:
120 | self.use_planned_skills = False
121 | test_return = 0
122 | test_heightmap_loss = 0
123 | for test_ep in range(self.testing_episodes):
124 | loss_info = self._interact(render, test=True)
125 | test_return += loss_info['emd_loss']
126 | test_heightmap_loss += loss_info['height_map_loss']
127 | self.logger.add_scalar(tag='Task/test_return',
128 | scalar_value=(test_return / self.testing_episodes), global_step=self.cur_ep)
129 | self.logger.add_scalar(tag='Task/test_heightmap_loss',
130 | scalar_value=(test_heightmap_loss / self.testing_episodes), global_step=self.cur_ep)
131 |
132 | print("Episode %i" % ep, "test avg. return %0.1f" % (test_return / self.testing_episodes))
133 |
134 | if (ep % self.saving_gap == 0) and (ep != 0) and (not test):
135 | self._save_network(ep=ep)
136 |
137 | if not test:
138 | print("Finished training")
139 | print("Saving statistics...")
140 | else:
141 | print("Finished testing")
142 |
143 | def _interact(self, render=False, test=False, sleep=0):
144 | done = False
145 | obs = self.env.reset()
146 | ep_return = 0
147 | new_episode = True
148 | # start a new episode
149 | while not done:
150 | if render:
151 | self.env.render()
152 | if self.total_env_step_count < self.warmup_step:
153 | if self.use_planned_skills:
154 | discrete_action = self.skill_plan[self.env.step_count]
155 | else:
156 | discrete_action = self.env.discrete_action_space.sample()
157 | continuous_action = self.env.continuous_action_space.sample()
158 | action = np.concatenate([[discrete_action], continuous_action], axis=0)
159 | else:
160 | action = self._select_action(obs, test=test)
161 | new_obs, reward, done, info = self.env.step(action)
162 | time.sleep(sleep)
163 | ep_return += reward
164 |
165 | next_skill = 0
166 | if self.planned_skills:
167 | try:
168 | next_skill = self.skill_plan[self.env.step_count]
169 | except:
170 | pass
171 |
172 | if not test:
173 | self._remember(obs['observation'], obs['desired_goal'], action,
174 | new_obs['observation'], new_obs['achieved_goal'], reward, 1 - int(done), next_skill,
175 | new_episode=new_episode)
176 | self.total_env_step_count += 1
177 | self._learn(steps=1)
178 |
179 | obs = new_obs
180 | new_episode = False
181 |
182 | return info
183 |
184 | def _select_action(self, obs, test=False):
185 | obs_points = T.as_tensor([obs['observation']], dtype=T.float).to(self.device)
186 | goal_points = T.as_tensor([obs['desired_goal']], dtype=T.float).to(self.device)
187 | obs_point_features = self.network_dict['critic_1_target'].get_features(obs_points.transpose(2, 1))
188 | goal_point_features = self.network_dict['critic_1_target'].get_features(goal_points.transpose(2, 1))
189 | inputs = T.cat((obs_point_features, goal_point_features), dim=1)
190 | if self.use_planned_skills:
191 | discrete_action = T.as_tensor([self.skill_plan[self.env.step_count]], dtype=T.long).to(self.device)
192 | else:
193 | discrete_action, _, _ = self.network_dict['discrete_actor'].get_action(inputs, greedy=test)
194 | discrete_action.type(T.long).flatten()
195 | discrete_action_onehot = F.one_hot(discrete_action, self.discrete_action_dim).float()
196 | inputs = T.cat((inputs, discrete_action_onehot), dim=1)
197 | continuous_action = self.network_dict['continuous_actor'].get_action(inputs, mean_pi=test).detach().cpu().numpy()
198 | return np.concatenate([discrete_action.detach().cpu().numpy(), continuous_action[0]], axis=0)
199 |
200 | def _learn(self, steps=None):
201 | if len(self.buffer) < self.batch_size:
202 | return
203 | if steps is None:
204 | steps = self.optimizer_steps
205 |
206 | avg_critic_1_loss = T.zeros(1, device=self.device)
207 | avg_critic_2_loss = T.zeros(1, device=self.device)
208 | avg_discrete_actor_loss = T.zeros(1, device=self.device)
209 | avg_discrete_alpha = T.zeros(1, device=self.device)
210 | avg_discrete_policy_entropy = T.zeros(1, device=self.device)
211 | avg_continuous_actor_loss = T.zeros(1, device=self.device)
212 | avg_continuous_alpha = T.zeros(1, device=self.device)
213 | avg_continuous_policy_entropy = T.zeros(1, device=self.device)
214 | for i in range(steps):
215 | if self.prioritised:
216 | batch, weights, inds = self.buffer.sample(self.batch_size)
217 | weights = T.as_tensor(weights, device=self.device).view(self.batch_size, 1)
218 | else:
219 | batch = self.buffer.sample(self.batch_size)
220 | weights = T.ones(size=(self.batch_size, 1), device=self.device)
221 | inds = None
222 |
223 | obs = T.as_tensor(batch.state, dtype=T.float32, device=self.device).transpose(2, 1)
224 | obs_features = self.network_dict['critic_1_target'].get_features(obs, detach=True)
225 | goal = T.as_tensor(batch.desired_goal, dtype=T.float32, device=self.device).transpose(2, 1)
226 | goal_features = self.network_dict['critic_1_target'].get_features(goal, detach=True)
227 | obs_ = T.as_tensor(batch.next_state, dtype=T.float32, device=self.device).transpose(2, 1)
228 | obs_features_ = self.network_dict['critic_1_target'].get_features(obs_, detach=True)
229 | actor_inputs_ = T.cat((obs_features_, goal_features), dim=1)
230 | actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device)
231 | discrete_actions = actions[:, 0].type(T.long)
232 | discrete_actions_onehot = F.one_hot(discrete_actions, self.discrete_action_dim).float()
233 | actions = T.cat((discrete_actions_onehot, actions[:, 1:]), dim=1)
234 | rewards = T.as_tensor(np.array(batch.reward), dtype=T.float32, device=self.device).unsqueeze(1)
235 | done = T.as_tensor(np.array(batch.done), dtype=T.float32, device=self.device).unsqueeze(1)
236 |
237 | if self.discard_time_limit:
238 | done = done * 0 + 1
239 |
240 | with T.no_grad():
241 | if not self.planned_skills:
242 | discrete_actions_, discrete_actions_log_probs_, _ = self.network_dict['discrete_actor'].get_action(
243 | actor_inputs_)
244 | discrete_actions_onehot_ = F.one_hot(discrete_actions_.flatten(), self.discrete_action_dim).float()
245 | else:
246 | discrete_actions_planned_ = T.as_tensor(batch.next_skill, dtype=T.long, device=self.device)
247 | discrete_actions_planned_onehot_ = F.one_hot(discrete_actions_planned_, self.discrete_action_dim).float()
248 | discrete_actions_onehot_ = discrete_actions_planned_onehot_
249 | discrete_actions_log_probs_ = T.ones(size=(self.batch_size, 1), device=self.device, dtype=T.float32)
250 |
251 | actor_inputs_ = T.cat((actor_inputs_, discrete_actions_onehot_), dim=1)
252 | continuous_actions_, continuous_actions_log_probs_ = self.network_dict[
253 | 'continuous_actor'].get_action(actor_inputs_, probs=True)
254 | actions_ = T.cat((discrete_actions_onehot_, continuous_actions_), dim=1)
255 |
256 | value_1_ = self.network_dict['critic_1_target'](obs_, actions_, goal)
257 | value_2_ = self.network_dict['critic_2_target'](obs_, actions_, goal)
258 | value_ = T.min(value_1_, value_2_) - \
259 | (self.network_dict['alpha_discrete'] * discrete_actions_log_probs_) - \
260 | (self.network_dict['alpha_continuous'] * continuous_actions_log_probs_)
261 | value_target = rewards + done * self.gamma * value_
262 | # value_target = T.clamp(value_target, -self.clip_value, 0.0)
263 |
264 | self.critic_1_optimizer.zero_grad()
265 | value_estimate_1 = self.network_dict['critic_1'](obs, actions, goal)
266 | critic_loss_1 = F.mse_loss(value_estimate_1, value_target.detach(), reduction='none')
267 | (critic_loss_1 * weights).mean().backward()
268 | self.critic_1_optimizer.step()
269 |
270 | if self.prioritised:
271 | assert inds is not None
272 | self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy()))
273 |
274 | self.critic_2_optimizer.zero_grad()
275 | value_estimate_2 = self.network_dict['critic_2'](obs, actions, goal)
276 | critic_loss_2 = F.mse_loss(value_estimate_2, value_target.detach(), reduction='none')
277 | (critic_loss_2 * weights).mean().backward()
278 | self.critic_2_optimizer.step()
279 |
280 | avg_critic_1_loss += critic_loss_1.detach().mean()
281 | avg_critic_2_loss += critic_loss_2.detach().mean()
282 |
283 | if self.optim_step_count % self.critic_target_update_interval == 0:
284 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'])
285 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'])
286 |
287 | if self.optim_step_count % self.actor_update_interval == 0:
288 | self.discrete_actor_optimizer.zero_grad()
289 | self.continuous_actor_optimizer.zero_grad()
290 | actor_inputs = T.cat((obs_features, goal_features), dim=1)
291 | if not self.planned_skills:
292 | new_discrete_actions, new_discrete_action_log_probs, new_discrete_action_entropy = \
293 | self.network_dict['discrete_actor'].get_action(actor_inputs)
294 | new_discrete_actions_onehot = F.one_hot(new_discrete_actions.flatten(), self.discrete_action_dim).float()
295 | else:
296 | new_discrete_actions_onehot = discrete_actions_onehot
297 |
298 | new_continuous_actions, new_continuous_action_log_probs, new_continuous_action_entropy = \
299 | self.network_dict['continuous_actor'].get_action(
300 | T.cat((actor_inputs, new_discrete_actions_onehot), dim=1), probs=True, entropy=True)
301 | new_actions = T.cat((new_discrete_actions_onehot, new_continuous_actions), dim=1)
302 |
303 | new_values = T.min(self.network_dict['critic_1'](obs, new_actions, goal),
304 | self.network_dict['critic_2'](obs, new_actions, goal))
305 |
306 | if not self.planned_skills:
307 | discrete_actor_loss = (
308 | self.network_dict['alpha_discrete'] * new_discrete_action_log_probs - new_values).mean()
309 | discrete_actor_loss.backward(retain_graph=True)
310 | self.discrete_actor_optimizer.step()
311 |
312 | self.alpha_discrete_optimizer.zero_grad()
313 | discrete_alpha_loss = (self.network_dict['log_alpha_discrete'] * (
314 | -new_discrete_action_log_probs - self.target_discrete_entropy).detach()).mean()
315 | discrete_alpha_loss.backward()
316 | self.alpha_discrete_optimizer.step()
317 | self.network_dict['alpha_discrete'] = self.network_dict['log_alpha_discrete'].exp()
318 |
319 | avg_discrete_actor_loss += discrete_actor_loss.detach()
320 | avg_discrete_alpha += self.network_dict['alpha_discrete'].detach()
321 | avg_discrete_policy_entropy += new_discrete_action_entropy.detach().mean()
322 |
323 | continuous_actor_loss = (
324 | self.network_dict['alpha_continuous'] * new_continuous_action_log_probs - new_values).mean()
325 | continuous_actor_loss.backward()
326 | self.continuous_actor_optimizer.step()
327 |
328 | self.alpha_continuous_optimizer.zero_grad()
329 | continuous_alpha_loss = (self.network_dict['log_alpha_continuous'] * (
330 | -new_continuous_action_log_probs - self.target_continuous_entropy).detach()).mean()
331 | continuous_alpha_loss.backward()
332 | self.alpha_continuous_optimizer.step()
333 | self.network_dict['alpha_continuous'] = self.network_dict['log_alpha_continuous'].exp()
334 |
335 | avg_continuous_actor_loss += continuous_actor_loss.detach()
336 | avg_continuous_alpha += self.network_dict['alpha_continuous'].detach()
337 | avg_continuous_policy_entropy += new_continuous_action_entropy.detach().mean()
338 |
339 | self.optim_step_count += 1
340 |
341 | self.logger.add_scalar(tag='Critic/critic_1_loss', scalar_value=avg_critic_1_loss / steps, global_step=self.cur_ep)
342 | self.logger.add_scalar(tag='Critic/critic_2_loss', scalar_value=avg_critic_2_loss / steps, global_step=self.cur_ep)
343 | if not self.planned_skills:
344 | self.logger.add_scalar(tag='Actor/discrete_actor_loss', scalar_value=avg_discrete_actor_loss / steps, global_step=self.cur_ep)
345 | self.logger.add_scalar(tag='Actor/discrete_alpha', scalar_value=avg_discrete_alpha / steps, global_step=self.cur_ep)
346 | self.logger.add_scalar(tag='Actor/discrete_policy_entropy', scalar_value=avg_discrete_policy_entropy / steps,
347 | global_step=self.cur_ep)
348 | self.logger.add_scalar(tag='Actor/continuous_actor_loss', scalar_value=avg_continuous_actor_loss / steps, global_step=self.cur_ep)
349 | self.logger.add_scalar(tag='Actor/continuous_alpha', scalar_value=avg_continuous_alpha / steps, global_step=self.cur_ep)
350 | self.logger.add_scalar(tag='Actor/continuous_policy_entropy', scalar_value=avg_continuous_policy_entropy / steps,
351 | global_step=self.cur_ep)
352 |
--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/sac_pointnet.py:
--------------------------------------------------------------------------------
1 | import time
2 | import numpy as np
3 | import torch as T
4 | import torch.nn.functional as F
5 | from torch.optim.adam import Adam
6 | from ..utils.networks_mlp import StochasticActor
7 | from ..utils.networks_pointnet import CriticPointNet
8 | from ..agent_base import Agent
9 | from ..utils.exploration_strategy import GaussianNoise
10 | from collections import namedtuple
11 |
12 |
13 | class PointnetSAC(Agent):
14 | def __init__(self, algo_params, env, logging=None, transition_tuple=None, path=None, seed=-1):
15 | # environment
16 | self.env = env
17 | self.env.seed(seed)
18 | obs = self.env.reset()
19 | algo_params.update({'state_shape': obs['observation'].shape,
20 | 'goal_shape': obs['desired_goal'].shape,
21 | 'action_dim': self.env.action_space.shape[0],
22 | 'action_max': self.env.action_space.high,
23 | 'action_scaling': self.env.action_space.high[0],
24 | })
25 | self.onestep = True if self.env.horizon == 1 else False
26 | # training args
27 | self.cur_ep = 0
28 | self.warmup_step = algo_params['warmup_step']
29 | self.training_episodes = algo_params['training_episodes']
30 | self.testing_gap = algo_params['testing_gap']
31 | self.testing_episodes = algo_params['testing_episodes']
32 | self.saving_gap = algo_params['saving_gap']
33 | if transition_tuple is None:
34 | transition_tuple = namedtuple('transition',
35 | ['state', 'desired_goal', 'action', 'achieved_goal', 'reward'])
36 | super(PointnetSAC, self).__init__(algo_params, non_flat_obs=True,
37 | action_type='continuous',
38 | transition_tuple=transition_tuple,
39 | goal_conditioned=True,
40 | path=path,
41 | seed=seed,
42 | logging=logging,
43 | create_logger=True)
44 | # torch
45 | self.network_dict.update({
46 | 'actor': StochasticActor(2048, self.action_dim,
47 | fc1_size=1024, log_std_min=-6, log_std_max=1,
48 | action_scaling=self.action_scaling).to(self.device),
49 | 'critic_1': CriticPointNet(output_dim=1, action_dim=self.action_dim).to(self.device),
50 | 'critic_2': CriticPointNet(output_dim=1, action_dim=self.action_dim).to(self.device),
51 | 'critic_target': CriticPointNet(output_dim=1, action_dim=self.action_dim).to(self.device),
52 | 'alpha': algo_params['alpha'],
53 | 'log_alpha': T.tensor(np.log(algo_params['alpha']), requires_grad=True, device=self.device),
54 | })
55 | self.network_dict['critic_target'].eval()
56 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_target'], tau=1)
57 | if not self.onestep:
58 | self.network_dict.update(
59 | {'critic_target_2': CriticPointNet(output_dim=1, action_dim=self.action_dim).to(self.device)})
60 | self.network_dict['critic_target_2'].eval()
61 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_target_2'], tau=1)
62 |
63 | self.network_keys_to_save = ['actor', 'critic_1']
64 | self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate)
65 | self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate)
66 | self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate)
67 | self.target_entropy = -self.action_dim
68 | self.alpha_optimizer = Adam([self.network_dict['log_alpha']], lr=self.actor_learning_rate)
69 | # training args
70 | self.actor_update_interval = algo_params['actor_update_interval']
71 | self.use_demonstrations = algo_params['use_demonstrations']
72 | self.demonstrate_percentage = algo_params['demonstrate_percentage']
73 | assert 0 < self.demonstrate_percentage < 1, "Demonstrate percentage should be between 0 and 1"
74 | self.n_demonstrate_episodes = int(self.demonstrate_percentage * self.training_episodes)
75 | self.demonstration_action = np.asarray(algo_params['demonstration_action'], dtype=np.float32)
76 | self.gaussian_noise = GaussianNoise(action_dim=self.action_dim, action_max=self.action_max,
77 | sigma=0.1, rng=self.rng)
78 |
79 | def run(self, test=False, render=False, load_network_ep=None, sleep=0, get_action=False):
80 | if test:
81 | num_episode = self.testing_episodes
82 | if load_network_ep is not None:
83 | print("Loading network parameters...")
84 | self._load_network(ep=load_network_ep)
85 | print("Start testing...")
86 | if get_action:
87 | obs = self.env.reset()
88 | action = self._select_action(obs, test=True)
89 | return action
90 | else:
91 | num_episode = self.training_episodes
92 | print("Start training...")
93 | self.logging.info("Start training...")
94 |
95 | for ep in range(num_episode):
96 | self.cur_ep = ep
97 | loss_info = self._interact(render, test, sleep=sleep)
98 | print("Episode %i" % ep)
99 | self.logging.info("Episode %i" % ep)
100 | print("emd loss %0.1f" % loss_info['emd_loss'])
101 | self.logging.info("emd loss %0.1f" % loss_info['emd_loss'])
102 | self.logger.add_scalar(tag='Task/emd_loss', scalar_value=loss_info['emd_loss'], global_step=ep)
103 | try:
104 | print("heightmap loss %0.1f" % loss_info['height_map_loss'])
105 | self.logger.add_scalar(tag='Task/heightmap_loss', scalar_value=loss_info['height_map_loss'], global_step=ep)
106 | self.logging.info("heightmap loss %0.1f" % loss_info['height_map_loss'])
107 | except:
108 | pass
109 | GPU_memory = self.get_gpu_memory()
110 | self.logger.add_scalar(tag='System/Free GPU memory', scalar_value=GPU_memory[0], global_step=ep)
111 | try:
112 | self.logger.add_scalar(tag='System/Used GPU memory', scalar_value=GPU_memory[1], global_step=ep)
113 | except:
114 | pass
115 | if not test and self.hindsight:
116 | self.buffer.hindsight()
117 |
118 | if (ep % self.testing_gap == 0) and (ep != 0) and (not test):
119 | ep_test_emd_loss = []
120 | ep_test_heightmap_loss = []
121 | for test_ep in range(self.testing_episodes):
122 | loss_info = self._interact(render, test=True)
123 | self.cur_ep += 1
124 | ep_test_emd_loss.append(loss_info['emd_loss'])
125 | try:
126 | ep_test_heightmap_loss.append(loss_info['height_map_loss'])
127 | except:
128 | pass
129 | self.logger.add_scalar(tag='Task/test_emd_loss',
130 | scalar_value=(sum(ep_test_emd_loss) / self.testing_episodes), global_step=ep)
131 | print("Episode %i" % ep)
132 | print("test emd loss %0.1f" % (sum(ep_test_emd_loss) / self.testing_episodes))
133 | self.logging.info("Episode %i" % ep)
134 | self.logging.info("test emd loss %0.1f" % (sum(ep_test_emd_loss) / self.testing_episodes))
135 |
136 | if len(ep_test_heightmap_loss) > 0:
137 | self.logger.add_scalar(tag='Task/test_heightmap_loss',
138 | scalar_value=(sum(ep_test_heightmap_loss) / self.testing_episodes),
139 | global_step=ep)
140 | print("test heightmap loss %0.1f" % (sum(ep_test_heightmap_loss) / self.testing_episodes))
141 | self.logging.info("test heightmap loss %0.1f" % (sum(ep_test_heightmap_loss) / self.testing_episodes))
142 |
143 | if (ep % self.saving_gap == 0) and (ep != 0) and (not test):
144 | self._save_network(ep=ep)
145 |
146 | if not test:
147 | print("Finished training")
148 | self.logging.info("Finished training")
149 | else:
150 | print("Finished testing")
151 |
152 | def _interact(self, render=False, test=False, sleep=0):
153 | obs = self.env.reset()
154 | if render:
155 | self.env.render()
156 | if self.onestep:
157 | # An episode has only one step
158 | if self.use_demonstrations and (self.cur_ep < self.n_demonstrate_episodes):
159 | action = self.gaussian_noise(self.demonstration_action)
160 | else:
161 | action = self._select_action(obs, test=test)
162 | obs_, reward, _, info = self.env.step(action)
163 | time.sleep(sleep)
164 |
165 | if not test:
166 | self._remember(obs['observation'], obs['desired_goal'], action, obs_['achieved_goal'], reward,
167 | new_episode=True)
168 | if self.total_env_step_count % self.update_interval == 0:
169 | self._learn()
170 | self.total_env_step_count += 1
171 | else:
172 | n = 0
173 | done = False
174 | new_episode = True
175 | while not done:
176 | if self.use_demonstrations and (self.cur_ep < self.n_demonstrate_episodes):
177 | try:
178 | action, object_out_of_view, demon_info = self.env.get_cur_demonstration()
179 | except:
180 | action = self.gaussian_noise(self.demonstration_action[n])
181 | else:
182 | action = self._select_action(obs, test=test)
183 | obs_, reward, done, info = self.env.step(action)
184 | time.sleep(sleep)
185 |
186 | if not test:
187 | self._remember(obs['observation'], obs['desired_goal'], action, obs_['achieved_goal'], reward,
188 | new_episode=new_episode)
189 | if self.total_env_step_count % self.update_interval == 0:
190 | self._learn()
191 | self.total_env_step_count += 1
192 |
193 | new_episode = False
194 |
195 | return info
196 |
197 | def _select_action(self, obs, test=False):
198 | obs_points = T.as_tensor([obs['observation']], dtype=T.float).to(self.device)
199 | goal_points = T.as_tensor([obs['desired_goal']], dtype=T.float).to(self.device)
200 | obs_point_features = self.network_dict['critic_target'].get_features(obs_points.transpose(2, 1))
201 | goal_point_features = self.network_dict['critic_target'].get_features(goal_points.transpose(2, 1))
202 | inputs = T.cat((obs_point_features, goal_point_features), dim=1)
203 | action = self.network_dict['actor'].get_action(inputs, mean_pi=test).detach().cpu().numpy()
204 | return action[0]
205 |
206 | def _learn(self, steps=None):
207 | if len(self.buffer) < self.batch_size:
208 | return
209 | if steps is None:
210 | steps = self.optimizer_steps
211 |
212 | avg_critic_1_loss = T.zeros(1, device=self.device)
213 | avg_critic_2_loss = T.zeros(1, device=self.device)
214 | avg_actor_loss = T.zeros(1, device=self.device)
215 | avg_alpha = T.zeros(1, device=self.device)
216 | avg_policy_entropy = T.zeros(1, device=self.device)
217 | for i in range(steps):
218 | if self.prioritised:
219 | batch, weights, inds = self.buffer.sample(self.batch_size)
220 | weights = T.tensor(weights).view(self.batch_size, 1).to(self.device)
221 | else:
222 | batch = self.buffer.sample(self.batch_size)
223 | weights = T.ones(size=(self.batch_size, 1)).to(self.device)
224 | inds = None
225 |
226 | obs = T.as_tensor(batch.state, dtype=T.float32, device=self.device).transpose(2, 1)
227 | goal = T.as_tensor(batch.desired_goal, dtype=T.float32, device=self.device).transpose(2, 1)
228 | actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device)
229 | rewards = T.as_tensor(batch.reward, dtype=T.float32, device=self.device).unsqueeze(1)
230 | if self.onestep:
231 | values_target = rewards
232 | else:
233 | with T.no_grad():
234 | obs_ = T.as_tensor(batch.next_state, dtype=T.float32, device=self.device).transpose(2, 1)
235 | obs_features_ = self.network_dict['critic_1'].get_features(obs_, detach=True)
236 | goal_features = self.network_dict['critic_1'].get_features(goal, detach=True)
237 | actor_inputs = T.cat((obs_features_, goal_features), dim=1)
238 | new_actions = self.network_dict['actor'].get_action(actor_inputs)
239 | values_1 = self.network_dict['critic_target'](obs_, new_actions, goal)
240 | values_2 = self.network_dict['critic_target_2'](obs_, new_actions, goal)
241 | values_target = rewards + self.gamma * T.min(values_1, values_2)
242 |
243 | self.critic_1_optimizer.zero_grad()
244 | value_estimate_1 = self.network_dict['critic_1'](obs, actions, goal)
245 | critic_loss_1 = F.mse_loss(value_estimate_1, values_target, reduction='none')
246 | (critic_loss_1 * weights).mean().backward()
247 | self.critic_1_optimizer.step()
248 |
249 | if self.prioritised:
250 | assert inds is not None
251 | self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy()))
252 |
253 | self.critic_2_optimizer.zero_grad()
254 | value_estimate_2 = self.network_dict['critic_2'](obs, actions, goal)
255 | critic_loss_2 = F.mse_loss(value_estimate_2, values_target, reduction='none')
256 | (critic_loss_2 * weights).mean().backward()
257 | self.critic_2_optimizer.step()
258 |
259 | avg_critic_1_loss += critic_loss_1.detach().mean()
260 | avg_critic_2_loss += critic_loss_2.detach().mean()
261 |
262 | if self.optim_step_count % self.actor_update_interval == 0:
263 | self.actor_optimizer.zero_grad()
264 | obs_features = self.network_dict['critic_1'].get_features(obs, detach=True)
265 | goal_features = self.network_dict['critic_1'].get_features(goal, detach=True)
266 | actor_inputs = T.cat((obs_features, goal_features), dim=1)
267 | new_actions, new_log_probs, new_entropy = self.network_dict['actor'].get_action(actor_inputs,
268 | probs=True,
269 | entropy=True)
270 | new_values = T.min(self.network_dict['critic_1'](obs, new_actions, goal),
271 | self.network_dict['critic_2'](obs, new_actions, goal))
272 | actor_loss = (self.network_dict['alpha'] * new_log_probs - new_values).mean()
273 | actor_loss.backward()
274 | self.actor_optimizer.step()
275 |
276 | self.alpha_optimizer.zero_grad()
277 | alpha_loss = (self.network_dict['log_alpha'] * (-new_log_probs - self.target_entropy).detach()).mean()
278 | alpha_loss.backward()
279 | self.alpha_optimizer.step()
280 | self.network_dict['alpha'] = self.network_dict['log_alpha'].exp()
281 |
282 | avg_actor_loss += actor_loss.detach().mean()
283 | avg_alpha += self.network_dict['alpha'].detach()
284 | avg_policy_entropy += new_entropy.detach().mean()
285 |
286 | self.optim_step_count += 1
287 |
288 | if self.onestep:
289 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_target'], tau=1)
290 | else:
291 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_target'], tau=self.tau)
292 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_target_2'], tau=self.tau)
293 |
294 | self.logger.add_scalar(tag='Critic/critic_1_loss', scalar_value=avg_critic_1_loss / steps,
295 | global_step=self.cur_ep)
296 | self.logger.add_scalar(tag='Critic/critic_2_loss', scalar_value=avg_critic_2_loss / steps,
297 | global_step=self.cur_ep)
298 | self.logger.add_scalar(tag='Actor/actor_loss', scalar_value=avg_actor_loss / steps, global_step=self.cur_ep)
299 | self.logger.add_scalar(tag='Actor/alpha', scalar_value=avg_alpha / steps, global_step=self.cur_ep)
300 | self.logger.add_scalar(tag='Actor/policy_entropy', scalar_value=avg_policy_entropy / steps,
301 | global_step=self.cur_ep)
302 |
--------------------------------------------------------------------------------
/drl_implementation/agent/continuous_action/td3.py:
--------------------------------------------------------------------------------
1 | import time
2 | import numpy as np
3 | import torch as T
4 | import torch.nn.functional as F
5 | from torch.optim.adam import Adam
6 | from ..utils.networks_mlp import Actor, Critic
7 | from ..agent_base import Agent
8 | from ..utils.exploration_strategy import GaussianNoise
9 |
10 |
11 | class TD3(Agent):
12 | def __init__(self, algo_params, env, transition_tuple=None, path=None, seed=-1):
13 | # environment
14 | self.env = env
15 | self.env.seed(seed)
16 | obs = self.env.reset()
17 | algo_params.update({'state_dim': obs.shape[0],
18 | 'action_dim': self.env.action_space.shape[0],
19 | 'action_max': self.env.action_space.high,
20 | 'action_scaling': self.env.action_space.high[0],
21 | 'init_input_means': None,
22 | 'init_input_vars': None
23 | })
24 | # training args
25 | self.training_episodes = algo_params['training_episodes']
26 | self.testing_gap = algo_params['testing_gap']
27 | self.testing_episodes = algo_params['testing_episodes']
28 | self.saving_gap = algo_params['saving_gap']
29 |
30 | super(TD3, self).__init__(algo_params,
31 | transition_tuple=transition_tuple,
32 | goal_conditioned=False,
33 | path=path,
34 | seed=seed)
35 | # torch
36 | self.network_dict.update({
37 | 'actor': Actor(self.state_dim, self.action_dim, action_scaling=self.action_scaling).to(self.device),
38 | 'actor_target': Actor(self.state_dim, self.action_dim, action_scaling=self.action_scaling).to(self.device),
39 | 'critic_1': Critic(self.state_dim + self.action_dim, 1).to(self.device),
40 | 'critic_1_target': Critic(self.state_dim + self.action_dim, 1).to(self.device),
41 | 'critic_2': Critic(self.state_dim + self.action_dim, 1).to(self.device),
42 | 'critic_2_target': Critic(self.state_dim + self.action_dim, 1).to(self.device)
43 | })
44 | self.network_keys_to_save = ['actor_target', 'critic_1_target']
45 | self.actor_optimizer = Adam(self.network_dict['actor'].parameters(), lr=self.actor_learning_rate)
46 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'], tau=1)
47 | self.critic_1_optimizer = Adam(self.network_dict['critic_1'].parameters(), lr=self.critic_learning_rate)
48 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'], tau=1)
49 | self.critic_2_optimizer = Adam(self.network_dict['critic_2'].parameters(), lr=self.critic_learning_rate)
50 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'], tau=1)
51 | # behavioural policy args (exploration)
52 | self.exploration_strategy = GaussianNoise(self.action_dim, self.action_max, mu=0, sigma=0.1)
53 | # training args
54 | self.target_noise = algo_params['target_noise']
55 | self.noise_clip = algo_params['noise_clip']
56 | self.warmup_step = algo_params['warmup_step']
57 | self.actor_update_interval = algo_params['actor_update_interval']
58 | # statistic dict
59 | self.statistic_dict.update({
60 | 'episode_return': [],
61 | 'episode_test_return': []
62 | })
63 |
64 | def run(self, test=False, render=False, load_network_ep=None, sleep=0):
65 | if test:
66 | num_episode = self.testing_episodes
67 | if load_network_ep is not None:
68 | print("Loading network parameters...")
69 | self._load_network(ep=load_network_ep)
70 | print("Start testing...")
71 | else:
72 | num_episode = self.training_episodes
73 | print("Start training...")
74 |
75 | for ep in range(num_episode):
76 | ep_return = self._interact(render, test, sleep=sleep)
77 | self.statistic_dict['episode_return'].append(ep_return)
78 | print("Episode %i" % ep, "return %0.1f" % ep_return)
79 |
80 | if (ep % self.testing_gap == 0) and (ep != 0) and (not test):
81 | ep_test_return = []
82 | for test_ep in range(self.testing_episodes):
83 | ep_test_return.append(self._interact(render, test=True))
84 | self.statistic_dict['episode_test_return'].append(sum(ep_test_return)/self.testing_episodes)
85 | print("Episode %i" % ep, "test return %0.1f" % (sum(ep_test_return)/self.testing_episodes))
86 |
87 | if (ep % self.saving_gap == 0) and (ep != 0) and (not test):
88 | self._save_network(ep=ep)
89 |
90 | if not test:
91 | print("Finished training")
92 | print("Saving statistics...")
93 | self._plot_statistics(save_to_file=True)
94 | else:
95 | print("Finished testing")
96 |
97 | def _interact(self, render=False, test=False, sleep=0):
98 | done = False
99 | obs = self.env.reset()
100 | ep_return = 0
101 | # start a new episode
102 | while not done:
103 | if render:
104 | self.env.render()
105 | if self.env_step_count < self.warmup_step:
106 | action = self.env.action_space.sample()
107 | else:
108 | action = self._select_action(obs, test=test)
109 | new_obs, reward, done, info = self.env.step(action)
110 | time.sleep(sleep)
111 | ep_return += reward
112 | if not test:
113 | self._remember(obs, action, new_obs, reward, 1 - int(done))
114 | if self.observation_normalization:
115 | self.normalizer.store_history(new_obs)
116 | self.normalizer.update_mean()
117 | if (self.env_step_count % self.update_interval == 0) and (self.env_step_count > self.warmup_step):
118 | self._learn()
119 | obs = new_obs
120 | self.env_step_count += 1
121 | return ep_return
122 |
123 | def _select_action(self, obs, test=False):
124 | obs = self.normalizer(obs)
125 | with T.no_grad():
126 | inputs = T.as_tensor(obs, dtype=T.float, device=self.device)
127 | action = self.network_dict['actor_target'](inputs).detach().cpu().numpy()
128 | if test:
129 | # evaluate
130 | return np.clip(action, -self.action_max, self.action_max)
131 | else:
132 | # explore
133 | return self.exploration_strategy(action)
134 |
135 | def _learn(self, steps=None):
136 | if len(self.buffer) < self.batch_size:
137 | return
138 | if steps is None:
139 | steps = self.optimizer_steps
140 |
141 | for i in range(steps):
142 | if self.prioritised:
143 | batch, weights, inds = self.buffer.sample(self.batch_size)
144 | weights = T.as_tensor(weights, device=self.device).view(self.batch_size, 1)
145 | else:
146 | batch = self.buffer.sample(self.batch_size)
147 | weights = T.ones(size=(self.batch_size, 1), device=self.device)
148 | inds = None
149 |
150 | actor_inputs = self.normalizer(batch.state)
151 | actor_inputs = T.as_tensor(actor_inputs, dtype=T.float32, device=self.device)
152 | actions = T.as_tensor(batch.action, dtype=T.float32, device=self.device)
153 | critic_inputs = T.cat((actor_inputs, actions), dim=1)
154 | actor_inputs_ = self.normalizer(batch.next_state)
155 | actor_inputs_ = T.as_tensor(actor_inputs_, dtype=T.float32, device=self.device)
156 | rewards = T.as_tensor(batch.reward, dtype=T.float32, device=self.device).unsqueeze(1)
157 | done = T.as_tensor(batch.done, dtype=T.float32, device=self.device).unsqueeze(1)
158 |
159 | if self.discard_time_limit:
160 | done = done * 0 + 1
161 |
162 | with T.no_grad():
163 | actions_ = self.network_dict['actor_target'](actor_inputs_)
164 | # add noise
165 | noise = (T.randn_like(actions_, device=self.device) * self.target_noise)
166 | actions_ += noise.clamp(-self.noise_clip, self.noise_clip)
167 | actions_ = actions_.clamp(-self.action_max[0], self.action_max[0])
168 | critic_inputs_ = T.cat((actor_inputs_, actions_), dim=1)
169 | value_1_ = self.network_dict['critic_1_target'](critic_inputs_)
170 | value_2_ = self.network_dict['critic_2_target'](critic_inputs_)
171 | value_ = T.min(value_1_, value_2_)
172 | value_target = rewards + done * self.gamma * value_
173 |
174 | self.critic_1_optimizer.zero_grad()
175 | value_estimate_1 = self.network_dict['critic_1'](critic_inputs)
176 | critic_loss_1 = F.mse_loss(value_estimate_1, value_target.detach(), reduction='none')
177 | (critic_loss_1 * weights).mean().backward()
178 | self.critic_1_optimizer.step()
179 |
180 | if self.prioritised:
181 | assert inds is not None
182 | self.buffer.update_priority(inds, np.abs(critic_loss_1.cpu().detach().numpy()))
183 |
184 | self.critic_2_optimizer.zero_grad()
185 | value_estimate_2 = self.network_dict['critic_2'](critic_inputs)
186 | critic_loss_2 = F.mse_loss(value_estimate_2, value_target.detach(), reduction='none')
187 | (critic_loss_2 * weights).mean().backward()
188 | self.critic_2_optimizer.step()
189 |
190 | self.statistic_dict['critic_loss'].append(critic_loss_1.detach().mean())
191 |
192 | if self.optim_step_count % self.actor_update_interval == 0:
193 | self.actor_optimizer.zero_grad()
194 | new_actions = self.network_dict['actor'](actor_inputs)
195 | critic_eval_inputs = T.cat((actor_inputs, new_actions), dim=1)
196 | actor_loss = -self.network_dict['critic_1'](critic_eval_inputs).mean()
197 | actor_loss.backward()
198 | self.actor_optimizer.step()
199 |
200 | self._soft_update(self.network_dict['actor'], self.network_dict['actor_target'])
201 | self._soft_update(self.network_dict['critic_1'], self.network_dict['critic_1_target'])
202 | self._soft_update(self.network_dict['critic_2'], self.network_dict['critic_2_target'])
203 |
204 | self.statistic_dict['actor_loss'].append(actor_loss.detach().mean())
205 |
206 | self.optim_step_count += 1
207 |
--------------------------------------------------------------------------------
/drl_implementation/agent/distributed_agent_base.py:
--------------------------------------------------------------------------------
1 | import os
2 | import time
3 | import torch as T
4 | import numpy as np
5 | import json
6 | import queue
7 | import importlib
8 | import multiprocessing as mp
9 | from collections import namedtuple
10 | from .utils.plot import smoothed_plot
11 | from .utils.replay_buffer import ReplayBuffer, PrioritisedReplayBuffer
12 | from .utils.normalizer import Normalizer
13 | # T.multiprocessing.set_start_method('spawn')
14 | t = namedtuple("transition", ('state', 'action', 'next_state', 'reward', 'done'))
15 |
16 |
17 | def mkdir(paths):
18 | for path in paths:
19 | os.makedirs(path, exist_ok=True)
20 |
21 |
22 | class Agent(object):
23 | def __init__(self, algo_params, image_obs=False, action_type='continuous', path=None, seed=-1):
24 | # torch device
25 | self.device = T.device("cuda" if T.cuda.is_available() else "cpu")
26 | if 'cuda_device_id' in algo_params.keys():
27 | self.device = T.device("cuda:%i" % algo_params['cuda_device_id'])
28 | # path & seeding
29 | T.manual_seed(seed)
30 | T.cuda.manual_seed_all(seed) # this has no effect if cuda is not available
31 |
32 | assert path is not None, 'please specify a project path to save files'
33 | self.path = path
34 | # path to save neural network check point
35 | self.ckpt_path = os.path.join(path, 'ckpts')
36 | # path to save statistics
37 | self.data_path = os.path.join(path, 'data')
38 | # create directories if not exist
39 | mkdir([self.path, self.ckpt_path, self.data_path])
40 |
41 | # non-goal-conditioned args
42 | self.image_obs = image_obs
43 | self.action_type = action_type
44 | if self.image_obs:
45 | self.state_dim = 0
46 | self.state_shape = algo_params['state_shape']
47 | else:
48 | self.state_dim = algo_params['state_dim']
49 | self.action_dim = algo_params['action_dim']
50 | if self.action_type == 'continuous':
51 | self.action_max = algo_params['action_max']
52 | self.action_scaling = algo_params['action_scaling']
53 |
54 | # common args
55 | if not self.image_obs:
56 | # todo: observation in distributed training should be synced as well
57 | self.observation_normalization = algo_params['observation_normalization']
58 | # if not using obs normalization, the normalizer is just a scale multiplier, returns inputs*scale
59 | self.normalizer = Normalizer(self.state_dim,
60 | algo_params['init_input_means'], algo_params['init_input_vars'],
61 | activated=self.observation_normalization)
62 |
63 | self.gamma = algo_params['discount_factor']
64 | self.tau = algo_params['tau']
65 |
66 | # network dict is filled in each specific agent
67 | self.network_dict = {}
68 | self.network_keys_to_save = None
69 |
70 | # algorithm-specific statistics are defined in each agent sub-class
71 | self.statistic_dict = {
72 | # use lowercase characters
73 | 'actor_loss': [],
74 | 'critic_loss': [],
75 | }
76 |
77 | def _soft_update(self, source, target, tau=None, from_params=False):
78 | if tau is None:
79 | tau = self.tau
80 |
81 | if not from_params:
82 | for target_param, param in zip(target.parameters(), source.parameters()):
83 | target_param.data.copy_(
84 | target_param.data * (1.0 - tau) + param.data * tau
85 | )
86 | else:
87 | for target_param, param in zip(target.parameters(), source):
88 | target_param.data.copy_(
89 | target_param.data * (1.0 - tau) + T.tensor(param).float().to(self.device) * tau
90 | )
91 |
92 | def _save_network(self, keys=None, ep=None):
93 | if ep is None:
94 | ep = ''
95 | else:
96 | ep = '_ep' + str(ep)
97 | if keys is None:
98 | keys = self.network_keys_to_save
99 | assert keys is not None
100 | for key in keys:
101 | T.save(self.network_dict[key].state_dict(), self.ckpt_path + '/ckpt_' + key + ep + '.pt')
102 |
103 | def _load_network(self, keys=None, ep=None):
104 | if not self.image_obs:
105 | self.normalizer.history_mean = np.load(os.path.join(self.data_path, 'input_means.npy'))
106 | self.normalizer.history_var = np.load(os.path.join(self.data_path, 'input_vars.npy'))
107 | if ep is None:
108 | ep = ''
109 | else:
110 | ep = '_ep' + str(ep)
111 | if keys is None:
112 | keys = self.network_keys_to_save
113 | assert keys is not None
114 | for key in keys:
115 | self.network_dict[key].load_state_dict(T.load(self.ckpt_path + '/ckpt_' + key + ep + '.pt'))
116 |
117 | def _save_statistics(self, keys=None):
118 | if not self.image_obs:
119 | np.save(os.path.join(self.data_path, 'input_means'), self.normalizer.history_mean)
120 | np.save(os.path.join(self.data_path, 'input_vars'), self.normalizer.history_var)
121 | if keys is None:
122 | keys = self.statistic_dict.keys()
123 | for key in keys:
124 | if len(self.statistic_dict[key]) == 0:
125 | continue
126 | # convert everything to a list before save via json
127 | if T.is_tensor(self.statistic_dict[key][0]):
128 | self.statistic_dict[key] = T.as_tensor(self.statistic_dict[key], device=self.device).cpu().numpy().tolist()
129 | else:
130 | self.statistic_dict[key] = np.array(self.statistic_dict[key]).tolist()
131 | json.dump(self.statistic_dict[key], open(os.path.join(self.data_path, key+'.json'), 'w'))
132 |
133 | def _plot_statistics(self, keys=None, x_labels=None, y_labels=None, window=5, save_to_file=True):
134 | if save_to_file:
135 | self._save_statistics(keys=keys)
136 | if y_labels is None:
137 | y_labels = {}
138 | for key in list(self.statistic_dict.keys()):
139 | if key not in y_labels.keys():
140 | if 'loss' in key:
141 | label = 'Loss'
142 | elif 'return' in key:
143 | label = 'Return'
144 | elif 'success' in key:
145 | label = 'Success'
146 | else:
147 | label = key
148 | y_labels.update({key: label})
149 |
150 | if x_labels is None:
151 | x_labels = {}
152 | for key in list(self.statistic_dict.keys()):
153 | if key not in x_labels.keys():
154 | if ('loss' in key) or ('alpha' in key) or ('entropy' in key) or ('step' in key):
155 | label = 'Optimization step'
156 | elif 'cycle' in key:
157 | label = 'Cycle'
158 | elif 'epoch' in key:
159 | label = 'Epoch'
160 | else:
161 | label = 'Episode'
162 | x_labels.update({key: label})
163 |
164 | if keys is None:
165 | for key in list(self.statistic_dict.keys()):
166 | smoothed_plot(os.path.join(self.path, key + '.png'), self.statistic_dict[key],
167 | x_label=x_labels[key], y_label=y_labels[key], window=window)
168 | else:
169 | for key in keys:
170 | smoothed_plot(os.path.join(self.path, key + '.png'), self.statistic_dict[key],
171 | x_label=x_labels[key], y_label=y_labels[key], window=window)
172 |
173 |
174 | class Worker(Agent):
175 | def __init__(self, algo_params, queues, path=None, seed=0, i=0):
176 | self.queues = queues
177 | self.worker_id = i
178 | self.worker_update_gap = algo_params['worker_update_gap'] # in episodes
179 | self.env_step_count = 0
180 | super(Worker, self).__init__(algo_params, path=path, seed=seed)
181 |
182 | def run(self, render=False, test=False, load_network_ep=None, sleep=0):
183 | raise NotImplementedError
184 |
185 | def _interact(self, render=False, test=False, sleep=0):
186 | raise NotImplementedError
187 |
188 | def _select_action(self, obs, test=False):
189 | raise NotImplementedError
190 |
191 | def _remember(self, batch):
192 | try:
193 | self.queues['replay_queue'].put_nowait(batch)
194 | except queue.Full:
195 | pass
196 |
197 | def _download_actor_networks(self, keys, tau=1.0):
198 | try:
199 | source = self.queues['network_queue'].get_nowait()
200 | except queue.Empty:
201 | return False
202 | print("Worker No. %i downloading network" % self.worker_id)
203 | for key in keys:
204 | self._soft_update(source[key], self.network_dict[key], tau=tau, from_params=True)
205 | return True
206 |
207 |
208 | class Learner(Agent):
209 | def __init__(self, algo_params, queues, path=None, seed=0):
210 | self.queues = queues
211 | self.num_workers = algo_params['num_workers']
212 | self.learner_steps = algo_params['learner_steps']
213 | self.learner_upload_gap = algo_params['learner_upload_gap'] # in optimization steps
214 | self.actor_learning_rate = algo_params['actor_learning_rate']
215 | self.critic_learning_rate = algo_params['critic_learning_rate']
216 | self.discard_time_limit = algo_params['discard_time_limit']
217 | self.batch_size = algo_params['batch_size']
218 | self.prioritised = algo_params['prioritised']
219 | self.optimizer_steps = algo_params['optimization_steps']
220 | self.optim_step_count = 0
221 | super(Learner, self).__init__(algo_params, path=path, seed=seed)
222 |
223 | def run(self):
224 | raise NotImplementedError
225 |
226 | def _learn(self, steps=None):
227 | raise NotImplementedError
228 |
229 | def _upload_learner_networks(self, keys):
230 | print("Learner uploading network")
231 | params = dict.fromkeys(keys)
232 | for key in keys:
233 | params[key] = [p.data.cpu().detach().numpy() for p in self.network_dict[key].parameters()]
234 | # delete an old net and upload a new one
235 | try:
236 | data = self.queues['network_queue'].get_nowait()
237 | del data
238 | except queue.Empty:
239 | pass
240 | try:
241 | self.queues['network_queue'].put(params)
242 | except queue.Full:
243 | pass
244 |
245 |
246 | class CentralProcessor(object):
247 | def __init__(self, algo_params, env_name, env_source, learner, worker, transition_tuple=None, path=None,
248 | worker_seeds=None, seed=0):
249 | self.algo_params = algo_params.copy()
250 | self.env_name = env_name
251 | assert env_source in ['gym', 'pybullet_envs', 'pybullet_multigoal_gym'], \
252 | "unsupported env source: {}, " \
253 | "only 3 env sources are supported: {}, " \
254 | "for new env sources please modify the original code".format(env_source,
255 | ['gym', 'pybullet_envs',
256 | 'pybullet_multigoal_gym'])
257 | self.env_source = importlib.import_module(env_source)
258 | self.learner = learner
259 | self.worker = worker
260 | self.batch_size = algo_params['batch_size']
261 | self.num_workers = algo_params['num_workers']
262 | self.learner_steps = algo_params['learner_steps']
263 | if worker_seeds is None:
264 | worker_seeds = np.random.randint(10, 1000, size=self.num_workers).tolist()
265 | else:
266 | assert len(worker_seeds) == self.num_workers, 'should assign seeds to every worker'
267 | self.worker_seeds = worker_seeds
268 | assert path is not None, 'please specify a project path to save files'
269 | self.path = path
270 | # create a random number generator and seed it
271 | self.rng = np.random.default_rng(seed=0)
272 |
273 | # multiprocessing queues
274 | self.queues = {
275 | 'replay_queue': mp.Queue(maxsize=algo_params['replay_queue_size']),
276 | 'batch_queue': mp.Queue(maxsize=algo_params['batch_queue_size']),
277 | 'network_queue': T.multiprocessing.Queue(maxsize=self.num_workers),
278 | 'learner_step_count': mp.Value('i', 0),
279 | 'global_episode_count': mp.Value('i', 0),
280 | }
281 |
282 | # setup replay buffer
283 | # prioritised replay
284 | self.prioritised = algo_params['prioritised']
285 | self.store_with_given_priority = algo_params['store_with_given_priority']
286 | # non-goal-conditioned replay buffer
287 | tr = transition_tuple
288 | if transition_tuple is None:
289 | tr = t
290 | if not self.prioritised:
291 | self.buffer = ReplayBuffer(algo_params['memory_capacity'], tr, seed=seed)
292 | else:
293 | self.queues.update({
294 | 'priority_queue': mp.Queue(maxsize=algo_params['priority_queue_size'])
295 | })
296 | self.buffer = PrioritisedReplayBuffer(algo_params['memory_capacity'], tr, rng=self.rng)
297 |
298 | def run(self):
299 | def worker_process(i, seed):
300 | env = self.env_source.make(self.env_name)
301 | path = os.path.join(self.path, "worker_%i" % i)
302 | worker = self.worker(self.algo_params, env, self.queues, path=path, seed=seed, i=i)
303 | worker.run()
304 | self.empty_queue('replay_queue')
305 |
306 | def learner_process():
307 | env = self.env_source.make(self.env_name)
308 | path = os.path.join(self.path, "learner")
309 | learner = self.learner(self.algo_params, env, self.queues, path=path, seed=0)
310 | learner.run()
311 | if self.prioritised:
312 | self.empty_queue('priority_queue')
313 | self.empty_queue('network_queue')
314 |
315 | def update_buffer():
316 | while self.queues['learner_step_count'].value < self.learner_steps:
317 | num_transitions_in_queue = self.queues['replay_queue'].qsize()
318 | for n in range(num_transitions_in_queue):
319 | data = self.queues['replay_queue'].get()
320 | if self.prioritised:
321 | if self.store_with_given_priority:
322 | self.buffer.store_experience_with_given_priority(data['priority'], *data['transition'])
323 | else:
324 | self.buffer.store_experience(*data)
325 | else:
326 | self.buffer.store_experience(*data)
327 | if self.batch_size > len(self.buffer):
328 | continue
329 |
330 | if self.prioritised:
331 | try:
332 | inds, priorities = self.queues['priority_queue'].get_nowait()
333 | self.buffer.update_priority(inds, priorities)
334 | except queue.Empty:
335 | pass
336 | try:
337 | batch, weights, inds = self.buffer.sample(batch_size=self.batch_size)
338 | state, action, next_state, reward, done = batch
339 | self.queues['batch_queue'].put_nowait((state, action, next_state, reward, done, weights, inds))
340 | except queue.Full:
341 | continue
342 | else:
343 | try:
344 | batch = self.buffer.sample(batch_size=self.batch_size)
345 | state, action, next_state, reward, done = batch
346 | self.queues['batch_queue'].put_nowait((state, action, next_state, reward, done))
347 | except queue.Full:
348 | time.sleep(0.1)
349 | continue
350 |
351 | self.empty_queue('batch_queue')
352 |
353 | processes = []
354 | p = T.multiprocessing.Process(target=update_buffer)
355 | processes.append(p)
356 | p = T.multiprocessing.Process(target=learner_process)
357 | processes.append(p)
358 | for i in range(self.num_workers):
359 | p = T.multiprocessing.Process(target=worker_process,
360 | args=(i, self.worker_seeds[i]))
361 | processes.append(p)
362 |
363 | for p in processes:
364 | p.start()
365 | for p in processes:
366 | p.join()
367 |
368 | def empty_queue(self, queue_name):
369 | while True:
370 | try:
371 | data = self.queues[queue_name].get_nowait()
372 | del data
373 | except queue.Empty:
374 | break
375 | self.queues[queue_name].close()
376 |
--------------------------------------------------------------------------------
/drl_implementation/agent/utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/drl_implementation/agent/utils/__init__.py
--------------------------------------------------------------------------------
/drl_implementation/agent/utils/env_wrapper.py:
--------------------------------------------------------------------------------
1 | import gym
2 | import numpy as np
3 | from skimage.transform import resize
4 | from collections import deque
5 |
6 |
7 | class FrameStack(gym.Wrapper):
8 | def __init__(self, env, k):
9 | gym.Wrapper.__init__(self, env)
10 | self._k = k
11 | self._frames = deque([], maxlen=k)
12 | shp = env.observation_space.shape
13 | self.observation_space = gym.spaces.Box(
14 | low=0,
15 | high=1,
16 | shape=((shp[0] * k,) + shp[1:]),
17 | dtype=env.observation_space.dtype)
18 | self._max_episode_steps = env._max_episode_steps
19 |
20 | def reset(self):
21 | obs = self.env.reset()
22 | for _ in range(self._k):
23 | self._frames.append(obs)
24 | return self._get_obs()
25 |
26 | def step(self, action):
27 | obs, reward, done, info = self.env.step(action)
28 | self._frames.append(obs)
29 | return self._get_obs(), reward, done, info
30 |
31 | def _get_obs(self):
32 | assert len(self._frames) == self._k
33 | return np.concatenate(list(self._frames), axis=0)
34 |
35 |
36 | class PixelPybulletGym(gym.Wrapper):
37 | def __init__(self, env, image_size, crop_size, channel_first=True):
38 | gym.Wrapper.__init__(self, env)
39 | self.image_size = image_size
40 | self.crop_size = crop_size
41 | self.channel_first = channel_first
42 | self.vertical_boundary = int((env.env._render_height - self.crop_size) / 2)
43 | self.horizontal_boundary = int((env.env._render_width - self.crop_size) / 2)
44 | self._max_episode_steps = env._max_episode_steps
45 |
46 | def reset(self):
47 | self.env.reset()
48 | return self._get_obs()
49 |
50 | def step(self, action):
51 | _, reward, done, info = self.env.step(action)
52 | return self._get_obs(), reward, done, info
53 |
54 | def _get_obs(self):
55 | # H, W, C
56 | obs = self.render(mode="rgb_array")
57 | obs = obs[self.vertical_boundary:-self.vertical_boundary, self.horizontal_boundary:-self.horizontal_boundary, :]
58 | obs = resize(obs, (self.image_size, self.image_size))
59 | if self.channel_first:
60 | obs = obs.transpose((-1, 0, 1))
61 | return obs
62 |
--------------------------------------------------------------------------------
/drl_implementation/agent/utils/exploration_strategy.py:
--------------------------------------------------------------------------------
1 | import math as M
2 | import numpy as np
3 |
4 |
5 | class ExpDecayGreedy(object):
6 | # e-greedy exploration with exponential decay
7 | def __init__(self, start=1, end=0.05, decay=50000, decay_start=None, rng=None):
8 | self.start = start
9 | self.end = end
10 | self.decay = decay
11 | self.decay_start = decay_start
12 | if rng is None:
13 | self.rng = np.random.default_rng(seed=0)
14 | else:
15 | self.rng = rng
16 |
17 | def __call__(self, count):
18 | if self.decay_start is not None:
19 | count -= self.decay_start
20 | if count < 0:
21 | count = 0
22 | epsilon = self.end + (self.start - self.end) * M.exp(-1. * count / self.decay)
23 | prob = self.rng.uniform(0, 1)
24 | if prob < epsilon:
25 | return True
26 | else:
27 | return False
28 |
29 |
30 | class LinearDecayGreedy(object):
31 | # e-greedy exploration with linear decay
32 | def __init__(self, start=1.0, end=0.1, decay=1000000, decay_start=None, rng=None):
33 | self.start = start
34 | self.end = end
35 | self.decay = decay
36 | self.decay_start = decay_start
37 | if rng is None:
38 | self.rng = np.random.default_rng(seed=0)
39 | else:
40 | self.rng = rng
41 |
42 | def __call__(self, count):
43 | if self.decay_start is not None:
44 | count -= self.decay_start
45 | if count < 0:
46 | count = 0
47 | if count > self.decay:
48 | count = self.decay
49 | epsilon = self.start - count * (self.start - self.end) / self.decay
50 | prob = self.rng.uniform(0, 1)
51 | if prob < epsilon:
52 | return True
53 | else:
54 | return False
55 |
56 |
57 | class OUNoise(object):
58 | # https://github.com/rll/rllab/blob/master/rllab/exploration_strategies/ou_strategy.py
59 | def __init__(self, action_dim, action_max, mu=0, theta=0.2, sigma=1.0, rng=None):
60 | if rng is None:
61 | self.rng = np.random.default_rng(seed=0)
62 | else:
63 | self.rng = rng
64 | self.action_dim = action_dim
65 | self.action_max = action_max
66 | self.mu = mu
67 | self.theta = theta
68 | self.sigma = sigma
69 | self.state = np.ones(self.action_dim) * self.mu
70 | self.reset()
71 |
72 | def reset(self):
73 | self.state = np.ones(self.action_dim) * self.mu
74 |
75 | def __call__(self, action):
76 | x = self.state
77 | dx = self.theta * (self.mu - x) + self.sigma * self.rng.standard_normal(len(x))
78 | self.state = x + dx
79 | return np.clip(action + self.state, -self.action_max, self.action_max)
80 |
81 |
82 | class GaussianNoise(object):
83 | # the one used in the TD3 paper: http://proceedings.mlr.press/v80/fujimoto18a/fujimoto18a.pdf
84 | def __init__(self, action_dim, action_max, scale=1, mu=0, sigma=0.1, rng=None):
85 | if rng is None:
86 | self.rng = np.random.default_rng(seed=0)
87 | else:
88 | self.rng = rng
89 | self.scale = scale
90 | self.action_dim = action_dim
91 | self.action_max = action_max
92 | self.mu = mu
93 | self.sigma = sigma
94 |
95 | def __call__(self, action):
96 | noise = self.scale*self.rng.normal(loc=self.mu, scale=self.sigma, size=(self.action_dim,))
97 | return np.clip(action + noise, -self.action_max, self.action_max)
98 |
99 |
100 | class EGreedyGaussian(object):
101 | # the one used in the HER paper: https://arxiv.org/abs/1707.01495
102 | def __init__(self, action_dim, action_max, chance=0.2, scale=1, mu=0, sigma=0.1, rng=None):
103 | self.chance = chance
104 | self.scale = scale
105 | self.action_dim = action_dim
106 | self.action_max = action_max
107 | self.mu = mu
108 | self.sigma = sigma
109 | if rng is None:
110 | self.rng = np.random.default_rng(seed=0)
111 | else:
112 | self.rng = rng
113 |
114 | def __call__(self, action):
115 | chance = self.rng.uniform(0, 1)
116 | if chance < self.chance:
117 | return self.rng.uniform(-self.action_max, self.action_max, size=(self.action_dim,))
118 | else:
119 | noise = self.scale*self.rng.normal(loc=self.mu, scale=self.sigma, size=(self.action_dim,))
120 | return np.clip(action + noise, -self.action_max, self.action_max)
121 |
122 |
123 | class AutoAdjustingEGreedyGaussian(object):
124 | """
125 | https://ieeexplore.ieee.org/document/9366328
126 | This exploration class is a goal-success-rate-based auto-adjusting exploration strategy.
127 | It modifies the original constant chance exploration strategy by reducing exploration probabilities and noise deviations
128 | w.r.t. the testing success rate of each goal.
129 | """
130 | def __init__(self, goal_num, action_dim, action_max, tau=0.05, chance=0.2, scale=1, mu=0, sigma=0.2, rng=None):
131 | if rng is None:
132 | self.rng = np.random.default_rng(seed=0)
133 | else:
134 | self.rng = rng
135 | self.scale = scale
136 | self.action_dim = action_dim
137 | self.action_max = action_max
138 | self.mu = mu
139 | self.base_sigma = sigma
140 | self.sigma = np.ones(goal_num) * sigma
141 |
142 | self.base_chance = chance
143 | self.goal_num = goal_num
144 | self.tau = tau
145 | self.success_rates = np.zeros(self.goal_num)
146 | self.chance = np.ones(self.goal_num) * chance
147 |
148 | def update_success_rates(self, new_tet_suc_rate):
149 | old_tet_suc_rate = self.success_rates.copy()
150 | self.success_rates = (1-self.tau)*old_tet_suc_rate + self.tau*new_tet_suc_rate
151 | self.chance = self.base_chance*(1-self.success_rates)
152 | self.sigma = self.base_sigma*(1-self.success_rates)
153 |
154 | def __call__(self, goal_ind, action):
155 | # return a random action or a noisy action
156 | prob = self.rng.uniform(0, 1)
157 | if prob < self.chance[goal_ind]:
158 | return self.rng.uniform(-self.action_max, self.action_max, size=(self.action_dim,))
159 | else:
160 | noise = self.scale*self.rng.normal(loc=self.mu, scale=self.sigma[goal_ind], size=(self.action_dim,))
161 | return action + noise
162 |
--------------------------------------------------------------------------------
/drl_implementation/agent/utils/networks_conv.py:
--------------------------------------------------------------------------------
1 | import torch as T
2 | import torch.nn as nn
3 | import torch.nn.functional as F
4 | from torch.distributions import Normal
5 |
6 |
7 | class DQNetwork(nn.Module):
8 | def __init__(self, input_shape, action_dims, init_w=3e-3):
9 | super(DQNetwork, self).__init__()
10 | self.input_shape = input_shape
11 | # input_shape: tuple(c, h, w)
12 | self.features = nn.Sequential(
13 | nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
14 | nn.ReLU(),
15 | nn.Conv2d(32, 64, kernel_size=4, stride=2),
16 | nn.ReLU(),
17 | nn.Conv2d(64, 64, kernel_size=3, stride=1),
18 | nn.ReLU()
19 | )
20 |
21 | x = T.randn([32] + list(input_shape))
22 | self.conv_out_dim = self.features(x).view(x.size(0), -1).size(1)
23 | self.fc = nn.Linear(self.conv_out_dim, 512)
24 | self.v = nn.Linear(512, action_dims)
25 | T.nn.init.uniform_(self.v.weight.data, -init_w, init_w)
26 | T.nn.init.uniform_(self.v.bias.data, -init_w, init_w)
27 |
28 | def forward(self, obs):
29 | if obs.max() > 1.:
30 | obs = obs / 255.
31 |
32 | x = self.features(obs)
33 | x = x.view(x.size(0), -1)
34 | x = F.relu(self.fc(x))
35 | value = self.v(x)
36 | return value
37 |
38 | def get_action(self, obs):
39 | values = self.forward(obs)
40 | return T.argmax(values).item()
41 |
42 |
43 | class StochasticConvActor(nn.Module):
44 | def __init__(self, action_dim, encoder, hidden_dim=1024, log_std_min=-10, log_std_max=2, action_scaling=1,
45 | detach_obs_encoder=False,
46 | goal_conditioned=False, detach_goal_encoder=True):
47 | super(StochasticConvActor, self).__init__()
48 |
49 | self.action_scaling = action_scaling
50 | self.encoder = encoder
51 | self.detach_obs_encoder = detach_obs_encoder
52 | self.log_std_min = log_std_min
53 | self.log_std_max = log_std_max
54 | trunk_input_dim = self.encoder.feature_dim
55 | self.goal_conditioned = goal_conditioned
56 | self.detach_goal_encoder = detach_goal_encoder
57 | if self.goal_conditioned:
58 | trunk_input_dim *= 2
59 | self.trunk = nn.Sequential(
60 | nn.Linear(trunk_input_dim, hidden_dim), nn.ReLU(),
61 | nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
62 | nn.Linear(hidden_dim, 2 * action_dim)
63 | )
64 |
65 | self.apply(orthogonal_init)
66 |
67 | def forward(self, obs, goal=None):
68 | feature = self.encoder(obs, detach=self.detach_obs_encoder)
69 | if self.goal_conditioned:
70 | assert goal is not None, "need a goal image for goal-conditioned network"
71 | goal_feature = self.encoder(goal, detach=self.detach_goal_encoder)
72 | feature = T.cat((feature, goal_feature), dim=1)
73 |
74 | mu, log_std = self.trunk(feature).chunk(2, dim=-1)
75 | log_std = T.clamp(log_std, self.log_std_min, self.log_std_max)
76 | return mu, log_std
77 |
78 | def get_action(self, obs, goal=None, epsilon=1e-6, mean_pi=False, probs=False, entropy=False):
79 | mean, log_std = self(obs, goal)
80 | if mean_pi:
81 | return T.tanh(mean)
82 | std = log_std.exp()
83 | mu = Normal(mean, std)
84 | z = mu.rsample()
85 | action = T.tanh(z)
86 | if not probs:
87 | return action * self.action_scaling
88 | else:
89 | log_probs = (mu.log_prob(z) - T.log(1 - action.pow(2) + epsilon)).sum(1, keepdim=True)
90 | if not entropy:
91 | return action * self.action_scaling, log_probs
92 | else:
93 | entropy = mu.entropy()
94 | return action * self.action_scaling, log_probs, entropy
95 |
96 |
97 | class ConvCritic(nn.Module):
98 | # Modified from https://github.com/PhilipZRH/ferm
99 | def __init__(self, action_dim, encoder, hidden_dim=1024, detach_obs_encoder=False,
100 | goal_conditioned=False, detach_goal_encoder=True):
101 | super(ConvCritic, self).__init__()
102 |
103 | self.encoder = encoder
104 | self.detach_obs_encoder = detach_obs_encoder
105 | trunk_input_dim = self.encoder.feature_dim
106 | self.goal_conditioned = goal_conditioned
107 | self.detach_goal_encoder = detach_goal_encoder
108 | if self.goal_conditioned:
109 | trunk_input_dim *= 2
110 | self.trunk = nn.Sequential(
111 | nn.Linear(trunk_input_dim + action_dim, hidden_dim),
112 | nn.ReLU(),
113 | nn.Linear(hidden_dim, hidden_dim),
114 | nn.ReLU(),
115 | nn.Linear(hidden_dim, 1)
116 | )
117 |
118 | self.apply(orthogonal_init)
119 |
120 | def forward(self, obs, action, goal=None):
121 | # detach_encoder allows to stop gradient propagation to encoder
122 | feature = self.encoder(obs, detach=self.detach_obs_encoder)
123 | if self.goal_conditioned:
124 | assert goal is not None, "need a goal image for goal-conditioned network"
125 | goal_feature = self.encoder(goal, detach=self.detach_goal_encoder)
126 | feature = T.cat((feature, goal_feature), dim=1)
127 | trunk_input = T.cat([feature, action], dim=1)
128 | q = self.trunk(trunk_input)
129 | return q
130 |
131 |
132 | class CURL(nn.Module):
133 | # Modified from https://github.com/PhilipZRH/ferm
134 | def __init__(self, z_dim, encoder, encoder_target):
135 | super(CURL, self).__init__()
136 | self.encoder = encoder
137 | self.encoder_target = encoder_target
138 | assert z_dim == self.encoder.feature_dim == self.encoder_target.feature_dim
139 | self.W = nn.Parameter(T.rand(z_dim, z_dim))
140 |
141 | def encode(self, x, detach=False, use_target=False):
142 | # if exponential moving average (ema) target is enabled,
143 | # then compute key values using target encoder without gradient,
144 | # else compute key values with the main encoder
145 | # from CURL https://arxiv.org/abs/2004.04136
146 | if use_target:
147 | with T.no_grad():
148 | z_out = self.encoder_target(x)
149 | else:
150 | z_out = self.encoder(x)
151 |
152 | if detach:
153 | z_out = z_out.detach()
154 | return z_out
155 |
156 | def compute_score(self, z_a, z_pos):
157 | """
158 | from CURL https://arxiv.org/abs/2004.04136
159 | - compute (B,B) matrix z_a (W z_pos.T)
160 | - positives are all diagonal elements
161 | - negatives are all other elements
162 | - to compute loss use multi-class cross entropy with identity matrix for labels
163 | """
164 | Wz = T.matmul(self.W, z_pos.T) # (z_dim,B)
165 | score = T.matmul(z_a, Wz) # (B,B)
166 | score = score - T.max(score, 1)[0][:, None]
167 | return score
168 |
169 |
170 | class PixelEncoder(nn.Module):
171 | def __init__(self, obs_shape, feature_dim=50, num_layers=4, num_filters=32):
172 | # the encoder architecture adopted by SAC-AE, DrQ and CURL
173 | super(PixelEncoder, self).__init__()
174 | assert len(obs_shape) == 3
175 | self.obs_shape = obs_shape[-2:]
176 | self.feature_dim = feature_dim
177 | self.num_layers = num_layers
178 |
179 | self.convs = nn.ModuleList(
180 | [nn.Conv2d(obs_shape[0], num_filters, 3, stride=2)]
181 | )
182 | for i in range(num_layers - 1):
183 | self.convs.append(nn.Conv2d(num_filters, num_filters, 3, stride=1))
184 |
185 | x = T.randn([32] + list(obs_shape))
186 | out_dim = self.forward_conv(x, flatten=False).shape[-1]
187 | self.trunk = nn.Sequential(
188 | nn.Linear(num_filters * out_dim * out_dim, self.feature_dim),
189 | nn.LayerNorm(self.feature_dim),
190 | nn.Tanh()
191 | )
192 |
193 | def forward_conv(self, obs, flatten=True):
194 | if obs.max() > 1.:
195 | obs = obs / 255.
196 |
197 | conv = T.relu(self.convs[0](obs))
198 | for i in range(1, self.num_layers):
199 | conv = T.relu(self.convs[i](conv))
200 | if flatten:
201 | conv = conv.reshape(conv.size(0), -1)
202 | return conv
203 |
204 | def forward(self, obs, detach=False):
205 | h = self.forward_conv(obs)
206 | if detach:
207 | h = h.detach()
208 | h = self.trunk(h)
209 | return h
210 |
211 | def copy_conv_weights_from(self, source):
212 | # only copy conv layers' weights
213 | for i in range(self.num_layers):
214 | self.convs[i].weight = source.convs[i].weight
215 | self.convs[i].bias = source.convs[i].bias
216 |
217 |
218 | class PixelDecoder(nn.Module):
219 | def __init__(self, obs_shape, feature_dim=50, num_layers=4, num_filters=32):
220 | # the encoder architecture adopted by SAC-AE, DrQ and CURL
221 | super(PixelDecoder, self).__init__()
222 | assert len(obs_shape) == 3
223 | self.obs_shape = obs_shape[-2:]
224 | self.feature_dim = feature_dim
225 | self.num_layers = num_layers
226 |
227 | # todo
228 |
229 |
230 | def orthogonal_init(m):
231 | if isinstance(m, nn.Linear):
232 | nn.init.orthogonal_(m.weight.data)
233 | if hasattr(m.bias, 'data'):
234 | m.bias.data.fill_(0.0)
235 | elif isinstance(m, nn.Conv2d) or isinstance(m, nn.ConvTranspose2d):
236 | gain = nn.init.calculate_gain('relu')
237 | nn.init.orthogonal_(m.weight.data, gain)
238 | if hasattr(m.bias, 'data'):
239 | m.bias.data.fill_(0.0)
240 |
--------------------------------------------------------------------------------
/drl_implementation/agent/utils/networks_mlp.py:
--------------------------------------------------------------------------------
1 | import torch as T
2 | import torch.nn as nn
3 | import torch.nn.functional as F
4 | from torch.distributions import Normal, Categorical
5 |
6 |
7 | class Actor(nn.Module):
8 | def __init__(self, input_dim, output_dim, fc1_size=256, fc2_size=256, fc3_size=256, init_w=3e-3, action_scaling=1):
9 | super(Actor, self).__init__()
10 | self.fc1 = nn.Linear(input_dim, fc1_size)
11 | self.fc2 = nn.Linear(fc1_size, fc2_size)
12 | self.fc3 = nn.Linear(fc2_size, fc3_size)
13 | self.pi = nn.Linear(fc3_size, output_dim)
14 | self.apply(orthogonal_init)
15 | self.action_scaling = action_scaling
16 |
17 | def forward(self, inputs):
18 | x = F.relu(self.fc1(inputs))
19 | x = F.relu(self.fc2(x))
20 | x = F.relu(self.fc3(x))
21 | action = T.tanh(self.pi(x))
22 | return action * self.action_scaling
23 |
24 |
25 | class StochasticActor(nn.Module):
26 | def __init__(self, input_dim, output_dim, log_std_min, log_std_max, continuous=True, agent_state_dim=0,
27 | fc1_size=256, fc2_size=256, fc3_size=256, init_w=3e-3, action_scaling=1):
28 | super(StochasticActor, self).__init__()
29 | self.continuous = continuous
30 | self.action_dim = output_dim
31 | self.agent_state_dim = agent_state_dim
32 | self.fc1 = nn.Linear(input_dim+agent_state_dim, fc1_size)
33 | self.fc2 = nn.Linear(fc1_size, fc2_size)
34 | if self.continuous:
35 | self.fc3 = nn.Linear(fc2_size, fc3_size)
36 | self.mean = nn.Linear(fc3_size, output_dim)
37 | self.log_std = nn.Linear(fc3_size, output_dim)
38 | else:
39 | self.fc3 = nn.Linear(fc2_size, output_dim)
40 | self.apply(orthogonal_init)
41 | self.log_std_min = log_std_min
42 | self.log_std_max = log_std_max
43 | self.action_scaling = action_scaling
44 |
45 | def forward(self, inputs):
46 | x = F.relu(self.fc1(inputs))
47 | x = F.relu(self.fc2(x))
48 | x = F.relu(self.fc3(x))
49 | if self.continuous:
50 | mean = self.mean(x)
51 | log_std = self.log_std(x)
52 | log_std = T.clamp(log_std, self.log_std_min, self.log_std_max)
53 | return mean, log_std
54 | else:
55 | return x
56 |
57 | def get_action(self, inputs, std_scale=None, epsilon=1e-6, mean_pi=False, greedy=False, probs=False, entropy=False):
58 | if self.continuous:
59 | mean, log_std = self(inputs)
60 | if mean_pi:
61 | return T.tanh(mean)
62 | std = log_std.exp()
63 | if std_scale is not None:
64 | std *= std_scale
65 | mu = Normal(mean, std)
66 | z = mu.rsample()
67 | action = T.tanh(z)
68 | if not probs:
69 | return action * self.action_scaling
70 | else:
71 | if action.shape == (self.action_dim,):
72 | action = action.reshape((1, self.action_dim))
73 | log_probs = (mu.log_prob(z) - T.log(1 - action.pow(2) + epsilon)).sum(1, keepdim=True)
74 | if not entropy:
75 | return action * self.action_scaling, log_probs
76 | else:
77 | entropy = mu.entropy()
78 | return action * self.action_scaling, log_probs, entropy
79 | else:
80 | logits = self(inputs)
81 | if greedy:
82 | actions = T.argmax(logits, dim=1, keepdim=True)
83 | return actions, None, None
84 | action_probs = F.softmax(logits, dim=1)
85 | dist = Categorical(action_probs)
86 | actions = dist.sample().view(-1, 1)
87 | log_probs = T.log(action_probs + epsilon).gather(1, actions)
88 | entropy = dist.entropy()
89 | return actions, log_probs, entropy
90 |
91 | def get_log_probs(self, inputs, actions, std_scale=None):
92 | actions /= self.action_scaling
93 | mean, log_std = self(inputs)
94 | std = log_std.exp()
95 | if std_scale is not None:
96 | std *= std_scale
97 | mu = Normal(mean, std)
98 | log_probs = mu.log_prob(actions)
99 | entropy = mu.entropy()
100 | return log_probs, entropy
101 |
102 |
103 | class Critic(nn.Module):
104 | def __init__(self, input_dim, output_dim, fc1_size=256, fc2_size=256, fc3_size=256, init_w=3e-3, softmax=False):
105 | super(Critic, self).__init__()
106 | self.fc1 = nn.Linear(input_dim, fc1_size)
107 | self.fc2 = nn.Linear(fc1_size, fc2_size)
108 | self.fc3 = nn.Linear(fc2_size, fc3_size)
109 | self.v = nn.Linear(fc3_size, output_dim)
110 | self.apply(orthogonal_init)
111 | self.softmax = softmax
112 |
113 | def forward(self, inputs):
114 | x = F.relu(self.fc1(inputs))
115 | x = F.relu(self.fc2(x))
116 | x = F.relu(self.fc3(x))
117 | value = self.v(x)
118 | if not self.softmax:
119 | return value
120 | else:
121 | return F.softmax(value, dim=1)
122 |
123 | def get_action(self, inputs):
124 | values = self.forward(inputs)
125 | return T.argmax(values).item()
126 |
127 |
128 | def orthogonal_init(m):
129 | if isinstance(m, nn.Linear):
130 | nn.init.orthogonal_(m.weight.data)
131 | if hasattr(m.bias, 'data'):
132 | m.bias.data.fill_(0.0)
133 | elif isinstance(m, nn.Conv2d) or isinstance(m, nn.ConvTranspose2d):
134 | gain = nn.init.calculate_gain('relu')
135 | nn.init.orthogonal_(m.weight.data, gain)
136 | if hasattr(m.bias, 'data'):
137 | m.bias.data.fill_(0.0)
138 |
--------------------------------------------------------------------------------
/drl_implementation/agent/utils/networks_pointnet.py:
--------------------------------------------------------------------------------
1 | import torch as T
2 | import torch.nn as nn
3 | import torch.nn.functional as F
4 | from .pointnet_2_utils import PointNetSetAbstraction
5 | from .pointnet_utils import PointNetEncoder, feature_transform_reguliarzer
6 |
7 |
8 | class CriticPointNet(nn.Module):
9 | def __init__(self, output_dim, action_dim, agent_state_dim=0, normal_channel=False, softmax=False, goal_conditioned=False):
10 | super(CriticPointNet, self).__init__()
11 | in_channel = 6 if normal_channel else 3
12 | self.action_dim = action_dim
13 | self.agent_state_dim = agent_state_dim
14 | self.feat = PointNetEncoder(global_feat=True, feature_transform=True, channel=in_channel)
15 | self.goal_conditioned = goal_conditioned
16 | if self.goal_conditioned:
17 | self.fc1 = nn.Linear(2048+action_dim+agent_state_dim, 512)
18 | else:
19 | self.fc1 = nn.Linear(1024+action_dim+agent_state_dim, 512)
20 | self.fc2 = nn.Linear(512, 256)
21 | self.fc3 = nn.Linear(256, output_dim)
22 | self.dropout = nn.Dropout(p=0.4)
23 | self.bn1 = nn.BatchNorm1d(512)
24 | self.bn2 = nn.BatchNorm1d(256)
25 | self.softmax = softmax
26 |
27 | def forward(self, obs_xyz, action, goal_xyz=None, agent_state=None):
28 | x, trans, trans_feat = self.feat(obs_xyz)
29 | if self.goal_conditioned and goal_xyz is not None:
30 | goal_x, goal_trans, goal_trans_feat = self.feat(goal_xyz)
31 | x = T.cat([x, goal_x.detach()], dim=1)
32 | if agent_state is not None:
33 | assert agent_state.shape[1] == self.agent_state_dim
34 | x = T.cat([x, agent_state], dim=1)
35 | x = T.cat([x, action], dim=1)
36 | x = F.relu(self.bn1(self.fc1(x)))
37 | x = F.relu(self.bn2(self.dropout(self.fc2(x))))
38 | value = self.fc3(x)
39 |
40 | if not self.softmax:
41 | return value
42 | else:
43 | return F.softmax(value, dim=1)
44 |
45 | def get_features(self, xyz, detach=False):
46 | x, trans, trans_feat = self.feat(xyz)
47 | if detach:
48 | x = x.detach()
49 | return x
50 |
51 |
52 | class CriticPointNet2(nn.Module):
53 | def __init__(self, output_dim, action_dim, normal_channel=False, softmax=False):
54 | super(CriticPointNet2, self).__init__()
55 | in_channel = 6 if normal_channel else 3
56 | self.normal_channel = normal_channel
57 | self.sa1 = PointNetSetAbstraction(npoint=512, radius=0.2, nsample=32, in_channel=in_channel, mlp=[64, 64, 128],
58 | group_all=False)
59 | self.sa2 = PointNetSetAbstraction(npoint=128, radius=0.4, nsample=64, in_channel=128 + 3, mlp=[128, 128, 256],
60 | group_all=False)
61 | self.sa3 = PointNetSetAbstraction(npoint=None, radius=None, nsample=None, in_channel=256 + 3,
62 | mlp=[256, 512, 1024], group_all=True)
63 |
64 | self.fc1 = nn.Linear(1024+action_dim, 512)
65 | self.bn1 = nn.BatchNorm1d(512)
66 | self.drop1 = nn.Dropout(0.4)
67 | self.fc2 = nn.Linear(512, 256)
68 | self.bn2 = nn.BatchNorm1d(256)
69 | self.drop2 = nn.Dropout(0.4)
70 | self.fc3 = nn.Linear(256, output_dim)
71 | self.softmax = softmax
72 |
73 | def forward(self, xyz):
74 | B, _, _ = xyz.shape
75 | if self.normal_channel:
76 | norm = xyz[:, 3:, :]
77 | xyz = xyz[:, :3, :]
78 | else:
79 | norm = None
80 | l1_xyz, l1_points = self.sa1(xyz, norm)
81 | l2_xyz, l2_points = self.sa2(l1_xyz, l1_points)
82 | l3_xyz, l3_points = self.sa3(l2_xyz, l2_points)
83 | x = l3_points.view(B, 1024)
84 |
85 | x = self.drop1(F.relu(self.bn1(self.fc1(x))))
86 | x = self.drop2(F.relu(self.bn2(self.fc2(x))))
87 | value = self.fc3(x)
88 |
89 | if not self.softmax:
90 | return value
91 | else:
92 | return F.softmax(value, dim=1)
93 |
94 | def get_features(self, xyz, detach=False):
95 | B, _, _ = xyz.shape
96 | if self.normal_channel:
97 | norm = xyz[:, 3:, :]
98 | xyz = xyz[:, :3, :]
99 | else:
100 | norm = None
101 | l1_xyz, l1_points = self.sa1(xyz, norm)
102 | l2_xyz, l2_points = self.sa2(l1_xyz, l1_points)
103 | l3_xyz, l3_points = self.sa3(l2_xyz, l2_points)
104 | x = l3_points.view(B, 1024)
105 | if detach:
106 | x = x.detach()
107 | return x
108 |
109 | def get_action(self, inputs):
110 | values = self.forward(inputs)
111 | return T.argmax(values).item()
112 |
--------------------------------------------------------------------------------
/drl_implementation/agent/utils/normalizer.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 |
4 | class Normalizer(object):
5 | def __init__(self, input_dims, init_mean, init_var,
6 | scale_factor=1, epsilon=1e-3, clip_range=None, activated=False):
7 | self.activated = activated
8 | self.input_dims = input_dims
9 | self.sample_count = 0
10 | self.history = []
11 | self.history_mean = init_mean
12 | self.history_var = init_var
13 | if self.history_mean is None:
14 | self.history_mean = np.zeros(self.input_dims)
15 | if self.history_var is None:
16 | self.history_var = np.ones(self.input_dims)
17 | assert self.history_mean.shape == (self.input_dims,)
18 | assert self.history_var.shape == (self.input_dims,)
19 | self.epsilon = epsilon*np.ones(self.input_dims)
20 | if clip_range is None:
21 | clip_range = 1e3
22 | self.input_clip_range = (-clip_range*np.ones(self.input_dims), clip_range*np.ones(self.input_dims))
23 | self.scale_factor = scale_factor
24 |
25 | def store_history(self, *args):
26 | self.history.append(*args)
27 |
28 | # update mean and var for z-score normalization
29 | def update_mean(self):
30 | if len(self.history) == 0:
31 | return
32 | new_sample_num = len(self.history)
33 | new_history = np.array(self.history, dtype=float)
34 | new_mean = np.mean(new_history, axis=0)
35 |
36 | new_var = np.sum(np.square(new_history - new_mean), axis=0)
37 | new_var = (self.sample_count * np.square(self.history_var) + new_var)
38 | new_var /= (new_sample_num + self.sample_count)
39 | self.history_var = np.sqrt(new_var)
40 |
41 | new_mean = (self.sample_count * self.history_mean + new_sample_num * new_mean)
42 | new_mean /= (new_sample_num + self.sample_count)
43 | self.history_mean = new_mean
44 |
45 | self.sample_count += new_sample_num
46 | self.history.clear()
47 |
48 | # pre-process inputs, currently using max-min-normalization
49 | def __call__(self, inputs):
50 | if self.activated:
51 | inputs = (inputs - self.history_mean) / (self.history_var+self.epsilon)
52 | inputs = np.clip(inputs, self.input_clip_range[0], self.input_clip_range[1])
53 | return self.scale_factor*inputs
54 |
--------------------------------------------------------------------------------
/drl_implementation/agent/utils/plot.py:
--------------------------------------------------------------------------------
1 | import json
2 | import numpy as np
3 | import matplotlib as mpl
4 |
5 | mpl.use('Agg')
6 | import matplotlib.pyplot as plt
7 | from matplotlib.lines import Line2D
8 | from copy import deepcopy as dcp
9 |
10 |
11 | def smoothed_plot(file, data, x_label="Timesteps", y_label="Success rate", window=5):
12 | N = len(data)
13 | running_avg = np.empty(N)
14 | for t in range(N):
15 | running_avg[t] = np.mean(data[max(0, t - window):(t + 1)])
16 | x = [i for i in range(N)]
17 | plt.ylabel(y_label)
18 | plt.xlabel(x_label)
19 | if x_label == "Epoch":
20 | x_tick_interval = len(data) // 10
21 | plt.xticks([n * x_tick_interval for n in range(11)])
22 | plt.plot(x, running_avg)
23 | plt.savefig(file, bbox_inches='tight', dpi=500)
24 | plt.close()
25 |
26 |
27 | def smoothed_plot_multi_line(file, data, colors=None, linestyles=None, linewidths=None, alphas=None,
28 | legend=None, legend_loc="upper right", window=5, title=None,
29 | x_label='Timesteps', x_axis_off=False, xticks=None, xticklabels=None,
30 | y_label="Success rate", ylim=(None, None), y_axis_off=False, yticks=None, yticklabels=None,
31 | grid=False,
32 | horizontal_lines=None, ho_linestyle='--', ho_linewidth=4, ho_xmin=0.05, ho_xmax=0.95):
33 | if y_axis_off:
34 | plt.ylabel(None)
35 | plt.yticks([])
36 | else:
37 | plt.ylabel(y_label)
38 | if yticks is not None:
39 | plt.yticks(yticks, yticklabels)
40 | if ylim[0] is not None:
41 | plt.ylim(ylim)
42 | if title is not None:
43 | plt.title(title)
44 |
45 | if x_axis_off:
46 | plt.xlabel(None)
47 | plt.xticks([])
48 | else:
49 | plt.xlabel(x_label)
50 | if x_label == "Epoch":
51 | x_tick_interval = len(data_dict_list[0]["mean"]) // 10
52 | plt.xticks([n * x_tick_interval for n in range(11)])
53 | if xticks is not None:
54 | plt.xticks(xticks, xticklabels)
55 |
56 | for t in range(len(data)):
57 | N = len(data[t])
58 | x = [i for i in range(N)]
59 | if window != 0:
60 | running_avg = np.empty(N)
61 | for n in range(N):
62 | running_avg[n] = np.mean(data[t][max(0, n - window):(n + 1)])
63 | else:
64 | running_avg = data[t]
65 |
66 | if colors is None:
67 | c = None
68 | else:
69 | assert len(colors) >= len(data)
70 | c = colors[t]
71 |
72 | if linestyles is None:
73 | ls = '-'
74 | else:
75 | assert len(linestyles) == len(data)
76 | ls = linestyles[t]
77 |
78 | if linewidths is None:
79 | linewidth = 1
80 | else:
81 | assert len(linewidths) == len(data)
82 | linewidth = linewidths[t]
83 |
84 | if alphas is None:
85 | alpha = 1
86 | else:
87 | assert len(alphas) == len(data)
88 | alpha = alphas[t]
89 |
90 | plt.plot(x, running_avg, c=c, linestyle=ls, linewidth=linewidth, alpha=alpha)
91 |
92 | if horizontal_lines is not None:
93 | for n in range(len(horizontal_lines)):
94 | plt.axhline(y=horizontal_lines[n], color=colors[len(data) + n],
95 | xmin=ho_xmin, xmax=ho_xmax, linestyle=ho_linestyle, linewidth=ho_linewidth)
96 |
97 | if legend is not None:
98 | plt.legend(legend, loc=legend_loc)
99 |
100 | if grid:
101 | plt.grid(True, linewidth=0.2)
102 |
103 | plt.savefig(file, bbox_inches='tight', dpi=500)
104 | plt.close()
105 |
106 |
107 | def smoothed_plot_mean_deviation(file, data_dict_list, title=None,
108 | vertical_lines=None, horizontal_lines=None, linestyle='--', linewidth=4,
109 | x_label='Timesteps', x_axis_off=False, xticks=None,
110 | y_label="Success rate", window=5, ylim=(None, None), y_axis_off=False, yticks=None,
111 | legend=None, legend_only=False, legend_file=None, legend_loc="upper right",
112 | legend_title=None, legend_bbox_to_anchor=None, legend_ncol=4, legend_frame=False,
113 | handlelength=2):
114 | colors = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple',
115 | 'tab:brown', 'tab:pink', 'tab:gray', 'tab:olive', 'tab:cyan','k']
116 | if not isinstance(data_dict_list, list):
117 | data_dict_list = [data_dict_list]
118 |
119 | if y_axis_off:
120 | plt.ylabel(None)
121 | plt.yticks([])
122 | else:
123 | plt.ylabel(y_label)
124 | if yticks is not None:
125 | plt.yticks(yticks)
126 | if ylim[0] is not None:
127 | plt.ylim(ylim)
128 | if title is not None:
129 | plt.title(title)
130 |
131 | if x_axis_off:
132 | plt.xlabel(None)
133 | plt.xticks([])
134 | else:
135 | plt.xlabel(x_label)
136 | if x_label == "Epoch":
137 | x_tick_interval = len(data_dict_list[0]["mean"]) // 10
138 | plt.xticks([n * x_tick_interval for n in range(11)])
139 | if xticks is not None:
140 | plt.xticks(xticks)
141 |
142 | handles = [Line2D([0], [0], color=colors[i], linewidth=linewidth) for i in range(len(data_dict_list))]
143 | if legend is not None:
144 | legend_plot = plt.legend(handles, legend, handlelength=handlelength,
145 | title=legend_title, loc=legend_loc, labelspacing=0.15,
146 | bbox_to_anchor=legend_bbox_to_anchor, ncol=legend_ncol, frameon=legend_frame)
147 | if legend_only:
148 | assert legend_file is not None, 'specify legend save path'
149 | fig = legend_plot.figure
150 | fig.canvas.draw()
151 | bbox = legend_plot.get_window_extent().transformed(fig.dpi_scale_trans.inverted())
152 | fig.savefig(legend_file, dpi=500, bbox_inches=bbox)
153 | plt.close()
154 | return
155 |
156 | N = len(data_dict_list[0]["mean"])
157 | x = [i for i in range(N)]
158 | for i in range(len(data_dict_list)):
159 | case_data = data_dict_list[i]
160 | for key in case_data:
161 | running_avg = np.empty(N)
162 | for n in range(N):
163 | running_avg[n] = np.mean(case_data[key][max(0, n - window):(n + 1)])
164 |
165 | case_data[key] = dcp(running_avg)
166 |
167 | plt.fill_between(x, case_data["upper"], case_data["lower"], alpha=0.3, color=colors[i], label='_nolegend_')
168 | plt.plot(x, case_data["mean"], color=colors[i])
169 |
170 | if horizontal_lines is not None:
171 | for n in range(len(horizontal_lines)):
172 | plt.axhline(y=horizontal_lines[n], color=colors[len(data_dict_list) + n], xmin=0.05, xmax=0.95,
173 | linestyle=linestyle, linewidth=linewidth)
174 | if vertical_lines is not None:
175 | assert horizontal_lines is None
176 | for n in range(len(vertical_lines)):
177 | plt.axvline(x=vertical_lines[n], color=colors[len(data_dict_list) + n], ymin=0.05, ymax=0.95,
178 | linestyle=linestyle, linewidth=linewidth)
179 |
180 | plt.savefig(file, bbox_inches='tight', dpi=500)
181 | plt.close()
182 |
183 |
184 | def get_mean_and_deviation(data, save_data=False, file_name=None):
185 | upper = np.max(data, axis=0)
186 | lower = np.min(data, axis=0)
187 | mean = np.mean(data, axis=0)
188 | statistics = {"mean": mean.tolist(),
189 | "upper": upper.tolist(),
190 | "lower": lower.tolist()}
191 | if save_data:
192 | assert file_name is not None
193 | json.dump(statistics, open(file_name, 'w'))
194 | return statistics
195 |
--------------------------------------------------------------------------------
/drl_implementation/agent/utils/pointnet_2_utils.py:
--------------------------------------------------------------------------------
1 | """
2 | https://github.com/yanx27/Pointnet_Pointnet2_pytorch/blob/master/models/pointnet2_utils.py
3 | @article{Pytorch_Pointnet_Pointnet2,
4 | Author = {Xu Yan},
5 | Title = {Pointnet/Pointnet++ Pytorch},
6 | Journal = {https://github.com/yanx27/Pointnet_Pointnet2_pytorch},
7 | Year = {2019}
8 | }
9 | """
10 | import torch
11 | import torch.nn as nn
12 | import torch.nn.functional as F
13 | from time import time
14 | import numpy as np
15 |
16 |
17 | def timeit(tag, t):
18 | print("{}: {}s".format(tag, time() - t))
19 | return time()
20 |
21 |
22 | def pc_normalize(pc):
23 | l = pc.shape[0]
24 | centroid = np.mean(pc, axis=0)
25 | pc = pc - centroid
26 | m = np.max(np.sqrt(np.sum(pc ** 2, axis=1)))
27 | pc = pc / m
28 | return pc
29 |
30 |
31 | def square_distance(src, dst):
32 | """
33 | Calculate Euclid distance between each two points.
34 |
35 | src^T * dst = xn * xm + yn * ym + zn * zm;
36 | sum(src^2, dim=-1) = xn*xn + yn*yn + zn*zn;
37 | sum(dst^2, dim=-1) = xm*xm + ym*ym + zm*zm;
38 | dist = (xn - xm)^2 + (yn - ym)^2 + (zn - zm)^2
39 | = sum(src**2,dim=-1)+sum(dst**2,dim=-1)-2*src^T*dst
40 |
41 | Input:
42 | src: source points, [B, N, C]
43 | dst: target points, [B, M, C]
44 | Output:
45 | dist: per-point square distance, [B, N, M]
46 | """
47 | B, N, _ = src.shape
48 | _, M, _ = dst.shape
49 | dist = -2 * torch.matmul(src, dst.permute(0, 2, 1))
50 | dist += torch.sum(src ** 2, -1).view(B, N, 1)
51 | dist += torch.sum(dst ** 2, -1).view(B, 1, M)
52 | return dist
53 |
54 |
55 | def index_points(points, idx):
56 | """
57 |
58 | Input:
59 | points: input points data, [B, N, C]
60 | idx: sample index data, [B, S]
61 | Return:
62 | new_points:, indexed points data, [B, S, C]
63 | """
64 | device = points.device
65 | B = points.shape[0]
66 | view_shape = list(idx.shape)
67 | view_shape[1:] = [1] * (len(view_shape) - 1)
68 | repeat_shape = list(idx.shape)
69 | repeat_shape[0] = 1
70 | batch_indices = torch.arange(B, dtype=torch.long).to(device).view(view_shape).repeat(repeat_shape)
71 | new_points = points[batch_indices, idx, :]
72 | return new_points
73 |
74 |
75 | def farthest_point_sample(xyz, npoint):
76 | """
77 | Input:
78 | xyz: pointcloud data, [B, N, 3]
79 | npoint: number of samples
80 | Return:
81 | centroids: sampled pointcloud index, [B, npoint]
82 | """
83 | device = xyz.device
84 | B, N, C = xyz.shape
85 | centroids = torch.zeros(B, npoint, dtype=torch.long).to(device)
86 | distance = torch.ones(B, N).to(device) * 1e10
87 | farthest = torch.randint(0, N, (B,), dtype=torch.long).to(device)
88 | batch_indices = torch.arange(B, dtype=torch.long).to(device)
89 | for i in range(npoint):
90 | centroids[:, i] = farthest
91 | centroid = xyz[batch_indices, farthest, :].view(B, 1, 3)
92 | dist = torch.sum((xyz - centroid) ** 2, -1)
93 | mask = dist < distance
94 | distance[mask] = dist[mask]
95 | farthest = torch.max(distance, -1)[1]
96 | return centroids
97 |
98 |
99 | def query_ball_point(radius, nsample, xyz, new_xyz):
100 | """
101 | Input:
102 | radius: local region radius
103 | nsample: max sample number in local region
104 | xyz: all points, [B, N, 3]
105 | new_xyz: query points, [B, S, 3]
106 | Return:
107 | group_idx: grouped points index, [B, S, nsample]
108 | """
109 | device = xyz.device
110 | B, N, C = xyz.shape
111 | _, S, _ = new_xyz.shape
112 | group_idx = torch.arange(N, dtype=torch.long).to(device).view(1, 1, N).repeat([B, S, 1])
113 | sqrdists = square_distance(new_xyz, xyz)
114 | group_idx[sqrdists > radius ** 2] = N
115 | group_idx = group_idx.sort(dim=-1)[0][:, :, :nsample]
116 | group_first = group_idx[:, :, 0].view(B, S, 1).repeat([1, 1, nsample])
117 | mask = group_idx == N
118 | group_idx[mask] = group_first[mask]
119 | return group_idx
120 |
121 |
122 | def sample_and_group(npoint, radius, nsample, xyz, points, returnfps=False):
123 | """
124 | Input:
125 | npoint:
126 | radius:
127 | nsample:
128 | xyz: input points position data, [B, N, 3]
129 | points: input points data, [B, N, D]
130 | Return:
131 | new_xyz: sampled points position data, [B, npoint, nsample, 3]
132 | new_points: sampled points data, [B, npoint, nsample, 3+D]
133 | """
134 | B, N, C = xyz.shape
135 | S = npoint
136 | fps_idx = farthest_point_sample(xyz, npoint) # [B, npoint, C]
137 | new_xyz = index_points(xyz, fps_idx)
138 | idx = query_ball_point(radius, nsample, xyz, new_xyz)
139 | grouped_xyz = index_points(xyz, idx) # [B, npoint, nsample, C]
140 | grouped_xyz_norm = grouped_xyz - new_xyz.view(B, S, 1, C)
141 |
142 | if points is not None:
143 | grouped_points = index_points(points, idx)
144 | new_points = torch.cat([grouped_xyz_norm, grouped_points], dim=-1) # [B, npoint, nsample, C+D]
145 | else:
146 | new_points = grouped_xyz_norm
147 | if returnfps:
148 | return new_xyz, new_points, grouped_xyz, fps_idx
149 | else:
150 | return new_xyz, new_points
151 |
152 |
153 | def sample_and_group_all(xyz, points):
154 | """
155 | Input:
156 | xyz: input points position data, [B, N, 3]
157 | points: input points data, [B, N, D]
158 | Return:
159 | new_xyz: sampled points position data, [B, 1, 3]
160 | new_points: sampled points data, [B, 1, N, 3+D]
161 | """
162 | device = xyz.device
163 | B, N, C = xyz.shape
164 | new_xyz = torch.zeros(B, 1, C).to(device)
165 | grouped_xyz = xyz.view(B, 1, N, C)
166 | if points is not None:
167 | new_points = torch.cat([grouped_xyz, points.view(B, 1, N, -1)], dim=-1)
168 | else:
169 | new_points = grouped_xyz
170 | return new_xyz, new_points
171 |
172 |
173 | class PointNetSetAbstraction(nn.Module):
174 | def __init__(self, npoint, radius, nsample, in_channel, mlp, group_all):
175 | super(PointNetSetAbstraction, self).__init__()
176 | self.npoint = npoint
177 | self.radius = radius
178 | self.nsample = nsample
179 | self.mlp_convs = nn.ModuleList()
180 | self.mlp_bns = nn.ModuleList()
181 | last_channel = in_channel
182 | for out_channel in mlp:
183 | self.mlp_convs.append(nn.Conv2d(last_channel, out_channel, 1))
184 | self.mlp_bns.append(nn.BatchNorm2d(out_channel))
185 | last_channel = out_channel
186 | self.group_all = group_all
187 |
188 | def forward(self, xyz, points):
189 | """
190 | Input:
191 | xyz: input points position data, [B, C, N]
192 | points: input points data, [B, D, N]
193 | Return:
194 | new_xyz: sampled points position data, [B, C, S]
195 | new_points_concat: sample points feature data, [B, D', S]
196 | """
197 | xyz = xyz.permute(0, 2, 1)
198 | if points is not None:
199 | points = points.permute(0, 2, 1)
200 |
201 | if self.group_all:
202 | new_xyz, new_points = sample_and_group_all(xyz, points)
203 | else:
204 | new_xyz, new_points = sample_and_group(self.npoint, self.radius, self.nsample, xyz, points)
205 | # new_xyz: sampled points position data, [B, npoint, C]
206 | # new_points: sampled points data, [B, npoint, nsample, C+D]
207 | new_points = new_points.permute(0, 3, 2, 1) # [B, C+D, nsample,npoint]
208 | for i, conv in enumerate(self.mlp_convs):
209 | bn = self.mlp_bns[i]
210 | new_points = F.relu(bn(conv(new_points)))
211 |
212 | new_points = torch.max(new_points, 2)[0]
213 | new_xyz = new_xyz.permute(0, 2, 1)
214 | return new_xyz, new_points
215 |
216 |
217 | class PointNetSetAbstractionMsg(nn.Module):
218 | def __init__(self, npoint, radius_list, nsample_list, in_channel, mlp_list):
219 | super(PointNetSetAbstractionMsg, self).__init__()
220 | self.npoint = npoint
221 | self.radius_list = radius_list
222 | self.nsample_list = nsample_list
223 | self.conv_blocks = nn.ModuleList()
224 | self.bn_blocks = nn.ModuleList()
225 | for i in range(len(mlp_list)):
226 | convs = nn.ModuleList()
227 | bns = nn.ModuleList()
228 | last_channel = in_channel + 3
229 | for out_channel in mlp_list[i]:
230 | convs.append(nn.Conv2d(last_channel, out_channel, 1))
231 | bns.append(nn.BatchNorm2d(out_channel))
232 | last_channel = out_channel
233 | self.conv_blocks.append(convs)
234 | self.bn_blocks.append(bns)
235 |
236 | def forward(self, xyz, points):
237 | """
238 | Input:
239 | xyz: input points position data, [B, C, N]
240 | points: input points data, [B, D, N]
241 | Return:
242 | new_xyz: sampled points position data, [B, C, S]
243 | new_points_concat: sample points feature data, [B, D', S]
244 | """
245 | xyz = xyz.permute(0, 2, 1)
246 | if points is not None:
247 | points = points.permute(0, 2, 1)
248 |
249 | B, N, C = xyz.shape
250 | S = self.npoint
251 | new_xyz = index_points(xyz, farthest_point_sample(xyz, S))
252 | new_points_list = []
253 | for i, radius in enumerate(self.radius_list):
254 | K = self.nsample_list[i]
255 | group_idx = query_ball_point(radius, K, xyz, new_xyz)
256 | grouped_xyz = index_points(xyz, group_idx)
257 | grouped_xyz -= new_xyz.view(B, S, 1, C)
258 | if points is not None:
259 | grouped_points = index_points(points, group_idx)
260 | grouped_points = torch.cat([grouped_points, grouped_xyz], dim=-1)
261 | else:
262 | grouped_points = grouped_xyz
263 |
264 | grouped_points = grouped_points.permute(0, 3, 2, 1) # [B, D, K, S]
265 | for j in range(len(self.conv_blocks[i])):
266 | conv = self.conv_blocks[i][j]
267 | bn = self.bn_blocks[i][j]
268 | grouped_points = F.relu(bn(conv(grouped_points)))
269 | new_points = torch.max(grouped_points, 2)[0] # [B, D', S]
270 | new_points_list.append(new_points)
271 |
272 | new_xyz = new_xyz.permute(0, 2, 1)
273 | new_points_concat = torch.cat(new_points_list, dim=1)
274 | return new_xyz, new_points_concat
275 |
276 |
277 | class PointNetFeaturePropagation(nn.Module):
278 | def __init__(self, in_channel, mlp):
279 | super(PointNetFeaturePropagation, self).__init__()
280 | self.mlp_convs = nn.ModuleList()
281 | self.mlp_bns = nn.ModuleList()
282 | last_channel = in_channel
283 | for out_channel in mlp:
284 | self.mlp_convs.append(nn.Conv1d(last_channel, out_channel, 1))
285 | self.mlp_bns.append(nn.BatchNorm1d(out_channel))
286 | last_channel = out_channel
287 |
288 | def forward(self, xyz1, xyz2, points1, points2):
289 | """
290 | Input:
291 | xyz1: input points position data, [B, C, N]
292 | xyz2: sampled input points position data, [B, C, S]
293 | points1: input points data, [B, D, N]
294 | points2: input points data, [B, D, S]
295 | Return:
296 | new_points: upsampled points data, [B, D', N]
297 | """
298 | xyz1 = xyz1.permute(0, 2, 1)
299 | xyz2 = xyz2.permute(0, 2, 1)
300 |
301 | points2 = points2.permute(0, 2, 1)
302 | B, N, C = xyz1.shape
303 | _, S, _ = xyz2.shape
304 |
305 | if S == 1:
306 | interpolated_points = points2.repeat(1, N, 1)
307 | else:
308 | dists = square_distance(xyz1, xyz2)
309 | dists, idx = dists.sort(dim=-1)
310 | dists, idx = dists[:, :, :3], idx[:, :, :3] # [B, N, 3]
311 |
312 | dist_recip = 1.0 / (dists + 1e-8)
313 | norm = torch.sum(dist_recip, dim=2, keepdim=True)
314 | weight = dist_recip / norm
315 | interpolated_points = torch.sum(index_points(points2, idx) * weight.view(B, N, 3, 1), dim=2)
316 |
317 | if points1 is not None:
318 | points1 = points1.permute(0, 2, 1)
319 | new_points = torch.cat([points1, interpolated_points], dim=-1)
320 | else:
321 | new_points = interpolated_points
322 |
323 | new_points = new_points.permute(0, 2, 1)
324 | for i, conv in enumerate(self.mlp_convs):
325 | bn = self.mlp_bns[i]
326 | new_points = F.relu(bn(conv(new_points)))
327 | return new_points
328 |
--------------------------------------------------------------------------------
/drl_implementation/agent/utils/pointnet_utils.py:
--------------------------------------------------------------------------------
1 | """
2 | https://github.com/yanx27/Pointnet_Pointnet2_pytorch/blob/master/models/pointnet2_utils.py
3 | @article{Pytorch_Pointnet_Pointnet2,
4 | Author = {Xu Yan},
5 | Title = {Pointnet/Pointnet++ Pytorch},
6 | Journal = {https://github.com/yanx27/Pointnet_Pointnet2_pytorch},
7 | Year = {2019}
8 | }
9 | """
10 | import torch
11 | import torch.nn as nn
12 | import torch.nn.parallel
13 | import torch.utils.data
14 | from torch.autograd import Variable
15 | import numpy as np
16 | import torch.nn.functional as F
17 |
18 |
19 | class STN3d(nn.Module):
20 | def __init__(self, channel):
21 | super(STN3d, self).__init__()
22 | self.conv1 = torch.nn.Conv1d(channel, 64, 1)
23 | self.conv2 = torch.nn.Conv1d(64, 128, 1)
24 | self.conv3 = torch.nn.Conv1d(128, 1024, 1)
25 | self.fc1 = nn.Linear(1024, 512)
26 | self.fc2 = nn.Linear(512, 256)
27 | self.fc3 = nn.Linear(256, 9)
28 | self.relu = nn.ReLU()
29 |
30 | self.bn1 = nn.BatchNorm1d(64)
31 | self.bn2 = nn.BatchNorm1d(128)
32 | self.bn3 = nn.BatchNorm1d(1024)
33 | self.bn4 = nn.BatchNorm1d(512)
34 | self.bn5 = nn.BatchNorm1d(256)
35 |
36 | def forward(self, x):
37 | batchsize = x.size()[0]
38 | x = F.relu(self.bn1(self.conv1(x)))
39 | x = F.relu(self.bn2(self.conv2(x)))
40 | x = F.relu(self.bn3(self.conv3(x)))
41 | x = torch.max(x, 2, keepdim=True)[0]
42 | x = x.view(-1, 1024)
43 |
44 | x = F.relu(self.bn4(self.fc1(x)))
45 | x = F.relu(self.bn5(self.fc2(x)))
46 | x = self.fc3(x)
47 |
48 | iden = Variable(torch.from_numpy(np.array([1, 0, 0, 0, 1, 0, 0, 0, 1]).astype(np.float32))).view(1, 9).repeat(
49 | batchsize, 1)
50 | if x.is_cuda:
51 | iden = iden.cuda()
52 | x = x + iden
53 | x = x.view(-1, 3, 3)
54 | return x
55 |
56 |
57 | class STNkd(nn.Module):
58 | def __init__(self, k=64):
59 | super(STNkd, self).__init__()
60 | self.conv1 = torch.nn.Conv1d(k, 64, 1)
61 | self.conv2 = torch.nn.Conv1d(64, 128, 1)
62 | self.conv3 = torch.nn.Conv1d(128, 1024, 1)
63 | self.fc1 = nn.Linear(1024, 512)
64 | self.fc2 = nn.Linear(512, 256)
65 | self.fc3 = nn.Linear(256, k * k)
66 | self.relu = nn.ReLU()
67 |
68 | self.bn1 = nn.BatchNorm1d(64)
69 | self.bn2 = nn.BatchNorm1d(128)
70 | self.bn3 = nn.BatchNorm1d(1024)
71 | self.bn4 = nn.BatchNorm1d(512)
72 | self.bn5 = nn.BatchNorm1d(256)
73 |
74 | self.k = k
75 |
76 | def forward(self, x):
77 | batchsize = x.size()[0]
78 | x = F.relu(self.bn1(self.conv1(x)))
79 | x = F.relu(self.bn2(self.conv2(x)))
80 | x = F.relu(self.bn3(self.conv3(x)))
81 | x = torch.max(x, 2, keepdim=True)[0]
82 | x = x.view(-1, 1024)
83 |
84 | x = F.relu(self.bn4(self.fc1(x)))
85 | x = F.relu(self.bn5(self.fc2(x)))
86 | x = self.fc3(x)
87 |
88 | iden = Variable(torch.from_numpy(np.eye(self.k).flatten().astype(np.float32))).view(1, self.k * self.k).repeat(
89 | batchsize, 1)
90 | if x.is_cuda:
91 | iden = iden.cuda()
92 | x = x + iden
93 | x = x.view(-1, self.k, self.k)
94 | return x
95 |
96 |
97 | class PointNetEncoder(nn.Module):
98 | def __init__(self, global_feat=True, feature_transform=False, channel=3):
99 | super(PointNetEncoder, self).__init__()
100 | self.stn = STN3d(channel)
101 | self.conv1 = torch.nn.Conv1d(channel, 64, 1)
102 | self.conv2 = torch.nn.Conv1d(64, 128, 1)
103 | self.conv3 = torch.nn.Conv1d(128, 1024, 1)
104 | self.bn1 = nn.BatchNorm1d(64)
105 | self.bn2 = nn.BatchNorm1d(128)
106 | self.bn3 = nn.BatchNorm1d(1024)
107 | self.global_feat = global_feat
108 | self.feature_transform = feature_transform
109 | if self.feature_transform:
110 | self.fstn = STNkd(k=64)
111 |
112 | def forward(self, x):
113 | B, D, N = x.size()
114 | trans = self.stn(x)
115 | x = x.transpose(2, 1)
116 | if D > 3:
117 | feature = x[:, :, 3:]
118 | x = x[:, :, :3]
119 | x = torch.bmm(x, trans)
120 | if D > 3:
121 | x = torch.cat([x, feature], dim=2)
122 | x = x.transpose(2, 1)
123 | x = F.relu(self.bn1(self.conv1(x)))
124 |
125 | if self.feature_transform:
126 | trans_feat = self.fstn(x)
127 | x = x.transpose(2, 1)
128 | x = torch.bmm(x, trans_feat)
129 | x = x.transpose(2, 1)
130 | else:
131 | trans_feat = None
132 |
133 | pointfeat = x
134 | x = F.relu(self.bn2(self.conv2(x)))
135 | x = self.bn3(self.conv3(x))
136 | x = torch.max(x, 2, keepdim=True)[0]
137 | x = x.view(-1, 1024)
138 | if self.global_feat:
139 | return x, trans, trans_feat
140 | else:
141 | x = x.view(-1, 1024, 1).repeat(1, 1, N)
142 | return torch.cat([x, pointfeat], 1), trans, trans_feat
143 |
144 |
145 | def feature_transform_reguliarzer(trans):
146 | d = trans.size()[1]
147 | I = torch.eye(d)[None, :, :]
148 | if trans.is_cuda:
149 | I = I.cuda()
150 | loss = torch.mean(torch.norm(torch.bmm(trans, trans.transpose(2, 1)) - I, dim=(1, 2)))
151 | return loss
--------------------------------------------------------------------------------
/drl_implementation/agent/utils/segment_tree.py:
--------------------------------------------------------------------------------
1 | """
2 | The segment tree implementation from OpenAI baseline GitHub repo:
3 | https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/common/segment_tree.py
4 | This is used in the prioritized replay buffer.
5 | """
6 | import operator
7 |
8 |
9 | class SegmentTree(object):
10 | def __init__(self, capacity, operation, neutral_element):
11 | """Build a Segment Tree data structure.
12 | https://en.wikipedia.org/wiki/Segment_tree
13 | Can be used as regular array, but with two
14 | important differences:
15 | a) setting item's value is slightly slower.
16 | It is O(lg capacity) instead of O(1).
17 | b) user has access to an efficient ( O(log segment size) )
18 | `reduce` operation which reduces `operation` over
19 | a contiguous subsequence of items in the array.
20 | Paramters
21 | ---------
22 | capacity: int
23 | Total size of the array - must be a power of two.
24 | operation: lambda obj, obj -> obj
25 | and operation for combining elements (eg. sum, max)
26 | must form a mathematical group together with the set of
27 | possible values for array elements (i.e. be associative)
28 | neutral_element: obj
29 | neutral element for the operation above. eg. float('-inf')
30 | for max and 0 for sum.
31 | """
32 | assert capacity > 0 and capacity & (capacity - 1) == 0, "capacity must be positive and a power of 2."
33 | self._capacity = capacity
34 | self._value = [neutral_element for _ in range(2 * capacity)]
35 | self._operation = operation
36 |
37 | def _reduce_helper(self, start, end, node, node_start, node_end):
38 | if start == node_start and end == node_end:
39 | return self._value[node]
40 | mid = (node_start + node_end) // 2
41 | if end <= mid:
42 | return self._reduce_helper(start, end, 2 * node, node_start, mid)
43 | else:
44 | if mid + 1 <= start:
45 | return self._reduce_helper(start, end, 2 * node + 1, mid + 1, node_end)
46 | else:
47 | return self._operation(
48 | self._reduce_helper(start, mid, 2 * node, node_start, mid),
49 | self._reduce_helper(mid + 1, end, 2 * node + 1, mid + 1, node_end)
50 | )
51 |
52 | def reduce(self, start=0, end=None):
53 | """Returns result of applying `self.operation`
54 | to a contiguous subsequence of the array.
55 | self.operation(arr[start], operation(arr[start+1], operation(... arr[end])))
56 | Parameters
57 | ----------
58 | start: int
59 | beginning of the subsequence
60 | end: int
61 | end of the subsequences
62 | Returns
63 | -------
64 | reduced: obj
65 | result of reducing self.operation over the specified range of array elements.
66 | """
67 | if end is None:
68 | end = self._capacity
69 | if end < 0:
70 | end += self._capacity
71 | end -= 1
72 | return self._reduce_helper(start, end, 1, 0, self._capacity - 1)
73 |
74 | def __setitem__(self, idx, val):
75 | # index of the leaf
76 | idx += self._capacity
77 | self._value[idx] = val
78 | idx //= 2
79 | while idx >= 1:
80 | self._value[idx] = self._operation(
81 | self._value[2 * idx],
82 | self._value[2 * idx + 1]
83 | )
84 | idx //= 2
85 |
86 | def __getitem__(self, idx):
87 | assert 0 <= idx < self._capacity
88 | return self._value[self._capacity + idx]
89 |
90 |
91 | class SumSegmentTree(SegmentTree):
92 | def __init__(self, capacity):
93 | super(SumSegmentTree, self).__init__(
94 | capacity=capacity,
95 | operation=operator.add,
96 | neutral_element=0.0
97 | )
98 |
99 | def sum(self, start=0, end=None):
100 | """Returns arr[start] + ... + arr[end]"""
101 | return super(SumSegmentTree, self).reduce(start, end)
102 |
103 | def find_prefixsum_idx(self, prefixsum):
104 | """Find the highest index `i` in the array such that
105 | sum(arr[0] + arr[1] + ... + arr[i - i]) <= prefixsum
106 | if array values are probabilities, this function
107 | allows to sample indexes according to the discrete
108 | probability efficiently.
109 | Parameters
110 | ----------
111 | perfixsum: float
112 | upperbound on the sum of array prefix
113 | Returns
114 | -------
115 | idx: int
116 | highest index satisfying the prefixsum constraint
117 | """
118 | assert 0 <= prefixsum <= self.sum() + 1e-5
119 | idx = 1
120 | while idx < self._capacity: # while non-leaf
121 | if self._value[2 * idx] > prefixsum:
122 | idx = 2 * idx
123 | else:
124 | prefixsum -= self._value[2 * idx]
125 | idx = 2 * idx + 1
126 | return idx - self._capacity
127 |
128 |
129 | class MinSegmentTree(SegmentTree):
130 | def __init__(self, capacity):
131 | super(MinSegmentTree, self).__init__(
132 | capacity=capacity,
133 | operation=min,
134 | neutral_element=float('inf')
135 | )
136 |
137 | def min(self, start=0, end=None):
138 | """Returns min(arr[start], ..., arr[end])"""
139 |
140 | return super(MinSegmentTree, self).reduce(start, end)
141 |
--------------------------------------------------------------------------------
/drl_implementation/examples/KukaPushPHER.py:
--------------------------------------------------------------------------------
1 | # this example runs a goal-condition soft actor critic with prioritised+hindsight experience replay
2 | # on the Push task from the pybullet-multigoal-gym package
3 |
4 | import os
5 | import gym
6 | import pybullet_multigoal_gym as pmg
7 | from drl_implementation import GoalConditionedSAC, GoalConditionedDDPG
8 | # you can replace the agent instantiation by one of the two above, with the proper params
9 |
10 | ddpg_params = {
11 | 'hindsight': True,
12 | 'her_sampling_strategy': 'future',
13 | 'prioritised': True,
14 | 'memory_capacity': int(1e6),
15 | 'actor_learning_rate': 0.001,
16 | 'critic_learning_rate': 0.001,
17 | 'Q_weight_decay': 0.0,
18 | 'update_interval': 1,
19 | 'batch_size': 128,
20 | 'optimization_steps': 40,
21 | 'tau': 0.05,
22 | 'discount_factor': 0.98,
23 | 'clip_value': 50,
24 | 'discard_time_limit': True,
25 | 'terminate_on_achieve': False,
26 | 'observation_normalization': True,
27 |
28 | 'random_action_chance': 0.2,
29 | 'noise_deviation': 0.05,
30 |
31 | 'training_epochs': 101,
32 | 'training_cycles': 50,
33 | 'training_episodes': 16,
34 | 'testing_gap': 1,
35 | 'testing_episodes': 30,
36 | 'saving_gap': 25,
37 | }
38 | # sac_params = {
39 | # 'hindsight': True,
40 | # 'her_sampling_strategy': 'future',
41 | # 'prioritised': True,
42 | # 'memory_capacity': int(1e6),
43 | # 'actor_learning_rate': 0.001,
44 | # 'critic_learning_rate': 0.001,
45 | # 'update_interval': 1,
46 | # 'batch_size': 128,
47 | # 'optimization_steps': 40,
48 | # 'tau': 0.005,
49 | # 'clip_value': 50,
50 | # 'discount_factor': 0.98,
51 | # 'discard_time_limit': True,
52 | # 'terminate_on_achieve': False,
53 | # 'observation_normalization': True,
54 | #
55 | # 'alpha': 0.5,
56 | # 'actor_update_interval': 1,
57 | # 'critic_target_update_interval': 1,
58 | #
59 | # 'training_epochs': 101,
60 | # 'training_cycles': 50,
61 | # 'training_episodes': 16,
62 | # 'testing_gap': 1,
63 | # 'testing_episodes': 30,
64 | # 'saving_gap': 25,
65 | # }
66 | seeds = [11, 22, 33, 44]
67 | seed_returns = []
68 | seed_success_rates = []
69 | path = os.path.dirname(os.path.realpath(__file__))
70 | path = os.path.join(path, 'PushPHER')
71 |
72 | for seed in seeds:
73 |
74 | env = pmg.make_env(task='push',
75 | gripper='parallel_jaw',
76 | render=False,
77 | binary_reward=True,
78 | max_episode_steps=50,
79 | image_observation=False,
80 | depth_image=False,
81 | goal_image=False)
82 | # use the render env for visualization
83 |
84 | seed_path = path + '/seed'+str(seed)
85 |
86 | agent = GoalConditionedDDPG(algo_params=ddpg_params, env=env, path=seed_path, seed=seed)
87 | agent.run(test=False)
88 | # the sleep argument pause the rendering for a while every step, useful for slowing down visualization
89 | # agent.run(test=True, load_network_ep=50, sleep=0.05)
90 | seed_returns.append(agent.statistic_dict['epoch_test_return'])
91 | seed_success_rates.append(agent.statistic_dict['epoch_test_success_rate'])
92 |
--------------------------------------------------------------------------------
/drl_implementation/examples/PendulumDDPG.py:
--------------------------------------------------------------------------------
1 | # this example runs a ddpg agent on a inverted pendulum swingup task from pybullet gym
2 |
3 | import os
4 | import pybullet_envs
5 | from drl_implementation import DDPG, SAC, TD3
6 | # you can replace the agent instantiation by one of the three above, with the proper params
7 |
8 | # td3_params = {
9 | # 'prioritised': True,
10 | # 'memory_capacity': int(1e6),
11 | # 'actor_learning_rate': 0.0003,
12 | # 'critic_learning_rate': 0.0003,
13 | # 'batch_size': 256,
14 | # 'optimization_steps': 1,
15 | # 'tau': 0.005,
16 | # 'discount_factor': 0.99,
17 | # 'discard_time_limit': True,
18 | # 'warmup_step': 2500,
19 | # 'target_noise': 0.2,
20 | # 'noise_clip': 0.5,
21 | # 'update_interval': 1,
22 | # 'actor_update_interval': 2,
23 | # 'observation_normalization': False,
24 | #
25 | # 'training_episodes': 101,
26 | # 'testing_gap': 10,
27 | # 'testing_episodes': 10,
28 | # 'saving_gap': 50,
29 | # }
30 | # sac_params = {
31 | # 'prioritised': True,
32 | # 'memory_capacity': int(1e6),
33 | # 'actor_learning_rate': 0.0003,
34 | # 'critic_learning_rate': 0.0003,
35 | # 'update_interval': 1,
36 | # 'batch_size': 256,
37 | # 'optimization_steps': 1,
38 | # 'tau': 0.005,
39 | # 'discount_factor': 0.99,
40 | # 'discard_time_limit': True,
41 | # 'observation_normalization': False,
42 | #
43 | # 'alpha': 0.5,
44 | # 'actor_update_interval': 1,
45 | # 'critic_target_update_interval': 1,
46 | # 'warmup_step': 1000,
47 | #
48 | # 'training_episodes': 101,
49 | # 'testing_gap': 10,
50 | # 'testing_episodes': 10,
51 | # 'saving_gap': 50,
52 | # }
53 | ddpg_params = {
54 | 'prioritised': True,
55 | 'memory_capacity': int(1e6),
56 | 'actor_learning_rate': 0.001,
57 | 'critic_learning_rate': 0.001,
58 | 'Q_weight_decay': 0.0,
59 | 'update_interval': 1,
60 | 'batch_size': 100,
61 | 'optimization_steps': 1,
62 | 'tau': 0.005,
63 | 'discount_factor': 0.99,
64 | 'discard_time_limit': True,
65 | 'warmup_step': 2500,
66 | 'observation_normalization': False,
67 |
68 | 'training_episodes': 101,
69 | 'testing_gap': 10,
70 | 'testing_episodes': 10,
71 | 'saving_gap': 50,
72 | }
73 | seeds = [11, 22, 33, 44, 55, 66]
74 | seed_returns = []
75 | path = os.path.dirname(os.path.realpath(__file__))
76 | for seed in seeds:
77 |
78 | env = pybullet_envs.make("InvertedPendulumSwingupBulletEnv-v0")
79 | # call render() before training to visualize (pybullet-gym-specific)
80 | # env.render()
81 | seed_path = path + '/seed'+str(seed)
82 |
83 | agent = DDPG(algo_params=ddpg_params, env=env, path=seed_path, seed=seed)
84 | agent.run(test=False)
85 | # the sleep argument pause the rendering for a while at every env step, useful for slowing down visualization
86 | # agent.run(test=True, load_network_ep=50, sleep=0.05)
87 | seed_returns.append(agent.statistic_dict['episode_return'])
88 | del env, agent
89 |
--------------------------------------------------------------------------------
/drl_implementation/examples/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/drl_implementation/examples/__init__.py
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | matplotlib >= 3.3.2
2 | numpy >= 1.18
3 | torch >= 1.3.0
4 | json
5 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup, find_packages
2 |
3 |
4 | packages = find_packages()
5 | # Ensure that we don't pollute the global namespace.
6 | for p in packages:
7 | assert p == 'drl_implementation' or p.startswith('drl_implementation.')
8 |
9 | setup(name='drl-implementation',
10 | version='1.0.0',
11 | description='A collection of deep reinforcement learning algorithms for fast implementation',
12 | url='https://github.com/IanYangChina/DRL_Implementation',
13 | author='XintongYang',
14 | author_email='YangX66@cardiff.ac.uk',
15 | packages=packages,
16 | package_dir={'drl_implementation': 'drl_implementation'},
17 | package_data={'drl_implementation': [
18 | 'examples/*.md',
19 | ]},
20 | classifiers=[
21 | "Programming Language :: Python :: 3",
22 | "Operating System :: OS Independent",
23 | ])
24 |
--------------------------------------------------------------------------------
/src/README.md:
--------------------------------------------------------------------------------
1 | #### Some tips
2 |
3 | To run the FetchReach-v1 environment, you will need to install gym, mujoco and mujoco-py on your environment.
4 | Here's the links:
5 |
6 | [OpenAI Gym](https://github.com/openai/gym)
7 | [Get a free Mujoco trial license, here](https://www.roboti.us/license.html)
8 | [Install Mujoco and Mujoco.py by OpenAI, here](https://github.com/openai/mujoco-py)
9 |
10 | Unfortunately, it seems that mujoco is hard to be installed on Windows systems, but you might find some help in [this
11 | page](https://github.com/openai/mujoco-py/issues/253). However, I still suggest you to run these codes on Linux systems.
12 |
13 | The [pybullet-multigoal-gym](https://github.com/IanYangChina/pybullet_multigoal_gym) environment is a migration of the open ai gym multigoal environment, developed by the author
14 | of this repo. It is free as it is based on Pybullet. You will need it to run the pybullet experiments.
15 |
16 | #### Some notes I made when I was implementing HER:
17 | * The original paper uses multiple cpu to collect data, however, this implementation uses single cpu. Multi-cpu might be
18 | added in the future.
19 | * Actor, critic networks have 3 hidden layers, each with 256 units and relu activation; critic output without activation,
20 | while actor output with tanh and rescaling.
21 | * Observation and goal are concatenated and fed into both networks.
22 | * The original paper scales observation, goals and actions into [-5, 5] (we don't need rescaling with the Gym environment),
23 | and normalize to 0 mean and standard variation. The means and standard deviations are computed using encountered data.
24 | * Training process has 200 epochs with 50 cycles, each of which has 16 episodes and 40 optimization steps. The total
25 | episode number is 200\*50\*16=160000, each of which has 50 time steps. After every 16 episodes, 40 optimization steps are
26 | performed.
27 | * Each optimization step uses a mini-batch of 128 batch size uniformly sampled from a replay buffer with 10^6 capacity,
28 | target network is updated softly with tau=0.05.
29 | * Adam is used for learning with a learning rate of 0.001, discount factor is 0.98, target value is clipped to
30 | [-1/(1-0.98), 0], that is [-50, 0]. I think this is based on the 50 time steps they set for each episode, in which at
31 | most an agent could gain -50 return.
32 | * For exploration, they randomly select action from uniform distribution with 20% chance; and with 80% chance, they
33 | add normal noise into increments along each axes with standard deviation equal to 5% of the max bound.
34 |
35 | * The SAC agent doesn't need a behavioural policy.
36 | * The goal-conditioned **sac** agent doesn't need value clipping.
37 | * Prioritised replay supported.
38 |
39 | #### Results on the task 'Push'
40 |
41 |
--------------------------------------------------------------------------------
/src/figs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/src/figs.png
--------------------------------------------------------------------------------
/src/pendulum.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/src/pendulum.gif
--------------------------------------------------------------------------------
/src/push.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IanYangChina/DRL_Implementation/38812c9647e4bec8359908be444dff19b90257d5/src/push.gif
--------------------------------------------------------------------------------
/tests/test.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 |
3 | t = namedtuple('t', ['a', 'b'])
4 |
5 | print('b' in t._fields)
--------------------------------------------------------------------------------