├── LICENSE.md
├── README.md
├── __init__.py
├── agents
├── __init__.py
├── a2c.py
├── common
│ ├── __init__.py
│ ├── buffers.py
│ ├── networks.py
│ └── utils.py
├── ddpg.py
├── dqn.py
├── ppo.py
├── sac.py
├── td3.py
├── trpo.py
└── vpg.py
├── results
└── graphs
│ ├── ant.png
│ ├── halfcheetah.png
│ └── humanoid.png
├── run_cartpole.py
├── run_mujoco.py
└── run_pendulum.py
/LICENSE.md:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 Dongmin Lee
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Deep Reinforcement Learning (DRL) Algorithms with PyTorch
2 |
3 | This repository contains PyTorch implementations of deep reinforcement learning algorithms. **The repository will soon be updated including the PyBullet environments!**
4 |
5 | ## Algorithms Implemented
6 |
7 | 1. Deep Q-Network (DQN) ([V. Mnih et al. 2015](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf))
8 | 2. Double DQN (DDQN) ([H. Van Hasselt et al. 2015](https://arxiv.org/abs/1509.06461))
9 | 3. Advantage Actor Critic (A2C)
10 | 4. Vanilla Policy Gradient (VPG)
11 | 5. Natural Policy Gradient (NPG) ([S. Kakade et al. 2002](http://papers.nips.cc/paper/2073-a-natural-policy-gradient.pdf))
12 | 6. Trust Region Policy Optimization (TRPO) ([J. Schulman et al. 2015](https://arxiv.org/abs/1502.05477))
13 | 7. Proximal Policy Optimization (PPO) ([J. Schulman et al. 2017](https://arxiv.org/abs/1707.06347))
14 | 8. Deep Deterministic Policy Gradient (DDPG) ([T. Lillicrap et al. 2015](https://arxiv.org/abs/1509.02971))
15 | 9. Twin Delayed DDPG (TD3) ([S. Fujimoto et al. 2018](https://arxiv.org/abs/1802.09477))
16 | 10. Soft Actor-Critic (SAC) ([T. Haarnoja et al. 2018](https://arxiv.org/abs/1801.01290))
17 | 11. SAC with automatic entropy adjustment (SAC-AEA) ([T. Haarnoja et al. 2018](https://arxiv.org/abs/1812.05905))
18 |
19 | ## Environments Implemented
20 |
21 | 1. Classic control environments (CartPole-v1, Pendulum-v0, etc.) (as described in [here](https://gym.openai.com/envs/#classic_control))
22 | 2. MuJoCo environments (Hopper-v2, HalfCheetah-v2, Ant-v2, Humanoid-v2, etc.) (as described in [here](https://gym.openai.com/envs/#mujoco))
23 | 3. **PyBullet environments (HopperBulletEnv-v0, HalfCheetahBulletEnv-v0, AntBulletEnv-v0, HumanoidDeepMimicWalkBulletEnv-v1 etc.)** (as described in [here](https://github.com/bulletphysics/bullet3/tree/master/examples/pybullet/gym/pybullet_envs))
24 |
25 | ## Results (MuJoCo, PyBullet)
26 |
27 | ### MuJoCo environments
28 |
29 | #### Hopper-v2
30 |
31 | - Observation space: 8
32 | - Action space: 3
33 |
34 | #### HalfCheetah-v2
35 |
36 | - Observation space: 17
37 | - Action space: 6
38 |
39 | #### Ant-v2
40 |
41 | - Observation space: 111
42 | - Action space: 8
43 |
44 | #### Humanoid-v2
45 |
46 | - Observation space: 376
47 | - Action space: 17
48 |
49 | ### PyBullet environments
50 |
51 | #### HopperBulletEnv-v0
52 |
53 | - Observation space: 15
54 | - Action space: 3
55 |
56 | #### HalfCheetahBulletEnv-v0
57 |
58 | - Observation space: 26
59 | - Action space: 6
60 |
61 | #### AntBulletEnv-v0
62 |
63 | - Observation space: 28
64 | - Action space: 8
65 |
66 | #### HumanoidDeepMimicWalkBulletEnv-v1
67 |
68 | - Observation space: 197
69 | - Action space: 36
70 |
71 | ## Requirements
72 |
73 | - [PyTorch](https://pytorch.org)
74 | - [TensorBoard](https://pytorch.org/docs/stable/tensorboard.html)
75 | - [gym](https://github.com/openai/gym)
76 | - [mujoco-py](https://github.com/openai/mujoco-py)
77 | - [PyBullet](https://pybullet.org/wordpress/)
78 |
79 | ## Usage
80 |
81 | The repository's high-level structure is:
82 |
83 | ├── agents
84 | └── common
85 | ├── results
86 | ├── data
87 | └── graphs
88 | └── save_model
89 |
90 | ### 1) To train the agents on the environments
91 |
92 | To train all the different agents on PyBullet environments, follow these steps:
93 |
94 | ```commandline
95 | git clone https://github.com/dongminlee94/deep_rl.git
96 | cd deep_rl
97 | python run_bullet.py
98 | ```
99 |
100 | For other environments, change the last line to `run_cartpole.py`, `run_pendulum.py`, `run_mujoco.py`.
101 |
102 | If you want to change configurations of the agents, follow this step:
103 | ```commandline
104 | python run_bullet.py \
105 | --env=HumanoidDeepMimicWalkBulletEnv-v1 \
106 | --algo=sac-aea \
107 | --phase=train \
108 | --render=False \
109 | --load=None \
110 | --seed=0 \
111 | --iterations=200 \
112 | --steps_per_iter=5000 \
113 | --max_step=1000 \
114 | --tensorboard=True \
115 | --gpu_index=0
116 | ```
117 |
118 | ### 2) To watch the learned agents on the above environments
119 |
120 | To watch all the learned agents on PyBullet environments, follow these steps:
121 |
122 | ```commandline
123 | python run_bullet.py \
124 | --env=HumanoidDeepMimicWalkBulletEnv-v1 \
125 | --algo=sac-aea \
126 | --phase=test \
127 | --render=True \
128 | --load=envname_algoname_... \
129 | --seed=0 \
130 | --iterations=200 \
131 | --steps_per_iter=5000 \
132 | --max_step=1000 \
133 | --tensorboard=False \
134 | --gpu_index=0
135 | ```
136 |
137 | You should copy the saved model name in `save_model/envname_algoname_...` and paste the copied name in `envname_algoname_...`. So the saved model will be load.
138 |
--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
--------------------------------------------------------------------------------
/agents/__init__.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
--------------------------------------------------------------------------------
/agents/a2c.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 | import torch.optim as optim
4 | import torch.nn.functional as F
5 |
6 | from agents.common.networks import *
7 |
8 |
9 | class Agent(object):
10 | """An implementation of the Advantage Actor-Critic (A2C) agent."""
11 |
12 | def __init__(self,
13 | env,
14 | args,
15 | device,
16 | obs_dim,
17 | act_num,
18 | steps=0,
19 | gamma=0.99,
20 | policy_lr=1e-4,
21 | vf_lr=1e-3,
22 | eval_mode=False,
23 | policy_losses=list(),
24 | vf_losses=list(),
25 | logger=dict(),
26 | ):
27 |
28 | self.env = env
29 | self.args = args
30 | self.device = device
31 | self.obs_dim = obs_dim
32 | self.act_num = act_num
33 | self.steps = steps
34 | self.gamma = gamma
35 | self.policy_lr = policy_lr
36 | self.vf_lr = vf_lr
37 | self.eval_mode = eval_mode
38 | self.policy_losses = policy_losses
39 | self.vf_losses = vf_losses
40 | self.logger = logger
41 |
42 | # Policy network
43 | self.policy = CategoricalPolicy(self.obs_dim, self.act_num, activation=torch.tanh).to(self.device)
44 | # Value network
45 | self.vf = MLP(self.obs_dim, 1, activation=torch.tanh).to(self.device)
46 |
47 | # Create optimizers
48 | self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=self.policy_lr)
49 | self.vf_optimizer = optim.Adam(self.vf.parameters(), lr=self.vf_lr)
50 |
51 | def select_action(self, obs):
52 | """Select an action from the set of available actions."""
53 | action, _, log_pi = self.policy(obs)
54 |
55 | # Prediction V(s)
56 | v = self.vf(obs)
57 |
58 | # Add logπ(a|s), V(s) to transition list
59 | self.transition.extend([log_pi, v])
60 | return action.detach().cpu().numpy()
61 |
62 | def train_model(self):
63 | log_pi, v, reward, next_obs, done = self.transition
64 |
65 | # Prediction V(s')
66 | next_v = self.vf(torch.Tensor(next_obs).to(self.device))
67 |
68 | # Target for Q regression
69 | q = reward + self.gamma*(1-done)*next_v
70 | q.to(self.device)
71 |
72 | # Advantage = Q - V
73 | advant = q - v
74 |
75 | if 0: # Check shape of prediction and target
76 | print("q", q.shape)
77 | print("v", v.shape)
78 | print("log_pi", log_pi.shape)
79 |
80 | # A2C losses
81 | policy_loss = -log_pi*advant.detach()
82 | vf_loss = F.mse_loss(v, q.detach())
83 |
84 | # Update value network parameter
85 | self.vf_optimizer.zero_grad()
86 | vf_loss.backward()
87 | self.vf_optimizer.step()
88 |
89 | # Update policy network parameter
90 | self.policy_optimizer.zero_grad()
91 | policy_loss.backward()
92 | self.policy_optimizer.step()
93 |
94 | # Save losses
95 | self.policy_losses.append(policy_loss.item())
96 | self.vf_losses.append(vf_loss.item())
97 |
98 | def run(self, max_step):
99 | step_number = 0
100 | total_reward = 0.
101 |
102 | obs = self.env.reset()
103 | done = False
104 |
105 | # Keep interacting until agent reaches a terminal state.
106 | while not (done or step_number == max_step):
107 | if self.args.render:
108 | self.env.render()
109 |
110 | if self.eval_mode:
111 | _, pi, _ = self.policy(torch.Tensor(obs).to(self.device))
112 | action = pi.argmax().detach().cpu().numpy()
113 | next_obs, reward, done, _ = self.env.step(action)
114 | else:
115 | self.steps += 1
116 |
117 | # Create a transition list
118 | self.transition = []
119 |
120 | # Collect experience (s, a, r, s') using some policy
121 | action = self.select_action(torch.Tensor(obs).to(self.device))
122 | next_obs, reward, done, _ = self.env.step(action)
123 |
124 | # Add (r, s') to transition list
125 | self.transition.extend([reward, next_obs, done])
126 |
127 | self.train_model()
128 |
129 | total_reward += reward
130 | step_number += 1
131 | obs = next_obs
132 |
133 | # Save total average losses
134 | self.logger['LossPi'] = round(np.mean(self.policy_losses), 5)
135 | self.logger['LossV'] = round(np.mean(self.vf_losses), 5)
136 | return step_number, total_reward
137 |
--------------------------------------------------------------------------------
/agents/common/__init__.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
--------------------------------------------------------------------------------
/agents/common/buffers.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 |
4 |
5 | class ReplayBuffer(object):
6 | """
7 | A simple FIFO experience replay buffer for agents.
8 | """
9 |
10 | def __init__(self, obs_dim, act_dim, size, device):
11 | self.obs1_buf = np.zeros([size, obs_dim], dtype=np.float32)
12 | self.obs2_buf = np.zeros([size, obs_dim], dtype=np.float32)
13 | self.acts_buf = np.zeros([size, act_dim], dtype=np.float32)
14 | self.rews_buf = np.zeros(size, dtype=np.float32)
15 | self.done_buf = np.zeros(size, dtype=np.float32)
16 | self.ptr, self.size, self.max_size = 0, 0, size
17 | self.device = device
18 |
19 | def add(self, obs, act, rew, next_obs, done):
20 | self.obs1_buf[self.ptr] = obs
21 | self.obs2_buf[self.ptr] = next_obs
22 | self.acts_buf[self.ptr] = act
23 | self.rews_buf[self.ptr] = rew
24 | self.done_buf[self.ptr] = done
25 | self.ptr = (self.ptr+1) % self.max_size
26 | self.size = min(self.size+1, self.max_size)
27 |
28 | def sample(self, batch_size=64):
29 | idxs = np.random.randint(0, self.size, size=batch_size)
30 | return dict(obs1=torch.Tensor(self.obs1_buf[idxs]).to(self.device),
31 | obs2=torch.Tensor(self.obs2_buf[idxs]).to(self.device),
32 | acts=torch.Tensor(self.acts_buf[idxs]).to(self.device),
33 | rews=torch.Tensor(self.rews_buf[idxs]).to(self.device),
34 | done=torch.Tensor(self.done_buf[idxs]).to(self.device))
35 |
36 |
37 | class Buffer(object):
38 | """
39 | A buffer for storing trajectories experienced by a agent interacting
40 | with the environment.
41 | """
42 |
43 | def __init__(self, obs_dim, act_dim, size, device, gamma=0.99, lam=0.97):
44 | self.obs_buf = np.zeros([size, obs_dim], dtype=np.float32)
45 | self.act_buf = np.zeros([size, act_dim], dtype=np.float32)
46 | self.rew_buf = np.zeros(size, dtype=np.float32)
47 | self.don_buf = np.zeros(size, dtype=np.float32)
48 | self.ret_buf = np.zeros(size, dtype=np.float32)
49 | self.adv_buf = np.zeros(size, dtype=np.float32)
50 | self.v_buf = np.zeros(size, dtype=np.float32)
51 | self.gamma, self.lam = gamma, lam
52 | self.ptr, self.max_size = 0, size
53 | self.device = device
54 |
55 | def add(self, obs, act, rew, don, v):
56 | assert self.ptr < self.max_size # Buffer has to have room so you can store
57 | self.obs_buf[self.ptr] = obs
58 | self.act_buf[self.ptr] = act
59 | self.rew_buf[self.ptr] = rew
60 | self.don_buf[self.ptr] = don
61 | self.v_buf[self.ptr] = v
62 | self.ptr += 1
63 |
64 | def finish_path(self):
65 | previous_v = 0
66 | running_ret = 0
67 | running_adv = 0
68 | for t in reversed(range(len(self.rew_buf))):
69 | # The next two line computes rewards-to-go, to be targets for the value function
70 | running_ret = self.rew_buf[t] + self.gamma*(1-self.don_buf[t])*running_ret
71 | self.ret_buf[t] = running_ret
72 |
73 | # The next four lines implement GAE-Lambda advantage calculation
74 | running_del = self.rew_buf[t] + self.gamma*(1-self.don_buf[t])*previous_v - self.v_buf[t]
75 | running_adv = running_del + self.gamma*self.lam*(1-self.don_buf[t])*running_adv
76 | previous_v = self.v_buf[t]
77 | self.adv_buf[t] = running_adv
78 | # The next line implement the advantage normalization trick
79 | self.adv_buf = (self.adv_buf - self.adv_buf.mean()) / self.adv_buf.std()
80 |
81 | def get(self):
82 | assert self.ptr == self.max_size # Buffer has to be full before you can get
83 | self.ptr = 0
84 | return dict(obs=torch.Tensor(self.obs_buf).to(self.device),
85 | act=torch.Tensor(self.act_buf).to(self.device),
86 | ret=torch.Tensor(self.ret_buf).to(self.device),
87 | adv=torch.Tensor(self.adv_buf).to(self.device),
88 | v=torch.Tensor(self.v_buf).to(self.device))
89 |
--------------------------------------------------------------------------------
/agents/common/networks.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 | import torch.nn as nn
4 | import torch.nn.functional as F
5 | from torch.distributions import Categorical, Normal
6 |
7 |
8 | def identity(x):
9 | """Return input without any change."""
10 | return x
11 |
12 |
13 | """
14 | DQN, DDQN, A2C critic, VPG critic, TRPO critic, PPO critic, DDPG actor, TD3 actor
15 | """
16 | class MLP(nn.Module):
17 | def __init__(self,
18 | input_size,
19 | output_size,
20 | output_limit=1.0,
21 | hidden_sizes=(64,64),
22 | activation=F.relu,
23 | output_activation=identity,
24 | use_output_layer=True,
25 | use_actor=False,
26 | ):
27 | super(MLP, self).__init__()
28 |
29 | self.input_size = input_size
30 | self.output_size = output_size
31 | self.output_limit = output_limit
32 | self.hidden_sizes = hidden_sizes
33 | self.activation = activation
34 | self.output_activation = output_activation
35 | self.use_output_layer = use_output_layer
36 | self.use_actor = use_actor
37 |
38 | # Set hidden layers
39 | self.hidden_layers = nn.ModuleList()
40 | in_size = self.input_size
41 | for next_size in self.hidden_sizes:
42 | fc = nn.Linear(in_size, next_size)
43 | in_size = next_size
44 | self.hidden_layers.append(fc)
45 |
46 | # Set output layers
47 | if self.use_output_layer:
48 | self.output_layer = nn.Linear(in_size, self.output_size)
49 | else:
50 | self.output_layer = identity
51 |
52 | def forward(self, x):
53 | for hidden_layer in self.hidden_layers:
54 | x = self.activation(hidden_layer(x))
55 | x = self.output_activation(self.output_layer(x))
56 | # If the network is used as actor network, make sure output is in correct range
57 | x = x * self.output_limit if self.use_actor else x
58 | return x
59 |
60 |
61 | """
62 | A2C actor
63 | """
64 | class CategoricalPolicy(MLP):
65 | def forward(self, x):
66 | x = super(CategoricalPolicy, self).forward(x)
67 | pi = F.softmax(x, dim=-1)
68 |
69 | dist = Categorical(pi)
70 | action = dist.sample()
71 | log_pi = dist.log_prob(action)
72 | return action, pi, log_pi
73 |
74 |
75 | """
76 | DDPG critic, TD3 critic, SAC qf
77 | """
78 | class FlattenMLP(MLP):
79 | def forward(self, x, a):
80 | q = torch.cat([x,a], dim=-1)
81 | return super(FlattenMLP, self).forward(q)
82 |
83 |
84 | """
85 | VPG actor, TRPO actor, PPO actor
86 | """
87 | class GaussianPolicy(MLP):
88 | def __init__(self,
89 | input_size,
90 | output_size,
91 | output_limit=1.0,
92 | hidden_sizes=(64,64),
93 | activation=torch.tanh,
94 | ):
95 | super(GaussianPolicy, self).__init__(
96 | input_size=input_size,
97 | output_size=output_size,
98 | hidden_sizes=hidden_sizes,
99 | activation=activation,
100 | )
101 |
102 | self.output_limit = output_limit
103 | self.log_std = np.ones(output_size, dtype=np.float32)
104 | self.log_std = torch.nn.Parameter(torch.Tensor(self.log_std))
105 |
106 | def forward(self, x, pi=None, use_pi=True):
107 | mu = super(GaussianPolicy, self).forward(x)
108 | std = torch.exp(self.log_std)
109 |
110 | dist = Normal(mu, std)
111 | if use_pi:
112 | pi = dist.sample()
113 | log_pi = dist.log_prob(pi).sum(dim=-1)
114 |
115 | # Make sure outputs are in correct range
116 | mu = mu * self.output_limit
117 | pi = pi * self.output_limit
118 | return mu, std, pi, log_pi
119 |
120 |
121 | """
122 | SAC actor
123 | """
124 | LOG_STD_MAX = 2
125 | LOG_STD_MIN = -20
126 |
127 | class ReparamGaussianPolicy(MLP):
128 | def __init__(self,
129 | input_size,
130 | output_size,
131 | output_limit=1.0,
132 | hidden_sizes=(64,64),
133 | activation=F.relu,
134 | ):
135 | super(ReparamGaussianPolicy, self).__init__(
136 | input_size=input_size,
137 | output_size=output_size,
138 | hidden_sizes=hidden_sizes,
139 | activation=activation,
140 | use_output_layer=False,
141 | )
142 |
143 | in_size = hidden_sizes[-1]
144 | self.output_limit = output_limit
145 |
146 | # Set output layers
147 | self.mu_layer = nn.Linear(in_size, output_size)
148 | self.log_std_layer = nn.Linear(in_size, output_size)
149 |
150 | def clip_but_pass_gradient(self, x, l=-1., u=1.):
151 | clip_up = (x > u).float()
152 | clip_low = (x < l).float()
153 | clip_value = (u - x)*clip_up + (l - x)*clip_low
154 | return x + clip_value.detach()
155 |
156 | def apply_squashing_func(self, mu, pi, log_pi):
157 | mu = torch.tanh(mu)
158 | pi = torch.tanh(pi)
159 | # To avoid evil machine precision error, strictly clip 1-pi**2 to [0,1] range.
160 | log_pi -= torch.sum(torch.log(self.clip_but_pass_gradient(1 - pi.pow(2), l=0., u=1.) + 1e-6), dim=-1)
161 | return mu, pi, log_pi
162 |
163 | def forward(self, x):
164 | x = super(ReparamGaussianPolicy, self).forward(x)
165 |
166 | mu = self.mu_layer(x)
167 | log_std = torch.tanh(self.log_std_layer(x))
168 | log_std = LOG_STD_MIN + 0.5 * (LOG_STD_MAX - LOG_STD_MIN) * (log_std + 1)
169 | std = torch.exp(log_std)
170 |
171 | # https://pytorch.org/docs/stable/distributions.html#normal
172 | dist = Normal(mu, std)
173 | pi = dist.rsample() # Reparameterization trick (mean + std * N(0,1))
174 | log_pi = dist.log_prob(pi).sum(dim=-1)
175 | mu, pi, log_pi = self.apply_squashing_func(mu, pi, log_pi)
176 |
177 | # Make sure outputs are in correct range
178 | mu = mu * self.output_limit
179 | pi = pi * self.output_limit
180 | return mu, pi, log_pi
--------------------------------------------------------------------------------
/agents/common/utils.py:
--------------------------------------------------------------------------------
1 | import torch
2 |
3 | def hard_target_update(main, target):
4 | target.load_state_dict(main.state_dict())
5 |
6 | def soft_target_update(main, target, tau=0.005):
7 | for main_param, target_param in zip(main.parameters(), target.parameters()):
8 | target_param.data.copy_(tau*main_param.data + (1.0-tau)*target_param.data)
9 |
--------------------------------------------------------------------------------
/agents/ddpg.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 | import torch.nn as nn
4 | import torch.optim as optim
5 | import torch.nn.functional as F
6 |
7 | from agents.common.utils import *
8 | from agents.common.buffers import *
9 | from agents.common.networks import *
10 |
11 |
12 | class Agent(object):
13 | """An implementation of the Deep Deterministic Policy Gradient (DDPG) agent."""
14 |
15 | def __init__(self,
16 | env,
17 | args,
18 | device,
19 | obs_dim,
20 | act_dim,
21 | act_limit,
22 | steps=0,
23 | expl_before=2000,
24 | train_after=1000,
25 | gamma=0.99,
26 | act_noise=0.1,
27 | hidden_sizes=(128,128),
28 | buffer_size=int(1e4),
29 | batch_size=64,
30 | policy_lr=3e-4,
31 | qf_lr=3e-4,
32 | gradient_clip_policy=0.5,
33 | gradient_clip_qf=1.0,
34 | eval_mode=False,
35 | policy_losses=list(),
36 | qf_losses=list(),
37 | logger=dict(),
38 | ):
39 |
40 | self.env = env
41 | self.args = args
42 | self.device = device
43 | self.obs_dim = obs_dim
44 | self.act_dim = act_dim
45 | self.act_limit = act_limit
46 | self.steps = steps
47 | self.expl_before = expl_before
48 | self.train_after = train_after
49 | self.gamma = gamma
50 | self.act_noise = act_noise
51 | self.hidden_sizes = hidden_sizes
52 | self.buffer_size = buffer_size
53 | self.batch_size = batch_size
54 | self.policy_lr = policy_lr
55 | self.qf_lr = qf_lr
56 | self.gradient_clip_policy = gradient_clip_policy
57 | self.gradient_clip_qf = gradient_clip_qf
58 | self.eval_mode = eval_mode
59 | self.policy_losses = policy_losses
60 | self.qf_losses = qf_losses
61 | self.logger = logger
62 |
63 | # Main network
64 | self.policy = MLP(self.obs_dim, self.act_dim, self.act_limit,
65 | hidden_sizes=self.hidden_sizes,
66 | output_activation=torch.tanh,
67 | use_actor=True).to(self.device)
68 | self.qf = FlattenMLP(self.obs_dim+self.act_dim, 1, hidden_sizes=self.hidden_sizes).to(self.device)
69 | # Target network
70 | self.policy_target = MLP(self.obs_dim, self.act_dim, self.act_limit,
71 | hidden_sizes=self.hidden_sizes,
72 | output_activation=torch.tanh,
73 | use_actor=True).to(self.device)
74 | self.qf_target = FlattenMLP(self.obs_dim+self.act_dim, 1, hidden_sizes=self.hidden_sizes).to(self.device)
75 |
76 | # Initialize target parameters to match main parameters
77 | hard_target_update(self.policy, self.policy_target)
78 | hard_target_update(self.qf, self.qf_target)
79 |
80 | # Create optimizers
81 | self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=self.policy_lr)
82 | self.qf_optimizer = optim.Adam(self.qf.parameters(), lr=self.qf_lr)
83 |
84 | # Experience buffer
85 | self.replay_buffer = ReplayBuffer(self.obs_dim, self.act_dim, self.buffer_size, self.device)
86 |
87 | def select_action(self, obs):
88 | action = self.policy(obs).detach().cpu().numpy()
89 | action += self.act_noise * np.random.randn(self.act_dim)
90 | return np.clip(action, -self.act_limit, self.act_limit)
91 |
92 | def train_model(self):
93 | batch = self.replay_buffer.sample(self.batch_size)
94 | obs1 = batch['obs1']
95 | obs2 = batch['obs2']
96 | acts = batch['acts']
97 | rews = batch['rews']
98 | done = batch['done']
99 |
100 | if 0: # Check shape of experiences
101 | print("obs1", obs1.shape)
102 | print("obs2", obs2.shape)
103 | print("acts", acts.shape)
104 | print("rews", rews.shape)
105 | print("done", done.shape)
106 |
107 | # Prediction Q(s,𝜇(s)), Q(s,a), Q‾(s',𝜇‾(s'))
108 | q_pi = self.qf(obs1, self.policy(obs1))
109 | q = self.qf(obs1, acts).squeeze(1)
110 | q_pi_target = self.qf_target(obs2, self.policy_target(obs2)).squeeze(1)
111 |
112 | # Target for Q regression
113 | q_backup = rews + self.gamma*(1-done)*q_pi_target
114 | q_backup.to(self.device)
115 |
116 | if 0: # Check shape of prediction and target
117 | print("q", q.shape)
118 | print("q_backup", q_backup.shape)
119 |
120 | # DDPG losses
121 | policy_loss = -q_pi.mean()
122 | qf_loss = F.mse_loss(q, q_backup.detach())
123 |
124 | # Update policy network parameter
125 | self.policy_optimizer.zero_grad()
126 | policy_loss.backward()
127 | nn.utils.clip_grad_norm_(self.policy.parameters(), self.gradient_clip_policy)
128 | self.policy_optimizer.step()
129 |
130 | # Update Q-function network parameter
131 | self.qf_optimizer.zero_grad()
132 | qf_loss.backward()
133 | nn.utils.clip_grad_norm_(self.qf.parameters(), self.gradient_clip_qf)
134 | self.qf_optimizer.step()
135 |
136 | # Polyak averaging for target parameter
137 | soft_target_update(self.policy, self.policy_target)
138 | soft_target_update(self.qf, self.qf_target)
139 |
140 | # Save losses
141 | self.policy_losses.append(policy_loss.item())
142 | self.qf_losses.append(qf_loss.item())
143 |
144 | def run(self, max_step):
145 | step_number = 0
146 | total_reward = 0.
147 |
148 | obs = self.env.reset()
149 | done = False
150 |
151 | # Keep interacting until agent reaches a terminal state.
152 | while not (done or step_number == max_step):
153 | if self.args.render:
154 | self.env.render()
155 |
156 | if self.eval_mode:
157 | action = self.policy(torch.Tensor(obs).to(self.device))
158 | action = action.detach().cpu().numpy()
159 | next_obs, reward, done, _ = self.env.step(action)
160 | else:
161 | self.steps += 1
162 |
163 | # Until expl_before have elapsed, randomly sample actions
164 | # from a uniform distribution for better exploration.
165 | # Afterwards, use the learned policy.
166 | if self.steps > self.expl_before:
167 | action = self.select_action(torch.Tensor(obs).to(self.device))
168 | else:
169 | action = self.env.action_space.sample()
170 |
171 | # Collect experience (s, a, r, s') using some policy
172 | next_obs, reward, done, _ = self.env.step(action)
173 |
174 | # Add experience to replay buffer
175 | self.replay_buffer.add(obs, action, reward, next_obs, done)
176 |
177 | # Start training when the number of experience is greater than train_after
178 | if self.steps > self.train_after:
179 | self.train_model()
180 |
181 | total_reward += reward
182 | step_number += 1
183 | obs = next_obs
184 |
185 | # Save logs
186 | self.logger['LossPi'] = round(np.mean(self.policy_losses), 5)
187 | self.logger['LossQ'] = round(np.mean(self.qf_losses), 5)
188 | return step_number, total_reward
189 |
--------------------------------------------------------------------------------
/agents/dqn.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 | import torch.optim as optim
4 | import torch.nn.functional as F
5 |
6 | from agents.common.utils import *
7 | from agents.common.buffers import *
8 | from agents.common.networks import *
9 |
10 |
11 | class Agent(object):
12 | """An implementation of the Deep Q-Network (DQN), Double DQN agents."""
13 |
14 | def __init__(self,
15 | env,
16 | args,
17 | device,
18 | obs_dim,
19 | act_num,
20 | steps=0,
21 | gamma=0.99,
22 | epsilon=1.0,
23 | epsilon_decay=0.995,
24 | buffer_size=int(1e4),
25 | batch_size=64,
26 | target_update_step=100,
27 | eval_mode=False,
28 | q_losses=list(),
29 | logger=dict(),
30 | ):
31 |
32 | self.env = env
33 | self.args = args
34 | self.device = device
35 | self.obs_dim = obs_dim
36 | self.act_num = act_num
37 | self.steps = steps
38 | self.gamma = gamma
39 | self.epsilon = epsilon
40 | self.epsilon_decay = epsilon_decay
41 | self.buffer_size = buffer_size
42 | self.batch_size = batch_size
43 | self.target_update_step = target_update_step
44 | self.eval_mode = eval_mode
45 | self.q_losses = q_losses
46 | self.logger = logger
47 |
48 | # Main network
49 | self.qf = MLP(self.obs_dim, self.act_num).to(self.device)
50 | # Target network
51 | self.qf_target = MLP(self.obs_dim, self.act_num).to(self.device)
52 |
53 | # Initialize target parameters to match main parameters
54 | hard_target_update(self.qf, self.qf_target)
55 |
56 | # Create an optimizer
57 | self.qf_optimizer = optim.Adam(self.qf.parameters(), lr=1e-3)
58 |
59 | # Experience buffer
60 | self.replay_buffer = ReplayBuffer(self.obs_dim, 1, self.buffer_size, self.device)
61 |
62 | def select_action(self, obs):
63 | """Select an action from the set of available actions."""
64 | # Decaying epsilon
65 | self.epsilon *= self.epsilon_decay
66 | self.epsilon = max(self.epsilon, 0.01)
67 |
68 | if np.random.rand() <= self.epsilon:
69 | # Choose a random action with probability epsilon
70 | return np.random.randint(self.act_num)
71 | else:
72 | # Choose the action with highest Q-value at the current state
73 | action = self.qf(obs).argmax()
74 | return action.detach().cpu().numpy()
75 |
76 | def train_model(self):
77 | batch = self.replay_buffer.sample(self.batch_size)
78 | obs1 = batch['obs1']
79 | obs2 = batch['obs2']
80 | acts = batch['acts']
81 | rews = batch['rews']
82 | done = batch['done']
83 |
84 | if 0: # Check shape of experiences
85 | print("obs1", obs1.shape)
86 | print("obs2", obs2.shape)
87 | print("acts", acts.shape)
88 | print("rews", rews.shape)
89 | print("done", done.shape)
90 |
91 | # Prediction Q(s)
92 | q = self.qf(obs1).gather(1, acts.long()).squeeze(1)
93 |
94 | # Target for Q regression
95 | if self.args.algo == 'dqn': # DQN
96 | q_target = self.qf_target(obs2)
97 | elif self.args.algo == 'ddqn': # Double DQN
98 | q2 = self.qf(obs2)
99 | q_target = self.qf_target(obs2)
100 | q_target = q_target.gather(1, q2.max(1)[1].unsqueeze(1))
101 | q_backup = rews + self.gamma*(1-done)*q_target.max(1)[0]
102 | q_backup.to(self.device)
103 |
104 | if 0: # Check shape of prediction and target
105 | print("q", q.shape)
106 | print("q_backup", q_backup.shape)
107 |
108 | # Update perdiction network parameter
109 | qf_loss = F.mse_loss(q, q_backup.detach())
110 | self.qf_optimizer.zero_grad()
111 | qf_loss.backward()
112 | self.qf_optimizer.step()
113 |
114 | # Synchronize target parameters 𝜃‾ as 𝜃 every C steps
115 | if self.steps % self.target_update_step == 0:
116 | hard_target_update(self.qf, self.qf_target)
117 |
118 | # Save loss
119 | self.q_losses.append(qf_loss.item())
120 |
121 | def run(self, max_step):
122 | step_number = 0
123 | total_reward = 0.
124 |
125 | obs = self.env.reset()
126 | done = False
127 |
128 | # Keep interacting until agent reaches a terminal state.
129 | while not (done or step_number == max_step):
130 | if self.args.render:
131 | self.env.render()
132 |
133 | if self.eval_mode:
134 | q_value = self.qf(torch.Tensor(obs).to(self.device)).argmax()
135 | action = q_value.detach().cpu().numpy()
136 | next_obs, reward, done, _ = self.env.step(action)
137 | else:
138 | self.steps += 1
139 |
140 | # Collect experience (s, a, r, s') using some policy
141 | action = self.select_action(torch.Tensor(obs).to(self.device))
142 | next_obs, reward, done, _ = self.env.step(action)
143 |
144 | # Add experience to replay buffer
145 | self.replay_buffer.add(obs, action, reward, next_obs, done)
146 |
147 | # Start training when the number of experience is greater than batch_size
148 | if self.steps > self.batch_size:
149 | self.train_model()
150 |
151 | total_reward += reward
152 | step_number += 1
153 | obs = next_obs
154 |
155 | # Save logs
156 | self.logger['LossQ'] = round(np.mean(self.q_losses), 5)
157 | return step_number, total_reward
158 |
--------------------------------------------------------------------------------
/agents/ppo.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 | import torch.nn as nn
4 | import torch.optim as optim
5 | import torch.nn.functional as F
6 |
7 | from agents.common.utils import *
8 | from agents.common.buffers import *
9 | from agents.common.networks import *
10 |
11 |
12 | class Agent(object):
13 | """
14 | An implementation of the Proximal Policy Optimization (PPO) (by clipping) agent,
15 | with early stopping based on approximate KL.
16 | """
17 |
18 | def __init__(self,
19 | env,
20 | args,
21 | device,
22 | obs_dim,
23 | act_dim,
24 | act_limit,
25 | steps=0,
26 | gamma=0.99,
27 | lam=0.97,
28 | hidden_sizes=(64,64),
29 | sample_size=2048,
30 | train_policy_iters=80,
31 | train_vf_iters=80,
32 | clip_param=0.2,
33 | target_kl=0.01,
34 | policy_lr=3e-4,
35 | vf_lr=1e-3,
36 | eval_mode=False,
37 | policy_losses=list(),
38 | vf_losses=list(),
39 | kls=list(),
40 | logger=dict(),
41 | ):
42 |
43 | self.env = env
44 | self.args = args
45 | self.device = device
46 | self.obs_dim = obs_dim
47 | self.act_dim = act_dim
48 | self.act_limit = act_limit
49 | self.steps = steps
50 | self.gamma = gamma
51 | self.lam = lam
52 | self.hidden_sizes = hidden_sizes
53 | self.sample_size = sample_size
54 | self.train_policy_iters = train_policy_iters
55 | self.train_vf_iters = train_vf_iters
56 | self.clip_param = clip_param
57 | self.target_kl = target_kl
58 | self.policy_lr = policy_lr
59 | self.vf_lr = vf_lr
60 | self.eval_mode = eval_mode
61 | self.policy_losses = policy_losses
62 | self.vf_losses = vf_losses
63 | self.kls = kls
64 | self.logger = logger
65 |
66 | # Main network
67 | self.policy = GaussianPolicy(self.obs_dim, self.act_dim, self.act_limit).to(self.device)
68 | self.vf = MLP(self.obs_dim, 1, activation=torch.tanh).to(self.device)
69 |
70 | # Create optimizers
71 | self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=self.policy_lr)
72 | self.vf_optimizer = optim.Adam(self.vf.parameters(), lr=self.vf_lr)
73 |
74 | # Experience buffer
75 | self.buffer = Buffer(self.obs_dim, self.act_dim, self.sample_size, self.device, self.gamma, self.lam)
76 |
77 | def compute_vf_loss(self, obs, ret, v_old):
78 | # Prediction V(s)
79 | v = self.vf(obs).squeeze(1)
80 |
81 | # Value loss
82 | clip_v = v_old + torch.clamp(v-v_old, -self.clip_param, self.clip_param)
83 | vf_loss = torch.max(F.mse_loss(v, ret), F.mse_loss(clip_v, ret)).mean()
84 | return vf_loss
85 |
86 | def compute_policy_loss(self, obs, act, adv, log_pi_old):
87 | # Prediction logπ(s)
88 | _, _, _, log_pi = self.policy(obs, act, use_pi=False)
89 |
90 | # Policy loss
91 | ratio = torch.exp(log_pi - log_pi_old)
92 | clip_adv = torch.clamp(ratio, 1.-self.clip_param, 1.+self.clip_param)*adv
93 | policy_loss = -torch.min(ratio*adv, clip_adv).mean()
94 |
95 | # A sample estimate for KL-divergence, easy to compute
96 | approx_kl = (log_pi_old - log_pi).mean()
97 | return policy_loss, approx_kl
98 |
99 | def train_model(self):
100 | batch = self.buffer.get()
101 | obs = batch['obs']
102 | act = batch['act']
103 | ret = batch['ret']
104 | adv = batch['adv']
105 |
106 | # Prediction logπ_old(s), V_old(s)
107 | _, _, _, log_pi_old = self.policy(obs, act, use_pi=False)
108 | log_pi_old = log_pi_old.detach()
109 | v_old = self.vf(obs).squeeze(1)
110 | v_old = v_old.detach()
111 |
112 | # Train policy with multiple steps of gradient descent
113 | for i in range(self.train_policy_iters):
114 | policy_loss, kl = self.compute_policy_loss(obs, act, adv, log_pi_old)
115 |
116 | # Early stopping at step i due to reaching max kl
117 | if kl > 1.5 * self.target_kl:
118 | break
119 |
120 | # Update policy network parameter
121 | self.policy_optimizer.zero_grad()
122 | policy_loss.backward()
123 | self.policy_optimizer.step()
124 |
125 | # Train value with multiple steps of gradient descent
126 | for i in range(self.train_vf_iters):
127 | vf_loss = self.compute_vf_loss(obs, ret, v_old)
128 |
129 | # Update value network parameter
130 | self.vf_optimizer.zero_grad()
131 | vf_loss.backward()
132 | self.vf_optimizer.step()
133 |
134 | # Save losses
135 | self.policy_losses.append(policy_loss.item())
136 | self.vf_losses.append(vf_loss.item())
137 | self.kls.append(kl.item())
138 |
139 | def run(self, max_step):
140 | step_number = 0
141 | total_reward = 0.
142 |
143 | obs = self.env.reset()
144 | done = False
145 |
146 | # Keep interacting until agent reaches a terminal state.
147 | while not (done or step_number == max_step):
148 | if self.args.render:
149 | self.env.render()
150 |
151 | if self.eval_mode:
152 | action, _, _, _ = self.policy(torch.Tensor(obs).to(self.device))
153 | action = action.detach().cpu().numpy()
154 | next_obs, reward, done, _ = self.env.step(action)
155 | else:
156 | self.steps += 1
157 |
158 | # Collect experience (s, a, r, s') using some policy
159 | _, _, action, _ = self.policy(torch.Tensor(obs).to(self.device))
160 | action = action.detach().cpu().numpy()
161 | next_obs, reward, done, _ = self.env.step(action)
162 |
163 | # Add experience to buffer
164 | v = self.vf(torch.Tensor(obs).to(self.device))
165 | self.buffer.add(obs, action, reward, done, v)
166 |
167 | # Start training when the number of experience is equal to sample size
168 | if self.steps == self.sample_size:
169 | self.buffer.finish_path()
170 | self.train_model()
171 | self.steps = 0
172 |
173 | total_reward += reward
174 | step_number += 1
175 | obs = next_obs
176 |
177 | # Save logs
178 | self.logger['LossPi'] = round(np.mean(self.policy_losses), 5)
179 | self.logger['LossV'] = round(np.mean(self.vf_losses), 5)
180 | self.logger['KL'] = round(np.mean(self.kls), 5)
181 | return step_number, total_reward
182 |
--------------------------------------------------------------------------------
/agents/sac.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 | import torch.nn as nn
4 | import torch.optim as optim
5 | import torch.nn.functional as F
6 |
7 | from agents.common.utils import *
8 | from agents.common.buffers import *
9 | from agents.common.networks import *
10 |
11 |
12 | class Agent(object):
13 | """
14 | An implementation of agents for Soft Actor-Critic (SAC), SAC with automatic entropy adjustment (SAC-AEA).
15 | """
16 |
17 | def __init__(self,
18 | env,
19 | args,
20 | device,
21 | obs_dim,
22 | act_dim,
23 | act_limit,
24 | steps=0,
25 | expl_before=2000,
26 | train_after=1000,
27 | gamma=0.99,
28 | alpha=0.2,
29 | automatic_entropy_tuning=False,
30 | hidden_sizes=(128,128),
31 | buffer_size=int(1e4),
32 | batch_size=64,
33 | policy_lr=3e-4,
34 | qf_lr=3e-4,
35 | eval_mode=False,
36 | policy_losses=list(),
37 | qf1_losses=list(),
38 | qf2_losses=list(),
39 | alpha_losses=list(),
40 | logger=dict(),
41 | ):
42 |
43 | self.env = env
44 | self.args = args
45 | self.device = device
46 | self.obs_dim = obs_dim
47 | self.act_dim = act_dim
48 | self.act_limit = act_limit
49 | self.steps = steps
50 | self.expl_before = expl_before
51 | self.train_after = train_after
52 | self.gamma = gamma
53 | self.alpha = alpha
54 | self.automatic_entropy_tuning = automatic_entropy_tuning
55 | self.hidden_sizes = hidden_sizes
56 | self.buffer_size = buffer_size
57 | self.batch_size = batch_size
58 | self.policy_lr = policy_lr
59 | self.qf_lr = qf_lr
60 | self.eval_mode = eval_mode
61 | self.policy_losses = policy_losses
62 | self.qf1_losses = qf1_losses
63 | self.qf2_losses = qf2_losses
64 | self.alpha_losses = alpha_losses
65 | self.logger = logger
66 |
67 | # Main network
68 | self.policy = ReparamGaussianPolicy(self.obs_dim, self.act_dim, self.act_limit,
69 | hidden_sizes=self.hidden_sizes).to(self.device)
70 | self.qf1 = FlattenMLP(self.obs_dim+self.act_dim, 1, hidden_sizes=self.hidden_sizes).to(self.device)
71 | self.qf2 = FlattenMLP(self.obs_dim+self.act_dim, 1, hidden_sizes=self.hidden_sizes).to(self.device)
72 | # Target network
73 | self.qf1_target = FlattenMLP(self.obs_dim+self.act_dim, 1, hidden_sizes=self.hidden_sizes).to(self.device)
74 | self.qf2_target = FlattenMLP(self.obs_dim+self.act_dim, 1, hidden_sizes=self.hidden_sizes).to(self.device)
75 |
76 | # Initialize target parameters to match main parameters
77 | hard_target_update(self.qf1, self.qf1_target)
78 | hard_target_update(self.qf2, self.qf2_target)
79 |
80 | # Concat the Q-network parameters to use one optim
81 | self.qf_parameters = list(self.qf1.parameters()) + list(self.qf2.parameters())
82 | # Create optimizers
83 | self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=self.policy_lr)
84 | self.qf_optimizer = optim.Adam(self.qf_parameters, lr=self.qf_lr)
85 |
86 | # Experience buffer
87 | self.replay_buffer = ReplayBuffer(self.obs_dim, self.act_dim, self.buffer_size, self.device)
88 |
89 | # If automatic entropy tuning is True,
90 | # initialize a target entropy, a log alpha and an alpha optimizer
91 | if self.automatic_entropy_tuning:
92 | self.target_entropy = -np.prod((act_dim,)).item()
93 | self.log_alpha = torch.zeros(1, requires_grad=True, device=self.device)
94 | self.alpha_optimizer = optim.Adam([self.log_alpha], lr=self.policy_lr)
95 |
96 | def train_model(self):
97 | batch = self.replay_buffer.sample(self.batch_size)
98 | obs1 = batch['obs1']
99 | obs2 = batch['obs2']
100 | acts = batch['acts']
101 | rews = batch['rews']
102 | done = batch['done']
103 |
104 | if 0: # Check shape of experiences
105 | print("obs1", obs1.shape)
106 | print("obs2", obs2.shape)
107 | print("acts", acts.shape)
108 | print("rews", rews.shape)
109 | print("done", done.shape)
110 |
111 | # Prediction π(a|s), logπ(a|s), π(a'|s'), logπ(a'|s'), Q1(s,a), Q2(s,a)
112 | _, pi, log_pi = self.policy(obs1)
113 | _, next_pi, next_log_pi = self.policy(obs2)
114 | q1 = self.qf1(obs1, acts).squeeze(1)
115 | q2 = self.qf2(obs1, acts).squeeze(1)
116 |
117 | # Min Double-Q: min(Q1(s,π(a|s)), Q2(s,π(a|s))), min(Q1‾(s',π(a'|s')), Q2‾(s',π(a'|s')))
118 | min_q_pi = torch.min(self.qf1(obs1, pi), self.qf2(obs1, pi)).squeeze(1).to(self.device)
119 | min_q_next_pi = torch.min(self.qf1_target(obs2, next_pi),
120 | self.qf2_target(obs2, next_pi)).squeeze(1).to(self.device)
121 |
122 | # Targets for Q regression
123 | v_backup = min_q_next_pi - self.alpha*next_log_pi
124 | q_backup = rews + self.gamma*(1-done)*v_backup
125 | q_backup.to(self.device)
126 |
127 | if 0: # Check shape of prediction and target
128 | print("log_pi", log_pi.shape)
129 | print("next_log_pi", next_log_pi.shape)
130 | print("q1", q1.shape)
131 | print("q2", q2.shape)
132 | print("min_q_pi", min_q_pi.shape)
133 | print("min_q_next_pi", min_q_next_pi.shape)
134 | print("q_backup", q_backup.shape)
135 |
136 | # SAC losses
137 | policy_loss = (self.alpha*log_pi - min_q_pi).mean()
138 | qf1_loss = F.mse_loss(q1, q_backup.detach())
139 | qf2_loss = F.mse_loss(q2, q_backup.detach())
140 | qf_loss = qf1_loss + qf2_loss
141 |
142 | # Update policy network parameter
143 | self.policy_optimizer.zero_grad()
144 | policy_loss.backward()
145 | self.policy_optimizer.step()
146 |
147 | # Update two Q-network parameter
148 | self.qf_optimizer.zero_grad()
149 | qf_loss.backward()
150 | self.qf_optimizer.step()
151 |
152 | # If automatic entropy tuning is True, update alpha
153 | if self.automatic_entropy_tuning:
154 | alpha_loss = -(self.log_alpha * (log_pi + self.target_entropy).detach()).mean()
155 | self.alpha_optimizer.zero_grad()
156 | alpha_loss.backward()
157 | self.alpha_optimizer.step()
158 |
159 | self.alpha = self.log_alpha.exp()
160 |
161 | # Save alpha loss
162 | self.alpha_losses.append(alpha_loss.item())
163 |
164 | # Polyak averaging for target parameter
165 | soft_target_update(self.qf1, self.qf1_target)
166 | soft_target_update(self.qf2, self.qf2_target)
167 |
168 | # Save losses
169 | self.policy_losses.append(policy_loss.item())
170 | self.qf1_losses.append(qf1_loss.item())
171 | self.qf2_losses.append(qf2_loss.item())
172 |
173 | def run(self, max_step):
174 | step_number = 0
175 | total_reward = 0.
176 |
177 | obs = self.env.reset()
178 | done = False
179 |
180 | # Keep interacting until agent reaches a terminal state.
181 | while not (done or step_number == max_step):
182 | if self.args.render:
183 | self.env.render()
184 |
185 | if self.eval_mode:
186 | action, _, _ = self.policy(torch.Tensor(obs).to(self.device))
187 | action = action.detach().cpu().numpy()
188 | next_obs, reward, done, _ = self.env.step(action)
189 | else:
190 | self.steps += 1
191 |
192 | # Until expl_before have elapsed, randomly sample actions
193 | # from a uniform distribution for better exploration.
194 | # Afterwards, use the learned policy.
195 | if self.steps > self.expl_before:
196 | _, action, _ = self.policy(torch.Tensor(obs).to(self.device))
197 | action = action.detach().cpu().numpy()
198 | else:
199 | action = self.env.action_space.sample()
200 |
201 | # Collect experience (s, a, r, s') using some policy
202 | next_obs, reward, done, _ = self.env.step(action)
203 |
204 | # Add experience to replay buffer
205 | self.replay_buffer.add(obs, action, reward, next_obs, done)
206 |
207 | # Start training when the number of experience is greater than train_after
208 | if self.steps > self.train_after:
209 | self.train_model()
210 |
211 | total_reward += reward
212 | step_number += 1
213 | obs = next_obs
214 |
215 | # Save logs
216 | self.logger['LossPi'] = round(np.mean(self.policy_losses), 5)
217 | self.logger['LossQ1'] = round(np.mean(self.qf1_losses), 5)
218 | self.logger['LossQ2'] = round(np.mean(self.qf2_losses), 5)
219 | if self.automatic_entropy_tuning:
220 | self.logger['LossAlpha'] = round(np.mean(self.alpha_losses), 5)
221 | return step_number, total_reward
222 |
--------------------------------------------------------------------------------
/agents/td3.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 | import torch.nn as nn
4 | import torch.optim as optim
5 | import torch.nn.functional as F
6 |
7 | from agents.common.utils import *
8 | from agents.common.buffers import *
9 | from agents.common.networks import *
10 |
11 |
12 | class Agent(object):
13 | """An implementation of the Twin Delayed DDPG (TD3) agent."""
14 |
15 | def __init__(self,
16 | env,
17 | args,
18 | device,
19 | obs_dim,
20 | act_dim,
21 | act_limit,
22 | steps=0,
23 | expl_before=2000,
24 | train_after=1000,
25 | gamma=0.99,
26 | act_noise=0.1,
27 | target_noise=0.2,
28 | noise_clip=0.5,
29 | policy_delay=2,
30 | hidden_sizes=(128,128),
31 | buffer_size=int(1e4),
32 | batch_size=64,
33 | policy_lr=3e-4,
34 | qf_lr=3e-4,
35 | eval_mode=False,
36 | policy_losses=list(),
37 | qf_losses=list(),
38 | logger=dict(),
39 | ):
40 |
41 | self.env = env
42 | self.args = args
43 | self.device = device
44 | self.obs_dim = obs_dim
45 | self.act_dim = act_dim
46 | self.act_limit = act_limit
47 | self.steps = steps
48 | self.expl_before = expl_before
49 | self.train_after = train_after
50 | self.gamma = gamma
51 | self.act_noise = act_noise
52 | self.target_noise = target_noise
53 | self.noise_clip = noise_clip
54 | self.policy_delay = policy_delay
55 | self.hidden_sizes = hidden_sizes
56 | self.buffer_size = buffer_size
57 | self.batch_size = batch_size
58 | self.policy_lr = policy_lr
59 | self.qf_lr = qf_lr
60 | self.eval_mode = eval_mode
61 | self.policy_losses = policy_losses
62 | self.qf_losses = qf_losses
63 | self.logger = logger
64 |
65 | # Main network
66 | self.policy = MLP(self.obs_dim, self.act_dim, self.act_limit,
67 | hidden_sizes=self.hidden_sizes,
68 | output_activation=torch.tanh,
69 | use_actor=True).to(self.device)
70 | self.qf1 = FlattenMLP(self.obs_dim+self.act_dim, 1, hidden_sizes=self.hidden_sizes).to(self.device)
71 | self.qf2 = FlattenMLP(self.obs_dim+self.act_dim, 1, hidden_sizes=self.hidden_sizes).to(self.device)
72 | # Target network
73 | self.policy_target = MLP(self.obs_dim, self.act_dim, self.act_limit,
74 | hidden_sizes=self.hidden_sizes,
75 | output_activation=torch.tanh,
76 | use_actor=True).to(self.device)
77 | self.qf1_target = FlattenMLP(self.obs_dim+self.act_dim, 1, hidden_sizes=self.hidden_sizes).to(self.device)
78 | self.qf2_target = FlattenMLP(self.obs_dim+self.act_dim, 1, hidden_sizes=self.hidden_sizes).to(self.device)
79 |
80 | # Initialize target parameters to match main parameters
81 | hard_target_update(self.policy, self.policy_target)
82 | hard_target_update(self.qf1, self.qf1_target)
83 | hard_target_update(self.qf2, self.qf2_target)
84 |
85 | # Concat the Q-network parameters to use one optim
86 | self.qf_parameters = list(self.qf1.parameters()) + list(self.qf2.parameters())
87 | # Create optimizers
88 | self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=self.policy_lr)
89 | self.qf_optimizer = optim.Adam(self.qf_parameters, lr=self.qf_lr)
90 |
91 | # Experience buffer
92 | self.replay_buffer = ReplayBuffer(self.obs_dim, self.act_dim, self.buffer_size, self.device)
93 |
94 | def select_action(self, obs):
95 | action = self.policy(obs).detach().cpu().numpy()
96 | action += self.act_noise * np.random.randn(self.act_dim)
97 | return np.clip(action, -self.act_limit, self.act_limit)
98 |
99 | def train_model(self):
100 | batch = self.replay_buffer.sample(self.batch_size)
101 | obs1 = batch['obs1']
102 | obs2 = batch['obs2']
103 | acts = batch['acts']
104 | rews = batch['rews']
105 | done = batch['done']
106 |
107 | if 0: # Check shape of experiences
108 | print("obs1", obs1.shape)
109 | print("obs2", obs2.shape)
110 | print("acts", acts.shape)
111 | print("rews", rews.shape)
112 | print("done", done.shape)
113 |
114 | # Prediction Q1(s,𝜇(s)), Q1(s,a), Q2(s,a)
115 | q1_pi = self.qf1(obs1, self.policy(obs1))
116 | q1 = self.qf1(obs1, acts).squeeze(1)
117 | q2 = self.qf2(obs1, acts).squeeze(1)
118 |
119 | # Target policy smoothing, by adding clipped noise to target actions
120 | pi_target = self.policy_target(obs2)
121 | epsilon = torch.normal(mean=0, std=self.target_noise, size=pi_target.size()).to(self.device)
122 | epsilon = torch.clamp(epsilon, -self.noise_clip, self.noise_clip).to(self.device)
123 | pi_target = torch.clamp(pi_target+epsilon, -self.act_limit, self.act_limit).to(self.device)
124 |
125 | # Min Double-Q: min(Q1‾(s',𝜇(s')), Q2‾(s',𝜇(s')))
126 | min_q_pi_target = torch.min(self.qf1_target(obs2, pi_target),
127 | self.qf2_target(obs2, pi_target)).squeeze(1).to(self.device)
128 |
129 | # Target for Q regression
130 | q_backup = rews + self.gamma*(1-done)*min_q_pi_target
131 | q_backup.to(self.device)
132 |
133 | if 0: # Check shape of prediction and target
134 | print("pi_target", pi_target.shape)
135 | print("epsilon", epsilon.shape)
136 | print("q1", q1.shape)
137 | print("q2", q2.shape)
138 | print("min_q_pi_target", min_q_pi_target.shape)
139 | print("q_backup", q_backup.shape)
140 |
141 | # TD3 losses
142 | policy_loss = -q1_pi.mean()
143 | qf1_loss = F.mse_loss(q1, q_backup.detach())
144 | qf2_loss = F.mse_loss(q2, q_backup.detach())
145 | qf_loss = qf1_loss + qf2_loss
146 |
147 | # Delayed policy update
148 | if self.steps % self.policy_delay == 0:
149 | # Update policy network parameter
150 | self.policy_optimizer.zero_grad()
151 | policy_loss.backward()
152 | self.policy_optimizer.step()
153 |
154 | # Polyak averaging for target parameter
155 | soft_target_update(self.policy, self.policy_target)
156 | soft_target_update(self.qf1, self.qf1_target)
157 | soft_target_update(self.qf2, self.qf2_target)
158 |
159 | # Update two Q-network parameter
160 | self.qf_optimizer.zero_grad()
161 | qf_loss.backward()
162 | self.qf_optimizer.step()
163 |
164 | # Save losses
165 | self.policy_losses.append(policy_loss.item())
166 | self.qf_losses.append(qf_loss.item())
167 |
168 | def run(self, max_step):
169 | step_number = 0
170 | total_reward = 0.
171 |
172 | obs = self.env.reset()
173 | done = False
174 |
175 | # Keep interacting until agent reaches a terminal state.
176 | while not (done or step_number == max_step):
177 | if self.args.render:
178 | self.env.render()
179 |
180 | if self.eval_mode:
181 | action = self.policy(torch.Tensor(obs).to(self.device))
182 | action = action.detach().cpu().numpy()
183 | next_obs, reward, done, _ = self.env.step(action)
184 | else:
185 | self.steps += 1
186 |
187 | # Until expl_before have elapsed, randomly sample actions
188 | # from a uniform distribution for better exploration.
189 | # Afterwards, use the learned policy.
190 | if self.steps > self.expl_before:
191 | action = self.select_action(torch.Tensor(obs).to(self.device))
192 | else:
193 | action = self.env.action_space.sample()
194 |
195 | # Collect experience (s, a, r, s') using some policy
196 | next_obs, reward, done, _ = self.env.step(action)
197 |
198 | # Add experience to replay buffer
199 | self.replay_buffer.add(obs, action, reward, next_obs, done)
200 |
201 | # Start training when the number of experience is greater than train_after
202 | if self.steps > self.train_after:
203 | self.train_model()
204 |
205 | total_reward += reward
206 | step_number += 1
207 | obs = next_obs
208 |
209 | # Save logs
210 | self.logger['LossPi'] = round(np.mean(self.policy_losses), 5)
211 | self.logger['LossQ'] = round(np.mean(self.qf_losses), 5)
212 | return step_number, total_reward
213 |
--------------------------------------------------------------------------------
/agents/trpo.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 | import torch.nn as nn
4 | import torch.optim as optim
5 | import torch.nn.functional as F
6 |
7 | from agents.common.utils import *
8 | from agents.common.buffers import *
9 | from agents.common.networks import *
10 |
11 |
12 | class Agent(object):
13 | """
14 | An implementation of the Trust Region Policy Optimization (TRPO) agent
15 | with support for Natural Policy Gradient (NPG).
16 | """
17 |
18 | def __init__(self,
19 | env,
20 | args,
21 | device,
22 | obs_dim,
23 | act_dim,
24 | act_limit,
25 | steps=0,
26 | gamma=0.99,
27 | lam=0.97,
28 | hidden_sizes=(64,64),
29 | sample_size=2048,
30 | vf_lr=1e-3,
31 | train_vf_iters=80,
32 | delta=0.01,
33 | backtrack_iter=10,
34 | backtrack_coeff=1.0,
35 | backtrack_alpha=0.5,
36 | eval_mode=False,
37 | policy_losses=list(),
38 | vf_losses=list(),
39 | kls=list(),
40 | backtrack_iters=list(),
41 | logger=dict(),
42 | ):
43 |
44 | self.env = env
45 | self.args = args
46 | self.device = device
47 | self.obs_dim = obs_dim
48 | self.act_dim = act_dim
49 | self.act_limit = act_limit
50 | self.steps = steps
51 | self.gamma = gamma
52 | self.lam = lam
53 | self.hidden_sizes = hidden_sizes
54 | self.sample_size = sample_size
55 | self.vf_lr = vf_lr
56 | self.train_vf_iters = train_vf_iters
57 | self.delta = delta
58 | self.backtrack_iter = backtrack_iter
59 | self.backtrack_coeff = backtrack_coeff
60 | self.backtrack_alpha = backtrack_alpha
61 | self.eval_mode = eval_mode
62 | self.policy_losses = policy_losses
63 | self.vf_losses = vf_losses
64 | self.kls = kls
65 | self.backtrack_iters = backtrack_iters
66 | self.logger = logger
67 |
68 | # Main network
69 | self.policy = GaussianPolicy(self.obs_dim, self.act_dim, self.act_limit).to(self.device)
70 | self.old_policy = GaussianPolicy(self.obs_dim, self.act_dim, self.act_limit).to(self.device)
71 | self.vf = MLP(self.obs_dim, 1, activation=torch.tanh).to(self.device)
72 |
73 | # Create optimizers
74 | self.vf_optimizer = optim.Adam(self.vf.parameters(), lr=self.vf_lr)
75 |
76 | # Experience buffer
77 | self.buffer = Buffer(self.obs_dim, self.act_dim, self.sample_size, self.device, self.gamma, self.lam)
78 |
79 | def cg(self, obs, b, cg_iters=10, EPS=1e-8, residual_tol=1e-10):
80 | # Conjugate gradient algorithm
81 | # (https://en.wikipedia.org/wiki/Conjugate_gradient_method)
82 | x = torch.zeros(b.size()).to(self.device)
83 | r = b.clone()
84 | p = r.clone()
85 | rdotr = torch.dot(r,r).to(self.device)
86 |
87 | for _ in range(cg_iters):
88 | Ap = self.hessian_vector_product(obs, p)
89 | alpha = rdotr / (torch.dot(p, Ap).to(self.device) + EPS)
90 |
91 | x += alpha * p
92 | r -= alpha * Ap
93 |
94 | new_rdotr = torch.dot(r, r)
95 | p = r + (new_rdotr / rdotr) * p
96 | rdotr = new_rdotr
97 |
98 | if rdotr < residual_tol:
99 | break
100 | return x
101 |
102 | def hessian_vector_product(self, obs, p, damping_coeff=0.1):
103 | p.detach()
104 | kl = self.gaussian_kl(old_policy=self.policy, new_policy=self.policy, obs=obs)
105 | kl_grad = torch.autograd.grad(kl, self.policy.parameters(), create_graph=True)
106 | kl_grad = self.flat_grad(kl_grad)
107 |
108 | kl_grad_p = (kl_grad * p).sum()
109 | kl_hessian = torch.autograd.grad(kl_grad_p, self.policy.parameters())
110 | kl_hessian = self.flat_grad(kl_hessian, hessian=True)
111 | return kl_hessian + p * damping_coeff
112 |
113 | def gaussian_kl(self, old_policy, new_policy, obs):
114 | mu_old, std_old, _, _ = old_policy(obs)
115 | mu_old, std_old = mu_old.detach(), std_old.detach()
116 | mu, std, _, _ = new_policy(obs)
117 |
118 | # kl divergence between old policy and new policy : D( pi_old || pi_new )
119 | # (https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians)
120 | kl = torch.log(std/std_old) + (std_old.pow(2)+(mu_old-mu).pow(2))/(2.0*std.pow(2)) - 0.5
121 | return kl.sum(-1, keepdim=True).mean()
122 |
123 | def flat_grad(self, grads, hessian=False):
124 | grad_flatten = []
125 | if hessian == False:
126 | for grad in grads:
127 | grad_flatten.append(grad.view(-1))
128 | grad_flatten = torch.cat(grad_flatten)
129 | return grad_flatten
130 | elif hessian == True:
131 | for grad in grads:
132 | grad_flatten.append(grad.contiguous().view(-1))
133 | grad_flatten = torch.cat(grad_flatten).data
134 | return grad_flatten
135 |
136 | def flat_params(self, model):
137 | params = []
138 | for param in model.parameters():
139 | params.append(param.data.view(-1))
140 | params_flatten = torch.cat(params)
141 | return params_flatten
142 |
143 | def update_model(self, model, new_params):
144 | index = 0
145 | for params in model.parameters():
146 | params_length = len(params.view(-1))
147 | new_param = new_params[index: index + params_length]
148 | new_param = new_param.view(params.size())
149 | params.data.copy_(new_param)
150 | index += params_length
151 |
152 | def train_model(self):
153 | batch = self.buffer.get()
154 | obs = batch['obs']
155 | act = batch['act']
156 | ret = batch['ret']
157 | adv = batch['adv']
158 |
159 | # Update value network parameter
160 | for _ in range(self.train_vf_iters):
161 | # Prediction V(s)
162 | v = self.vf(obs).squeeze(1)
163 |
164 | # Value loss
165 | vf_loss = F.mse_loss(v, ret)
166 |
167 | self.vf_optimizer.zero_grad()
168 | vf_loss.backward()
169 | self.vf_optimizer.step()
170 |
171 | # Prediction logπ_old(s), logπ(s)
172 | _, _, _, log_pi_old = self.policy(obs, act, use_pi=False)
173 | log_pi_old = log_pi_old.detach()
174 | _, _, _, log_pi = self.policy(obs, act, use_pi=False)
175 |
176 | # Policy loss
177 | ratio_old = torch.exp(log_pi - log_pi_old)
178 | policy_loss_old = (ratio_old*adv).mean()
179 |
180 | # Symbols needed for Conjugate gradient solver
181 | gradient = torch.autograd.grad(policy_loss_old, self.policy.parameters())
182 | gradient = self.flat_grad(gradient)
183 |
184 | # Core calculations for NPG or TRPO
185 | search_dir = self.cg(obs, gradient.data)
186 | gHg = (self.hessian_vector_product(obs, search_dir) * search_dir).sum(0)
187 | step_size = torch.sqrt(2 * self.delta / gHg)
188 | old_params = self.flat_params(self.policy)
189 | self.update_model(self.old_policy, old_params)
190 |
191 | if self.args.algo == 'npg':
192 | params = old_params + step_size * search_dir
193 | self.update_model(self.policy, params)
194 |
195 | kl = self.gaussian_kl(new_policy=self.policy, old_policy=self.old_policy, obs=obs)
196 | elif self.args.algo == 'trpo':
197 | expected_improve = (gradient * step_size * search_dir).sum(0, keepdim=True)
198 |
199 | for i in range(self.backtrack_iter):
200 | # Backtracking line search
201 | # (https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf) 464p.
202 | params = old_params + self.backtrack_coeff * step_size * search_dir
203 | self.update_model(self.policy, params)
204 |
205 | _, _, _, log_pi = self.policy(obs, act, use_pi=False)
206 | ratio = torch.exp(log_pi - log_pi_old)
207 | policy_loss = (ratio*adv).mean()
208 |
209 | loss_improve = policy_loss - policy_loss_old
210 | expected_improve *= self.backtrack_coeff
211 | improve_condition = loss_improve / expected_improve
212 |
213 | kl = self.gaussian_kl(new_policy=self.policy, old_policy=self.old_policy, obs=obs)
214 |
215 | if kl < self.delta and improve_condition > self.backtrack_alpha:
216 | print('Accepting new params at step %d of line search.'%i)
217 | self.backtrack_iters.append(i)
218 | break
219 |
220 | if i == self.backtrack_iter-1:
221 | print('Line search failed! Keeping old params.')
222 | self.backtrack_iters.append(i)
223 |
224 | params = self.flat_params(self.old_policy)
225 | self.update_model(self.policy, params)
226 |
227 | self.backtrack_coeff *= 0.5
228 |
229 | # Save losses
230 | self.policy_losses.append(policy_loss_old.item())
231 | self.vf_losses.append(vf_loss.item())
232 | self.kls.append(kl.item())
233 |
234 | def run(self, max_step):
235 | step_number = 0
236 | total_reward = 0.
237 |
238 | obs = self.env.reset()
239 | done = False
240 |
241 | # Keep interacting until agent reaches a terminal state.
242 | while not (done or step_number == max_step):
243 | if self.args.render:
244 | self.env.render()
245 |
246 | if self.eval_mode:
247 | action, _, _, _ = self.policy(torch.Tensor(obs).to(self.device))
248 | action = action.detach().cpu().numpy()
249 | next_obs, reward, done, _ = self.env.step(action)
250 | else:
251 | self.steps += 1
252 |
253 | # Collect experience (s, a, r, s') using some policy
254 | _, _, action, _ = self.policy(torch.Tensor(obs).to(self.device))
255 | action = action.detach().cpu().numpy()
256 | next_obs, reward, done, _ = self.env.step(action)
257 |
258 | # Add experience to buffer
259 | v = self.vf(torch.Tensor(obs).to(self.device))
260 | self.buffer.add(obs, action, reward, done, v)
261 |
262 | # Start training when the number of experience is equal to sample size
263 | if self.steps == self.sample_size:
264 | self.buffer.finish_path()
265 | self.train_model()
266 | self.steps = 0
267 |
268 | total_reward += reward
269 | step_number += 1
270 | obs = next_obs
271 |
272 | # Save logs
273 | self.logger['LossPi'] = round(np.mean(self.policy_losses), 5)
274 | self.logger['LossV'] = round(np.mean(self.vf_losses), 5)
275 | self.logger['KL'] = round(np.mean(self.kls), 5)
276 | if self.args.algo == 'trpo':
277 | self.logger['BacktrackIters'] = np.mean(self.backtrack_iters)
278 | return step_number, total_reward
279 |
--------------------------------------------------------------------------------
/agents/vpg.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 | import torch.nn as nn
4 | import torch.optim as optim
5 | import torch.nn.functional as F
6 |
7 | from agents.common.utils import *
8 | from agents.common.buffers import *
9 | from agents.common.networks import *
10 |
11 |
12 | class Agent(object):
13 | """
14 | An implementation of the Vanilla Policy Gradient (VPG) agent
15 | with GAE-Lambda for advantage estimation.
16 | """
17 |
18 | def __init__(self,
19 | env,
20 | args,
21 | device,
22 | obs_dim,
23 | act_dim,
24 | act_limit,
25 | steps=0,
26 | gamma=0.99,
27 | lam=0.97,
28 | hidden_sizes=(64,64),
29 | sample_size=2048,
30 | policy_lr=1e-3,
31 | vf_lr=1e-3,
32 | train_vf_iters=80,
33 | eval_mode=False,
34 | policy_losses=list(),
35 | vf_losses=list(),
36 | kls=list(),
37 | logger=dict(),
38 | ):
39 |
40 | self.env = env
41 | self.args = args
42 | self.device = device
43 | self.obs_dim = obs_dim
44 | self.act_dim = act_dim
45 | self.act_limit = act_limit
46 | self.steps = steps
47 | self.gamma = gamma
48 | self.lam = lam
49 | self.hidden_sizes = hidden_sizes
50 | self.sample_size = sample_size
51 | self.policy_lr = policy_lr
52 | self.vf_lr = vf_lr
53 | self.train_vf_iters = train_vf_iters
54 | self.eval_mode = eval_mode
55 | self.policy_losses = policy_losses
56 | self.vf_losses = vf_losses
57 | self.kls = kls
58 | self.logger = logger
59 |
60 | # Main network
61 | self.policy = GaussianPolicy(self.obs_dim, self.act_dim, self.act_limit).to(self.device)
62 | self.vf = MLP(self.obs_dim, 1, activation=torch.tanh).to(self.device)
63 |
64 | # Create optimizers
65 | self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=self.policy_lr)
66 | self.vf_optimizer = optim.Adam(self.vf.parameters(), lr=self.vf_lr)
67 |
68 | # Experience buffer
69 | self.buffer = Buffer(self.obs_dim, self.act_dim, self.sample_size, self.device, self.gamma, self.lam)
70 |
71 | def train_model(self):
72 | batch = self.buffer.get()
73 | obs = batch['obs']
74 | act = batch['act'].detach()
75 | ret = batch['ret']
76 | adv = batch['adv']
77 |
78 | if 0: # Check shape of experiences
79 | print("obs", obs.shape)
80 | print("act", act.shape)
81 | print("ret", ret.shape)
82 | print("adv", adv.shape)
83 |
84 | # Update value network parameter
85 | for _ in range(self.train_vf_iters):
86 | # Prediction V(s)
87 | v = self.vf(obs).squeeze(1)
88 |
89 | # Value loss
90 | vf_loss = F.mse_loss(v, ret)
91 |
92 | self.vf_optimizer.zero_grad()
93 | vf_loss.backward()
94 | self.vf_optimizer.step()
95 |
96 | # Prediction logπ(s)
97 | _, _, _, log_pi_old = self.policy(obs, act, use_pi=False)
98 | log_pi_old = log_pi_old.detach()
99 | _, _, _, log_pi = self.policy(obs, act, use_pi=False)
100 |
101 | # Policy loss
102 | policy_loss = -(log_pi*adv).mean()
103 |
104 | # Update policy network parameter
105 | self.policy_optimizer.zero_grad()
106 | policy_loss.backward()
107 | self.policy_optimizer.step()
108 |
109 | # A sample estimate for KL-divergence, easy to compute
110 | approx_kl = (log_pi_old - log_pi).mean()
111 |
112 | # Save losses
113 | self.policy_losses.append(policy_loss.item())
114 | self.vf_losses.append(vf_loss.item())
115 | self.kls.append(approx_kl.item())
116 |
117 | def run(self, max_step):
118 | step_number = 0
119 | total_reward = 0.
120 |
121 | obs = self.env.reset()
122 | done = False
123 |
124 | # Keep interacting until agent reaches a terminal state.
125 | while not (done or step_number == max_step):
126 | if self.args.render:
127 | self.env.render()
128 |
129 | if self.eval_mode:
130 | action, _, _, _ = self.policy(torch.Tensor(obs).to(self.device))
131 | action = action.detach().cpu().numpy()
132 | next_obs, reward, done, _ = self.env.step(action)
133 | else:
134 | self.steps += 1
135 |
136 | # Collect experience (s, a, r, s') using some policy
137 | _, _, action, log_pi = self.policy(torch.Tensor(obs).to(self.device))
138 | action = action.detach().cpu().numpy()
139 | next_obs, reward, done, _ = self.env.step(action)
140 |
141 | # Add experience to buffer
142 | v = self.vf(torch.Tensor(obs).to(self.device))
143 | self.buffer.add(obs, action, reward, done, v)
144 |
145 | # Start training when the number of experience is equal to sample size
146 | if self.steps == self.sample_size:
147 | self.buffer.finish_path()
148 | self.train_model()
149 | self.steps = 0
150 |
151 | total_reward += reward
152 | step_number += 1
153 | obs = next_obs
154 |
155 | # Save logs
156 | self.logger['LossPi'] = round(np.mean(self.policy_losses), 5)
157 | self.logger['LossV'] = round(np.mean(self.vf_losses), 5)
158 | self.logger['KL'] = round(np.mean(self.kls), 5)
159 | return step_number, total_reward
160 |
--------------------------------------------------------------------------------
/results/graphs/ant.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dongminlee94/deep_rl/d5bcabc541a5f16e166be33876d23352e149f97e/results/graphs/ant.png
--------------------------------------------------------------------------------
/results/graphs/halfcheetah.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dongminlee94/deep_rl/d5bcabc541a5f16e166be33876d23352e149f97e/results/graphs/halfcheetah.png
--------------------------------------------------------------------------------
/results/graphs/humanoid.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dongminlee94/deep_rl/d5bcabc541a5f16e166be33876d23352e149f97e/results/graphs/humanoid.png
--------------------------------------------------------------------------------
/run_cartpole.py:
--------------------------------------------------------------------------------
1 | import os
2 | import gym
3 | import time
4 | import argparse
5 | import datetime
6 | import numpy as np
7 | import torch
8 | from torch.utils.tensorboard import SummaryWriter
9 |
10 | # Configurations
11 | parser = argparse.ArgumentParser(description='RL algorithms with PyTorch in CartPole environment')
12 | parser.add_argument('--env', type=str, default='CartPole-v1',
13 | help='cartpole environment')
14 | parser.add_argument('--algo', type=str, default='dqn',
15 | help='select an algorithm among dqn, ddqn, a2c')
16 | parser.add_argument('--phase', type=str, default='train',
17 | help='choose between training phase and testing phase')
18 | parser.add_argument('--render', action='store_true', default=False,
19 | help='if you want to render, set this to True')
20 | parser.add_argument('--load', type=str, default=None,
21 | help='copy & paste the saved model name, and load it')
22 | parser.add_argument('--seed', type=int, default=0,
23 | help='seed for random number generators')
24 | parser.add_argument('--iterations', type=int, default=500,
25 | help='iterations to run and train agent')
26 | parser.add_argument('--eval_per_train', type=int, default=50,
27 | help='evaluation number per training')
28 | parser.add_argument('--max_step', type=int, default=500,
29 | help='max episode step')
30 | parser.add_argument('--threshold_return', type=int, default=500,
31 | help='solved requirement for success in given environment')
32 | parser.add_argument('--tensorboard', action='store_true', default=True)
33 | parser.add_argument('--gpu_index', type=int, default=0)
34 | args = parser.parse_args()
35 | device = torch.device('cuda', index=args.gpu_index) if torch.cuda.is_available() else torch.device('cpu')
36 |
37 | if args.algo == 'dqn':
38 | from agents.dqn import Agent
39 | elif args.algo == 'ddqn': # Just replace the target of DQN with Double DQN
40 | from agents.dqn import Agent
41 | elif args.algo == 'a2c':
42 | from agents.a2c import Agent
43 |
44 |
45 | def main():
46 | """Main."""
47 | # Initialize environment
48 | env = gym.make(args.env)
49 | obs_dim = env.observation_space.shape[0]
50 | act_num = env.action_space.n
51 |
52 | print('---------------------------------------')
53 | print('Environment:', args.env)
54 | print('Algorithm:', args.algo)
55 | print('State dimension:', obs_dim)
56 | print('Action number:', act_num)
57 | print('---------------------------------------')
58 |
59 | # Set a random seed
60 | env.seed(args.seed)
61 | np.random.seed(args.seed)
62 | torch.manual_seed(args.seed)
63 |
64 | # Create an agent
65 | agent = Agent(env, args, device, obs_dim, act_num)
66 |
67 | # If we have a saved model, load it
68 | if args.load is not None:
69 | pretrained_model_path = os.path.join('./save_model/' + str(args.load))
70 | pretrained_model = torch.load(pretrained_model_path, map_location=device)
71 | if args.algo == 'dqn' or args.algo == 'ddqn':
72 | agent.qf.load_state_dict(pretrained_model)
73 | else:
74 | agent.policy.load_state_dict(pretrained_model)
75 |
76 | # Create a SummaryWriter object by TensorBoard
77 | if args.tensorboard and args.load is None:
78 | dir_name = 'runs/' + args.env + '/' \
79 | + args.algo \
80 | + '_s_' + str(args.seed) \
81 | + '_t_' + datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
82 | writer = SummaryWriter(log_dir=dir_name)
83 |
84 | start_time = time.time()
85 |
86 | train_num_steps = 0
87 | train_sum_returns = 0.
88 | train_num_episodes = 0
89 |
90 | # Main loop
91 | for i in range(args.iterations):
92 | # Perform the training phase, during which the agent learns
93 | if args.phase == 'train':
94 | agent.eval_mode = False
95 |
96 | # Run one episode
97 | train_step_length, train_episode_return = agent.run(args.max_step)
98 |
99 | train_num_steps += train_step_length
100 | train_sum_returns += train_episode_return
101 | train_num_episodes += 1
102 |
103 | train_average_return = train_sum_returns / train_num_episodes if train_num_episodes > 0 else 0.0
104 |
105 | # Log experiment result for training episodes
106 | if args.tensorboard and args.load is None:
107 | writer.add_scalar('Train/AverageReturns', train_average_return, i)
108 | writer.add_scalar('Train/EpisodeReturns', train_episode_return, i)
109 |
110 | # Perform the evaluation phase -- no learning
111 | if (i + 1) % args.eval_per_train == 0:
112 | eval_sum_returns = 0.
113 | eval_num_episodes = 0
114 | agent.eval_mode = True
115 |
116 | for _ in range(100):
117 | # Run one episode
118 | eval_step_length, eval_episode_return = agent.run(args.max_step)
119 |
120 | eval_sum_returns += eval_episode_return
121 | eval_num_episodes += 1
122 |
123 | eval_average_return = eval_sum_returns / eval_num_episodes if eval_num_episodes > 0 else 0.0
124 |
125 | # Log experiment result for evaluation episodes
126 | if args.tensorboard and args.load is None:
127 | writer.add_scalar('Eval/AverageReturns', eval_average_return, i)
128 | writer.add_scalar('Eval/EpisodeReturns', eval_episode_return, i)
129 |
130 | if args.phase == 'train':
131 | print('---------------------------------------')
132 | print('Iterations:', i + 1)
133 | print('Steps:', train_num_steps)
134 | print('Episodes:', train_num_episodes)
135 | print('EpisodeReturn:', round(train_episode_return, 2))
136 | print('AverageReturn:', round(train_average_return, 2))
137 | print('EvalEpisodes:', eval_num_episodes)
138 | print('EvalEpisodeReturn:', round(eval_episode_return, 2))
139 | print('EvalAverageReturn:', round(eval_average_return, 2))
140 | print('OtherLogs:', agent.logger)
141 | print('Time:', int(time.time() - start_time))
142 | print('---------------------------------------')
143 |
144 | # Save the trained model
145 | if eval_average_return >= args.threshold_return:
146 | if not os.path.exists('./save_model'):
147 | os.mkdir('./save_model')
148 |
149 | ckpt_path = os.path.join('./save_model/' + args.env + '_' + args.algo \
150 | + '_s_' + str(args.seed) \
151 | + '_i_' + str(i + 1) \
152 | + '_tr_' + str(round(train_episode_return, 2)) \
153 | + '_er_' + str(round(eval_episode_return, 2)) + '.pt')
154 |
155 | if args.algo == 'dqn' or args.algo == 'ddqn':
156 | torch.save(agent.qf.state_dict(), ckpt_path)
157 | else:
158 | torch.save(agent.policy.state_dict(), ckpt_path)
159 | elif args.phase == 'test':
160 | print('---------------------------------------')
161 | print('EvalEpisodes:', eval_num_episodes)
162 | print('EvalEpisodeReturn:', round(eval_episode_return, 2))
163 | print('EvalAverageReturn:', round(eval_average_return, 2))
164 | print('Time:', int(time.time() - start_time))
165 | print('---------------------------------------')
166 |
167 | if __name__ == "__main__":
168 | main()
169 |
--------------------------------------------------------------------------------
/run_mujoco.py:
--------------------------------------------------------------------------------
1 | import os
2 | import gym
3 | import time
4 | import argparse
5 | import datetime
6 | import numpy as np
7 | import torch
8 | from torch.utils.tensorboard import SummaryWriter
9 |
10 | # Configurations
11 | parser = argparse.ArgumentParser(description='RL algorithms with PyTorch in MuJoCo environments')
12 | parser.add_argument('--env', type=str, default='Humanoid-v2',
13 | help='choose an environment between Hopper-v2, HalfCheetah-v2, Ant-v2 and Humanoid-v2')
14 | parser.add_argument('--algo', type=str, default='atac',
15 | help='select an algorithm among vpg, npg, trpo, ppo, ddpg, td3, sac, asac, tac, atac')
16 | parser.add_argument('--phase', type=str, default='train',
17 | help='choose between training phase and testing phase')
18 | parser.add_argument('--render', action='store_true', default=False,
19 | help='if you want to render, set this to True')
20 | parser.add_argument('--load', type=str, default=None,
21 | help='copy & paste the saved model name, and load it')
22 | parser.add_argument('--seed', type=int, default=0,
23 | help='seed for random number generators')
24 | parser.add_argument('--iterations', type=int, default=200,
25 | help='iterations to run and train agent')
26 | parser.add_argument('--steps_per_iter', type=int, default=5000,
27 | help='steps of interaction for the agent and the environment in each epoch')
28 | parser.add_argument('--max_step', type=int, default=1000,
29 | help='max episode step')
30 | parser.add_argument('--tensorboard', action='store_true', default=True)
31 | parser.add_argument('--gpu_index', type=int, default=0)
32 | args = parser.parse_args()
33 | device = torch.device('cuda', index=args.gpu_index) if torch.cuda.is_available() else torch.device('cpu')
34 |
35 | if args.algo == 'vpg':
36 | from agents.vpg import Agent
37 | elif args.algo == 'npg':
38 | from agents.trpo import Agent
39 | elif args.algo == 'trpo':
40 | from agents.trpo import Agent
41 | elif args.algo == 'ppo':
42 | from agents.ppo import Agent
43 | elif args.algo == 'ddpg':
44 | from agents.ddpg import Agent
45 | elif args.algo == 'td3':
46 | from agents.td3 import Agent
47 | elif args.algo == 'sac':
48 | from agents.sac import Agent
49 | elif args.algo == 'asac': # Automating entropy adjustment on SAC
50 | from agents.sac import Agent
51 | elif args.algo == 'tac':
52 | from agents.sac import Agent
53 | elif args.algo == 'atac': # Automating entropy adjustment on TAC
54 | from agents.sac import Agent
55 |
56 |
57 | def main():
58 | """Main."""
59 | # Initialize environment
60 | env = gym.make(args.env)
61 | obs_dim = env.observation_space.shape[0]
62 | act_dim = env.action_space.shape[0]
63 | act_limit = env.action_space.high[0]
64 |
65 | print('---------------------------------------')
66 | print('Environment:', args.env)
67 | print('Algorithm:', args.algo)
68 | print('State dimension:', obs_dim)
69 | print('Action dimension:', act_dim)
70 | print('Action limit:', act_limit)
71 | print('---------------------------------------')
72 |
73 | # Set a random seed
74 | env.seed(args.seed)
75 | np.random.seed(args.seed)
76 | torch.manual_seed(args.seed)
77 |
78 | # Create an agent
79 | if args.algo == 'ddpg' or args.algo == 'td3':
80 | agent = Agent(env, args, device, obs_dim, act_dim, act_limit,
81 | expl_before=10000,
82 | act_noise=0.1,
83 | hidden_sizes=(256,256),
84 | buffer_size=int(1e6),
85 | batch_size=256,
86 | policy_lr=3e-4,
87 | qf_lr=3e-4)
88 | elif args.algo == 'sac':
89 | agent = Agent(env, args, device, obs_dim, act_dim, act_limit,
90 | expl_before=10000,
91 | alpha=0.2, # In HalfCheetah-v2 and Ant-v2, SAC with 0.2
92 | hidden_sizes=(256,256), # shows the best performance in entropy coefficient
93 | buffer_size=int(1e6), # while, in Humanoid-v2, SAC with 0.05 shows the best performance.
94 | batch_size=256,
95 | policy_lr=3e-4,
96 | qf_lr=3e-4)
97 | elif args.algo == 'asac':
98 | agent = Agent(env, args, device, obs_dim, act_dim, act_limit,
99 | expl_before=10000,
100 | automatic_entropy_tuning=True,
101 | hidden_sizes=(256,256),
102 | buffer_size=int(1e6),
103 | batch_size=256,
104 | policy_lr=3e-4,
105 | qf_lr=3e-4)
106 | elif args.algo == 'tac':
107 | agent = Agent(env, args, device, obs_dim, act_dim, act_limit,
108 | expl_before=10000,
109 | alpha=0.2,
110 | log_type='log-q',
111 | entropic_index=1.2,
112 | hidden_sizes=(256,256),
113 | buffer_size=int(1e6),
114 | batch_size=256,
115 | policy_lr=3e-4,
116 | qf_lr=3e-4)
117 | elif args.algo == 'atac':
118 | agent = Agent(env, args, device, obs_dim, act_dim, act_limit,
119 | expl_before=10000,
120 | log_type='log-q',
121 | entropic_index=1.2,
122 | automatic_entropy_tuning=True,
123 | hidden_sizes=(256,256),
124 | buffer_size=int(1e6),
125 | batch_size=256,
126 | policy_lr=3e-4,
127 | qf_lr=3e-4)
128 | else: # vpg, npg, trpo, ppo
129 | agent = Agent(env, args, device, obs_dim, act_dim, act_limit, sample_size=4096)
130 |
131 | # If we have a saved model, load it
132 | if args.load is not None:
133 | pretrained_model_path = os.path.join('./save_model/' + str(args.load))
134 | pretrained_model = torch.load(pretrained_model_path, map_location=device)
135 | agent.policy.load_state_dict(pretrained_model)
136 |
137 | # Create a SummaryWriter object by TensorBoard
138 | if args.tensorboard and args.load is None:
139 | dir_name = 'runs/' + args.env + '/' \
140 | + args.algo \
141 | + '_s_' + str(args.seed) \
142 | + '_t_' + datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
143 | writer = SummaryWriter(log_dir=dir_name)
144 |
145 | start_time = time.time()
146 |
147 | total_num_steps = 0
148 | train_sum_returns = 0.
149 | train_num_episodes = 0
150 |
151 | # Main loop
152 | for i in range(args.iterations):
153 | # Perform the training phase, during which the agent learns
154 | if args.phase == 'train':
155 | train_step_count = 0
156 |
157 | while train_step_count <= args.steps_per_iter:
158 | agent.eval_mode = False
159 |
160 | # Run one episode
161 | train_step_length, train_episode_return = agent.run(args.max_step)
162 |
163 | total_num_steps += train_step_length
164 | train_step_count += train_step_length
165 | train_sum_returns += train_episode_return
166 | train_num_episodes += 1
167 |
168 | train_average_return = train_sum_returns / train_num_episodes if train_num_episodes > 0 else 0.0
169 |
170 | # Log experiment result for training steps
171 | if args.tensorboard and args.load is None:
172 | writer.add_scalar('Train/AverageReturns', train_average_return, total_num_steps)
173 | writer.add_scalar('Train/EpisodeReturns', train_episode_return, total_num_steps)
174 | if args.algo == 'asac' or args.algo == 'atac':
175 | writer.add_scalar('Train/Alpha', agent.alpha, total_num_steps)
176 |
177 | # Perform the evaluation phase -- no learning
178 | eval_sum_returns = 0.
179 | eval_num_episodes = 0
180 | agent.eval_mode = True
181 |
182 | for _ in range(10):
183 | # Run one episode
184 | eval_step_length, eval_episode_return = agent.run(args.max_step)
185 |
186 | eval_sum_returns += eval_episode_return
187 | eval_num_episodes += 1
188 |
189 | eval_average_return = eval_sum_returns / eval_num_episodes if eval_num_episodes > 0 else 0.0
190 |
191 | # Log experiment result for evaluation steps
192 | if args.tensorboard and args.load is None:
193 | writer.add_scalar('Eval/AverageReturns', eval_average_return, total_num_steps)
194 | writer.add_scalar('Eval/EpisodeReturns', eval_episode_return, total_num_steps)
195 |
196 | if args.phase == 'train':
197 | print('---------------------------------------')
198 | print('Iterations:', i + 1)
199 | print('Steps:', total_num_steps)
200 | print('Episodes:', train_num_episodes)
201 | print('EpisodeReturn:', round(train_episode_return, 2))
202 | print('AverageReturn:', round(train_average_return, 2))
203 | print('EvalEpisodes:', eval_num_episodes)
204 | print('EvalEpisodeReturn:', round(eval_episode_return, 2))
205 | print('EvalAverageReturn:', round(eval_average_return, 2))
206 | print('OtherLogs:', agent.logger)
207 | print('Time:', int(time.time() - start_time))
208 | print('---------------------------------------')
209 |
210 | # Save the trained model
211 | if (i + 1) >= 180 and (i + 1) % 20 == 0:
212 | if not os.path.exists('./save_model'):
213 | os.mkdir('./save_model')
214 |
215 | ckpt_path = os.path.join('./save_model/' + args.env + '_' + args.algo \
216 | + '_s_' + str(args.seed) \
217 | + '_i_' + str(i + 1) \
218 | + '_tr_' + str(round(train_episode_return, 2)) \
219 | + '_er_' + str(round(eval_episode_return, 2)) + '.pt')
220 |
221 | torch.save(agent.policy.state_dict(), ckpt_path)
222 | elif args.phase == 'test':
223 | print('---------------------------------------')
224 | print('EvalEpisodes:', eval_num_episodes)
225 | print('EvalEpisodeReturn:', round(eval_episode_return, 2))
226 | print('EvalAverageReturn:', round(eval_average_return, 2))
227 | print('Time:', int(time.time() - start_time))
228 | print('---------------------------------------')
229 |
230 | if __name__ == "__main__":
231 | main()
232 |
--------------------------------------------------------------------------------
/run_pendulum.py:
--------------------------------------------------------------------------------
1 | import os
2 | import gym
3 | import time
4 | import argparse
5 | import datetime
6 | import numpy as np
7 | import torch
8 | from torch.utils.tensorboard import SummaryWriter
9 |
10 | # Configurations
11 | parser = argparse.ArgumentParser(description='RL algorithms with PyTorch in Pendulum environment')
12 | parser.add_argument('--env', type=str, default='Pendulum-v0',
13 | help='pendulum environment')
14 | parser.add_argument('--algo', type=str, default='atac',
15 | help='select an algorithm among vpg, npg, trpo, ppo, ddpg, td3, sac, asac, tac, atac')
16 | parser.add_argument('--phase', type=str, default='train',
17 | help='choose between training phase and testing phase')
18 | parser.add_argument('--render', action='store_true', default=False,
19 | help='if you want to render, set this to True')
20 | parser.add_argument('--load', type=str, default=None,
21 | help='copy & paste the saved model name, and load it')
22 | parser.add_argument('--seed', type=int, default=0,
23 | help='seed for random number generators')
24 | parser.add_argument('--iterations', type=int, default=1000,
25 | help='iterations to run and train agent')
26 | parser.add_argument('--eval_per_train', type=int, default=100,
27 | help='evaluation number per training')
28 | parser.add_argument('--max_step', type=int, default=200,
29 | help='max episode step')
30 | parser.add_argument('--threshold_return', type=int, default=-230,
31 | help='solved requirement for success in given environment')
32 | parser.add_argument('--tensorboard', action='store_true', default=True)
33 | parser.add_argument('--gpu_index', type=int, default=0)
34 | args = parser.parse_args()
35 | device = torch.device('cuda', index=args.gpu_index) if torch.cuda.is_available() else torch.device('cpu')
36 |
37 | if args.algo == 'vpg':
38 | from agents.vpg import Agent
39 | elif args.algo == 'npg':
40 | from agents.trpo import Agent
41 | elif args.algo == 'trpo':
42 | from agents.trpo import Agent
43 | elif args.algo == 'ppo':
44 | from agents.ppo import Agent
45 | elif args.algo == 'ddpg':
46 | from agents.ddpg import Agent
47 | elif args.algo == 'td3':
48 | from agents.td3 import Agent
49 | elif args.algo == 'sac':
50 | from agents.sac import Agent
51 | elif args.algo == 'asac': # Automating entropy adjustment on SAC
52 | from agents.sac import Agent
53 | elif args.algo == 'tac':
54 | from agents.sac import Agent
55 | elif args.algo == 'atac': # Automating entropy adjustment on TAC
56 | from agents.sac import Agent
57 |
58 |
59 | def main():
60 | """Main."""
61 | # Initialize environment
62 | env = gym.make(args.env)
63 | obs_dim = env.observation_space.shape[0]
64 | act_dim = env.action_space.shape[0]
65 | act_limit = env.action_space.high[0]
66 |
67 | print('---------------------------------------')
68 | print('Environment:', args.env)
69 | print('Algorithm:', args.algo)
70 | print('State dimension:', obs_dim)
71 | print('Action dimension:', act_dim)
72 | print('Action limit:', act_limit)
73 | print('---------------------------------------')
74 |
75 | # Set a random seed
76 | env.seed(args.seed)
77 | np.random.seed(args.seed)
78 | torch.manual_seed(args.seed)
79 |
80 | # Create an agent
81 | if args.algo == 'ddpg' or args.algo == 'td3':
82 | agent = Agent(env, args, device, obs_dim, act_dim, act_limit)
83 | elif args.algo == 'sac':
84 | agent = Agent(env, args, device, obs_dim, act_dim, act_limit,
85 | alpha=0.5)
86 | elif args.algo == 'asac':
87 | agent = Agent(env, args, device, obs_dim, act_dim, act_limit,
88 | automatic_entropy_tuning=True)
89 | elif args.algo == 'tac':
90 | agent = Agent(env, args, device, obs_dim, act_dim, act_limit,
91 | alpha=0.5,
92 | log_type='log-q',
93 | entropic_index=1.2)
94 | elif args.algo == 'atac':
95 | agent = Agent(env, args, device, obs_dim, act_dim, act_limit,
96 | log_type='log-q',
97 | entropic_index=1.2,
98 | automatic_entropy_tuning=True)
99 | else: # vpg, npg, trpo, ppo
100 | agent = Agent(env, args, device, obs_dim, act_dim, act_limit)
101 |
102 | # If we have a saved model, load it
103 | if args.load is not None:
104 | pretrained_model_path = os.path.join('./save_model/' + str(args.load))
105 | pretrained_model = torch.load(pretrained_model_path, map_location=device)
106 | agent.policy.load_state_dict(pretrained_model)
107 |
108 | # Create a SummaryWriter object by TensorBoard
109 | if args.tensorboard and args.load is None:
110 | dir_name = 'runs/' + args.env + '/' \
111 | + args.algo \
112 | + '_s_' + str(args.seed) \
113 | + '_t_' + datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
114 | writer = SummaryWriter(log_dir=dir_name)
115 |
116 | start_time = time.time()
117 |
118 | train_num_steps = 0
119 | train_sum_returns = 0.
120 | train_num_episodes = 0
121 |
122 | # Main loop
123 | for i in range(args.iterations):
124 | # Perform the training phase, during which the agent learns
125 | if args.phase == 'train':
126 | agent.eval_mode = False
127 |
128 | # Run one episode
129 | train_step_length, train_episode_return = agent.run(args.max_step)
130 |
131 | train_num_steps += train_step_length
132 | train_sum_returns += train_episode_return
133 | train_num_episodes += 1
134 |
135 | train_average_return = train_sum_returns / train_num_episodes if train_num_episodes > 0 else 0.0
136 |
137 | # Log experiment result for training episodes
138 | if args.tensorboard and args.load is None:
139 | writer.add_scalar('Train/AverageReturns', train_average_return, i)
140 | writer.add_scalar('Train/EpisodeReturns', train_episode_return, i)
141 | if args.algo == 'asac' or args.algo == 'atac':
142 | writer.add_scalar('Train/Alpha', agent.alpha, i)
143 |
144 | # Perform the evaluation phase -- no learning
145 | if (i + 1) % args.eval_per_train == 0:
146 | eval_sum_returns = 0.
147 | eval_num_episodes = 0
148 | agent.eval_mode = True
149 |
150 | for _ in range(100):
151 | # Run one episode
152 | eval_step_length, eval_episode_return = agent.run(args.max_step)
153 |
154 | eval_sum_returns += eval_episode_return
155 | eval_num_episodes += 1
156 |
157 | eval_average_return = eval_sum_returns / eval_num_episodes if eval_num_episodes > 0 else 0.0
158 |
159 | # Log experiment result for evaluation episodes
160 | if args.tensorboard and args.load is None:
161 | writer.add_scalar('Eval/AverageReturns', eval_average_return, i)
162 | writer.add_scalar('Eval/EpisodeReturns', eval_episode_return, i)
163 |
164 | if args.phase == 'train':
165 | print('---------------------------------------')
166 | print('Iterations:', i + 1)
167 | print('Steps:', train_num_steps)
168 | print('Episodes:', train_num_episodes)
169 | print('EpisodeReturn:', round(train_episode_return, 2))
170 | print('AverageReturn:', round(train_average_return, 2))
171 | print('EvalEpisodes:', eval_num_episodes)
172 | print('EvalEpisodeReturn:', round(eval_episode_return, 2))
173 | print('EvalAverageReturn:', round(eval_average_return, 2))
174 | print('OtherLogs:', agent.logger)
175 | print('Time:', int(time.time() - start_time))
176 | print('---------------------------------------')
177 |
178 | # Save the trained model
179 | if eval_average_return >= args.threshold_return:
180 | if not os.path.exists('./save_model'):
181 | os.mkdir('./save_model')
182 |
183 | ckpt_path = os.path.join('./save_model/' + args.env + '_' + args.algo \
184 | + '_s_' + str(args.seed) \
185 | + '_i_' + str(i + 1) \
186 | + '_tr_' + str(round(train_episode_return, 2)) \
187 | + '_er_' + str(round(eval_episode_return, 2)) + '.pt')
188 |
189 | torch.save(agent.policy.state_dict(), ckpt_path)
190 | elif args.phase == 'test':
191 | print('---------------------------------------')
192 | print('EvalEpisodes:', eval_num_episodes)
193 | print('EvalEpisodeReturn:', round(eval_episode_return, 2))
194 | print('EvalAverageReturn:', round(eval_average_return, 2))
195 | print('Time:', int(time.time() - start_time))
196 | print('---------------------------------------')
197 |
198 | if __name__ == "__main__":
199 | main()
200 |
--------------------------------------------------------------------------------