├── .github └── workflows │ └── codeql-analysis.yml ├── README.md ├── SECURITY.md ├── code ├── .DS_Store ├── RND │ ├── agents.py │ ├── bash.sh │ ├── envs.py │ ├── model.py │ ├── requirements.txt │ ├── train.py │ └── utils.py ├── Rainbow │ ├── .DS_Store │ ├── README.md │ ├── agent.py │ ├── bash.sh │ ├── env.py │ ├── main.py │ ├── memory.py │ ├── model.py │ ├── requirements.txt │ └── test.py ├── SAC-discrete │ ├── .DS_Store │ ├── bash.sh │ ├── config │ │ ├── .DS_Store │ │ └── default.yaml │ ├── env.py │ ├── memory.py │ ├── train.py │ └── utils.py ├── a2c.py ├── acer.py ├── ars.py ├── ars_tune.py ├── cem.py ├── cem_tune.py ├── dqn.py ├── ppo.py ├── trpo.py └── vpg.py └── docs ├── ppo_experiments.png └── rainbow.png /.github/workflows/codeql-analysis.yml: -------------------------------------------------------------------------------- 1 | # For most projects, this workflow file will not need changing; you simply need 2 | # to commit it to your repository. 3 | # 4 | # You may wish to alter this file to override the set of languages analyzed, 5 | # or to provide custom queries or build logic. 6 | # 7 | # ******** NOTE ******** 8 | # We have attempted to detect the languages in your repository. Please check 9 | # the `language` matrix defined below to confirm you have the correct set of 10 | # supported CodeQL languages. 11 | # 12 | name: "CodeQL" 13 | 14 | on: 15 | push: 16 | branches: [ master ] 17 | pull_request: 18 | # The branches below must be a subset of the branches above 19 | branches: [ master ] 20 | schedule: 21 | - cron: '20 8 * * 1' 22 | 23 | jobs: 24 | analyze: 25 | name: Analyze 26 | runs-on: ubuntu-latest 27 | 28 | strategy: 29 | fail-fast: false 30 | matrix: 31 | language: [ 'python' ] 32 | # CodeQL supports [ 'cpp', 'csharp', 'go', 'java', 'javascript', 'python' ] 33 | # Learn more: 34 | # https://docs.github.com/en/free-pro-team@latest/github/finding-security-vulnerabilities-and-errors-in-your-code/configuring-code-scanning#changing-the-languages-that-are-analyzed 35 | 36 | steps: 37 | - name: Checkout repository 38 | uses: actions/checkout@v2 39 | 40 | # Initializes the CodeQL tools for scanning. 41 | - name: Initialize CodeQL 42 | uses: github/codeql-action/init@v1 43 | with: 44 | languages: ${{ matrix.language }} 45 | # If you wish to specify custom queries, you can do so here or in a config file. 46 | # By default, queries listed here will override any specified in a config file. 47 | # Prefix the list here with "+" to use these queries and those in the config file. 48 | # queries: ./path/to/local/query, your-org/your-repo/queries@main 49 | 50 | # Autobuild attempts to build any compiled languages (C/C++, C#, or Java). 51 | # If this step fails, then you should remove it and run the build manually (see below) 52 | - name: Autobuild 53 | uses: github/codeql-action/autobuild@v1 54 | 55 | # ℹ️ Command-line programs to run using the OS shell. 56 | # 📚 https://git.io/JvXDl 57 | 58 | # ✏️ If the Autobuild fails above, remove it and uncomment the following three lines 59 | # and modify them (or add more) to build your code if your project 60 | # uses a compiled language 61 | 62 | #- run: | 63 | # make bootstrap 64 | # make release 65 | 66 | - name: Perform CodeQL Analysis 67 | uses: github/codeql-action/analyze@v1 68 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Reinforcement-Implementation 2 | 3 | This project aims to reproduce the results of several model-free RL algorithms in continuous action domain (mujuco environment). 4 | 5 | This projects 6 | * uses pytorch package 7 | * implements different algorithms independently in seperate files / minimal files 8 | * is written in simplest style 9 | * tries to follow the original paper and reproduce their results 10 | 11 | My first stage of work is to reproduce this figure in the PPO paper. 12 | 13 | ![](docs/ppo_experiments.png) 14 | 15 | - [x] A2C 16 | - [x] ACER (A2C + Trust Region): It seems that this implementation has some problems ... (welcome bug report) 17 | - [X] CEM 18 | - [x] TRPO (TRPO single path) 19 | - [x] PPO (PPO clip) 20 | - [x] Vanilla PG 21 | 22 | On the next stage, I want to implement 23 | 24 | - [ ] DDPG 25 | - [X] Random Search (see [Simple random search provides a competitive approach to reinforcement learning](https://arxiv.org/pdf/1803.07055.pdf)) 26 | - [ ] SAC (soft actor-critic) with continuous action space 27 | - [X] SAC (soft actor-critic) with discrete action space 28 | - [X] DQN 29 | 30 | Then next stage, discrete action space problem and raw video input (Atari) problems: 31 | 32 | - [X] Rainbow: DQN and relevant techniques (target network / double Q-learning / prioritized experience replay / dueling network structure / distributional RL) 33 | - [X] PPO with random network distillation (RND) 34 | 35 | Rainbow on Atari with only 3M: It works but may need further tuning. 36 | 37 | ![](docs/ppo_experiments.png) 38 | 39 | And then model-based algorithms (not planned) 40 | 41 | - [ ] PILCO 42 | - [ ] PE-TS 43 | 44 | TODOs: 45 | - [ ] change the way reward counts, current way may underestimate the reward (evaluate a deterministic model rather a stochastic/exploratory model) 46 | 47 | ## PPO Implementation 48 | 49 | PPO implementation is of high quality - matches the performance of openai.baselines. 50 | 51 | ## Update 52 | 53 | Recently, I added Rainbow and DQN. The Rainbow implementation is of high quality on Atari games - enough for you to modify and write your own research paper. The DQN implementation is a minimum workaround and reaches a good performance on MountainCar (which is a simple task but many codes on Github do not achieve good performance or need additional reward/environment engineering). This is enough for you to have a fast test of your research ideas. 54 | -------------------------------------------------------------------------------- /SECURITY.md: -------------------------------------------------------------------------------- 1 | # Security Policy 2 | 3 | ## Supported Versions 4 | 5 | Use this section to tell people about which versions of your project are 6 | currently being supported with security updates. 7 | 8 | | Version | Supported | 9 | | ------- | ------------------ | 10 | | 5.1.x | :white_check_mark: | 11 | | 5.0.x | :x: | 12 | | 4.0.x | :white_check_mark: | 13 | | < 4.0 | :x: | 14 | 15 | ## Reporting a Vulnerability 16 | 17 | Use this section to tell people how to report a vulnerability. 18 | 19 | Tell them where to go, how often they can expect to get an update on a 20 | reported vulnerability, what to expect if the vulnerability is accepted or 21 | declined, etc. 22 | -------------------------------------------------------------------------------- /code/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhangchuheng123/Reinforcement-Implementation/c04e0df10ec29ef775ea31395a8ad4b917302d24/code/.DS_Store -------------------------------------------------------------------------------- /code/RND/agents.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch.nn.functional as F 3 | import torch.nn as nn 4 | import torch 5 | import torch.optim as optim 6 | from torch.distributions.categorical import Categorical 7 | from model import CnnActorCriticNetwork, RNDModel 8 | from utils import global_grad_norm_ 9 | 10 | 11 | class RNDAgent(object): 12 | def __init__( 13 | self, 14 | input_size, 15 | output_size, 16 | num_env, 17 | num_step, 18 | gamma, 19 | lam=0.95, 20 | learning_rate=1e-4, 21 | ent_coef=0.01, 22 | clip_grad_norm=0.5, 23 | epoch=3, 24 | batch_size=128, 25 | ppo_eps=0.1, 26 | update_proportion=0.25, 27 | use_gae=True, 28 | use_cuda=False, 29 | use_noisy_net=False): 30 | self.model = CnnActorCriticNetwork(input_size, output_size, use_noisy_net) 31 | self.num_env = num_env 32 | self.output_size = output_size 33 | self.input_size = input_size 34 | self.num_step = num_step 35 | self.gamma = gamma 36 | self.lam = lam 37 | self.epoch = epoch 38 | self.batch_size = batch_size 39 | self.use_gae = use_gae 40 | self.ent_coef = ent_coef 41 | self.ppo_eps = ppo_eps 42 | self.clip_grad_norm = clip_grad_norm 43 | self.update_proportion = update_proportion 44 | self.device = torch.device('cuda' if use_cuda else 'cpu') 45 | 46 | self.rnd = RNDModel(input_size, output_size) 47 | self.optimizer = optim.Adam(list(self.model.parameters()) + list(self.rnd.predictor.parameters()), 48 | lr=learning_rate) 49 | self.rnd = self.rnd.to(self.device) 50 | 51 | self.model = self.model.to(self.device) 52 | 53 | def get_action(self, state): 54 | state = torch.Tensor(state).to(self.device) 55 | state = state.float() 56 | policy, value_ext, value_int = self.model(state) 57 | action_prob = F.softmax(policy, dim=-1).data.cpu().numpy() 58 | 59 | action = self.random_choice_prob_index(action_prob) 60 | 61 | return action, value_ext.data.cpu().numpy().squeeze(), value_int.data.cpu().numpy().squeeze(), policy.detach() 62 | 63 | @staticmethod 64 | def random_choice_prob_index(p, axis=1): 65 | r = np.expand_dims(np.random.rand(p.shape[1 - axis]), axis=axis) 66 | return (p.cumsum(axis=axis) > r).argmax(axis=axis) 67 | 68 | def compute_intrinsic_reward(self, next_obs): 69 | next_obs = torch.FloatTensor(next_obs).to(self.device) 70 | 71 | target_next_feature = self.rnd.target(next_obs) 72 | predict_next_feature = self.rnd.predictor(next_obs) 73 | intrinsic_reward = (target_next_feature - predict_next_feature).pow(2).sum(1) / 2 74 | 75 | return intrinsic_reward.data.cpu().numpy() 76 | 77 | def train_model(self, s_batch, target_ext_batch, target_int_batch, y_batch, adv_batch, next_obs_batch, old_policy): 78 | s_batch = torch.FloatTensor(s_batch).to(self.device) 79 | target_ext_batch = torch.FloatTensor(target_ext_batch).to(self.device) 80 | target_int_batch = torch.FloatTensor(target_int_batch).to(self.device) 81 | y_batch = torch.LongTensor(y_batch).to(self.device) 82 | adv_batch = torch.FloatTensor(adv_batch).to(self.device) 83 | next_obs_batch = torch.FloatTensor(next_obs_batch).to(self.device) 84 | 85 | sample_range = np.arange(len(s_batch)) 86 | forward_mse = nn.MSELoss(reduction='none') 87 | 88 | with torch.no_grad(): 89 | policy_old_list = torch.stack(old_policy).permute(1, 0, 2).contiguous().view(-1, self.output_size).to( 90 | self.device) 91 | 92 | m_old = Categorical(F.softmax(policy_old_list, dim=-1)) 93 | log_prob_old = m_old.log_prob(y_batch) 94 | # ------------------------------------------------------------ 95 | 96 | for i in range(self.epoch): 97 | np.random.shuffle(sample_range) 98 | for j in range(int(len(s_batch) / self.batch_size)): 99 | sample_idx = sample_range[self.batch_size * j:self.batch_size * (j + 1)] 100 | 101 | # -------------------------------------------------------------------------------- 102 | # for Curiosity-driven(Random Network Distillation) 103 | predict_next_state_feature, target_next_state_feature = self.rnd(next_obs_batch[sample_idx]) 104 | 105 | forward_loss = forward_mse(predict_next_state_feature, target_next_state_feature.detach()).mean(-1) 106 | # Proportion of exp used for predictor update 107 | mask = torch.rand(len(forward_loss)).to(self.device) 108 | mask = (mask < self.update_proportion).type(torch.FloatTensor).to(self.device) 109 | forward_loss = (forward_loss * mask).sum() / torch.max(mask.sum(), torch.Tensor([1]).to(self.device)) 110 | # --------------------------------------------------------------------------------- 111 | 112 | policy, value_ext, value_int = self.model(s_batch[sample_idx]) 113 | m = Categorical(F.softmax(policy, dim=-1)) 114 | log_prob = m.log_prob(y_batch[sample_idx]) 115 | 116 | ratio = torch.exp(log_prob - log_prob_old[sample_idx]) 117 | 118 | surr1 = ratio * adv_batch[sample_idx] 119 | surr2 = torch.clamp( 120 | ratio, 121 | 1.0 - self.ppo_eps, 122 | 1.0 + self.ppo_eps) * adv_batch[sample_idx] 123 | 124 | actor_loss = -torch.min(surr1, surr2).mean() 125 | critic_ext_loss = F.mse_loss(value_ext.sum(1), target_ext_batch[sample_idx]) 126 | critic_int_loss = F.mse_loss(value_int.sum(1), target_int_batch[sample_idx]) 127 | 128 | critic_loss = critic_ext_loss + critic_int_loss 129 | 130 | entropy = m.entropy().mean() 131 | 132 | self.optimizer.zero_grad() 133 | loss = actor_loss + 0.5 * critic_loss - self.ent_coef * entropy + forward_loss 134 | loss.backward() 135 | global_grad_norm_(list(self.model.parameters())+list(self.rnd.predictor.parameters())) 136 | self.optimizer.step() 137 | -------------------------------------------------------------------------------- /code/RND/bash.sh: -------------------------------------------------------------------------------- 1 | CUDA_VISIBLE_DEVICES=0 python train.py --use-gae --use-gpu --sticky-action --env-id SeaquestNoFrameskip-v4 & 2 | CUDA_VISIBLE_DEVICES=1 python train.py --use-gae --use-gpu --sticky-action --env-id MontezumaRevengeNoFrameskip-v4 & 3 | CUDA_VISIBLE_DEVICES=2 python train.py --use-gae --use-gpu --sticky-action --env-id RoadRunnerNoFrameskip-v4 & 4 | CUDA_VISIBLE_DEVICES=3 python train.py --use-gae --use-gpu --sticky-action --env-id BattleZoneNoFrameskip-v4 & -------------------------------------------------------------------------------- /code/RND/envs.py: -------------------------------------------------------------------------------- 1 | import gym 2 | import cv2 3 | import numpy as np 4 | from abc import abstractmethod 5 | from collections import deque 6 | from copy import copy 7 | from torch.multiprocessing import Process 8 | from PIL import Image 9 | 10 | class Environment(Process): 11 | @abstractmethod 12 | def run(self): 13 | pass 14 | 15 | @abstractmethod 16 | def reset(self): 17 | pass 18 | 19 | @abstractmethod 20 | def pre_proc(self, x): 21 | pass 22 | 23 | @abstractmethod 24 | def get_init_state(self, x): 25 | pass 26 | 27 | 28 | def unwrap(env): 29 | if hasattr(env, "unwrapped"): 30 | return env.unwrapped 31 | elif hasattr(env, "env"): 32 | return unwrap(env.env) 33 | elif hasattr(env, "leg_env"): 34 | return unwrap(env.leg_env) 35 | else: 36 | return env 37 | 38 | 39 | class MaxAndSkipEnv(gym.Wrapper): 40 | def __init__(self, env, is_render, skip=4): 41 | """Return only every `skip`-th frame""" 42 | gym.Wrapper.__init__(self, env) 43 | # most recent raw observations (for max pooling across time steps) 44 | self._obs_buffer = np.zeros((2,) + env.observation_space.shape, dtype=np.uint8) 45 | self._skip = skip 46 | self.is_render = is_render 47 | 48 | def step(self, action): 49 | """Repeat action, sum reward, and max over last observations.""" 50 | total_reward = 0.0 51 | done = None 52 | for i in range(self._skip): 53 | obs, reward, done, info = self.env.step(action) 54 | if self.is_render: 55 | self.env.render() 56 | if i == self._skip - 2: 57 | self._obs_buffer[0] = obs 58 | if i == self._skip - 1: 59 | self._obs_buffer[1] = obs 60 | total_reward += reward 61 | if done: 62 | break 63 | # Note that the observation on the done=True frame 64 | # doesn't matter 65 | max_frame = self._obs_buffer.max(axis=0) 66 | 67 | return max_frame, total_reward, done, info 68 | 69 | def reset(self, **kwargs): 70 | return self.env.reset(**kwargs) 71 | 72 | 73 | class AtariEnvironment(Environment): 74 | def __init__( 75 | self, 76 | env_id, 77 | is_render, 78 | env_idx, 79 | child_conn, 80 | history_size=4, 81 | h=84, 82 | w=84, 83 | life_done=True, 84 | sticky_action=True, 85 | p=0.25, 86 | max_step_per_episode=4500): 87 | super(AtariEnvironment, self).__init__() 88 | self.daemon = True 89 | self.env = MaxAndSkipEnv(gym.make(env_id), is_render) 90 | self.env_id = env_id 91 | self.is_render = is_render 92 | self.env_idx = env_idx 93 | self.steps = 0 94 | self.episode = 0 95 | self.rall = 0 96 | self.recent_rlist = deque(maxlen=100) 97 | self.child_conn = child_conn 98 | self.max_step_per_episode = max_step_per_episode 99 | 100 | self.sticky_action = sticky_action 101 | self.last_action = 0 102 | self.p = p 103 | 104 | self.history_size = history_size 105 | self.history = np.zeros([history_size, h, w]) 106 | self.h = h 107 | self.w = w 108 | 109 | self.reset() 110 | 111 | def run(self): 112 | super(AtariEnvironment, self).run() 113 | while True: 114 | action = self.child_conn.recv() 115 | 116 | if 'Breakout' in self.env_id: 117 | action += 1 118 | 119 | # sticky action 120 | if self.sticky_action: 121 | if np.random.rand() <= self.p: 122 | action = self.last_action 123 | self.last_action = action 124 | 125 | s, reward, done, info = self.env.step(action) 126 | 127 | if self.steps > self.max_step_per_episode: 128 | done = True 129 | 130 | log_reward = reward 131 | force_done = done 132 | 133 | self.history[:3, :, :] = self.history[1:, :, :] 134 | self.history[3, :, :] = self.pre_proc(s) 135 | 136 | self.rall += reward 137 | self.steps += 1 138 | 139 | if done: 140 | self.recent_rlist.append(self.rall) 141 | # print("[Episode {}({})] Step: {} Reward: {} Recent Reward: {}".format( 142 | # self.episode, self.env_idx, self.steps, self.rall, np.mean(self.recent_rlist))) 143 | self.history = self.reset() 144 | 145 | # if self.env_idx == 0: 146 | # print('env_idx=0 num_rooms={} done={}'.format(num_rooms, done)) 147 | 148 | self.child_conn.send( 149 | [self.history[:, :, :], reward, force_done, done, log_reward]) 150 | 151 | def reset(self): 152 | self.last_action = 0 153 | self.steps = 0 154 | self.episode += 1 155 | self.rall = 0 156 | s = self.env.reset() 157 | self.get_init_state( 158 | self.pre_proc(s)) 159 | return self.history[:, :, :] 160 | 161 | def pre_proc(self, X): 162 | X = np.array(Image.fromarray(X).convert('L')).astype('float32') 163 | x = cv2.resize(X, (self.h, self.w)) 164 | return x 165 | 166 | def get_init_state(self, s): 167 | for i in range(self.history_size): 168 | self.history[i, :, :] = self.pre_proc(s) 169 | -------------------------------------------------------------------------------- /code/RND/model.py: -------------------------------------------------------------------------------- 1 | import torch.nn.functional as F 2 | import torch.nn as nn 3 | import torch 4 | import torch.optim as optim 5 | import numpy as np 6 | import math 7 | from torch.nn import init 8 | 9 | 10 | class NoisyLinear(nn.Module): 11 | """Factorised Gaussian NoisyNet""" 12 | 13 | def __init__(self, in_features, out_features, sigma0=0.5): 14 | super().__init__() 15 | self.in_features = in_features 16 | self.out_features = out_features 17 | self.weight = nn.Parameter(torch.Tensor(out_features, in_features)) 18 | self.bias = nn.Parameter(torch.Tensor(out_features)) 19 | self.noisy_weight = nn.Parameter( 20 | torch.Tensor(out_features, in_features)) 21 | self.noisy_bias = nn.Parameter(torch.Tensor(out_features)) 22 | self.noise_std = sigma0 / math.sqrt(self.in_features) 23 | 24 | self.reset_parameters() 25 | self.register_noise() 26 | 27 | def register_noise(self): 28 | in_noise = torch.FloatTensor(self.in_features) 29 | out_noise = torch.FloatTensor(self.out_features) 30 | noise = torch.FloatTensor(self.out_features, self.in_features) 31 | self.register_buffer('in_noise', in_noise) 32 | self.register_buffer('out_noise', out_noise) 33 | self.register_buffer('noise', noise) 34 | 35 | def sample_noise(self): 36 | self.in_noise.normal_(0, self.noise_std) 37 | self.out_noise.normal_(0, self.noise_std) 38 | self.noise = torch.mm( 39 | self.out_noise.view(-1, 1), self.in_noise.view(1, -1)) 40 | 41 | def reset_parameters(self): 42 | stdv = 1. / math.sqrt(self.weight.size(1)) 43 | self.weight.data.uniform_(-stdv, stdv) 44 | self.noisy_weight.data.uniform_(-stdv, stdv) 45 | if self.bias is not None: 46 | self.bias.data.uniform_(-stdv, stdv) 47 | self.noisy_bias.data.uniform_(-stdv, stdv) 48 | 49 | def forward(self, x): 50 | """ 51 | Note: noise will be updated if x is not volatile 52 | """ 53 | normal_y = nn.functional.linear(x, self.weight, self.bias) 54 | if self.training: 55 | # update the noise once per update 56 | self.sample_noise() 57 | 58 | noisy_weight = self.noisy_weight * self.noise 59 | noisy_bias = self.noisy_bias * self.out_noise 60 | noisy_y = nn.functional.linear(x, noisy_weight, noisy_bias) 61 | return noisy_y + normal_y 62 | 63 | def __repr__(self): 64 | return self.__class__.__name__ + '(' \ 65 | + 'in_features=' + str(self.in_features) \ 66 | + ', out_features=' + str(self.out_features) + ')' 67 | 68 | 69 | class Flatten(nn.Module): 70 | def forward(self, input): 71 | return input.view(input.size(0), -1) 72 | 73 | 74 | class CnnActorCriticNetwork(nn.Module): 75 | def __init__(self, input_size, output_size, use_noisy_net=False): 76 | super(CnnActorCriticNetwork, self).__init__() 77 | 78 | if use_noisy_net: 79 | print('use NoisyNet') 80 | linear = NoisyLinear 81 | else: 82 | linear = nn.Linear 83 | 84 | self.feature = nn.Sequential( 85 | nn.Conv2d( 86 | in_channels=4, 87 | out_channels=32, 88 | kernel_size=8, 89 | stride=4), 90 | nn.ReLU(), 91 | nn.Conv2d( 92 | in_channels=32, 93 | out_channels=64, 94 | kernel_size=4, 95 | stride=2), 96 | nn.ReLU(), 97 | nn.Conv2d( 98 | in_channels=64, 99 | out_channels=64, 100 | kernel_size=3, 101 | stride=1), 102 | nn.ReLU(), 103 | Flatten(), 104 | linear( 105 | 7 * 7 * 64, 106 | 256), 107 | nn.ReLU(), 108 | linear( 109 | 256, 110 | 448), 111 | nn.ReLU() 112 | ) 113 | 114 | self.actor = nn.Sequential( 115 | linear(448, 448), 116 | nn.ReLU(), 117 | linear(448, output_size) 118 | ) 119 | 120 | self.extra_layer = nn.Sequential( 121 | linear(448, 448), 122 | nn.ReLU() 123 | ) 124 | 125 | self.critic_ext = linear(448, 1) 126 | self.critic_int = linear(448, 1) 127 | 128 | for p in self.modules(): 129 | if isinstance(p, nn.Conv2d): 130 | init.orthogonal_(p.weight, np.sqrt(2)) 131 | p.bias.data.zero_() 132 | 133 | if isinstance(p, nn.Linear): 134 | init.orthogonal_(p.weight, np.sqrt(2)) 135 | p.bias.data.zero_() 136 | 137 | init.orthogonal_(self.critic_ext.weight, 0.01) 138 | self.critic_ext.bias.data.zero_() 139 | 140 | init.orthogonal_(self.critic_int.weight, 0.01) 141 | self.critic_int.bias.data.zero_() 142 | 143 | for i in range(len(self.actor)): 144 | if type(self.actor[i]) == nn.Linear: 145 | init.orthogonal_(self.actor[i].weight, 0.01) 146 | self.actor[i].bias.data.zero_() 147 | 148 | for i in range(len(self.extra_layer)): 149 | if type(self.extra_layer[i]) == nn.Linear: 150 | init.orthogonal_(self.extra_layer[i].weight, 0.1) 151 | self.extra_layer[i].bias.data.zero_() 152 | 153 | def forward(self, state): 154 | x = self.feature(state) 155 | policy = self.actor(x) 156 | value_ext = self.critic_ext(self.extra_layer(x) + x) 157 | value_int = self.critic_int(self.extra_layer(x) + x) 158 | return policy, value_ext, value_int 159 | 160 | 161 | class RNDModel(nn.Module): 162 | def __init__(self, input_size, output_size): 163 | super(RNDModel, self).__init__() 164 | 165 | self.input_size = input_size 166 | self.output_size = output_size 167 | 168 | feature_output = 7 * 7 * 64 169 | self.predictor = nn.Sequential( 170 | nn.Conv2d( 171 | in_channels=1, 172 | out_channels=32, 173 | kernel_size=8, 174 | stride=4), 175 | nn.LeakyReLU(), 176 | nn.Conv2d( 177 | in_channels=32, 178 | out_channels=64, 179 | kernel_size=4, 180 | stride=2), 181 | nn.LeakyReLU(), 182 | nn.Conv2d( 183 | in_channels=64, 184 | out_channels=64, 185 | kernel_size=3, 186 | stride=1), 187 | nn.LeakyReLU(), 188 | Flatten(), 189 | nn.Linear(feature_output, 512), 190 | nn.ReLU(), 191 | nn.Linear(512, 512), 192 | nn.ReLU(), 193 | nn.Linear(512, 512) 194 | ) 195 | 196 | self.target = nn.Sequential( 197 | nn.Conv2d( 198 | in_channels=1, 199 | out_channels=32, 200 | kernel_size=8, 201 | stride=4), 202 | nn.LeakyReLU(), 203 | nn.Conv2d( 204 | in_channels=32, 205 | out_channels=64, 206 | kernel_size=4, 207 | stride=2), 208 | nn.LeakyReLU(), 209 | nn.Conv2d( 210 | in_channels=64, 211 | out_channels=64, 212 | kernel_size=3, 213 | stride=1), 214 | nn.LeakyReLU(), 215 | Flatten(), 216 | nn.Linear(feature_output, 512) 217 | ) 218 | 219 | for p in self.modules(): 220 | if isinstance(p, nn.Conv2d): 221 | init.orthogonal_(p.weight, np.sqrt(2)) 222 | p.bias.data.zero_() 223 | 224 | if isinstance(p, nn.Linear): 225 | init.orthogonal_(p.weight, np.sqrt(2)) 226 | p.bias.data.zero_() 227 | 228 | for param in self.target.parameters(): 229 | param.requires_grad = False 230 | 231 | def forward(self, next_obs): 232 | target_feature = self.target(next_obs) 233 | predict_feature = self.predictor(next_obs) 234 | 235 | return predict_feature, target_feature 236 | -------------------------------------------------------------------------------- /code/RND/requirements.txt: -------------------------------------------------------------------------------- 1 | atari-py==0.2.6 2 | opencv-python==4.2.0.34 3 | plotly==4.8.1 4 | procgen==0.10.3 5 | tensorboardX==2.0 6 | torch==1.5.0 7 | tqdm==4.42.1 8 | tensorflow<2.0.0,>=1.4.0 9 | numpy<2.0.0,>=1.17.0 10 | pathos==0.2.6 11 | kornia==0.3.1 -------------------------------------------------------------------------------- /code/RND/train.py: -------------------------------------------------------------------------------- 1 | from agents import RNDAgent 2 | from envs import AtariEnvironment 3 | from utils import make_train_data, RunningMeanStd, RewardForwardFilter, softmax 4 | 5 | import torch 6 | from torch.multiprocessing import Pipe 7 | 8 | from tensorboardX import SummaryWriter 9 | from datetime import datetime 10 | import numpy as np 11 | import argparse 12 | import tqdm 13 | import gym 14 | import os 15 | 16 | # Enable multi-thread 17 | os.system("taskset -p 0xffffffff %d" % os.getpid()) 18 | torch.set_num_threads(128) 19 | 20 | def parse_arguments(): 21 | 22 | parser = argparse.ArgumentParser(description='RND') 23 | 24 | parser.add_argument('--train-method', type=str, default='RND') 25 | parser.add_argument('--env-type', type=str, default='atari') 26 | parser.add_argument('--env-id', type=str, default='MontezumaRevengeNoFrameskip-v4') 27 | parser.add_argument('--max-step-per-episode', type=int, default=4500) 28 | parser.add_argument('--total-frames', type=int, default=int(50e6)) 29 | parser.add_argument('--ext-coef', type=float, default=2.0) 30 | parser.add_argument('--learning-rate', type=float, default=1e-4) 31 | parser.add_argument('--num-worker', type=int, default=128) 32 | parser.add_argument('--num-step', type=int, default=128) 33 | parser.add_argument('--gamma', type=float, default=0.999) 34 | parser.add_argument('--int-gamma', type=float, default=0.99) 35 | parser.add_argument('--lam', type=float, default=0.95) 36 | parser.add_argument('--stable-eps', type=float, default=1e-8) 37 | parser.add_argument('--stable-stack-size', type=int, default=4) 38 | parser.add_argument('--preproc-height', type=int, default=84) 39 | parser.add_argument('--preproc-width', type=int, default=84) 40 | parser.add_argument('--use-gae', action='store_true') 41 | parser.add_argument('--use-gpu', action='store_true') 42 | parser.add_argument('--use-norm', action='store_true') 43 | parser.add_argument('--use-noisynet', action='store_true') 44 | parser.add_argument('--clip-grad-norm', type=float, default=0.5) 45 | parser.add_argument('--entropy', type=float, default=0.001) 46 | parser.add_argument('--epoch', type=int, default=4) 47 | parser.add_argument('--minibatch', type=int, default=4) 48 | parser.add_argument('--ppo-eps', type=float, default=0.1) 49 | parser.add_argument('--int-coef', type=float, default=1.0) 50 | parser.add_argument('--sticky-action', action='store_true') 51 | parser.add_argument('--action-prob', type=float, default=0.25) 52 | parser.add_argument('--update-proportion', type=float, default=0.25) 53 | parser.add_argument('--life-done', action='store_true') 54 | parser.add_argument('--obs-norm-step', type=int, default=50) 55 | parser.add_argument('--save-models', action='store_true') 56 | 57 | # Setup 58 | args = parser.parse_args() 59 | 60 | return args 61 | 62 | class Logger(object): 63 | def __init__(self, path): 64 | self.path = path 65 | 66 | def info(self, s): 67 | string = '[' + str(datetime.now().strftime('%Y-%m-%dT%H:%M:%S')) + '] ' + s 68 | print(string) 69 | with open(os.path.join(self.path, 'log.txt'), 'a+') as f: 70 | f.writelines([string, '']) 71 | 72 | def main(): 73 | 74 | args = parse_arguments() 75 | 76 | train_method = args.train_method 77 | env_id = args.env_id 78 | env_type = args.env_type 79 | 80 | if env_type == 'atari': 81 | env = gym.make(env_id) 82 | input_size = env.observation_space.shape 83 | output_size = env.action_space.n 84 | env.close() 85 | else: 86 | raise NotImplementedError 87 | 88 | is_load_model = False 89 | is_render = False 90 | os.makedirs('models', exist_ok=True) 91 | model_path = 'models/{}.model'.format(env_id) 92 | predictor_path = 'models/{}.pred'.format(env_id) 93 | target_path = 'models/{}.target'.format(env_id) 94 | 95 | results_dir = os.path.join('outputs', args.env_id) 96 | os.makedirs(results_dir, exist_ok=True) 97 | logger = Logger(results_dir) 98 | writer = SummaryWriter(os.path.join(results_dir, 'tensorboard', args.env_id)) 99 | 100 | use_cuda = args.use_gpu 101 | use_gae = args.use_gae 102 | use_noisy_net = args.use_noisynet 103 | lam = args.lam 104 | num_worker = args.num_worker 105 | num_step = args.num_step 106 | ppo_eps = args.ppo_eps 107 | epoch = args.epoch 108 | mini_batch = args.minibatch 109 | batch_size = int(num_step * num_worker / mini_batch) 110 | learning_rate = args.learning_rate 111 | entropy_coef = args.entropy 112 | gamma = args.gamma 113 | int_gamma = args.int_gamma 114 | clip_grad_norm = args.clip_grad_norm 115 | ext_coef = args.ext_coef 116 | int_coef = args.int_coef 117 | sticky_action = args.sticky_action 118 | action_prob = args.action_prob 119 | life_done = args.life_done 120 | pre_obs_norm_step = args.obs_norm_step 121 | 122 | reward_rms = RunningMeanStd() 123 | obs_rms = RunningMeanStd(shape=(1, 1, 84, 84)) 124 | discounted_reward = RewardForwardFilter(int_gamma) 125 | 126 | if args.train_method == 'RND': 127 | agent = RNDAgent 128 | else: 129 | raise NotImplementedError 130 | 131 | if args.env_type == 'atari': 132 | env_type = AtariEnvironment 133 | else: 134 | raise NotImplementedError 135 | 136 | agent = agent( 137 | input_size, 138 | output_size, 139 | num_worker, 140 | num_step, 141 | gamma, 142 | lam=lam, 143 | learning_rate=learning_rate, 144 | ent_coef=entropy_coef, 145 | clip_grad_norm=clip_grad_norm, 146 | epoch=epoch, 147 | batch_size=batch_size, 148 | ppo_eps=ppo_eps, 149 | use_cuda=use_cuda, 150 | use_gae=use_gae, 151 | use_noisy_net=use_noisy_net 152 | ) 153 | 154 | logger.info('Start to initialize workers') 155 | works = [] 156 | parent_conns = [] 157 | child_conns = [] 158 | for idx in range(num_worker): 159 | parent_conn, child_conn = Pipe() 160 | work = env_type(env_id, is_render, idx, child_conn, 161 | sticky_action=sticky_action, p=action_prob, life_done=life_done, 162 | max_step_per_episode=args.max_step_per_episode) 163 | work.start() 164 | works.append(work) 165 | parent_conns.append(parent_conn) 166 | child_conns.append(child_conn) 167 | 168 | states = np.zeros([num_worker, 4, 84, 84]) 169 | 170 | sample_episode = 0 171 | sample_rall = 0 172 | sample_step = 0 173 | sample_env_idx = 0 174 | sample_i_rall = 0 175 | global_update = 0 176 | global_step = 0 177 | 178 | # normalize obs 179 | logger.info('Start to initailize observation normalization parameter.....') 180 | next_obs = [] 181 | for step in range(num_step * pre_obs_norm_step): 182 | actions = np.random.randint(0, output_size, size=(num_worker,)) 183 | 184 | for parent_conn, action in zip(parent_conns, actions): 185 | parent_conn.send(action) 186 | 187 | for parent_conn in parent_conns: 188 | s, r, d, rd, lr = parent_conn.recv() 189 | next_obs.append(s[3, :, :].reshape([1, 84, 84])) 190 | 191 | if len(next_obs) % (num_step * num_worker) == 0: 192 | next_obs = np.stack(next_obs) 193 | obs_rms.update(next_obs) 194 | next_obs = [] 195 | logger.info('End to initalize...') 196 | 197 | pbar = tqdm.tqdm(total=args.total_frames) 198 | while True: 199 | logger.info('Iteration: {}'.format(global_update)) 200 | total_state, total_reward, total_done, total_next_state, \ 201 | total_action, total_int_reward, total_next_obs, total_ext_values, \ 202 | total_int_values, total_policy, total_policy_np = \ 203 | [], [], [], [], [], [], [], [], [], [], [] 204 | global_step += (num_worker * num_step) 205 | global_update += 1 206 | 207 | # Step 1. n-step rollout 208 | for _ in range(num_step): 209 | actions, value_ext, value_int, policy = agent.get_action(np.float32(states) / 255.) 210 | 211 | for parent_conn, action in zip(parent_conns, actions): 212 | parent_conn.send(action) 213 | 214 | next_states, rewards, dones, real_dones, log_rewards, next_obs = \ 215 | [], [], [], [], [], [] 216 | for parent_conn in parent_conns: 217 | s, r, d, rd, lr = parent_conn.recv() 218 | next_states.append(s) 219 | rewards.append(r) 220 | dones.append(d) 221 | real_dones.append(rd) 222 | log_rewards.append(lr) 223 | next_obs.append(s[3, :, :].reshape([1, 84, 84])) 224 | 225 | next_states = np.stack(next_states) 226 | rewards = np.hstack(rewards) 227 | dones = np.hstack(dones) 228 | real_dones = np.hstack(real_dones) 229 | next_obs = np.stack(next_obs) 230 | 231 | # total reward = int reward + ext Reward 232 | intrinsic_reward = agent.compute_intrinsic_reward( 233 | ((next_obs - obs_rms.mean) / np.sqrt(obs_rms.var)).clip(-5, 5)) 234 | intrinsic_reward = np.hstack(intrinsic_reward) 235 | sample_i_rall += intrinsic_reward[sample_env_idx] 236 | 237 | total_next_obs.append(next_obs) 238 | total_int_reward.append(intrinsic_reward) 239 | total_state.append(states) 240 | total_reward.append(rewards) 241 | total_done.append(dones) 242 | total_action.append(actions) 243 | total_ext_values.append(value_ext) 244 | total_int_values.append(value_int) 245 | total_policy.append(policy) 246 | total_policy_np.append(policy.cpu().numpy()) 247 | 248 | states = next_states[:, :, :, :] 249 | 250 | sample_rall += log_rewards[sample_env_idx] 251 | 252 | sample_step += 1 253 | if real_dones[sample_env_idx]: 254 | sample_episode += 1 255 | writer.add_scalar('data/returns_vs_frames', sample_rall, global_step) 256 | writer.add_scalar('data/lengths_vs_frames', sample_step, global_step) 257 | writer.add_scalar('data/reward_per_epi', sample_rall, sample_episode) 258 | writer.add_scalar('data/reward_per_rollout', sample_rall, global_update) 259 | writer.add_scalar('data/step', sample_step, sample_episode) 260 | sample_rall = 0 261 | sample_step = 0 262 | sample_i_rall = 0 263 | 264 | # calculate last next value 265 | _, value_ext, value_int, _ = agent.get_action(np.float32(states) / 255.) 266 | total_ext_values.append(value_ext) 267 | total_int_values.append(value_int) 268 | 269 | total_state = np.stack(total_state).transpose([1, 0, 2, 3, 4]).reshape([-1, 4, 84, 84]) 270 | total_reward = np.stack(total_reward).transpose().clip(-1, 1) 271 | total_action = np.stack(total_action).transpose().reshape([-1]) 272 | total_done = np.stack(total_done).transpose() 273 | total_next_obs = np.stack(total_next_obs).transpose([1, 0, 2, 3, 4]).reshape([-1, 1, 84, 84]) 274 | total_ext_values = np.stack(total_ext_values).transpose() 275 | total_int_values = np.stack(total_int_values).transpose() 276 | total_logging_policy = np.vstack(total_policy_np) 277 | 278 | # Step 2. calculate intrinsic reward 279 | # running mean intrinsic reward 280 | total_int_reward = np.stack(total_int_reward).transpose() 281 | total_reward_per_env = np.array([discounted_reward.update(reward_per_step) for reward_per_step in 282 | total_int_reward.T]) 283 | mean, std, count = np.mean(total_reward_per_env), np.std(total_reward_per_env), len(total_reward_per_env) 284 | reward_rms.update_from_moments(mean, std ** 2, count) 285 | 286 | # normalize intrinsic reward 287 | total_int_reward /= np.sqrt(reward_rms.var) 288 | writer.add_scalar('data/int_reward_per_epi', np.sum(total_int_reward) / num_worker, sample_episode) 289 | writer.add_scalar('data/int_reward_per_rollout', np.sum(total_int_reward) / num_worker, global_update) 290 | 291 | # logging Max action probability 292 | writer.add_scalar('data/max_prob', softmax(total_logging_policy).max(1).mean(), sample_episode) 293 | 294 | # Step 3. make target and advantage 295 | # extrinsic reward calculate 296 | ext_target, ext_adv = make_train_data(total_reward, total_done, 297 | total_ext_values, gamma, num_step, num_worker) 298 | 299 | # intrinsic reward calculate 300 | # None Episodic 301 | int_target, int_adv = make_train_data(total_int_reward, np.zeros_like(total_int_reward), 302 | total_int_values, int_gamma, num_step, num_worker) 303 | 304 | # add ext adv and int adv 305 | total_adv = int_adv * int_coef + ext_adv * ext_coef 306 | 307 | # Step 4. update obs normalize param 308 | obs_rms.update(total_next_obs) 309 | 310 | # Step 5. Training! 311 | agent.train_model(np.float32(total_state) / 255., ext_target, int_target, total_action, 312 | total_adv, ((total_next_obs - obs_rms.mean) / np.sqrt(obs_rms.var)).clip(-5, 5), 313 | total_policy) 314 | 315 | if args.save_models and global_update % 1000 == 0: 316 | torch.save(agent.model.state_dict(), 'models/{}-{}.model'.format(env_id, global_update)) 317 | logger.info('Now Global Step :{}'.format(global_step)) 318 | torch.save(agent.model.state_dict(), model_path) 319 | torch.save(agent.rnd.predictor.state_dict(), predictor_path) 320 | torch.save(agent.rnd.target.state_dict(), target_path) 321 | 322 | pbar.update(num_worker * num_step) 323 | if global_step >= args.total_frames: 324 | break 325 | 326 | pbar.close() 327 | 328 | if __name__ == '__main__': 329 | main() 330 | -------------------------------------------------------------------------------- /code/RND/utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | from torch._six import inf 4 | 5 | 6 | def make_train_data(reward, done, value, gamma, num_step, num_worker, lam=0.95, use_gae=True): 7 | discounted_return = np.empty([num_worker, num_step]) 8 | 9 | # Discounted Return 10 | if use_gae: 11 | gae = np.zeros_like([num_worker, ]) 12 | for t in range(num_step - 1, -1, -1): 13 | delta = reward[:, t] + gamma * value[:, t + 1] * (1 - done[:, t]) - value[:, t] 14 | gae = delta + gamma * lam * (1 - done[:, t]) * gae 15 | 16 | discounted_return[:, t] = gae + value[:, t] 17 | 18 | # For Actor 19 | adv = discounted_return - value[:, :-1] 20 | 21 | else: 22 | running_add = value[:, -1] 23 | for t in range(num_step - 1, -1, -1): 24 | running_add = reward[:, t] + gamma * running_add * (1 - done[:, t]) 25 | discounted_return[:, t] = running_add 26 | 27 | # For Actor 28 | adv = discounted_return - value[:, :-1] 29 | 30 | return discounted_return.reshape([-1]), adv.reshape([-1]) 31 | 32 | 33 | class RunningMeanStd(object): 34 | # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm 35 | def __init__(self, epsilon=1e-4, shape=()): 36 | self.mean = np.zeros(shape, 'float64') 37 | self.var = np.ones(shape, 'float64') 38 | self.count = epsilon 39 | 40 | def update(self, x): 41 | batch_mean = np.mean(x, axis=0) 42 | batch_var = np.var(x, axis=0) 43 | batch_count = x.shape[0] 44 | self.update_from_moments(batch_mean, batch_var, batch_count) 45 | 46 | def update_from_moments(self, batch_mean, batch_var, batch_count): 47 | delta = batch_mean - self.mean 48 | tot_count = self.count + batch_count 49 | 50 | new_mean = self.mean + delta * batch_count / tot_count 51 | m_a = self.var * (self.count) 52 | m_b = batch_var * (batch_count) 53 | M2 = m_a + m_b + np.square(delta) * self.count * batch_count / (self.count + batch_count) 54 | new_var = M2 / (self.count + batch_count) 55 | 56 | new_count = batch_count + self.count 57 | 58 | self.mean = new_mean 59 | self.var = new_var 60 | self.count = new_count 61 | 62 | 63 | class RewardForwardFilter(object): 64 | def __init__(self, gamma): 65 | self.rewems = None 66 | self.gamma = gamma 67 | 68 | def update(self, rews): 69 | if self.rewems is None: 70 | self.rewems = rews 71 | else: 72 | self.rewems = self.rewems * self.gamma + rews 73 | return self.rewems 74 | 75 | 76 | def softmax(z): 77 | assert len(z.shape) == 2 78 | s = np.max(z, axis=1) 79 | s = s[:, np.newaxis] # necessary step to do broadcasting 80 | e_x = np.exp(z - s) 81 | div = np.sum(e_x, axis=1) 82 | div = div[:, np.newaxis] # dito 83 | return e_x / div 84 | 85 | 86 | def global_grad_norm_(parameters, norm_type=2): 87 | """Clips gradient norm of an iterable of parameters. 88 | 89 | The norm is computed over all gradients together, as if they were 90 | concatenated into a single vector. Gradients are modified in-place. 91 | 92 | Arguments: 93 | parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a 94 | single Tensor that will have gradients normalized 95 | max_norm (float or int): max norm of the gradients 96 | norm_type (float or int): type of the used p-norm. Can be ``'inf'`` for 97 | infinity norm. 98 | 99 | Returns: 100 | Total norm of the parameters (viewed as a single vector). 101 | """ 102 | if isinstance(parameters, torch.Tensor): 103 | parameters = [parameters] 104 | parameters = list(filter(lambda p: p.grad is not None, parameters)) 105 | norm_type = float(norm_type) 106 | if norm_type == inf: 107 | total_norm = max(p.grad.data.abs().max() for p in parameters) 108 | else: 109 | total_norm = 0 110 | for p in parameters: 111 | param_norm = p.grad.data.norm(norm_type) 112 | total_norm += param_norm.item() ** norm_type 113 | total_norm = total_norm ** (1. / norm_type) 114 | 115 | return total_norm 116 | -------------------------------------------------------------------------------- /code/Rainbow/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhangchuheng123/Reinforcement-Implementation/c04e0df10ec29ef775ea31395a8ad4b917302d24/code/Rainbow/.DS_Store -------------------------------------------------------------------------------- /code/Rainbow/README.md: -------------------------------------------------------------------------------- 1 | This is copied and modified from https://github.com/Kaixhin/Rainbow. 2 | 3 | I verified the performance. -------------------------------------------------------------------------------- /code/Rainbow/agent.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | from __future__ import division 3 | import os 4 | import numpy as np 5 | import torch 6 | from torch import optim 7 | from torch.nn.utils import clip_grad_norm_ 8 | 9 | from model import DQN 10 | 11 | 12 | class Agent(): 13 | def __init__(self, args, env): 14 | self.action_space = env.action_space() 15 | self.atoms = args.atoms 16 | self.Vmin = args.V_min 17 | self.Vmax = args.V_max 18 | self.support = torch.linspace(args.V_min, args.V_max, self.atoms).to(device=args.device) # Support (range) of z 19 | self.delta_z = (args.V_max - args.V_min) / (self.atoms - 1) 20 | self.batch_size = args.batch_size 21 | self.n = args.multi_step 22 | self.discount = args.discount 23 | self.norm_clip = args.norm_clip 24 | 25 | self.online_net = DQN(args, self.action_space).to(device=args.device) 26 | if args.model: # Load pretrained model if provided 27 | if os.path.isfile(args.model): 28 | state_dict = torch.load(args.model, map_location='cpu') # Always load tensors onto CPU by default, will shift to GPU if necessary 29 | if 'conv1.weight' in state_dict.keys(): 30 | for old_key, new_key in (('conv1.weight', 'convs.0.weight'), ('conv1.bias', 'convs.0.bias'), ('conv2.weight', 'convs.2.weight'), ('conv2.bias', 'convs.2.bias'), ('conv3.weight', 'convs.4.weight'), ('conv3.bias', 'convs.4.bias')): 31 | state_dict[new_key] = state_dict[old_key] # Re-map state dict for old pretrained models 32 | del state_dict[old_key] # Delete old keys for strict load_state_dict 33 | self.online_net.load_state_dict(state_dict) 34 | print("Loading pretrained model: " + args.model) 35 | else: # Raise error if incorrect model path provided 36 | raise FileNotFoundError(args.model) 37 | 38 | self.online_net.train() 39 | 40 | self.target_net = DQN(args, self.action_space).to(device=args.device) 41 | self.update_target_net() 42 | self.target_net.train() 43 | for param in self.target_net.parameters(): 44 | param.requires_grad = False 45 | 46 | self.optimiser = optim.Adam(self.online_net.parameters(), lr=args.learning_rate, eps=args.adam_eps) 47 | 48 | # Resets noisy weights in all linear layers (of online net only) 49 | def reset_noise(self): 50 | self.online_net.reset_noise() 51 | 52 | # Acts based on single state (no batch) 53 | def act(self, state): 54 | with torch.no_grad(): 55 | return (self.online_net(state.unsqueeze(0)) * self.support).sum(2).argmax(1).item() 56 | 57 | # Acts with an ε-greedy policy (used for evaluation only) 58 | def act_e_greedy(self, state, epsilon=0.001): # High ε can reduce evaluation scores drastically 59 | return np.random.randint(0, self.action_space) if np.random.random() < epsilon else self.act(state) 60 | 61 | def learn(self, mem): 62 | # Sample transitions 63 | idxs, states, actions, returns, next_states, nonterminals, weights = mem.sample(self.batch_size) 64 | 65 | # Calculate current state probabilities (online network noise already sampled) 66 | log_ps = self.online_net(states, log=True) # Log probabilities log p(s_t, ·; θonline) 67 | log_ps_a = log_ps[range(self.batch_size), actions] # log p(s_t, a_t; θonline) 68 | 69 | with torch.no_grad(): 70 | # Calculate nth next state probabilities 71 | pns = self.online_net(next_states) # Probabilities p(s_t+n, ·; θonline) 72 | dns = self.support.expand_as(pns) * pns # Distribution d_t+n = (z, p(s_t+n, ·; θonline)) 73 | argmax_indices_ns = dns.sum(2).argmax(1) # Perform argmax action selection using online network: argmax_a[(z, p(s_t+n, a; θonline))] 74 | self.target_net.reset_noise() # Sample new target net noise 75 | pns = self.target_net(next_states) # Probabilities p(s_t+n, ·; θtarget) 76 | pns_a = pns[range(self.batch_size), argmax_indices_ns] # Double-Q probabilities p(s_t+n, argmax_a[(z, p(s_t+n, a; θonline))]; θtarget) 77 | 78 | # Compute Tz (Bellman operator T applied to z) 79 | Tz = returns.unsqueeze(1) + nonterminals * (self.discount ** self.n) * self.support.unsqueeze(0) # Tz = R^n + (γ^n)z (accounting for terminal states) 80 | Tz = Tz.clamp(min=self.Vmin, max=self.Vmax) # Clamp between supported values 81 | # Compute L2 projection of Tz onto fixed support z 82 | b = (Tz - self.Vmin) / self.delta_z # b = (Tz - Vmin) / Δz 83 | l, u = b.floor().to(torch.int64), b.ceil().to(torch.int64) 84 | # Fix disappearing probability mass when l = b = u (b is int) 85 | l[(u > 0) * (l == u)] -= 1 86 | u[(l < (self.atoms - 1)) * (l == u)] += 1 87 | 88 | # Distribute probability of Tz 89 | m = states.new_zeros(self.batch_size, self.atoms) 90 | offset = torch.linspace(0, ((self.batch_size - 1) * self.atoms), self.batch_size).unsqueeze(1).expand(self.batch_size, self.atoms).to(actions) 91 | m.view(-1).index_add_(0, (l + offset).view(-1), (pns_a * (u.float() - b)).view(-1)) # m_l = m_l + p(s_t+n, a*)(u - b) 92 | m.view(-1).index_add_(0, (u + offset).view(-1), (pns_a * (b - l.float())).view(-1)) # m_u = m_u + p(s_t+n, a*)(b - l) 93 | 94 | loss = -torch.sum(m * log_ps_a, 1) # Cross-entropy loss (minimises DKL(m||p(s_t, a_t))) 95 | self.online_net.zero_grad() 96 | (weights * loss).mean().backward() # Backpropagate importance-weighted minibatch loss 97 | clip_grad_norm_(self.online_net.parameters(), self.norm_clip) # Clip gradients by L2 norm 98 | self.optimiser.step() 99 | 100 | mem.update_priorities(idxs, loss.detach().cpu().numpy()) # Update priorities of sampled transitions 101 | 102 | def update_target_net(self): 103 | self.target_net.load_state_dict(self.online_net.state_dict()) 104 | 105 | # Save model parameters on current device (don't move model between devices) 106 | def save(self, path, name='model.pth'): 107 | torch.save(self.online_net.state_dict(), os.path.join(path, name)) 108 | 109 | # Evaluates Q-value based on single state (no batch) 110 | def evaluate_q(self, state): 111 | with torch.no_grad(): 112 | return (self.online_net(state.unsqueeze(0)) * self.support).sum(2).max(1)[0].item() 113 | 114 | def train(self): 115 | self.online_net.train() 116 | 117 | def eval(self): 118 | self.online_net.eval() 119 | -------------------------------------------------------------------------------- /code/Rainbow/bash.sh: -------------------------------------------------------------------------------- 1 | CUDA_VISIBLE_DEVICES=0 python main.py --id baseline --game alien --enable-cudnn --tensorboard-dir ./results/rainbow & 2 | CUDA_VISIBLE_DEVICES=1 python main.py --id baseline --game amidar --enable-cudnn --tensorboard-dir ./results/rainbow & 3 | CUDA_VISIBLE_DEVICES=2 python main.py --id baseline --game assault --enable-cudnn --tensorboard-dir ./results/rainbow & 4 | CUDA_VISIBLE_DEVICES=3 python main.py --id baseline --game asterix --enable-cudnn --tensorboard-dir ./results/rainbow & 5 | CUDA_VISIBLE_DEVICES=0 python main.py --id baseline --game bank_heist --enable-cudnn --tensorboard-dir ./results/rainbow & 6 | CUDA_VISIBLE_DEVICES=1 python main.py --id baseline --game battle_zone --enable-cudnn --tensorboard-dir ./results/rainbow & 7 | CUDA_VISIBLE_DEVICES=2 python main.py --id baseline --game boxing --enable-cudnn --tensorboard-dir ./results/rainbow & 8 | CUDA_VISIBLE_DEVICES=3 python main.py --id baseline --game breakout --enable-cudnn --tensorboard-dir ./results/rainbow & 9 | # CUDA_VISIBLE_DEVICES=0 python main.py --id baseline --game chopper_command --enable-cudnn --tensorboard-dir ./results/rainbow & 10 | # CUDA_VISIBLE_DEVICES=1 python main.py --id baseline --game crazy_climber --enable-cudnn --tensorboard-dir ./results/rainbow & 11 | # CUDA_VISIBLE_DEVICES=2 python main.py --id baseline --game demon_attack --enable-cudnn --tensorboard-dir ./results/rainbow & 12 | # CUDA_VISIBLE_DEVICES=3 python main.py --id baseline --game freeway --enable-cudnn --tensorboard-dir ./results/rainbow & 13 | # CUDA_VISIBLE_DEVICES=0 python main.py --id baseline --game frostbite --enable-cudnn --tensorboard-dir ./results/rainbow & 14 | # CUDA_VISIBLE_DEVICES=1 python main.py --id baseline --game gopher --enable-cudnn --tensorboard-dir ./results/rainbow & 15 | # CUDA_VISIBLE_DEVICES=2 python main.py --id baseline --game hero --enable-cudnn --tensorboard-dir ./results/rainbow & 16 | # CUDA_VISIBLE_DEVICES=3 python main.py --id baseline --game jamesbond --enable-cudnn --tensorboard-dir ./results/rainbow & 17 | # CUDA_VISIBLE_DEVICES=0 python main.py --id baseline --game kangaroo --enable-cudnn --tensorboard-dir ./results/rainbow & 18 | # CUDA_VISIBLE_DEVICES=1 python main.py --id baseline --game krull --enable-cudnn --tensorboard-dir ./results/rainbow & 19 | # CUDA_VISIBLE_DEVICES=2 python main.py --id baseline --game kung_fu_master --enable-cudnn --tensorboard-dir ./results/rainbow & 20 | # CUDA_VISIBLE_DEVICES=3 python main.py --id baseline --game ms_pacman --enable-cudnn --tensorboard-dir ./results/rainbow & 21 | # CUDA_VISIBLE_DEVICES=0 python main.py --id baseline --game pong --enable-cudnn --tensorboard-dir ./results/rainbow & 22 | # CUDA_VISIBLE_DEVICES=1 python main.py --id baseline --game private_eye --enable-cudnn --tensorboard-dir ./results/rainbow & 23 | # CUDA_VISIBLE_DEVICES=2 python main.py --id baseline --game qbert --enable-cudnn --tensorboard-dir ./results/rainbow & 24 | # CUDA_VISIBLE_DEVICES=3 python main.py --id baseline --game road_runner --enable-cudnn --tensorboard-dir ./results/rainbow & 25 | # CUDA_VISIBLE_DEVICES=0 python main.py --id baseline --game seaquest --enable-cudnn --tensorboard-dir ./results/rainbow & 26 | # CUDA_VISIBLE_DEVICES=1 python main.py --id baseline --game up_n_down --enable-cudnn --tensorboard-dir ./results/rainbow &s -------------------------------------------------------------------------------- /code/Rainbow/env.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | from collections import deque 3 | import random 4 | import atari_py 5 | import cv2 6 | import torch 7 | 8 | 9 | class Env(): 10 | def __init__(self, args): 11 | self.device = args.device 12 | self.ale = atari_py.ALEInterface() 13 | self.ale.setInt('random_seed', args.seed) 14 | self.ale.setInt('max_num_frames_per_episode', args.max_episode_length) 15 | self.ale.setFloat('repeat_action_probability', 0) # Disable sticky actions 16 | self.ale.setInt('frame_skip', 0) 17 | self.ale.setBool('color_averaging', False) 18 | self.ale.loadROM(atari_py.get_game_path(args.game)) # ROM loading must be done after setting options 19 | actions = self.ale.getMinimalActionSet() 20 | self.actions = dict([i, e] for i, e in zip(range(len(actions)), actions)) 21 | self.lives = 0 # Life counter (used in DeepMind training) 22 | self.life_termination = False # Used to check if resetting only from loss of life 23 | self.window = args.history_length # Number of frames to concatenate 24 | self.state_buffer = deque([], maxlen=args.history_length) 25 | self.training = True # Consistent with model training mode 26 | 27 | def _get_state(self): 28 | state = cv2.resize(self.ale.getScreenGrayscale(), (84, 84), interpolation=cv2.INTER_LINEAR) 29 | return torch.tensor(state, dtype=torch.float32, device=self.device).div_(255) 30 | 31 | def _reset_buffer(self): 32 | for _ in range(self.window): 33 | self.state_buffer.append(torch.zeros(84, 84, device=self.device)) 34 | 35 | def reset(self): 36 | if self.life_termination: 37 | self.life_termination = False # Reset flag 38 | self.ale.act(0) # Use a no-op after loss of life 39 | else: 40 | # Reset internals 41 | self._reset_buffer() 42 | self.ale.reset_game() 43 | # Perform up to 30 random no-ops before starting 44 | for _ in range(random.randrange(30)): 45 | self.ale.act(0) # Assumes raw action 0 is always no-op 46 | if self.ale.game_over(): 47 | self.ale.reset_game() 48 | # Process and return "initial" state 49 | observation = self._get_state() 50 | self.state_buffer.append(observation) 51 | self.lives = self.ale.lives() 52 | return torch.stack(list(self.state_buffer), 0) 53 | 54 | def step(self, action): 55 | # Repeat action 4 times, max pool over last 2 frames 56 | frame_buffer = torch.zeros(2, 84, 84, device=self.device) 57 | reward, done = 0, False 58 | for t in range(4): 59 | reward += self.ale.act(self.actions.get(action)) 60 | if t == 2: 61 | frame_buffer[0] = self._get_state() 62 | elif t == 3: 63 | frame_buffer[1] = self._get_state() 64 | done = self.ale.game_over() 65 | if done: 66 | break 67 | observation = frame_buffer.max(0)[0] 68 | self.state_buffer.append(observation) 69 | # Detect loss of life as terminal in training mode 70 | if self.training: 71 | lives = self.ale.lives() 72 | if lives < self.lives and lives > 0: # Lives > 0 for Q*bert 73 | self.life_termination = not done # Only set flag when not truly done 74 | done = True 75 | self.lives = lives 76 | # Return state, reward, done 77 | return torch.stack(list(self.state_buffer), 0), reward, done 78 | 79 | # Uses loss of life as terminal signal 80 | def train(self): 81 | self.training = True 82 | 83 | # Uses standard terminal signal 84 | def eval(self): 85 | self.training = False 86 | 87 | def action_space(self): 88 | return len(self.actions) 89 | 90 | def render(self): 91 | cv2.imshow('screen', self.ale.getScreenRGB()[:, :, ::-1]) 92 | cv2.waitKey(1) 93 | 94 | def close(self): 95 | cv2.destroyAllWindows() 96 | -------------------------------------------------------------------------------- /code/Rainbow/main.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | TODO: Note that DeepMind's evaluation method is running the latest agent for 500K frames every 1M steps 4 | """ 5 | from __future__ import division 6 | 7 | import torch 8 | from torch.utils.tensorboard import SummaryWriter 9 | from tqdm import trange 10 | import numpy as np 11 | import atari_py 12 | 13 | from datetime import datetime 14 | import argparse 15 | import pickle 16 | import bz2 17 | import os 18 | 19 | from agent import Agent 20 | from env import Env 21 | from memory import ReplayMemory 22 | from test import test 23 | 24 | 25 | def parse_arguments(): 26 | 27 | parser = argparse.ArgumentParser(description='Rainbow') 28 | parser.add_argument('--id', type=str, default='default', help='Experiment ID') 29 | parser.add_argument('--seed', type=int, default=123, help='Random seed') 30 | parser.add_argument('--disable-cuda', action='store_true', help='Disable CUDA') 31 | parser.add_argument('--game', type=str, default='space_invaders', choices=atari_py.list_games(), help='ATARI game') 32 | parser.add_argument('--T-max', type=int, default=int(50e6), metavar='STEPS', help='Number of training steps (4x number of frames)') 33 | parser.add_argument('--max-episode-length', type=int, default=int(108e3), metavar='LENGTH', help='Max episode length in game frames (0 to disable)') 34 | parser.add_argument('--history-length', type=int, default=4, metavar='T', help='Number of consecutive states processed') 35 | parser.add_argument('--hidden-size', type=int, default=512, metavar='SIZE', help='Network hidden size') 36 | parser.add_argument('--noisy-std', type=float, default=0.1, metavar='σ', help='Initial standard deviation of noisy linear layers') 37 | parser.add_argument('--atoms', type=int, default=51, metavar='C', help='Discretised size of value distribution') 38 | parser.add_argument('--V-min', type=float, default=-10, metavar='V', help='Minimum of value distribution support') 39 | parser.add_argument('--V-max', type=float, default=10, metavar='V', help='Maximum of value distribution support') 40 | parser.add_argument('--model', type=str, metavar='PARAMS', help='Pretrained model (state dict)') 41 | parser.add_argument('--memory-capacity', type=int, default=int(1e6), metavar='CAPACITY', help='Experience replay memory capacity') 42 | parser.add_argument('--replay-frequency', type=int, default=4, metavar='k', help='Frequency of sampling from memory') 43 | parser.add_argument('--priority-exponent', type=float, default=0.5, metavar='ω', help='Prioritised experience replay exponent (originally denoted α)') 44 | parser.add_argument('--priority-weight', type=float, default=0.4, metavar='β', help='Initial prioritised experience replay importance sampling weight') 45 | parser.add_argument('--multi-step', type=int, default=3, metavar='n', help='Number of steps for multi-step return') 46 | parser.add_argument('--discount', type=float, default=0.99, metavar='γ', help='Discount factor') 47 | parser.add_argument('--target-update', type=int, default=int(8e3), metavar='τ', help='Number of steps after which to update target network') 48 | parser.add_argument('--reward-clip', type=int, default=1, metavar='VALUE', help='Reward clipping (0 to disable)') 49 | parser.add_argument('--learning-rate', type=float, default=0.0000625, metavar='η', help='Learning rate') 50 | parser.add_argument('--adam-eps', type=float, default=1.5e-4, metavar='ε', help='Adam epsilon') 51 | parser.add_argument('--batch-size', type=int, default=32, metavar='SIZE', help='Batch size') 52 | parser.add_argument('--norm-clip', type=float, default=10, metavar='NORM', help='Max L2 norm for gradient clipping') 53 | parser.add_argument('--learn-start', type=int, default=int(20e3), metavar='STEPS', help='Number of steps before starting training') 54 | parser.add_argument('--evaluate', action='store_true', help='Evaluate only') 55 | parser.add_argument('--evaluation-interval', type=int, default=100000, metavar='STEPS', help='Number of training steps between evaluations') 56 | parser.add_argument('--evaluation-episodes', type=int, default=10, metavar='N', help='Number of evaluation episodes to average over') 57 | parser.add_argument('--evaluation-size', type=int, default=500, metavar='N', help='Number of transitions to use for validating Q') 58 | parser.add_argument('--render', action='store_true', help='Display screen (testing only)') 59 | parser.add_argument('--enable-cudnn', action='store_true', help='Enable cuDNN (faster but nondeterministic)') 60 | parser.add_argument('--checkpoint-interval', default=int(20e3), help='How often to checkpoint the model, defaults to 0 (never checkpoint)') 61 | parser.add_argument('--memory', help='Path to save/load the memory from') 62 | parser.add_argument('--disable-bzip-memory', action='store_true', help='Don\'t zip the memory file. Not recommended (zipping is a bit slower and much, much smaller)') 63 | parser.add_argument('--tensorboard-dir', type=str, default=None, help='tensorboard directory') 64 | parser.add_argument('--architecture', type=str, default='canonical', choices=['canonical', 'data-efficient'], metavar='ARCH', help='Network architecture') 65 | 66 | args = parser.parse_args() 67 | 68 | return args 69 | 70 | def load_memory(memory_path, disable_bzip): 71 | if disable_bzip: 72 | with open(memory_path, 'rb') as pickle_file: 73 | return pickle.load(pickle_file) 74 | else: 75 | with bz2.open(memory_path, 'rb') as zipped_pickle_file: 76 | return pickle.load(zipped_pickle_file) 77 | 78 | def save_memory(memory, memory_path, disable_bzip): 79 | if disable_bzip: 80 | with open(memory_path, 'wb') as pickle_file: 81 | pickle.dump(memory, pickle_file) 82 | else: 83 | with bz2.open(memory_path, 'wb') as zipped_pickle_file: 84 | pickle.dump(memory, zipped_pickle_file) 85 | 86 | class Logger(object): 87 | def __init__(self, path): 88 | self.path = path 89 | 90 | def info(self, s): 91 | string = '[' + str(datetime.now().strftime('%Y-%m-%dT%H:%M:%S')) + '] ' + s 92 | print(string) 93 | with open(os.path.join(self.path, 'log.txt'), 'a+') as f: 94 | f.writelines([string, '']) 95 | 96 | def main(): 97 | args = parse_arguments() 98 | 99 | results_dir = os.path.join('results', args.id) 100 | os.makedirs(results_dir, exist_ok=True) 101 | logger = Logger(results_dir) 102 | 103 | metrics = {'steps': [], 'rewards': [], 'Qs': [], 'Qstds': [], 'best_avg_reward': -float('inf')} 104 | np.random.seed(args.seed) 105 | torch.manual_seed(np.random.randint(1, 10000)) 106 | if torch.cuda.is_available() and not args.disable_cuda: 107 | args.device = torch.device('cuda') 108 | torch.cuda.manual_seed(np.random.randint(1, 10000)) 109 | torch.backends.cudnn.enabled = args.enable_cudnn 110 | else: 111 | args.device = torch.device('cpu') 112 | 113 | if args.tensorboard_dir is None: 114 | writer = SummaryWriter(os.path.join(results_dir, 'tensorboard', args.game, args.architecture)) 115 | else: 116 | writer = SummaryWriter(os.path.join(args.tensorboard_dir, args.game, args.architecture)) 117 | 118 | # Environment 119 | env = Env(args) 120 | env.train() 121 | action_space = env.action_space() 122 | 123 | # Agent 124 | dqn = Agent(args, env) 125 | 126 | # If a model is provided, and evaluate is fale, presumably we want to resume, so try to load memory 127 | if args.model is not None and not args.evaluate: 128 | if not args.memory: 129 | raise ValueError('Cannot resume training without memory save path. Aborting...') 130 | elif not os.path.exists(args.memory): 131 | raise ValueError('Could not find memory file at {path}. Aborting...'.format(path=args.memory)) 132 | 133 | mem = load_memory(args.memory, args.disable_bzip_memory) 134 | 135 | else: 136 | mem = ReplayMemory(args, args.memory_capacity) 137 | 138 | priority_weight_increase = (1 - args.priority_weight) / (args.T_max - args.learn_start) 139 | 140 | # Construct validation memory 141 | val_mem = ReplayMemory(args, args.evaluation_size) 142 | T, done = 0, True 143 | while T < args.evaluation_size: 144 | if done: 145 | state, done = env.reset(), False 146 | 147 | next_state, _, done = env.step(np.random.randint(0, action_space)) 148 | val_mem.append(state, None, None, done) 149 | state = next_state 150 | T += 1 151 | 152 | if args.evaluate: 153 | dqn.eval() # Set DQN (online network) to evaluation mode 154 | test_result = test(args, 0, dqn, val_mem, metrics, results_dir, evaluate=True) # Test 155 | logger.info('Avg. reward: ' + str(test_result['avg_reward']) \ 156 | + ' | Avg. Q: ' + str(test_result['avg_Q'])) 157 | else: 158 | # Training loop 159 | dqn.train() 160 | T, done = 0, True 161 | accumulate_reward = 0 162 | for T in trange(1, args.T_max + 1): 163 | if done: 164 | state, done = env.reset(), False 165 | writer.add_scalar('Train/Reward', accumulate_reward, T) 166 | accumulate_reward = 0 167 | 168 | if T % args.replay_frequency == 0: 169 | dqn.reset_noise() # Draw a new set of noisy weights 170 | 171 | action = dqn.act(state) # Choose an action greedily (with noisy weights) 172 | next_state, reward, done = env.step(action) # Step 173 | accumulate_reward += reward 174 | if args.reward_clip > 0: 175 | reward = max(min(reward, args.reward_clip), -args.reward_clip) # Clip rewards 176 | mem.append(state, action, reward, done) # Append transition to memory 177 | 178 | # Train and test 179 | if T >= args.learn_start: 180 | mem.priority_weight = min(mem.priority_weight + priority_weight_increase, 1) # Anneal importance sampling weight β to 1 181 | 182 | if T % args.replay_frequency == 0: 183 | dqn.learn(mem) # Train with n-step distributional double-Q learning 184 | 185 | if T % args.evaluation_interval == 0: 186 | dqn.eval() # Set DQN (online network) to evaluation mode 187 | test_result = test(args, T, dqn, val_mem, metrics, results_dir) # Test 188 | for k, v in test_result.items(): 189 | writer.add_scalar('Eval/{}'.format(k), v, T) 190 | logger.info('T = ' + str(T) + ' / ' + str(args.T_max) + \ 191 | ' | Avg. reward: ' + str(test_result['avg_reward']) + \ 192 | ' | Avg. Q: ' + str(test_result['avg_Q'])) 193 | dqn.train() # Set DQN (online network) back to training mode 194 | 195 | # If memory path provided, save it 196 | if args.memory is not None: 197 | save_memory(mem, args.memory, args.disable_bzip_memory) 198 | 199 | # Update target network 200 | if T % args.target_update == 0: 201 | dqn.update_target_net() 202 | 203 | # Checkpoint the network 204 | if T % args.checkpoint_interval == 0: 205 | dqn.save(results_dir, 'checkpoint.pth') 206 | 207 | state = next_state 208 | 209 | env.close() 210 | 211 | if __name__ == '__main__': 212 | main() 213 | -------------------------------------------------------------------------------- /code/Rainbow/memory.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | from __future__ import division 3 | from collections import namedtuple 4 | import numpy as np 5 | import torch 6 | 7 | 8 | Transition = namedtuple('Transition', ('timestep', 'state', 'action', 'reward', 'nonterminal')) 9 | blank_trans = Transition(0, torch.zeros(84, 84, dtype=torch.uint8), None, 0, False) 10 | 11 | 12 | # Segment tree data structure where parent node values are sum/max of children node values 13 | class SegmentTree(): 14 | def __init__(self, size): 15 | self.index = 0 16 | self.size = size 17 | self.full = False # Used to track actual capacity 18 | self.sum_tree = np.zeros((2 * size - 1, ), dtype=np.float32) # Initialise fixed size tree with all (priority) zeros 19 | self.data = np.array([None] * size) # Wrap-around cyclic buffer 20 | self.max = 1 # Initial max value to return (1 = 1^ω) 21 | 22 | # Propagates value up tree given a tree index 23 | def _propagate(self, index, value): 24 | parent = (index - 1) // 2 25 | left, right = 2 * parent + 1, 2 * parent + 2 26 | self.sum_tree[parent] = self.sum_tree[left] + self.sum_tree[right] 27 | if parent != 0: 28 | self._propagate(parent, value) 29 | 30 | # Updates value given a tree index 31 | def update(self, index, value): 32 | self.sum_tree[index] = value # Set new value 33 | self._propagate(index, value) # Propagate value 34 | self.max = max(value, self.max) 35 | 36 | def append(self, data, value): 37 | self.data[self.index] = data # Store data in underlying data structure 38 | self.update(self.index + self.size - 1, value) # Update tree 39 | self.index = (self.index + 1) % self.size # Update index 40 | self.full = self.full or self.index == 0 # Save when capacity reached 41 | self.max = max(value, self.max) 42 | 43 | # Searches for the location of a value in sum tree 44 | def _retrieve(self, index, value): 45 | left, right = 2 * index + 1, 2 * index + 2 46 | if left >= len(self.sum_tree): 47 | return index 48 | elif value <= self.sum_tree[left]: 49 | return self._retrieve(left, value) 50 | else: 51 | return self._retrieve(right, value - self.sum_tree[left]) 52 | 53 | # Searches for a value in sum tree and returns value, data index and tree index 54 | def find(self, value): 55 | index = self._retrieve(0, value) # Search for index of item from root 56 | data_index = index - self.size + 1 57 | return (self.sum_tree[index], data_index, index) # Return value, data index, tree index 58 | 59 | # Returns data given a data index 60 | def get(self, data_index): 61 | return self.data[data_index % self.size] 62 | 63 | def total(self): 64 | return self.sum_tree[0] 65 | 66 | class ReplayMemory(): 67 | def __init__(self, args, capacity): 68 | self.device = args.device 69 | self.capacity = capacity 70 | self.history = args.history_length 71 | self.discount = args.discount 72 | self.n = args.multi_step 73 | self.priority_weight = args.priority_weight # Initial importance sampling weight β, annealed to 1 over course of training 74 | self.priority_exponent = args.priority_exponent 75 | self.t = 0 # Internal episode timestep counter 76 | self.transitions = SegmentTree(capacity) # Store transitions in a wrap-around cyclic buffer within a sum tree for querying priorities 77 | 78 | # Adds state and action at time t, reward and terminal at time t + 1 79 | def append(self, state, action, reward, terminal): 80 | state = state[-1].mul(255).to(dtype=torch.uint8, device=torch.device('cpu')) # Only store last frame and discretise to save memory 81 | self.transitions.append(Transition(self.t, state, action, reward, not terminal), self.transitions.max) # Store new transition with maximum priority 82 | self.t = 0 if terminal else self.t + 1 # Start new episodes with t = 0 83 | 84 | # Returns a transition with blank states where appropriate 85 | def _get_transition(self, idx): 86 | transition = np.array([None] * (self.history + self.n)) 87 | transition[self.history - 1] = self.transitions.get(idx) 88 | for t in range(self.history - 2, -1, -1): # e.g. 2 1 0 89 | if transition[t + 1].timestep == 0: 90 | transition[t] = blank_trans # If future frame has timestep 0 91 | else: 92 | transition[t] = self.transitions.get(idx - self.history + 1 + t) 93 | for t in range(self.history, self.history + self.n): # e.g. 4 5 6 94 | if transition[t - 1].nonterminal: 95 | transition[t] = self.transitions.get(idx - self.history + 1 + t) 96 | else: 97 | transition[t] = blank_trans # If prev (next) frame is terminal 98 | return transition 99 | 100 | # Returns a valid sample from a segment 101 | def _get_sample_from_segment(self, segment, i): 102 | valid = False 103 | while not valid: 104 | sample = np.random.uniform(i * segment, (i + 1) * segment) # Uniformly sample an element from within a segment 105 | prob, idx, tree_idx = self.transitions.find(sample) # Retrieve sample from tree with un-normalised probability 106 | # Resample if transition straddled current index or probablity 0 107 | if (self.transitions.index - idx) % self.capacity > self.n and (idx - self.transitions.index) % self.capacity >= self.history and prob != 0: 108 | valid = True # Note that conditions are valid but extra conservative around buffer index 0 109 | 110 | # Retrieve all required transition data (from t - h to t + n) 111 | transition = self._get_transition(idx) 112 | # Create un-discretised state and nth next state 113 | state = torch.stack([trans.state for trans in transition[:self.history]]).to(device=self.device).to(dtype=torch.float32).div_(255) 114 | next_state = torch.stack([trans.state for trans in transition[self.n:self.n + self.history]]).to(device=self.device).to(dtype=torch.float32).div_(255) 115 | # Discrete action to be used as index 116 | action = torch.tensor([transition[self.history - 1].action], dtype=torch.int64, device=self.device) 117 | # Calculate truncated n-step discounted return R^n = Σ_k=0->n-1 (γ^k)R_t+k+1 (note that invalid nth next states have reward 0) 118 | R = torch.tensor([sum(self.discount ** n * transition[self.history + n - 1].reward for n in range(self.n))], dtype=torch.float32, device=self.device) 119 | # Mask for non-terminal nth next states 120 | nonterminal = torch.tensor([transition[self.history + self.n - 1].nonterminal], dtype=torch.float32, device=self.device) 121 | 122 | return prob, idx, tree_idx, state, action, R, next_state, nonterminal 123 | 124 | def sample(self, batch_size): 125 | p_total = self.transitions.total() # Retrieve sum of all priorities (used to create a normalised probability distribution) 126 | segment = p_total / batch_size # Batch size number of segments, based on sum over all probabilities 127 | batch = [self._get_sample_from_segment(segment, i) for i in range(batch_size)] # Get batch of valid samples 128 | probs, idxs, tree_idxs, states, actions, returns, next_states, nonterminals = zip(*batch) 129 | states, next_states, = torch.stack(states), torch.stack(next_states) 130 | actions, returns, nonterminals = torch.cat(actions), torch.cat(returns), torch.stack(nonterminals) 131 | probs = np.array(probs, dtype=np.float32) / p_total # Calculate normalised probabilities 132 | capacity = self.capacity if self.transitions.full else self.transitions.index 133 | weights = (capacity * probs) ** -self.priority_weight # Compute importance-sampling weights w 134 | weights = torch.tensor(weights / weights.max(), dtype=torch.float32, device=self.device) # Normalise by max importance-sampling weight from batch 135 | return tree_idxs, states, actions, returns, next_states, nonterminals, weights 136 | 137 | 138 | def update_priorities(self, idxs, priorities): 139 | priorities = np.power(priorities, self.priority_exponent) 140 | [self.transitions.update(idx, priority) for idx, priority in zip(idxs, priorities)] 141 | 142 | # Set up internal state for iterator 143 | def __iter__(self): 144 | self.current_idx = 0 145 | return self 146 | 147 | # Return valid states for validation 148 | def __next__(self): 149 | if self.current_idx == self.capacity: 150 | raise StopIteration 151 | # Create stack of states 152 | state_stack = [None] * self.history 153 | state_stack[-1] = self.transitions.data[self.current_idx].state 154 | prev_timestep = self.transitions.data[self.current_idx].timestep 155 | for t in reversed(range(self.history - 1)): 156 | if prev_timestep == 0: 157 | state_stack[t] = blank_trans.state # If future frame has timestep 0 158 | else: 159 | state_stack[t] = self.transitions.data[self.current_idx + t - self.history + 1].state 160 | prev_timestep -= 1 161 | state = torch.stack(state_stack, 0).to(dtype=torch.float32, device=self.device).div_(255) # Agent will turn into batch 162 | self.current_idx += 1 163 | return state 164 | 165 | next = __next__ # Alias __next__ for Python 2 compatibility 166 | -------------------------------------------------------------------------------- /code/Rainbow/model.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | from __future__ import division 3 | import math 4 | import torch 5 | from torch import nn 6 | from torch.nn import functional as F 7 | 8 | 9 | # Factorised NoisyLinear layer with bias 10 | class NoisyLinear(nn.Module): 11 | def __init__(self, in_features, out_features, std_init=0.5): 12 | super(NoisyLinear, self).__init__() 13 | self.in_features = in_features 14 | self.out_features = out_features 15 | self.std_init = std_init 16 | self.weight_mu = nn.Parameter(torch.empty(out_features, in_features)) 17 | self.weight_sigma = nn.Parameter(torch.empty(out_features, in_features)) 18 | self.register_buffer('weight_epsilon', torch.empty(out_features, in_features)) 19 | self.bias_mu = nn.Parameter(torch.empty(out_features)) 20 | self.bias_sigma = nn.Parameter(torch.empty(out_features)) 21 | self.register_buffer('bias_epsilon', torch.empty(out_features)) 22 | self.reset_parameters() 23 | self.reset_noise() 24 | 25 | def reset_parameters(self): 26 | mu_range = 1 / math.sqrt(self.in_features) 27 | self.weight_mu.data.uniform_(-mu_range, mu_range) 28 | self.weight_sigma.data.fill_(self.std_init / math.sqrt(self.in_features)) 29 | self.bias_mu.data.uniform_(-mu_range, mu_range) 30 | self.bias_sigma.data.fill_(self.std_init / math.sqrt(self.out_features)) 31 | 32 | def _scale_noise(self, size): 33 | x = torch.randn(size) 34 | return x.sign().mul_(x.abs().sqrt_()) 35 | 36 | def reset_noise(self): 37 | epsilon_in = self._scale_noise(self.in_features) 38 | epsilon_out = self._scale_noise(self.out_features) 39 | self.weight_epsilon.copy_(epsilon_out.ger(epsilon_in)) 40 | self.bias_epsilon.copy_(epsilon_out) 41 | 42 | def forward(self, input): 43 | if self.training: 44 | return F.linear(input, self.weight_mu + self.weight_sigma * self.weight_epsilon, self.bias_mu + self.bias_sigma * self.bias_epsilon) 45 | else: 46 | return F.linear(input, self.weight_mu, self.bias_mu) 47 | 48 | 49 | class DQN(nn.Module): 50 | def __init__(self, args, action_space): 51 | super(DQN, self).__init__() 52 | self.atoms = args.atoms 53 | self.action_space = action_space 54 | 55 | if args.architecture == 'canonical': 56 | self.convs = nn.Sequential(nn.Conv2d(args.history_length, 32, 8, stride=4, padding=0), nn.ReLU(), 57 | nn.Conv2d(32, 64, 4, stride=2, padding=0), nn.ReLU(), 58 | nn.Conv2d(64, 64, 3, stride=1, padding=0), nn.ReLU()) 59 | self.conv_output_size = 3136 60 | elif args.architecture == 'data-efficient': 61 | self.convs = nn.Sequential(nn.Conv2d(args.history_length, 32, 5, stride=5, padding=0), nn.ReLU(), 62 | nn.Conv2d(32, 64, 5, stride=5, padding=0), nn.ReLU()) 63 | self.conv_output_size = 576 64 | 65 | self.fc_h_v = NoisyLinear(self.conv_output_size, args.hidden_size, std_init=args.noisy_std) 66 | self.fc_h_a = NoisyLinear(self.conv_output_size, args.hidden_size, std_init=args.noisy_std) 67 | self.fc_z_v = NoisyLinear(args.hidden_size, self.atoms, std_init=args.noisy_std) 68 | self.fc_z_a = NoisyLinear(args.hidden_size, action_space * self.atoms, std_init=args.noisy_std) 69 | 70 | def representation(self, x): 71 | x = self.convs(x) 72 | x = x.view(-1, self.conv_output_size) 73 | return x 74 | 75 | def forward(self, x, log=False): 76 | x = self.convs(x) 77 | x = x.view(-1, self.conv_output_size) 78 | v = self.fc_z_v(F.relu(self.fc_h_v(x))) # Value stream 79 | a = self.fc_z_a(F.relu(self.fc_h_a(x))) # Advantage stream 80 | v, a = v.view(-1, 1, self.atoms), a.view(-1, self.action_space, self.atoms) 81 | q = v + a - a.mean(1, keepdim=True) # Combine streams 82 | if log: # Use log softmax for numerical stability 83 | q = F.log_softmax(q, dim=2) # Log probabilities with action over second dimension 84 | else: 85 | q = F.softmax(q, dim=2) # Probabilities with action over second dimension 86 | return q 87 | 88 | def reset_noise(self): 89 | for name, module in self.named_children(): 90 | if 'fc' in name: 91 | module.reset_noise() 92 | -------------------------------------------------------------------------------- /code/Rainbow/requirements.txt: -------------------------------------------------------------------------------- 1 | atari-py==0.2.6 2 | opencv-python==4.2.0.34 3 | plotly==4.8.1 4 | procgen==0.10.3 5 | tensorboardX==2.0 6 | torch==1.5.1 7 | tqdm==4.42.1 8 | tensorflow<2.0.0,>=1.4.0 9 | numpy<2.0.0,>=1.17.0 10 | pathos==0.2.6 11 | -------------------------------------------------------------------------------- /code/Rainbow/test.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | import os 3 | import plotly 4 | from plotly.graph_objs import Scatter 5 | from plotly.graph_objs.scatter import Line 6 | import numpy as np 7 | import torch 8 | from env import Env 9 | 10 | 11 | # Test DQN 12 | def test(args, T, agent, val_mem, metrics, results_dir, evaluate=False, plot=False): 13 | 14 | env = Env(args) 15 | env.eval() 16 | metrics['steps'].append(T) 17 | T_rewards, T_Qs, T_Qstds = [], [], [] 18 | 19 | # Test performance over several episodes 20 | return_trajs = np.array([]) 21 | Q_trajs = np.array([]) 22 | Qstd_trajs = np.array([]) 23 | done = True 24 | for _ in range(args.evaluation_episodes): 25 | while True: 26 | if done: 27 | state, reward_traj, reward_sum, state_traj, done = env.reset(), [], 0, [], False 28 | 29 | state_traj.append(state) 30 | action = agent.act(state) # Choose an action greedily (possibly with noisy net) 31 | state, reward, done = env.step(action) # Step 32 | reward_traj.append(reward) 33 | reward_sum += reward 34 | if args.render: 35 | env.render() 36 | 37 | if done: 38 | T_rewards.append(reward_sum) 39 | reward_traj = np.array(reward_traj) 40 | return_trajs = np.append(return_trajs, np.cumsum(reward_traj[::-1])[::-1]) 41 | t_Qs, t_Qstds = [], [] 42 | for state in state_traj: 43 | res = agent.evaluate_q(state) 44 | t_Qs.append(res) 45 | t_Qstds.append(0) 46 | Q_trajs = np.append(Q_trajs, np.array(t_Qs)) 47 | Qstd_trajs = np.append(Qstd_trajs, np.array(t_Qstds)) 48 | break 49 | env.close() 50 | 51 | # Test Q-values over validation memory 52 | for state in val_mem: # Iterate over valid states 53 | res = agent.evaluate_q(state) 54 | T_Qs.append(res) 55 | T_Qstds.append(0) 56 | 57 | avg_reward = sum(T_rewards) / len(T_rewards) 58 | avg_Q = sum(T_Qs) / len(T_Qs) 59 | avg_Qstd = sum(T_Qstds) / len(T_Qstds) 60 | 61 | if not evaluate: 62 | # Save model parameters if improved 63 | if avg_reward > metrics['best_avg_reward']: 64 | metrics['best_avg_reward'] = avg_reward 65 | agent.save(results_dir) 66 | 67 | # Append to results and save metrics 68 | metrics['rewards'].append(T_rewards) 69 | metrics['Qs'].append(T_Qs) 70 | metrics['Qstds'].append(T_Qstds) 71 | torch.save(metrics, os.path.join(results_dir, 'metrics.pth')) 72 | 73 | # Plot 74 | if plot: 75 | _plot_line(metrics['steps'], metrics['rewards'], 'Reward', path=results_dir) 76 | _plot_line(metrics['steps'], metrics['Qs'], 'Q', path=results_dir) 77 | _plot_line(metrics['steps'], metrics['Qstds'], 'Qstd', path=results_dir) 78 | 79 | avg_R = np.mean(return_trajs) 80 | bias_trajs = (Q_trajs - return_trajs) / (np.abs(avg_R) + 1e-6) 81 | test_result = { 82 | 'avg_reward': avg_reward, 83 | 'avg_Q_fixed_set': avg_Q, 84 | 'avg_Qstd_fixed_set': avg_Qstd, 85 | 'avg_Q': np.mean(Q_trajs), 86 | 'avg_Qstd': np.mean(Qstd_trajs), 87 | 'avg_R': avg_R, 88 | 'mean_bias': np.mean(bias_trajs), 89 | 'std_bias': np.std(bias_trajs), 90 | } 91 | 92 | return test_result 93 | 94 | 95 | # Plots min, max and mean + standard deviation bars of a population over time 96 | def _plot_line(xs, ys_population, title, path=''): 97 | max_colour, mean_colour, std_colour, transparent = 'rgb(0, 132, 180)', 'rgb(0, 172, 237)', 'rgba(29, 202, 255, 0.2)', 'rgba(0, 0, 0, 0)' 98 | 99 | ys = torch.tensor(ys_population, dtype=torch.float32) 100 | ys_min, ys_max, ys_mean, ys_std = ys.min(1)[0].squeeze(), ys.max(1)[0].squeeze(), ys.mean(1).squeeze(), ys.std(1).squeeze() 101 | ys_upper, ys_lower = ys_mean + ys_std, ys_mean - ys_std 102 | 103 | trace_max = Scatter(x=xs, y=ys_max.numpy(), line=Line(color=max_colour, dash='dash'), name='Max') 104 | trace_upper = Scatter(x=xs, y=ys_upper.numpy(), line=Line(color=transparent), name='+1 Std. Dev.', showlegend=False) 105 | trace_mean = Scatter(x=xs, y=ys_mean.numpy(), fill='tonexty', fillcolor=std_colour, line=Line(color=mean_colour), name='Mean') 106 | trace_lower = Scatter(x=xs, y=ys_lower.numpy(), fill='tonexty', fillcolor=std_colour, line=Line(color=transparent), name='-1 Std. Dev.', showlegend=False) 107 | trace_min = Scatter(x=xs, y=ys_min.numpy(), line=Line(color=max_colour, dash='dash'), name='Min') 108 | 109 | plotly.offline.plot({ 110 | 'data': [trace_upper, trace_mean, trace_lower, trace_min, trace_max], 111 | 'layout': dict(title=title, xaxis={'title': 'Step'}, yaxis={'title': title}) 112 | }, filename=os.path.join(path, title + '.html'), auto_open=False) 113 | -------------------------------------------------------------------------------- /code/SAC-discrete/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhangchuheng123/Reinforcement-Implementation/c04e0df10ec29ef775ea31395a8ad4b917302d24/code/SAC-discrete/.DS_Store -------------------------------------------------------------------------------- /code/SAC-discrete/bash.sh: -------------------------------------------------------------------------------- 1 | CUDA_VISIBLE_DEVICES=0 python sac_discrete.py --config config/debug.yaml -------------------------------------------------------------------------------- /code/SAC-discrete/config/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhangchuheng123/Reinforcement-Implementation/c04e0df10ec29ef775ea31395a8ad4b917302d24/code/SAC-discrete/config/.DS_Store -------------------------------------------------------------------------------- /code/SAC-discrete/config/default.yaml: -------------------------------------------------------------------------------- 1 | basic: 2 | device: cuda 3 | accuracy: float32 4 | seed: 6666 5 | verbose: 2 6 | 7 | algo: 8 | num_steps: 5000000 # 5M 9 | batch_size: 64 10 | lr: 0.0003 11 | memory_size: 300000 # 300k 12 | multi_step: 1 13 | target_entropy_ratio: 0.98 14 | target_update_interval: 8000 15 | use_per: False 16 | use_dueling: False 17 | start_steps: 5000 18 | normalization_steps: 2000 19 | lamda: 0.97 20 | clip_reward: True 21 | zscore_reward: False 22 | normalize_state: False 23 | hidden_size: 64 24 | update_interval: 4 25 | log_interval: 20 26 | eval_interval: 10000 27 | no_term: False 28 | evaluate_steps: 6000 29 | gamma: 0.98 30 | 31 | env: 32 | num_parallel_envs: 1 33 | name: PongNoFrameskip-v4 34 | state_dtype: uint8 35 | encoder: CNN -------------------------------------------------------------------------------- /code/SAC-discrete/env.py: -------------------------------------------------------------------------------- 1 | # NOTE: this code was mainly taken from: 2 | # https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/common/atari_wrappers.py 3 | from collections import deque 4 | 5 | import numpy as np 6 | import gym 7 | from gym import spaces, wrappers 8 | import cv2 9 | cv2.ocl.setUseOpenCL(False) 10 | 11 | 12 | class NoopResetEnv(gym.Wrapper): 13 | def __init__(self, env, noop_max=30): 14 | """ 15 | Sample initial states by taking random number of no-ops on reset. 16 | No-op is assumed to be action 0. 17 | :param env: (Gym Environment) the environment to wrap 18 | :param noop_max: (int) the maximum value of no-ops to run 19 | """ 20 | gym.Wrapper.__init__(self, env) 21 | self.noop_max = noop_max 22 | self.override_num_noops = None 23 | self.noop_action = 0 24 | assert env.unwrapped.get_action_meanings()[0] == 'NOOP' 25 | 26 | def reset(self, **kwargs): 27 | self.env.reset(**kwargs) 28 | if self.override_num_noops is not None: 29 | noops = self.override_num_noops 30 | else: 31 | noops = self.unwrapped.np_random.integers(1, self.noop_max + 1) 32 | assert noops > 0 33 | obs = None 34 | for _ in range(noops): 35 | obs, _, done, _ = self.env.step(self.noop_action) 36 | if done: 37 | obs = self.env.reset(**kwargs) 38 | return obs 39 | 40 | def step(self, action): 41 | return self.env.step(action) 42 | 43 | 44 | class FireResetEnv(gym.Wrapper): 45 | def __init__(self, env): 46 | """ 47 | Take action on reset for environments that are fixed until firing. 48 | :param env: (Gym Environment) the environment to wrap 49 | """ 50 | gym.Wrapper.__init__(self, env) 51 | assert env.unwrapped.get_action_meanings()[1] == 'FIRE' 52 | assert len(env.unwrapped.get_action_meanings()) >= 3 53 | 54 | def reset(self, **kwargs): 55 | self.env.reset(**kwargs) 56 | obs, _, done, _ = self.env.step(1) 57 | if done: 58 | self.env.reset(**kwargs) 59 | obs, _, done, _ = self.env.step(2) 60 | if done: 61 | self.env.reset(**kwargs) 62 | return obs 63 | 64 | def step(self, action): 65 | return self.env.step(action) 66 | 67 | 68 | class EpisodicLifeEnv(gym.Wrapper): 69 | def __init__(self, env): 70 | """ 71 | Make end-of-life == end-of-episode, but only reset on true game over. 72 | Done by DeepMind for the DQN and co. since it helps value estimation. 73 | :param env: (Gym Environment) the environment to wrap 74 | """ 75 | gym.Wrapper.__init__(self, env) 76 | self.lives = 0 77 | self.was_real_done = True 78 | 79 | def step(self, action): 80 | obs, reward, done, info = self.env.step(action) 81 | self.was_real_done = done 82 | # check current lives, make loss of life terminal, 83 | # then update lives to handle bonus lives 84 | lives = self.env.unwrapped.ale.lives() 85 | if 0 < lives < self.lives: 86 | # for Qbert sometimes we stay in lives == 0 condtion for a few 87 | # frames so its important to keep lives > 0, so that we only reset 88 | # once the environment advertises done. 89 | done = True 90 | self.lives = lives 91 | return obs, reward, done, info 92 | 93 | def reset(self, **kwargs): 94 | """ 95 | Calls the Gym environment reset, only when lives are exhausted. 96 | This way all states are still reachable even though lives are episodic, 97 | and the learner need not know about any of this behind-the-scenes. 98 | :param kwargs: Extra keywords passed to env.reset() call 99 | :return: ([int] or [float]) the first observation of the environment 100 | """ 101 | if self.was_real_done: 102 | obs = self.env.reset(**kwargs) 103 | else: 104 | # no-op step to advance from terminal/lost life state 105 | obs, _, _, _ = self.env.step(0) 106 | self.lives = self.env.unwrapped.ale.lives() 107 | return obs 108 | 109 | 110 | class MaxAndSkipEnv(gym.Wrapper): 111 | def __init__(self, env, skip=4): 112 | """ 113 | Return only every `skip`-th frame (frameskipping) 114 | :param env: (Gym Environment) the environment 115 | :param skip: (int) number of `skip`-th frame 116 | """ 117 | gym.Wrapper.__init__(self, env) 118 | # most recent raw observations (for max pooling across time steps) 119 | self._obs_buffer = np.zeros( 120 | (2,)+env.observation_space.shape, 121 | dtype=env.observation_space.dtype) 122 | self._skip = skip 123 | 124 | def step(self, action): 125 | """ 126 | Step the environment with the given action 127 | Repeat action, sum reward, and max over last observations. 128 | :param action: ([int] or [float]) the action 129 | :return: ([int] or [float], [float], [bool], dict) observation, reward, 130 | done, information 131 | """ 132 | total_reward = 0.0 133 | done = None 134 | for i in range(self._skip): 135 | obs, reward, done, info = self.env.step(action) 136 | if i == self._skip - 2: 137 | self._obs_buffer[0] = obs 138 | if i == self._skip - 1: 139 | self._obs_buffer[1] = obs 140 | total_reward += reward 141 | if done: 142 | break 143 | # Note that the observation on the done=True frame 144 | # doesn't matter 145 | max_frame = self._obs_buffer.max(axis=0) 146 | 147 | return max_frame, total_reward, done, info 148 | 149 | def reset(self, **kwargs): 150 | return self.env.reset(**kwargs) 151 | 152 | 153 | class ClipRewardEnv(gym.RewardWrapper): 154 | def __init__(self, env): 155 | """ 156 | clips the reward to {+1, 0, -1} by its sign. 157 | :param env: (Gym Environment) the environment 158 | """ 159 | gym.RewardWrapper.__init__(self, env) 160 | 161 | def reward(self, reward): 162 | """ 163 | Bin reward to {+1, 0, -1} by its sign. 164 | :param reward: (float) 165 | """ 166 | return np.sign(reward) 167 | 168 | 169 | class WarpFramePyTorch(gym.ObservationWrapper): 170 | def __init__(self, env): 171 | """ 172 | Warp frames to 84x84 as done in the Nature paper and later work. 173 | :param env: (Gym Environment) the environment 174 | """ 175 | gym.ObservationWrapper.__init__(self, env) 176 | self.width = 84 177 | self.height = 84 178 | self.observation_space = spaces.Box( 179 | low=0, high=255, shape=(1, self.height, self.width), 180 | dtype=env.observation_space.dtype) 181 | 182 | def observation(self, frame): 183 | """ 184 | returns the current observation from a frame 185 | :param frame: ([int] or [float]) environment frame 186 | :return: ([int] or [float]) the observation 187 | """ 188 | frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY) 189 | frame = cv2.resize( 190 | frame, (self.width, self.height), interpolation=cv2.INTER_AREA) 191 | return frame[None, :, :] 192 | 193 | 194 | class FrameStackPyTorch(gym.Wrapper): 195 | def __init__(self, env, n_frames): 196 | """Stack n_frames last frames. 197 | Returns lazy array, which is much more memory efficient. 198 | See Also 199 | -------- 200 | stable_baselines.common.atari_wrappers.LazyFrames 201 | :param env: (Gym Environment) the environment 202 | :param n_frames: (int) the number of frames to stack 203 | """ 204 | assert env.observation_space.dtype == np.uint8 205 | 206 | gym.Wrapper.__init__(self, env) 207 | self.n_frames = n_frames 208 | self.frames = deque([], maxlen=n_frames) 209 | shp = env.observation_space.shape 210 | 211 | self.observation_space = spaces.Box( 212 | low=np.min(env.observation_space.low), 213 | high=np.max(env.observation_space.high), 214 | shape=(shp[0] * n_frames, shp[1], shp[2]), 215 | dtype=env.observation_space.dtype) 216 | 217 | def reset(self): 218 | obs = self.env.reset() 219 | for _ in range(self.n_frames): 220 | self.frames.append(obs) 221 | return self._get_ob() 222 | 223 | def step(self, action): 224 | obs, reward, done, info = self.env.step(action) 225 | self.frames.append(obs) 226 | return self._get_ob(), reward, done, info 227 | 228 | def _get_ob(self): 229 | assert len(self.frames) == self.n_frames 230 | return LazyFrames(list(self.frames)) 231 | 232 | 233 | class ScaledFloatFrame(gym.ObservationWrapper): 234 | def __init__(self, env): 235 | gym.ObservationWrapper.__init__(self, env) 236 | self.observation_space = spaces.Box( 237 | low=0, high=1.0, shape=env.observation_space.shape, 238 | dtype=np.float32) 239 | 240 | def observation(self, observation): 241 | # careful! This undoes the memory optimization, use 242 | # with smaller replay buffers only. 243 | return np.array(observation).astype(np.float32) / 255.0 244 | 245 | 246 | class LazyFrames(object): 247 | def __init__(self, frames): 248 | self._frames = frames 249 | self.dtype = frames[0].dtype 250 | 251 | def _force(self): 252 | return np.concatenate( 253 | np.array(self._frames, dtype=self.dtype), axis=0) 254 | 255 | def __array__(self, dtype=None): 256 | out = self._force() 257 | if dtype is not None: 258 | out = out.astype(dtype) 259 | return out 260 | 261 | def __len__(self): 262 | return len(self._force()) 263 | 264 | def __getitem__(self, i): 265 | return self._force()[i] 266 | 267 | 268 | def make_atari(env_id): 269 | """ 270 | Create a wrapped atari envrionment 271 | :param env_id: (str) the environment ID 272 | :return: (Gym Environment) the wrapped atari environment 273 | """ 274 | env = gym.make(env_id) 275 | assert 'NoFrameskip' in env.spec.id 276 | env = NoopResetEnv(env, noop_max=30) 277 | env = MaxAndSkipEnv(env, skip=4) 278 | return env 279 | 280 | 281 | def wrap_deepmind_pytorch(env, episode_life=True, clip_rewards=True, 282 | frame_stack=True, scale=False): 283 | """ 284 | Configure environment for DeepMind-style Atari. 285 | :param env: (Gym Environment) the atari environment 286 | :param episode_life: (bool) wrap the episode life wrapper 287 | :param clip_rewards: (bool) wrap the reward clipping wrapper 288 | :param frame_stack: (bool) wrap the frame stacking wrapper 289 | :param scale: (bool) wrap the scaling observation wrapper 290 | :return: (Gym Environment) the wrapped atari environment 291 | """ 292 | if episode_life: 293 | env = EpisodicLifeEnv(env) 294 | if 'FIRE' in env.unwrapped.get_action_meanings(): 295 | env = FireResetEnv(env) 296 | env = WarpFramePyTorch(env) 297 | if clip_rewards: 298 | env = ClipRewardEnv(env) 299 | if scale: 300 | env = ScaledFloatFrame(env) 301 | if frame_stack: 302 | env = FrameStackPyTorch(env, 4) 303 | return env 304 | 305 | 306 | def make_env(env_id, episode_life=True, clip_rewards=True, 307 | frame_stack=True, scale=False): 308 | env = make_atari(env_id) 309 | env = wrap_deepmind_pytorch( 310 | env, episode_life, clip_rewards, frame_stack, scale) 311 | return env 312 | 313 | 314 | def wrap_monitor(env, log_dir): 315 | env = wrappers.Monitor( 316 | env, log_dir, video_callable=lambda x: True) 317 | return env 318 | 319 | 320 | class VectorEnv(object): 321 | def __init__(self, n, env_func, parallel_init=False, **kwargs): 322 | 323 | print('[{}] Wait for initializing environments'.format(str(datetime.datetime.now()))) 324 | self.pool = Pool(NUM_CORES) 325 | if parallel_init: 326 | init_func = lambda x: env_func(**kwargs) 327 | self.envs = self.pool.map(init_func, list(range(n))) 328 | else: 329 | self.envs = tuple(env_func(**kwargs) for _ in range(n)) 330 | self.return_state_format = 'list' 331 | print('[{}] Finish initializing environments'.format(str(datetime.datetime.now()))) 332 | 333 | def set_return_state_format(self, fmt): 334 | self.return_state_format = fmt 335 | 336 | def seed(self, seeds): 337 | 338 | seed_func = lambda args: args[0].seed(args[1]) 339 | self.pool.map(seed_func, list(zip(self.envs, seeds))) 340 | 341 | def reset(self): 342 | 343 | reset_func = lambda env: env.reset() 344 | states = self.pool.map(reset_func, self.envs) 345 | if self.return_state_format == 'array': 346 | states = np.array(states) 347 | return states 348 | 349 | def step(self, actions): 350 | 351 | def step_func(args): 352 | env, a = args 353 | observation, reward, done, info = env.step(a) 354 | if done: 355 | observation = env.reset() 356 | return observation, reward, done, info 357 | 358 | res = self.pool.map(step_func, list(zip(self.envs, actions))) 359 | states, rewards, dones, infos = list(zip(*res)) 360 | if self.return_state_format == 'array': 361 | states = np.array(states) 362 | return states, rewards, dones, infos 363 | 364 | def __del__(self): 365 | self.close() 366 | 367 | def close(self): 368 | self.pool.close() -------------------------------------------------------------------------------- /code/SAC-discrete/memory.py: -------------------------------------------------------------------------------- 1 | import operator 2 | from collections import deque 3 | import numpy as np 4 | import torch 5 | 6 | 7 | class MultiStepBuff: 8 | 9 | def __init__(self, maxlen=3): 10 | super(MultiStepBuff, self).__init__() 11 | self.maxlen = int(maxlen) 12 | self.reset() 13 | 14 | def append(self, state, action, reward): 15 | self.states.append(state) 16 | self.actions.append(action) 17 | self.rewards.append(reward) 18 | 19 | def get(self, gamma=0.99): 20 | assert len(self.rewards) > 0 21 | state = self.states.popleft() 22 | action = self.actions.popleft() 23 | reward = self._nstep_return(gamma) 24 | return state, action, reward 25 | 26 | def _nstep_return(self, gamma): 27 | r = np.sum([r * (gamma ** i) for i, r in enumerate(self.rewards)]) 28 | self.rewards.popleft() 29 | return r 30 | 31 | def reset(self): 32 | # Buffer to store n-step transitions. 33 | self.states = deque(maxlen=self.maxlen) 34 | self.actions = deque(maxlen=self.maxlen) 35 | self.rewards = deque(maxlen=self.maxlen) 36 | 37 | def is_empty(self): 38 | return len(self.rewards) == 0 39 | 40 | def is_full(self): 41 | return len(self.rewards) == self.maxlen 42 | 43 | def __len__(self): 44 | return len(self.rewards) 45 | 46 | 47 | class LazyMemory(dict): 48 | 49 | def __init__(self, capacity, state_shape, device, state_dtype): 50 | super(LazyMemory, self).__init__() 51 | self.capacity = int(capacity) 52 | self.state_shape = state_shape 53 | self.device = device 54 | self.reset() 55 | if state_dtype == 'float32': 56 | self.state_dtype = np.float32 57 | elif state_dtype == 'float64': 58 | self.state_dtype = np.float64 59 | elif state_dtype == 'uint8': 60 | self.state_dtype = np.uint8 61 | 62 | def reset(self): 63 | self['state'] = [] 64 | self['next_state'] = [] 65 | 66 | self['action'] = np.empty((self.capacity, 1), dtype=np.int64) 67 | self['reward'] = np.empty((self.capacity, 1), dtype=np.float32) 68 | self['done'] = np.empty((self.capacity, 1), dtype=np.float32) 69 | 70 | self._n = 0 71 | self._p = 0 72 | 73 | def append(self, state, action, reward, next_state, done, 74 | episode_done=None): 75 | self._append(state, action, reward, next_state, done) 76 | 77 | def _append(self, state, action, reward, next_state, done): 78 | self['state'].append(state) 79 | self['next_state'].append(next_state) 80 | self['action'][self._p] = action 81 | self['reward'][self._p] = reward 82 | self['done'][self._p] = done 83 | 84 | self._n = min(self._n + 1, self.capacity) 85 | self._p = (self._p + 1) % self.capacity 86 | 87 | self.truncate() 88 | 89 | def truncate(self): 90 | while len(self['state']) > self.capacity: 91 | del self['state'][0] 92 | del self['next_state'][0] 93 | 94 | def sample(self, batch_size): 95 | indices = np.random.randint(low=0, high=len(self), size=batch_size) 96 | return self._sample(indices, batch_size) 97 | 98 | def _sample(self, indices, batch_size): 99 | bias = -self._p if self._n == self.capacity else 0 100 | 101 | states = np.empty((batch_size, *self.state_shape), dtype=self.state_dtype) 102 | next_states = np.empty((batch_size, *self.state_shape), dtype=self.state_dtype) 103 | 104 | for i, index in enumerate(indices): 105 | _index = np.mod(index+bias, self.capacity) 106 | states[i, ...] = self['state'][_index] 107 | next_states[i, ...] = self['next_state'][_index] 108 | 109 | if self.state_dtype is np.float32: 110 | states = torch.FloatTensor(states).to(self.device) 111 | next_states = torch.FloatTensor(next_states).to(self.device) 112 | elif self.state_dtype is np.float64: 113 | states = torch.DoubleTensor(states).to(self.device) 114 | next_states = torch.DoubleTensor(next_states).to(self.device) 115 | elif self.state_dtype is np.uint8: 116 | states = torch.ByteTensor(states).to(self.device).float() / 255. 117 | next_states = torch.ByteTensor(next_states).to(self.device).float() / 255. 118 | actions = torch.LongTensor(self['action'][indices]).to(self.device) 119 | rewards = torch.FloatTensor(self['reward'][indices]).to(self.device) 120 | dones = torch.FloatTensor(self['done'][indices]).to(self.device) 121 | 122 | return states, actions, rewards, next_states, dones 123 | 124 | def __len__(self): 125 | return self._n 126 | 127 | 128 | class LazyMultiStepMemory(LazyMemory): 129 | 130 | def __init__(self, capacity, state_shape, device, gamma=0.99, 131 | multi_step=3, state_dtype='float32'): 132 | super(LazyMultiStepMemory, self).__init__( 133 | capacity, state_shape, device, state_dtype) 134 | 135 | self.gamma = gamma 136 | self.multi_step = int(multi_step) 137 | if self.multi_step != 1: 138 | self.buff = MultiStepBuff(maxlen=self.multi_step) 139 | 140 | def append(self, state, action, reward, next_state, done): 141 | if self.multi_step != 1: 142 | self.buff.append(state, action, reward) 143 | 144 | if self.buff.is_full(): 145 | state, action, reward = self.buff.get(self.gamma) 146 | self._append(state, action, reward, next_state, done) 147 | 148 | if done: 149 | while not self.buff.is_empty(): 150 | state, action, reward = self.buff.get(self.gamma) 151 | self._append(state, action, reward, next_state, done) 152 | else: 153 | self._append(state, action, reward, next_state, done) 154 | 155 | 156 | class LazyPrioritizedMultiStepMemory(LazyMultiStepMemory): 157 | 158 | def __init__(self, capacity, state_shape, device, state_dtype='float32', 159 | gamma=0.99, multi_step=3, alpha=0.6, beta=0.4, beta_steps=2e5, 160 | min_pa=0.0, max_pa=1.0, eps=0.01): 161 | 162 | super().__init__(capacity, state_shape, device, gamma, 163 | multi_step, state_dtype) 164 | 165 | self.alpha = alpha 166 | self.beta = beta 167 | self.beta_diff = (1.0 - beta) / beta_steps 168 | self.min_pa = min_pa 169 | self.max_pa = max_pa 170 | self.eps = eps 171 | self._cached = None 172 | 173 | it_capacity = 1 174 | while it_capacity < capacity: 175 | it_capacity *= 2 176 | self.it_sum = SumTree(it_capacity) 177 | self.it_min = MinTree(it_capacity) 178 | 179 | def _pa(self, p): 180 | return np.clip((p + self.eps) ** self.alpha, self.min_pa, self.max_pa) 181 | 182 | def append(self, state, action, reward, next_state, done, p=None): 183 | # Calculate priority. 184 | if p is None: 185 | pa = self.max_pa 186 | else: 187 | pa = self._pa(p) 188 | 189 | if self.multi_step != 1: 190 | self.buff.append(state, action, reward) 191 | 192 | if self.buff.is_full(): 193 | state, action, reward = self.buff.get(self.gamma) 194 | self._append(state, action, reward, next_state, done, pa) 195 | 196 | if done: 197 | while not self.buff.is_empty(): 198 | state, action, reward = self.buff.get(self.gamma) 199 | self._append(state, action, reward, next_state, done, pa) 200 | else: 201 | self._append(state, action, reward, next_state, done, pa) 202 | 203 | def _append(self, state, action, reward, next_state, done, pa): 204 | # Store priority, which is done efficiently by SegmentTree. 205 | self.it_min[self._p] = pa 206 | self.it_sum[self._p] = pa 207 | super()._append(state, action, reward, next_state, done) 208 | 209 | def _sample_idxes(self, batch_size): 210 | total_pa = self.it_sum.sum(0, self._n) 211 | rands = np.random.rand(batch_size) * total_pa 212 | indices = [self.it_sum.find_prefixsum_idx(r) for r in rands] 213 | self.beta = min(1., self.beta + self.beta_diff) 214 | return indices 215 | 216 | def sample(self, batch_size): 217 | assert self._cached is None, 'Update priorities before sampling.' 218 | 219 | self._cached = self._sample_idxes(batch_size) 220 | batch = self._sample(self._cached, batch_size) 221 | weights = self._calc_weights(self._cached) 222 | return batch, weights 223 | 224 | def _calc_weights(self, indices): 225 | min_pa = self.it_min.min() 226 | weights = [(self.it_sum[i] / min_pa) ** -self.beta for i in indices] 227 | return torch.FloatTensor(weights).to(self.device).view(-1, 1) 228 | 229 | def update_priority(self, errors): 230 | assert self._cached is not None 231 | 232 | ps = errors.detach().cpu().abs().numpy().flatten() 233 | pas = self._pa(ps) 234 | 235 | for index, pa in zip(self._cached, pas): 236 | assert 0 <= index < self._n 237 | assert 0 < pa 238 | self.it_sum[index] = pa 239 | self.it_min[index] = pa 240 | 241 | self._cached = None 242 | 243 | 244 | class SegmentTree(object): 245 | 246 | def __init__(self, size, op, init_val): 247 | assert size > 0 and size & (size - 1) == 0 248 | self._size = size 249 | self._op = op 250 | self._init_val = init_val 251 | self._values = [init_val for _ in range(2 * size)] 252 | 253 | def _reduce(self, start=0, end=None): 254 | if end is None: 255 | end = self._size 256 | elif end < 0: 257 | end += self._size 258 | 259 | start += self._size 260 | end += self._size 261 | 262 | res = self._init_val 263 | while start < end: 264 | if start & 1: 265 | res = self._op(res, self._values[start]) 266 | start += 1 267 | 268 | if end & 1: 269 | end -= 1 270 | res = self._op(res, self._values[end]) 271 | 272 | start //= 2 273 | end //= 2 274 | 275 | return res 276 | 277 | def __setitem__(self, idx, val): 278 | assert 0 <= idx < self._size 279 | 280 | # Set value. 281 | idx += self._size 282 | self._values[idx] = val 283 | 284 | # Update its ancestors iteratively. 285 | idx = idx >> 1 286 | while idx >= 1: 287 | left = 2 * idx 288 | self._values[idx] = \ 289 | self._op(self._values[left], self._values[left + 1]) 290 | idx = idx >> 1 291 | 292 | def __getitem__(self, idx): 293 | assert 0 <= idx < self._size 294 | return self._values[idx + self._size] 295 | 296 | 297 | class SumTree(SegmentTree): 298 | 299 | def __init__(self, size): 300 | super().__init__(size, operator.add, 0.0) 301 | 302 | def sum(self, start=0, end=None): 303 | return self._reduce(start, end) 304 | 305 | def find_prefixsum_idx(self, prefixsum): 306 | assert 0 <= prefixsum <= self.sum() + 1e-5 307 | idx = 1 308 | 309 | # Traverse to the leaf. 310 | while idx < self._size: 311 | left = 2 * idx 312 | if self._values[left] > prefixsum: 313 | idx = left 314 | else: 315 | prefixsum -= self._values[left] 316 | idx = left + 1 317 | return idx - self._size 318 | 319 | 320 | class MinTree(SegmentTree): 321 | 322 | def __init__(self, size): 323 | super().__init__(size, min, float("inf")) 324 | 325 | def min(self, start=0, end=None): 326 | return self._reduce(start, end) -------------------------------------------------------------------------------- /code/SAC-discrete/utils.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | from torch import Tensor 4 | import torch.optim as opt 5 | from torch.optim import Adam 6 | from torch.autograd import Variable 7 | from torch.nn import functional as F 8 | from torch.distributions import Categorical 9 | from torch.utils.tensorboard import SummaryWriter 10 | 11 | from sklearn.base import TransformerMixin 12 | from sklearn.preprocessing import StandardScaler 13 | 14 | from multiprocessing.pool import ThreadPool as Pool 15 | import numpy as np 16 | import argparse 17 | 18 | 19 | def update_params(optim, loss, retain_graph=False): 20 | optim.zero_grad() 21 | loss.backward(retain_graph=retain_graph) 22 | optim.step() 23 | 24 | def disable_gradients(network): 25 | # Disable calculations of gradients. 26 | for param in network.parameters(): 27 | param.requires_grad = False 28 | 29 | def initialize_weights_he(m): 30 | if isinstance(m, nn.Linear) or isinstance(m, nn.Conv2d): 31 | torch.nn.init.kaiming_uniform_(m.weight) 32 | if m.bias is not None: 33 | torch.nn.init.constant_(m.bias, 0) 34 | 35 | 36 | class Struct: 37 | def __init__(self, **entries): 38 | self.__dict__.update(entries) 39 | 40 | 41 | class RunningMeanStats(object): 42 | """ 43 | An inefficient estimator for calculating the mean scaler over a given horizon 44 | """ 45 | def __init__(self, n=10): 46 | self.n = n 47 | self.stats = deque(maxlen=n) 48 | 49 | def append(self, x): 50 | self.stats.append(x) 51 | 52 | def get(self): 53 | return np.mean(self.stats) 54 | 55 | 56 | class Flatten(nn.Module): 57 | def forward(self, x): 58 | return x.view(x.size(0), -1) 59 | 60 | 61 | class RunningStat(object): 62 | def __init__(self): 63 | self._n = 0 64 | self._M = 0 65 | self._S = 0 66 | 67 | def push(self, x): 68 | self._n += 1 69 | if self._n == 1: 70 | self._M = x 71 | else: 72 | oldM = self._M if type(self._M) is float else self._M.copy() 73 | self._M = oldM + (x - oldM) / self._n 74 | self._S = self._S + (x - oldM) * (x - self._M) 75 | 76 | @property 77 | def n(self): 78 | return self._n 79 | 80 | @property 81 | def mean(self): 82 | return self._M 83 | 84 | @property 85 | def var(self): 86 | return self._S / (self._n - 1) if self._n > 1 else np.square(self._M) 87 | 88 | @property 89 | def std(self): 90 | return np.sqrt(self.var) 91 | 92 | @property 93 | def shape(self): 94 | return self._M.shape 95 | 96 | 97 | class ZFilter(object): 98 | """ 99 | y = (x-mean)/std 100 | using running estimates of mean, std 101 | """ 102 | def __init__(self, demean=True, destd=True, clip=10.0): 103 | self.demean = demean 104 | self.destd = destd 105 | self.clip = clip 106 | 107 | self.rs = RunningStat() 108 | 109 | def __call__(self, x, update=True): 110 | if update: self.rs.push(x) 111 | if self.demean: 112 | x = x - self.rs.mean 113 | if self.destd: 114 | x = x / (self.rs.std + 1e-8) 115 | if self.clip: 116 | x = np.clip(x, -self.clip, self.clip) 117 | return x 118 | 119 | def update(self, x): 120 | self.rs.push(x) 121 | 122 | 123 | class Memory(object): 124 | def __init__(self): 125 | self.memory = [] 126 | 127 | def push(self, *args): 128 | self.memory.append(Transition(*args)) 129 | 130 | def sample(self): 131 | return Transition(*zip(*self.memory)) 132 | 133 | def __len__(self): 134 | return len(self.memory) 135 | 136 | 137 | class NDStandardScaler(TransformerMixin): 138 | def __init__(self, **kwargs): 139 | self._scaler = StandardScaler(copy=True, **kwargs) 140 | self._orig_shape = None 141 | 142 | def fit(self, X, **kwargs): 143 | X = np.array(X) 144 | if len(X.shape) > 1: 145 | self._orig_shape = X.shape[1:] 146 | X = self._flatten(X) 147 | self._scaler.fit(X, **kwargs) 148 | return self 149 | 150 | def transform(self, X, **kwargs): 151 | X = np.array(X) 152 | X = self._flatten(X) 153 | X = self._scaler.transform(X, **kwargs) 154 | X = self._reshape(X) 155 | return X 156 | 157 | def _flatten(self, X): 158 | # Reshape X to <= 2 dimensions 159 | if len(X.shape) > 2: 160 | n_dims = np.prod(self._orig_shape) 161 | X = X.reshape(-1, n_dims) 162 | return X 163 | 164 | def _reshape(self, X): 165 | # Reshape X back to it's original shape 166 | if len(X.shape) >= 2: 167 | X = X.reshape(-1, *self._orig_shape) 168 | return X 169 | -------------------------------------------------------------------------------- /code/a2c.py: -------------------------------------------------------------------------------- 1 | """ 2 | Implementation of A2C 3 | ref: Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." ICML. 2016. 4 | This one follows appendix C of A3C paper (continuous action domain) 5 | 6 | NOTICE: 7 | `Tensor2` means 2D-Tensor (num_samples, num_dims) 8 | """ 9 | 10 | import gym 11 | import torch 12 | import torch.nn as nn 13 | import torch.optim as opt 14 | from torch import Tensor 15 | from torch.autograd import Variable 16 | from collections import namedtuple 17 | from itertools import count 18 | import scipy.optimize as sciopt 19 | import matplotlib 20 | matplotlib.use('agg') 21 | import matplotlib.pyplot as plt 22 | from os.path import join as joindir 23 | import pandas as pd 24 | import numpy as np 25 | import argparse 26 | import datetime 27 | import math 28 | 29 | 30 | EPS = 1e-10 31 | RESULT_DIR = '../result' 32 | 33 | 34 | class args(object): 35 | env_name = 'Hopper-v2' 36 | seed = 1234 37 | num_episode = 100 38 | max_step_per_round = 2000 39 | gamma = 0.995 40 | lamda = 0.97 41 | log_num_episode = 1 42 | loss_coeff_value = 1.0 43 | loss_coeff_entropy = 1e-4 44 | lr = 5e-5 45 | hidden_size = 200 46 | lstm_size = 128 47 | num_parallel_run = 5 48 | 49 | 50 | def add_arguments(): 51 | parser = argparse.ArgumentParser() 52 | parser.add_argument('--env_name', type=str, default='Hopper-v2') 53 | parser.add_argument('--seed', type=int, default=1234) 54 | parser.add_argument('--num_episode', type=int, default=1000) 55 | parser.add_argument('--max_step_per_round', type=int, default=2000) 56 | parser.add_argument('--gamma', type=float, default=0.995) 57 | parser.add_argument('--lamda', type=float, default=0.97) 58 | parser.add_argument('--log_num_episode', type=int, default=1) 59 | parser.add_argument('--loss_coeff_value', type=float, default=1.0) 60 | parser.add_argument('--loss_coeff_entropy', type=float, default=1e-4) 61 | parser.add_argument('--lr', type=float, default=5e-5) 62 | parser.add_argument('--hidden_size', type=int, default=200) 63 | parser.add_argument('--lstm_size', type=int, default=128) 64 | parser.add_argument('--num_parallel_run', type=int, default=5) 65 | 66 | args = parser.parse_args() 67 | return args 68 | 69 | class RunningStat(object): 70 | def __init__(self, shape): 71 | self._n = 0 72 | self._M = np.zeros(shape) 73 | self._S = np.zeros(shape) 74 | 75 | def push(self, x): 76 | x = np.asarray(x) 77 | assert x.shape == self._M.shape 78 | self._n += 1 79 | if self._n == 1: 80 | self._M[...] = x 81 | else: 82 | oldM = self._M.copy() 83 | self._M[...] = oldM + (x - oldM) / self._n 84 | self._S[...] = self._S + (x - oldM) * (x - self._M) 85 | 86 | @property 87 | def n(self): 88 | return self._n 89 | 90 | @property 91 | def mean(self): 92 | return self._M 93 | 94 | @property 95 | def var(self): 96 | return self._S / (self._n - 1) if self._n > 1 else np.square(self._M) 97 | 98 | @property 99 | def std(self): 100 | return np.sqrt(self.var) 101 | 102 | @property 103 | def shape(self): 104 | return self._M.shape 105 | 106 | 107 | class ZFilter: 108 | """ 109 | y = (x-mean)/std 110 | using running estimates of mean,std 111 | """ 112 | 113 | def __init__(self, shape, demean=True, destd=True, clip=10.0): 114 | self.demean = demean 115 | self.destd = destd 116 | self.clip = clip 117 | 118 | self.rs = RunningStat(shape) 119 | 120 | def __call__(self, x, update=True): 121 | if update: self.rs.push(x) 122 | if self.demean: 123 | x = x - self.rs.mean 124 | if self.destd: 125 | x = x / (self.rs.std + 1e-8) 126 | if self.clip: 127 | x = np.clip(x, -self.clip, self.clip) 128 | return x 129 | 130 | def output_shape(self, input_space): 131 | return input_space.shape 132 | 133 | 134 | class ActorCritic(nn.Module): 135 | def __init__(self, num_inputs, num_outputs): 136 | super(ActorCritic, self).__init__() 137 | 138 | self.actor_fc1 = nn.Linear(num_inputs, args.hidden_size) 139 | self.actor_fc2 = nn.LSTM(args.hidden_size, args.lstm_size) 140 | self.actor_mu = nn.Linear(args.lstm_size, num_outputs) 141 | self.actor_sig = nn.Linear(args.lstm_size, num_outputs) 142 | self.actor_sig_activation = nn.Softplus() 143 | 144 | self.critic_fc1 = nn.Linear(num_inputs, args.hidden_size) 145 | self.critic_fc2 = nn.LSTM(args.hidden_size, args.lstm_size) 146 | self.critic_fc3 = nn.Linear(args.lstm_size, 1) 147 | 148 | def forward(self, states, actor_hidden, critic_hidden): 149 | """ 150 | run policy network (actor) as well as value network (critic) 151 | :param states: a Tensor2 represents states, 2 tuple represents hidden states 152 | :return: 5 Tensor2 153 | """ 154 | action_mean, action_std, actor_newh = self._forward_actor(states, actor_hidden) 155 | critic_value, critic_newh = self._forward_critic(states, critic_hidden) 156 | return action_mean, action_std, actor_newh, critic_value, critic_newh 157 | 158 | def _forward_actor(self, states, hidden): 159 | x = torch.relu(self.actor_fc1(states)).unsqueeze(0) 160 | x, newhidden = self.actor_fc2(x, hidden) 161 | x = x.squeeze(0) 162 | action_mean = self.actor_mu(x) 163 | action_std = self.actor_sig_activation(self.actor_sig(x)) 164 | return action_mean, action_std, newhidden 165 | 166 | def _forward_critic(self, states, hidden): 167 | x = torch.relu(self.critic_fc1(states)).unsqueeze(0) 168 | x, newhidden = self.critic_fc2(x, hidden) 169 | x = x.squeeze(0) 170 | critic_value = self.critic_fc3(x) 171 | return critic_value, newhidden 172 | 173 | def select_action(self, action_mean, action_std, return_logproba=True): 174 | """ 175 | given mean and std, sample an action from normal(mean, std) 176 | also returns probability of the given chosen 177 | """ 178 | action_logstd = torch.log(action_std) 179 | action = torch.normal(action_mean, action_std) 180 | if return_logproba: 181 | logproba = self._normal_logproba(action, action_mean, action_logstd, action_std) 182 | return action, logproba 183 | else: 184 | return action 185 | 186 | @staticmethod 187 | def _normal_logproba(x, mean, logstd, std=None): 188 | if std is None: 189 | std = torch.exp(logstd) 190 | 191 | std_sq = std.pow(2) 192 | logproba = - 0.5 * math.log(2 * math.pi) - logstd - (x - mean).pow(2) / (2 * std_sq) 193 | return logproba.sum(1) 194 | 195 | def get_logproba(self, states, actions): 196 | """ 197 | return probability of chosen the given actions under corresponding states of current network 198 | :param states: Tensor 199 | :param actions: Tensor 200 | """ 201 | action_mean, action_logstd = self._forward_actor(states) 202 | logproba = self._normal_logproba(actions, action_mean, action_logstd) 203 | return logproba 204 | 205 | def a2c(args): 206 | env = gym.make(args.env_name) 207 | num_inputs = env.observation_space.shape[0] 208 | num_actions = env.action_space.shape[0] 209 | 210 | env.seed(args.seed) 211 | torch.manual_seed(args.seed) 212 | 213 | network = ActorCritic(num_inputs, num_actions) 214 | optimizer = opt.RMSprop(network.parameters(), lr=args.lr) 215 | running_state = ZFilter((num_inputs,), clip=5) 216 | 217 | # record average 1-round cumulative reward in every episode 218 | reward_record = [] 219 | num_steps = 0 220 | 221 | for i_episode in range(args.num_episode): 222 | # step1: perform current policy to collect trajectories 223 | # this is an on-policy method! 224 | state = env.reset() 225 | state = running_state(state) 226 | actor_hidden = (torch.zeros(1, 1, args.lstm_size), torch.zeros(1, 1, args.lstm_size)) 227 | critic_hidden = (torch.zeros(1, 1, args.lstm_size), torch.zeros(1, 1, args.lstm_size)) 228 | reward_sum = 0 229 | states = [] 230 | values = [] 231 | actions = [] 232 | action_stds = [] 233 | logprobas = [] 234 | next_states = [] 235 | rewards = [] 236 | for t in range(args.max_step_per_round): 237 | action_mean, action_std, actor_hidden, value, critic_hidden = \ 238 | network(Tensor(state).unsqueeze(0), actor_hidden, critic_hidden) 239 | action, logproba = network.select_action(action_mean, action_std) 240 | action = action.data.numpy()[0] 241 | next_state, reward, done, _ = env.step(action) 242 | reward_sum += reward 243 | next_state = running_state(next_state) 244 | mask = 0 if done else 1 245 | 246 | states.append(state) 247 | values.append(value) 248 | actions.append(action) 249 | action_stds.append(action_std) 250 | logprobas.append(logproba) 251 | next_states.append(next_state) 252 | rewards.append(reward) 253 | 254 | if done: 255 | break 256 | 257 | state = next_state 258 | 259 | values = torch.cat(values) 260 | action_stds = torch.cat(action_stds) 261 | logprobas = torch.cat(logprobas).unsqueeze(1) 262 | num_steps += (t + 1) 263 | 264 | reward_record.append({'steps': num_steps, 'reward': reward_sum}) 265 | 266 | # step2: extract variables from trajectories 267 | batch_size = len(rewards) 268 | prev_return = 0 269 | prev_value = 0 270 | prev_advantage = 0 271 | returns = Tensor(batch_size, 1) 272 | deltas = Tensor(batch_size, 1) 273 | advantages = Tensor(batch_size, 1) 274 | for i in reversed(range(batch_size)): 275 | returns[i] = rewards[i] + args.gamma * prev_return 276 | deltas[i] = rewards[i] + args.gamma * prev_value - values[i].data.numpy()[0] 277 | # ref: https://arxiv.org/pdf/1506.02438.pdf (generalization advantage estimate) 278 | advantages[i] = deltas[i] + args.gamma * args.lamda * prev_advantage 279 | 280 | prev_return = returns[i] 281 | prev_value = values[i].data.numpy()[0] 282 | prev_advantage = advantages[i] 283 | advantages = (advantages - advantages.mean()) / (advantages.std() + EPS) 284 | 285 | # step3: construct loss functions 286 | loss_policy = torch.mean(- logprobas * advantages) 287 | loss_value = torch.mean((values - returns).pow(2)) 288 | loss_entropy = torch.mean(- (torch.log(2 * math.pi * action_stds.pow(2)) + 1) / 2) 289 | loss = loss_policy + args.loss_coeff_value * loss_value + args.loss_coeff_entropy * loss_entropy 290 | 291 | # step4: do gradient update 292 | optimizer.zero_grad() 293 | loss.backward() 294 | optimizer.step() 295 | 296 | # step5: do logging 297 | if i_episode % args.log_num_episode == 0: 298 | print('Finished episode: {} Mean Reward: {:.4f} total_loss = {:.4f} = {:.4f} + {} * {:.4f} + {} * {:.4f}' \ 299 | .format(i_episode, reward_record[-1]['reward'], loss.data, loss_policy.data, args.loss_coeff_value, 300 | loss_value.data, args.loss_coeff_entropy, loss_entropy.data)) 301 | print('-----------------') 302 | 303 | return reward_record 304 | 305 | if __name__ == '__main__': 306 | datestr = datetime.datetime.now().strftime('%Y-%m-%d') 307 | args = add_arguments() 308 | 309 | record_dfs = pd.DataFrame(columns=['steps', 'reward']) 310 | reward_cols = [] 311 | for i in range(args.num_parallel_run): 312 | args.seed += 1 313 | reward_record = pd.DataFrame(a2c(args)) 314 | record_dfs = record_dfs.merge(reward_record, how='outer', on='steps', suffixes=('', '_{}'.format(i))) 315 | reward_cols.append('reward_{}'.format(i)) 316 | 317 | record_dfs = record_dfs.drop(columns='reward').sort_values(by='steps', ascending=True).ffill().bfill() 318 | record_dfs['reward_mean'] = record_dfs[reward_cols].mean(axis=1) 319 | record_dfs['reward_std'] = record_dfs[reward_cols].std(axis=1) 320 | record_dfs['reward_smooth'] = record_dfs['reward_mean'].ewm(span=20).mean() 321 | record_dfs.to_csv(joindir(RESULT_DIR, 'a2c-record-{}-{}.csv'.format(args.env_name, datestr))) 322 | 323 | # Plot 324 | plt.figure(figsize=(12, 6)) 325 | plt.plot(record_dfs['steps'], record_dfs['reward_mean'], label='trajory reward') 326 | plt.plot(record_dfs['steps'], record_dfs['reward_smooth'], label='smoothed reward') 327 | plt.fill_between(record_dfs['steps'], record_dfs['reward_mean'] - record_dfs['reward_std'], 328 | record_dfs['reward_mean'] + record_dfs['reward_std'], color='b', alpha=0.2) 329 | plt.legend() 330 | plt.xlabel('steps of env interaction (sample complexity)') 331 | plt.ylabel('average reward') 332 | plt.title('A2C on {}'.format(args.env_name)) 333 | plt.savefig(joindir(RESULT_DIR, 'a2c-{}-{}.pdf'.format(args.env_name, datestr))) 334 | 335 | 336 | -------------------------------------------------------------------------------- /code/ars.py: -------------------------------------------------------------------------------- 1 | """ 2 | Implementation of Augmented Random Search 3 | 4 | This is a derivative-free method, probably hard to solve high 5 | dimensional problems. Following the paper, we use linear and 6 | deterministic policy. 7 | 8 | There are four versions of ARS in the paper. Here, we implement 9 | V2-t, the version that seems to have the best performance. 10 | 11 | ref: 12 | 13 | Mania, Horia, Aurelia Guy, and Benjamin Recht. "Simple random 14 | search provides a competitive approach to reinforcement learning." 15 | arXiv preprint arXiv:1803.07055 (2018). 16 | 17 | https://github.com/modestyachts/ARS 18 | """ 19 | 20 | from pathos.multiprocessing import ProcessingPool as Pool 21 | from collections import namedtuple 22 | import gym 23 | import matplotlib 24 | matplotlib.use('agg') 25 | import matplotlib.pyplot as plt 26 | from os.path import join as joindir 27 | from functools import partial 28 | import pandas as pd 29 | import numpy as np 30 | import argparse 31 | import datetime 32 | import math 33 | import ray 34 | 35 | 36 | Stats = namedtuple('Stats', ('n', 'mean', 'svar')) 37 | EPS = 1e-8 38 | RESULT_DIR = '../result' 39 | 40 | 41 | class args(object): 42 | env_name = 'Hopper-v2' 43 | seed = 1234 44 | num_episode = 100 45 | max_step_per_round = 200 46 | random_table_size = 100000000 47 | num_round_avg = 5 48 | 49 | N = 8 50 | alpha = 0.01 51 | nu = 0.03 52 | b = 4 53 | 54 | log_num_episode = 1 55 | num_parallel_run = 5 56 | 57 | 58 | def add_arguments(): 59 | parser = argparse.ArgumentParser() 60 | parser.add_argument('--env_name', type=str, default='Hopper-v2') 61 | parser.add_argument('--seed', type=int, default=1234) 62 | parser.add_argument('--num_episode', type=int, default=1000) 63 | parser.add_argument('--max_step_per_round', type=int, default=200) 64 | parser.add_argument('--random_table_size', type=int, default=100000000) 65 | parser.add_argument('--num_round_avg', type=int, default=5) 66 | 67 | parser.add_argument('--N', type=int, default=8) 68 | parser.add_argument('--alpha', type=float, default=0.01) 69 | parser.add_argument('--nu', type=float, default=0.03) 70 | parser.add_argument('--b', type=int, default=4) 71 | 72 | parser.add_argument('--log_num_episode', type=int, default=1) 73 | parser.add_argument('--num_parallel_run', type=int, default=5) 74 | 75 | args = parser.parse_args() 76 | return args 77 | 78 | 79 | class RunningStat(object): 80 | def __init__(self, shape): 81 | self._n = 0 82 | self._mean = np.zeros(shape) 83 | self._svar = np.zeros(shape) 84 | 85 | def push(self, x): 86 | x = np.asarray(x).astype(float) 87 | assert x.shape == self._mean.shape 88 | self._n += 1 89 | if self._n == 1: 90 | self._mean[...] = x 91 | else: 92 | mean_old = self._mean.copy() 93 | self._mean[...] = self._mean + (x - mean_old) / self._n 94 | self._svar[...] = self._svar + (x - mean_old) * (x - self._mean) 95 | 96 | def update(self, stat_list): 97 | """ 98 | ref: https://www.emathzone.com/tutorials/basic-statistics/combined-variance.html 99 | """ 100 | n_old = self._n 101 | mean_old = self._mean.copy() 102 | svar_old = self._svar.copy() 103 | 104 | self._n += np.sum([stat.n for stat in stat_list]) 105 | self._mean[...] = self._mean \ 106 | + np.sum([stat.n * (stat.mean - mean_old) / self._n for stat in stat_list], axis=0) 107 | self._svar[...] = self._svar + n_old * np.square(mean_old - self._mean) \ 108 | + np.sum([stat.svar + stat.n * np.square(stat.mean - self._mean) for stat in stat_list], axis=0) 109 | 110 | @property 111 | def stat(self): 112 | return Stats(n=self._n, mean=self._mean, svar=self._svar) 113 | 114 | @property 115 | def n(self): 116 | return self._n 117 | 118 | @property 119 | def mean(self): 120 | return self._mean 121 | 122 | @property 123 | def var(self): 124 | return self._svar / (self._n - 1) + EPS if self._n > 1 else np.ones(self._svar.shape) 125 | 126 | @property 127 | def std(self): 128 | return np.sqrt(self.var) 129 | 130 | @property 131 | def shape(self): 132 | return self._mean.shape 133 | 134 | 135 | @ray.remote 136 | class Worker(object): 137 | def __init__(self, M, mean, var, deltas, args): 138 | """ 139 | initialize the agent 140 | """ 141 | self.actor = NaiveActor(M, mean, var) 142 | self.deltas = deltas 143 | 144 | self.num_round_avg = args['num_round_avg'] 145 | self.max_step_per_round = args['max_step_per_round'] 146 | self.env_name = args['env_name'] 147 | self.env_seed = args['seed'] 148 | self.nu = args['nu'] 149 | self.delta_shape = M.shape 150 | self.delta_dim = np.prod(self.delta_shape) 151 | self.rg = np.random.RandomState(args['seed']) 152 | 153 | def sync_actor_params(self, M, mean, var): 154 | self.actor.sync_params(M, mean, var) 155 | 156 | def rollout(self): 157 | """ 158 | test the transition matrix M by process several rollouts 159 | :return: mean rewards, states statistics (n, mean, var) 160 | """ 161 | # generate delta 162 | delta_ind = self.rg.randint(0, len(self.deltas) - self.delta_dim + 1) 163 | delta = self.deltas[delta_ind:delta_ind + self.delta_dim].reshape(self.delta_shape) 164 | self.actor.set_delta(delta) 165 | 166 | running_stat = RunningStat((self.delta_shape[1], )) 167 | env = gym.make(self.env_name) 168 | env.seed(self.env_seed) 169 | reward_neg_pos = [] 170 | 171 | # generate rollouts with M +/- nu * delta 172 | for nu in [-self.nu, self.nu]: 173 | self.actor.set_nu(nu) 174 | reward_sum_record = [] 175 | for i_run in range(self.num_round_avg): 176 | done = False 177 | num_steps = 0 178 | reward_sum = 0 179 | state = env.reset() 180 | running_stat.push(state) 181 | while (not done) and (num_steps < self.max_step_per_round): 182 | action = self.actor.forward(state) 183 | state, reward, done, _ = env.step(action) 184 | reward_sum += reward 185 | running_stat.push(state) 186 | num_steps += 1 187 | reward_sum_record.append(reward_sum) 188 | reward_neg_pos.append(np.mean(reward_sum_record)) 189 | return (delta_ind, reward_neg_pos, running_stat.stat) 190 | 191 | 192 | class NaiveActor(object): 193 | def __init__(self, M, mean, var): 194 | """ 195 | :param M: the tested transition matrix, of dimension (dim_actions, dim_states) 196 | :param mean: mean of all previous states 197 | :param var: diagonal of covariance (element-wise variance) of all previous states 198 | """ 199 | self._M = M 200 | self._mean = mean 201 | self._var = var 202 | self._delta = None 203 | self._pert_M = None 204 | 205 | def set_delta(self, delta): 206 | self._delta = delta 207 | 208 | def set_nu(self, nu): 209 | self._pert_M = self._M + nu * self._delta 210 | 211 | def sync_params(self, M, mean, var): 212 | self._M = M 213 | self._mean = mean 214 | self._var = var 215 | 216 | def forward(self, states): 217 | """ 218 | given a states returns the action, where M = self._M + nu * deltas 219 | :param states: a np.ndarray represents states 220 | :return: the deterministic action 221 | """ 222 | return np.matmul(self._pert_M, (states - self._mean) / np.sqrt(self._var)) 223 | 224 | 225 | class Master(object): 226 | """ 227 | A linear policy actor master 228 | Each weight is drawn from independent Gaussian distribution 229 | """ 230 | def __init__(self, args): 231 | self.dim_states, self.dim_actions = self._get_dimensions(args.env_name) 232 | self.dim_M = self.dim_states * self.dim_actions 233 | 234 | self.M = ray.put(np.zeros((self.dim_actions, self.dim_states))) 235 | self.running_stat = RunningStat((self.dim_states, )) 236 | 237 | self.deltas = ray.put(np.random.RandomState(args.seed).randn(args.random_table_size).astype(np.float64)) 238 | 239 | worker_args = { 240 | 'num_round_avg': args.num_round_avg, 241 | 'max_step_per_round': args.max_step_per_round, 242 | 'env_name': args.env_name, 243 | 'seed': args.seed, 244 | 'nu': args.nu, 245 | } 246 | self.workers = [Worker.remote(self.M, self.running_stat.mean, 247 | self.running_stat.var, self.deltas, worker_args) for i in range(args.N)] 248 | 249 | self.args = args 250 | 251 | self.reward_record = [] 252 | 253 | def _get_dimensions(self, env_name): 254 | env = gym.make(env_name) 255 | return env.observation_space.shape[0], env.action_space.shape[0] 256 | 257 | def run(self): 258 | for i in range(self.args.num_episode): 259 | rollout_results = ray.get([w.rollout.remote() for w in self.workers]) 260 | 261 | rollout_ids = [res[0] for res in rollout_results] 262 | rollout_rewards = np.array([res[1] for res in rollout_results]) 263 | rollout_stats = [res[2] for res in rollout_results] 264 | 265 | # update master policy 266 | self.update(rollout_ids, rollout_rewards) 267 | 268 | # update master state mean and variance 269 | self.running_stat.update(rollout_stats) 270 | 271 | # sync master policy and state statistics to workers 272 | ray.get([w.sync_actor_params.remote(self.M, self.running_stat.mean, self.running_stat.var) \ 273 | for w in self.workers]) 274 | 275 | self.reward_record.append({'steps': self.running_stat.n, 'reward': rollout_rewards.mean()}) 276 | 277 | if i % self.args.log_num_episode == 0: 278 | print('Finished episode: {} steps: {} AvgReward: {:.4f}' \ 279 | .format(i, self.reward_record[-1]['steps'], self.reward_record[-1]['reward'])) 280 | print('-----------------') 281 | 282 | def update(self, rollout_ids, rollout_rewards): 283 | max_rollout_reward = rollout_rewards.max(axis=1) 284 | selected_ind = np.argsort(max_rollout_reward)[::-1][:self.args.b] 285 | sig_reward = rollout_rewards[selected_ind].reshape(-1).std() 286 | 287 | new_M = np.copy(ray.get(self.M)) 288 | deltas = ray.get(self.deltas) 289 | for i in selected_ind: 290 | delta = deltas[rollout_ids[i]:rollout_ids[i] + self.dim_M].reshape((self.dim_actions, self.dim_states)) 291 | new_M += self.args.alpha / self.args.b / sig_reward \ 292 | * (rollout_rewards[i, 1] - rollout_rewards[i, 0]) * delta 293 | self.M = ray.put(new_M) 294 | 295 | if __name__ == '__main__': 296 | ray.init() 297 | datestr = datetime.datetime.now().strftime('%Y-%m-%d') 298 | args = add_arguments() 299 | 300 | record_dfs = pd.DataFrame(columns=['steps', 'reward']) 301 | reward_cols = [] 302 | for i in range(args.num_parallel_run): 303 | args.seed += 1 304 | master = Master(args) 305 | master.run() 306 | reward_record = pd.DataFrame(master.reward_record) 307 | record_dfs = record_dfs.merge(reward_record, how='outer', on='steps', suffixes=('', '_{}'.format(i))) 308 | reward_cols.append('reward_{}'.format(i)) 309 | 310 | record_dfs = record_dfs.drop(columns='reward').sort_values(by='steps', ascending=True).ffill().bfill() 311 | record_dfs['reward_mean'] = record_dfs[reward_cols].mean(axis=1) 312 | record_dfs['reward_std'] = record_dfs[reward_cols].std(axis=1) 313 | record_dfs['reward_smooth'] = record_dfs['reward_mean'].ewm(span=1000).mean() 314 | record_dfs['reward_smooth_std'] = record_dfs['reward_std'].ewm(span=1000).mean() 315 | record_dfs.to_csv(joindir(RESULT_DIR, 'ars-record-{}-{}.csv'.format(args.env_name, datestr))) 316 | 317 | # Plot 318 | plt.figure(figsize=(12, 6)) 319 | plt.plot(record_dfs['steps'], record_dfs['reward_smooth'], label='reward') 320 | plt.fill_between(record_dfs['steps'], record_dfs['reward_smooth'] - record_dfs['reward_smooth_std'], 321 | record_dfs['reward_smooth'] + record_dfs['reward_smooth_std'], color='b', alpha=0.2) 322 | plt.legend() 323 | plt.xlabel('steps of env interaction (sample complexity)') 324 | plt.ylabel('average reward') 325 | plt.title('ARS on {}'.format(args.env_name)) 326 | plt.savefig(joindir(RESULT_DIR, 'ars-plot-{}-{}.pdf'.format(args.env_name, datestr))) 327 | -------------------------------------------------------------------------------- /code/ars_tune.py: -------------------------------------------------------------------------------- 1 | """ 2 | Implementation of Augmented Random Search 3 | 4 | This is a derivative-free method, probably hard to solve high 5 | dimensional problems. Following the paper, we use linear and 6 | deterministic policy. 7 | 8 | There are four versions of ARS in the paper. Here, we implement 9 | V2-t, the version that seems to have the best performance. 10 | 11 | ref: 12 | 13 | Mania, Horia, Aurelia Guy, and Benjamin Recht. "Simple random 14 | search provides a competitive approach to reinforcement learning." 15 | arXiv preprint arXiv:1803.07055 (2018). 16 | 17 | https://github.com/modestyachts/ARS 18 | 19 | Notice: 20 | This is a _tune version, which means that it finds an optimal 21 | configuration of hyperparameters (by running the algorithm 22 | multiple times) and run with this found configuration. 23 | """ 24 | 25 | from pathos.multiprocessing import ProcessingPool as Pool 26 | from collections import namedtuple 27 | import gym 28 | import matplotlib 29 | matplotlib.use('agg') 30 | import matplotlib.pyplot as plt 31 | from os.path import join as joindir 32 | from functools import partial 33 | from itertools import product 34 | import pandas as pd 35 | import numpy as np 36 | import argparse 37 | import datetime 38 | import json 39 | import math 40 | import ray 41 | import ray.tune as tune 42 | 43 | 44 | EPS = 1e-8 45 | RESULT_DIR = '../result' 46 | TUNE_DIR = '../tune' 47 | LOG_DIR = '../log' 48 | 49 | 50 | config_hopper = { 51 | 'env_name': 'Hopper-v2', 52 | 'seed': 'auto', 53 | 'num_episode': 1000, 54 | 'max_step_per_round': 200, 55 | 'random_table_size': 100000000, 56 | 'num_round_avg': 5, 57 | 'N': [8, 16, 32], 58 | 'alpha': [0.01, 0.02, 0.025], 59 | 'nu': [0.03, 0.025, 0.02, 0.01], 60 | 'b_ratio': [0.5, 1.0], 61 | # 'N': 32, 62 | # 'alpha': 0.01, 63 | # 'nu': [0.02, 0.01], 64 | # 'b_ratio': 0.5, 65 | 'num_trials': 5, 66 | } 67 | 68 | class Logger(object): 69 | def __init__(self, logfile='log.txt'): 70 | super(Logger, self).__init__() 71 | self.logfile = logfile 72 | 73 | def info(self, msg): 74 | timestr = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S') 75 | print('[info {}] {}'.format(timestr, msg)) 76 | with open(self.logfile, 'a+') as f: 77 | f.write('[info {}] {}\n'.format(timestr, msg)) 78 | 79 | class RunningStat(object): 80 | def __init__(self, shape): 81 | self._n = 0 82 | self._mean = np.zeros(shape) 83 | self._svar = np.zeros(shape) 84 | 85 | def push(self, x): 86 | x = np.asarray(x).astype(float) 87 | assert x.shape == self._mean.shape 88 | self._n += 1 89 | if self._n == 1: 90 | self._mean[...] = x 91 | else: 92 | mean_old = self._mean.copy() 93 | self._mean[...] = self._mean + (x - mean_old) / self._n 94 | self._svar[...] = self._svar + (x - mean_old) * (x - self._mean) 95 | 96 | def update(self, stat_list): 97 | """ 98 | ref: https://www.emathzone.com/tutorials/basic-statistics/combined-variance.html 99 | """ 100 | n_old = self._n 101 | mean_old = self._mean.copy() 102 | svar_old = self._svar.copy() 103 | 104 | self._n += np.sum([stat['n'] for stat in stat_list]) 105 | self._mean[...] = self._mean \ 106 | + np.sum([stat['n'] * (stat['mean'] - mean_old) / self._n for stat in stat_list], axis=0) 107 | self._svar[...] = self._svar + n_old * np.square(mean_old - self._mean) \ 108 | + np.sum([stat['svar'] + stat['n'] * np.square(stat['mean'] - self._mean) for stat in stat_list], axis=0) 109 | 110 | @property 111 | def stat(self): 112 | return {'n': self._n, 'mean': self._mean, 'svar': self._svar} 113 | 114 | @property 115 | def n(self): 116 | return self._n 117 | 118 | @property 119 | def mean(self): 120 | return self._mean 121 | 122 | @property 123 | def var(self): 124 | return self._svar / (self._n - 1) + EPS if self._n > 1 else np.ones(self._svar.shape) 125 | 126 | @property 127 | def std(self): 128 | return np.sqrt(self.var) 129 | 130 | @property 131 | def shape(self): 132 | return self._mean.shape 133 | 134 | 135 | @ray.remote 136 | class Worker(object): 137 | def __init__(self, M, mean, var, deltas, config): 138 | """ 139 | initialize the agent 140 | """ 141 | self.actor = NaiveActor(M, mean, var) 142 | self.deltas = deltas 143 | 144 | self.num_round_avg = config['num_round_avg'] 145 | self.max_step_per_round = config['max_step_per_round'] 146 | self.env_name = config['env_name'] 147 | self.env_seed = config['seed'] 148 | self.nu = config['nu'] 149 | self.delta_shape = M.shape 150 | self.delta_dim = np.prod(self.delta_shape) 151 | self.rg = np.random.RandomState(config['seed']) 152 | 153 | def sync_actor_params(self, M, mean, var): 154 | self.actor.sync_params(M, mean, var) 155 | 156 | def rollout(self): 157 | """ 158 | test the transition matrix M by process several rollouts 159 | :return: mean rewards, states statistics (n, mean, var) 160 | """ 161 | # generate delta 162 | delta_ind = self.rg.randint(0, len(self.deltas) - self.delta_dim + 1) 163 | delta = self.deltas[delta_ind:delta_ind + self.delta_dim].reshape(self.delta_shape) 164 | self.actor.set_delta(delta) 165 | 166 | running_stat = RunningStat((self.delta_shape[1], )) 167 | env = gym.make(self.env_name) 168 | env.seed(self.env_seed) 169 | reward_neg_pos = [] 170 | 171 | # generate rollouts with M +/- nu * delta 172 | for nu in [-self.nu, self.nu]: 173 | self.actor.set_nu(nu) 174 | reward_sum_record = [] 175 | for i_run in range(self.num_round_avg): 176 | done = False 177 | num_steps = 0 178 | reward_sum = 0 179 | state = env.reset() 180 | running_stat.push(state) 181 | while (not done) and (num_steps < self.max_step_per_round): 182 | action = self.actor.forward(state) 183 | state, reward, done, _ = env.step(action) 184 | reward_sum += reward 185 | running_stat.push(state) 186 | num_steps += 1 187 | reward_sum_record.append(reward_sum) 188 | reward_neg_pos.append(np.mean(reward_sum_record)) 189 | return (delta_ind, reward_neg_pos, running_stat.stat) 190 | 191 | 192 | class NaiveActor(object): 193 | def __init__(self, M, mean, var): 194 | """ 195 | :param M: the tested transition matrix, of dimension (dim_actions, dim_states) 196 | :param mean: mean of all previous states 197 | :param var: diagonal of covariance (element-wise variance) of all previous states 198 | """ 199 | self._M = M 200 | self._mean = mean 201 | self._var = var 202 | self._delta = None 203 | self._pert_M = None 204 | 205 | def set_delta(self, delta): 206 | self._delta = delta 207 | 208 | def set_nu(self, nu): 209 | self._pert_M = self._M + nu * self._delta 210 | 211 | def sync_params(self, M, mean, var): 212 | self._M = M 213 | self._mean = mean 214 | self._var = var 215 | 216 | def forward(self, states): 217 | """ 218 | given a states returns the action, where M = self._M + nu * deltas 219 | :param states: a np.ndarray represents states 220 | :return: the deterministic action 221 | """ 222 | return np.matmul(self._pert_M, (states - self._mean) / np.sqrt(self._var)) 223 | 224 | 225 | class Master(object): 226 | """ 227 | A linear policy actor master 228 | Each weight is drawn from independent Gaussian distribution 229 | """ 230 | def __init__(self, config, verbose=1): 231 | self.dim_states, self.dim_actions = self._get_dimensions(config['env_name']) 232 | self.dim_M = self.dim_states * self.dim_actions 233 | 234 | self.M = ray.put(np.zeros((self.dim_actions, self.dim_states))) 235 | self.running_stat = RunningStat((self.dim_states, )) 236 | 237 | self.deltas = ray.put(np.random.RandomState(config['seed']).randn(config['random_table_size']).astype(np.float64)) 238 | 239 | worker_config = { 240 | 'num_round_avg': config['num_round_avg'], 241 | 'max_step_per_round': config['max_step_per_round'], 242 | 'env_name': config['env_name'], 243 | 'seed': config['seed'], 244 | 'nu': config['nu'], 245 | } 246 | self.workers = [Worker.remote(self.M, self.running_stat.mean, 247 | self.running_stat.var, self.deltas, worker_config) for i in range(config['N'])] 248 | 249 | self.config = config 250 | self.config.update({'b': int(config['N'] * config['b_ratio'])}) 251 | 252 | self.reward_record = [] 253 | 254 | self.verbose = verbose 255 | 256 | def _get_dimensions(self, env_name): 257 | env = gym.make(env_name) 258 | return env.observation_space.shape[0], env.action_space.shape[0] 259 | 260 | def run(self): 261 | for i in range(self.config['num_episode']): 262 | rollout_results = ray.get([w.rollout.remote() for w in self.workers]) 263 | 264 | rollout_ids = [res[0] for res in rollout_results] 265 | rollout_rewards = np.array([res[1] for res in rollout_results]) 266 | rollout_stats = [res[2] for res in rollout_results] 267 | 268 | # update master policy 269 | self.update(rollout_ids, rollout_rewards) 270 | 271 | # update master state mean and variance 272 | self.running_stat.update(rollout_stats) 273 | 274 | # sync master policy and state statistics to workers 275 | ray.get([w.sync_actor_params.remote(self.M, self.running_stat.mean, self.running_stat.var) \ 276 | for w in self.workers]) 277 | 278 | self.reward_record.append({'steps': self.running_stat.n, 'reward': rollout_rewards.mean()}) 279 | 280 | if self.verbose >= 1: 281 | logger.info('Finished episode: {} steps: {} AvgReward: {:.4f}' \ 282 | .format(i, self.reward_record[-1]['steps'], self.reward_record[-1]['reward'])) 283 | logger.info('-----------------') 284 | 285 | def update(self, rollout_ids, rollout_rewards): 286 | max_rollout_reward = rollout_rewards.max(axis=1) 287 | selected_ind = np.argsort(max_rollout_reward)[::-1][:self.config['b']] 288 | sig_reward = rollout_rewards[selected_ind].reshape(-1).std() 289 | 290 | new_M = np.copy(ray.get(self.M)) 291 | deltas = ray.get(self.deltas) 292 | for i in selected_ind: 293 | delta = deltas[rollout_ids[i]:rollout_ids[i] + self.dim_M].reshape((self.dim_actions, self.dim_states)) 294 | new_M += self.config['alpha'] / self.config['b'] / sig_reward \ 295 | * (rollout_rewards[i, 1] - rollout_rewards[i, 0]) * delta 296 | self.M = ray.put(new_M) 297 | 298 | def grid_search(func, config): 299 | auto_seed = False 300 | if 'seed' in config and config['seed'] == 'auto': 301 | auto_seed = True 302 | if 'num_trials' in config: 303 | num_trials = config['num_trials'] 304 | else: 305 | num_trials = 1 306 | list_elements = [config[d] for d in config if type(config[d]) is list] 307 | list_names = [d for d in config if type(config[d]) is list] 308 | trials = [] 309 | for values in product(*list_elements): 310 | config.update({name: val for val, name in zip(values, list_names)}) 311 | logger.info('========try new config========') 312 | logger.info('config: {}'.format(config)) 313 | scores = [] 314 | for i in range(num_trials): 315 | try_config = config.copy() 316 | if auto_seed: 317 | try_config['seed'] = np.random.randint(1000) 318 | scores.append(func(try_config)) 319 | trials.append({'config': try_config, 'score': np.mean(scores) - np.std(scores)}) 320 | logger.info('score: {} (+/- {})'.format(np.mean(scores), np.std(scores))) 321 | return trials 322 | 323 | def run_ars(config): 324 | master = Master(config, verbose=0) 325 | master.run() 326 | num_last_episodes = int(config['num_episode'] * 0.1) 327 | score = np.mean([x['reward'] for x in master.reward_record[-num_last_episodes:]]) 328 | return score 329 | 330 | def run_single_and_plot(config): 331 | record_dfs = pd.DataFrame(columns=['steps', 'reward']) 332 | reward_cols = [] 333 | for i in range(config['num_trials']): 334 | config['seed'] = np.random.randint(1000) 335 | master = Master(config) 336 | master.run() 337 | reward_record = pd.DataFrame(master.reward_record) 338 | record_dfs = record_dfs.merge(reward_record, how='outer', on='steps', suffixes=('', '_{}'.format(i))) 339 | reward_cols.append('reward_{}'.format(i)) 340 | 341 | record_dfs = record_dfs.drop(columns='reward').sort_values(by='steps', ascending=True).ffill().bfill() 342 | record_dfs['reward_mean'] = record_dfs[reward_cols].mean(axis=1) 343 | record_dfs['reward_std'] = record_dfs[reward_cols].std(axis=1) 344 | record_dfs['reward_smooth'] = record_dfs['reward_mean'].ewm(span=1000).mean() 345 | record_dfs['reward_smooth_std'] = record_dfs['reward_std'].ewm(span=1000).mean() 346 | record_dfs.to_csv(joindir(TUNE_DIR, 'ARS-record-{}.csv'.format(config['env_name']))) 347 | 348 | # Plot 349 | plt.figure(figsize=(12, 6)) 350 | plt.plot(record_dfs['steps'], record_dfs['reward_smooth'], label='reward') 351 | plt.fill_between(record_dfs['steps'], record_dfs['reward_smooth'] - record_dfs['reward_smooth_std'], 352 | record_dfs['reward_smooth'] + record_dfs['reward_smooth_std'], color='b', alpha=0.2) 353 | plt.legend() 354 | plt.xlabel('steps of env interaction (sample complexity)') 355 | plt.ylabel('average reward') 356 | plt.title('ARS on {}'.format(config['env_name'])) 357 | plt.savefig(joindir(TUNE_DIR, 'ARS-plot-{}.pdf'.format(config['env_name']))) 358 | 359 | if __name__ == '__main__': 360 | logger = Logger(joindir(LOG_DIR, 'log_ars.txt')) 361 | ray.init() 362 | 363 | trials = grid_search(run_ars, config_hopper) 364 | 365 | best_trial = sorted(trials, key=lambda x: x['score'])[-1] 366 | best_config = best_trial['config'] 367 | best_score = best_trial['score'] 368 | 369 | with open(joindir(TUNE_DIR, 'ARS-{}.json'.format(best_config['env_name'])), 'w') as f: 370 | json.dump(best_config, f, indent=4, sort_keys=True) 371 | 372 | logger.info('========best solution found========') 373 | logger.info('best score: {}'.format(best_score)) 374 | logger.info('best config: {}'.format(best_config)) 375 | 376 | run_single_and_plot(best_config) 377 | 378 | -------------------------------------------------------------------------------- /code/cem.py: -------------------------------------------------------------------------------- 1 | """ 2 | Implementation of Cross Entropy Method 3 | 4 | This is a derivative-free method. It seems hard to solve high 5 | dimensional problems. Here, we use linear and deterministic 6 | policy. 7 | 8 | ref: 9 | Szita, István, and András Lörincz. "Learning Tetris using the 10 | noisy cross-entropy method." Neural computation 18.12 (2006): 11 | 2936-2941. 12 | """ 13 | 14 | from pathos.multiprocessing import ProcessingPool as Pool 15 | import gym 16 | import matplotlib 17 | matplotlib.use('agg') 18 | import matplotlib.pyplot as plt 19 | from os.path import join as joindir 20 | from functools import partial 21 | import pandas as pd 22 | import numpy as np 23 | import argparse 24 | import datetime 25 | import math 26 | 27 | 28 | EPS = 1e-10 29 | RESULT_DIR = '../result' 30 | 31 | 32 | class args(object): 33 | env_name = 'Hopper-v2' 34 | seed = 1234 35 | num_episode = 100 36 | max_step_per_round = 200 37 | log_num_episode = 1 38 | num_parallel_run = 5 39 | init_sig = 10.0 40 | const_noise_sig2 = 4.0 41 | num_samples = 100 42 | best_ratio = 0.1 43 | num_round_avg = 30 44 | num_cores = 10 45 | 46 | 47 | def add_arguments(): 48 | parser = argparse.ArgumentParser() 49 | parser.add_argument('--env_name', type=str, default='Hopper-v2') 50 | parser.add_argument('--seed', type=int, default=1234) 51 | parser.add_argument('--num_episode', type=int, default=1000) 52 | parser.add_argument('--max_step_per_round', type=int, default=200) 53 | parser.add_argument('--log_num_episode', type=int, default=1) 54 | parser.add_argument('--num_parallel_run', type=int, default=5) 55 | parser.add_argument('--init_sig', type=float, default=10.0) 56 | parser.add_argument('--const_noise_sig2', type=float, default=4.0) 57 | parser.add_argument('--num_samples', type=int, default=100) 58 | parser.add_argument('--best_ratio', type=float, default=0.1) 59 | parser.add_argument('--num_round_avg', type=int, default=30) 60 | parser.add_argument('--num_cores', type=int, default=10) 61 | 62 | args = parser.parse_args() 63 | return args 64 | 65 | class RunningStat(object): 66 | def __init__(self, shape): 67 | self._n = 0 68 | self._M = np.zeros(shape) 69 | self._S = np.zeros(shape) 70 | 71 | def push(self, x): 72 | x = np.asarray(x) 73 | assert x.shape == self._M.shape 74 | self._n += 1 75 | if self._n == 1: 76 | self._M[...] = x 77 | else: 78 | oldM = self._M.copy() 79 | self._M[...] = oldM + (x - oldM) / self._n 80 | self._S[...] = self._S + (x - oldM) * (x - self._M) 81 | 82 | @property 83 | def n(self): 84 | return self._n 85 | 86 | @property 87 | def mean(self): 88 | return self._M 89 | 90 | @property 91 | def var(self): 92 | return self._S / (self._n - 1) if self._n > 1 else np.square(self._M) 93 | 94 | @property 95 | def std(self): 96 | return np.sqrt(self.var) 97 | 98 | @property 99 | def shape(self): 100 | return self._M.shape 101 | 102 | 103 | class ZFilter: 104 | """ 105 | y = (x-mean)/std 106 | using running estimates of mean,std 107 | """ 108 | 109 | def __init__(self, shape, demean=True, destd=True, clip=10.0): 110 | self.demean = demean 111 | self.destd = destd 112 | self.clip = clip 113 | 114 | self.rs = RunningStat(shape) 115 | 116 | def __call__(self, x, update=True): 117 | if update: self.rs.push(x) 118 | if self.demean: 119 | x = x - self.rs.mean 120 | if self.destd: 121 | x = x / (self.rs.std + 1e-8) 122 | if self.clip: 123 | x = np.clip(x, -self.clip, self.clip) 124 | return x 125 | 126 | def output_shape(self, input_space): 127 | return input_space.shape 128 | 129 | 130 | class Agent(object): 131 | def __init__(self, M): 132 | self.actor = NaiveActor(M) 133 | 134 | def run(self, num_round_avg, env_name, env_seed): 135 | env = gym.make(env_name) 136 | env.seed(env_seed) 137 | total_steps = 0 138 | reward_sum_record = [] 139 | for i_run in range(num_round_avg): 140 | done = False 141 | num_steps = 0 142 | reward_sum = 0 143 | state = env.reset() 144 | # state = running_state(state) 145 | while (not done) and (num_steps < args.max_step_per_round): 146 | action = self.actor.forward(state) 147 | state, reward, done, _ = env.step(action) 148 | reward_sum += reward 149 | # state = running_state(state) 150 | num_steps += 1 151 | total_steps += num_steps 152 | reward_sum_record.append(reward_sum) 153 | return (total_steps, np.mean(reward_sum_record)) 154 | 155 | 156 | class NaiveActor(object): 157 | def __init__(self, M): 158 | self._M = M 159 | 160 | def forward(self, states): 161 | """ 162 | given a states returns the action 163 | :param states: a np.ndarray represents states 164 | :return: the deterministic action 165 | """ 166 | return np.matmul(states, self._M) 167 | 168 | 169 | class Actor(object): 170 | """ 171 | A linear policy actor 172 | Each weight is drawn from independent Gaussian distribution 173 | """ 174 | def __init__(self, dim_states, dim_actions): 175 | self.shape = (dim_states, dim_actions) 176 | self._mu = np.zeros(self.shape) 177 | self._sig = np.ones(np.prod(self.shape)) * args.init_sig 178 | 179 | def sample(self): 180 | """ 181 | give one sample of transition matrix self._M and set to itself 182 | """ 183 | M = np.random.normal(self._mu.reshape(-1), self._sig).reshape(self.shape) 184 | return M 185 | 186 | def update(self, weights): 187 | """ 188 | given the selected good samples of weights, update according 189 | to CEM formula 190 | :param weights: list of weights, each is the same size of self._M 191 | """ 192 | self._mu = np.mean(weights, axis=0) 193 | self._sig = np.sqrt(np.array([np.square((w - self._mu).reshape(-1)) for w in weights]).mean(axis=0) \ 194 | + args.const_noise_sig2) 195 | 196 | 197 | def get_score_of_weight(M): 198 | agent = Agent(M) 199 | return agent.run(args.num_round_avg, args.env_name, args.seed) 200 | 201 | 202 | def cem(): 203 | env = gym.make(args.env_name) 204 | dim_states = env.observation_space.shape[0] 205 | dim_actions = env.action_space.shape[0] 206 | del env 207 | p = Pool(args.num_cores) 208 | 209 | actor = Actor(dim_states, dim_actions) 210 | # running_state = ZFilter((dim_states,), clip=5) 211 | 212 | reward_record = [] 213 | global_steps = 0 214 | num_top_samples = int(max(1, np.floor(args.num_samples * args.best_ratio))) 215 | 216 | for i_episode in range(args.num_episode): 217 | 218 | # sample several weights and perform multiple times each 219 | weights = [] 220 | scores = [] 221 | for i_sample in range(args.num_samples): 222 | weights.append(actor.sample()) 223 | 224 | res = p.map(get_score_of_weight, weights) 225 | scores = [score for _, score in res] 226 | steps = [step for step, _ in res] 227 | 228 | global_steps += np.sum(steps) 229 | reward_record.append({'steps': global_steps, 'reward': np.mean(scores)}) 230 | 231 | # sort weights according to scores in decreasing order 232 | # ref: https://stackoverflow.com/questions/6618515/sorting-list-based-on-values-from-another-list 233 | selected_weights = [x for _, x in sorted(zip(scores, weights), reverse=True)][:num_top_samples] 234 | actor.update(selected_weights) 235 | 236 | if i_episode % args.log_num_episode == 0: 237 | print('Finished episode: {} steps: {} AvgReward: {:.4f}' \ 238 | .format(i_episode, reward_record[-1]['steps'], reward_record[-1]['reward'])) 239 | print('-----------------') 240 | 241 | return reward_record 242 | 243 | if __name__ == '__main__': 244 | datestr = datetime.datetime.now().strftime('%Y-%m-%d') 245 | args = add_arguments() 246 | 247 | record_dfs = pd.DataFrame(columns=['steps', 'reward']) 248 | reward_cols = [] 249 | for i in range(args.num_parallel_run): 250 | args.seed += 1 251 | reward_record = pd.DataFrame(cem()) 252 | record_dfs = record_dfs.merge(reward_record, how='outer', on='steps', suffixes=('', '_{}'.format(i))) 253 | reward_cols.append('reward_{}'.format(i)) 254 | 255 | record_dfs = record_dfs.drop(columns='reward').sort_values(by='steps', ascending=True).ffill().bfill() 256 | record_dfs['reward_mean'] = record_dfs[reward_cols].mean(axis=1) 257 | record_dfs['reward_std'] = record_dfs[reward_cols].std(axis=1) 258 | record_dfs['reward_smooth'] = record_dfs['reward_mean'].ewm(span=1000).mean() 259 | record_dfs['reward_smooth_std'] = record_dfs['reward_std'].ewm(span=1000).mean() 260 | record_dfs.to_csv(joindir(RESULT_DIR, 'cem-record-{}-{}.csv'.format(args.env_name, datestr))) 261 | 262 | # Plot 263 | plt.figure(figsize=(12, 6)) 264 | plt.plot(record_dfs['steps'], record_dfs['reward_smooth'], label='reward') 265 | plt.fill_between(record_dfs['steps'], record_dfs['reward_smooth'] - record_dfs['reward_smooth_std'], 266 | record_dfs['reward_smooth'] + record_dfs['reward_smooth_std'], color='b', alpha=0.2) 267 | plt.legend() 268 | plt.xlabel('steps of env interaction (sample complexity)') 269 | plt.ylabel('average reward') 270 | plt.title('CEM on {}'.format(args.env_name)) 271 | plt.savefig(joindir(RESULT_DIR, 'cem-plot-{}-{}.pdf'.format(args.env_name, datestr))) 272 | -------------------------------------------------------------------------------- /code/cem_tune.py: -------------------------------------------------------------------------------- 1 | """ 2 | Implementation of Cross Entropy Method 3 | 4 | This is a derivative-free method. It seems hard to solve high 5 | dimensional problems. Here, we use linear and deterministic 6 | policy. 7 | 8 | ref: 9 | Szita, Istvan, and Andras Lorincz. "Learning Tetris using the 10 | noisy cross-entropy method." Neural computation 18.12 (2006): 11 | 2936-2941. 12 | 13 | Notice: 14 | This is a _tune version, which means that it finds an optimal 15 | configuration of hyperparameters (by running the algorithm 16 | multiple times) and run with this found configuration. 17 | """ 18 | 19 | from pathos.multiprocessing import ProcessingPool as Pool 20 | import gym 21 | import matplotlib 22 | matplotlib.use('agg') 23 | import matplotlib.pyplot as plt 24 | from os.path import join as joindir 25 | from functools import partial 26 | import pandas as pd 27 | import numpy as np 28 | from itertools import product 29 | import argparse 30 | import datetime 31 | import math 32 | 33 | 34 | EPS = 1e-10 35 | RESULT_DIR = '../result' 36 | TUNE_DIR = '../tune' 37 | LOG_DIR = '../log' 38 | 39 | 40 | config_hopper = { 41 | 'env_name': 'Hopper-v2', 42 | 'seed': 'auto', 43 | 'num_episode': 100, 44 | 'max_step_per_round': 200, 45 | 'init_sig': [1.0, 10.0], 46 | 'const_noise_sig2': [0.0, 4.0], 47 | 'num_samples': [50, 100], 48 | 'best_ratio': [0.1, 0.2], 49 | 'num_round_avg': 30, 50 | 'num_cores': 10, 51 | 'num_trials': 5, 52 | } 53 | 54 | 55 | class Logger(object): 56 | def __init__(self, logfile='log.txt'): 57 | super(Logger, self).__init__() 58 | self.logfile = logfile 59 | 60 | def info(self, msg): 61 | timestr = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S') 62 | print('[info {}] {}'.format(timestr, msg)) 63 | with open(self.logfile, 'a+') as f: 64 | f.write('[info {}] {}\n'.format(timestr, msg)) 65 | 66 | 67 | class Worker(object): 68 | def __init__(self, M, config): 69 | self.actor = NaiveActor(M) 70 | self.num_round_avg = config['num_round_avg'] 71 | self.env_name = config['env_name'] 72 | self.env_seed = config['seed'] 73 | self.max_step_per_round = config['max_step_per_round'] 74 | 75 | def rollout(self): 76 | env = gym.make(self.env_name) 77 | env.seed(self.env_seed) 78 | total_steps = 0 79 | reward_sum_record = [] 80 | for i_run in range(self.num_round_avg): 81 | done = False 82 | num_steps = 0 83 | reward_sum = 0 84 | state = env.reset() 85 | while (not done) and (num_steps < self.max_step_per_round): 86 | action = self.actor.forward(state) 87 | state, reward, done, _ = env.step(action) 88 | reward_sum += reward 89 | num_steps += 1 90 | total_steps += num_steps 91 | reward_sum_record.append(reward_sum) 92 | return (total_steps, np.mean(reward_sum_record)) 93 | 94 | 95 | class NaiveActor(object): 96 | def __init__(self, M): 97 | self._M = M 98 | 99 | def forward(self, states): 100 | """ 101 | given a states returns the action 102 | :param states: a np.ndarray represents states 103 | :return: the deterministic action 104 | """ 105 | return np.matmul(states, self._M) 106 | 107 | 108 | class Master(object): 109 | """ 110 | A linear policy actor 111 | Each weight is drawn from independent Gaussian distribution 112 | """ 113 | def __init__(self, config, verbose=1): 114 | self.dim_states, self.dim_actions = self._get_dimensions(config['env_name']) 115 | self.shape = (self.dim_states, self.dim_actions) 116 | self._mu = np.zeros(self.shape) 117 | self._sig = np.ones(np.prod(self.shape)) * config['init_sig'] 118 | 119 | self.verbose = verbose 120 | self.config = config 121 | self.reward_record = [] 122 | self.pool = Pool(config['num_cores']) 123 | 124 | def run(self): 125 | global_steps = 0 126 | num_top_samples = int(max(1, np.floor(self.config['num_samples'] * self.config['best_ratio']))) 127 | 128 | for i_episode in range(self.config['num_episode']): 129 | 130 | # sample several weights and perform multiple times each 131 | weights = [] 132 | scores = [] 133 | for i_sample in range(self.config['num_samples']): 134 | weights.append(self.sample()) 135 | 136 | res = self.pool.map(self._get_score_of_weight, weights) 137 | scores = [score for _, score in res] 138 | steps = [step for step, _ in res] 139 | 140 | global_steps += np.sum(steps) 141 | self.reward_record.append({'steps': global_steps, 'reward': np.mean(scores)}) 142 | 143 | # sort weights according to scores in decreasing order 144 | # ref: https://stackoverflow.com/questions/6618515/sorting-list-based-on-values-from-another-list 145 | selected_weights = [x for _, x in sorted(zip(scores, weights), reverse=True)][:num_top_samples] 146 | self.update(selected_weights) 147 | 148 | if self.verbose >= 1: 149 | logger.info('Finished episode: {} steps: {} AvgReward: {:.4f}' \ 150 | .format(i_episode, self.reward_record[-1]['steps'], self.reward_record[-1]['reward'])) 151 | logger.info('-----------------') 152 | 153 | def sample(self): 154 | """ 155 | give one sample of transition matrix self._M and set to itself 156 | """ 157 | M = np.random.normal(self._mu.reshape(-1), self._sig).reshape(self.shape) 158 | return M 159 | 160 | def update(self, weights): 161 | """ 162 | given the selected good samples of weights, update according 163 | to CEM formula 164 | :param weights: list of weights, each is the same size of self._M 165 | """ 166 | self._mu = np.mean(weights, axis=0) 167 | self._sig = np.sqrt(np.array([np.square((w - self._mu).reshape(-1)) for w in weights]).mean(axis=0) \ 168 | + self.config['const_noise_sig2']) 169 | 170 | def _get_dimensions(self, env_name): 171 | env = gym.make(env_name) 172 | return env.observation_space.shape[0], env.action_space.shape[0] 173 | 174 | def _get_score_of_weight(self, M): 175 | return Worker(M, self.config).rollout() 176 | 177 | def run_cem(config): 178 | master = Master(config, verbose=0) 179 | master.run() 180 | num_last_episodes = int(config['num_episode'] * 0.1) 181 | score = np.mean([x['reward'] for x in master.reward_record[-num_last_episodes:]]) 182 | return score 183 | 184 | def grid_search(func, config): 185 | auto_seed = False 186 | if 'seed' in config and config['seed'] == 'auto': 187 | auto_seed = True 188 | if 'num_trials' in config: 189 | num_trials = config['num_trials'] 190 | else: 191 | num_trials = 1 192 | list_elements = [config[d] for d in config if type(config[d]) is list] 193 | list_names = [d for d in config if type(config[d]) is list] 194 | trials = [] 195 | for values in product(*list_elements): 196 | config.update({name: val for val, name in zip(values, list_names)}) 197 | logger.info('========try new config========') 198 | logger.info('config: {}'.format(config)) 199 | scores = [] 200 | for i in range(num_trials): 201 | try_config = config.copy() 202 | if auto_seed: 203 | try_config['seed'] = np.random.randint(1000) 204 | scores.append(func(try_config)) 205 | trials.append({'config': try_config, 'score': np.mean(scores) - np.std(scores)}) 206 | logger.info('score: {} (+/- {})'.format(np.mean(scores), np.std(scores))) 207 | return trials 208 | 209 | def run_single_and_plot(config, algo_name='CEM'): 210 | record_dfs = pd.DataFrame(columns=['steps', 'reward']) 211 | reward_cols = [] 212 | for i in range(config['num_trials']): 213 | config['seed'] = np.random.randint(1000) 214 | master = Master(config) 215 | master.run() 216 | reward_record = pd.DataFrame(master.reward_record) 217 | record_dfs = record_dfs.merge(reward_record, how='outer', on='steps', suffixes=('', '_{}'.format(i))) 218 | reward_cols.append('reward_{}'.format(i)) 219 | 220 | record_dfs = record_dfs.drop(columns='reward').sort_values(by='steps', ascending=True).ffill().bfill() 221 | record_dfs['reward_mean'] = record_dfs[reward_cols].mean(axis=1) 222 | record_dfs['reward_std'] = record_dfs[reward_cols].std(axis=1) 223 | record_dfs['reward_smooth'] = record_dfs['reward_mean'].ewm(span=1000).mean() 224 | record_dfs['reward_smooth_std'] = record_dfs['reward_std'].ewm(span=1000).mean() 225 | record_dfs.to_csv(joindir(TUNE_DIR, '{}-record-{}.csv'.format(algo_name, config['env_name']))) 226 | 227 | # Plot 228 | plt.figure(figsize=(12, 6)) 229 | plt.plot(record_dfs['steps'], record_dfs['reward_smooth'], label='reward') 230 | plt.fill_between(record_dfs['steps'], record_dfs['reward_smooth'] - record_dfs['reward_smooth_std'], 231 | record_dfs['reward_smooth'] + record_dfs['reward_smooth_std'], color='b', alpha=0.2) 232 | plt.legend() 233 | plt.xlabel('steps of env interaction (sample complexity)') 234 | plt.ylabel('average reward') 235 | plt.title('{} on {}'.format(algo_name, config['env_name'])) 236 | plt.savefig(joindir(TUNE_DIR, '{}-plot-{}.pdf'.format(algo_name, config['env_name']))) 237 | 238 | if __name__ == '__main__': 239 | 240 | logger = Logger(joindir(LOG_DIR, 'log_cem.txt')) 241 | 242 | trials = grid_search(run_cem, config_hopper) 243 | 244 | best_trial = sorted(trials, key=lambda x: x['score'])[-1] 245 | best_config = best_trial['config'] 246 | best_score = best_trial['score'] 247 | 248 | with open(joindir(TUNE_DIR, 'ARS-{}.json'.format(best_config['env_name'])), 'w') as f: 249 | json.dump(best_config, f, indent=4, sort_keys=True) 250 | 251 | logger.info('========best solution found========') 252 | logger.info('best score: {}'.format(best_score)) 253 | logger.info('best config: {}'.format(best_config)) 254 | 255 | run_single_and_plot(best_config) 256 | -------------------------------------------------------------------------------- /code/dqn.py: -------------------------------------------------------------------------------- 1 | from pathos.multiprocessing import ProcessingPool as Pool 2 | import matplotlib 3 | matplotlib.use('agg') 4 | import matplotlib.pyplot as plt 5 | import matplotlib.patches as mpatches 6 | from matplotlib.colors import ListedColormap 7 | 8 | from tqdm import trange 9 | import pandas as pd 10 | import gym 11 | 12 | import torch 13 | from torch import nn 14 | from torch import optim 15 | from torch import Tensor 16 | from torch.autograd import Variable 17 | from torch.nn import functional as F 18 | from collections import deque 19 | import numpy as np 20 | import pdb 21 | import os 22 | 23 | env = gym.make('MountainCar-v0') 24 | sample_func = env.action_space.sample 25 | 26 | class MLP(nn.Module): 27 | def __init__(self): 28 | super(MLP, self).__init__() 29 | 30 | self.state_space = env.observation_space.shape[0] 31 | self.action_space = env.action_space.n 32 | self.hidden = 200 33 | self.fc1 = nn.Linear(self.state_space, self.hidden, bias=False) 34 | self.fc2 = nn.Linear(self.hidden, self.action_space, bias=False) 35 | 36 | def forward(self, x): 37 | model = torch.nn.Sequential( 38 | self.fc1, 39 | self.fc2, 40 | ) 41 | return model(x) 42 | 43 | def act(self, x): 44 | x = Tensor(x).unsqueeze(0) 45 | return int(self.forward(x).argmax(1)[0]) 46 | 47 | def act_egreedy(self, x, e=0.7, sample=sample_func): 48 | return self.act(x) if np.random.rand() > e else sample() 49 | 50 | def dqn(loss_type, target_freq, epsilon_decay, lr_decay_freq, lr): 51 | 52 | DIRS = 'test' 53 | os.makedirs(DIRS, exist_ok=True) 54 | 55 | identifier = '{}/{}_{}_{}_{}_{}'.format(DIRS, loss_type, target_freq, epsilon_decay, lr_decay_freq, lr) 56 | record = [] 57 | evaluate = [] 58 | buffer = deque(maxlen=100000) 59 | agent = MLP() 60 | agent_target = MLP() 61 | agent_target.load_state_dict(agent.state_dict()) 62 | opt = optim.SGD(agent.parameters(), lr=lr) 63 | sch = optim.lr_scheduler.StepLR(opt, step_size=lr_decay_freq, gamma=0.998) 64 | mseloss = nn.MSELoss() 65 | env = gym.make('MountainCar-v0') 66 | s = env.reset() 67 | batch_size = 128 68 | learn_start = 1000 69 | gamma = 0.998 70 | epsilon = 0.7 71 | total_steps = 200 * 100000 72 | 73 | maxpos = - 4 74 | reward = 0 75 | eplen = 0 76 | success_sofar = 0 77 | loss = 0 78 | 79 | for i in trange(total_steps): 80 | 81 | # sample a transition and store it to the replay buffer 82 | if i < learn_start: 83 | a = env.action_space.sample() 84 | else: 85 | a = agent.act_egreedy(s, e=epsilon) 86 | ns, r, d, info = env.step(a) 87 | buffer.append([s, a, r, ns, d]) 88 | reward += r 89 | eplen += 1 90 | if ns[0] > maxpos: 91 | maxpos = ns[0] 92 | if d: 93 | if reward != -200: 94 | success_sofar += 1 95 | evaluate.append(dict(i=i, reward=reward, eplen=eplen, maxpos=maxpos, 96 | epsilon=epsilon, success_sofar=success_sofar, lr=opt.param_groups[0]['lr'], 97 | loss=float(loss))) 98 | reward = 0 99 | eplen = 0 100 | maxpos = -4 101 | epsilon = max(0.01, epsilon * epsilon_decay) 102 | s = env.reset() 103 | else: 104 | s = ns 105 | 106 | if i >= learn_start and i % 4 == 0: 107 | 108 | # sample a batch from the replay buffer 109 | inds = np.random.choice(len(buffer), batch_size, replace=False) 110 | bs, ba, br, bns, bd = [], [], [], [], [] 111 | for ind in inds: 112 | ss, aa, rr, nsns, dd = buffer[ind] 113 | bs.append(ss) 114 | ba.append(aa) 115 | br.append(rr) 116 | bns.append(nsns) 117 | bd.append(dd) 118 | bs = Tensor(np.array(bs)) 119 | ba = torch.tensor(np.array(ba), dtype=torch.long) 120 | br = Tensor(np.array(br)) 121 | bns = Tensor(np.array(bns)) 122 | masks = Tensor(1 - np.array(bd) * 1) 123 | 124 | nsaction = agent(bns).argmax(1) 125 | Qtarget = (br + masks * gamma * agent_target(bns)[range(batch_size), nsaction]).detach() 126 | Qvalue = agent(bs)[range(batch_size), ba] 127 | if loss_type == 'MSE': 128 | loss = mseloss(Qvalue, Qtarget) 129 | elif loss_type == 'SL1': 130 | loss = F.smooth_l1_loss(Qvalue, Qtarget) 131 | agent.zero_grad() 132 | loss.backward() 133 | for param in agent.parameters(): 134 | param.grad.data.clamp_(-1, 1) 135 | # print('Finish the {}-th iteration, the loss = {}'.format(i, float(loss))) 136 | opt.step() 137 | sch.step() 138 | 139 | if i % target_freq == 0: 140 | agent_target.load_state_dict(agent.state_dict()) 141 | 142 | record.append(dict(i=i, loss=float(loss))) 143 | 144 | record = pd.DataFrame(record) 145 | evaluate = pd.DataFrame(evaluate) 146 | evaluate.to_csv('{}_episode.csv'.format(identifier)) 147 | 148 | # Plot training process 149 | plt.figure(figsize=(15, 5)) 150 | plt.subplot(241) 151 | plt.plot(record['i'][::10000], record['loss'][::10000]) 152 | plt.title('loss') 153 | plt.subplot(242) 154 | plt.plot(evaluate['i'][::200], evaluate['reward'][::200]) 155 | plt.title('reward') 156 | plt.subplot(243) 157 | plt.plot(evaluate['i'][::200], evaluate['eplen'][::200]) 158 | plt.title('eplen') 159 | plt.subplot(244) 160 | plt.plot(evaluate['i'][::200], evaluate['maxpos'][::200]) 161 | plt.title('maxpos') 162 | plt.subplot(245) 163 | plt.plot(evaluate['i'][::200], evaluate['epsilon'][::200]) 164 | plt.title('epsilon') 165 | plt.subplot(246) 166 | plt.plot(evaluate['i'][::200], evaluate['success_sofar'][::200]) 167 | plt.title('success_sofar') 168 | plt.subplot(247) 169 | plt.plot(evaluate['i'][::200], evaluate['lr'][::200]) 170 | plt.title('lr') 171 | plt.subplot(248) 172 | plt.plot(evaluate['i'][::200], evaluate['loss'][::200]) 173 | plt.title('loss') 174 | plt.savefig('{}_fig1.png'.format(identifier)) 175 | 176 | # Plot policy 177 | X = np.random.uniform(-1.2, 0.6, 10000) 178 | Y = np.random.uniform(-0.07, 0.07, 10000) 179 | Z = [] 180 | for i in range(len(X)): 181 | _, temp = torch.max( 182 | agent(Variable(torch.from_numpy(np.array([X[i],Y[i]]))).type(torch.FloatTensor)), dim =-1) 183 | z = temp.item() 184 | Z.append(z) 185 | Z = pd.Series(Z) 186 | colors = {0:'blue',1:'lime',2:'red'} 187 | colors = Z.apply(lambda x:colors[x]) 188 | labels = ['Left','Right','Nothing'] 189 | 190 | fig = plt.figure(3, figsize=[7,7]) 191 | ax = fig.gca() 192 | plt.set_cmap('brg') 193 | surf = ax.scatter(X,Y, c=Z) 194 | ax.set_xlabel('Position') 195 | ax.set_ylabel('Velocity') 196 | ax.set_title('Policy') 197 | recs = [] 198 | for i in range(0,3): 199 | recs.append(mpatches.Rectangle((0,0),1,1,fc=sorted(colors.unique())[i])) 200 | plt.legend(recs,labels,loc=4,ncol=3) 201 | plt.savefig('{}_fig2.png'.format(identifier)) 202 | 203 | if __name__ == '__main__': 204 | 205 | dqn(loss_type='MSE', target_freq=2000, epsilon_decay=0.998, lr_decay_freq=2000, lr=1e-4) 206 | -------------------------------------------------------------------------------- /code/ppo.py: -------------------------------------------------------------------------------- 1 | """ 2 | Implementation of PPO 3 | ref: Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017). 4 | ref: https://github.com/Jiankai-Sun/Proximal-Policy-Optimization-in-Pytorch/blob/master/ppo.py 5 | ref: https://github.com/openai/baselines/tree/master/baselines/ppo2 6 | 7 | NOTICE: 8 | `Tensor2` means 2D-Tensor (num_samples, num_dims) 9 | """ 10 | 11 | import gym 12 | import torch 13 | import torch.nn as nn 14 | import torch.optim as opt 15 | from torch import Tensor 16 | from torch.autograd import Variable 17 | from collections import namedtuple 18 | from itertools import count 19 | import matplotlib 20 | matplotlib.use('agg') 21 | import matplotlib.pyplot as plt 22 | from os.path import join as joindir 23 | from os import makedirs as mkdir 24 | import pandas as pd 25 | import numpy as np 26 | import argparse 27 | import datetime 28 | import math 29 | 30 | 31 | Transition = namedtuple('Transition', ('state', 'value', 'action', 'logproba', 'mask', 'next_state', 'reward')) 32 | EPS = 1e-10 33 | RESULT_DIR = joindir('../result', '.'.join(__file__.split('.')[:-1])) 34 | mkdir(RESULT_DIR, exist_ok=True) 35 | 36 | 37 | class args(object): 38 | env_name = 'Hopper-v2' 39 | seed = 1234 40 | num_episode = 2000 41 | batch_size = 2048 42 | max_step_per_round = 2000 43 | gamma = 0.995 44 | lamda = 0.97 45 | log_num_episode = 1 46 | num_epoch = 10 47 | minibatch_size = 256 48 | clip = 0.2 49 | loss_coeff_value = 0.5 50 | loss_coeff_entropy = 0.01 51 | lr = 3e-4 52 | num_parallel_run = 5 53 | # tricks 54 | schedule_adam = 'linear' 55 | schedule_clip = 'linear' 56 | layer_norm = True 57 | state_norm = True 58 | advantage_norm = True 59 | lossvalue_norm = True 60 | 61 | 62 | class RunningStat(object): 63 | def __init__(self, shape): 64 | self._n = 0 65 | self._M = np.zeros(shape) 66 | self._S = np.zeros(shape) 67 | 68 | def push(self, x): 69 | x = np.asarray(x) 70 | assert x.shape == self._M.shape 71 | self._n += 1 72 | if self._n == 1: 73 | self._M[...] = x 74 | else: 75 | oldM = self._M.copy() 76 | self._M[...] = oldM + (x - oldM) / self._n 77 | self._S[...] = self._S + (x - oldM) * (x - self._M) 78 | 79 | @property 80 | def n(self): 81 | return self._n 82 | 83 | @property 84 | def mean(self): 85 | return self._M 86 | 87 | @property 88 | def var(self): 89 | return self._S / (self._n - 1) if self._n > 1 else np.square(self._M) 90 | 91 | @property 92 | def std(self): 93 | return np.sqrt(self.var) 94 | 95 | @property 96 | def shape(self): 97 | return self._M.shape 98 | 99 | 100 | class ZFilter: 101 | """ 102 | y = (x-mean)/std 103 | using running estimates of mean,std 104 | """ 105 | 106 | def __init__(self, shape, demean=True, destd=True, clip=10.0): 107 | self.demean = demean 108 | self.destd = destd 109 | self.clip = clip 110 | 111 | self.rs = RunningStat(shape) 112 | 113 | def __call__(self, x, update=True): 114 | if update: self.rs.push(x) 115 | if self.demean: 116 | x = x - self.rs.mean 117 | if self.destd: 118 | x = x / (self.rs.std + 1e-8) 119 | if self.clip: 120 | x = np.clip(x, -self.clip, self.clip) 121 | return x 122 | 123 | def output_shape(self, input_space): 124 | return input_space.shape 125 | 126 | 127 | class ActorCritic(nn.Module): 128 | def __init__(self, num_inputs, num_outputs, layer_norm=True): 129 | super(ActorCritic, self).__init__() 130 | 131 | self.actor_fc1 = nn.Linear(num_inputs, 64) 132 | self.actor_fc2 = nn.Linear(64, 64) 133 | self.actor_fc3 = nn.Linear(64, num_outputs) 134 | self.actor_logstd = nn.Parameter(torch.zeros(1, num_outputs)) 135 | 136 | self.critic_fc1 = nn.Linear(num_inputs, 64) 137 | self.critic_fc2 = nn.Linear(64, 64) 138 | self.critic_fc3 = nn.Linear(64, 1) 139 | 140 | if layer_norm: 141 | self.layer_norm(self.actor_fc1, std=1.0) 142 | self.layer_norm(self.actor_fc2, std=1.0) 143 | self.layer_norm(self.actor_fc3, std=0.01) 144 | 145 | self.layer_norm(self.critic_fc1, std=1.0) 146 | self.layer_norm(self.critic_fc2, std=1.0) 147 | self.layer_norm(self.critic_fc3, std=1.0) 148 | 149 | @staticmethod 150 | def layer_norm(layer, std=1.0, bias_const=0.0): 151 | torch.nn.init.orthogonal_(layer.weight, std) 152 | torch.nn.init.constant_(layer.bias, bias_const) 153 | 154 | def forward(self, states): 155 | """ 156 | run policy network (actor) as well as value network (critic) 157 | :param states: a Tensor2 represents states 158 | :return: 3 Tensor2 159 | """ 160 | action_mean, action_logstd = self._forward_actor(states) 161 | critic_value = self._forward_critic(states) 162 | return action_mean, action_logstd, critic_value 163 | 164 | def _forward_actor(self, states): 165 | x = torch.tanh(self.actor_fc1(states)) 166 | x = torch.tanh(self.actor_fc2(x)) 167 | action_mean = self.actor_fc3(x) 168 | action_logstd = self.actor_logstd.expand_as(action_mean) 169 | return action_mean, action_logstd 170 | 171 | def _forward_critic(self, states): 172 | x = torch.tanh(self.critic_fc1(states)) 173 | x = torch.tanh(self.critic_fc2(x)) 174 | critic_value = self.critic_fc3(x) 175 | return critic_value 176 | 177 | def select_action(self, action_mean, action_logstd, return_logproba=True): 178 | """ 179 | given mean and std, sample an action from normal(mean, std) 180 | also returns probability of the given chosen 181 | """ 182 | action_std = torch.exp(action_logstd) 183 | action = torch.normal(action_mean, action_std) 184 | if return_logproba: 185 | logproba = self._normal_logproba(action, action_mean, action_logstd, action_std) 186 | return action, logproba 187 | 188 | @staticmethod 189 | def _normal_logproba(x, mean, logstd, std=None): 190 | if std is None: 191 | std = torch.exp(logstd) 192 | 193 | std_sq = std.pow(2) 194 | logproba = - 0.5 * math.log(2 * math.pi) - logstd - (x - mean).pow(2) / (2 * std_sq) 195 | return logproba.sum(1) 196 | 197 | def get_logproba(self, states, actions): 198 | """ 199 | return probability of chosen the given actions under corresponding states of current network 200 | :param states: Tensor 201 | :param actions: Tensor 202 | """ 203 | action_mean, action_logstd = self._forward_actor(states) 204 | logproba = self._normal_logproba(actions, action_mean, action_logstd) 205 | return logproba 206 | 207 | 208 | class Memory(object): 209 | def __init__(self): 210 | self.memory = [] 211 | 212 | def push(self, *args): 213 | self.memory.append(Transition(*args)) 214 | 215 | def sample(self): 216 | return Transition(*zip(*self.memory)) 217 | 218 | def __len__(self): 219 | return len(self.memory) 220 | 221 | def ppo(args): 222 | env = gym.make(args.env_name) 223 | num_inputs = env.observation_space.shape[0] 224 | num_actions = env.action_space.shape[0] 225 | 226 | env.seed(args.seed) 227 | torch.manual_seed(args.seed) 228 | 229 | network = ActorCritic(num_inputs, num_actions, layer_norm=args.layer_norm) 230 | optimizer = opt.Adam(network.parameters(), lr=args.lr) 231 | 232 | running_state = ZFilter((num_inputs,), clip=5.0) 233 | 234 | # record average 1-round cumulative reward in every episode 235 | reward_record = [] 236 | global_steps = 0 237 | 238 | lr_now = args.lr 239 | clip_now = args.clip 240 | 241 | for i_episode in range(args.num_episode): 242 | # step1: perform current policy to collect trajectories 243 | # this is an on-policy method! 244 | memory = Memory() 245 | num_steps = 0 246 | reward_list = [] 247 | len_list = [] 248 | while num_steps < args.batch_size: 249 | state = env.reset() 250 | if args.state_norm: 251 | state = running_state(state) 252 | reward_sum = 0 253 | for t in range(args.max_step_per_round): 254 | action_mean, action_logstd, value = network(Tensor(state).unsqueeze(0)) 255 | action, logproba = network.select_action(action_mean, action_logstd) 256 | action = action.data.numpy()[0] 257 | logproba = logproba.data.numpy()[0] 258 | next_state, reward, done, _ = env.step(action) 259 | reward_sum += reward 260 | if args.state_norm: 261 | next_state = running_state(next_state) 262 | mask = 0 if done else 1 263 | 264 | memory.push(state, value, action, logproba, mask, next_state, reward) 265 | 266 | if done: 267 | break 268 | 269 | state = next_state 270 | 271 | num_steps += (t + 1) 272 | global_steps += (t + 1) 273 | reward_list.append(reward_sum) 274 | len_list.append(t + 1) 275 | reward_record.append({ 276 | 'episode': i_episode, 277 | 'steps': global_steps, 278 | 'meanepreward': np.mean(reward_list), 279 | 'meaneplen': np.mean(len_list)}) 280 | 281 | batch = memory.sample() 282 | batch_size = len(memory) 283 | 284 | # step2: extract variables from trajectories 285 | rewards = Tensor(batch.reward) 286 | values = Tensor(batch.value) 287 | masks = Tensor(batch.mask) 288 | actions = Tensor(batch.action) 289 | states = Tensor(batch.state) 290 | oldlogproba = Tensor(batch.logproba) 291 | 292 | returns = Tensor(batch_size) 293 | deltas = Tensor(batch_size) 294 | advantages = Tensor(batch_size) 295 | 296 | prev_return = 0 297 | prev_value = 0 298 | prev_advantage = 0 299 | for i in reversed(range(batch_size)): 300 | returns[i] = rewards[i] + args.gamma * prev_return * masks[i] 301 | deltas[i] = rewards[i] + args.gamma * prev_value * masks[i] - values[i] 302 | # ref: https://arxiv.org/pdf/1506.02438.pdf (generalization advantage estimate) 303 | advantages[i] = deltas[i] + args.gamma * args.lamda * prev_advantage * masks[i] 304 | 305 | prev_return = returns[i] 306 | prev_value = values[i] 307 | prev_advantage = advantages[i] 308 | if args.advantage_norm: 309 | advantages = (advantages - advantages.mean()) / (advantages.std() + EPS) 310 | 311 | for i_epoch in range(int(args.num_epoch * batch_size / args.minibatch_size)): 312 | # sample from current batch 313 | minibatch_ind = np.random.choice(batch_size, args.minibatch_size, replace=False) 314 | minibatch_states = states[minibatch_ind] 315 | minibatch_actions = actions[minibatch_ind] 316 | minibatch_oldlogproba = oldlogproba[minibatch_ind] 317 | minibatch_newlogproba = network.get_logproba(minibatch_states, minibatch_actions) 318 | minibatch_advantages = advantages[minibatch_ind] 319 | minibatch_returns = returns[minibatch_ind] 320 | minibatch_newvalues = network._forward_critic(minibatch_states).flatten() 321 | 322 | ratio = torch.exp(minibatch_newlogproba - minibatch_oldlogproba) 323 | surr1 = ratio * minibatch_advantages 324 | surr2 = ratio.clamp(1 - clip_now, 1 + clip_now) * minibatch_advantages 325 | loss_surr = - torch.mean(torch.min(surr1, surr2)) 326 | 327 | # not sure the value loss should be clipped as well 328 | # clip example: https://github.com/Jiankai-Sun/Proximal-Policy-Optimization-in-Pytorch/blob/master/ppo.py 329 | # however, it does not make sense to clip score-like value by a dimensionless clipping parameter 330 | # moreover, original paper does not mention clipped value 331 | if args.lossvalue_norm: 332 | minibatch_return_6std = 6 * minibatch_returns.std() 333 | loss_value = torch.mean((minibatch_newvalues - minibatch_returns).pow(2)) / minibatch_return_6std 334 | else: 335 | loss_value = torch.mean((minibatch_newvalues - minibatch_returns).pow(2)) 336 | 337 | loss_entropy = torch.mean(torch.exp(minibatch_newlogproba) * minibatch_newlogproba) 338 | 339 | total_loss = loss_surr + args.loss_coeff_value * loss_value + args.loss_coeff_entropy * loss_entropy 340 | optimizer.zero_grad() 341 | total_loss.backward() 342 | optimizer.step() 343 | 344 | if args.schedule_clip == 'linear': 345 | ep_ratio = 1 - (i_episode / args.num_episode) 346 | clip_now = args.clip * ep_ratio 347 | 348 | if args.schedule_adam == 'linear': 349 | ep_ratio = 1 - (i_episode / args.num_episode) 350 | lr_now = args.lr * ep_ratio 351 | # set learning rate 352 | # ref: https://stackoverflow.com/questions/48324152/ 353 | for g in optimizer.param_groups: 354 | g['lr'] = lr_now 355 | 356 | if i_episode % args.log_num_episode == 0: 357 | print('Finished episode: {} Reward: {:.4f} total_loss = {:.4f} = {:.4f} + {} * {:.4f} + {} * {:.4f}' \ 358 | .format(i_episode, reward_record[-1]['meanepreward'], total_loss.data, loss_surr.data, args.loss_coeff_value, 359 | loss_value.data, args.loss_coeff_entropy, loss_entropy.data)) 360 | print('-----------------') 361 | 362 | return reward_record 363 | 364 | def test(args): 365 | record_dfs = [] 366 | for i in range(args.num_parallel_run): 367 | args.seed += 1 368 | reward_record = pd.DataFrame(ppo(args)) 369 | reward_record['#parallel_run'] = i 370 | record_dfs.append(reward_record) 371 | record_dfs = pd.concat(record_dfs, axis=0) 372 | record_dfs.to_csv(joindir(RESULT_DIR, 'ppo-record-{}.csv'.format(args.env_name))) 373 | 374 | if __name__ == '__main__': 375 | 376 | for env in ['Walker2d-v2', 'Swimmer-v2', 'Hopper-v2', 'Humanoid-v2', 'HalfCheetah-v2', 'Reacher-v2']: 377 | args.env_name = env 378 | test(args) 379 | 380 | -------------------------------------------------------------------------------- /code/vpg.py: -------------------------------------------------------------------------------- 1 | """ 2 | Implementation of Vinilla Policy Gradient 3 | 4 | This is a policy gradient with a state value function baseline. 5 | Each time trajectories are sampled and the returns are calculated. 6 | The state value function approximator is stepped to the return and 7 | the policy gradient is done w.r.t. this baseline. 8 | 9 | The actor outputs an mean and std. To keep an exploration, we add 10 | entropy loss to the actor. 11 | 12 | ref: http://rail.eecs.berkeley.edu/deeprlcourse/static/homeworks/hw2.pdf 13 | 14 | NOTICE: 15 | `Tensor2` means 2D-Tensor (num_samples, num_dims) 16 | """ 17 | 18 | import gym 19 | import torch 20 | import torch.nn as nn 21 | import torch.optim as opt 22 | from torch import Tensor 23 | from torch.autograd import Variable 24 | from collections import deque, namedtuple 25 | from itertools import count 26 | import scipy.optimize as sciopt 27 | import matplotlib 28 | matplotlib.use('agg') 29 | import matplotlib.pyplot as plt 30 | from os.path import join as joindir 31 | import pandas as pd 32 | import numpy as np 33 | import argparse 34 | import datetime 35 | import math 36 | 37 | 38 | Transition = namedtuple('Transition', ('state', 'action', 'action_mean', 'action_logstd', 'mask', 'next_state', 'reward')) 39 | EPS = 1e-10 40 | RESULT_DIR = '../result' 41 | 42 | 43 | class args(object): 44 | env_name = 'Hopper-v2' 45 | seed = 1234 46 | num_episode = 100 47 | max_step_per_round = 200 48 | batch_size = 5000 49 | gamma = 0.995 50 | log_num_episode = 1 51 | loss_coeff_entropy = 1e-3 52 | lr = 1e-4 53 | hidden_size = 32 54 | initial_policy_logstd = -1.20397 55 | num_opt_value_each_episode = 100 56 | num_opt_actor_each_episode = 10 57 | num_parallel_run = 5 58 | 59 | 60 | def add_arguments(): 61 | parser = argparse.ArgumentParser() 62 | parser.add_argument('--env_name', type=str, default='Hopper-v2') 63 | parser.add_argument('--seed', type=int, default=1234) 64 | parser.add_argument('--num_episode', type=int, default=1000) 65 | parser.add_argument('--max_step_per_round', type=int, default=200) 66 | parser.add_argument('--batch_size', type=int, default=5000) 67 | parser.add_argument('--gamma', type=float, default=0.995) 68 | parser.add_argument('--log_num_episode', type=int, default=1) 69 | parser.add_argument('--loss_coeff_entropy', type=float, default=1e-3) 70 | parser.add_argument('--lr', type=float, default=1e-4) 71 | parser.add_argument('--hidden_size', type=int, default=32) 72 | parser.add_argument('--initial_policy_logstd', type=float, default=-1.20397) 73 | parser.add_argument('--num_opt_value_each_episode', type=int, default=100) 74 | parser.add_argument('--num_opt_actor_each_episode', type=int, default=10) 75 | parser.add_argument('--num_parallel_run', type=int, default=5) 76 | 77 | args = parser.parse_args() 78 | return args 79 | 80 | class RunningStat(object): 81 | def __init__(self, shape): 82 | self._n = 0 83 | self._M = np.zeros(shape) 84 | self._S = np.zeros(shape) 85 | 86 | def push(self, x): 87 | x = np.asarray(x) 88 | assert x.shape == self._M.shape 89 | self._n += 1 90 | if self._n == 1: 91 | self._M[...] = x 92 | else: 93 | oldM = self._M.copy() 94 | self._M[...] = oldM + (x - oldM) / self._n 95 | self._S[...] = self._S + (x - oldM) * (x - self._M) 96 | 97 | @property 98 | def n(self): 99 | return self._n 100 | 101 | @property 102 | def mean(self): 103 | return self._M 104 | 105 | @property 106 | def var(self): 107 | return self._S / (self._n - 1) if self._n > 1 else np.square(self._M) 108 | 109 | @property 110 | def std(self): 111 | return np.sqrt(self.var) 112 | 113 | @property 114 | def shape(self): 115 | return self._M.shape 116 | 117 | 118 | class ZFilter: 119 | """ 120 | y = (x-mean)/std 121 | using running estimates of mean,std 122 | """ 123 | 124 | def __init__(self, shape, demean=True, destd=True, clip=10.0): 125 | self.demean = demean 126 | self.destd = destd 127 | self.clip = clip 128 | 129 | self.rs = RunningStat(shape) 130 | 131 | def __call__(self, x, update=True): 132 | if update: self.rs.push(x) 133 | if self.demean: 134 | x = x - self.rs.mean 135 | if self.destd: 136 | x = x / (self.rs.std + 1e-8) 137 | if self.clip: 138 | x = np.clip(x, -self.clip, self.clip) 139 | return x 140 | 141 | def output_shape(self, input_space): 142 | return input_space.shape 143 | 144 | 145 | class Memory(object): 146 | def __init__(self): 147 | self.memory = [] 148 | 149 | def push(self, transition): 150 | self.memory.append(transition) 151 | 152 | def sample(self, do_reverse=True): 153 | if do_reverse: 154 | return Transition(*zip(*reversed(self.memory))) 155 | else: 156 | return Transition(*zip(*self.memory)) 157 | 158 | def __len__(self): 159 | return len(self.memory) 160 | 161 | 162 | class Actor(nn.Module): 163 | def __init__(self, dim_states, dim_actions): 164 | super(Actor, self).__init__() 165 | 166 | self.fc1 = nn.Linear(dim_states, args.hidden_size) 167 | self.fc2 = nn.Linear(args.hidden_size, args.hidden_size) 168 | self.fc_mean = nn.Linear(args.hidden_size, dim_actions) 169 | self.fc_logstd = nn.Parameter(args.initial_policy_logstd * torch.ones(1, dim_actions), requires_grad=False) 170 | 171 | def forward(self, states): 172 | """ 173 | given a states returns the action distribution (gaussian) with mean and logstd 174 | :param states: a Tensor2 represents states 175 | :return: Tensor2 action mean and logstd 176 | """ 177 | x = torch.relu(self.fc1(states)) 178 | x = torch.relu(self.fc2(x)) 179 | action_mean = self.fc_mean(x) 180 | action_logstd = self.fc_logstd.expand_as(action_mean) 181 | return action_mean, action_logstd 182 | 183 | @ staticmethod 184 | def select_action(action_mean, action_logstd): 185 | """ 186 | given mean and std, sample an action from normal(mean, std) 187 | also returns probability of the given chosen 188 | :param action_mean: Tensor2 189 | :param action_logstd: Tensor2 190 | :return: Tensor2 action 191 | """ 192 | action_std = torch.exp(action_logstd) 193 | action = torch.normal(action_mean, action_std) 194 | return action 195 | 196 | @staticmethod 197 | def normal_logproba(x, mean, logstd, std=None): 198 | if std is None: 199 | std = torch.exp(logstd) 200 | 201 | std_sq = std.pow(2) 202 | logproba = - 0.5 * math.log(2 * math.pi) - logstd - (x - mean).pow(2) / (2 * std_sq) 203 | return logproba.sum(1).view(-1, 1) 204 | 205 | class Baseline(nn.Module): 206 | def __init__(self, dim_states): 207 | super(Baseline, self).__init__() 208 | 209 | self.fc1 = nn.Linear(dim_states, args.hidden_size) 210 | self.fc2 = nn.Linear(args.hidden_size, args.hidden_size) 211 | self.fc3 = nn.Linear(args.hidden_size, 1) 212 | 213 | def forward(self, states): 214 | """ 215 | given states returns its approximated state value function 216 | :param states: a Tensor2 represents states 217 | :return: Tensor2 state value function 218 | """ 219 | x = torch.relu(self.fc1(states)) 220 | x = torch.relu(self.fc2(x)) 221 | values = torch.relu(self.fc3(x)) 222 | return values 223 | 224 | 225 | def vpg(): 226 | env = gym.make(args.env_name) 227 | dim_states = env.observation_space.shape[0] 228 | dim_actions = env.action_space.shape[0] 229 | 230 | env.seed(args.seed) 231 | torch.manual_seed(args.seed) 232 | 233 | actor = Actor(dim_states, dim_actions) 234 | baseline = Baseline(dim_states) 235 | optimizer_a = opt.Adam(actor.parameters(), lr=args.lr) 236 | optimizer_b = opt.Adam(baseline.parameters(), lr=args.lr) 237 | running_state = ZFilter((dim_states,), clip=5) 238 | 239 | reward_record = [] 240 | global_steps = 0 241 | 242 | for i_episode in range(args.num_episode): 243 | 244 | memory = Memory() 245 | num_steps = 0 246 | reward_sum_list = [] 247 | while num_steps < args.batch_size: 248 | state = env.reset() 249 | state = running_state(state) 250 | reward_sum = 0 251 | for t in range(args.max_step_per_round): 252 | action_mean, action_logstd = actor(Tensor(state).unsqueeze(0)) 253 | action = actor.select_action(action_mean, action_logstd) 254 | action = action.data.numpy()[0] 255 | next_state, reward, done, info = env.step(action) 256 | reward_sum += reward 257 | next_state = running_state(next_state) 258 | mask = 0 if done else 1 259 | 260 | memory.push(Transition( 261 | state=state, action=action, action_mean=action_mean, action_logstd=action_logstd, 262 | mask=mask, next_state=next_state, reward=reward 263 | )) 264 | 265 | if done: 266 | break 267 | 268 | state = next_state 269 | 270 | reward_sum_list.append(reward_sum) 271 | 272 | num_steps += (t + 1) 273 | global_steps += (t + 1) 274 | reward_record.append({'steps': global_steps, 'reward': reward_sum}) 275 | 276 | batch = memory.sample() 277 | batch_size = len(memory) 278 | 279 | states = Tensor(batch.state) 280 | actions = Tensor(batch.action) 281 | action_means = torch.cat(batch.action_mean) 282 | action_logstds = torch.cat(batch.action_logstd) 283 | masks = Tensor(batch.mask).view(-1, 1) 284 | next_states = Tensor(batch.next_state) 285 | rewards = Tensor(batch.reward).view(-1, 1) 286 | 287 | returns = torch.zeros(batch_size, 1) 288 | returns[0] = rewards[0] 289 | # notice the trajector is already reversed 290 | for i in range(1, batch_size): 291 | returns[i] = rewards[i] + args.gamma * returns[i - 1] * masks[i] 292 | 293 | for i in range(args.num_opt_value_each_episode): 294 | optimizer_b.zero_grad() 295 | values = baseline(Variable(states)) 296 | loss_value = (Variable(returns) - values).pow(2).mean() 297 | loss_value.backward() 298 | optimizer_b.step() 299 | 300 | for i in range(args.num_opt_actor_each_episode): 301 | optimizer_a.zero_grad() 302 | action_means, action_logstds = actor(Variable(states)) 303 | logprobas = actor.normal_logproba(Variable(actions), action_means, action_logstds) 304 | loss_policy = - (Variable(returns - values) * logprobas).mean() 305 | loss_policy.backward() 306 | optimizer_a.step() 307 | 308 | if i_episode % args.log_num_episode == 0: 309 | print('Finished episode: {} steps: {} AvgReward: {:.4f} loss = value({:.4f}) + policy({:.4f})' \ 310 | .format(i_episode, reward_record[-1]['steps'], np.mean(reward_sum_list), loss_value.data, loss_policy.data)) 311 | print('-----------------') 312 | 313 | return reward_record 314 | 315 | if __name__ == '__main__': 316 | datestr = datetime.datetime.now().strftime('%Y-%m-%d') 317 | args = add_arguments() 318 | 319 | record_dfs = pd.DataFrame(columns=['steps', 'reward']) 320 | reward_cols = [] 321 | for i in range(args.num_parallel_run): 322 | args.seed += 1 323 | reward_record = pd.DataFrame(vpg()) 324 | record_dfs = record_dfs.merge(reward_record, how='outer', on='steps', suffixes=('', '_{}'.format(i))) 325 | reward_cols.append('reward_{}'.format(i)) 326 | 327 | record_dfs = record_dfs.drop(columns='reward').sort_values(by='steps', ascending=True).ffill().bfill() 328 | record_dfs['reward_mean'] = record_dfs[reward_cols].mean(axis=1) 329 | record_dfs['reward_std'] = record_dfs[reward_cols].std(axis=1) 330 | record_dfs['reward_smooth'] = record_dfs['reward_mean'].ewm(span=1000).mean() 331 | record_dfs['reward_smooth_std'] = record_dfs['reward_std'].ewm(span=1000).mean() 332 | record_dfs.to_csv(joindir(RESULT_DIR, 'vpg-record-{}-{}.csv'.format(args.env_name, datestr))) 333 | 334 | # Plot 335 | plt.figure(figsize=(12, 6)) 336 | plt.plot(record_dfs['steps'], record_dfs['reward_smooth'], label='reward') 337 | plt.fill_between(record_dfs['steps'], record_dfs['reward_smooth'] - record_dfs['reward_smooth_std'], 338 | record_dfs['reward_smooth'] + record_dfs['reward_smooth_std'], color='b', alpha=0.2) 339 | plt.legend() 340 | plt.xlabel('steps of env interaction (sample complexity)') 341 | plt.ylabel('average reward') 342 | plt.title('VPG on {}'.format(args.env_name)) 343 | plt.savefig(joindir(RESULT_DIR, 'vpg-{}-{}.pdf'.format(args.env_name, datestr))) 344 | -------------------------------------------------------------------------------- /docs/ppo_experiments.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhangchuheng123/Reinforcement-Implementation/c04e0df10ec29ef775ea31395a8ad4b917302d24/docs/ppo_experiments.png -------------------------------------------------------------------------------- /docs/rainbow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhangchuheng123/Reinforcement-Implementation/c04e0df10ec29ef775ea31395a8ad4b917302d24/docs/rainbow.png --------------------------------------------------------------------------------