├── .github
    └── workflows
    │   └── codeql-analysis.yml
├── README.md
├── SECURITY.md
├── code
    ├── .DS_Store
    ├── RND
    │   ├── agents.py
    │   ├── bash.sh
    │   ├── envs.py
    │   ├── model.py
    │   ├── requirements.txt
    │   ├── train.py
    │   └── utils.py
    ├── Rainbow
    │   ├── .DS_Store
    │   ├── README.md
    │   ├── agent.py
    │   ├── bash.sh
    │   ├── env.py
    │   ├── main.py
    │   ├── memory.py
    │   ├── model.py
    │   ├── requirements.txt
    │   └── test.py
    ├── SAC-discrete
    │   ├── .DS_Store
    │   ├── bash.sh
    │   ├── config
    │   │   ├── .DS_Store
    │   │   └── default.yaml
    │   ├── env.py
    │   ├── memory.py
    │   ├── train.py
    │   └── utils.py
    ├── a2c.py
    ├── acer.py
    ├── ars.py
    ├── ars_tune.py
    ├── cem.py
    ├── cem_tune.py
    ├── dqn.py
    ├── ppo.py
    ├── trpo.py
    └── vpg.py
└── docs
    ├── ppo_experiments.png
    └── rainbow.png


/.github/workflows/codeql-analysis.yml:
--------------------------------------------------------------------------------
 1 | # For most projects, this workflow file will not need changing; you simply need
 2 | # to commit it to your repository.
 3 | #
 4 | # You may wish to alter this file to override the set of languages analyzed,
 5 | # or to provide custom queries or build logic.
 6 | #
 7 | # ******** NOTE ********
 8 | # We have attempted to detect the languages in your repository. Please check
 9 | # the `language` matrix defined below to confirm you have the correct set of
10 | # supported CodeQL languages.
11 | #
12 | name: "CodeQL"
13 | 
14 | on:
15 |   push:
16 |     branches: [ master ]
17 |   pull_request:
18 |     # The branches below must be a subset of the branches above
19 |     branches: [ master ]
20 |   schedule:
21 |     - cron: '20 8 * * 1'
22 | 
23 | jobs:
24 |   analyze:
25 |     name: Analyze
26 |     runs-on: ubuntu-latest
27 | 
28 |     strategy:
29 |       fail-fast: false
30 |       matrix:
31 |         language: [ 'python' ]
32 |         # CodeQL supports [ 'cpp', 'csharp', 'go', 'java', 'javascript', 'python' ]
33 |         # Learn more:
34 |         # https://docs.github.com/en/free-pro-team@latest/github/finding-security-vulnerabilities-and-errors-in-your-code/configuring-code-scanning#changing-the-languages-that-are-analyzed
35 | 
36 |     steps:
37 |     - name: Checkout repository
38 |       uses: actions/checkout@v2
39 | 
40 |     # Initializes the CodeQL tools for scanning.
41 |     - name: Initialize CodeQL
42 |       uses: github/codeql-action/init@v1
43 |       with:
44 |         languages: ${{ matrix.language }}
45 |         # If you wish to specify custom queries, you can do so here or in a config file.
46 |         # By default, queries listed here will override any specified in a config file.
47 |         # Prefix the list here with "+" to use these queries and those in the config file.
48 |         # queries: ./path/to/local/query, your-org/your-repo/queries@main
49 | 
50 |     # Autobuild attempts to build any compiled languages  (C/C++, C#, or Java).
51 |     # If this step fails, then you should remove it and run the build manually (see below)
52 |     - name: Autobuild
53 |       uses: github/codeql-action/autobuild@v1
54 | 
55 |     # ℹ️ Command-line programs to run using the OS shell.
56 |     # 📚 https://git.io/JvXDl
57 | 
58 |     # ✏️ If the Autobuild fails above, remove it and uncomment the following three lines
59 |     #    and modify them (or add more) to build your code if your project
60 |     #    uses a compiled language
61 | 
62 |     #- run: |
63 |     #   make bootstrap
64 |     #   make release
65 | 
66 |     - name: Perform CodeQL Analysis
67 |       uses: github/codeql-action/analyze@v1
68 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Reinforcement-Implementation
 2 | 
 3 | This project aims to reproduce the results of several model-free RL algorithms in continuous action domain (mujuco environment).
 4 | 
 5 | This projects
 6 | * uses pytorch package
 7 | * implements different algorithms independently in seperate files / minimal files
 8 | * is written in simplest style
 9 | * tries to follow the original paper and reproduce their results
10 | 
11 | My first stage of work is to reproduce this figure in the PPO paper.
12 | 
13 | ![](docs/ppo_experiments.png)
14 | 
15 | - [x] A2C
16 | - [x] ACER (A2C + Trust Region): It seems that this implementation has some problems ... (welcome bug report) 
17 | - [X] CEM
18 | - [x] TRPO (TRPO single path)
19 | - [x] PPO (PPO clip)
20 | - [x] Vanilla PG
21 | 
22 | On the next stage, I want to implement
23 | 
24 | - [ ] DDPG
25 | - [X] Random Search (see [Simple random search provides a competitive approach to reinforcement learning](https://arxiv.org/pdf/1803.07055.pdf))
26 | - [ ] SAC (soft actor-critic) with continuous action space
27 | - [X] SAC (soft actor-critic) with discrete action space
28 | - [X] DQN 
29 | 
30 | Then next stage, discrete action space problem and raw video input (Atari) problems:
31 | 
32 | - [X] Rainbow: DQN and relevant techniques (target network / double Q-learning / prioritized experience replay / dueling network structure / distributional RL)
33 | - [X] PPO with random network distillation (RND)
34 | 
35 | Rainbow on Atari with only 3M: It works but may need further tuning.
36 | 
37 | ![](docs/ppo_experiments.png)
38 | 
39 | And then model-based algorithms (not planned)
40 | 
41 | - [ ] PILCO
42 | - [ ] PE-TS
43 | 
44 | TODOs:
45 | - [ ] change the way reward counts, current way may underestimate the reward (evaluate a deterministic model rather a stochastic/exploratory model)
46 | 
47 | ## PPO Implementation
48 | 
49 | PPO implementation is of high quality - matches the performance of openai.baselines. 
50 | 
51 | ## Update
52 | 
53 | Recently, I added Rainbow and DQN. The Rainbow implementation is of high quality on Atari games - enough for you to modify and write your own research paper. The DQN implementation is a minimum workaround and reaches a good performance on MountainCar (which is a simple task but many codes on Github do not achieve good performance or need additional reward/environment engineering). This is enough for you to have a fast test of your research ideas.
54 | 


--------------------------------------------------------------------------------
/SECURITY.md:
--------------------------------------------------------------------------------
 1 | # Security Policy
 2 | 
 3 | ## Supported Versions
 4 | 
 5 | Use this section to tell people about which versions of your project are
 6 | currently being supported with security updates.
 7 | 
 8 | | Version | Supported          |
 9 | | ------- | ------------------ |
10 | | 5.1.x   | :white_check_mark: |
11 | | 5.0.x   | :x:                |
12 | | 4.0.x   | :white_check_mark: |
13 | | < 4.0   | :x:                |
14 | 
15 | ## Reporting a Vulnerability
16 | 
17 | Use this section to tell people how to report a vulnerability.
18 | 
19 | Tell them where to go, how often they can expect to get an update on a
20 | reported vulnerability, what to expect if the vulnerability is accepted or
21 | declined, etc.
22 | 


--------------------------------------------------------------------------------
/code/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangchuheng123/Reinforcement-Implementation/c04e0df10ec29ef775ea31395a8ad4b917302d24/code/.DS_Store


--------------------------------------------------------------------------------
/code/RND/agents.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import torch.nn.functional as F
  3 | import torch.nn as nn
  4 | import torch
  5 | import torch.optim as optim
  6 | from torch.distributions.categorical import Categorical
  7 | from model import CnnActorCriticNetwork, RNDModel
  8 | from utils import global_grad_norm_
  9 | 
 10 | 
 11 | class RNDAgent(object):
 12 |     def __init__(
 13 |             self,
 14 |             input_size,
 15 |             output_size,
 16 |             num_env,
 17 |             num_step,
 18 |             gamma,
 19 |             lam=0.95,
 20 |             learning_rate=1e-4,
 21 |             ent_coef=0.01,
 22 |             clip_grad_norm=0.5,
 23 |             epoch=3,
 24 |             batch_size=128,
 25 |             ppo_eps=0.1,
 26 |             update_proportion=0.25,
 27 |             use_gae=True,
 28 |             use_cuda=False,
 29 |             use_noisy_net=False):
 30 |         self.model = CnnActorCriticNetwork(input_size, output_size, use_noisy_net)
 31 |         self.num_env = num_env
 32 |         self.output_size = output_size
 33 |         self.input_size = input_size
 34 |         self.num_step = num_step
 35 |         self.gamma = gamma
 36 |         self.lam = lam
 37 |         self.epoch = epoch
 38 |         self.batch_size = batch_size
 39 |         self.use_gae = use_gae
 40 |         self.ent_coef = ent_coef
 41 |         self.ppo_eps = ppo_eps
 42 |         self.clip_grad_norm = clip_grad_norm
 43 |         self.update_proportion = update_proportion
 44 |         self.device = torch.device('cuda' if use_cuda else 'cpu')
 45 | 
 46 |         self.rnd = RNDModel(input_size, output_size)
 47 |         self.optimizer = optim.Adam(list(self.model.parameters()) + list(self.rnd.predictor.parameters()),
 48 |                                     lr=learning_rate)
 49 |         self.rnd = self.rnd.to(self.device)
 50 | 
 51 |         self.model = self.model.to(self.device)
 52 | 
 53 |     def get_action(self, state):
 54 |         state = torch.Tensor(state).to(self.device)
 55 |         state = state.float()
 56 |         policy, value_ext, value_int = self.model(state)
 57 |         action_prob = F.softmax(policy, dim=-1).data.cpu().numpy()
 58 | 
 59 |         action = self.random_choice_prob_index(action_prob)
 60 | 
 61 |         return action, value_ext.data.cpu().numpy().squeeze(), value_int.data.cpu().numpy().squeeze(), policy.detach()
 62 | 
 63 |     @staticmethod
 64 |     def random_choice_prob_index(p, axis=1):
 65 |         r = np.expand_dims(np.random.rand(p.shape[1 - axis]), axis=axis)
 66 |         return (p.cumsum(axis=axis) > r).argmax(axis=axis)
 67 | 
 68 |     def compute_intrinsic_reward(self, next_obs):
 69 |         next_obs = torch.FloatTensor(next_obs).to(self.device)
 70 | 
 71 |         target_next_feature = self.rnd.target(next_obs)
 72 |         predict_next_feature = self.rnd.predictor(next_obs)
 73 |         intrinsic_reward = (target_next_feature - predict_next_feature).pow(2).sum(1) / 2
 74 | 
 75 |         return intrinsic_reward.data.cpu().numpy()
 76 | 
 77 |     def train_model(self, s_batch, target_ext_batch, target_int_batch, y_batch, adv_batch, next_obs_batch, old_policy):
 78 |         s_batch = torch.FloatTensor(s_batch).to(self.device)
 79 |         target_ext_batch = torch.FloatTensor(target_ext_batch).to(self.device)
 80 |         target_int_batch = torch.FloatTensor(target_int_batch).to(self.device)
 81 |         y_batch = torch.LongTensor(y_batch).to(self.device)
 82 |         adv_batch = torch.FloatTensor(adv_batch).to(self.device)
 83 |         next_obs_batch = torch.FloatTensor(next_obs_batch).to(self.device)
 84 | 
 85 |         sample_range = np.arange(len(s_batch))
 86 |         forward_mse = nn.MSELoss(reduction='none')
 87 | 
 88 |         with torch.no_grad():
 89 |             policy_old_list = torch.stack(old_policy).permute(1, 0, 2).contiguous().view(-1, self.output_size).to(
 90 |                 self.device)
 91 | 
 92 |             m_old = Categorical(F.softmax(policy_old_list, dim=-1))
 93 |             log_prob_old = m_old.log_prob(y_batch)
 94 |             # ------------------------------------------------------------
 95 | 
 96 |         for i in range(self.epoch):
 97 |             np.random.shuffle(sample_range)
 98 |             for j in range(int(len(s_batch) / self.batch_size)):
 99 |                 sample_idx = sample_range[self.batch_size * j:self.batch_size * (j + 1)]
100 | 
101 |                 # --------------------------------------------------------------------------------
102 |                 # for Curiosity-driven(Random Network Distillation)
103 |                 predict_next_state_feature, target_next_state_feature = self.rnd(next_obs_batch[sample_idx])
104 | 
105 |                 forward_loss = forward_mse(predict_next_state_feature, target_next_state_feature.detach()).mean(-1)
106 |                 # Proportion of exp used for predictor update
107 |                 mask = torch.rand(len(forward_loss)).to(self.device)
108 |                 mask = (mask < self.update_proportion).type(torch.FloatTensor).to(self.device)
109 |                 forward_loss = (forward_loss * mask).sum() / torch.max(mask.sum(), torch.Tensor([1]).to(self.device))
110 |                 # ---------------------------------------------------------------------------------
111 | 
112 |                 policy, value_ext, value_int = self.model(s_batch[sample_idx])
113 |                 m = Categorical(F.softmax(policy, dim=-1))
114 |                 log_prob = m.log_prob(y_batch[sample_idx])
115 | 
116 |                 ratio = torch.exp(log_prob - log_prob_old[sample_idx])
117 | 
118 |                 surr1 = ratio * adv_batch[sample_idx]
119 |                 surr2 = torch.clamp(
120 |                     ratio,
121 |                     1.0 - self.ppo_eps,
122 |                     1.0 + self.ppo_eps) * adv_batch[sample_idx]
123 | 
124 |                 actor_loss = -torch.min(surr1, surr2).mean()
125 |                 critic_ext_loss = F.mse_loss(value_ext.sum(1), target_ext_batch[sample_idx])
126 |                 critic_int_loss = F.mse_loss(value_int.sum(1), target_int_batch[sample_idx])
127 | 
128 |                 critic_loss = critic_ext_loss + critic_int_loss
129 | 
130 |                 entropy = m.entropy().mean()
131 | 
132 |                 self.optimizer.zero_grad()
133 |                 loss = actor_loss + 0.5 * critic_loss - self.ent_coef * entropy + forward_loss
134 |                 loss.backward()
135 |                 global_grad_norm_(list(self.model.parameters())+list(self.rnd.predictor.parameters()))
136 |                 self.optimizer.step()
137 | 


--------------------------------------------------------------------------------
/code/RND/bash.sh:
--------------------------------------------------------------------------------
1 | CUDA_VISIBLE_DEVICES=0 python train.py --use-gae --use-gpu --sticky-action --env-id SeaquestNoFrameskip-v4 &
2 | CUDA_VISIBLE_DEVICES=1 python train.py --use-gae --use-gpu --sticky-action --env-id MontezumaRevengeNoFrameskip-v4 &
3 | CUDA_VISIBLE_DEVICES=2 python train.py --use-gae --use-gpu --sticky-action --env-id RoadRunnerNoFrameskip-v4 & 
4 | CUDA_VISIBLE_DEVICES=3 python train.py --use-gae --use-gpu --sticky-action --env-id BattleZoneNoFrameskip-v4 &


--------------------------------------------------------------------------------
/code/RND/envs.py:
--------------------------------------------------------------------------------
  1 | import gym
  2 | import cv2
  3 | import numpy as np
  4 | from abc import abstractmethod
  5 | from collections import deque
  6 | from copy import copy
  7 | from torch.multiprocessing import Process
  8 | from PIL import Image
  9 | 
 10 | class Environment(Process):
 11 |     @abstractmethod
 12 |     def run(self):
 13 |         pass
 14 | 
 15 |     @abstractmethod
 16 |     def reset(self):
 17 |         pass
 18 | 
 19 |     @abstractmethod
 20 |     def pre_proc(self, x):
 21 |         pass
 22 | 
 23 |     @abstractmethod
 24 |     def get_init_state(self, x):
 25 |         pass
 26 | 
 27 | 
 28 | def unwrap(env):
 29 |     if hasattr(env, "unwrapped"):
 30 |         return env.unwrapped
 31 |     elif hasattr(env, "env"):
 32 |         return unwrap(env.env)
 33 |     elif hasattr(env, "leg_env"):
 34 |         return unwrap(env.leg_env)
 35 |     else:
 36 |         return env
 37 | 
 38 | 
 39 | class MaxAndSkipEnv(gym.Wrapper):
 40 |     def __init__(self, env, is_render, skip=4):
 41 |         """Return only every `skip`-th frame"""
 42 |         gym.Wrapper.__init__(self, env)
 43 |         # most recent raw observations (for max pooling across time steps)
 44 |         self._obs_buffer = np.zeros((2,) + env.observation_space.shape, dtype=np.uint8)
 45 |         self._skip = skip
 46 |         self.is_render = is_render
 47 | 
 48 |     def step(self, action):
 49 |         """Repeat action, sum reward, and max over last observations."""
 50 |         total_reward = 0.0
 51 |         done = None
 52 |         for i in range(self._skip):
 53 |             obs, reward, done, info = self.env.step(action)
 54 |             if self.is_render:
 55 |                 self.env.render()
 56 |             if i == self._skip - 2:
 57 |                 self._obs_buffer[0] = obs
 58 |             if i == self._skip - 1:
 59 |                 self._obs_buffer[1] = obs
 60 |             total_reward += reward
 61 |             if done:
 62 |                 break
 63 |         # Note that the observation on the done=True frame
 64 |         # doesn't matter
 65 |         max_frame = self._obs_buffer.max(axis=0)
 66 | 
 67 |         return max_frame, total_reward, done, info
 68 | 
 69 |     def reset(self, **kwargs):
 70 |         return self.env.reset(**kwargs)
 71 | 
 72 | 
 73 | class AtariEnvironment(Environment):
 74 |     def __init__(
 75 |             self,
 76 |             env_id,
 77 |             is_render,
 78 |             env_idx,
 79 |             child_conn,
 80 |             history_size=4,
 81 |             h=84,
 82 |             w=84,
 83 |             life_done=True,
 84 |             sticky_action=True,
 85 |             p=0.25,
 86 |             max_step_per_episode=4500):
 87 |         super(AtariEnvironment, self).__init__()
 88 |         self.daemon = True
 89 |         self.env = MaxAndSkipEnv(gym.make(env_id), is_render)
 90 |         self.env_id = env_id
 91 |         self.is_render = is_render
 92 |         self.env_idx = env_idx
 93 |         self.steps = 0
 94 |         self.episode = 0
 95 |         self.rall = 0
 96 |         self.recent_rlist = deque(maxlen=100)
 97 |         self.child_conn = child_conn
 98 |         self.max_step_per_episode = max_step_per_episode
 99 | 
100 |         self.sticky_action = sticky_action
101 |         self.last_action = 0
102 |         self.p = p
103 | 
104 |         self.history_size = history_size
105 |         self.history = np.zeros([history_size, h, w])
106 |         self.h = h
107 |         self.w = w
108 | 
109 |         self.reset()
110 | 
111 |     def run(self):
112 |         super(AtariEnvironment, self).run()
113 |         while True:
114 |             action = self.child_conn.recv()
115 | 
116 |             if 'Breakout' in self.env_id:
117 |                 action += 1
118 | 
119 |             # sticky action
120 |             if self.sticky_action:
121 |                 if np.random.rand() <= self.p:
122 |                     action = self.last_action
123 |                 self.last_action = action
124 | 
125 |             s, reward, done, info = self.env.step(action)
126 | 
127 |             if self.steps > self.max_step_per_episode:
128 |                 done = True
129 | 
130 |             log_reward = reward
131 |             force_done = done
132 | 
133 |             self.history[:3, :, :] = self.history[1:, :, :]
134 |             self.history[3, :, :] = self.pre_proc(s)
135 | 
136 |             self.rall += reward
137 |             self.steps += 1
138 | 
139 |             if done:
140 |                 self.recent_rlist.append(self.rall)
141 |                 # print("[Episode {}({})] Step: {}  Reward: {}  Recent Reward: {}".format(
142 |                 #     self.episode, self.env_idx, self.steps, self.rall, np.mean(self.recent_rlist)))
143 |                 self.history = self.reset()
144 |                 
145 |             # if self.env_idx == 0:
146 |             #     print('env_idx=0 num_rooms={} done={}'.format(num_rooms, done))
147 | 
148 |             self.child_conn.send(
149 |                 [self.history[:, :, :], reward, force_done, done, log_reward])
150 | 
151 |     def reset(self):
152 |         self.last_action = 0
153 |         self.steps = 0
154 |         self.episode += 1
155 |         self.rall = 0
156 |         s = self.env.reset()
157 |         self.get_init_state(
158 |             self.pre_proc(s))
159 |         return self.history[:, :, :]
160 | 
161 |     def pre_proc(self, X):
162 |         X = np.array(Image.fromarray(X).convert('L')).astype('float32')
163 |         x = cv2.resize(X, (self.h, self.w))
164 |         return x
165 | 
166 |     def get_init_state(self, s):
167 |         for i in range(self.history_size):
168 |             self.history[i, :, :] = self.pre_proc(s)
169 | 


--------------------------------------------------------------------------------
/code/RND/model.py:
--------------------------------------------------------------------------------
  1 | import torch.nn.functional as F
  2 | import torch.nn as nn
  3 | import torch
  4 | import torch.optim as optim
  5 | import numpy as np
  6 | import math
  7 | from torch.nn import init
  8 | 
  9 | 
 10 | class NoisyLinear(nn.Module):
 11 |     """Factorised Gaussian NoisyNet"""
 12 | 
 13 |     def __init__(self, in_features, out_features, sigma0=0.5):
 14 |         super().__init__()
 15 |         self.in_features = in_features
 16 |         self.out_features = out_features
 17 |         self.weight = nn.Parameter(torch.Tensor(out_features, in_features))
 18 |         self.bias = nn.Parameter(torch.Tensor(out_features))
 19 |         self.noisy_weight = nn.Parameter(
 20 |             torch.Tensor(out_features, in_features))
 21 |         self.noisy_bias = nn.Parameter(torch.Tensor(out_features))
 22 |         self.noise_std = sigma0 / math.sqrt(self.in_features)
 23 | 
 24 |         self.reset_parameters()
 25 |         self.register_noise()
 26 | 
 27 |     def register_noise(self):
 28 |         in_noise = torch.FloatTensor(self.in_features)
 29 |         out_noise = torch.FloatTensor(self.out_features)
 30 |         noise = torch.FloatTensor(self.out_features, self.in_features)
 31 |         self.register_buffer('in_noise', in_noise)
 32 |         self.register_buffer('out_noise', out_noise)
 33 |         self.register_buffer('noise', noise)
 34 | 
 35 |     def sample_noise(self):
 36 |         self.in_noise.normal_(0, self.noise_std)
 37 |         self.out_noise.normal_(0, self.noise_std)
 38 |         self.noise = torch.mm(
 39 |             self.out_noise.view(-1, 1), self.in_noise.view(1, -1))
 40 | 
 41 |     def reset_parameters(self):
 42 |         stdv = 1. / math.sqrt(self.weight.size(1))
 43 |         self.weight.data.uniform_(-stdv, stdv)
 44 |         self.noisy_weight.data.uniform_(-stdv, stdv)
 45 |         if self.bias is not None:
 46 |             self.bias.data.uniform_(-stdv, stdv)
 47 |             self.noisy_bias.data.uniform_(-stdv, stdv)
 48 | 
 49 |     def forward(self, x):
 50 |         """
 51 |         Note: noise will be updated if x is not volatile
 52 |         """
 53 |         normal_y = nn.functional.linear(x, self.weight, self.bias)
 54 |         if self.training:
 55 |             # update the noise once per update
 56 |             self.sample_noise()
 57 | 
 58 |         noisy_weight = self.noisy_weight * self.noise
 59 |         noisy_bias = self.noisy_bias * self.out_noise
 60 |         noisy_y = nn.functional.linear(x, noisy_weight, noisy_bias)
 61 |         return noisy_y + normal_y
 62 | 
 63 |     def __repr__(self):
 64 |         return self.__class__.__name__ + '(' \
 65 |             + 'in_features=' + str(self.in_features) \
 66 |             + ', out_features=' + str(self.out_features) + ')'
 67 | 
 68 | 
 69 | class Flatten(nn.Module):
 70 |     def forward(self, input):
 71 |         return input.view(input.size(0), -1)
 72 | 
 73 | 
 74 | class CnnActorCriticNetwork(nn.Module):
 75 |     def __init__(self, input_size, output_size, use_noisy_net=False):
 76 |         super(CnnActorCriticNetwork, self).__init__()
 77 | 
 78 |         if use_noisy_net:
 79 |             print('use NoisyNet')
 80 |             linear = NoisyLinear
 81 |         else:
 82 |             linear = nn.Linear
 83 | 
 84 |         self.feature = nn.Sequential(
 85 |             nn.Conv2d(
 86 |                 in_channels=4,
 87 |                 out_channels=32,
 88 |                 kernel_size=8,
 89 |                 stride=4),
 90 |             nn.ReLU(),
 91 |             nn.Conv2d(
 92 |                 in_channels=32,
 93 |                 out_channels=64,
 94 |                 kernel_size=4,
 95 |                 stride=2),
 96 |             nn.ReLU(),
 97 |             nn.Conv2d(
 98 |                 in_channels=64,
 99 |                 out_channels=64,
100 |                 kernel_size=3,
101 |                 stride=1),
102 |             nn.ReLU(),
103 |             Flatten(),
104 |             linear(
105 |                 7 * 7 * 64,
106 |                 256),
107 |             nn.ReLU(),
108 |             linear(
109 |                 256,
110 |                 448),
111 |             nn.ReLU()
112 |         )
113 | 
114 |         self.actor = nn.Sequential(
115 |             linear(448, 448),
116 |             nn.ReLU(),
117 |             linear(448, output_size)
118 |         )
119 | 
120 |         self.extra_layer = nn.Sequential(
121 |             linear(448, 448),
122 |             nn.ReLU()
123 |         )
124 | 
125 |         self.critic_ext = linear(448, 1)
126 |         self.critic_int = linear(448, 1)
127 | 
128 |         for p in self.modules():
129 |             if isinstance(p, nn.Conv2d):
130 |                 init.orthogonal_(p.weight, np.sqrt(2))
131 |                 p.bias.data.zero_()
132 | 
133 |             if isinstance(p, nn.Linear):
134 |                 init.orthogonal_(p.weight, np.sqrt(2))
135 |                 p.bias.data.zero_()
136 | 
137 |         init.orthogonal_(self.critic_ext.weight, 0.01)
138 |         self.critic_ext.bias.data.zero_()
139 | 
140 |         init.orthogonal_(self.critic_int.weight, 0.01)
141 |         self.critic_int.bias.data.zero_()
142 | 
143 |         for i in range(len(self.actor)):
144 |             if type(self.actor[i]) == nn.Linear:
145 |                 init.orthogonal_(self.actor[i].weight, 0.01)
146 |                 self.actor[i].bias.data.zero_()
147 | 
148 |         for i in range(len(self.extra_layer)):
149 |             if type(self.extra_layer[i]) == nn.Linear:
150 |                 init.orthogonal_(self.extra_layer[i].weight, 0.1)
151 |                 self.extra_layer[i].bias.data.zero_()
152 | 
153 |     def forward(self, state):
154 |         x = self.feature(state)
155 |         policy = self.actor(x)
156 |         value_ext = self.critic_ext(self.extra_layer(x) + x)
157 |         value_int = self.critic_int(self.extra_layer(x) + x)
158 |         return policy, value_ext, value_int
159 | 
160 | 
161 | class RNDModel(nn.Module):
162 |     def __init__(self, input_size, output_size):
163 |         super(RNDModel, self).__init__()
164 | 
165 |         self.input_size = input_size
166 |         self.output_size = output_size
167 | 
168 |         feature_output = 7 * 7 * 64
169 |         self.predictor = nn.Sequential(
170 |             nn.Conv2d(
171 |                 in_channels=1,
172 |                 out_channels=32,
173 |                 kernel_size=8,
174 |                 stride=4),
175 |             nn.LeakyReLU(),
176 |             nn.Conv2d(
177 |                 in_channels=32,
178 |                 out_channels=64,
179 |                 kernel_size=4,
180 |                 stride=2),
181 |             nn.LeakyReLU(),
182 |             nn.Conv2d(
183 |                 in_channels=64,
184 |                 out_channels=64,
185 |                 kernel_size=3,
186 |                 stride=1),
187 |             nn.LeakyReLU(),
188 |             Flatten(),
189 |             nn.Linear(feature_output, 512),
190 |             nn.ReLU(),
191 |             nn.Linear(512, 512),
192 |             nn.ReLU(),
193 |             nn.Linear(512, 512)
194 |         )
195 | 
196 |         self.target = nn.Sequential(
197 |             nn.Conv2d(
198 |                 in_channels=1,
199 |                 out_channels=32,
200 |                 kernel_size=8,
201 |                 stride=4),
202 |             nn.LeakyReLU(),
203 |             nn.Conv2d(
204 |                 in_channels=32,
205 |                 out_channels=64,
206 |                 kernel_size=4,
207 |                 stride=2),
208 |             nn.LeakyReLU(),
209 |             nn.Conv2d(
210 |                 in_channels=64,
211 |                 out_channels=64,
212 |                 kernel_size=3,
213 |                 stride=1),
214 |             nn.LeakyReLU(),
215 |             Flatten(),
216 |             nn.Linear(feature_output, 512)
217 |         )
218 | 
219 |         for p in self.modules():
220 |             if isinstance(p, nn.Conv2d):
221 |                 init.orthogonal_(p.weight, np.sqrt(2))
222 |                 p.bias.data.zero_()
223 | 
224 |             if isinstance(p, nn.Linear):
225 |                 init.orthogonal_(p.weight, np.sqrt(2))
226 |                 p.bias.data.zero_()
227 | 
228 |         for param in self.target.parameters():
229 |             param.requires_grad = False
230 | 
231 |     def forward(self, next_obs):
232 |         target_feature = self.target(next_obs)
233 |         predict_feature = self.predictor(next_obs)
234 | 
235 |         return predict_feature, target_feature
236 | 


--------------------------------------------------------------------------------
/code/RND/requirements.txt:
--------------------------------------------------------------------------------
 1 | atari-py==0.2.6
 2 | opencv-python==4.2.0.34
 3 | plotly==4.8.1
 4 | procgen==0.10.3
 5 | tensorboardX==2.0
 6 | torch==1.5.0
 7 | tqdm==4.42.1
 8 | tensorflow<2.0.0,>=1.4.0
 9 | numpy<2.0.0,>=1.17.0
10 | pathos==0.2.6
11 | kornia==0.3.1


--------------------------------------------------------------------------------
/code/RND/train.py:
--------------------------------------------------------------------------------
  1 | from agents import RNDAgent
  2 | from envs import AtariEnvironment
  3 | from utils import make_train_data, RunningMeanStd, RewardForwardFilter, softmax
  4 | 
  5 | import torch
  6 | from torch.multiprocessing import Pipe
  7 | 
  8 | from tensorboardX import SummaryWriter
  9 | from datetime import datetime
 10 | import numpy as np
 11 | import argparse
 12 | import tqdm
 13 | import gym
 14 | import os
 15 | 
 16 | # Enable multi-thread
 17 | os.system("taskset -p 0xffffffff %d" % os.getpid())
 18 | torch.set_num_threads(128)
 19 | 
 20 | def parse_arguments():
 21 | 
 22 |     parser = argparse.ArgumentParser(description='RND')
 23 | 
 24 |     parser.add_argument('--train-method', type=str, default='RND')
 25 |     parser.add_argument('--env-type', type=str, default='atari')
 26 |     parser.add_argument('--env-id', type=str, default='MontezumaRevengeNoFrameskip-v4')
 27 |     parser.add_argument('--max-step-per-episode', type=int, default=4500)
 28 |     parser.add_argument('--total-frames', type=int, default=int(50e6))
 29 |     parser.add_argument('--ext-coef', type=float, default=2.0)
 30 |     parser.add_argument('--learning-rate', type=float, default=1e-4)
 31 |     parser.add_argument('--num-worker', type=int, default=128)
 32 |     parser.add_argument('--num-step', type=int, default=128)
 33 |     parser.add_argument('--gamma', type=float, default=0.999)
 34 |     parser.add_argument('--int-gamma', type=float, default=0.99)
 35 |     parser.add_argument('--lam', type=float, default=0.95)
 36 |     parser.add_argument('--stable-eps', type=float, default=1e-8)
 37 |     parser.add_argument('--stable-stack-size', type=int, default=4)
 38 |     parser.add_argument('--preproc-height', type=int, default=84)
 39 |     parser.add_argument('--preproc-width', type=int, default=84)
 40 |     parser.add_argument('--use-gae', action='store_true')
 41 |     parser.add_argument('--use-gpu', action='store_true')
 42 |     parser.add_argument('--use-norm', action='store_true')
 43 |     parser.add_argument('--use-noisynet', action='store_true')
 44 |     parser.add_argument('--clip-grad-norm', type=float, default=0.5)
 45 |     parser.add_argument('--entropy', type=float, default=0.001)
 46 |     parser.add_argument('--epoch', type=int, default=4)
 47 |     parser.add_argument('--minibatch', type=int, default=4)
 48 |     parser.add_argument('--ppo-eps', type=float, default=0.1)
 49 |     parser.add_argument('--int-coef', type=float, default=1.0)
 50 |     parser.add_argument('--sticky-action', action='store_true')
 51 |     parser.add_argument('--action-prob', type=float, default=0.25)
 52 |     parser.add_argument('--update-proportion', type=float, default=0.25)
 53 |     parser.add_argument('--life-done', action='store_true')
 54 |     parser.add_argument('--obs-norm-step', type=int, default=50)
 55 |     parser.add_argument('--save-models', action='store_true')
 56 | 
 57 |     # Setup
 58 |     args = parser.parse_args()
 59 | 
 60 |     return args
 61 | 
 62 | class Logger(object):
 63 |     def __init__(self, path):
 64 |         self.path = path
 65 | 
 66 |     def info(self, s):
 67 |         string = '[' + str(datetime.now().strftime('%Y-%m-%dT%H:%M:%S')) + '] ' + s
 68 |         print(string)
 69 |         with open(os.path.join(self.path, 'log.txt'), 'a+') as f:
 70 |             f.writelines([string, ''])
 71 | 
 72 | def main():
 73 | 
 74 |     args = parse_arguments()
 75 | 
 76 |     train_method = args.train_method
 77 |     env_id = args.env_id
 78 |     env_type = args.env_type
 79 | 
 80 |     if env_type == 'atari':
 81 |         env = gym.make(env_id)
 82 |         input_size = env.observation_space.shape 
 83 |         output_size = env.action_space.n 
 84 |         env.close()
 85 |     else:
 86 |         raise NotImplementedError
 87 | 
 88 |     is_load_model = False
 89 |     is_render = False
 90 |     os.makedirs('models', exist_ok=True)
 91 |     model_path = 'models/{}.model'.format(env_id)
 92 |     predictor_path = 'models/{}.pred'.format(env_id)
 93 |     target_path = 'models/{}.target'.format(env_id)
 94 | 
 95 |     results_dir = os.path.join('outputs', args.env_id)
 96 |     os.makedirs(results_dir, exist_ok=True)
 97 |     logger = Logger(results_dir)
 98 |     writer = SummaryWriter(os.path.join(results_dir, 'tensorboard', args.env_id))   
 99 | 
100 |     use_cuda = args.use_gpu
101 |     use_gae = args.use_gae
102 |     use_noisy_net = args.use_noisynet
103 |     lam = args.lam
104 |     num_worker = args.num_worker
105 |     num_step = args.num_step
106 |     ppo_eps = args.ppo_eps
107 |     epoch = args.epoch
108 |     mini_batch = args.minibatch 
109 |     batch_size = int(num_step * num_worker / mini_batch)
110 |     learning_rate = args.learning_rate
111 |     entropy_coef = args.entropy
112 |     gamma = args.gamma
113 |     int_gamma = args.int_gamma
114 |     clip_grad_norm = args.clip_grad_norm
115 |     ext_coef = args.ext_coef
116 |     int_coef = args.int_coef
117 |     sticky_action = args.sticky_action
118 |     action_prob = args.action_prob
119 |     life_done = args.life_done
120 |     pre_obs_norm_step = args.obs_norm_step
121 | 
122 |     reward_rms = RunningMeanStd()
123 |     obs_rms = RunningMeanStd(shape=(1, 1, 84, 84))
124 |     discounted_reward = RewardForwardFilter(int_gamma)
125 | 
126 |     if args.train_method == 'RND':
127 |         agent = RNDAgent
128 |     else:
129 |         raise NotImplementedError
130 | 
131 |     if args.env_type == 'atari':
132 |         env_type = AtariEnvironment
133 |     else:
134 |         raise NotImplementedError
135 | 
136 |     agent = agent(
137 |         input_size,
138 |         output_size,
139 |         num_worker,
140 |         num_step,
141 |         gamma,
142 |         lam=lam,
143 |         learning_rate=learning_rate,
144 |         ent_coef=entropy_coef,
145 |         clip_grad_norm=clip_grad_norm,
146 |         epoch=epoch,
147 |         batch_size=batch_size,
148 |         ppo_eps=ppo_eps,
149 |         use_cuda=use_cuda,
150 |         use_gae=use_gae,
151 |         use_noisy_net=use_noisy_net
152 |     )
153 | 
154 |     logger.info('Start to initialize workers')
155 |     works = []
156 |     parent_conns = []
157 |     child_conns = []
158 |     for idx in range(num_worker):
159 |         parent_conn, child_conn = Pipe()
160 |         work = env_type(env_id, is_render, idx, child_conn, 
161 |             sticky_action=sticky_action, p=action_prob, life_done=life_done, 
162 |             max_step_per_episode=args.max_step_per_episode)
163 |         work.start()
164 |         works.append(work)
165 |         parent_conns.append(parent_conn)
166 |         child_conns.append(child_conn)
167 | 
168 |     states = np.zeros([num_worker, 4, 84, 84])
169 | 
170 |     sample_episode = 0
171 |     sample_rall = 0
172 |     sample_step = 0
173 |     sample_env_idx = 0
174 |     sample_i_rall = 0
175 |     global_update = 0
176 |     global_step = 0
177 | 
178 |     # normalize obs
179 |     logger.info('Start to initailize observation normalization parameter.....')
180 |     next_obs = []
181 |     for step in range(num_step * pre_obs_norm_step):
182 |         actions = np.random.randint(0, output_size, size=(num_worker,))
183 | 
184 |         for parent_conn, action in zip(parent_conns, actions):
185 |             parent_conn.send(action)
186 | 
187 |         for parent_conn in parent_conns:
188 |             s, r, d, rd, lr = parent_conn.recv()
189 |             next_obs.append(s[3, :, :].reshape([1, 84, 84]))
190 | 
191 |         if len(next_obs) % (num_step * num_worker) == 0:
192 |             next_obs = np.stack(next_obs)
193 |             obs_rms.update(next_obs)
194 |             next_obs = []
195 |     logger.info('End to initalize...')
196 | 
197 |     pbar = tqdm.tqdm(total=args.total_frames)
198 |     while True:
199 |         logger.info('Iteration: {}'.format(global_update))
200 |         total_state, total_reward, total_done, total_next_state, \
201 |             total_action, total_int_reward, total_next_obs, total_ext_values, \
202 |             total_int_values, total_policy, total_policy_np = \
203 |             [], [], [], [], [], [], [], [], [], [], []
204 |         global_step += (num_worker * num_step)
205 |         global_update += 1
206 | 
207 |         # Step 1. n-step rollout
208 |         for _ in range(num_step):
209 |             actions, value_ext, value_int, policy = agent.get_action(np.float32(states) / 255.)
210 | 
211 |             for parent_conn, action in zip(parent_conns, actions):
212 |                 parent_conn.send(action)
213 | 
214 |             next_states, rewards, dones, real_dones, log_rewards, next_obs = \
215 |                 [], [], [], [], [], []
216 |             for parent_conn in parent_conns:
217 |                 s, r, d, rd, lr = parent_conn.recv()
218 |                 next_states.append(s)
219 |                 rewards.append(r)
220 |                 dones.append(d)
221 |                 real_dones.append(rd)
222 |                 log_rewards.append(lr)
223 |                 next_obs.append(s[3, :, :].reshape([1, 84, 84]))
224 | 
225 |             next_states = np.stack(next_states)
226 |             rewards = np.hstack(rewards)
227 |             dones = np.hstack(dones)
228 |             real_dones = np.hstack(real_dones)
229 |             next_obs = np.stack(next_obs)
230 | 
231 |             # total reward = int reward + ext Reward
232 |             intrinsic_reward = agent.compute_intrinsic_reward(
233 |                 ((next_obs - obs_rms.mean) / np.sqrt(obs_rms.var)).clip(-5, 5))
234 |             intrinsic_reward = np.hstack(intrinsic_reward)
235 |             sample_i_rall += intrinsic_reward[sample_env_idx]
236 | 
237 |             total_next_obs.append(next_obs)
238 |             total_int_reward.append(intrinsic_reward)
239 |             total_state.append(states)
240 |             total_reward.append(rewards)
241 |             total_done.append(dones)
242 |             total_action.append(actions)
243 |             total_ext_values.append(value_ext)
244 |             total_int_values.append(value_int)
245 |             total_policy.append(policy)
246 |             total_policy_np.append(policy.cpu().numpy())
247 | 
248 |             states = next_states[:, :, :, :]
249 | 
250 |             sample_rall += log_rewards[sample_env_idx]
251 | 
252 |             sample_step += 1
253 |             if real_dones[sample_env_idx]:
254 |                 sample_episode += 1
255 |                 writer.add_scalar('data/returns_vs_frames', sample_rall, global_step)
256 |                 writer.add_scalar('data/lengths_vs_frames', sample_step, global_step)
257 |                 writer.add_scalar('data/reward_per_epi', sample_rall, sample_episode)
258 |                 writer.add_scalar('data/reward_per_rollout', sample_rall, global_update)
259 |                 writer.add_scalar('data/step', sample_step, sample_episode)
260 |                 sample_rall = 0
261 |                 sample_step = 0
262 |                 sample_i_rall = 0
263 | 
264 |         # calculate last next value
265 |         _, value_ext, value_int, _ = agent.get_action(np.float32(states) / 255.)
266 |         total_ext_values.append(value_ext)
267 |         total_int_values.append(value_int)
268 | 
269 |         total_state = np.stack(total_state).transpose([1, 0, 2, 3, 4]).reshape([-1, 4, 84, 84])
270 |         total_reward = np.stack(total_reward).transpose().clip(-1, 1)
271 |         total_action = np.stack(total_action).transpose().reshape([-1])
272 |         total_done = np.stack(total_done).transpose()
273 |         total_next_obs = np.stack(total_next_obs).transpose([1, 0, 2, 3, 4]).reshape([-1, 1, 84, 84])
274 |         total_ext_values = np.stack(total_ext_values).transpose()
275 |         total_int_values = np.stack(total_int_values).transpose()
276 |         total_logging_policy = np.vstack(total_policy_np)
277 |         
278 |         # Step 2. calculate intrinsic reward
279 |         # running mean intrinsic reward
280 |         total_int_reward = np.stack(total_int_reward).transpose()
281 |         total_reward_per_env = np.array([discounted_reward.update(reward_per_step) for reward_per_step in
282 |                                          total_int_reward.T])
283 |         mean, std, count = np.mean(total_reward_per_env), np.std(total_reward_per_env), len(total_reward_per_env)
284 |         reward_rms.update_from_moments(mean, std ** 2, count)
285 | 
286 |         # normalize intrinsic reward
287 |         total_int_reward /= np.sqrt(reward_rms.var)
288 |         writer.add_scalar('data/int_reward_per_epi', np.sum(total_int_reward) / num_worker, sample_episode)
289 |         writer.add_scalar('data/int_reward_per_rollout', np.sum(total_int_reward) / num_worker, global_update)
290 | 
291 |         # logging Max action probability
292 |         writer.add_scalar('data/max_prob', softmax(total_logging_policy).max(1).mean(), sample_episode)
293 | 
294 |         # Step 3. make target and advantage
295 |         # extrinsic reward calculate
296 |         ext_target, ext_adv = make_train_data(total_reward, total_done, 
297 |             total_ext_values, gamma, num_step, num_worker)
298 | 
299 |         # intrinsic reward calculate
300 |         # None Episodic
301 |         int_target, int_adv = make_train_data(total_int_reward, np.zeros_like(total_int_reward),
302 |             total_int_values, int_gamma, num_step, num_worker)
303 | 
304 |         # add ext adv and int adv
305 |         total_adv = int_adv * int_coef + ext_adv * ext_coef
306 | 
307 |         # Step 4. update obs normalize param
308 |         obs_rms.update(total_next_obs)
309 | 
310 |         # Step 5. Training!
311 |         agent.train_model(np.float32(total_state) / 255., ext_target, int_target, total_action,
312 |                           total_adv, ((total_next_obs - obs_rms.mean) / np.sqrt(obs_rms.var)).clip(-5, 5),
313 |                           total_policy)
314 | 
315 |         if args.save_models and global_update % 1000 == 0:
316 |             torch.save(agent.model.state_dict(), 'models/{}-{}.model'.format(env_id, global_update))
317 |             logger.info('Now Global Step :{}'.format(global_step))
318 |             torch.save(agent.model.state_dict(), model_path)
319 |             torch.save(agent.rnd.predictor.state_dict(), predictor_path)
320 |             torch.save(agent.rnd.target.state_dict(), target_path)
321 | 
322 |         pbar.update(num_worker * num_step)
323 |         if global_step >= args.total_frames:
324 |             break
325 | 
326 |     pbar.close()
327 | 
328 | if __name__ == '__main__':
329 |     main()
330 | 


--------------------------------------------------------------------------------
/code/RND/utils.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import torch
  3 | from torch._six import inf
  4 | 
  5 | 
  6 | def make_train_data(reward, done, value, gamma, num_step, num_worker, lam=0.95, use_gae=True):
  7 |     discounted_return = np.empty([num_worker, num_step])
  8 | 
  9 |     # Discounted Return
 10 |     if use_gae:
 11 |         gae = np.zeros_like([num_worker, ])
 12 |         for t in range(num_step - 1, -1, -1):
 13 |             delta = reward[:, t] + gamma * value[:, t + 1] * (1 - done[:, t]) - value[:, t]
 14 |             gae = delta + gamma * lam * (1 - done[:, t]) * gae
 15 | 
 16 |             discounted_return[:, t] = gae + value[:, t]
 17 | 
 18 |             # For Actor
 19 |         adv = discounted_return - value[:, :-1]
 20 | 
 21 |     else:
 22 |         running_add = value[:, -1]
 23 |         for t in range(num_step - 1, -1, -1):
 24 |             running_add = reward[:, t] + gamma * running_add * (1 - done[:, t])
 25 |             discounted_return[:, t] = running_add
 26 | 
 27 |         # For Actor
 28 |         adv = discounted_return - value[:, :-1]
 29 | 
 30 |     return discounted_return.reshape([-1]), adv.reshape([-1])
 31 | 
 32 | 
 33 | class RunningMeanStd(object):
 34 |     # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
 35 |     def __init__(self, epsilon=1e-4, shape=()):
 36 |         self.mean = np.zeros(shape, 'float64')
 37 |         self.var = np.ones(shape, 'float64')
 38 |         self.count = epsilon
 39 | 
 40 |     def update(self, x):
 41 |         batch_mean = np.mean(x, axis=0)
 42 |         batch_var = np.var(x, axis=0)
 43 |         batch_count = x.shape[0]
 44 |         self.update_from_moments(batch_mean, batch_var, batch_count)
 45 | 
 46 |     def update_from_moments(self, batch_mean, batch_var, batch_count):
 47 |         delta = batch_mean - self.mean
 48 |         tot_count = self.count + batch_count
 49 | 
 50 |         new_mean = self.mean + delta * batch_count / tot_count
 51 |         m_a = self.var * (self.count)
 52 |         m_b = batch_var * (batch_count)
 53 |         M2 = m_a + m_b + np.square(delta) * self.count * batch_count / (self.count + batch_count)
 54 |         new_var = M2 / (self.count + batch_count)
 55 | 
 56 |         new_count = batch_count + self.count
 57 | 
 58 |         self.mean = new_mean
 59 |         self.var = new_var
 60 |         self.count = new_count
 61 | 
 62 | 
 63 | class RewardForwardFilter(object):
 64 |     def __init__(self, gamma):
 65 |         self.rewems = None
 66 |         self.gamma = gamma
 67 | 
 68 |     def update(self, rews):
 69 |         if self.rewems is None:
 70 |             self.rewems = rews
 71 |         else:
 72 |             self.rewems = self.rewems * self.gamma + rews
 73 |         return self.rewems
 74 | 
 75 | 
 76 | def softmax(z):
 77 |     assert len(z.shape) == 2
 78 |     s = np.max(z, axis=1)
 79 |     s = s[:, np.newaxis]  # necessary step to do broadcasting
 80 |     e_x = np.exp(z - s)
 81 |     div = np.sum(e_x, axis=1)
 82 |     div = div[:, np.newaxis]  # dito
 83 |     return e_x / div
 84 | 
 85 | 
 86 | def global_grad_norm_(parameters, norm_type=2):
 87 |     """Clips gradient norm of an iterable of parameters.
 88 | 
 89 |     The norm is computed over all gradients together, as if they were
 90 |     concatenated into a single vector. Gradients are modified in-place.
 91 | 
 92 |     Arguments:
 93 |         parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a
 94 |             single Tensor that will have gradients normalized
 95 |         max_norm (float or int): max norm of the gradients
 96 |         norm_type (float or int): type of the used p-norm. Can be ``'inf'`` for
 97 |             infinity norm.
 98 | 
 99 |     Returns:
100 |         Total norm of the parameters (viewed as a single vector).
101 |     """
102 |     if isinstance(parameters, torch.Tensor):
103 |         parameters = [parameters]
104 |     parameters = list(filter(lambda p: p.grad is not None, parameters))
105 |     norm_type = float(norm_type)
106 |     if norm_type == inf:
107 |         total_norm = max(p.grad.data.abs().max() for p in parameters)
108 |     else:
109 |         total_norm = 0
110 |         for p in parameters:
111 |             param_norm = p.grad.data.norm(norm_type)
112 |             total_norm += param_norm.item() ** norm_type
113 |         total_norm = total_norm ** (1. / norm_type)
114 | 
115 |     return total_norm
116 | 


--------------------------------------------------------------------------------
/code/Rainbow/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangchuheng123/Reinforcement-Implementation/c04e0df10ec29ef775ea31395a8ad4b917302d24/code/Rainbow/.DS_Store


--------------------------------------------------------------------------------
/code/Rainbow/README.md:
--------------------------------------------------------------------------------
1 | This is copied and modified from https://github.com/Kaixhin/Rainbow.
2 | 
3 | I verified the performance.


--------------------------------------------------------------------------------
/code/Rainbow/agent.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | from __future__ import division
  3 | import os
  4 | import numpy as np
  5 | import torch
  6 | from torch import optim
  7 | from torch.nn.utils import clip_grad_norm_
  8 | 
  9 | from model import DQN
 10 | 
 11 | 
 12 | class Agent():
 13 |     def __init__(self, args, env):
 14 |         self.action_space = env.action_space()
 15 |         self.atoms = args.atoms
 16 |         self.Vmin = args.V_min
 17 |         self.Vmax = args.V_max
 18 |         self.support = torch.linspace(args.V_min, args.V_max, self.atoms).to(device=args.device)  # Support (range) of z
 19 |         self.delta_z = (args.V_max - args.V_min) / (self.atoms - 1)
 20 |         self.batch_size = args.batch_size
 21 |         self.n = args.multi_step
 22 |         self.discount = args.discount
 23 |         self.norm_clip = args.norm_clip
 24 | 
 25 |         self.online_net = DQN(args, self.action_space).to(device=args.device)
 26 |         if args.model:  # Load pretrained model if provided
 27 |             if os.path.isfile(args.model):
 28 |                 state_dict = torch.load(args.model, map_location='cpu')  # Always load tensors onto CPU by default, will shift to GPU if necessary
 29 |                 if 'conv1.weight' in state_dict.keys():
 30 |                     for old_key, new_key in (('conv1.weight', 'convs.0.weight'), ('conv1.bias', 'convs.0.bias'), ('conv2.weight', 'convs.2.weight'), ('conv2.bias', 'convs.2.bias'), ('conv3.weight', 'convs.4.weight'), ('conv3.bias', 'convs.4.bias')):
 31 |                         state_dict[new_key] = state_dict[old_key]  # Re-map state dict for old pretrained models
 32 |                         del state_dict[old_key]  # Delete old keys for strict load_state_dict
 33 |                 self.online_net.load_state_dict(state_dict)
 34 |                 print("Loading pretrained model: " + args.model)
 35 |             else:  # Raise error if incorrect model path provided
 36 |                 raise FileNotFoundError(args.model)
 37 | 
 38 |         self.online_net.train()
 39 | 
 40 |         self.target_net = DQN(args, self.action_space).to(device=args.device)
 41 |         self.update_target_net()
 42 |         self.target_net.train()
 43 |         for param in self.target_net.parameters():
 44 |             param.requires_grad = False
 45 | 
 46 |         self.optimiser = optim.Adam(self.online_net.parameters(), lr=args.learning_rate, eps=args.adam_eps)
 47 | 
 48 |     # Resets noisy weights in all linear layers (of online net only)
 49 |     def reset_noise(self):
 50 |         self.online_net.reset_noise()
 51 | 
 52 |     # Acts based on single state (no batch)
 53 |     def act(self, state):
 54 |         with torch.no_grad():
 55 |             return (self.online_net(state.unsqueeze(0)) * self.support).sum(2).argmax(1).item()
 56 | 
 57 |     # Acts with an ε-greedy policy (used for evaluation only)
 58 |     def act_e_greedy(self, state, epsilon=0.001):  # High ε can reduce evaluation scores drastically
 59 |         return np.random.randint(0, self.action_space) if np.random.random() < epsilon else self.act(state)
 60 | 
 61 |     def learn(self, mem):
 62 |         # Sample transitions
 63 |         idxs, states, actions, returns, next_states, nonterminals, weights = mem.sample(self.batch_size)
 64 | 
 65 |         # Calculate current state probabilities (online network noise already sampled)
 66 |         log_ps = self.online_net(states, log=True)  # Log probabilities log p(s_t, ·; θonline)
 67 |         log_ps_a = log_ps[range(self.batch_size), actions]  # log p(s_t, a_t; θonline)
 68 | 
 69 |         with torch.no_grad():
 70 |             # Calculate nth next state probabilities
 71 |             pns = self.online_net(next_states)  # Probabilities p(s_t+n, ·; θonline)
 72 |             dns = self.support.expand_as(pns) * pns  # Distribution d_t+n = (z, p(s_t+n, ·; θonline))
 73 |             argmax_indices_ns = dns.sum(2).argmax(1)  # Perform argmax action selection using online network: argmax_a[(z, p(s_t+n, a; θonline))]
 74 |             self.target_net.reset_noise()  # Sample new target net noise
 75 |             pns = self.target_net(next_states)  # Probabilities p(s_t+n, ·; θtarget)
 76 |             pns_a = pns[range(self.batch_size), argmax_indices_ns]  # Double-Q probabilities p(s_t+n, argmax_a[(z, p(s_t+n, a; θonline))]; θtarget)
 77 | 
 78 |             # Compute Tz (Bellman operator T applied to z)
 79 |             Tz = returns.unsqueeze(1) + nonterminals * (self.discount ** self.n) * self.support.unsqueeze(0)  # Tz = R^n + (γ^n)z (accounting for terminal states)
 80 |             Tz = Tz.clamp(min=self.Vmin, max=self.Vmax)  # Clamp between supported values
 81 |             # Compute L2 projection of Tz onto fixed support z
 82 |             b = (Tz - self.Vmin) / self.delta_z  # b = (Tz - Vmin) / Δz
 83 |             l, u = b.floor().to(torch.int64), b.ceil().to(torch.int64)
 84 |             # Fix disappearing probability mass when l = b = u (b is int)
 85 |             l[(u > 0) * (l == u)] -= 1
 86 |             u[(l < (self.atoms - 1)) * (l == u)] += 1
 87 | 
 88 |             # Distribute probability of Tz
 89 |             m = states.new_zeros(self.batch_size, self.atoms)
 90 |             offset = torch.linspace(0, ((self.batch_size - 1) * self.atoms), self.batch_size).unsqueeze(1).expand(self.batch_size, self.atoms).to(actions)
 91 |             m.view(-1).index_add_(0, (l + offset).view(-1), (pns_a * (u.float() - b)).view(-1))  # m_l = m_l + p(s_t+n, a*)(u - b)
 92 |             m.view(-1).index_add_(0, (u + offset).view(-1), (pns_a * (b - l.float())).view(-1))  # m_u = m_u + p(s_t+n, a*)(b - l)
 93 | 
 94 |         loss = -torch.sum(m * log_ps_a, 1)  # Cross-entropy loss (minimises DKL(m||p(s_t, a_t)))
 95 |         self.online_net.zero_grad()
 96 |         (weights * loss).mean().backward()  # Backpropagate importance-weighted minibatch loss
 97 |         clip_grad_norm_(self.online_net.parameters(), self.norm_clip)  # Clip gradients by L2 norm
 98 |         self.optimiser.step()
 99 | 
100 |         mem.update_priorities(idxs, loss.detach().cpu().numpy())  # Update priorities of sampled transitions
101 | 
102 |     def update_target_net(self):
103 |         self.target_net.load_state_dict(self.online_net.state_dict())
104 | 
105 |     # Save model parameters on current device (don't move model between devices)
106 |     def save(self, path, name='model.pth'):
107 |         torch.save(self.online_net.state_dict(), os.path.join(path, name))
108 | 
109 |     # Evaluates Q-value based on single state (no batch)
110 |     def evaluate_q(self, state):
111 |         with torch.no_grad():
112 |             return (self.online_net(state.unsqueeze(0)) * self.support).sum(2).max(1)[0].item()
113 | 
114 |     def train(self):
115 |         self.online_net.train()
116 | 
117 |     def eval(self):
118 |         self.online_net.eval()
119 | 


--------------------------------------------------------------------------------
/code/Rainbow/bash.sh:
--------------------------------------------------------------------------------
 1 | CUDA_VISIBLE_DEVICES=0 python main.py --id baseline --game alien --enable-cudnn --tensorboard-dir ./results/rainbow &
 2 | CUDA_VISIBLE_DEVICES=1 python main.py --id baseline --game amidar --enable-cudnn --tensorboard-dir ./results/rainbow &
 3 | CUDA_VISIBLE_DEVICES=2 python main.py --id baseline --game assault --enable-cudnn --tensorboard-dir ./results/rainbow &
 4 | CUDA_VISIBLE_DEVICES=3 python main.py --id baseline --game asterix --enable-cudnn --tensorboard-dir ./results/rainbow &
 5 | CUDA_VISIBLE_DEVICES=0 python main.py --id baseline --game bank_heist --enable-cudnn --tensorboard-dir ./results/rainbow &
 6 | CUDA_VISIBLE_DEVICES=1 python main.py --id baseline --game battle_zone --enable-cudnn --tensorboard-dir ./results/rainbow &
 7 | CUDA_VISIBLE_DEVICES=2 python main.py --id baseline --game boxing --enable-cudnn --tensorboard-dir ./results/rainbow &
 8 | CUDA_VISIBLE_DEVICES=3 python main.py --id baseline --game breakout --enable-cudnn --tensorboard-dir ./results/rainbow &
 9 | # CUDA_VISIBLE_DEVICES=0 python main.py --id baseline --game chopper_command --enable-cudnn --tensorboard-dir ./results/rainbow &
10 | # CUDA_VISIBLE_DEVICES=1 python main.py --id baseline --game crazy_climber --enable-cudnn --tensorboard-dir ./results/rainbow &
11 | # CUDA_VISIBLE_DEVICES=2 python main.py --id baseline --game demon_attack --enable-cudnn --tensorboard-dir ./results/rainbow &
12 | # CUDA_VISIBLE_DEVICES=3 python main.py --id baseline --game freeway --enable-cudnn --tensorboard-dir ./results/rainbow &
13 | # CUDA_VISIBLE_DEVICES=0 python main.py --id baseline --game frostbite --enable-cudnn --tensorboard-dir ./results/rainbow &
14 | # CUDA_VISIBLE_DEVICES=1 python main.py --id baseline --game gopher --enable-cudnn --tensorboard-dir ./results/rainbow &
15 | # CUDA_VISIBLE_DEVICES=2 python main.py --id baseline --game hero --enable-cudnn --tensorboard-dir ./results/rainbow &
16 | # CUDA_VISIBLE_DEVICES=3 python main.py --id baseline --game jamesbond --enable-cudnn --tensorboard-dir ./results/rainbow &
17 | # CUDA_VISIBLE_DEVICES=0 python main.py --id baseline --game kangaroo --enable-cudnn --tensorboard-dir ./results/rainbow &
18 | # CUDA_VISIBLE_DEVICES=1 python main.py --id baseline --game krull --enable-cudnn --tensorboard-dir ./results/rainbow &
19 | # CUDA_VISIBLE_DEVICES=2 python main.py --id baseline --game kung_fu_master --enable-cudnn --tensorboard-dir ./results/rainbow &
20 | # CUDA_VISIBLE_DEVICES=3 python main.py --id baseline --game ms_pacman --enable-cudnn --tensorboard-dir ./results/rainbow &
21 | # CUDA_VISIBLE_DEVICES=0 python main.py --id baseline --game pong --enable-cudnn --tensorboard-dir ./results/rainbow &
22 | # CUDA_VISIBLE_DEVICES=1 python main.py --id baseline --game private_eye --enable-cudnn --tensorboard-dir ./results/rainbow &
23 | # CUDA_VISIBLE_DEVICES=2 python main.py --id baseline --game qbert --enable-cudnn --tensorboard-dir ./results/rainbow &
24 | # CUDA_VISIBLE_DEVICES=3 python main.py --id baseline --game road_runner --enable-cudnn --tensorboard-dir ./results/rainbow &
25 | # CUDA_VISIBLE_DEVICES=0 python main.py --id baseline --game seaquest --enable-cudnn --tensorboard-dir ./results/rainbow &
26 | # CUDA_VISIBLE_DEVICES=1 python main.py --id baseline --game up_n_down --enable-cudnn --tensorboard-dir ./results/rainbow &s


--------------------------------------------------------------------------------
/code/Rainbow/env.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | from collections import deque
 3 | import random
 4 | import atari_py
 5 | import cv2
 6 | import torch
 7 | 
 8 | 
 9 | class Env():
10 |     def __init__(self, args):
11 |         self.device = args.device
12 |         self.ale = atari_py.ALEInterface()
13 |         self.ale.setInt('random_seed', args.seed)
14 |         self.ale.setInt('max_num_frames_per_episode', args.max_episode_length)
15 |         self.ale.setFloat('repeat_action_probability', 0)  # Disable sticky actions
16 |         self.ale.setInt('frame_skip', 0)
17 |         self.ale.setBool('color_averaging', False)
18 |         self.ale.loadROM(atari_py.get_game_path(args.game))  # ROM loading must be done after setting options
19 |         actions = self.ale.getMinimalActionSet()
20 |         self.actions = dict([i, e] for i, e in zip(range(len(actions)), actions))
21 |         self.lives = 0  # Life counter (used in DeepMind training)
22 |         self.life_termination = False  # Used to check if resetting only from loss of life
23 |         self.window = args.history_length  # Number of frames to concatenate
24 |         self.state_buffer = deque([], maxlen=args.history_length)
25 |         self.training = True  # Consistent with model training mode
26 | 
27 |     def _get_state(self):
28 |         state = cv2.resize(self.ale.getScreenGrayscale(), (84, 84), interpolation=cv2.INTER_LINEAR)
29 |         return torch.tensor(state, dtype=torch.float32, device=self.device).div_(255)
30 | 
31 |     def _reset_buffer(self):
32 |         for _ in range(self.window):
33 |             self.state_buffer.append(torch.zeros(84, 84, device=self.device))
34 | 
35 |     def reset(self):
36 |         if self.life_termination:
37 |             self.life_termination = False  # Reset flag
38 |             self.ale.act(0)  # Use a no-op after loss of life
39 |         else:
40 |             # Reset internals
41 |             self._reset_buffer()
42 |             self.ale.reset_game()
43 |             # Perform up to 30 random no-ops before starting
44 |             for _ in range(random.randrange(30)):
45 |                 self.ale.act(0)  # Assumes raw action 0 is always no-op
46 |                 if self.ale.game_over():
47 |                     self.ale.reset_game()
48 |         # Process and return "initial" state
49 |         observation = self._get_state()
50 |         self.state_buffer.append(observation)
51 |         self.lives = self.ale.lives()
52 |         return torch.stack(list(self.state_buffer), 0)
53 | 
54 |     def step(self, action):
55 |         # Repeat action 4 times, max pool over last 2 frames
56 |         frame_buffer = torch.zeros(2, 84, 84, device=self.device)
57 |         reward, done = 0, False
58 |         for t in range(4):
59 |             reward += self.ale.act(self.actions.get(action))
60 |             if t == 2:
61 |                 frame_buffer[0] = self._get_state()
62 |             elif t == 3:
63 |                 frame_buffer[1] = self._get_state()
64 |             done = self.ale.game_over()
65 |             if done:
66 |                 break
67 |         observation = frame_buffer.max(0)[0]
68 |         self.state_buffer.append(observation)
69 |         # Detect loss of life as terminal in training mode
70 |         if self.training:
71 |             lives = self.ale.lives()
72 |             if lives < self.lives and lives > 0:  # Lives > 0 for Q*bert
73 |                 self.life_termination = not done  # Only set flag when not truly done
74 |                 done = True
75 |             self.lives = lives
76 |         # Return state, reward, done
77 |         return torch.stack(list(self.state_buffer), 0), reward, done
78 | 
79 |     # Uses loss of life as terminal signal
80 |     def train(self):
81 |         self.training = True
82 | 
83 |     # Uses standard terminal signal
84 |     def eval(self):
85 |         self.training = False
86 | 
87 |     def action_space(self):
88 |         return len(self.actions)
89 | 
90 |     def render(self):
91 |         cv2.imshow('screen', self.ale.getScreenRGB()[:, :, ::-1])
92 |         cv2.waitKey(1)
93 | 
94 |     def close(self):
95 |         cv2.destroyAllWindows()
96 | 


--------------------------------------------------------------------------------
/code/Rainbow/main.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | TODO: Note that DeepMind's evaluation method is running the latest agent for 500K frames every 1M steps
  4 | """
  5 | from __future__ import division
  6 | 
  7 | import torch
  8 | from torch.utils.tensorboard import SummaryWriter
  9 | from tqdm import trange
 10 | import numpy as np
 11 | import atari_py
 12 | 
 13 | from datetime import datetime
 14 | import argparse
 15 | import pickle
 16 | import bz2
 17 | import os
 18 | 
 19 | from agent import Agent
 20 | from env import Env
 21 | from memory import ReplayMemory
 22 | from test import test
 23 | 
 24 | 
 25 | def parse_arguments():
 26 | 
 27 |     parser = argparse.ArgumentParser(description='Rainbow')
 28 |     parser.add_argument('--id', type=str, default='default', help='Experiment ID')
 29 |     parser.add_argument('--seed', type=int, default=123, help='Random seed')
 30 |     parser.add_argument('--disable-cuda', action='store_true', help='Disable CUDA')
 31 |     parser.add_argument('--game', type=str, default='space_invaders', choices=atari_py.list_games(), help='ATARI game')
 32 |     parser.add_argument('--T-max', type=int, default=int(50e6), metavar='STEPS', help='Number of training steps (4x number of frames)')
 33 |     parser.add_argument('--max-episode-length', type=int, default=int(108e3), metavar='LENGTH', help='Max episode length in game frames (0 to disable)')
 34 |     parser.add_argument('--history-length', type=int, default=4, metavar='T', help='Number of consecutive states processed')
 35 |     parser.add_argument('--hidden-size', type=int, default=512, metavar='SIZE', help='Network hidden size')
 36 |     parser.add_argument('--noisy-std', type=float, default=0.1, metavar='σ', help='Initial standard deviation of noisy linear layers')
 37 |     parser.add_argument('--atoms', type=int, default=51, metavar='C', help='Discretised size of value distribution')
 38 |     parser.add_argument('--V-min', type=float, default=-10, metavar='V', help='Minimum of value distribution support')
 39 |     parser.add_argument('--V-max', type=float, default=10, metavar='V', help='Maximum of value distribution support')
 40 |     parser.add_argument('--model', type=str, metavar='PARAMS', help='Pretrained model (state dict)')
 41 |     parser.add_argument('--memory-capacity', type=int, default=int(1e6), metavar='CAPACITY', help='Experience replay memory capacity')
 42 |     parser.add_argument('--replay-frequency', type=int, default=4, metavar='k', help='Frequency of sampling from memory')
 43 |     parser.add_argument('--priority-exponent', type=float, default=0.5, metavar='ω', help='Prioritised experience replay exponent (originally denoted α)')
 44 |     parser.add_argument('--priority-weight', type=float, default=0.4, metavar='β', help='Initial prioritised experience replay importance sampling weight')
 45 |     parser.add_argument('--multi-step', type=int, default=3, metavar='n', help='Number of steps for multi-step return')
 46 |     parser.add_argument('--discount', type=float, default=0.99, metavar='γ', help='Discount factor')
 47 |     parser.add_argument('--target-update', type=int, default=int(8e3), metavar='τ', help='Number of steps after which to update target network')
 48 |     parser.add_argument('--reward-clip', type=int, default=1, metavar='VALUE', help='Reward clipping (0 to disable)')
 49 |     parser.add_argument('--learning-rate', type=float, default=0.0000625, metavar='η', help='Learning rate')
 50 |     parser.add_argument('--adam-eps', type=float, default=1.5e-4, metavar='ε', help='Adam epsilon')
 51 |     parser.add_argument('--batch-size', type=int, default=32, metavar='SIZE', help='Batch size')
 52 |     parser.add_argument('--norm-clip', type=float, default=10, metavar='NORM', help='Max L2 norm for gradient clipping')
 53 |     parser.add_argument('--learn-start', type=int, default=int(20e3), metavar='STEPS', help='Number of steps before starting training')
 54 |     parser.add_argument('--evaluate', action='store_true', help='Evaluate only')
 55 |     parser.add_argument('--evaluation-interval', type=int, default=100000, metavar='STEPS', help='Number of training steps between evaluations')
 56 |     parser.add_argument('--evaluation-episodes', type=int, default=10, metavar='N', help='Number of evaluation episodes to average over')
 57 |     parser.add_argument('--evaluation-size', type=int, default=500, metavar='N', help='Number of transitions to use for validating Q')
 58 |     parser.add_argument('--render', action='store_true', help='Display screen (testing only)')
 59 |     parser.add_argument('--enable-cudnn', action='store_true', help='Enable cuDNN (faster but nondeterministic)')
 60 |     parser.add_argument('--checkpoint-interval', default=int(20e3), help='How often to checkpoint the model, defaults to 0 (never checkpoint)')
 61 |     parser.add_argument('--memory', help='Path to save/load the memory from')
 62 |     parser.add_argument('--disable-bzip-memory', action='store_true', help='Don\'t zip the memory file. Not recommended (zipping is a bit slower and much, much smaller)')
 63 |     parser.add_argument('--tensorboard-dir', type=str, default=None, help='tensorboard directory')
 64 |     parser.add_argument('--architecture', type=str, default='canonical', choices=['canonical', 'data-efficient'], metavar='ARCH', help='Network architecture')
 65 | 
 66 |     args = parser.parse_args()
 67 | 
 68 |     return args
 69 | 
 70 | def load_memory(memory_path, disable_bzip):
 71 |     if disable_bzip:
 72 |         with open(memory_path, 'rb') as pickle_file:
 73 |             return pickle.load(pickle_file)
 74 |     else:
 75 |         with bz2.open(memory_path, 'rb') as zipped_pickle_file:
 76 |             return pickle.load(zipped_pickle_file)
 77 | 
 78 | def save_memory(memory, memory_path, disable_bzip):
 79 |     if disable_bzip:
 80 |         with open(memory_path, 'wb') as pickle_file:
 81 |             pickle.dump(memory, pickle_file)
 82 |     else:
 83 |         with bz2.open(memory_path, 'wb') as zipped_pickle_file:
 84 |             pickle.dump(memory, zipped_pickle_file)
 85 | 
 86 | class Logger(object):
 87 |     def __init__(self, path):
 88 |         self.path = path
 89 | 
 90 |     def info(self, s):
 91 |         string = '[' + str(datetime.now().strftime('%Y-%m-%dT%H:%M:%S')) + '] ' + s
 92 |         print(string)
 93 |         with open(os.path.join(self.path, 'log.txt'), 'a+') as f:
 94 |             f.writelines([string, ''])
 95 | 
 96 | def main():
 97 |     args = parse_arguments()
 98 | 
 99 |     results_dir = os.path.join('results', args.id)
100 |     os.makedirs(results_dir, exist_ok=True)
101 |     logger = Logger(results_dir)
102 | 
103 |     metrics = {'steps': [], 'rewards': [], 'Qs': [], 'Qstds': [], 'best_avg_reward': -float('inf')}
104 |     np.random.seed(args.seed)
105 |     torch.manual_seed(np.random.randint(1, 10000))
106 |     if torch.cuda.is_available() and not args.disable_cuda:
107 |         args.device = torch.device('cuda')
108 |         torch.cuda.manual_seed(np.random.randint(1, 10000))
109 |         torch.backends.cudnn.enabled = args.enable_cudnn
110 |     else:
111 |         args.device = torch.device('cpu')
112 | 
113 |     if args.tensorboard_dir is None:
114 |         writer = SummaryWriter(os.path.join(results_dir, 'tensorboard', args.game, args.architecture))
115 |     else:
116 |         writer = SummaryWriter(os.path.join(args.tensorboard_dir, args.game, args.architecture))
117 | 
118 |     # Environment
119 |     env = Env(args)
120 |     env.train()
121 |     action_space = env.action_space()
122 | 
123 |     # Agent
124 |     dqn = Agent(args, env)
125 | 
126 |     # If a model is provided, and evaluate is fale, presumably we want to resume, so try to load memory
127 |     if args.model is not None and not args.evaluate:
128 |         if not args.memory:
129 |             raise ValueError('Cannot resume training without memory save path. Aborting...')
130 |         elif not os.path.exists(args.memory):
131 |             raise ValueError('Could not find memory file at {path}. Aborting...'.format(path=args.memory))
132 | 
133 |         mem = load_memory(args.memory, args.disable_bzip_memory)
134 | 
135 |     else:
136 |         mem = ReplayMemory(args, args.memory_capacity)
137 | 
138 |     priority_weight_increase = (1 - args.priority_weight) / (args.T_max - args.learn_start)
139 | 
140 |     # Construct validation memory
141 |     val_mem = ReplayMemory(args, args.evaluation_size)
142 |     T, done = 0, True
143 |     while T < args.evaluation_size:
144 |         if done:
145 |             state, done = env.reset(), False
146 | 
147 |         next_state, _, done = env.step(np.random.randint(0, action_space))
148 |         val_mem.append(state, None, None, done)
149 |         state = next_state
150 |         T += 1
151 | 
152 |     if args.evaluate:
153 |         dqn.eval()  # Set DQN (online network) to evaluation mode
154 |         test_result = test(args, 0, dqn, val_mem, metrics, results_dir, evaluate=True)  # Test
155 |         logger.info('Avg. reward: ' + str(test_result['avg_reward']) \
156 |             + ' | Avg. Q: ' + str(test_result['avg_Q']))
157 |     else:
158 |         # Training loop
159 |         dqn.train()
160 |         T, done = 0, True
161 |         accumulate_reward = 0
162 |         for T in trange(1, args.T_max + 1):
163 |             if done:
164 |                 state, done = env.reset(), False
165 |                 writer.add_scalar('Train/Reward', accumulate_reward, T)
166 |                 accumulate_reward = 0
167 | 
168 |             if T % args.replay_frequency == 0:
169 |                 dqn.reset_noise()  # Draw a new set of noisy weights
170 | 
171 |             action = dqn.act(state)  # Choose an action greedily (with noisy weights)
172 |             next_state, reward, done = env.step(action)  # Step
173 |             accumulate_reward += reward
174 |             if args.reward_clip > 0:
175 |                 reward = max(min(reward, args.reward_clip), -args.reward_clip)  # Clip rewards
176 |             mem.append(state, action, reward, done)  # Append transition to memory
177 | 
178 |             # Train and test
179 |             if T >= args.learn_start:
180 |                 mem.priority_weight = min(mem.priority_weight + priority_weight_increase, 1)  # Anneal importance sampling weight β to 1
181 | 
182 |                 if T % args.replay_frequency == 0:
183 |                     dqn.learn(mem)  # Train with n-step distributional double-Q learning
184 | 
185 |                 if T % args.evaluation_interval == 0:
186 |                     dqn.eval()  # Set DQN (online network) to evaluation mode
187 |                     test_result = test(args, T, dqn, val_mem, metrics, results_dir)  # Test
188 |                     for k, v in test_result.items():
189 |                         writer.add_scalar('Eval/{}'.format(k), v, T)
190 |                     logger.info('T = ' + str(T) + ' / ' + str(args.T_max) + \
191 |                         ' | Avg. reward: ' + str(test_result['avg_reward']) + \
192 |                         ' | Avg. Q: ' + str(test_result['avg_Q']))
193 |                     dqn.train()  # Set DQN (online network) back to training mode
194 | 
195 |                     # If memory path provided, save it
196 |                     if args.memory is not None:
197 |                         save_memory(mem, args.memory, args.disable_bzip_memory)
198 | 
199 |                 # Update target network
200 |                 if T % args.target_update == 0:
201 |                     dqn.update_target_net()
202 | 
203 |                 # Checkpoint the network
204 |                 if T % args.checkpoint_interval == 0:
205 |                     dqn.save(results_dir, 'checkpoint.pth')
206 | 
207 |             state = next_state
208 | 
209 |     env.close()
210 | 
211 | if __name__ == '__main__':
212 |     main()
213 |     


--------------------------------------------------------------------------------
/code/Rainbow/memory.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | from __future__ import division
  3 | from collections import namedtuple
  4 | import numpy as np
  5 | import torch
  6 | 
  7 | 
  8 | Transition = namedtuple('Transition', ('timestep', 'state', 'action', 'reward', 'nonterminal'))
  9 | blank_trans = Transition(0, torch.zeros(84, 84, dtype=torch.uint8), None, 0, False)
 10 | 
 11 | 
 12 | # Segment tree data structure where parent node values are sum/max of children node values
 13 | class SegmentTree():
 14 |     def __init__(self, size):
 15 |         self.index = 0
 16 |         self.size = size
 17 |         self.full = False  # Used to track actual capacity
 18 |         self.sum_tree = np.zeros((2 * size - 1, ), dtype=np.float32)  # Initialise fixed size tree with all (priority) zeros
 19 |         self.data = np.array([None] * size)  # Wrap-around cyclic buffer
 20 |         self.max = 1  # Initial max value to return (1 = 1^ω)
 21 | 
 22 |     # Propagates value up tree given a tree index
 23 |     def _propagate(self, index, value):
 24 |         parent = (index - 1) // 2
 25 |         left, right = 2 * parent + 1, 2 * parent + 2
 26 |         self.sum_tree[parent] = self.sum_tree[left] + self.sum_tree[right]
 27 |         if parent != 0:
 28 |             self._propagate(parent, value)
 29 | 
 30 |     # Updates value given a tree index
 31 |     def update(self, index, value):
 32 |         self.sum_tree[index] = value  # Set new value
 33 |         self._propagate(index, value)  # Propagate value
 34 |         self.max = max(value, self.max)
 35 | 
 36 |     def append(self, data, value):
 37 |         self.data[self.index] = data  # Store data in underlying data structure
 38 |         self.update(self.index + self.size - 1, value)  # Update tree
 39 |         self.index = (self.index + 1) % self.size  # Update index
 40 |         self.full = self.full or self.index == 0  # Save when capacity reached
 41 |         self.max = max(value, self.max)
 42 | 
 43 |     # Searches for the location of a value in sum tree
 44 |     def _retrieve(self, index, value):
 45 |         left, right = 2 * index + 1, 2 * index + 2
 46 |         if left >= len(self.sum_tree):
 47 |             return index
 48 |         elif value <= self.sum_tree[left]:
 49 |             return self._retrieve(left, value)
 50 |         else:
 51 |             return self._retrieve(right, value - self.sum_tree[left])
 52 | 
 53 |     # Searches for a value in sum tree and returns value, data index and tree index
 54 |     def find(self, value):
 55 |         index = self._retrieve(0, value)  # Search for index of item from root
 56 |         data_index = index - self.size + 1
 57 |         return (self.sum_tree[index], data_index, index)  # Return value, data index, tree index
 58 | 
 59 |     # Returns data given a data index
 60 |     def get(self, data_index):
 61 |         return self.data[data_index % self.size]
 62 | 
 63 |     def total(self):
 64 |         return self.sum_tree[0]
 65 | 
 66 | class ReplayMemory():
 67 |     def __init__(self, args, capacity):
 68 |         self.device = args.device
 69 |         self.capacity = capacity
 70 |         self.history = args.history_length
 71 |         self.discount = args.discount
 72 |         self.n = args.multi_step
 73 |         self.priority_weight = args.priority_weight  # Initial importance sampling weight β, annealed to 1 over course of training
 74 |         self.priority_exponent = args.priority_exponent
 75 |         self.t = 0  # Internal episode timestep counter
 76 |         self.transitions = SegmentTree(capacity)  # Store transitions in a wrap-around cyclic buffer within a sum tree for querying priorities
 77 | 
 78 |     # Adds state and action at time t, reward and terminal at time t + 1
 79 |     def append(self, state, action, reward, terminal):
 80 |         state = state[-1].mul(255).to(dtype=torch.uint8, device=torch.device('cpu'))  # Only store last frame and discretise to save memory
 81 |         self.transitions.append(Transition(self.t, state, action, reward, not terminal), self.transitions.max)  # Store new transition with maximum priority
 82 |         self.t = 0 if terminal else self.t + 1  # Start new episodes with t = 0
 83 | 
 84 |     # Returns a transition with blank states where appropriate
 85 |     def _get_transition(self, idx):
 86 |         transition = np.array([None] * (self.history + self.n))
 87 |         transition[self.history - 1] = self.transitions.get(idx)
 88 |         for t in range(self.history - 2, -1, -1):  # e.g. 2 1 0
 89 |             if transition[t + 1].timestep == 0:
 90 |                 transition[t] = blank_trans  # If future frame has timestep 0
 91 |             else:
 92 |                 transition[t] = self.transitions.get(idx - self.history + 1 + t)
 93 |         for t in range(self.history, self.history + self.n):  # e.g. 4 5 6
 94 |             if transition[t - 1].nonterminal:
 95 |                 transition[t] = self.transitions.get(idx - self.history + 1 + t)
 96 |             else:
 97 |                 transition[t] = blank_trans  # If prev (next) frame is terminal
 98 |         return transition
 99 | 
100 |     # Returns a valid sample from a segment
101 |     def _get_sample_from_segment(self, segment, i):
102 |         valid = False
103 |         while not valid:
104 |             sample = np.random.uniform(i * segment, (i + 1) * segment)  # Uniformly sample an element from within a segment
105 |             prob, idx, tree_idx = self.transitions.find(sample)  # Retrieve sample from tree with un-normalised probability
106 |             # Resample if transition straddled current index or probablity 0
107 |             if (self.transitions.index - idx) % self.capacity > self.n and (idx - self.transitions.index) % self.capacity >= self.history and prob != 0:
108 |                 valid = True  # Note that conditions are valid but extra conservative around buffer index 0
109 | 
110 |         # Retrieve all required transition data (from t - h to t + n)
111 |         transition = self._get_transition(idx)
112 |         # Create un-discretised state and nth next state
113 |         state = torch.stack([trans.state for trans in transition[:self.history]]).to(device=self.device).to(dtype=torch.float32).div_(255)
114 |         next_state = torch.stack([trans.state for trans in transition[self.n:self.n + self.history]]).to(device=self.device).to(dtype=torch.float32).div_(255)
115 |         # Discrete action to be used as index
116 |         action = torch.tensor([transition[self.history - 1].action], dtype=torch.int64, device=self.device)
117 |         # Calculate truncated n-step discounted return R^n = Σ_k=0->n-1 (γ^k)R_t+k+1 (note that invalid nth next states have reward 0)
118 |         R = torch.tensor([sum(self.discount ** n * transition[self.history + n - 1].reward for n in range(self.n))], dtype=torch.float32, device=self.device)
119 |         # Mask for non-terminal nth next states
120 |         nonterminal = torch.tensor([transition[self.history + self.n - 1].nonterminal], dtype=torch.float32, device=self.device)
121 | 
122 |         return prob, idx, tree_idx, state, action, R, next_state, nonterminal
123 | 
124 |     def sample(self, batch_size):
125 |         p_total = self.transitions.total()  # Retrieve sum of all priorities (used to create a normalised probability distribution)
126 |         segment = p_total / batch_size  # Batch size number of segments, based on sum over all probabilities
127 |         batch = [self._get_sample_from_segment(segment, i) for i in range(batch_size)]  # Get batch of valid samples
128 |         probs, idxs, tree_idxs, states, actions, returns, next_states, nonterminals = zip(*batch)
129 |         states, next_states, = torch.stack(states), torch.stack(next_states)
130 |         actions, returns, nonterminals = torch.cat(actions), torch.cat(returns), torch.stack(nonterminals)
131 |         probs = np.array(probs, dtype=np.float32) / p_total  # Calculate normalised probabilities
132 |         capacity = self.capacity if self.transitions.full else self.transitions.index
133 |         weights = (capacity * probs) ** -self.priority_weight  # Compute importance-sampling weights w
134 |         weights = torch.tensor(weights / weights.max(), dtype=torch.float32, device=self.device)  # Normalise by max importance-sampling weight from batch
135 |         return tree_idxs, states, actions, returns, next_states, nonterminals, weights
136 | 
137 | 
138 |     def update_priorities(self, idxs, priorities):
139 |         priorities = np.power(priorities, self.priority_exponent)
140 |         [self.transitions.update(idx, priority) for idx, priority in zip(idxs, priorities)]
141 | 
142 |     # Set up internal state for iterator
143 |     def __iter__(self):
144 |         self.current_idx = 0
145 |         return self
146 | 
147 |     # Return valid states for validation
148 |     def __next__(self):
149 |         if self.current_idx == self.capacity:
150 |             raise StopIteration
151 |         # Create stack of states
152 |         state_stack = [None] * self.history
153 |         state_stack[-1] = self.transitions.data[self.current_idx].state
154 |         prev_timestep = self.transitions.data[self.current_idx].timestep
155 |         for t in reversed(range(self.history - 1)):
156 |             if prev_timestep == 0:
157 |                 state_stack[t] = blank_trans.state  # If future frame has timestep 0
158 |             else:
159 |                 state_stack[t] = self.transitions.data[self.current_idx + t - self.history + 1].state
160 |                 prev_timestep -= 1
161 |         state = torch.stack(state_stack, 0).to(dtype=torch.float32, device=self.device).div_(255)  # Agent will turn into batch
162 |         self.current_idx += 1
163 |         return state
164 | 
165 |     next = __next__  # Alias __next__ for Python 2 compatibility
166 | 


--------------------------------------------------------------------------------
/code/Rainbow/model.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | from __future__ import division
 3 | import math
 4 | import torch
 5 | from torch import nn
 6 | from torch.nn import functional as F
 7 | 
 8 | 
 9 | # Factorised NoisyLinear layer with bias
10 | class NoisyLinear(nn.Module):
11 |     def __init__(self, in_features, out_features, std_init=0.5):
12 |         super(NoisyLinear, self).__init__()
13 |         self.in_features = in_features
14 |         self.out_features = out_features
15 |         self.std_init = std_init
16 |         self.weight_mu = nn.Parameter(torch.empty(out_features, in_features))
17 |         self.weight_sigma = nn.Parameter(torch.empty(out_features, in_features))
18 |         self.register_buffer('weight_epsilon', torch.empty(out_features, in_features))
19 |         self.bias_mu = nn.Parameter(torch.empty(out_features))
20 |         self.bias_sigma = nn.Parameter(torch.empty(out_features))
21 |         self.register_buffer('bias_epsilon', torch.empty(out_features))
22 |         self.reset_parameters()
23 |         self.reset_noise()
24 | 
25 |     def reset_parameters(self):
26 |         mu_range = 1 / math.sqrt(self.in_features)
27 |         self.weight_mu.data.uniform_(-mu_range, mu_range)
28 |         self.weight_sigma.data.fill_(self.std_init / math.sqrt(self.in_features))
29 |         self.bias_mu.data.uniform_(-mu_range, mu_range)
30 |         self.bias_sigma.data.fill_(self.std_init / math.sqrt(self.out_features))
31 | 
32 |     def _scale_noise(self, size):
33 |         x = torch.randn(size)
34 |         return x.sign().mul_(x.abs().sqrt_())
35 | 
36 |     def reset_noise(self):
37 |         epsilon_in = self._scale_noise(self.in_features)
38 |         epsilon_out = self._scale_noise(self.out_features)
39 |         self.weight_epsilon.copy_(epsilon_out.ger(epsilon_in))
40 |         self.bias_epsilon.copy_(epsilon_out)
41 | 
42 |     def forward(self, input):
43 |         if self.training:
44 |             return F.linear(input, self.weight_mu + self.weight_sigma * self.weight_epsilon, self.bias_mu + self.bias_sigma * self.bias_epsilon)
45 |         else:
46 |             return F.linear(input, self.weight_mu, self.bias_mu)
47 | 
48 | 
49 | class DQN(nn.Module):
50 |     def  __init__(self, args, action_space):
51 |         super(DQN, self).__init__()
52 |         self.atoms = args.atoms
53 |         self.action_space = action_space
54 | 
55 |         if args.architecture == 'canonical':
56 |             self.convs = nn.Sequential(nn.Conv2d(args.history_length, 32, 8, stride=4, padding=0), nn.ReLU(),
57 |                                        nn.Conv2d(32, 64, 4, stride=2, padding=0), nn.ReLU(),
58 |                                        nn.Conv2d(64, 64, 3, stride=1, padding=0), nn.ReLU())
59 |             self.conv_output_size = 3136
60 |         elif args.architecture == 'data-efficient':
61 |             self.convs = nn.Sequential(nn.Conv2d(args.history_length, 32, 5, stride=5, padding=0), nn.ReLU(),
62 |                                        nn.Conv2d(32, 64, 5, stride=5, padding=0), nn.ReLU())
63 |             self.conv_output_size = 576
64 | 
65 |         self.fc_h_v = NoisyLinear(self.conv_output_size, args.hidden_size, std_init=args.noisy_std)
66 |         self.fc_h_a = NoisyLinear(self.conv_output_size, args.hidden_size, std_init=args.noisy_std)
67 |         self.fc_z_v = NoisyLinear(args.hidden_size, self.atoms, std_init=args.noisy_std)
68 |         self.fc_z_a = NoisyLinear(args.hidden_size, action_space * self.atoms, std_init=args.noisy_std)
69 | 
70 |     def representation(self, x):
71 |         x = self.convs(x)
72 |         x = x.view(-1, self.conv_output_size)
73 |         return x
74 | 
75 |     def forward(self, x, log=False):
76 |         x = self.convs(x)
77 |         x = x.view(-1, self.conv_output_size)
78 |         v = self.fc_z_v(F.relu(self.fc_h_v(x)))  # Value stream
79 |         a = self.fc_z_a(F.relu(self.fc_h_a(x)))  # Advantage stream
80 |         v, a = v.view(-1, 1, self.atoms), a.view(-1, self.action_space, self.atoms)
81 |         q = v + a - a.mean(1, keepdim=True)  # Combine streams
82 |         if log:  # Use log softmax for numerical stability
83 |             q = F.log_softmax(q, dim=2)  # Log probabilities with action over second dimension
84 |         else:
85 |             q = F.softmax(q, dim=2)  # Probabilities with action over second dimension
86 |         return q
87 | 
88 |     def reset_noise(self):
89 |         for name, module in self.named_children():
90 |             if 'fc' in name:
91 |                 module.reset_noise()
92 | 


--------------------------------------------------------------------------------
/code/Rainbow/requirements.txt:
--------------------------------------------------------------------------------
 1 | atari-py==0.2.6
 2 | opencv-python==4.2.0.34
 3 | plotly==4.8.1
 4 | procgen==0.10.3
 5 | tensorboardX==2.0
 6 | torch==1.5.1
 7 | tqdm==4.42.1
 8 | tensorflow<2.0.0,>=1.4.0
 9 | numpy<2.0.0,>=1.17.0
10 | pathos==0.2.6
11 | 


--------------------------------------------------------------------------------
/code/Rainbow/test.py:
--------------------------------------------------------------------------------
  1 | from __future__ import division
  2 | import os
  3 | import plotly
  4 | from plotly.graph_objs import Scatter
  5 | from plotly.graph_objs.scatter import Line
  6 | import numpy as np
  7 | import torch
  8 | from env import Env
  9 | 
 10 | 
 11 | # Test DQN
 12 | def test(args, T, agent, val_mem, metrics, results_dir, evaluate=False, plot=False):
 13 | 
 14 |     env = Env(args)
 15 |     env.eval()
 16 |     metrics['steps'].append(T)
 17 |     T_rewards, T_Qs, T_Qstds = [], [], []
 18 | 
 19 |     # Test performance over several episodes
 20 |     return_trajs = np.array([])
 21 |     Q_trajs = np.array([])
 22 |     Qstd_trajs = np.array([])
 23 |     done = True
 24 |     for _ in range(args.evaluation_episodes):
 25 |         while True:
 26 |             if done:
 27 |                 state, reward_traj, reward_sum, state_traj, done = env.reset(), [], 0, [], False
 28 | 
 29 |             state_traj.append(state)
 30 |             action = agent.act(state)  # Choose an action greedily (possibly with noisy net)
 31 |             state, reward, done = env.step(action)  # Step
 32 |             reward_traj.append(reward)
 33 |             reward_sum += reward
 34 |             if args.render:
 35 |                 env.render()
 36 | 
 37 |             if done:
 38 |                 T_rewards.append(reward_sum)
 39 |                 reward_traj = np.array(reward_traj)
 40 |                 return_trajs = np.append(return_trajs, np.cumsum(reward_traj[::-1])[::-1])
 41 |                 t_Qs, t_Qstds = [], []
 42 |                 for state in state_traj:
 43 |                     res = agent.evaluate_q(state)
 44 |                     t_Qs.append(res)
 45 |                     t_Qstds.append(0)
 46 |                 Q_trajs = np.append(Q_trajs, np.array(t_Qs))
 47 |                 Qstd_trajs = np.append(Qstd_trajs, np.array(t_Qstds))
 48 |                 break
 49 |     env.close()
 50 | 
 51 |     # Test Q-values over validation memory
 52 |     for state in val_mem:  # Iterate over valid states
 53 |         res = agent.evaluate_q(state)
 54 |         T_Qs.append(res)
 55 |         T_Qstds.append(0)
 56 | 
 57 |     avg_reward = sum(T_rewards) / len(T_rewards)
 58 |     avg_Q = sum(T_Qs) / len(T_Qs)
 59 |     avg_Qstd = sum(T_Qstds) / len(T_Qstds)
 60 | 
 61 |     if not evaluate:
 62 |         # Save model parameters if improved
 63 |         if avg_reward > metrics['best_avg_reward']:
 64 |             metrics['best_avg_reward'] = avg_reward
 65 |             agent.save(results_dir)
 66 | 
 67 |         # Append to results and save metrics
 68 |         metrics['rewards'].append(T_rewards)
 69 |         metrics['Qs'].append(T_Qs)
 70 |         metrics['Qstds'].append(T_Qstds)
 71 |         torch.save(metrics, os.path.join(results_dir, 'metrics.pth'))
 72 | 
 73 |         # Plot
 74 |         if plot:
 75 |             _plot_line(metrics['steps'], metrics['rewards'], 'Reward', path=results_dir)
 76 |             _plot_line(metrics['steps'], metrics['Qs'], 'Q', path=results_dir)
 77 |             _plot_line(metrics['steps'], metrics['Qstds'], 'Qstd', path=results_dir)
 78 | 
 79 |     avg_R = np.mean(return_trajs)
 80 |     bias_trajs = (Q_trajs - return_trajs) / (np.abs(avg_R) + 1e-6)
 81 |     test_result = {
 82 |         'avg_reward': avg_reward,
 83 |         'avg_Q_fixed_set': avg_Q,
 84 |         'avg_Qstd_fixed_set': avg_Qstd,
 85 |         'avg_Q': np.mean(Q_trajs), 
 86 |         'avg_Qstd': np.mean(Qstd_trajs),
 87 |         'avg_R': avg_R,
 88 |         'mean_bias': np.mean(bias_trajs),
 89 |         'std_bias': np.std(bias_trajs),
 90 |     }
 91 | 
 92 |     return test_result
 93 | 
 94 | 
 95 | # Plots min, max and mean + standard deviation bars of a population over time
 96 | def _plot_line(xs, ys_population, title, path=''):
 97 |     max_colour, mean_colour, std_colour, transparent = 'rgb(0, 132, 180)', 'rgb(0, 172, 237)', 'rgba(29, 202, 255, 0.2)', 'rgba(0, 0, 0, 0)'
 98 | 
 99 |     ys = torch.tensor(ys_population, dtype=torch.float32)
100 |     ys_min, ys_max, ys_mean, ys_std = ys.min(1)[0].squeeze(), ys.max(1)[0].squeeze(), ys.mean(1).squeeze(), ys.std(1).squeeze()
101 |     ys_upper, ys_lower = ys_mean + ys_std, ys_mean - ys_std
102 | 
103 |     trace_max = Scatter(x=xs, y=ys_max.numpy(), line=Line(color=max_colour, dash='dash'), name='Max')
104 |     trace_upper = Scatter(x=xs, y=ys_upper.numpy(), line=Line(color=transparent), name='+1 Std. Dev.', showlegend=False)
105 |     trace_mean = Scatter(x=xs, y=ys_mean.numpy(), fill='tonexty', fillcolor=std_colour, line=Line(color=mean_colour), name='Mean')
106 |     trace_lower = Scatter(x=xs, y=ys_lower.numpy(), fill='tonexty', fillcolor=std_colour, line=Line(color=transparent), name='-1 Std. Dev.', showlegend=False)
107 |     trace_min = Scatter(x=xs, y=ys_min.numpy(), line=Line(color=max_colour, dash='dash'), name='Min')
108 | 
109 |     plotly.offline.plot({
110 |         'data': [trace_upper, trace_mean, trace_lower, trace_min, trace_max],
111 |         'layout': dict(title=title, xaxis={'title': 'Step'}, yaxis={'title': title})
112 |     }, filename=os.path.join(path, title + '.html'), auto_open=False)
113 | 


--------------------------------------------------------------------------------
/code/SAC-discrete/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangchuheng123/Reinforcement-Implementation/c04e0df10ec29ef775ea31395a8ad4b917302d24/code/SAC-discrete/.DS_Store


--------------------------------------------------------------------------------
/code/SAC-discrete/bash.sh:
--------------------------------------------------------------------------------
1 | CUDA_VISIBLE_DEVICES=0 python sac_discrete.py --config config/debug.yaml


--------------------------------------------------------------------------------
/code/SAC-discrete/config/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangchuheng123/Reinforcement-Implementation/c04e0df10ec29ef775ea31395a8ad4b917302d24/code/SAC-discrete/config/.DS_Store


--------------------------------------------------------------------------------
/code/SAC-discrete/config/default.yaml:
--------------------------------------------------------------------------------
 1 | basic:
 2 |   device: cuda
 3 |   accuracy: float32
 4 |   seed: 6666
 5 |   verbose: 2
 6 | 
 7 | algo:
 8 |   num_steps: 5000000    # 5M
 9 |   batch_size: 64
10 |   lr: 0.0003
11 |   memory_size: 300000   # 300k
12 |   multi_step: 1
13 |   target_entropy_ratio: 0.98
14 |   target_update_interval: 8000
15 |   use_per: False                
16 |   use_dueling: False
17 |   start_steps: 5000
18 |   normalization_steps: 2000
19 |   lamda: 0.97
20 |   clip_reward: True
21 |   zscore_reward: False
22 |   normalize_state: False
23 |   hidden_size: 64
24 |   update_interval: 4
25 |   log_interval: 20
26 |   eval_interval: 10000
27 |   no_term: False
28 |   evaluate_steps: 6000
29 |   gamma: 0.98
30 | 
31 | env:
32 |   num_parallel_envs: 1
33 |   name: PongNoFrameskip-v4
34 |   state_dtype: uint8
35 |   encoder: CNN


--------------------------------------------------------------------------------
/code/SAC-discrete/env.py:
--------------------------------------------------------------------------------
  1 | # NOTE: this code was mainly taken from:
  2 | # https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/common/atari_wrappers.py
  3 | from collections import deque
  4 | 
  5 | import numpy as np
  6 | import gym
  7 | from gym import spaces, wrappers
  8 | import cv2
  9 | cv2.ocl.setUseOpenCL(False)
 10 | 
 11 | 
 12 | class NoopResetEnv(gym.Wrapper):
 13 |     def __init__(self, env, noop_max=30):
 14 |         """
 15 |         Sample initial states by taking random number of no-ops on reset.
 16 |         No-op is assumed to be action 0.
 17 |         :param env: (Gym Environment) the environment to wrap
 18 |         :param noop_max: (int) the maximum value of no-ops to run
 19 |         """
 20 |         gym.Wrapper.__init__(self, env)
 21 |         self.noop_max = noop_max
 22 |         self.override_num_noops = None
 23 |         self.noop_action = 0
 24 |         assert env.unwrapped.get_action_meanings()[0] == 'NOOP'
 25 | 
 26 |     def reset(self, **kwargs):
 27 |         self.env.reset(**kwargs)
 28 |         if self.override_num_noops is not None:
 29 |             noops = self.override_num_noops
 30 |         else:
 31 |             noops = self.unwrapped.np_random.integers(1, self.noop_max + 1)
 32 |         assert noops > 0
 33 |         obs = None
 34 |         for _ in range(noops):
 35 |             obs, _, done, _ = self.env.step(self.noop_action)
 36 |             if done:
 37 |                 obs = self.env.reset(**kwargs)
 38 |         return obs
 39 | 
 40 |     def step(self, action):
 41 |         return self.env.step(action)
 42 | 
 43 | 
 44 | class FireResetEnv(gym.Wrapper):
 45 |     def __init__(self, env):
 46 |         """
 47 |         Take action on reset for environments that are fixed until firing.
 48 |         :param env: (Gym Environment) the environment to wrap
 49 |         """
 50 |         gym.Wrapper.__init__(self, env)
 51 |         assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
 52 |         assert len(env.unwrapped.get_action_meanings()) >= 3
 53 | 
 54 |     def reset(self, **kwargs):
 55 |         self.env.reset(**kwargs)
 56 |         obs, _, done, _ = self.env.step(1)
 57 |         if done:
 58 |             self.env.reset(**kwargs)
 59 |         obs, _, done, _ = self.env.step(2)
 60 |         if done:
 61 |             self.env.reset(**kwargs)
 62 |         return obs
 63 | 
 64 |     def step(self, action):
 65 |         return self.env.step(action)
 66 | 
 67 | 
 68 | class EpisodicLifeEnv(gym.Wrapper):
 69 |     def __init__(self, env):
 70 |         """
 71 |         Make end-of-life == end-of-episode, but only reset on true game over.
 72 |         Done by DeepMind for the DQN and co. since it helps value estimation.
 73 |         :param env: (Gym Environment) the environment to wrap
 74 |         """
 75 |         gym.Wrapper.__init__(self, env)
 76 |         self.lives = 0
 77 |         self.was_real_done = True
 78 | 
 79 |     def step(self, action):
 80 |         obs, reward, done, info = self.env.step(action)
 81 |         self.was_real_done = done
 82 |         # check current lives, make loss of life terminal,
 83 |         # then update lives to handle bonus lives
 84 |         lives = self.env.unwrapped.ale.lives()
 85 |         if 0 < lives < self.lives:
 86 |             # for Qbert sometimes we stay in lives == 0 condtion for a few
 87 |             # frames so its important to keep lives > 0, so that we only reset
 88 |             # once the environment advertises done.
 89 |             done = True
 90 |         self.lives = lives
 91 |         return obs, reward, done, info
 92 | 
 93 |     def reset(self, **kwargs):
 94 |         """
 95 |         Calls the Gym environment reset, only when lives are exhausted.
 96 |         This way all states are still reachable even though lives are episodic,
 97 |         and the learner need not know about any of this behind-the-scenes.
 98 |         :param kwargs: Extra keywords passed to env.reset() call
 99 |         :return: ([int] or [float]) the first observation of the environment
100 |         """
101 |         if self.was_real_done:
102 |             obs = self.env.reset(**kwargs)
103 |         else:
104 |             # no-op step to advance from terminal/lost life state
105 |             obs, _, _, _ = self.env.step(0)
106 |         self.lives = self.env.unwrapped.ale.lives()
107 |         return obs
108 | 
109 | 
110 | class MaxAndSkipEnv(gym.Wrapper):
111 |     def __init__(self, env, skip=4):
112 |         """
113 |         Return only every `skip`-th frame (frameskipping)
114 |         :param env: (Gym Environment) the environment
115 |         :param skip: (int) number of `skip`-th frame
116 |         """
117 |         gym.Wrapper.__init__(self, env)
118 |         # most recent raw observations (for max pooling across time steps)
119 |         self._obs_buffer = np.zeros(
120 |             (2,)+env.observation_space.shape,
121 |             dtype=env.observation_space.dtype)
122 |         self._skip = skip
123 | 
124 |     def step(self, action):
125 |         """
126 |         Step the environment with the given action
127 |         Repeat action, sum reward, and max over last observations.
128 |         :param action: ([int] or [float]) the action
129 |         :return: ([int] or [float], [float], [bool], dict) observation, reward,
130 |                  done, information
131 |         """
132 |         total_reward = 0.0
133 |         done = None
134 |         for i in range(self._skip):
135 |             obs, reward, done, info = self.env.step(action)
136 |             if i == self._skip - 2:
137 |                 self._obs_buffer[0] = obs
138 |             if i == self._skip - 1:
139 |                 self._obs_buffer[1] = obs
140 |             total_reward += reward
141 |             if done:
142 |                 break
143 |         # Note that the observation on the done=True frame
144 |         # doesn't matter
145 |         max_frame = self._obs_buffer.max(axis=0)
146 | 
147 |         return max_frame, total_reward, done, info
148 | 
149 |     def reset(self, **kwargs):
150 |         return self.env.reset(**kwargs)
151 | 
152 | 
153 | class ClipRewardEnv(gym.RewardWrapper):
154 |     def __init__(self, env):
155 |         """
156 |         clips the reward to {+1, 0, -1} by its sign.
157 |         :param env: (Gym Environment) the environment
158 |         """
159 |         gym.RewardWrapper.__init__(self, env)
160 | 
161 |     def reward(self, reward):
162 |         """
163 |         Bin reward to {+1, 0, -1} by its sign.
164 |         :param reward: (float)
165 |         """
166 |         return np.sign(reward)
167 | 
168 | 
169 | class WarpFramePyTorch(gym.ObservationWrapper):
170 |     def __init__(self, env):
171 |         """
172 |         Warp frames to 84x84 as done in the Nature paper and later work.
173 |         :param env: (Gym Environment) the environment
174 |         """
175 |         gym.ObservationWrapper.__init__(self, env)
176 |         self.width = 84
177 |         self.height = 84
178 |         self.observation_space = spaces.Box(
179 |             low=0, high=255, shape=(1, self.height, self.width),
180 |             dtype=env.observation_space.dtype)
181 | 
182 |     def observation(self, frame):
183 |         """
184 |         returns the current observation from a frame
185 |         :param frame: ([int] or [float]) environment frame
186 |         :return: ([int] or [float]) the observation
187 |         """
188 |         frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
189 |         frame = cv2.resize(
190 |             frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
191 |         return frame[None, :, :]
192 | 
193 | 
194 | class FrameStackPyTorch(gym.Wrapper):
195 |     def __init__(self, env, n_frames):
196 |         """Stack n_frames last frames.
197 |         Returns lazy array, which is much more memory efficient.
198 |         See Also
199 |         --------
200 |         stable_baselines.common.atari_wrappers.LazyFrames
201 |         :param env: (Gym Environment) the environment
202 |         :param n_frames: (int) the number of frames to stack
203 |         """
204 |         assert env.observation_space.dtype == np.uint8
205 | 
206 |         gym.Wrapper.__init__(self, env)
207 |         self.n_frames = n_frames
208 |         self.frames = deque([], maxlen=n_frames)
209 |         shp = env.observation_space.shape
210 | 
211 |         self.observation_space = spaces.Box(
212 |             low=np.min(env.observation_space.low),
213 |             high=np.max(env.observation_space.high),
214 |             shape=(shp[0] * n_frames, shp[1], shp[2]),
215 |             dtype=env.observation_space.dtype)
216 | 
217 |     def reset(self):
218 |         obs = self.env.reset()
219 |         for _ in range(self.n_frames):
220 |             self.frames.append(obs)
221 |         return self._get_ob()
222 | 
223 |     def step(self, action):
224 |         obs, reward, done, info = self.env.step(action)
225 |         self.frames.append(obs)
226 |         return self._get_ob(), reward, done, info
227 | 
228 |     def _get_ob(self):
229 |         assert len(self.frames) == self.n_frames
230 |         return LazyFrames(list(self.frames))
231 | 
232 | 
233 | class ScaledFloatFrame(gym.ObservationWrapper):
234 |     def __init__(self, env):
235 |         gym.ObservationWrapper.__init__(self, env)
236 |         self.observation_space = spaces.Box(
237 |             low=0, high=1.0, shape=env.observation_space.shape,
238 |             dtype=np.float32)
239 | 
240 |     def observation(self, observation):
241 |         # careful! This undoes the memory optimization, use
242 |         # with smaller replay buffers only.
243 |         return np.array(observation).astype(np.float32) / 255.0
244 | 
245 | 
246 | class LazyFrames(object):
247 |     def __init__(self, frames):
248 |         self._frames = frames
249 |         self.dtype = frames[0].dtype
250 | 
251 |     def _force(self):
252 |         return np.concatenate(
253 |             np.array(self._frames, dtype=self.dtype), axis=0)
254 | 
255 |     def __array__(self, dtype=None):
256 |         out = self._force()
257 |         if dtype is not None:
258 |             out = out.astype(dtype)
259 |         return out
260 | 
261 |     def __len__(self):
262 |         return len(self._force())
263 | 
264 |     def __getitem__(self, i):
265 |         return self._force()[i]
266 | 
267 | 
268 | def make_atari(env_id):
269 |     """
270 |     Create a wrapped atari envrionment
271 |     :param env_id: (str) the environment ID
272 |     :return: (Gym Environment) the wrapped atari environment
273 |     """
274 |     env = gym.make(env_id)
275 |     assert 'NoFrameskip' in env.spec.id
276 |     env = NoopResetEnv(env, noop_max=30)
277 |     env = MaxAndSkipEnv(env, skip=4)
278 |     return env
279 | 
280 | 
281 | def wrap_deepmind_pytorch(env, episode_life=True, clip_rewards=True,
282 |                           frame_stack=True, scale=False):
283 |     """
284 |     Configure environment for DeepMind-style Atari.
285 |     :param env: (Gym Environment) the atari environment
286 |     :param episode_life: (bool) wrap the episode life wrapper
287 |     :param clip_rewards: (bool) wrap the reward clipping wrapper
288 |     :param frame_stack: (bool) wrap the frame stacking wrapper
289 |     :param scale: (bool) wrap the scaling observation wrapper
290 |     :return: (Gym Environment) the wrapped atari environment
291 |     """
292 |     if episode_life:
293 |         env = EpisodicLifeEnv(env)
294 |     if 'FIRE' in env.unwrapped.get_action_meanings():
295 |         env = FireResetEnv(env)
296 |     env = WarpFramePyTorch(env)
297 |     if clip_rewards:
298 |         env = ClipRewardEnv(env)
299 |     if scale:
300 |         env = ScaledFloatFrame(env)
301 |     if frame_stack:
302 |         env = FrameStackPyTorch(env, 4)
303 |     return env
304 | 
305 | 
306 | def make_env(env_id, episode_life=True, clip_rewards=True,
307 |                      frame_stack=True, scale=False):
308 |     env = make_atari(env_id)
309 |     env = wrap_deepmind_pytorch(
310 |         env, episode_life, clip_rewards, frame_stack, scale)
311 |     return env
312 | 
313 | 
314 | def wrap_monitor(env, log_dir):
315 |     env = wrappers.Monitor(
316 |         env, log_dir, video_callable=lambda x: True)
317 |     return env
318 | 
319 | 
320 | class VectorEnv(object):
321 |     def __init__(self, n, env_func, parallel_init=False, **kwargs):
322 | 
323 |         print('[{}] Wait for initializing environments'.format(str(datetime.datetime.now())))
324 |         self.pool = Pool(NUM_CORES)
325 |         if parallel_init:
326 |             init_func = lambda x: env_func(**kwargs)
327 |             self.envs = self.pool.map(init_func, list(range(n)))
328 |         else:
329 |             self.envs = tuple(env_func(**kwargs) for _ in range(n))
330 |         self.return_state_format = 'list'
331 |         print('[{}] Finish initializing environments'.format(str(datetime.datetime.now())))
332 | 
333 |     def set_return_state_format(self, fmt):
334 |         self.return_state_format = fmt
335 | 
336 |     def seed(self, seeds):
337 | 
338 |         seed_func = lambda args: args[0].seed(args[1])
339 |         self.pool.map(seed_func, list(zip(self.envs, seeds)))
340 | 
341 |     def reset(self):
342 | 
343 |         reset_func = lambda env: env.reset()
344 |         states = self.pool.map(reset_func, self.envs)
345 |         if self.return_state_format == 'array':
346 |             states = np.array(states)
347 |         return states
348 | 
349 |     def step(self, actions):
350 | 
351 |         def step_func(args):
352 |             env, a = args
353 |             observation, reward, done, info = env.step(a)
354 |             if done:
355 |                 observation = env.reset()
356 |             return observation, reward, done, info
357 | 
358 |         res = self.pool.map(step_func, list(zip(self.envs, actions)))
359 |         states, rewards, dones, infos = list(zip(*res))
360 |         if self.return_state_format == 'array':
361 |             states = np.array(states)
362 |         return states, rewards, dones, infos
363 | 
364 |     def __del__(self):
365 |         self.close()
366 | 
367 |     def close(self):
368 |         self.pool.close()


--------------------------------------------------------------------------------
/code/SAC-discrete/memory.py:
--------------------------------------------------------------------------------
  1 | import operator
  2 | from collections import deque
  3 | import numpy as np
  4 | import torch
  5 | 
  6 | 
  7 | class MultiStepBuff:
  8 | 
  9 |     def __init__(self, maxlen=3):
 10 |         super(MultiStepBuff, self).__init__()
 11 |         self.maxlen = int(maxlen)
 12 |         self.reset()
 13 | 
 14 |     def append(self, state, action, reward):
 15 |         self.states.append(state)
 16 |         self.actions.append(action)
 17 |         self.rewards.append(reward)
 18 | 
 19 |     def get(self, gamma=0.99):
 20 |         assert len(self.rewards) > 0
 21 |         state = self.states.popleft()
 22 |         action = self.actions.popleft()
 23 |         reward = self._nstep_return(gamma)
 24 |         return state, action, reward
 25 | 
 26 |     def _nstep_return(self, gamma):
 27 |         r = np.sum([r * (gamma ** i) for i, r in enumerate(self.rewards)])
 28 |         self.rewards.popleft()
 29 |         return r
 30 | 
 31 |     def reset(self):
 32 |         # Buffer to store n-step transitions.
 33 |         self.states = deque(maxlen=self.maxlen)
 34 |         self.actions = deque(maxlen=self.maxlen)
 35 |         self.rewards = deque(maxlen=self.maxlen)
 36 | 
 37 |     def is_empty(self):
 38 |         return len(self.rewards) == 0
 39 | 
 40 |     def is_full(self):
 41 |         return len(self.rewards) == self.maxlen
 42 | 
 43 |     def __len__(self):
 44 |         return len(self.rewards)
 45 | 
 46 | 
 47 | class LazyMemory(dict):
 48 | 
 49 |     def __init__(self, capacity, state_shape, device, state_dtype):
 50 |         super(LazyMemory, self).__init__()
 51 |         self.capacity = int(capacity)
 52 |         self.state_shape = state_shape
 53 |         self.device = device
 54 |         self.reset()
 55 |         if state_dtype == 'float32':
 56 |             self.state_dtype = np.float32
 57 |         elif state_dtype == 'float64':
 58 |             self.state_dtype = np.float64
 59 |         elif state_dtype == 'uint8':
 60 |             self.state_dtype = np.uint8
 61 | 
 62 |     def reset(self):
 63 |         self['state'] = []
 64 |         self['next_state'] = []
 65 | 
 66 |         self['action'] = np.empty((self.capacity, 1), dtype=np.int64)
 67 |         self['reward'] = np.empty((self.capacity, 1), dtype=np.float32)
 68 |         self['done'] = np.empty((self.capacity, 1), dtype=np.float32)
 69 | 
 70 |         self._n = 0
 71 |         self._p = 0
 72 | 
 73 |     def append(self, state, action, reward, next_state, done,
 74 |                episode_done=None):
 75 |         self._append(state, action, reward, next_state, done)
 76 | 
 77 |     def _append(self, state, action, reward, next_state, done):
 78 |         self['state'].append(state)
 79 |         self['next_state'].append(next_state)
 80 |         self['action'][self._p] = action
 81 |         self['reward'][self._p] = reward
 82 |         self['done'][self._p] = done
 83 | 
 84 |         self._n = min(self._n + 1, self.capacity)
 85 |         self._p = (self._p + 1) % self.capacity
 86 | 
 87 |         self.truncate()
 88 | 
 89 |     def truncate(self):
 90 |         while len(self['state']) > self.capacity:
 91 |             del self['state'][0]
 92 |             del self['next_state'][0]
 93 | 
 94 |     def sample(self, batch_size):
 95 |         indices = np.random.randint(low=0, high=len(self), size=batch_size)
 96 |         return self._sample(indices, batch_size)
 97 | 
 98 |     def _sample(self, indices, batch_size):
 99 |         bias = -self._p if self._n == self.capacity else 0
100 | 
101 |         states = np.empty((batch_size, *self.state_shape), dtype=self.state_dtype)
102 |         next_states = np.empty((batch_size, *self.state_shape), dtype=self.state_dtype)
103 | 
104 |         for i, index in enumerate(indices):
105 |             _index = np.mod(index+bias, self.capacity)
106 |             states[i, ...] = self['state'][_index]
107 |             next_states[i, ...] = self['next_state'][_index]
108 | 
109 |         if self.state_dtype is np.float32:
110 |             states = torch.FloatTensor(states).to(self.device)
111 |             next_states = torch.FloatTensor(next_states).to(self.device)
112 |         elif self.state_dtype is np.float64:
113 |             states = torch.DoubleTensor(states).to(self.device)
114 |             next_states = torch.DoubleTensor(next_states).to(self.device)
115 |         elif self.state_dtype is np.uint8:
116 |             states = torch.ByteTensor(states).to(self.device).float() / 255.
117 |             next_states = torch.ByteTensor(next_states).to(self.device).float() / 255.
118 |         actions = torch.LongTensor(self['action'][indices]).to(self.device)
119 |         rewards = torch.FloatTensor(self['reward'][indices]).to(self.device)
120 |         dones = torch.FloatTensor(self['done'][indices]).to(self.device)
121 | 
122 |         return states, actions, rewards, next_states, dones
123 | 
124 |     def __len__(self):
125 |         return self._n
126 | 
127 | 
128 | class LazyMultiStepMemory(LazyMemory):
129 | 
130 |     def __init__(self, capacity, state_shape, device, gamma=0.99,
131 |                  multi_step=3, state_dtype='float32'):
132 |         super(LazyMultiStepMemory, self).__init__(
133 |             capacity, state_shape, device, state_dtype)
134 | 
135 |         self.gamma = gamma
136 |         self.multi_step = int(multi_step)
137 |         if self.multi_step != 1:
138 |             self.buff = MultiStepBuff(maxlen=self.multi_step)
139 | 
140 |     def append(self, state, action, reward, next_state, done):
141 |         if self.multi_step != 1:
142 |             self.buff.append(state, action, reward)
143 | 
144 |             if self.buff.is_full():
145 |                 state, action, reward = self.buff.get(self.gamma)
146 |                 self._append(state, action, reward, next_state, done)
147 | 
148 |             if done:
149 |                 while not self.buff.is_empty():
150 |                     state, action, reward = self.buff.get(self.gamma)
151 |                     self._append(state, action, reward, next_state, done)
152 |         else:
153 |             self._append(state, action, reward, next_state, done)
154 | 
155 | 
156 | class LazyPrioritizedMultiStepMemory(LazyMultiStepMemory):
157 | 
158 |     def __init__(self, capacity, state_shape, device, state_dtype='float32',
159 |             gamma=0.99, multi_step=3, alpha=0.6, beta=0.4, beta_steps=2e5,
160 |             min_pa=0.0, max_pa=1.0, eps=0.01):
161 | 
162 |         super().__init__(capacity, state_shape, device, gamma, 
163 |             multi_step, state_dtype)
164 | 
165 |         self.alpha = alpha
166 |         self.beta = beta
167 |         self.beta_diff = (1.0 - beta) / beta_steps
168 |         self.min_pa = min_pa
169 |         self.max_pa = max_pa
170 |         self.eps = eps
171 |         self._cached = None
172 | 
173 |         it_capacity = 1
174 |         while it_capacity < capacity:
175 |             it_capacity *= 2
176 |         self.it_sum = SumTree(it_capacity)
177 |         self.it_min = MinTree(it_capacity)
178 | 
179 |     def _pa(self, p):
180 |         return np.clip((p + self.eps) ** self.alpha, self.min_pa, self.max_pa)
181 | 
182 |     def append(self, state, action, reward, next_state, done, p=None):
183 |         # Calculate priority.
184 |         if p is None:
185 |             pa = self.max_pa
186 |         else:
187 |             pa = self._pa(p)
188 | 
189 |         if self.multi_step != 1:
190 |             self.buff.append(state, action, reward)
191 | 
192 |             if self.buff.is_full():
193 |                 state, action, reward = self.buff.get(self.gamma)
194 |                 self._append(state, action, reward, next_state, done, pa)
195 | 
196 |             if done:
197 |                 while not self.buff.is_empty():
198 |                     state, action, reward = self.buff.get(self.gamma)
199 |                     self._append(state, action, reward, next_state, done, pa)
200 |         else:
201 |             self._append(state, action, reward, next_state, done, pa)
202 | 
203 |     def _append(self, state, action, reward, next_state, done, pa):
204 |         # Store priority, which is done efficiently by SegmentTree.
205 |         self.it_min[self._p] = pa
206 |         self.it_sum[self._p] = pa
207 |         super()._append(state, action, reward, next_state, done)
208 | 
209 |     def _sample_idxes(self, batch_size):
210 |         total_pa = self.it_sum.sum(0, self._n)
211 |         rands = np.random.rand(batch_size) * total_pa
212 |         indices = [self.it_sum.find_prefixsum_idx(r) for r in rands]
213 |         self.beta = min(1., self.beta + self.beta_diff)
214 |         return indices
215 | 
216 |     def sample(self, batch_size):
217 |         assert self._cached is None, 'Update priorities before sampling.'
218 | 
219 |         self._cached = self._sample_idxes(batch_size)
220 |         batch = self._sample(self._cached, batch_size)
221 |         weights = self._calc_weights(self._cached)
222 |         return batch, weights
223 | 
224 |     def _calc_weights(self, indices):
225 |         min_pa = self.it_min.min()
226 |         weights = [(self.it_sum[i] / min_pa) ** -self.beta for i in indices]
227 |         return torch.FloatTensor(weights).to(self.device).view(-1, 1)
228 | 
229 |     def update_priority(self, errors):
230 |         assert self._cached is not None
231 | 
232 |         ps = errors.detach().cpu().abs().numpy().flatten()
233 |         pas = self._pa(ps)
234 | 
235 |         for index, pa in zip(self._cached, pas):
236 |             assert 0 <= index < self._n
237 |             assert 0 < pa
238 |             self.it_sum[index] = pa
239 |             self.it_min[index] = pa
240 | 
241 |         self._cached = None
242 | 
243 | 
244 | class SegmentTree(object):
245 | 
246 |     def __init__(self, size, op, init_val):
247 |         assert size > 0 and size & (size - 1) == 0
248 |         self._size = size
249 |         self._op = op
250 |         self._init_val = init_val
251 |         self._values = [init_val for _ in range(2 * size)]
252 | 
253 |     def _reduce(self, start=0, end=None):
254 |         if end is None:
255 |             end = self._size
256 |         elif end < 0:
257 |             end += self._size
258 | 
259 |         start += self._size
260 |         end += self._size
261 | 
262 |         res = self._init_val
263 |         while start < end:
264 |             if start & 1:
265 |                 res = self._op(res, self._values[start])
266 |                 start += 1
267 | 
268 |             if end & 1:
269 |                 end -= 1
270 |                 res = self._op(res, self._values[end])
271 | 
272 |             start //= 2
273 |             end //= 2
274 | 
275 |         return res
276 | 
277 |     def __setitem__(self, idx, val):
278 |         assert 0 <= idx < self._size
279 | 
280 |         # Set value.
281 |         idx += self._size
282 |         self._values[idx] = val
283 | 
284 |         # Update its ancestors iteratively.
285 |         idx = idx >> 1
286 |         while idx >= 1:
287 |             left = 2 * idx
288 |             self._values[idx] = \
289 |                 self._op(self._values[left], self._values[left + 1])
290 |             idx = idx >> 1
291 | 
292 |     def __getitem__(self, idx):
293 |         assert 0 <= idx < self._size
294 |         return self._values[idx + self._size]
295 | 
296 | 
297 | class SumTree(SegmentTree):
298 | 
299 |     def __init__(self, size):
300 |         super().__init__(size, operator.add, 0.0)
301 | 
302 |     def sum(self, start=0, end=None):
303 |         return self._reduce(start, end)
304 | 
305 |     def find_prefixsum_idx(self, prefixsum):
306 |         assert 0 <= prefixsum <= self.sum() + 1e-5
307 |         idx = 1
308 | 
309 |         # Traverse to the leaf.
310 |         while idx < self._size:
311 |             left = 2 * idx
312 |             if self._values[left] > prefixsum:
313 |                 idx = left
314 |             else:
315 |                 prefixsum -= self._values[left]
316 |                 idx = left + 1
317 |         return idx - self._size
318 | 
319 | 
320 | class MinTree(SegmentTree):
321 | 
322 |     def __init__(self, size):
323 |         super().__init__(size, min, float("inf"))
324 | 
325 |     def min(self, start=0, end=None):
326 |         return self._reduce(start, end)


--------------------------------------------------------------------------------
/code/SAC-discrete/utils.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn as nn
  3 | from torch import Tensor
  4 | import torch.optim as opt
  5 | from torch.optim import Adam
  6 | from torch.autograd import Variable
  7 | from torch.nn import functional as F
  8 | from torch.distributions import Categorical
  9 | from torch.utils.tensorboard import SummaryWriter
 10 | 
 11 | from sklearn.base import TransformerMixin
 12 | from sklearn.preprocessing import StandardScaler
 13 | 
 14 | from multiprocessing.pool import ThreadPool as Pool
 15 | import numpy as np
 16 | import argparse
 17 | 
 18 | 
 19 | def update_params(optim, loss, retain_graph=False):
 20 |     optim.zero_grad()
 21 |     loss.backward(retain_graph=retain_graph)
 22 |     optim.step()
 23 | 
 24 | def disable_gradients(network):
 25 |     # Disable calculations of gradients.
 26 |     for param in network.parameters():
 27 |         param.requires_grad = False
 28 | 
 29 | def initialize_weights_he(m):
 30 |     if isinstance(m, nn.Linear) or isinstance(m, nn.Conv2d):
 31 |         torch.nn.init.kaiming_uniform_(m.weight)
 32 |         if m.bias is not None:
 33 |             torch.nn.init.constant_(m.bias, 0)
 34 | 
 35 | 
 36 | class Struct:
 37 |     def __init__(self, **entries): 
 38 |         self.__dict__.update(entries)
 39 | 
 40 | 
 41 | class RunningMeanStats(object):
 42 |     """
 43 |     An inefficient estimator for calculating the mean scaler over a given horizon
 44 |     """
 45 |     def __init__(self, n=10):
 46 |         self.n = n
 47 |         self.stats = deque(maxlen=n)
 48 | 
 49 |     def append(self, x):
 50 |         self.stats.append(x)
 51 | 
 52 |     def get(self):
 53 |         return np.mean(self.stats)
 54 | 
 55 | 
 56 | class Flatten(nn.Module):
 57 |     def forward(self, x):
 58 |         return x.view(x.size(0), -1)
 59 | 
 60 | 
 61 | class RunningStat(object):
 62 |     def __init__(self):
 63 |         self._n = 0
 64 |         self._M = 0
 65 |         self._S = 0
 66 | 
 67 |     def push(self, x):
 68 |         self._n += 1
 69 |         if self._n == 1:
 70 |             self._M = x
 71 |         else:
 72 |             oldM = self._M if type(self._M) is float else self._M.copy()
 73 |             self._M = oldM + (x - oldM) / self._n
 74 |             self._S = self._S + (x - oldM) * (x - self._M)
 75 | 
 76 |     @property
 77 |     def n(self):
 78 |         return self._n
 79 | 
 80 |     @property
 81 |     def mean(self):
 82 |         return self._M
 83 | 
 84 |     @property
 85 |     def var(self):
 86 |         return self._S / (self._n - 1) if self._n > 1 else np.square(self._M)
 87 | 
 88 |     @property
 89 |     def std(self):
 90 |         return np.sqrt(self.var)
 91 | 
 92 |     @property
 93 |     def shape(self):
 94 |         return self._M.shape
 95 | 
 96 | 
 97 | class ZFilter(object):
 98 |     """
 99 |     y = (x-mean)/std
100 |     using running estimates of mean, std
101 |     """
102 |     def __init__(self, demean=True, destd=True, clip=10.0):
103 |         self.demean = demean
104 |         self.destd = destd
105 |         self.clip = clip
106 | 
107 |         self.rs = RunningStat()
108 | 
109 |     def __call__(self, x, update=True):
110 |         if update: self.rs.push(x)
111 |         if self.demean:
112 |             x = x - self.rs.mean
113 |         if self.destd:
114 |             x = x / (self.rs.std + 1e-8)
115 |         if self.clip:
116 |             x = np.clip(x, -self.clip, self.clip)
117 |         return x
118 | 
119 |     def update(self, x):
120 |         self.rs.push(x)
121 | 
122 | 
123 | class Memory(object):
124 |     def __init__(self):
125 |         self.memory = []
126 | 
127 |     def push(self, *args):
128 |         self.memory.append(Transition(*args))
129 | 
130 |     def sample(self):
131 |         return Transition(*zip(*self.memory))
132 | 
133 |     def __len__(self):
134 |         return len(self.memory)
135 | 
136 | 
137 | class NDStandardScaler(TransformerMixin):
138 |     def __init__(self, **kwargs):
139 |         self._scaler = StandardScaler(copy=True, **kwargs)
140 |         self._orig_shape = None
141 | 
142 |     def fit(self, X, **kwargs):
143 |         X = np.array(X)
144 |         if len(X.shape) > 1:
145 |             self._orig_shape = X.shape[1:]
146 |         X = self._flatten(X)
147 |         self._scaler.fit(X, **kwargs)
148 |         return self
149 | 
150 |     def transform(self, X, **kwargs):
151 |         X = np.array(X)
152 |         X = self._flatten(X)
153 |         X = self._scaler.transform(X, **kwargs)
154 |         X = self._reshape(X)
155 |         return X
156 | 
157 |     def _flatten(self, X):
158 |         # Reshape X to <= 2 dimensions
159 |         if len(X.shape) > 2:
160 |             n_dims = np.prod(self._orig_shape)
161 |             X = X.reshape(-1, n_dims)
162 |         return X
163 | 
164 |     def _reshape(self, X):
165 |         # Reshape X back to it's original shape
166 |         if len(X.shape) >= 2:
167 |             X = X.reshape(-1, *self._orig_shape)
168 |         return X
169 | 


--------------------------------------------------------------------------------
/code/a2c.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Implementation of A2C
  3 | ref: Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." ICML. 2016.
  4 | This one follows appendix C of A3C paper (continuous action domain)
  5 | 
  6 | NOTICE:
  7 |     `Tensor2` means 2D-Tensor (num_samples, num_dims) 
  8 | """
  9 | 
 10 | import gym
 11 | import torch
 12 | import torch.nn as nn
 13 | import torch.optim as opt
 14 | from torch import Tensor
 15 | from torch.autograd import Variable
 16 | from collections import namedtuple
 17 | from itertools import count
 18 | import scipy.optimize as sciopt
 19 | import matplotlib
 20 | matplotlib.use('agg')
 21 | import matplotlib.pyplot as plt
 22 | from os.path import join as joindir
 23 | import pandas as pd
 24 | import numpy as np
 25 | import argparse
 26 | import datetime
 27 | import math
 28 | 
 29 | 
 30 | EPS = 1e-10
 31 | RESULT_DIR = '../result'
 32 | 
 33 | 
 34 | class args(object):
 35 |     env_name = 'Hopper-v2'
 36 |     seed = 1234
 37 |     num_episode = 100
 38 |     max_step_per_round = 2000
 39 |     gamma = 0.995
 40 |     lamda = 0.97
 41 |     log_num_episode = 1
 42 |     loss_coeff_value = 1.0
 43 |     loss_coeff_entropy = 1e-4
 44 |     lr = 5e-5
 45 |     hidden_size = 200
 46 |     lstm_size = 128
 47 |     num_parallel_run = 5
 48 | 
 49 | 
 50 | def add_arguments():
 51 |     parser = argparse.ArgumentParser()
 52 |     parser.add_argument('--env_name', type=str, default='Hopper-v2')
 53 |     parser.add_argument('--seed', type=int, default=1234)
 54 |     parser.add_argument('--num_episode', type=int, default=1000)
 55 |     parser.add_argument('--max_step_per_round', type=int, default=2000)
 56 |     parser.add_argument('--gamma', type=float, default=0.995)
 57 |     parser.add_argument('--lamda', type=float, default=0.97)
 58 |     parser.add_argument('--log_num_episode', type=int, default=1)
 59 |     parser.add_argument('--loss_coeff_value', type=float, default=1.0)
 60 |     parser.add_argument('--loss_coeff_entropy', type=float, default=1e-4)
 61 |     parser.add_argument('--lr', type=float, default=5e-5)
 62 |     parser.add_argument('--hidden_size', type=int, default=200)
 63 |     parser.add_argument('--lstm_size', type=int, default=128)
 64 |     parser.add_argument('--num_parallel_run', type=int, default=5)
 65 | 
 66 |     args = parser.parse_args()
 67 |     return args
 68 | 
 69 | class RunningStat(object):
 70 |     def __init__(self, shape):
 71 |         self._n = 0
 72 |         self._M = np.zeros(shape)
 73 |         self._S = np.zeros(shape)
 74 | 
 75 |     def push(self, x):
 76 |         x = np.asarray(x)
 77 |         assert x.shape == self._M.shape
 78 |         self._n += 1
 79 |         if self._n == 1:
 80 |             self._M[...] = x
 81 |         else:
 82 |             oldM = self._M.copy()
 83 |             self._M[...] = oldM + (x - oldM) / self._n
 84 |             self._S[...] = self._S + (x - oldM) * (x - self._M)
 85 | 
 86 |     @property
 87 |     def n(self):
 88 |         return self._n
 89 | 
 90 |     @property
 91 |     def mean(self):
 92 |         return self._M
 93 | 
 94 |     @property
 95 |     def var(self):
 96 |         return self._S / (self._n - 1) if self._n > 1 else np.square(self._M)
 97 | 
 98 |     @property
 99 |     def std(self):
100 |         return np.sqrt(self.var)
101 | 
102 |     @property
103 |     def shape(self):
104 |         return self._M.shape
105 | 
106 | 
107 | class ZFilter:
108 |     """
109 |     y = (x-mean)/std
110 |     using running estimates of mean,std
111 |     """
112 | 
113 |     def __init__(self, shape, demean=True, destd=True, clip=10.0):
114 |         self.demean = demean
115 |         self.destd = destd
116 |         self.clip = clip
117 | 
118 |         self.rs = RunningStat(shape)
119 | 
120 |     def __call__(self, x, update=True):
121 |         if update: self.rs.push(x)
122 |         if self.demean:
123 |             x = x - self.rs.mean
124 |         if self.destd:
125 |             x = x / (self.rs.std + 1e-8)
126 |         if self.clip:
127 |             x = np.clip(x, -self.clip, self.clip)
128 |         return x
129 | 
130 |     def output_shape(self, input_space):
131 |         return input_space.shape
132 | 
133 | 
134 | class ActorCritic(nn.Module):
135 |     def __init__(self, num_inputs, num_outputs):
136 |         super(ActorCritic, self).__init__()
137 |         
138 |         self.actor_fc1 = nn.Linear(num_inputs, args.hidden_size)
139 |         self.actor_fc2 = nn.LSTM(args.hidden_size, args.lstm_size)
140 |         self.actor_mu = nn.Linear(args.lstm_size, num_outputs)
141 |         self.actor_sig = nn.Linear(args.lstm_size, num_outputs)
142 |         self.actor_sig_activation = nn.Softplus()
143 | 
144 |         self.critic_fc1 = nn.Linear(num_inputs, args.hidden_size)
145 |         self.critic_fc2 = nn.LSTM(args.hidden_size, args.lstm_size)
146 |         self.critic_fc3 = nn.Linear(args.lstm_size, 1)
147 | 
148 |     def forward(self, states, actor_hidden, critic_hidden):
149 |         """
150 |         run policy network (actor) as well as value network (critic)
151 |         :param states: a Tensor2 represents states, 2 tuple represents hidden states
152 |         :return: 5 Tensor2
153 |         """
154 |         action_mean, action_std, actor_newh = self._forward_actor(states, actor_hidden)
155 |         critic_value, critic_newh = self._forward_critic(states, critic_hidden)
156 |         return action_mean, action_std, actor_newh, critic_value, critic_newh
157 | 
158 |     def _forward_actor(self, states, hidden):
159 |         x = torch.relu(self.actor_fc1(states)).unsqueeze(0)
160 |         x, newhidden = self.actor_fc2(x, hidden)
161 |         x = x.squeeze(0)
162 |         action_mean = self.actor_mu(x)
163 |         action_std = self.actor_sig_activation(self.actor_sig(x))
164 |         return action_mean, action_std, newhidden
165 | 
166 |     def _forward_critic(self, states, hidden):
167 |         x = torch.relu(self.critic_fc1(states)).unsqueeze(0)
168 |         x, newhidden = self.critic_fc2(x, hidden)
169 |         x = x.squeeze(0)
170 |         critic_value = self.critic_fc3(x)
171 |         return critic_value, newhidden
172 | 
173 |     def select_action(self, action_mean, action_std, return_logproba=True):
174 |         """
175 |         given mean and std, sample an action from normal(mean, std)
176 |         also returns probability of the given chosen
177 |         """
178 |         action_logstd = torch.log(action_std)
179 |         action = torch.normal(action_mean, action_std)
180 |         if return_logproba:
181 |             logproba = self._normal_logproba(action, action_mean, action_logstd, action_std)
182 |             return action, logproba
183 |         else:
184 |             return action
185 | 
186 |     @staticmethod
187 |     def _normal_logproba(x, mean, logstd, std=None):
188 |         if std is None:
189 |             std = torch.exp(logstd)
190 | 
191 |         std_sq = std.pow(2)
192 |         logproba = - 0.5 * math.log(2 * math.pi) - logstd - (x - mean).pow(2) / (2 * std_sq)
193 |         return logproba.sum(1)
194 | 
195 |     def get_logproba(self, states, actions):
196 |         """
197 |         return probability of chosen the given actions under corresponding states of current network
198 |         :param states: Tensor
199 |         :param actions: Tensor
200 |         """
201 |         action_mean, action_logstd = self._forward_actor(states)
202 |         logproba = self._normal_logproba(actions, action_mean, action_logstd)
203 |         return logproba
204 | 
205 | def a2c(args):
206 |     env = gym.make(args.env_name)
207 |     num_inputs = env.observation_space.shape[0]
208 |     num_actions = env.action_space.shape[0]
209 | 
210 |     env.seed(args.seed)
211 |     torch.manual_seed(args.seed)
212 | 
213 |     network = ActorCritic(num_inputs, num_actions)
214 |     optimizer = opt.RMSprop(network.parameters(), lr=args.lr)
215 |     running_state = ZFilter((num_inputs,), clip=5)
216 |     
217 |     # record average 1-round cumulative reward in every episode
218 |     reward_record = []
219 |     num_steps = 0
220 | 
221 |     for i_episode in range(args.num_episode):
222 |         # step1: perform current policy to collect trajectories
223 |         # this is an on-policy method!
224 |         state = env.reset()
225 |         state = running_state(state)
226 |         actor_hidden = (torch.zeros(1, 1, args.lstm_size), torch.zeros(1, 1, args.lstm_size))
227 |         critic_hidden = (torch.zeros(1, 1, args.lstm_size), torch.zeros(1, 1, args.lstm_size))
228 |         reward_sum = 0
229 |         states = []
230 |         values = []
231 |         actions = []
232 |         action_stds = []
233 |         logprobas = []
234 |         next_states = []
235 |         rewards = []
236 |         for t in range(args.max_step_per_round):
237 |             action_mean, action_std, actor_hidden, value, critic_hidden = \
238 |                 network(Tensor(state).unsqueeze(0), actor_hidden, critic_hidden)
239 |             action, logproba = network.select_action(action_mean, action_std)
240 |             action = action.data.numpy()[0]
241 |             next_state, reward, done, _ = env.step(action)
242 |             reward_sum += reward
243 |             next_state = running_state(next_state)
244 |             mask = 0 if done else 1
245 | 
246 |             states.append(state)
247 |             values.append(value)
248 |             actions.append(action)
249 |             action_stds.append(action_std)
250 |             logprobas.append(logproba)
251 |             next_states.append(next_state)
252 |             rewards.append(reward)
253 |             
254 |             if done:
255 |                 break
256 |                 
257 |             state = next_state
258 |                 
259 |         values = torch.cat(values)
260 |         action_stds = torch.cat(action_stds)
261 |         logprobas = torch.cat(logprobas).unsqueeze(1)
262 |         num_steps += (t + 1)
263 | 
264 |         reward_record.append({'steps': num_steps, 'reward': reward_sum})
265 | 
266 |         # step2: extract variables from trajectories
267 |         batch_size = len(rewards)
268 |         prev_return = 0
269 |         prev_value = 0
270 |         prev_advantage = 0
271 |         returns = Tensor(batch_size, 1)
272 |         deltas = Tensor(batch_size, 1)
273 |         advantages = Tensor(batch_size, 1)
274 |         for i in reversed(range(batch_size)):
275 |             returns[i] = rewards[i] + args.gamma * prev_return
276 |             deltas[i] = rewards[i] + args.gamma * prev_value - values[i].data.numpy()[0]
277 |             # ref: https://arxiv.org/pdf/1506.02438.pdf (generalization advantage estimate)
278 |             advantages[i] = deltas[i] + args.gamma * args.lamda * prev_advantage
279 | 
280 |             prev_return = returns[i]
281 |             prev_value = values[i].data.numpy()[0]
282 |             prev_advantage = advantages[i]
283 |         advantages = (advantages - advantages.mean()) / (advantages.std() + EPS)
284 | 
285 |         # step3: construct loss functions
286 |         loss_policy = torch.mean(- logprobas * advantages)
287 |         loss_value = torch.mean((values - returns).pow(2))
288 |         loss_entropy = torch.mean(- (torch.log(2 * math.pi * action_stds.pow(2)) + 1) / 2)
289 |         loss = loss_policy + args.loss_coeff_value * loss_value + args.loss_coeff_entropy * loss_entropy
290 | 
291 |         # step4: do gradient update
292 |         optimizer.zero_grad()
293 |         loss.backward()
294 |         optimizer.step()
295 | 
296 |         # step5: do logging
297 |         if i_episode % args.log_num_episode == 0:
298 |             print('Finished episode: {} Mean Reward: {:.4f} total_loss = {:.4f} = {:.4f} + {} * {:.4f} + {} * {:.4f}' \
299 |                 .format(i_episode, reward_record[-1]['reward'], loss.data, loss_policy.data, args.loss_coeff_value, 
300 |                 loss_value.data, args.loss_coeff_entropy, loss_entropy.data))
301 |             print('-----------------')
302 | 
303 |     return reward_record
304 |     
305 | if __name__ == '__main__':
306 |     datestr = datetime.datetime.now().strftime('%Y-%m-%d')
307 |     args = add_arguments()
308 | 
309 |     record_dfs = pd.DataFrame(columns=['steps', 'reward'])
310 |     reward_cols = []
311 |     for i in range(args.num_parallel_run):
312 |         args.seed += 1
313 |         reward_record = pd.DataFrame(a2c(args))
314 |         record_dfs = record_dfs.merge(reward_record, how='outer', on='steps', suffixes=('', '_{}'.format(i)))
315 |         reward_cols.append('reward_{}'.format(i))
316 | 
317 |     record_dfs = record_dfs.drop(columns='reward').sort_values(by='steps', ascending=True).ffill().bfill()
318 |     record_dfs['reward_mean'] = record_dfs[reward_cols].mean(axis=1)
319 |     record_dfs['reward_std'] = record_dfs[reward_cols].std(axis=1)
320 |     record_dfs['reward_smooth'] = record_dfs['reward_mean'].ewm(span=20).mean()
321 |     record_dfs.to_csv(joindir(RESULT_DIR, 'a2c-record-{}-{}.csv'.format(args.env_name, datestr)))
322 | 
323 |     # Plot
324 |     plt.figure(figsize=(12, 6))
325 |     plt.plot(record_dfs['steps'], record_dfs['reward_mean'], label='trajory reward')
326 |     plt.plot(record_dfs['steps'], record_dfs['reward_smooth'], label='smoothed reward')
327 |     plt.fill_between(record_dfs['steps'], record_dfs['reward_mean'] - record_dfs['reward_std'], 
328 |         record_dfs['reward_mean'] + record_dfs['reward_std'], color='b', alpha=0.2)
329 |     plt.legend()
330 |     plt.xlabel('steps of env interaction (sample complexity)')
331 |     plt.ylabel('average reward')
332 |     plt.title('A2C on {}'.format(args.env_name))
333 |     plt.savefig(joindir(RESULT_DIR, 'a2c-{}-{}.pdf'.format(args.env_name, datestr)))
334 | 
335 | 
336 | 


--------------------------------------------------------------------------------
/code/ars.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Implementation of Augmented Random Search
  3 | 
  4 |     This is a derivative-free method, probably hard to solve high
  5 |     dimensional problems. Following the paper, we use linear and 
  6 |     deterministic policy.
  7 | 
  8 |     There are four versions of ARS in the paper. Here, we implement
  9 |     V2-t, the version that seems to have the best performance. 
 10 | 
 11 | ref: 
 12 | 
 13 |     Mania, Horia, Aurelia Guy, and Benjamin Recht. "Simple random 
 14 |     search provides a competitive approach to reinforcement learning." 
 15 |     arXiv preprint arXiv:1803.07055 (2018).
 16 | 
 17 |     https://github.com/modestyachts/ARS
 18 | """
 19 | 
 20 | from pathos.multiprocessing import ProcessingPool as Pool
 21 | from collections import namedtuple
 22 | import gym
 23 | import matplotlib
 24 | matplotlib.use('agg')
 25 | import matplotlib.pyplot as plt
 26 | from os.path import join as joindir
 27 | from functools import partial
 28 | import pandas as pd
 29 | import numpy as np
 30 | import argparse
 31 | import datetime
 32 | import math
 33 | import ray
 34 | 
 35 | 
 36 | Stats = namedtuple('Stats', ('n', 'mean', 'svar'))
 37 | EPS = 1e-8
 38 | RESULT_DIR = '../result'
 39 | 
 40 | 
 41 | class args(object):
 42 |     env_name = 'Hopper-v2'
 43 |     seed = 1234
 44 |     num_episode = 100
 45 |     max_step_per_round = 200
 46 |     random_table_size = 100000000
 47 |     num_round_avg = 5
 48 | 
 49 |     N = 8
 50 |     alpha = 0.01
 51 |     nu = 0.03
 52 |     b = 4
 53 | 
 54 |     log_num_episode = 1
 55 |     num_parallel_run = 5
 56 | 
 57 | 
 58 | def add_arguments():
 59 |     parser = argparse.ArgumentParser()
 60 |     parser.add_argument('--env_name', type=str, default='Hopper-v2')
 61 |     parser.add_argument('--seed', type=int, default=1234)
 62 |     parser.add_argument('--num_episode', type=int, default=1000)
 63 |     parser.add_argument('--max_step_per_round', type=int, default=200)
 64 |     parser.add_argument('--random_table_size', type=int, default=100000000)
 65 |     parser.add_argument('--num_round_avg', type=int, default=5)
 66 | 
 67 |     parser.add_argument('--N', type=int, default=8)
 68 |     parser.add_argument('--alpha', type=float, default=0.01)
 69 |     parser.add_argument('--nu', type=float, default=0.03)
 70 |     parser.add_argument('--b', type=int, default=4)
 71 | 
 72 |     parser.add_argument('--log_num_episode', type=int, default=1)
 73 |     parser.add_argument('--num_parallel_run', type=int, default=5)
 74 | 
 75 |     args = parser.parse_args()
 76 |     return args
 77 | 
 78 | 
 79 | class RunningStat(object):
 80 |     def __init__(self, shape):
 81 |         self._n = 0
 82 |         self._mean = np.zeros(shape)
 83 |         self._svar = np.zeros(shape)
 84 | 
 85 |     def push(self, x):
 86 |         x = np.asarray(x).astype(float)
 87 |         assert x.shape == self._mean.shape
 88 |         self._n += 1
 89 |         if self._n == 1:
 90 |             self._mean[...] = x
 91 |         else:
 92 |             mean_old = self._mean.copy()
 93 |             self._mean[...] = self._mean + (x - mean_old) / self._n
 94 |             self._svar[...] = self._svar + (x - mean_old) * (x - self._mean)
 95 | 
 96 |     def update(self, stat_list):
 97 |         """
 98 |         ref: https://www.emathzone.com/tutorials/basic-statistics/combined-variance.html
 99 |         """
100 |         n_old = self._n
101 |         mean_old = self._mean.copy()
102 |         svar_old = self._svar.copy()
103 | 
104 |         self._n += np.sum([stat.n for stat in stat_list])
105 |         self._mean[...] = self._mean \
106 |             + np.sum([stat.n * (stat.mean - mean_old) / self._n for stat in stat_list], axis=0)
107 |         self._svar[...] = self._svar + n_old * np.square(mean_old - self._mean) \
108 |             + np.sum([stat.svar + stat.n * np.square(stat.mean - self._mean) for stat in stat_list], axis=0)
109 | 
110 |     @property
111 |     def stat(self):
112 |         return Stats(n=self._n, mean=self._mean, svar=self._svar)
113 | 
114 |     @property
115 |     def n(self):
116 |         return self._n
117 | 
118 |     @property
119 |     def mean(self):
120 |         return self._mean
121 | 
122 |     @property
123 |     def var(self):
124 |         return self._svar / (self._n - 1) + EPS if self._n > 1 else np.ones(self._svar.shape)
125 | 
126 |     @property
127 |     def std(self):
128 |         return np.sqrt(self.var)
129 | 
130 |     @property
131 |     def shape(self):
132 |         return self._mean.shape
133 | 
134 | 
135 | @ray.remote
136 | class Worker(object):
137 |     def __init__(self, M, mean, var, deltas, args):
138 |         """
139 |         initialize the agent
140 |         """
141 |         self.actor = NaiveActor(M, mean, var)
142 |         self.deltas = deltas
143 | 
144 |         self.num_round_avg = args['num_round_avg']
145 |         self.max_step_per_round = args['max_step_per_round']
146 |         self.env_name = args['env_name']
147 |         self.env_seed = args['seed']
148 |         self.nu = args['nu']
149 |         self.delta_shape = M.shape
150 |         self.delta_dim = np.prod(self.delta_shape)
151 |         self.rg = np.random.RandomState(args['seed'])
152 | 
153 |     def sync_actor_params(self, M, mean, var):
154 |         self.actor.sync_params(M, mean, var)
155 | 
156 |     def rollout(self):
157 |         """
158 |         test the transition matrix M by process several rollouts
159 |         :return: mean rewards, states statistics (n, mean, var)
160 |         """
161 |         # generate delta
162 |         delta_ind = self.rg.randint(0, len(self.deltas) - self.delta_dim + 1)
163 |         delta = self.deltas[delta_ind:delta_ind + self.delta_dim].reshape(self.delta_shape)
164 |         self.actor.set_delta(delta)
165 | 
166 |         running_stat = RunningStat((self.delta_shape[1], ))
167 |         env = gym.make(self.env_name)
168 |         env.seed(self.env_seed)
169 |         reward_neg_pos = []
170 | 
171 |         # generate rollouts with M +/- nu * delta
172 |         for nu in [-self.nu, self.nu]:
173 |             self.actor.set_nu(nu)
174 |             reward_sum_record = []
175 |             for i_run in range(self.num_round_avg):
176 |                 done = False
177 |                 num_steps = 0
178 |                 reward_sum = 0
179 |                 state = env.reset()
180 |                 running_stat.push(state)
181 |                 while (not done) and (num_steps < self.max_step_per_round):
182 |                     action = self.actor.forward(state)
183 |                     state, reward, done, _ = env.step(action)
184 |                     reward_sum += reward
185 |                     running_stat.push(state)
186 |                     num_steps += 1
187 |                 reward_sum_record.append(reward_sum)
188 |             reward_neg_pos.append(np.mean(reward_sum_record))
189 |         return (delta_ind, reward_neg_pos, running_stat.stat)
190 | 
191 | 
192 | class NaiveActor(object):
193 |     def __init__(self, M, mean, var):
194 |         """
195 |         :param M: the tested transition matrix, of dimension (dim_actions, dim_states)
196 |         :param mean: mean of all previous states
197 |         :param var: diagonal of covariance (element-wise variance) of all previous states
198 |         """
199 |         self._M = M
200 |         self._mean = mean
201 |         self._var = var
202 |         self._delta = None
203 |         self._pert_M = None
204 | 
205 |     def set_delta(self, delta):
206 |         self._delta = delta
207 | 
208 |     def set_nu(self, nu):
209 |         self._pert_M = self._M + nu * self._delta
210 | 
211 |     def sync_params(self, M, mean, var):
212 |         self._M = M
213 |         self._mean = mean
214 |         self._var = var
215 | 
216 |     def forward(self, states):
217 |         """
218 |         given a states returns the action, where M = self._M + nu * deltas
219 |         :param states: a np.ndarray represents states
220 |         :return: the deterministic action
221 |         """
222 |         return np.matmul(self._pert_M, (states - self._mean) / np.sqrt(self._var))
223 | 
224 | 
225 | class Master(object):
226 |     """
227 |     A linear policy actor master
228 |     Each weight is drawn from independent Gaussian distribution
229 |     """
230 |     def __init__(self, args):
231 |         self.dim_states, self.dim_actions = self._get_dimensions(args.env_name)
232 |         self.dim_M = self.dim_states * self.dim_actions
233 | 
234 |         self.M = ray.put(np.zeros((self.dim_actions, self.dim_states)))
235 |         self.running_stat = RunningStat((self.dim_states, ))
236 | 
237 |         self.deltas = ray.put(np.random.RandomState(args.seed).randn(args.random_table_size).astype(np.float64))
238 | 
239 |         worker_args = {
240 |             'num_round_avg': args.num_round_avg,
241 |             'max_step_per_round': args.max_step_per_round,
242 |             'env_name': args.env_name,
243 |             'seed': args.seed,
244 |             'nu': args.nu,
245 |         }
246 |         self.workers = [Worker.remote(self.M, self.running_stat.mean, 
247 |             self.running_stat.var, self.deltas, worker_args) for i in range(args.N)]
248 | 
249 |         self.args = args
250 | 
251 |         self.reward_record = []
252 | 
253 |     def _get_dimensions(self, env_name):
254 |         env = gym.make(env_name)
255 |         return env.observation_space.shape[0], env.action_space.shape[0]
256 | 
257 |     def run(self):
258 |         for i in range(self.args.num_episode):
259 |             rollout_results = ray.get([w.rollout.remote() for w in self.workers])
260 | 
261 |             rollout_ids = [res[0] for res in rollout_results]
262 |             rollout_rewards = np.array([res[1] for res in rollout_results])
263 |             rollout_stats = [res[2] for res in rollout_results]
264 | 
265 |             # update master policy
266 |             self.update(rollout_ids, rollout_rewards)
267 | 
268 |             # update master state mean and variance
269 |             self.running_stat.update(rollout_stats)
270 | 
271 |             # sync master policy and state statistics to workers
272 |             ray.get([w.sync_actor_params.remote(self.M, self.running_stat.mean, self.running_stat.var) \
273 |                 for w in self.workers])
274 | 
275 |             self.reward_record.append({'steps': self.running_stat.n, 'reward': rollout_rewards.mean()})
276 | 
277 |             if i % self.args.log_num_episode == 0:
278 |                 print('Finished episode: {} steps: {} AvgReward: {:.4f}' \
279 |                     .format(i, self.reward_record[-1]['steps'], self.reward_record[-1]['reward']))
280 |                 print('-----------------')
281 | 
282 |     def update(self, rollout_ids, rollout_rewards):
283 |         max_rollout_reward = rollout_rewards.max(axis=1)
284 |         selected_ind = np.argsort(max_rollout_reward)[::-1][:self.args.b]
285 |         sig_reward = rollout_rewards[selected_ind].reshape(-1).std()
286 | 
287 |         new_M = np.copy(ray.get(self.M))
288 |         deltas = ray.get(self.deltas)
289 |         for i in selected_ind:
290 |             delta = deltas[rollout_ids[i]:rollout_ids[i] + self.dim_M].reshape((self.dim_actions, self.dim_states))
291 |             new_M += self.args.alpha / self.args.b / sig_reward \
292 |                 * (rollout_rewards[i, 1] - rollout_rewards[i, 0]) * delta
293 |         self.M = ray.put(new_M)
294 | 
295 | if __name__ == '__main__':
296 |     ray.init()
297 |     datestr = datetime.datetime.now().strftime('%Y-%m-%d')
298 |     args = add_arguments()
299 | 
300 |     record_dfs = pd.DataFrame(columns=['steps', 'reward'])
301 |     reward_cols = []
302 |     for i in range(args.num_parallel_run):
303 |         args.seed += 1
304 |         master = Master(args)
305 |         master.run()
306 |         reward_record = pd.DataFrame(master.reward_record)
307 |         record_dfs = record_dfs.merge(reward_record, how='outer', on='steps', suffixes=('', '_{}'.format(i)))
308 |         reward_cols.append('reward_{}'.format(i))
309 | 
310 |     record_dfs = record_dfs.drop(columns='reward').sort_values(by='steps', ascending=True).ffill().bfill()
311 |     record_dfs['reward_mean'] = record_dfs[reward_cols].mean(axis=1)
312 |     record_dfs['reward_std'] = record_dfs[reward_cols].std(axis=1)
313 |     record_dfs['reward_smooth'] = record_dfs['reward_mean'].ewm(span=1000).mean()
314 |     record_dfs['reward_smooth_std'] = record_dfs['reward_std'].ewm(span=1000).mean()
315 |     record_dfs.to_csv(joindir(RESULT_DIR, 'ars-record-{}-{}.csv'.format(args.env_name, datestr)))
316 | 
317 |     # Plot
318 |     plt.figure(figsize=(12, 6))
319 |     plt.plot(record_dfs['steps'], record_dfs['reward_smooth'], label='reward')
320 |     plt.fill_between(record_dfs['steps'], record_dfs['reward_smooth'] - record_dfs['reward_smooth_std'], 
321 |         record_dfs['reward_smooth'] + record_dfs['reward_smooth_std'], color='b', alpha=0.2)
322 |     plt.legend()
323 |     plt.xlabel('steps of env interaction (sample complexity)')
324 |     plt.ylabel('average reward')
325 |     plt.title('ARS on {}'.format(args.env_name))
326 |     plt.savefig(joindir(RESULT_DIR, 'ars-plot-{}-{}.pdf'.format(args.env_name, datestr)))
327 |     


--------------------------------------------------------------------------------
/code/ars_tune.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Implementation of Augmented Random Search
  3 | 
  4 |     This is a derivative-free method, probably hard to solve high
  5 |     dimensional problems. Following the paper, we use linear and 
  6 |     deterministic policy.
  7 | 
  8 |     There are four versions of ARS in the paper. Here, we implement
  9 |     V2-t, the version that seems to have the best performance. 
 10 | 
 11 | ref: 
 12 | 
 13 |     Mania, Horia, Aurelia Guy, and Benjamin Recht. "Simple random 
 14 |     search provides a competitive approach to reinforcement learning." 
 15 |     arXiv preprint arXiv:1803.07055 (2018).
 16 | 
 17 |     https://github.com/modestyachts/ARS
 18 |     
 19 | Notice: 
 20 |     This is a _tune version, which means that it finds an optimal
 21 |     configuration of hyperparameters (by running the algorithm 
 22 |     multiple times) and run with this found configuration.
 23 | """
 24 | 
 25 | from pathos.multiprocessing import ProcessingPool as Pool
 26 | from collections import namedtuple
 27 | import gym
 28 | import matplotlib
 29 | matplotlib.use('agg')
 30 | import matplotlib.pyplot as plt
 31 | from os.path import join as joindir
 32 | from functools import partial
 33 | from itertools import product
 34 | import pandas as pd
 35 | import numpy as np
 36 | import argparse
 37 | import datetime
 38 | import json
 39 | import math
 40 | import ray
 41 | import ray.tune as tune
 42 | 
 43 | 
 44 | EPS = 1e-8
 45 | RESULT_DIR = '../result'
 46 | TUNE_DIR = '../tune'
 47 | LOG_DIR = '../log'
 48 | 
 49 | 
 50 | config_hopper = {
 51 |     'env_name': 'Hopper-v2',
 52 |     'seed': 'auto',
 53 |     'num_episode': 1000,
 54 |     'max_step_per_round': 200,
 55 |     'random_table_size': 100000000,
 56 |     'num_round_avg': 5,
 57 |     'N': [8, 16, 32], 
 58 |     'alpha': [0.01, 0.02, 0.025],
 59 |     'nu': [0.03, 0.025, 0.02, 0.01],
 60 |     'b_ratio': [0.5, 1.0], 
 61 |     # 'N': 32,
 62 |     # 'alpha': 0.01,
 63 |     # 'nu': [0.02, 0.01],
 64 |     # 'b_ratio': 0.5,
 65 |     'num_trials': 5,
 66 | }
 67 | 
 68 | class Logger(object):
 69 |     def __init__(self, logfile='log.txt'):
 70 |         super(Logger, self).__init__()
 71 |         self.logfile = logfile
 72 | 
 73 |     def info(self, msg):
 74 |         timestr = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
 75 |         print('[info {}] {}'.format(timestr, msg))
 76 |         with open(self.logfile, 'a+') as f:
 77 |             f.write('[info {}] {}\n'.format(timestr, msg))
 78 | 
 79 | class RunningStat(object):
 80 |     def __init__(self, shape):
 81 |         self._n = 0
 82 |         self._mean = np.zeros(shape)
 83 |         self._svar = np.zeros(shape)
 84 | 
 85 |     def push(self, x):
 86 |         x = np.asarray(x).astype(float)
 87 |         assert x.shape == self._mean.shape
 88 |         self._n += 1
 89 |         if self._n == 1:
 90 |             self._mean[...] = x
 91 |         else:
 92 |             mean_old = self._mean.copy()
 93 |             self._mean[...] = self._mean + (x - mean_old) / self._n
 94 |             self._svar[...] = self._svar + (x - mean_old) * (x - self._mean)
 95 | 
 96 |     def update(self, stat_list):
 97 |         """
 98 |         ref: https://www.emathzone.com/tutorials/basic-statistics/combined-variance.html
 99 |         """
100 |         n_old = self._n
101 |         mean_old = self._mean.copy()
102 |         svar_old = self._svar.copy()
103 | 
104 |         self._n += np.sum([stat['n'] for stat in stat_list])
105 |         self._mean[...] = self._mean \
106 |             + np.sum([stat['n'] * (stat['mean'] - mean_old) / self._n for stat in stat_list], axis=0)
107 |         self._svar[...] = self._svar + n_old * np.square(mean_old - self._mean) \
108 |             + np.sum([stat['svar'] + stat['n'] * np.square(stat['mean'] - self._mean) for stat in stat_list], axis=0)
109 | 
110 |     @property
111 |     def stat(self):
112 |         return {'n': self._n, 'mean': self._mean, 'svar': self._svar}
113 | 
114 |     @property
115 |     def n(self):
116 |         return self._n
117 | 
118 |     @property
119 |     def mean(self):
120 |         return self._mean
121 | 
122 |     @property
123 |     def var(self):
124 |         return self._svar / (self._n - 1) + EPS if self._n > 1 else np.ones(self._svar.shape)
125 | 
126 |     @property
127 |     def std(self):
128 |         return np.sqrt(self.var)
129 | 
130 |     @property
131 |     def shape(self):
132 |         return self._mean.shape
133 | 
134 | 
135 | @ray.remote
136 | class Worker(object):
137 |     def __init__(self, M, mean, var, deltas, config):
138 |         """
139 |         initialize the agent
140 |         """
141 |         self.actor = NaiveActor(M, mean, var)
142 |         self.deltas = deltas
143 | 
144 |         self.num_round_avg = config['num_round_avg']
145 |         self.max_step_per_round = config['max_step_per_round']
146 |         self.env_name = config['env_name']
147 |         self.env_seed = config['seed']
148 |         self.nu = config['nu']
149 |         self.delta_shape = M.shape
150 |         self.delta_dim = np.prod(self.delta_shape)
151 |         self.rg = np.random.RandomState(config['seed'])
152 | 
153 |     def sync_actor_params(self, M, mean, var):
154 |         self.actor.sync_params(M, mean, var)
155 | 
156 |     def rollout(self):
157 |         """
158 |         test the transition matrix M by process several rollouts
159 |         :return: mean rewards, states statistics (n, mean, var)
160 |         """
161 |         # generate delta
162 |         delta_ind = self.rg.randint(0, len(self.deltas) - self.delta_dim + 1)
163 |         delta = self.deltas[delta_ind:delta_ind + self.delta_dim].reshape(self.delta_shape)
164 |         self.actor.set_delta(delta)
165 | 
166 |         running_stat = RunningStat((self.delta_shape[1], ))
167 |         env = gym.make(self.env_name)
168 |         env.seed(self.env_seed)
169 |         reward_neg_pos = []
170 | 
171 |         # generate rollouts with M +/- nu * delta
172 |         for nu in [-self.nu, self.nu]:
173 |             self.actor.set_nu(nu)
174 |             reward_sum_record = []
175 |             for i_run in range(self.num_round_avg):
176 |                 done = False
177 |                 num_steps = 0
178 |                 reward_sum = 0
179 |                 state = env.reset()
180 |                 running_stat.push(state)
181 |                 while (not done) and (num_steps < self.max_step_per_round):
182 |                     action = self.actor.forward(state)
183 |                     state, reward, done, _ = env.step(action)
184 |                     reward_sum += reward
185 |                     running_stat.push(state)
186 |                     num_steps += 1
187 |                 reward_sum_record.append(reward_sum)
188 |             reward_neg_pos.append(np.mean(reward_sum_record))
189 |         return (delta_ind, reward_neg_pos, running_stat.stat)
190 | 
191 | 
192 | class NaiveActor(object):
193 |     def __init__(self, M, mean, var):
194 |         """
195 |         :param M: the tested transition matrix, of dimension (dim_actions, dim_states)
196 |         :param mean: mean of all previous states
197 |         :param var: diagonal of covariance (element-wise variance) of all previous states
198 |         """
199 |         self._M = M
200 |         self._mean = mean
201 |         self._var = var
202 |         self._delta = None
203 |         self._pert_M = None
204 | 
205 |     def set_delta(self, delta):
206 |         self._delta = delta
207 | 
208 |     def set_nu(self, nu):
209 |         self._pert_M = self._M + nu * self._delta
210 | 
211 |     def sync_params(self, M, mean, var):
212 |         self._M = M
213 |         self._mean = mean
214 |         self._var = var
215 | 
216 |     def forward(self, states):
217 |         """
218 |         given a states returns the action, where M = self._M + nu * deltas
219 |         :param states: a np.ndarray represents states
220 |         :return: the deterministic action
221 |         """
222 |         return np.matmul(self._pert_M, (states - self._mean) / np.sqrt(self._var))
223 | 
224 | 
225 | class Master(object):
226 |     """
227 |     A linear policy actor master
228 |     Each weight is drawn from independent Gaussian distribution
229 |     """
230 |     def __init__(self, config, verbose=1):
231 |         self.dim_states, self.dim_actions = self._get_dimensions(config['env_name'])
232 |         self.dim_M = self.dim_states * self.dim_actions
233 | 
234 |         self.M = ray.put(np.zeros((self.dim_actions, self.dim_states)))
235 |         self.running_stat = RunningStat((self.dim_states, ))
236 | 
237 |         self.deltas = ray.put(np.random.RandomState(config['seed']).randn(config['random_table_size']).astype(np.float64))
238 | 
239 |         worker_config = {
240 |             'num_round_avg': config['num_round_avg'],
241 |             'max_step_per_round': config['max_step_per_round'],
242 |             'env_name': config['env_name'],
243 |             'seed': config['seed'],
244 |             'nu': config['nu'],
245 |         }
246 |         self.workers = [Worker.remote(self.M, self.running_stat.mean, 
247 |             self.running_stat.var, self.deltas, worker_config) for i in range(config['N'])]
248 | 
249 |         self.config = config
250 |         self.config.update({'b': int(config['N'] * config['b_ratio'])})
251 | 
252 |         self.reward_record = []
253 | 
254 |         self.verbose = verbose
255 | 
256 |     def _get_dimensions(self, env_name):
257 |         env = gym.make(env_name)
258 |         return env.observation_space.shape[0], env.action_space.shape[0]
259 | 
260 |     def run(self):
261 |         for i in range(self.config['num_episode']):
262 |             rollout_results = ray.get([w.rollout.remote() for w in self.workers])
263 | 
264 |             rollout_ids = [res[0] for res in rollout_results]
265 |             rollout_rewards = np.array([res[1] for res in rollout_results])
266 |             rollout_stats = [res[2] for res in rollout_results]
267 | 
268 |             # update master policy
269 |             self.update(rollout_ids, rollout_rewards)
270 | 
271 |             # update master state mean and variance
272 |             self.running_stat.update(rollout_stats)
273 | 
274 |             # sync master policy and state statistics to workers
275 |             ray.get([w.sync_actor_params.remote(self.M, self.running_stat.mean, self.running_stat.var) \
276 |                 for w in self.workers])
277 | 
278 |             self.reward_record.append({'steps': self.running_stat.n, 'reward': rollout_rewards.mean()})
279 | 
280 |             if self.verbose >= 1:
281 |                 logger.info('Finished episode: {} steps: {} AvgReward: {:.4f}' \
282 |                     .format(i, self.reward_record[-1]['steps'], self.reward_record[-1]['reward']))
283 |                 logger.info('-----------------')
284 | 
285 |     def update(self, rollout_ids, rollout_rewards):
286 |         max_rollout_reward = rollout_rewards.max(axis=1)
287 |         selected_ind = np.argsort(max_rollout_reward)[::-1][:self.config['b']]
288 |         sig_reward = rollout_rewards[selected_ind].reshape(-1).std()
289 | 
290 |         new_M = np.copy(ray.get(self.M))
291 |         deltas = ray.get(self.deltas)
292 |         for i in selected_ind:
293 |             delta = deltas[rollout_ids[i]:rollout_ids[i] + self.dim_M].reshape((self.dim_actions, self.dim_states))
294 |             new_M += self.config['alpha'] / self.config['b'] / sig_reward \
295 |                 * (rollout_rewards[i, 1] - rollout_rewards[i, 0]) * delta
296 |         self.M = ray.put(new_M)
297 | 
298 | def grid_search(func, config):
299 |     auto_seed = False
300 |     if 'seed' in config and config['seed'] == 'auto':
301 |         auto_seed = True
302 |     if 'num_trials' in config:
303 |         num_trials = config['num_trials']
304 |     else:
305 |         num_trials = 1
306 |     list_elements = [config[d] for d in config if type(config[d]) is list]
307 |     list_names = [d for d in config if type(config[d]) is list]
308 |     trials = []
309 |     for values in product(*list_elements):
310 |         config.update({name: val for val, name in zip(values, list_names)})
311 |         logger.info('========try new config========')
312 |         logger.info('config: {}'.format(config))
313 |         scores = []
314 |         for i in range(num_trials):
315 |             try_config = config.copy()
316 |             if auto_seed:
317 |                 try_config['seed'] = np.random.randint(1000)
318 |             scores.append(func(try_config))
319 |         trials.append({'config': try_config, 'score': np.mean(scores) - np.std(scores)})
320 |         logger.info('score: {} (+/- {})'.format(np.mean(scores), np.std(scores)))
321 |     return trials
322 | 
323 | def run_ars(config):
324 |     master = Master(config, verbose=0)
325 |     master.run()
326 |     num_last_episodes = int(config['num_episode'] * 0.1)
327 |     score = np.mean([x['reward'] for x in master.reward_record[-num_last_episodes:]])
328 |     return score
329 | 
330 | def run_single_and_plot(config):
331 |     record_dfs = pd.DataFrame(columns=['steps', 'reward'])
332 |     reward_cols = []
333 |     for i in range(config['num_trials']):
334 |         config['seed'] = np.random.randint(1000)
335 |         master = Master(config)
336 |         master.run()
337 |         reward_record = pd.DataFrame(master.reward_record)
338 |         record_dfs = record_dfs.merge(reward_record, how='outer', on='steps', suffixes=('', '_{}'.format(i)))
339 |         reward_cols.append('reward_{}'.format(i))
340 | 
341 |     record_dfs = record_dfs.drop(columns='reward').sort_values(by='steps', ascending=True).ffill().bfill()
342 |     record_dfs['reward_mean'] = record_dfs[reward_cols].mean(axis=1)
343 |     record_dfs['reward_std'] = record_dfs[reward_cols].std(axis=1)
344 |     record_dfs['reward_smooth'] = record_dfs['reward_mean'].ewm(span=1000).mean()
345 |     record_dfs['reward_smooth_std'] = record_dfs['reward_std'].ewm(span=1000).mean()
346 |     record_dfs.to_csv(joindir(TUNE_DIR, 'ARS-record-{}.csv'.format(config['env_name'])))
347 | 
348 |     # Plot
349 |     plt.figure(figsize=(12, 6))
350 |     plt.plot(record_dfs['steps'], record_dfs['reward_smooth'], label='reward')
351 |     plt.fill_between(record_dfs['steps'], record_dfs['reward_smooth'] - record_dfs['reward_smooth_std'], 
352 |         record_dfs['reward_smooth'] + record_dfs['reward_smooth_std'], color='b', alpha=0.2)
353 |     plt.legend()
354 |     plt.xlabel('steps of env interaction (sample complexity)')
355 |     plt.ylabel('average reward')
356 |     plt.title('ARS on {}'.format(config['env_name']))
357 |     plt.savefig(joindir(TUNE_DIR, 'ARS-plot-{}.pdf'.format(config['env_name'])))
358 | 
359 | if __name__ == '__main__':
360 |     logger = Logger(joindir(LOG_DIR, 'log_ars.txt'))
361 |     ray.init()
362 | 
363 |     trials = grid_search(run_ars, config_hopper)
364 | 
365 |     best_trial = sorted(trials, key=lambda x: x['score'])[-1]
366 |     best_config = best_trial['config']
367 |     best_score = best_trial['score']
368 | 
369 |     with open(joindir(TUNE_DIR, 'ARS-{}.json'.format(best_config['env_name'])), 'w') as f:
370 |         json.dump(best_config, f, indent=4, sort_keys=True) 
371 | 
372 |     logger.info('========best solution found========')
373 |     logger.info('best score: {}'.format(best_score))
374 |     logger.info('best config: {}'.format(best_config))
375 | 
376 |     run_single_and_plot(best_config)
377 | 
378 | 


--------------------------------------------------------------------------------
/code/cem.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Implementation of Cross Entropy Method
  3 | 
  4 |     This is a derivative-free method. It seems hard to solve high
  5 |     dimensional problems. Here, we use linear and deterministic
  6 |     policy.
  7 | 
  8 | ref: 
  9 |     Szita, István, and András Lörincz. "Learning Tetris using the 
 10 |     noisy cross-entropy method." Neural computation 18.12 (2006): 
 11 |     2936-2941.
 12 | """
 13 | 
 14 | from pathos.multiprocessing import ProcessingPool as Pool
 15 | import gym
 16 | import matplotlib
 17 | matplotlib.use('agg')
 18 | import matplotlib.pyplot as plt
 19 | from os.path import join as joindir
 20 | from functools import partial
 21 | import pandas as pd
 22 | import numpy as np
 23 | import argparse
 24 | import datetime
 25 | import math
 26 | 
 27 | 
 28 | EPS = 1e-10
 29 | RESULT_DIR = '../result'
 30 | 
 31 | 
 32 | class args(object):
 33 |     env_name = 'Hopper-v2'
 34 |     seed = 1234
 35 |     num_episode = 100
 36 |     max_step_per_round = 200
 37 |     log_num_episode = 1
 38 |     num_parallel_run = 5
 39 |     init_sig = 10.0
 40 |     const_noise_sig2 = 4.0
 41 |     num_samples = 100
 42 |     best_ratio = 0.1
 43 |     num_round_avg = 30
 44 |     num_cores = 10
 45 | 
 46 | 
 47 | def add_arguments():
 48 |     parser = argparse.ArgumentParser()
 49 |     parser.add_argument('--env_name', type=str, default='Hopper-v2')
 50 |     parser.add_argument('--seed', type=int, default=1234)
 51 |     parser.add_argument('--num_episode', type=int, default=1000)
 52 |     parser.add_argument('--max_step_per_round', type=int, default=200)
 53 |     parser.add_argument('--log_num_episode', type=int, default=1)
 54 |     parser.add_argument('--num_parallel_run', type=int, default=5)
 55 |     parser.add_argument('--init_sig', type=float, default=10.0)
 56 |     parser.add_argument('--const_noise_sig2', type=float, default=4.0)
 57 |     parser.add_argument('--num_samples', type=int, default=100)
 58 |     parser.add_argument('--best_ratio', type=float, default=0.1)
 59 |     parser.add_argument('--num_round_avg', type=int, default=30)
 60 |     parser.add_argument('--num_cores', type=int, default=10)
 61 | 
 62 |     args = parser.parse_args()
 63 |     return args
 64 | 
 65 | class RunningStat(object):
 66 |     def __init__(self, shape):
 67 |         self._n = 0
 68 |         self._M = np.zeros(shape)
 69 |         self._S = np.zeros(shape)
 70 | 
 71 |     def push(self, x):
 72 |         x = np.asarray(x)
 73 |         assert x.shape == self._M.shape
 74 |         self._n += 1
 75 |         if self._n == 1:
 76 |             self._M[...] = x
 77 |         else:
 78 |             oldM = self._M.copy()
 79 |             self._M[...] = oldM + (x - oldM) / self._n
 80 |             self._S[...] = self._S + (x - oldM) * (x - self._M)
 81 | 
 82 |     @property
 83 |     def n(self):
 84 |         return self._n
 85 | 
 86 |     @property
 87 |     def mean(self):
 88 |         return self._M
 89 | 
 90 |     @property
 91 |     def var(self):
 92 |         return self._S / (self._n - 1) if self._n > 1 else np.square(self._M)
 93 | 
 94 |     @property
 95 |     def std(self):
 96 |         return np.sqrt(self.var)
 97 | 
 98 |     @property
 99 |     def shape(self):
100 |         return self._M.shape
101 | 
102 | 
103 | class ZFilter:
104 |     """
105 |     y = (x-mean)/std
106 |     using running estimates of mean,std
107 |     """
108 | 
109 |     def __init__(self, shape, demean=True, destd=True, clip=10.0):
110 |         self.demean = demean
111 |         self.destd = destd
112 |         self.clip = clip
113 | 
114 |         self.rs = RunningStat(shape)
115 | 
116 |     def __call__(self, x, update=True):
117 |         if update: self.rs.push(x)
118 |         if self.demean:
119 |             x = x - self.rs.mean
120 |         if self.destd:
121 |             x = x / (self.rs.std + 1e-8)
122 |         if self.clip:
123 |             x = np.clip(x, -self.clip, self.clip)
124 |         return x
125 | 
126 |     def output_shape(self, input_space):
127 |         return input_space.shape
128 | 
129 | 
130 | class Agent(object):
131 |     def __init__(self, M):
132 |         self.actor = NaiveActor(M)
133 | 
134 |     def run(self, num_round_avg, env_name, env_seed):
135 |         env = gym.make(env_name)
136 |         env.seed(env_seed)
137 |         total_steps = 0
138 |         reward_sum_record = []
139 |         for i_run in range(num_round_avg):
140 |             done = False
141 |             num_steps = 0
142 |             reward_sum = 0
143 |             state = env.reset()
144 |             # state = running_state(state)
145 |             while (not done) and (num_steps < args.max_step_per_round):
146 |                 action = self.actor.forward(state)
147 |                 state, reward, done, _ = env.step(action)
148 |                 reward_sum += reward
149 |                 # state = running_state(state)
150 |                 num_steps += 1
151 |             total_steps += num_steps
152 |             reward_sum_record.append(reward_sum)
153 |         return (total_steps, np.mean(reward_sum_record))
154 | 
155 | 
156 | class NaiveActor(object):
157 |     def __init__(self, M):
158 |         self._M = M
159 | 
160 |     def forward(self, states):
161 |         """
162 |         given a states returns the action
163 |         :param states: a np.ndarray represents states
164 |         :return: the deterministic action
165 |         """
166 |         return np.matmul(states, self._M)
167 | 
168 | 
169 | class Actor(object):
170 |     """
171 |     A linear policy actor
172 |     Each weight is drawn from independent Gaussian distribution
173 |     """
174 |     def __init__(self, dim_states, dim_actions):
175 |         self.shape = (dim_states, dim_actions)
176 |         self._mu = np.zeros(self.shape)
177 |         self._sig = np.ones(np.prod(self.shape)) * args.init_sig
178 | 
179 |     def sample(self):
180 |         """
181 |         give one sample of transition matrix self._M and set to itself
182 |         """
183 |         M = np.random.normal(self._mu.reshape(-1), self._sig).reshape(self.shape)
184 |         return M
185 | 
186 |     def update(self, weights):
187 |         """
188 |         given the selected good samples of weights, update according
189 |         to CEM formula
190 |         :param weights: list of weights, each is the same size of self._M
191 |         """
192 |         self._mu = np.mean(weights, axis=0)
193 |         self._sig = np.sqrt(np.array([np.square((w - self._mu).reshape(-1)) for w in weights]).mean(axis=0) \
194 |             + args.const_noise_sig2)
195 | 
196 | 
197 | def get_score_of_weight(M):
198 |     agent = Agent(M)
199 |     return agent.run(args.num_round_avg, args.env_name, args.seed)
200 | 
201 | 
202 | def cem():
203 |     env = gym.make(args.env_name)
204 |     dim_states = env.observation_space.shape[0]
205 |     dim_actions = env.action_space.shape[0]
206 |     del env
207 |     p = Pool(args.num_cores)
208 | 
209 |     actor = Actor(dim_states, dim_actions)
210 |     # running_state = ZFilter((dim_states,), clip=5)
211 | 
212 |     reward_record = []
213 |     global_steps = 0
214 |     num_top_samples = int(max(1, np.floor(args.num_samples * args.best_ratio)))
215 | 
216 |     for i_episode in range(args.num_episode):
217 | 
218 |         # sample several weights and perform multiple times each 
219 |         weights = []
220 |         scores = []
221 |         for i_sample in range(args.num_samples):
222 |             weights.append(actor.sample())
223 | 
224 |         res = p.map(get_score_of_weight, weights)
225 |         scores = [score for _, score in res]
226 |         steps = [step for step, _ in res]
227 | 
228 |         global_steps += np.sum(steps)
229 |         reward_record.append({'steps': global_steps, 'reward': np.mean(scores)})
230 | 
231 |         # sort weights according to scores in decreasing order
232 |         # ref: https://stackoverflow.com/questions/6618515/sorting-list-based-on-values-from-another-list
233 |         selected_weights = [x for _, x in sorted(zip(scores, weights), reverse=True)][:num_top_samples]
234 |         actor.update(selected_weights)
235 | 
236 |         if i_episode % args.log_num_episode == 0:
237 |             print('Finished episode: {} steps: {} AvgReward: {:.4f}' \
238 |                 .format(i_episode, reward_record[-1]['steps'], reward_record[-1]['reward']))
239 |             print('-----------------')
240 | 
241 |     return reward_record
242 | 
243 | if __name__ == '__main__':
244 |     datestr = datetime.datetime.now().strftime('%Y-%m-%d')
245 |     args = add_arguments()
246 | 
247 |     record_dfs = pd.DataFrame(columns=['steps', 'reward'])
248 |     reward_cols = []
249 |     for i in range(args.num_parallel_run):
250 |         args.seed += 1
251 |         reward_record = pd.DataFrame(cem())
252 |         record_dfs = record_dfs.merge(reward_record, how='outer', on='steps', suffixes=('', '_{}'.format(i)))
253 |         reward_cols.append('reward_{}'.format(i))
254 | 
255 |     record_dfs = record_dfs.drop(columns='reward').sort_values(by='steps', ascending=True).ffill().bfill()
256 |     record_dfs['reward_mean'] = record_dfs[reward_cols].mean(axis=1)
257 |     record_dfs['reward_std'] = record_dfs[reward_cols].std(axis=1)
258 |     record_dfs['reward_smooth'] = record_dfs['reward_mean'].ewm(span=1000).mean()
259 |     record_dfs['reward_smooth_std'] = record_dfs['reward_std'].ewm(span=1000).mean()
260 |     record_dfs.to_csv(joindir(RESULT_DIR, 'cem-record-{}-{}.csv'.format(args.env_name, datestr)))
261 | 
262 |     # Plot
263 |     plt.figure(figsize=(12, 6))
264 |     plt.plot(record_dfs['steps'], record_dfs['reward_smooth'], label='reward')
265 |     plt.fill_between(record_dfs['steps'], record_dfs['reward_smooth'] - record_dfs['reward_smooth_std'], 
266 |         record_dfs['reward_smooth'] + record_dfs['reward_smooth_std'], color='b', alpha=0.2)
267 |     plt.legend()
268 |     plt.xlabel('steps of env interaction (sample complexity)')
269 |     plt.ylabel('average reward')
270 |     plt.title('CEM on {}'.format(args.env_name))
271 |     plt.savefig(joindir(RESULT_DIR, 'cem-plot-{}-{}.pdf'.format(args.env_name, datestr)))
272 |     


--------------------------------------------------------------------------------
/code/cem_tune.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Implementation of Cross Entropy Method
  3 | 
  4 |     This is a derivative-free method. It seems hard to solve high
  5 |     dimensional problems. Here, we use linear and deterministic
  6 |     policy.
  7 | 
  8 | ref: 
  9 |     Szita, Istvan, and Andras Lorincz. "Learning Tetris using the 
 10 |     noisy cross-entropy method." Neural computation 18.12 (2006): 
 11 |     2936-2941.
 12 |     
 13 | Notice: 
 14 |     This is a _tune version, which means that it finds an optimal
 15 |     configuration of hyperparameters (by running the algorithm 
 16 |     multiple times) and run with this found configuration.
 17 | """
 18 | 
 19 | from pathos.multiprocessing import ProcessingPool as Pool
 20 | import gym
 21 | import matplotlib
 22 | matplotlib.use('agg')
 23 | import matplotlib.pyplot as plt
 24 | from os.path import join as joindir
 25 | from functools import partial
 26 | import pandas as pd
 27 | import numpy as np
 28 | from itertools import product
 29 | import argparse
 30 | import datetime
 31 | import math
 32 | 
 33 | 
 34 | EPS = 1e-10
 35 | RESULT_DIR = '../result'
 36 | TUNE_DIR = '../tune'
 37 | LOG_DIR = '../log'
 38 | 
 39 | 
 40 | config_hopper = {
 41 |     'env_name': 'Hopper-v2',
 42 |     'seed': 'auto',
 43 |     'num_episode': 100,
 44 |     'max_step_per_round': 200,
 45 |     'init_sig': [1.0, 10.0],
 46 |     'const_noise_sig2': [0.0, 4.0],
 47 |     'num_samples': [50, 100],
 48 |     'best_ratio': [0.1, 0.2],
 49 |     'num_round_avg': 30,
 50 |     'num_cores': 10,
 51 |     'num_trials': 5,
 52 | }
 53 | 
 54 | 
 55 | class Logger(object):
 56 |     def __init__(self, logfile='log.txt'):
 57 |         super(Logger, self).__init__()
 58 |         self.logfile = logfile
 59 | 
 60 |     def info(self, msg):
 61 |         timestr = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
 62 |         print('[info {}] {}'.format(timestr, msg))
 63 |         with open(self.logfile, 'a+') as f:
 64 |             f.write('[info {}] {}\n'.format(timestr, msg))
 65 | 
 66 | 
 67 | class Worker(object):
 68 |     def __init__(self, M, config):
 69 |         self.actor = NaiveActor(M)
 70 |         self.num_round_avg = config['num_round_avg']
 71 |         self.env_name = config['env_name']
 72 |         self.env_seed = config['seed']
 73 |         self.max_step_per_round = config['max_step_per_round']
 74 | 
 75 |     def rollout(self):
 76 |         env = gym.make(self.env_name)
 77 |         env.seed(self.env_seed)
 78 |         total_steps = 0
 79 |         reward_sum_record = []
 80 |         for i_run in range(self.num_round_avg):
 81 |             done = False
 82 |             num_steps = 0
 83 |             reward_sum = 0
 84 |             state = env.reset()
 85 |             while (not done) and (num_steps < self.max_step_per_round):
 86 |                 action = self.actor.forward(state)
 87 |                 state, reward, done, _ = env.step(action)
 88 |                 reward_sum += reward
 89 |                 num_steps += 1
 90 |             total_steps += num_steps
 91 |             reward_sum_record.append(reward_sum)
 92 |         return (total_steps, np.mean(reward_sum_record))
 93 | 
 94 | 
 95 | class NaiveActor(object):
 96 |     def __init__(self, M):
 97 |         self._M = M
 98 | 
 99 |     def forward(self, states):
100 |         """
101 |         given a states returns the action
102 |         :param states: a np.ndarray represents states
103 |         :return: the deterministic action
104 |         """
105 |         return np.matmul(states, self._M)
106 | 
107 | 
108 | class Master(object):
109 |     """
110 |     A linear policy actor
111 |     Each weight is drawn from independent Gaussian distribution
112 |     """
113 |     def __init__(self, config, verbose=1):
114 |         self.dim_states, self.dim_actions = self._get_dimensions(config['env_name'])
115 |         self.shape = (self.dim_states, self.dim_actions)
116 |         self._mu = np.zeros(self.shape)
117 |         self._sig = np.ones(np.prod(self.shape)) * config['init_sig']
118 | 
119 |         self.verbose = verbose
120 |         self.config = config
121 |         self.reward_record = []
122 |         self.pool = Pool(config['num_cores'])
123 | 
124 |     def run(self):
125 |         global_steps = 0
126 |         num_top_samples = int(max(1, np.floor(self.config['num_samples'] * self.config['best_ratio'])))
127 | 
128 |         for i_episode in range(self.config['num_episode']):
129 | 
130 |             # sample several weights and perform multiple times each 
131 |             weights = []
132 |             scores = []
133 |             for i_sample in range(self.config['num_samples']):
134 |                 weights.append(self.sample())
135 | 
136 |             res = self.pool.map(self._get_score_of_weight, weights)
137 |             scores = [score for _, score in res]
138 |             steps = [step for step, _ in res]
139 | 
140 |             global_steps += np.sum(steps)
141 |             self.reward_record.append({'steps': global_steps, 'reward': np.mean(scores)})
142 | 
143 |             # sort weights according to scores in decreasing order
144 |             # ref: https://stackoverflow.com/questions/6618515/sorting-list-based-on-values-from-another-list
145 |             selected_weights = [x for _, x in sorted(zip(scores, weights), reverse=True)][:num_top_samples]
146 |             self.update(selected_weights)
147 | 
148 |             if self.verbose >= 1:
149 |                 logger.info('Finished episode: {} steps: {} AvgReward: {:.4f}' \
150 |                     .format(i_episode, self.reward_record[-1]['steps'], self.reward_record[-1]['reward']))
151 |                 logger.info('-----------------')
152 | 
153 |     def sample(self):
154 |         """
155 |         give one sample of transition matrix self._M and set to itself
156 |         """
157 |         M = np.random.normal(self._mu.reshape(-1), self._sig).reshape(self.shape)
158 |         return M
159 | 
160 |     def update(self, weights):
161 |         """
162 |         given the selected good samples of weights, update according
163 |         to CEM formula
164 |         :param weights: list of weights, each is the same size of self._M
165 |         """
166 |         self._mu = np.mean(weights, axis=0)
167 |         self._sig = np.sqrt(np.array([np.square((w - self._mu).reshape(-1)) for w in weights]).mean(axis=0) \
168 |             + self.config['const_noise_sig2'])
169 | 
170 |     def _get_dimensions(self, env_name):
171 |         env = gym.make(env_name)
172 |         return env.observation_space.shape[0], env.action_space.shape[0]
173 | 
174 |     def _get_score_of_weight(self, M):
175 |         return Worker(M, self.config).rollout()
176 | 
177 | def run_cem(config):
178 |     master = Master(config, verbose=0)
179 |     master.run()
180 |     num_last_episodes = int(config['num_episode'] * 0.1)
181 |     score = np.mean([x['reward'] for x in master.reward_record[-num_last_episodes:]])
182 |     return score
183 | 
184 | def grid_search(func, config):
185 |     auto_seed = False
186 |     if 'seed' in config and config['seed'] == 'auto':
187 |         auto_seed = True
188 |     if 'num_trials' in config:
189 |         num_trials = config['num_trials']
190 |     else:
191 |         num_trials = 1
192 |     list_elements = [config[d] for d in config if type(config[d]) is list]
193 |     list_names = [d for d in config if type(config[d]) is list]
194 |     trials = []
195 |     for values in product(*list_elements):
196 |         config.update({name: val for val, name in zip(values, list_names)})
197 |         logger.info('========try new config========')
198 |         logger.info('config: {}'.format(config))
199 |         scores = []
200 |         for i in range(num_trials):
201 |             try_config = config.copy()
202 |             if auto_seed:
203 |                 try_config['seed'] = np.random.randint(1000)
204 |             scores.append(func(try_config))
205 |         trials.append({'config': try_config, 'score': np.mean(scores) - np.std(scores)})
206 |         logger.info('score: {} (+/- {})'.format(np.mean(scores), np.std(scores)))
207 |     return trials
208 | 
209 | def run_single_and_plot(config, algo_name='CEM'):
210 |     record_dfs = pd.DataFrame(columns=['steps', 'reward'])
211 |     reward_cols = []
212 |     for i in range(config['num_trials']):
213 |         config['seed'] = np.random.randint(1000)
214 |         master = Master(config)
215 |         master.run()
216 |         reward_record = pd.DataFrame(master.reward_record)
217 |         record_dfs = record_dfs.merge(reward_record, how='outer', on='steps', suffixes=('', '_{}'.format(i)))
218 |         reward_cols.append('reward_{}'.format(i))
219 | 
220 |     record_dfs = record_dfs.drop(columns='reward').sort_values(by='steps', ascending=True).ffill().bfill()
221 |     record_dfs['reward_mean'] = record_dfs[reward_cols].mean(axis=1)
222 |     record_dfs['reward_std'] = record_dfs[reward_cols].std(axis=1)
223 |     record_dfs['reward_smooth'] = record_dfs['reward_mean'].ewm(span=1000).mean()
224 |     record_dfs['reward_smooth_std'] = record_dfs['reward_std'].ewm(span=1000).mean()
225 |     record_dfs.to_csv(joindir(TUNE_DIR, '{}-record-{}.csv'.format(algo_name, config['env_name'])))
226 | 
227 |     # Plot
228 |     plt.figure(figsize=(12, 6))
229 |     plt.plot(record_dfs['steps'], record_dfs['reward_smooth'], label='reward')
230 |     plt.fill_between(record_dfs['steps'], record_dfs['reward_smooth'] - record_dfs['reward_smooth_std'], 
231 |         record_dfs['reward_smooth'] + record_dfs['reward_smooth_std'], color='b', alpha=0.2)
232 |     plt.legend()
233 |     plt.xlabel('steps of env interaction (sample complexity)')
234 |     plt.ylabel('average reward')
235 |     plt.title('{} on {}'.format(algo_name, config['env_name']))
236 |     plt.savefig(joindir(TUNE_DIR, '{}-plot-{}.pdf'.format(algo_name, config['env_name'])))
237 | 
238 | if __name__ == '__main__':
239 | 
240 |     logger = Logger(joindir(LOG_DIR, 'log_cem.txt'))
241 |     
242 |     trials = grid_search(run_cem, config_hopper)
243 | 
244 |     best_trial = sorted(trials, key=lambda x: x['score'])[-1]
245 |     best_config = best_trial['config']
246 |     best_score = best_trial['score']
247 | 
248 |     with open(joindir(TUNE_DIR, 'ARS-{}.json'.format(best_config['env_name'])), 'w') as f:
249 |         json.dump(best_config, f, indent=4, sort_keys=True) 
250 | 
251 |     logger.info('========best solution found========')
252 |     logger.info('best score: {}'.format(best_score))
253 |     logger.info('best config: {}'.format(best_config))
254 | 
255 |     run_single_and_plot(best_config)
256 | 


--------------------------------------------------------------------------------
/code/dqn.py:
--------------------------------------------------------------------------------
  1 | from pathos.multiprocessing import ProcessingPool as Pool
  2 | import matplotlib
  3 | matplotlib.use('agg')
  4 | import matplotlib.pyplot as plt
  5 | import matplotlib.patches as mpatches
  6 | from matplotlib.colors import ListedColormap
  7 | 
  8 | from tqdm import trange
  9 | import pandas as pd
 10 | import gym
 11 | 
 12 | import torch
 13 | from torch import nn
 14 | from torch import optim
 15 | from torch import Tensor
 16 | from torch.autograd import Variable
 17 | from torch.nn import functional as F
 18 | from collections import deque
 19 | import numpy as np
 20 | import pdb
 21 | import os
 22 | 
 23 | env = gym.make('MountainCar-v0')
 24 | sample_func = env.action_space.sample
 25 | 
 26 | class MLP(nn.Module):
 27 |     def __init__(self):
 28 |         super(MLP, self).__init__()
 29 |         
 30 |         self.state_space = env.observation_space.shape[0]
 31 |         self.action_space = env.action_space.n
 32 |         self.hidden = 200
 33 |         self.fc1 = nn.Linear(self.state_space, self.hidden, bias=False)
 34 |         self.fc2 = nn.Linear(self.hidden, self.action_space, bias=False)
 35 |     
 36 |     def forward(self, x):    
 37 |         model = torch.nn.Sequential(
 38 |             self.fc1,
 39 |             self.fc2,
 40 |         )
 41 |         return model(x)
 42 |     
 43 |     def act(self, x):
 44 |         x = Tensor(x).unsqueeze(0)
 45 |         return int(self.forward(x).argmax(1)[0])
 46 |         
 47 |     def act_egreedy(self, x, e=0.7, sample=sample_func):
 48 |         return self.act(x) if np.random.rand() > e else sample()
 49 |     
 50 | def dqn(loss_type, target_freq, epsilon_decay, lr_decay_freq, lr):
 51 |     
 52 |     DIRS = 'test'
 53 |     os.makedirs(DIRS, exist_ok=True)
 54 | 
 55 |     identifier = '{}/{}_{}_{}_{}_{}'.format(DIRS, loss_type, target_freq, epsilon_decay, lr_decay_freq, lr)
 56 |     record = []
 57 |     evaluate = []
 58 |     buffer = deque(maxlen=100000)
 59 |     agent = MLP()
 60 |     agent_target = MLP()
 61 |     agent_target.load_state_dict(agent.state_dict())
 62 |     opt = optim.SGD(agent.parameters(), lr=lr)
 63 |     sch = optim.lr_scheduler.StepLR(opt, step_size=lr_decay_freq, gamma=0.998)
 64 |     mseloss = nn.MSELoss()
 65 |     env = gym.make('MountainCar-v0')
 66 |     s = env.reset()
 67 |     batch_size = 128
 68 |     learn_start = 1000
 69 |     gamma = 0.998
 70 |     epsilon = 0.7
 71 |     total_steps = 200 * 100000
 72 | 
 73 |     maxpos = - 4
 74 |     reward = 0
 75 |     eplen = 0
 76 |     success_sofar = 0
 77 |     loss = 0
 78 | 
 79 |     for i in trange(total_steps):
 80 |         
 81 |         # sample a transition and store it to the replay buffer
 82 |         if i < learn_start:
 83 |             a = env.action_space.sample()
 84 |         else:
 85 |             a = agent.act_egreedy(s, e=epsilon)
 86 |         ns, r, d, info = env.step(a)
 87 |         buffer.append([s, a, r, ns, d])
 88 |         reward += r
 89 |         eplen += 1
 90 |         if ns[0] > maxpos:
 91 |             maxpos = ns[0]
 92 |         if d:
 93 |             if reward != -200:
 94 |                 success_sofar += 1
 95 |             evaluate.append(dict(i=i, reward=reward, eplen=eplen, maxpos=maxpos, 
 96 |                 epsilon=epsilon, success_sofar=success_sofar, lr=opt.param_groups[0]['lr'],
 97 |                 loss=float(loss)))
 98 |             reward = 0
 99 |             eplen = 0
100 |             maxpos = -4
101 |             epsilon = max(0.01, epsilon * epsilon_decay)
102 |             s = env.reset()
103 |         else:
104 |             s = ns
105 |         
106 |         if i >= learn_start and i % 4 == 0:
107 |             
108 |             # sample a batch from the replay buffer
109 |             inds = np.random.choice(len(buffer), batch_size, replace=False)
110 |             bs, ba, br, bns, bd = [], [], [], [], []
111 |             for ind in inds:
112 |                 ss, aa, rr, nsns, dd = buffer[ind]
113 |                 bs.append(ss)
114 |                 ba.append(aa)
115 |                 br.append(rr)
116 |                 bns.append(nsns)
117 |                 bd.append(dd)
118 |             bs = Tensor(np.array(bs))
119 |             ba = torch.tensor(np.array(ba), dtype=torch.long)
120 |             br = Tensor(np.array(br))
121 |             bns = Tensor(np.array(bns))
122 |             masks = Tensor(1 - np.array(bd) * 1)
123 | 
124 |             nsaction = agent(bns).argmax(1)
125 |             Qtarget = (br + masks * gamma * agent_target(bns)[range(batch_size), nsaction]).detach()
126 |             Qvalue = agent(bs)[range(batch_size), ba]
127 |             if loss_type == 'MSE':
128 |                 loss = mseloss(Qvalue, Qtarget)
129 |             elif loss_type == 'SL1':
130 |                 loss = F.smooth_l1_loss(Qvalue, Qtarget)
131 |             agent.zero_grad()
132 |             loss.backward()
133 |             for param in agent.parameters():
134 |                 param.grad.data.clamp_(-1, 1)
135 |             # print('Finish the {}-th iteration, the loss = {}'.format(i, float(loss)))
136 |             opt.step()
137 |             sch.step()
138 |             
139 |             if i % target_freq == 0:
140 |                 agent_target.load_state_dict(agent.state_dict())
141 | 
142 |             record.append(dict(i=i, loss=float(loss)))
143 |             
144 |     record = pd.DataFrame(record)
145 |     evaluate = pd.DataFrame(evaluate)
146 |     evaluate.to_csv('{}_episode.csv'.format(identifier))
147 | 
148 |     # Plot training process
149 |     plt.figure(figsize=(15, 5))
150 |     plt.subplot(241)
151 |     plt.plot(record['i'][::10000], record['loss'][::10000])
152 |     plt.title('loss')
153 |     plt.subplot(242)
154 |     plt.plot(evaluate['i'][::200], evaluate['reward'][::200])
155 |     plt.title('reward')
156 |     plt.subplot(243)
157 |     plt.plot(evaluate['i'][::200], evaluate['eplen'][::200])
158 |     plt.title('eplen')
159 |     plt.subplot(244)
160 |     plt.plot(evaluate['i'][::200], evaluate['maxpos'][::200])
161 |     plt.title('maxpos')
162 |     plt.subplot(245)
163 |     plt.plot(evaluate['i'][::200], evaluate['epsilon'][::200])
164 |     plt.title('epsilon')
165 |     plt.subplot(246)
166 |     plt.plot(evaluate['i'][::200], evaluate['success_sofar'][::200])
167 |     plt.title('success_sofar')
168 |     plt.subplot(247)
169 |     plt.plot(evaluate['i'][::200], evaluate['lr'][::200])
170 |     plt.title('lr')
171 |     plt.subplot(248)
172 |     plt.plot(evaluate['i'][::200], evaluate['loss'][::200])
173 |     plt.title('loss')
174 |     plt.savefig('{}_fig1.png'.format(identifier))
175 | 
176 |     # Plot policy
177 |     X = np.random.uniform(-1.2, 0.6, 10000)
178 |     Y = np.random.uniform(-0.07, 0.07, 10000)
179 |     Z = []
180 |     for i in range(len(X)):
181 |         _, temp = torch.max(
182 |             agent(Variable(torch.from_numpy(np.array([X[i],Y[i]]))).type(torch.FloatTensor)), dim =-1)
183 |         z = temp.item()
184 |         Z.append(z)
185 |     Z = pd.Series(Z)
186 |     colors = {0:'blue',1:'lime',2:'red'}
187 |     colors = Z.apply(lambda x:colors[x])
188 |     labels = ['Left','Right','Nothing']
189 | 
190 |     fig = plt.figure(3, figsize=[7,7])
191 |     ax = fig.gca()
192 |     plt.set_cmap('brg')
193 |     surf = ax.scatter(X,Y, c=Z)
194 |     ax.set_xlabel('Position')
195 |     ax.set_ylabel('Velocity')
196 |     ax.set_title('Policy')
197 |     recs = []
198 |     for i in range(0,3):
199 |          recs.append(mpatches.Rectangle((0,0),1,1,fc=sorted(colors.unique())[i]))
200 |     plt.legend(recs,labels,loc=4,ncol=3)
201 |     plt.savefig('{}_fig2.png'.format(identifier))
202 | 
203 | if __name__ == '__main__':
204 | 
205 |     dqn(loss_type='MSE', target_freq=2000, epsilon_decay=0.998, lr_decay_freq=2000, lr=1e-4)
206 | 


--------------------------------------------------------------------------------
/code/ppo.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Implementation of PPO
  3 | ref: Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
  4 | ref: https://github.com/Jiankai-Sun/Proximal-Policy-Optimization-in-Pytorch/blob/master/ppo.py
  5 | ref: https://github.com/openai/baselines/tree/master/baselines/ppo2
  6 | 
  7 | NOTICE:
  8 |     `Tensor2` means 2D-Tensor (num_samples, num_dims) 
  9 | """
 10 | 
 11 | import gym
 12 | import torch
 13 | import torch.nn as nn
 14 | import torch.optim as opt
 15 | from torch import Tensor
 16 | from torch.autograd import Variable
 17 | from collections import namedtuple
 18 | from itertools import count
 19 | import matplotlib
 20 | matplotlib.use('agg')
 21 | import matplotlib.pyplot as plt
 22 | from os.path import join as joindir
 23 | from os import makedirs as mkdir
 24 | import pandas as pd
 25 | import numpy as np
 26 | import argparse
 27 | import datetime
 28 | import math
 29 | 
 30 | 
 31 | Transition = namedtuple('Transition', ('state', 'value', 'action', 'logproba', 'mask', 'next_state', 'reward'))
 32 | EPS = 1e-10
 33 | RESULT_DIR = joindir('../result', '.'.join(__file__.split('.')[:-1]))
 34 | mkdir(RESULT_DIR, exist_ok=True)
 35 | 
 36 | 
 37 | class args(object):
 38 |     env_name = 'Hopper-v2'
 39 |     seed = 1234
 40 |     num_episode = 2000
 41 |     batch_size = 2048
 42 |     max_step_per_round = 2000
 43 |     gamma = 0.995
 44 |     lamda = 0.97
 45 |     log_num_episode = 1
 46 |     num_epoch = 10
 47 |     minibatch_size = 256
 48 |     clip = 0.2
 49 |     loss_coeff_value = 0.5
 50 |     loss_coeff_entropy = 0.01
 51 |     lr = 3e-4
 52 |     num_parallel_run = 5
 53 |     # tricks
 54 |     schedule_adam = 'linear'
 55 |     schedule_clip = 'linear'
 56 |     layer_norm = True
 57 |     state_norm = True
 58 |     advantage_norm = True
 59 |     lossvalue_norm = True
 60 | 
 61 | 
 62 | class RunningStat(object):
 63 |     def __init__(self, shape):
 64 |         self._n = 0
 65 |         self._M = np.zeros(shape)
 66 |         self._S = np.zeros(shape)
 67 | 
 68 |     def push(self, x):
 69 |         x = np.asarray(x)
 70 |         assert x.shape == self._M.shape
 71 |         self._n += 1
 72 |         if self._n == 1:
 73 |             self._M[...] = x
 74 |         else:
 75 |             oldM = self._M.copy()
 76 |             self._M[...] = oldM + (x - oldM) / self._n
 77 |             self._S[...] = self._S + (x - oldM) * (x - self._M)
 78 | 
 79 |     @property
 80 |     def n(self):
 81 |         return self._n
 82 | 
 83 |     @property
 84 |     def mean(self):
 85 |         return self._M
 86 | 
 87 |     @property
 88 |     def var(self):
 89 |         return self._S / (self._n - 1) if self._n > 1 else np.square(self._M)
 90 | 
 91 |     @property
 92 |     def std(self):
 93 |         return np.sqrt(self.var)
 94 | 
 95 |     @property
 96 |     def shape(self):
 97 |         return self._M.shape
 98 | 
 99 | 
100 | class ZFilter:
101 |     """
102 |     y = (x-mean)/std
103 |     using running estimates of mean,std
104 |     """
105 | 
106 |     def __init__(self, shape, demean=True, destd=True, clip=10.0):
107 |         self.demean = demean
108 |         self.destd = destd
109 |         self.clip = clip
110 | 
111 |         self.rs = RunningStat(shape)
112 | 
113 |     def __call__(self, x, update=True):
114 |         if update: self.rs.push(x)
115 |         if self.demean:
116 |             x = x - self.rs.mean
117 |         if self.destd:
118 |             x = x / (self.rs.std + 1e-8)
119 |         if self.clip:
120 |             x = np.clip(x, -self.clip, self.clip)
121 |         return x
122 | 
123 |     def output_shape(self, input_space):
124 |         return input_space.shape
125 | 
126 | 
127 | class ActorCritic(nn.Module):
128 |     def __init__(self, num_inputs, num_outputs, layer_norm=True):
129 |         super(ActorCritic, self).__init__()
130 |         
131 |         self.actor_fc1 = nn.Linear(num_inputs, 64)
132 |         self.actor_fc2 = nn.Linear(64, 64)
133 |         self.actor_fc3 = nn.Linear(64, num_outputs)
134 |         self.actor_logstd = nn.Parameter(torch.zeros(1, num_outputs))
135 | 
136 |         self.critic_fc1 = nn.Linear(num_inputs, 64)
137 |         self.critic_fc2 = nn.Linear(64, 64)
138 |         self.critic_fc3 = nn.Linear(64, 1)
139 | 
140 |         if layer_norm:
141 |             self.layer_norm(self.actor_fc1, std=1.0)
142 |             self.layer_norm(self.actor_fc2, std=1.0)
143 |             self.layer_norm(self.actor_fc3, std=0.01)
144 | 
145 |             self.layer_norm(self.critic_fc1, std=1.0)
146 |             self.layer_norm(self.critic_fc2, std=1.0)
147 |             self.layer_norm(self.critic_fc3, std=1.0)
148 | 
149 |     @staticmethod
150 |     def layer_norm(layer, std=1.0, bias_const=0.0):
151 |         torch.nn.init.orthogonal_(layer.weight, std)
152 |         torch.nn.init.constant_(layer.bias, bias_const)
153 | 
154 |     def forward(self, states):
155 |         """
156 |         run policy network (actor) as well as value network (critic)
157 |         :param states: a Tensor2 represents states
158 |         :return: 3 Tensor2
159 |         """
160 |         action_mean, action_logstd = self._forward_actor(states)
161 |         critic_value = self._forward_critic(states)
162 |         return action_mean, action_logstd, critic_value
163 | 
164 |     def _forward_actor(self, states):
165 |         x = torch.tanh(self.actor_fc1(states))
166 |         x = torch.tanh(self.actor_fc2(x))
167 |         action_mean = self.actor_fc3(x)
168 |         action_logstd = self.actor_logstd.expand_as(action_mean)
169 |         return action_mean, action_logstd
170 | 
171 |     def _forward_critic(self, states):
172 |         x = torch.tanh(self.critic_fc1(states))
173 |         x = torch.tanh(self.critic_fc2(x))
174 |         critic_value = self.critic_fc3(x)
175 |         return critic_value
176 | 
177 |     def select_action(self, action_mean, action_logstd, return_logproba=True):
178 |         """
179 |         given mean and std, sample an action from normal(mean, std)
180 |         also returns probability of the given chosen
181 |         """
182 |         action_std = torch.exp(action_logstd)
183 |         action = torch.normal(action_mean, action_std)
184 |         if return_logproba:
185 |             logproba = self._normal_logproba(action, action_mean, action_logstd, action_std)
186 |         return action, logproba
187 | 
188 |     @staticmethod
189 |     def _normal_logproba(x, mean, logstd, std=None):
190 |         if std is None:
191 |             std = torch.exp(logstd)
192 | 
193 |         std_sq = std.pow(2)
194 |         logproba = - 0.5 * math.log(2 * math.pi) - logstd - (x - mean).pow(2) / (2 * std_sq)
195 |         return logproba.sum(1)
196 | 
197 |     def get_logproba(self, states, actions):
198 |         """
199 |         return probability of chosen the given actions under corresponding states of current network
200 |         :param states: Tensor
201 |         :param actions: Tensor
202 |         """
203 |         action_mean, action_logstd = self._forward_actor(states)
204 |         logproba = self._normal_logproba(actions, action_mean, action_logstd)
205 |         return logproba
206 | 
207 |     
208 | class Memory(object):
209 |     def __init__(self):
210 |         self.memory = []
211 | 
212 |     def push(self, *args):
213 |         self.memory.append(Transition(*args))
214 | 
215 |     def sample(self):
216 |         return Transition(*zip(*self.memory))
217 | 
218 |     def __len__(self):
219 |         return len(self.memory)
220 | 
221 | def ppo(args):
222 |     env = gym.make(args.env_name)
223 |     num_inputs = env.observation_space.shape[0]
224 |     num_actions = env.action_space.shape[0]
225 | 
226 |     env.seed(args.seed)
227 |     torch.manual_seed(args.seed)
228 | 
229 |     network = ActorCritic(num_inputs, num_actions, layer_norm=args.layer_norm)
230 |     optimizer = opt.Adam(network.parameters(), lr=args.lr)
231 | 
232 |     running_state = ZFilter((num_inputs,), clip=5.0)
233 |     
234 |     # record average 1-round cumulative reward in every episode
235 |     reward_record = []
236 |     global_steps = 0
237 | 
238 |     lr_now = args.lr
239 |     clip_now = args.clip
240 | 
241 |     for i_episode in range(args.num_episode):
242 |         # step1: perform current policy to collect trajectories
243 |         # this is an on-policy method!
244 |         memory = Memory()
245 |         num_steps = 0
246 |         reward_list = []
247 |         len_list = []
248 |         while num_steps < args.batch_size:
249 |             state = env.reset()
250 |             if args.state_norm:
251 |                 state = running_state(state)
252 |             reward_sum = 0
253 |             for t in range(args.max_step_per_round):
254 |                 action_mean, action_logstd, value = network(Tensor(state).unsqueeze(0))
255 |                 action, logproba = network.select_action(action_mean, action_logstd)
256 |                 action = action.data.numpy()[0]
257 |                 logproba = logproba.data.numpy()[0]
258 |                 next_state, reward, done, _ = env.step(action)
259 |                 reward_sum += reward
260 |                 if args.state_norm:
261 |                     next_state = running_state(next_state)
262 |                 mask = 0 if done else 1
263 | 
264 |                 memory.push(state, value, action, logproba, mask, next_state, reward)
265 |                 
266 |                 if done:
267 |                     break
268 |                     
269 |                 state = next_state
270 |                 
271 |             num_steps += (t + 1)
272 |             global_steps += (t + 1)
273 |             reward_list.append(reward_sum)
274 |             len_list.append(t + 1)
275 |         reward_record.append({
276 |             'episode': i_episode, 
277 |             'steps': global_steps, 
278 |             'meanepreward': np.mean(reward_list), 
279 |             'meaneplen': np.mean(len_list)})
280 | 
281 |         batch = memory.sample()
282 |         batch_size = len(memory)
283 |         
284 |         # step2: extract variables from trajectories
285 |         rewards = Tensor(batch.reward)
286 |         values = Tensor(batch.value)
287 |         masks = Tensor(batch.mask)
288 |         actions = Tensor(batch.action)
289 |         states = Tensor(batch.state)
290 |         oldlogproba = Tensor(batch.logproba)
291 |         
292 |         returns = Tensor(batch_size)
293 |         deltas = Tensor(batch_size)
294 |         advantages = Tensor(batch_size)
295 | 
296 |         prev_return = 0
297 |         prev_value = 0
298 |         prev_advantage = 0
299 |         for i in reversed(range(batch_size)):
300 |             returns[i] = rewards[i] + args.gamma * prev_return * masks[i]
301 |             deltas[i] = rewards[i] + args.gamma * prev_value * masks[i] - values[i]
302 |             # ref: https://arxiv.org/pdf/1506.02438.pdf (generalization advantage estimate)
303 |             advantages[i] = deltas[i] + args.gamma * args.lamda * prev_advantage * masks[i]
304 | 
305 |             prev_return = returns[i]
306 |             prev_value = values[i]
307 |             prev_advantage = advantages[i]
308 |         if args.advantage_norm:
309 |             advantages = (advantages - advantages.mean()) / (advantages.std() + EPS)
310 | 
311 |         for i_epoch in range(int(args.num_epoch * batch_size / args.minibatch_size)):
312 |             # sample from current batch
313 |             minibatch_ind = np.random.choice(batch_size, args.minibatch_size, replace=False)
314 |             minibatch_states = states[minibatch_ind]
315 |             minibatch_actions = actions[minibatch_ind]
316 |             minibatch_oldlogproba = oldlogproba[minibatch_ind]
317 |             minibatch_newlogproba = network.get_logproba(minibatch_states, minibatch_actions)
318 |             minibatch_advantages = advantages[minibatch_ind]
319 |             minibatch_returns = returns[minibatch_ind]
320 |             minibatch_newvalues = network._forward_critic(minibatch_states).flatten()
321 | 
322 |             ratio =  torch.exp(minibatch_newlogproba - minibatch_oldlogproba)
323 |             surr1 = ratio * minibatch_advantages
324 |             surr2 = ratio.clamp(1 - clip_now, 1 + clip_now) * minibatch_advantages
325 |             loss_surr = - torch.mean(torch.min(surr1, surr2))
326 | 
327 |             # not sure the value loss should be clipped as well 
328 |             # clip example: https://github.com/Jiankai-Sun/Proximal-Policy-Optimization-in-Pytorch/blob/master/ppo.py
329 |             # however, it does not make sense to clip score-like value by a dimensionless clipping parameter
330 |             # moreover, original paper does not mention clipped value 
331 |             if args.lossvalue_norm:
332 |                 minibatch_return_6std = 6 * minibatch_returns.std()
333 |                 loss_value = torch.mean((minibatch_newvalues - minibatch_returns).pow(2)) / minibatch_return_6std
334 |             else:
335 |                 loss_value = torch.mean((minibatch_newvalues - minibatch_returns).pow(2))
336 | 
337 |             loss_entropy = torch.mean(torch.exp(minibatch_newlogproba) * minibatch_newlogproba)
338 | 
339 |             total_loss = loss_surr + args.loss_coeff_value * loss_value + args.loss_coeff_entropy * loss_entropy
340 |             optimizer.zero_grad()
341 |             total_loss.backward()
342 |             optimizer.step()
343 | 
344 |         if args.schedule_clip == 'linear':
345 |             ep_ratio = 1 - (i_episode / args.num_episode)
346 |             clip_now = args.clip * ep_ratio
347 | 
348 |         if args.schedule_adam == 'linear':
349 |             ep_ratio = 1 - (i_episode / args.num_episode)
350 |             lr_now = args.lr * ep_ratio
351 |             # set learning rate
352 |             # ref: https://stackoverflow.com/questions/48324152/
353 |             for g in optimizer.param_groups:
354 |                 g['lr'] = lr_now
355 | 
356 |         if i_episode % args.log_num_episode == 0:
357 |             print('Finished episode: {} Reward: {:.4f} total_loss = {:.4f} = {:.4f} + {} * {:.4f} + {} * {:.4f}' \
358 |                 .format(i_episode, reward_record[-1]['meanepreward'], total_loss.data, loss_surr.data, args.loss_coeff_value, 
359 |                 loss_value.data, args.loss_coeff_entropy, loss_entropy.data))
360 |             print('-----------------')
361 | 
362 |     return reward_record
363 | 
364 | def test(args):
365 |     record_dfs = []
366 |     for i in range(args.num_parallel_run):
367 |         args.seed += 1
368 |         reward_record = pd.DataFrame(ppo(args))
369 |         reward_record['#parallel_run'] = i
370 |         record_dfs.append(reward_record)
371 |     record_dfs = pd.concat(record_dfs, axis=0)
372 |     record_dfs.to_csv(joindir(RESULT_DIR, 'ppo-record-{}.csv'.format(args.env_name)))
373 |     
374 | if __name__ == '__main__':
375 | 
376 |     for env in ['Walker2d-v2', 'Swimmer-v2', 'Hopper-v2', 'Humanoid-v2', 'HalfCheetah-v2', 'Reacher-v2']:
377 |         args.env_name = env
378 |         test(args)
379 | 
380 | 


--------------------------------------------------------------------------------
/code/vpg.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Implementation of Vinilla Policy Gradient
  3 | 
  4 |     This is a policy gradient with a state value function baseline. 
  5 |     Each time trajectories are sampled and the returns are calculated.
  6 |     The state value function approximator is stepped to the return and 
  7 |     the policy gradient is done w.r.t. this baseline. 
  8 | 
  9 |     The actor outputs an mean and std. To keep an exploration, we add
 10 |     entropy loss to the actor.
 11 | 
 12 | ref: http://rail.eecs.berkeley.edu/deeprlcourse/static/homeworks/hw2.pdf
 13 | 
 14 | NOTICE:
 15 |     `Tensor2` means 2D-Tensor (num_samples, num_dims) 
 16 | """
 17 | 
 18 | import gym
 19 | import torch
 20 | import torch.nn as nn
 21 | import torch.optim as opt
 22 | from torch import Tensor
 23 | from torch.autograd import Variable
 24 | from collections import deque, namedtuple
 25 | from itertools import count
 26 | import scipy.optimize as sciopt
 27 | import matplotlib
 28 | matplotlib.use('agg')
 29 | import matplotlib.pyplot as plt
 30 | from os.path import join as joindir
 31 | import pandas as pd
 32 | import numpy as np
 33 | import argparse
 34 | import datetime
 35 | import math
 36 | 
 37 | 
 38 | Transition = namedtuple('Transition', ('state', 'action', 'action_mean', 'action_logstd', 'mask', 'next_state', 'reward'))
 39 | EPS = 1e-10
 40 | RESULT_DIR = '../result'
 41 | 
 42 | 
 43 | class args(object):
 44 |     env_name = 'Hopper-v2'
 45 |     seed = 1234
 46 |     num_episode = 100
 47 |     max_step_per_round = 200
 48 |     batch_size = 5000
 49 |     gamma = 0.995
 50 |     log_num_episode = 1
 51 |     loss_coeff_entropy = 1e-3
 52 |     lr = 1e-4
 53 |     hidden_size = 32
 54 |     initial_policy_logstd = -1.20397
 55 |     num_opt_value_each_episode = 100
 56 |     num_opt_actor_each_episode = 10
 57 |     num_parallel_run = 5
 58 | 
 59 | 
 60 | def add_arguments():
 61 |     parser = argparse.ArgumentParser()
 62 |     parser.add_argument('--env_name', type=str, default='Hopper-v2')
 63 |     parser.add_argument('--seed', type=int, default=1234)
 64 |     parser.add_argument('--num_episode', type=int, default=1000)
 65 |     parser.add_argument('--max_step_per_round', type=int, default=200)
 66 |     parser.add_argument('--batch_size', type=int, default=5000)
 67 |     parser.add_argument('--gamma', type=float, default=0.995)
 68 |     parser.add_argument('--log_num_episode', type=int, default=1)
 69 |     parser.add_argument('--loss_coeff_entropy', type=float, default=1e-3)
 70 |     parser.add_argument('--lr', type=float, default=1e-4)
 71 |     parser.add_argument('--hidden_size', type=int, default=32)
 72 |     parser.add_argument('--initial_policy_logstd', type=float, default=-1.20397)
 73 |     parser.add_argument('--num_opt_value_each_episode', type=int, default=100)
 74 |     parser.add_argument('--num_opt_actor_each_episode', type=int, default=10)
 75 |     parser.add_argument('--num_parallel_run', type=int, default=5)
 76 | 
 77 |     args = parser.parse_args()
 78 |     return args
 79 | 
 80 | class RunningStat(object):
 81 |     def __init__(self, shape):
 82 |         self._n = 0
 83 |         self._M = np.zeros(shape)
 84 |         self._S = np.zeros(shape)
 85 | 
 86 |     def push(self, x):
 87 |         x = np.asarray(x)
 88 |         assert x.shape == self._M.shape
 89 |         self._n += 1
 90 |         if self._n == 1:
 91 |             self._M[...] = x
 92 |         else:
 93 |             oldM = self._M.copy()
 94 |             self._M[...] = oldM + (x - oldM) / self._n
 95 |             self._S[...] = self._S + (x - oldM) * (x - self._M)
 96 | 
 97 |     @property
 98 |     def n(self):
 99 |         return self._n
100 | 
101 |     @property
102 |     def mean(self):
103 |         return self._M
104 | 
105 |     @property
106 |     def var(self):
107 |         return self._S / (self._n - 1) if self._n > 1 else np.square(self._M)
108 | 
109 |     @property
110 |     def std(self):
111 |         return np.sqrt(self.var)
112 | 
113 |     @property
114 |     def shape(self):
115 |         return self._M.shape
116 | 
117 | 
118 | class ZFilter:
119 |     """
120 |     y = (x-mean)/std
121 |     using running estimates of mean,std
122 |     """
123 | 
124 |     def __init__(self, shape, demean=True, destd=True, clip=10.0):
125 |         self.demean = demean
126 |         self.destd = destd
127 |         self.clip = clip
128 | 
129 |         self.rs = RunningStat(shape)
130 | 
131 |     def __call__(self, x, update=True):
132 |         if update: self.rs.push(x)
133 |         if self.demean:
134 |             x = x - self.rs.mean
135 |         if self.destd:
136 |             x = x / (self.rs.std + 1e-8)
137 |         if self.clip:
138 |             x = np.clip(x, -self.clip, self.clip)
139 |         return x
140 | 
141 |     def output_shape(self, input_space):
142 |         return input_space.shape
143 | 
144 | 
145 | class Memory(object):
146 |     def __init__(self):
147 |         self.memory = []
148 | 
149 |     def push(self, transition):
150 |         self.memory.append(transition)
151 | 
152 |     def sample(self, do_reverse=True):
153 |         if do_reverse:
154 |             return Transition(*zip(*reversed(self.memory)))
155 |         else:
156 |             return Transition(*zip(*self.memory))
157 | 
158 |     def __len__(self):
159 |         return len(self.memory)
160 | 
161 | 
162 | class Actor(nn.Module):
163 |     def __init__(self, dim_states, dim_actions):
164 |         super(Actor, self).__init__()
165 |         
166 |         self.fc1 = nn.Linear(dim_states, args.hidden_size)
167 |         self.fc2 = nn.Linear(args.hidden_size, args.hidden_size)
168 |         self.fc_mean = nn.Linear(args.hidden_size, dim_actions)
169 |         self.fc_logstd = nn.Parameter(args.initial_policy_logstd * torch.ones(1, dim_actions), requires_grad=False)
170 | 
171 |     def forward(self, states):
172 |         """
173 |         given a states returns the action distribution (gaussian) with mean and logstd 
174 |         :param states: a Tensor2 represents states
175 |         :return: Tensor2 action mean and logstd  
176 |         """
177 |         x = torch.relu(self.fc1(states))
178 |         x = torch.relu(self.fc2(x))
179 |         action_mean = self.fc_mean(x)
180 |         action_logstd = self.fc_logstd.expand_as(action_mean)
181 |         return action_mean, action_logstd
182 | 
183 |     @ staticmethod
184 |     def select_action(action_mean, action_logstd):
185 |         """
186 |         given mean and std, sample an action from normal(mean, std)
187 |         also returns probability of the given chosen
188 |         :param action_mean: Tensor2
189 |         :param action_logstd: Tensor2
190 |         :return: Tensor2 action
191 |         """
192 |         action_std = torch.exp(action_logstd)
193 |         action = torch.normal(action_mean, action_std)
194 |         return action
195 | 
196 |     @staticmethod
197 |     def normal_logproba(x, mean, logstd, std=None):
198 |         if std is None:
199 |             std = torch.exp(logstd)
200 | 
201 |         std_sq = std.pow(2)
202 |         logproba = - 0.5 * math.log(2 * math.pi) - logstd - (x - mean).pow(2) / (2 * std_sq)
203 |         return logproba.sum(1).view(-1, 1)
204 | 
205 | class Baseline(nn.Module):
206 |     def __init__(self, dim_states):
207 |         super(Baseline, self).__init__()
208 | 
209 |         self.fc1 = nn.Linear(dim_states, args.hidden_size)
210 |         self.fc2 = nn.Linear(args.hidden_size, args.hidden_size)
211 |         self.fc3 = nn.Linear(args.hidden_size, 1)
212 | 
213 |     def forward(self, states):
214 |         """
215 |         given states returns its approximated state value function
216 |         :param states: a Tensor2 represents states
217 |         :return: Tensor2 state value function 
218 |         """
219 |         x = torch.relu(self.fc1(states))
220 |         x = torch.relu(self.fc2(x))
221 |         values = torch.relu(self.fc3(x))
222 |         return values
223 | 
224 | 
225 | def vpg():
226 |     env = gym.make(args.env_name)
227 |     dim_states = env.observation_space.shape[0]
228 |     dim_actions = env.action_space.shape[0]
229 | 
230 |     env.seed(args.seed)
231 |     torch.manual_seed(args.seed)
232 | 
233 |     actor = Actor(dim_states, dim_actions)
234 |     baseline = Baseline(dim_states)
235 |     optimizer_a = opt.Adam(actor.parameters(), lr=args.lr)
236 |     optimizer_b = opt.Adam(baseline.parameters(), lr=args.lr)
237 |     running_state = ZFilter((dim_states,), clip=5)
238 | 
239 |     reward_record = []
240 |     global_steps = 0
241 | 
242 |     for i_episode in range(args.num_episode):
243 | 
244 |         memory = Memory()
245 |         num_steps = 0
246 |         reward_sum_list = []
247 |         while num_steps < args.batch_size:
248 |             state = env.reset()
249 |             state = running_state(state)
250 |             reward_sum = 0
251 |             for t in range(args.max_step_per_round):
252 |                 action_mean, action_logstd = actor(Tensor(state).unsqueeze(0))
253 |                 action = actor.select_action(action_mean, action_logstd)
254 |                 action = action.data.numpy()[0]
255 |                 next_state, reward, done, info = env.step(action)
256 |                 reward_sum += reward
257 |                 next_state = running_state(next_state)
258 |                 mask = 0 if done else 1
259 | 
260 |                 memory.push(Transition(
261 |                     state=state, action=action, action_mean=action_mean, action_logstd=action_logstd,
262 |                     mask=mask, next_state=next_state, reward=reward
263 |                 ))
264 | 
265 |                 if done:
266 |                     break
267 | 
268 |                 state = next_state
269 | 
270 |                 reward_sum_list.append(reward_sum)
271 | 
272 |             num_steps += (t + 1)
273 |             global_steps += (t + 1)
274 |             reward_record.append({'steps': global_steps, 'reward': reward_sum})
275 | 
276 |         batch = memory.sample()
277 |         batch_size = len(memory)
278 | 
279 |         states = Tensor(batch.state)
280 |         actions = Tensor(batch.action)
281 |         action_means = torch.cat(batch.action_mean)
282 |         action_logstds = torch.cat(batch.action_logstd)
283 |         masks = Tensor(batch.mask).view(-1, 1)
284 |         next_states = Tensor(batch.next_state)
285 |         rewards = Tensor(batch.reward).view(-1, 1)
286 | 
287 |         returns = torch.zeros(batch_size, 1)
288 |         returns[0] = rewards[0]
289 |         # notice the trajector is already reversed
290 |         for i in range(1, batch_size):
291 |             returns[i] = rewards[i] + args.gamma * returns[i - 1] * masks[i]
292 | 
293 |         for i in range(args.num_opt_value_each_episode):
294 |             optimizer_b.zero_grad()
295 |             values = baseline(Variable(states))
296 |             loss_value = (Variable(returns) - values).pow(2).mean()
297 |             loss_value.backward()
298 |             optimizer_b.step()
299 | 
300 |         for i in range(args.num_opt_actor_each_episode):
301 |             optimizer_a.zero_grad()
302 |             action_means, action_logstds = actor(Variable(states))
303 |             logprobas = actor.normal_logproba(Variable(actions), action_means, action_logstds)
304 |             loss_policy = - (Variable(returns - values) * logprobas).mean()
305 |             loss_policy.backward()
306 |             optimizer_a.step()
307 | 
308 |         if i_episode % args.log_num_episode == 0:
309 |             print('Finished episode: {} steps: {} AvgReward: {:.4f} loss = value({:.4f}) + policy({:.4f})' \
310 |                 .format(i_episode, reward_record[-1]['steps'], np.mean(reward_sum_list), loss_value.data, loss_policy.data))
311 |             print('-----------------')
312 | 
313 |     return reward_record
314 | 
315 | if __name__ == '__main__':
316 |     datestr = datetime.datetime.now().strftime('%Y-%m-%d')
317 |     args = add_arguments()
318 | 
319 |     record_dfs = pd.DataFrame(columns=['steps', 'reward'])
320 |     reward_cols = []
321 |     for i in range(args.num_parallel_run):
322 |         args.seed += 1
323 |         reward_record = pd.DataFrame(vpg())
324 |         record_dfs = record_dfs.merge(reward_record, how='outer', on='steps', suffixes=('', '_{}'.format(i)))
325 |         reward_cols.append('reward_{}'.format(i))
326 | 
327 |     record_dfs = record_dfs.drop(columns='reward').sort_values(by='steps', ascending=True).ffill().bfill()
328 |     record_dfs['reward_mean'] = record_dfs[reward_cols].mean(axis=1)
329 |     record_dfs['reward_std'] = record_dfs[reward_cols].std(axis=1)
330 |     record_dfs['reward_smooth'] = record_dfs['reward_mean'].ewm(span=1000).mean()
331 |     record_dfs['reward_smooth_std'] = record_dfs['reward_std'].ewm(span=1000).mean()
332 |     record_dfs.to_csv(joindir(RESULT_DIR, 'vpg-record-{}-{}.csv'.format(args.env_name, datestr)))
333 | 
334 |     # Plot
335 |     plt.figure(figsize=(12, 6))
336 |     plt.plot(record_dfs['steps'], record_dfs['reward_smooth'], label='reward')
337 |     plt.fill_between(record_dfs['steps'], record_dfs['reward_smooth'] - record_dfs['reward_smooth_std'], 
338 |         record_dfs['reward_smooth'] + record_dfs['reward_smooth_std'], color='b', alpha=0.2)
339 |     plt.legend()
340 |     plt.xlabel('steps of env interaction (sample complexity)')
341 |     plt.ylabel('average reward')
342 |     plt.title('VPG on {}'.format(args.env_name))
343 |     plt.savefig(joindir(RESULT_DIR, 'vpg-{}-{}.pdf'.format(args.env_name, datestr)))
344 |     


--------------------------------------------------------------------------------
/docs/ppo_experiments.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangchuheng123/Reinforcement-Implementation/c04e0df10ec29ef775ea31395a8ad4b917302d24/docs/ppo_experiments.png


--------------------------------------------------------------------------------
/docs/rainbow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhangchuheng123/Reinforcement-Implementation/c04e0df10ec29ef775ea31395a8ad4b917302d24/docs/rainbow.png


--------------------------------------------------------------------------------