├── README.md
├── code
    ├── .DS_Store
    ├── actor_critic_advantage.py
    ├── ddpg_update.py
    ├── deep_deterministic_policy_gradient.py
    ├── policy_gradient.py
    ├── proximal_policy_optimization.py
    └── tensrolayer-implemented
    │   ├── a3c.py
    │   ├── ac.py
    │   ├── ddpg.py
    │   ├── dqn.py
    │   ├── dqn_variants.py
    │   ├── pg.py
    │   ├── ppo.py
    │   ├── qlearning.py
    │   └── tutorial_wrappers.py
├── notes
    ├── .DS_Store
    ├── 1 Introduction.md
    ├── 2 Policy Gradient.md
    ├── 3 Q - Learning.md
    ├── 4 Actor Critic.md
    ├── 5 Sparse Reward.md
    └── 6 Imitation Learning.md
└── slides
    ├── AC.pdf
    ├── IRL (v2).pdf
    ├── PPO (v3).pdf
    ├── QLearning (v2).pdf
    └── Reward (v3).pdf


/README.md:
--------------------------------------------------------------------------------
 1 | # 李宏毅深度强化学习 笔记
 2 | 
 3 | ### 课程主页：[NTU-MLDS18](http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLDS18.html)
 4 | 
 5 | ### 视频：
 6 | - [youtube](https://www.youtube.com/playlist?list=PLJV_el3uVTsODxQFgzMzPLa16h6B8kWM_) 
 7 | - [B站](https://www.bilibili.com/video/av24724071/?spm_id_from=333.788.videocard.4)
 8 | 
 9 | 
10 | ![1](http://oss.hackslog.cn/imgs/075034.png)
11 | 
12 | 这门课的学习路线如上，强化学习是作为单独一个模块介绍。李宏毅老师讲这门课不是从MDP开始讲起，而是从如何获得最大化奖励出发，直接引出Policy Gradient（以及PPO），再讲Q-learning（原始Q-learning，DQN，各种DQN的升级），然后是A2C（以及A3C, DDPG），紧接着介绍了一些Reward Shaping的方法（主要是Curiosity，Curriculum Learning ，Hierarchical Learning），最后介绍Imitation Learning (Inverse RL)。比较全面的展现了深度强化学习的核心内容，也比较直观。跟伯克利学派的课类似，与UCL上来就讲MDP，解各种value iteration的思路有较大区别。
13 | 文档中的notes以对slides的批注为主，方便在阅读slides时理解，code以纯tensorflow实现，主要参考[莫凡RL教学](https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/)，修正部分代码以保持前后一致性，已经加入便于理解的注释。
14 | ### 参考资料： 
15 | [作业代码参考](https://github.com/JasonYao81000/MLDS2018SPRING/tree/master/hw4)  [纯numpy实现非Deep的RL算法](https://github.com/ddbourgin/numpy-ml/tree/master/numpy_ml/rl_models) [OpenAI tutorial](https://github.com/openai/spinningup/tree/master/docs) [莫凡RL教学](https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/)
16 | - code中的tensorlayer实现来自于[Tensorlayer-RL](https://github.com/tensorlayer/tensorlayer/tree/master/examples/reinforcement_learning),比起原生tensorflow更加简洁
17 | 


--------------------------------------------------------------------------------
/code/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/morningsky/NTU-ReinforcementLearning-Notes/d4a9dbf584ae24d974d9b7839f34cee3f18b79dd/code/.DS_Store


--------------------------------------------------------------------------------
/code/actor_critic_advantage.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | import numpy as np
  3 | import gym
  4 | 
  5 | '''
  6 | 比较PG算法
  7 | PG loss = log_prob * v估计 （来自贝尔曼公式）
  8 | A2C loss = log_prob * TD-error（来自critic网络 表达当前动作的价值比平均动作的价值好多少）
  9 | DDPG ： critic不仅能影响actor actor也能影响critic 相当于critic不仅告诉actor的行为好不好，还告诉他应该怎么改进才能更好(传一个梯度 dq/da)
 10 | PPO: 对PG的更新加了限制，提高训练稳定性 相比于A2C 只是actor网络更加复杂
 11 | '''
 12 | class Actor(object): #本质还是policy gradient 不过A2C是单步更新
 13 |     def __init__(self, 
 14 |                  sess, #两个网络需要共用一个session 所以外部初始化
 15 |                  n_actions, 
 16 |                  n_features, 
 17 |                  lr=0.01, ):
 18 |         #self.ep_obs, self.ep_as, self.ep_rs =[],[],[] #由于是单步更新 所以不需要存储每个episode的数据
 19 |         self.sess = sess
 20 |         
 21 |         self.s = tf.placeholder(tf.float32, [1, n_features], "state")
 22 |         self.a = tf.placeholder(tf.int32, None, "act") #           
 23 |         self.td_error = tf.placeholder(tf.float32, None, "td_error")  # TD_error更新的幅度 td 的理解应该是 Q(s, a) - V(s), 某个动作价值减去平均动作价值
 24 |         
 25 |         with tf.variable_scope('Actor'): #将原来的name_scope换成variable_scope ，可以在一个scope里面共享变量
 26 |             l1 = tf.layers.dense(
 27 |                 inputs=self.s,
 28 |                 units=20,    # number of hidden units
 29 |                 activation=tf.nn.relu,
 30 |                 kernel_initializer=tf.random_normal_initializer(0., .1),    # weights
 31 |                 bias_initializer=tf.constant_initializer(0.1),  # biases
 32 |                 name='l1'
 33 |             )
 34 | 
 35 |             self.acts_prob = tf.layers.dense(
 36 |                 inputs=l1,
 37 |                 units=n_actions,    # output units
 38 |                 activation=tf.nn.softmax,   # get action probabilities
 39 |                 kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
 40 |                 bias_initializer=tf.constant_initializer(0.1),  # biases
 41 |                 name='acts_prob'
 42 |             )
 43 |         
 44 |         #with tf.name_scope('loss'):
 45 |             # 最大化 总体 reward (log_p * R) 就是在最小化 -(log_p * R), 而 tf 的功能里只有最小化 loss
 46 |             #neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob)*tf.one_hot(self.tf_acts, self.n_actions), axis=1) #加- 变为梯度下降
 47 |             #loss = tf.reduce_mean(neg_log_prob * self.tf_vt)
 48 |         with tf.variable_scope('loss'):
 49 |             log_prob = tf.log(self.acts_prob[0,self.a]) #[[0.1,0.2,0.3]] -> 0.1, if a=0
 50 |             self.loss = log_prob * self.td_error  # advantage (TD_error) guided loss
 51 | 
 52 |         with tf.name_scope('train'):
 53 |             self.train_op = tf.train.AdamOptimizer(lr).minimize(-self.loss)
 54 |             
 55 |     def choose_action(self, s): #选择行为
 56 |         s = s[np.newaxis, :]
 57 |         probs = self.sess.run(self.acts_prob, {self.s: s})   # get probabilities for all actions
 58 |         action = np.random.choice(np.arange(probs.shape[1]), p=probs.ravel()) 
 59 |         return action  # return a int
 60 |     
 61 | 
 62 |     def learn(self, s, a, td):
 63 |         s = s[np.newaxis, :]
 64 |         feed_dict = {self.s: s, self.a: a, self.td_error: td}
 65 |         _, loss = self.sess.run([self.train_op, self.loss], feed_dict)
 66 |         return loss
 67 | 
 68 | 
 69 | class Critic(object):
 70 |     def __init__(self, sess, n_features, lr=0.01, gamma=0.9):
 71 |         self.sess = sess
 72 | 
 73 |         self.s = tf.placeholder(tf.float32, [1, n_features], "state")
 74 |         self.v_ = tf.placeholder(tf.float32, [1, 1], "v_next")
 75 |         self.r = tf.placeholder(tf.float32, None, 'r')
 76 | 
 77 |         with tf.variable_scope('Critic'):
 78 |             l1 = tf.layers.dense(
 79 |                 inputs=self.s,
 80 |                 units=20,  # number of hidden units
 81 |                 activation=tf.nn.relu,  # None
 82 |                 # have to be linear to make sure the convergence of actor.
 83 |                 # But linear approximator seems hardly learns the correct Q.
 84 |                 kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
 85 |                 bias_initializer=tf.constant_initializer(0.1),  # biases
 86 |                 name='l1'
 87 |             )
 88 | 
 89 |             self.v = tf.layers.dense(
 90 |                 inputs=l1,
 91 |                 units=1,  # output units
 92 |                 activation=None,
 93 |                 kernel_initializer=tf.random_normal_initializer(0., .1),  # weights
 94 |                 bias_initializer=tf.constant_initializer(0.1),  # biases
 95 |                 name='V'
 96 |             )
 97 | 
 98 |         with tf.variable_scope('squared_TD_error'):
 99 |             self.td_error = self.r + gamma * self.v_ - self.v
100 |             self.loss = tf.square(self.td_error)    # TD_error = (r+gamma*V_next) - V_eval
101 |         with tf.variable_scope('train'):
102 |             self.train_op = tf.train.AdamOptimizer(lr).minimize(self.loss)
103 | 
104 |     def learn(self, s, r, s_):
105 |         s, s_ = s[np.newaxis, :], s_[np.newaxis, :]
106 | 
107 |         v_ = self.sess.run(self.v, {self.s: s_})
108 |         td_error, _ = self.sess.run([self.td_error, self.train_op],
109 |                                         {self.s: s, self.v_: v_, self.r: r})
110 |         return td_error
111 | 
112 | np.random.seed(2)
113 | tf.set_random_seed(2)  # reproducible
114 | 
115 | # Superparameters
116 | OUTPUT_GRAPH = False
117 | MAX_EPISODE = 100#3000
118 | DISPLAY_REWARD_THRESHOLD = 200  # renders environment if total episode reward is greater then this threshold
119 | MAX_EP_STEPS = 1000   # maximum time step in one episode
120 | RENDER = False  # rendering wastes time
121 | GAMMA = 0.9     # reward discount in TD error
122 | LR_A = 0.01    # learning rate for actor
123 | LR_C = 0.05     # learning rate for critic
124 | 
125 | env = gym.make('CartPole-v0')
126 | env.seed(1)  # reproducible
127 | env = env.unwrapped
128 | 
129 | N_F = env.observation_space.shape[0]
130 | N_A = env.action_space.n
131 | 
132 | from gym import Space
133 | 
134 | sess = tf.Session() #两个网络共用一个session
135 | 
136 | actor = Actor(sess, n_features=N_F, n_actions=N_A, lr=LR_A)
137 | critic = Critic(sess, n_features=N_F, lr=LR_C)     # we need a good teacher, so the teacher should learn faster than the actor
138 | 
139 | sess.run(tf.global_variables_initializer())
140 |         
141 | if OUTPUT_GRAPH:
142 |     tf.summary.FileWriter("logs/", sess.graph)
143 |     
144 | for i_episode in range(MAX_EPISODE):
145 |     state = env.reset()
146 |     t = 0
147 |     r_list = []
148 | 
149 |     while True:
150 |         if RENDER:
151 |             env.render()
152 |         action = actor.choose_action(state)
153 |         state_, reward, done, info = env.step(action)
154 |         if done: 
155 |             reward=-20 #最后一步的奖励 一个trick
156 |         r_list.append(reward)
157 |         td_error = critic.learn(state, reward, state_)
158 |         actor.learn(state, action, td_error)
159 |         state = state_
160 | 
161 |         if done or t>= MAX_EP_STEPS:
162 |             ep_rs_sum = sum(r_list)
163 |             if 'running_reward' not in globals():
164 |                 running_reward = ep_rs_sum
165 |             else:
166 |                 running_reward = running_reward * 0.95 + ep_rs_sum * 0.05
167 |             if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = False  # rendering
168 |             print("episode:", i_episode, "  reward:", int(running_reward))
169 |             break
170 | 
171 | 
172 |                 
173 | 
174 |     
175 | 
176 | 


--------------------------------------------------------------------------------
/code/ddpg_update.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | import numpy as np
  3 | import gym
  4 | import time
  5 | 
  6 | #####################  hyper parameters  ####################
  7 | 
  8 | MAX_EPISODES = 200
  9 | MAX_EP_STEPS = 200
 10 | LR_A = 0.01    # learning rate for actor
 11 | LR_C = 0.02    # learning rate for critic
 12 | GAMMA = 0.9     # reward discount
 13 | TAU = 0.01      # soft replacement
 14 | MEMORY_CAPACITY = 10000
 15 | BATCH_SIZE = 32
 16 | 
 17 | RENDER = False
 18 | ENV_NAME = 'Pendulum-v0'
 19 | 
 20 | #pendulum 动作与状态都是连续空间
 21 | #动作空间：只有一维力矩 长度为1 虽然是连续值，但是有bound【-2，2】
 22 | #状态空间：一维速度，长度为3
 23 | 
 24 | ###############################  DDPG  ####################################
 25 | #离线训练 单步更新 按batch更新 引入replay buffer机制
 26 | class DDPG(object):
 27 |     def __init__(self, a_dim, s_dim, a_bound,): #初始化2个网络图 注意无论是critic还是actor网络都有target-network机制 target-network不训练
 28 |         self.memory = np.zeros((MEMORY_CAPACITY, s_dim * 2 + a_dim +1), dtype=np.float32) #借鉴replay buff机制 s*2 : s, s_
 29 |         self.pointer = 0
 30 |         self.sess = tf.Session()
 31 | 
 32 |         self.a_dim, self.s_dim, self.a_bound = a_dim, s_dim, a_bound
 33 |         self.S = tf.placeholder(tf.float32, [None, s_dim], 's') #前面的None用来给batch size占位
 34 |         self.S_ = tf.placeholder(tf.float32, [None, s_dim], 's_')
 35 |         self.R = tf.placeholder(tf.float32, [None,1], 'r')
 36 | 
 37 |         with tf.variable_scope('Actor'):
 38 |             self.a = self._build_a(self.S, scope='eval', trainable=True) #要训练的pi网络，也负责收集数据 # input s, output a
 39 |             a_ = self._build_a(self.S, scope='target', trainable=False) #target网络不训练，只负责输出动作给critic # input s_, output a, get a_ for critic
 40 |         with tf.variable_scope('Critic'):
 41 |             q = self._build_c(self.S, self.a, scope='eval', trainable=True) #要训练的Q， 与target输出的q算mse（td-error）  注意这个a来自于memory
 42 |             q_ = self._build_c(self.S_, a_, scope='target', trainable=False) #这个网络不训练, 用于给出 Actor 更新参数时的 Gradient ascent 强度 即dq/da 注意这个a来自于actor要更新参数时候的a
 43 |     
 44 |         # networks parameters
 45 |         self.ae_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval')
 46 |         self.at_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target')
 47 |         self.ce_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/eval')
 48 |         self.ct_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target')
 49 | 
 50 |         #taget 网络更新 即从eval网络中复制参数
 51 |         self.soft_replace = [tf.assign(t, (1-TAU)*t + TAU *e) for t, e in zip(self.at_params+self.ct_params,self.ae_params+self.ce_params)]
 52 | 
 53 |         #训练critic网络（eval）
 54 |         q_target = self.R + GAMMA * q_ #贝尔曼公式（里面的q_来自于Q-target网络输入(s_，a_)的输出） 得出q的”真实值“ 与预测值求mse
 55 |         td_error = tf.losses.mean_squared_error(labels=q_target, predictions=q) #预测值q 来自于q-eval网络输入当前时刻的(s,a)的输出
 56 |         self.ctrain = tf.train.AdamOptimizer(LR_C).minimize(td_error, var_list = self.ce_params) #要train的是q-eval网络的参数 最小化mse
 57 | 
 58 |         #训练actor网络（eval）
 59 |         a_loss = -tf.reduce_mean(q) #maximize q
 60 |         self.atrain = tf.train.AdamOptimizer(LR_A).minimize(a_loss, var_list = self.ae_params) #
 61 |         
 62 |         self.sess.run(tf.global_variables_initializer())
 63 | 
 64 | 
 65 |     def choose_action(self, s):
 66 |         s = s[np.newaxis, :]
 67 |         return self.sess.run(self.a, feed_dict={self.S: s})[0]  # single action
 68 | 
 69 | 
 70 |     def learn(self):
 71 |         #每次学习都是先更新target网络参数
 72 |         self.sess.run(self.soft_replace)
 73 |         indices = np.random.choice(MEMORY_CAPACITY, size=BATCH_SIZE) 
 74 |         bt = self.memory[indices, : ] #从memory中取一个batch的数据来训练
 75 |         bs = bt[:, :self.s_dim] #a batch of state
 76 |         ba = bt[:, self.s_dim: self.s_dim + self.a_dim] #a batch of action
 77 |         br = bt[:, -self.s_dim - 1: -self.s_dim] #a batch of reward
 78 |         bs_ = bt[:, -self.s_dim:]
 79 | 
 80 |         #一次训练一个batch 这一个batch的训练过程中target网络相当于固定不动
 81 |         self.sess.run(self.atrain, {self.S: bs})
 82 |         self.sess.run(self.ctrain, {self.S: bs, self.a: ba, self.R: br, self.S_: bs_})
 83 |         
 84 | 
 85 |     def store_transition(self, s, a, r, s_): #离线训练算法标准操作
 86 |         transition = np.hstack((s, a, [r], s_))
 87 |         index = self.pointer % MEMORY_CAPACITY  # replace the old memory with new memory
 88 |         self.memory[index, :] = transition
 89 |         self.pointer += 1
 90 | 
 91 |     def _build_a(self, s, scope, trainable): #actor网络结构 直接输出动作确定a
 92 |         with tf.variable_scope(scope):
 93 |             net = tf.layers.dense(s, 30, activation=tf.nn.relu, name='l1', trainable=trainable)
 94 |             a = tf.layers.dense(net, self.a_dim, activation=tf.nn.tanh, name='a', trainable=trainable) #a经过了tanh 数值缩放到了【-1，1】
 95 |             return tf.multiply(a, self.a_bound, name='scaled_a') #输出的每个a值都乘边界[max,] 可以保证输出范围在【-max，max】 如果最小 最大值不是相反数 得用clip正则化
 96 | 
 97 |     def _build_c(self, s, a, scope, trainable): #critic网络结构 输出Q(s,a)
 98 |         with tf.variable_scope(scope):
 99 |             n_l1 = 30
100 |             w1_s = tf.get_variable('w1_s', [self.s_dim, n_l1], trainable=trainable)
101 |             w1_a = tf.get_variable('w1_a', [self.a_dim, n_l1], trainable=trainable)
102 |             b1 = tf.get_variable('b1', [1, n_l1], trainable=trainable)
103 |             net = tf.nn.relu(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1)
104 |             return tf.layers.dense(net, 1, trainable=trainable)  # Q(s,a)
105 | 
106 | env = gym.make(ENV_NAME)
107 | env = env.unwrapped
108 | env.seed(1)
109 | s_dim = env.observation_space.shape[0]
110 | a_dim = env.action_space.shape[0]
111 | a_bound = env.action_space.high
112 | ddpg = DDPG(a_dim, s_dim, a_bound)
113 | 
114 | var = 3  # control exploration
115 | t1 = time.time()
116 | for i in range(MAX_EPISODES):
117 |     s = env.reset()
118 |     ep_reward = 0
119 |     for j in range(MAX_EP_STEPS): #没有明确停止条件的游戏都需要这么一个
120 |         if RENDER:
121 |             env.render()
122 |         a = ddpg.choose_action(s)
123 |         a = np.clip(np.random.normal(a, var),-2,2) #增加exploration noise 以actor输出的a为均值，var为方差进行选择a 同时保证a的值在【-2，2】
124 |         s_, r, done, info = env.step(a)
125 | 
126 |         ddpg.store_transition(s, a, r/10, s_)
127 |         if ddpg.pointer > MEMORY_CAPACITY: #存储的数据满了开始训练各个网络
128 |             var *= 0.9995 #降低动作选择的随机性
129 |             ddpg.learn() #超过10000才开始训练，每次从经验库中抽取一个batch，每走一步都会执行一次训练 单步更新
130 |         s = s_
131 |         ep_reward += r
132 |         if j == MAX_EP_STEPS-1:
133 |             print('Episode:', i, ' Reward: %i' % int(ep_reward), 'Explore: %.2f' % var, )
134 |             # if ep_reward > -300:RENDER = True
135 |             break
136 | print('Running time: ', time.time() - t1)


--------------------------------------------------------------------------------
/code/deep_deterministic_policy_gradient.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | import numpy as np
  3 | import gym
  4 | import time
  5 | 
  6 | 
  7 | np.random.seed(1)
  8 | tf.set_random_seed(1)
  9 | 
 10 | MAX_EPISODES = 200
 11 | MAX_EP_STEPS = 200
 12 | LR_A = 0.001    # learning rate for actor
 13 | LR_C = 0.001    # learning rate for critic
 14 | GAMMA = 0.9     # reward discount
 15 | REPLACEMENT = [
 16 |     dict(name='soft', tau=0.01),
 17 |     dict(name='hard', rep_iter_a=600, rep_iter_c=500)
 18 | ][0]            # you can try different target replacement strategies
 19 | MEMORY_CAPACITY = 10000
 20 | BATCH_SIZE = 32
 21 | 
 22 | RENDER = False
 23 | OUTPUT_GRAPH = True
 24 | ENV_NAME = 'Pendulum-v0'
 25 | 
 26 | class Actor(object):
 27 |     def __init__(self, sess, action_dim, action_bound, learning_rate, replacement):
 28 |         self.sess = sess
 29 |         self.a_dim = action_dim
 30 |         self.action_bound = action_bound
 31 |         self.lr = learning_rate
 32 |         self.replacement = replacement
 33 |         self.t_replace_counter = 0
 34 |         
 35 |         with tf.variable_scope('Actor'):
 36 |             # 这个网络用于及时更新参数
 37 |             self.a = self._build_net(S, scope='eval_net', trainable=True) #由target网络给出确定的action
 38 |             #对比ppo 网络的输出是一个概率分布 
 39 |             #pi, pi_params = self._build_anet('pi', trainable=True)
 40 |             #self.sample_op = tf.squeeze(pi.sample(1), axis=0) #按概率分布pi选择一个action
 41 | 
 42 |             # 这个网络不及时更新参数, 用于预测 Critic 的 Q_target 中的 action
 43 |             self.a_ = self._build_net(S_, scope='target_net', trainable=False)
 44 | 
 45 |         self.e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval_net')
 46 |         self.t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target_net')
 47 | 
 48 |         if self.replacement['name'] == 'hard':
 49 |             self.t_replace_counter = 0
 50 |             self.hard_replace = [tf.assign(t, e) for t, e in zip(self.t_params, self.e_params)]
 51 |         else:
 52 |             self.soft_replace = [tf.assign(t, (1 - self.replacement['tau']) * t + self.replacement['tau'] * e)
 53 |                                  for t, e in zip(self.t_params, self.e_params)]
 54 | 
 55 |     def _build_net(self, s, scope, trainable):
 56 |         with tf.variable_scope(scope):
 57 |             init_w = tf.random_normal_initializer(0., 0.3)
 58 |             init_b = tf.constant_initializer(0.1)
 59 |             net = tf.layers.dense(s, 30, activation=tf.nn.relu,
 60 |                                 kernel_initializer=init_w, bias_initializer=init_b, name='l1',
 61 |                                 trainable=trainable)
 62 |             with tf.variable_scope('a'):
 63 |                 actions = tf.layers.dense(net, self.a_dim, activation=tf.nn.tanh, kernel_initializer=init_w,
 64 |                                         bias_initializer=init_b, name='a', trainable=trainable)
 65 |                 scaled_a = tf.multiply(actions, self.action_bound, name='scaled_a')  # Scale output to -action_bound to action_bound
 66 |         return scaled_a 
 67 | 
 68 |     def learn(self, s):   # batch update
 69 |         self.sess.run(self.train_op, feed_dict={S: s})
 70 | 
 71 |         if self.replacement['name'] == 'soft':
 72 |             self.sess.run(self.soft_replace)
 73 |         else:
 74 |             if self.t_replace_counter % self.replacement['rep_iter_a'] == 0:
 75 |                 self.sess.run(self.hard_replace)
 76 |             self.t_replace_counter += 1
 77 | 
 78 |     def choose_action(self, s):
 79 |         s = s[np.newaxis, :]    # single state
 80 |         return self.sess.run(self.a, feed_dict={S: s})[0]  # single action
 81 |         #对比ppo a = self.sess.run(self.sample_op, {self.tfs:s})[0]
 82 | 
 83 |     ## 将 critic 产出的 dQ/da 加入到 Actor 的 Graph 中去
 84 |     def add_grad_to_graph(self, a_grads):
 85 |         with tf.variable_scope('policy_grads'):
 86 |             # ys = policy;
 87 |             # xs = policy's parameters;
 88 |             # a_grads = the gradients of the policy to get more Q
 89 |             # tf.gradients will calculate dys/dxs with a initial gradients for ys, so this is dq/da * da/dparams
 90 |             self.policy_grads = tf.gradients(ys=self.a, xs=self.e_params, grad_ys=a_grads) ##grad_ys 这是从 Critic 来的 dQ/da
 91 | 
 92 |         with tf.variable_scope('A_train'):
 93 |             opt = tf.train.AdamOptimizer(-self.lr)  # (- learning rate) for ascent policy
 94 |             self.train_op = opt.apply_gradients(zip(self.policy_grads, self.e_params))
 95 | 
 96 | class Critic(object):
 97 |     def __init__(self, sess, state_dim, action_dim, learning_rate, gamma, replacement, a, a_):
 98 |         self.sess = sess
 99 |         self.s_dim = state_dim
100 |         self.a_dim = action_dim
101 |         self.lr = learning_rate
102 |         self.gamma = gamma
103 |         self.replacement = replacement
104 | 
105 |         with tf.variable_scope('Critic'):
106 |             # Input (s, a), output q
107 |             self.a = tf.stop_gradient(a)    # stop critic update flows to actor
108 |             self.q = self._build_net(S, self.a, 'eval_net', trainable=True)
109 | 
110 |             # Input (s_, a_), output q_ for q_target
111 |             self.q_ = self._build_net(S_, a_, 'target_net', trainable=False)    # target_q is based on a_ from Actor's target_net
112 | 
113 |             self.e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/eval_net')
114 |             self.t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target_net')
115 | 
116 |         with tf.variable_scope('target_q'):
117 |             self.target_q = R + self.gamma * self.q_ ## self.q_ 根据 Actor 的 target_net 来
118 | 
119 |         with tf.variable_scope('TD_error'):
120 |             self.loss = tf.reduce_mean(tf.squared_difference(self.target_q, self.q)) # self.q 又基于 Actor 的 target_net
121 | 
122 |         with tf.variable_scope('C_train'):
123 |             self.train_op = tf.train.AdamOptimizer(self.lr).minimize(self.loss)
124 | 
125 |         with tf.variable_scope('a_grad'):
126 |             self.a_grads = tf.gradients(self.q, a)[0]   # tensor of gradients of each sample (None, a_dim)
127 | 
128 |         if self.replacement['name'] == 'hard':
129 |             self.t_replace_counter = 0
130 |             self.hard_replacement = [tf.assign(t, e) for t, e in zip(self.t_params, self.e_params)]
131 |         else:
132 |             self.soft_replacement = [tf.assign(t, (1 - self.replacement['tau']) * t + self.replacement['tau'] * e)
133 |                                      for t, e in zip(self.t_params, self.e_params)]
134 | 
135 |     def _build_net(self, s, a, scope, trainable):
136 |         with tf.variable_scope(scope):
137 |             init_w = tf.random_normal_initializer(0., 0.1)
138 |             init_b = tf.constant_initializer(0.1)
139 | 
140 |             with tf.variable_scope('l1'):
141 |                 n_l1 = 30
142 |                 w1_s = tf.get_variable('w1_s', [self.s_dim, n_l1], initializer=init_w, trainable=trainable)
143 |                 w1_a = tf.get_variable('w1_a', [self.a_dim, n_l1], initializer=init_w, trainable=trainable)
144 |                 b1 = tf.get_variable('b1', [1, n_l1], initializer=init_b, trainable=trainable)
145 |                 net = tf.nn.relu(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1)
146 | 
147 |             with tf.variable_scope('q'):
148 |                 q = tf.layers.dense(net, 1, kernel_initializer=init_w, bias_initializer=init_b, trainable=trainable)   # Q(s,a)
149 |         return q
150 | 
151 |     def learn(self, s, a, r, s_):
152 |         self.sess.run(self.train_op, feed_dict={S: s, self.a: a, R: r, S_: s_})
153 |         if self.replacement['name'] == 'soft':
154 |             self.sess.run(self.soft_replacement)
155 |         else:
156 |             if self.t_replace_counter % self.replacement['rep_iter_c'] == 0:
157 |                 self.sess.run(self.hard_replacement)
158 |             self.t_replace_counter += 1
159 | 
160 | 
161 | class Memory(object):
162 |     def __init__(self, capacity, dims):
163 |         self.capacity = capacity
164 |         self.data = np.zeros((capacity, dims))
165 |         self.pointer = 0
166 | 
167 |     def store_transition(self, s, a, r, s_):
168 |         transition = np.hstack((s, a, [r], s_))
169 |         index = self.pointer % self.capacity  # replace the old memory with new memory
170 |         self.data[index, :] = transition
171 |         self.pointer += 1
172 | 
173 |     def sample(self, n):
174 |         assert self.pointer >= self.capacity, 'Memory has not been fulfilled'
175 |         indices = np.random.choice(self.capacity, size=n)
176 |         return self.data[indices, :]
177 | 
178 | 
179 | env = gym.make(ENV_NAME)
180 | env = env.unwrapped
181 | env.seed(1)
182 | 
183 | state_dim = env.observation_space.shape[0]
184 | action_dim = env.action_space.shape[0]
185 | action_bound = env.action_space.high
186 | 
187 | # all placeholder for tf
188 | with tf.name_scope('S'):
189 |     S = tf.placeholder(tf.float32, shape=[None, state_dim], name='s')
190 | with tf.name_scope('R'):
191 |     R = tf.placeholder(tf.float32, [None, 1], name='r')
192 | with tf.name_scope('S_'):
193 |     S_ = tf.placeholder(tf.float32, shape=[None, state_dim], name='s_')
194 | 
195 | 
196 | sess = tf.Session()
197 | 
198 | # Create actor and critic.
199 | # They are actually connected to each other, details can be seen in tensorboard or in this picture:
200 | actor = Actor(sess, action_dim, action_bound, LR_A, REPLACEMENT)
201 | critic = Critic(sess, state_dim, action_dim, LR_C, GAMMA, REPLACEMENT, actor.a, actor.a_) # 将 actor 同它的 eval_net/target_net 产生的 a/a_ 传给 Critic
202 | actor.add_grad_to_graph(critic.a_grads) # 将 critic 产出的 dQ/da 加入到 Actor 的 Graph 中去
203 | 
204 | sess.run(tf.global_variables_initializer())
205 | 
206 | M = Memory(MEMORY_CAPACITY, dims=2 * state_dim + action_dim + 1)
207 | 
208 | if OUTPUT_GRAPH:
209 |     tf.summary.FileWriter("logs/", sess.graph)
210 | 
211 | var = 3  # control exploration
212 | 
213 | t1 = time.time()
214 | for i in range(MAX_EPISODES):
215 |     s = env.reset()
216 |     ep_reward = 0
217 | 
218 |     for j in range(MAX_EP_STEPS):
219 | 
220 |         if RENDER:
221 |             env.render()
222 | 
223 |         # Add exploration noise
224 |         a = actor.choose_action(s)
225 |         a = np.clip(np.random.normal(a, var), -2, 2)    # add randomness to action selection for exploration
226 |         s_, r, done, info = env.step(a)
227 | 
228 |         M.store_transition(s, a, r / 10, s_)
229 | 
230 |         if M.pointer > MEMORY_CAPACITY:
231 |             var *= .9995    # decay the action randomness
232 |             b_M = M.sample(BATCH_SIZE)
233 |             b_s = b_M[:, :state_dim]
234 |             b_a = b_M[:, state_dim: state_dim + action_dim]
235 |             b_r = b_M[:, -state_dim - 1: -state_dim]
236 |             b_s_ = b_M[:, -state_dim:]
237 | 
238 |             critic.learn(b_s, b_a, b_r, b_s_)
239 |             actor.learn(b_s)
240 | 
241 |         s = s_
242 |         ep_reward += r
243 | 
244 |         if j == MAX_EP_STEPS-1:
245 |             print('Episode:', i, ' Reward: %i' % int(ep_reward), 'Explore: %.2f' % var, )
246 |             if ep_reward > -300:
247 |                 RENDER = True
248 |             break
249 | 
250 | print('Running time: ', time.time()-t1)


--------------------------------------------------------------------------------
/code/policy_gradient.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | import numpy as np
  3 | import gym
  4 | 
  5 | class PolicyGradient:
  6 |     def __init__(self, 
  7 |                  n_actions, 
  8 |                  n_features, 
  9 |                  learning_rate=0.01, 
 10 |                  reward_decay=0.95, 
 11 |                  output_graph=False
 12 |     ):
 13 |         self.n_actions = n_actions
 14 |         self.n_features = n_features
 15 |         self.lr = learning_rate
 16 |         self.gamma = reward_decay
 17 |         self.ep_obs, self.ep_as, self.ep_rs =[],[],[] #states,actions,rewards
 18 |         self.__build_net()
 19 |         self.sess = tf.Session()
 20 |         
 21 |         if output_graph:
 22 |             # $ tensorboard --logdir=logs
 23 |             # http://0.0.0.0:6006/
 24 |             # tf.train.SummaryWriter soon be deprecated, use following
 25 |             tf.summary.FileWriter("logs/", self.sess.graph)
 26 |     
 27 |         self.sess.run(tf.global_variables_initializer())
 28 |         
 29 |     def __build_net(self): #PG网络
 30 |         with tf.name_scope('inputs'):
 31 |             self.tf_obs = tf.placeholder(tf.float32, [None, self.n_features],name="observations")
 32 |             self.tf_acts = tf.placeholder(tf.int32, [None,], name="actions_num")
 33 |             self.tf_vt = tf.placeholder(tf.float32, [None,], name="actions_value") #V(s,a)
 34 |             
 35 |             layer = tf.layers.dense(
 36 |                 inputs = self.tf_obs,
 37 |                 units = 10,
 38 |                 activation = tf.nn.tanh,
 39 |                 kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3),
 40 |                 bias_initializer=tf.constant_initializer(0.1),
 41 |                 name = 'fc1'    
 42 |             )
 43 |             all_act = tf.layers.dense(
 44 |                 inputs=layer,
 45 |                 units=self.n_actions,   # 输出个数
 46 |                 activation=None,    # 之后再加 Softmax
 47 |                 kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3),
 48 |                 bias_initializer=tf.constant_initializer(0.1),
 49 |                 name='fc2'
 50 |             )
 51 |             self.all_act_prob = tf.nn.softmax(all_act, name='act_prob')
 52 |             with tf.name_scope('loss'):
 53 |                 # 最大化 总体 reward (log_p * R) 就是在最小化 -(log_p * R), 而 tf 的功能里只有最小化 loss
 54 |                 neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob)*tf.one_hot(self.tf_acts, self.n_actions), axis=1) #加- 变为梯度下降
 55 |                 loss = tf.reduce_mean(neg_log_prob * self.tf_vt)
 56 | 
 57 |             with tf.name_scope('train'):
 58 |                 self.train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)
 59 |             
 60 |     def choose_action(self, observation): #选择行为
 61 |         prob_weights = self.sess.run(self.all_act_prob, feed_dict = {self.tf_obs: observation[np.newaxis, :]}) #[0,1,2]->[[0,1,2]] 所有action的概率 矩阵形式
 62 |         action = np.random.choice(range(prob_weights.shape[1]), p=prob_weights.ravel()) # 根据概率来选 action  range(prob_weights.shape[1]用0，1，2，表示动作
 63 |         return action
 64 |     
 65 |     def store_transition(self, s, a, r):#存储一个回合的经验
 66 |         self.ep_obs.append(s)
 67 |         self.ep_as.append(a)
 68 |         self.ep_rs.append(r)
 69 | 
 70 |     def learn(self):
 71 |         discounted_ep_rs_norm = self._discount_and_norm_rewards() # 衰减, 并标准化这回合的 reward
 72 |         self.sess.run(self.train_op, feed_dict={
 73 |             self.tf_obs: np.vstack(self.ep_obs),  # shape=[None, n_obs]
 74 |             self.tf_acts: np.array(self.ep_as),  # shape=[None, ]
 75 |             self.tf_vt: discounted_ep_rs_norm,  # shape=[None, ]
 76 |         })
 77 |         self.ep_obs, self.ep_as, self.ep_rs = [],[],[] #清空回合数据
 78 |         return discounted_ep_rs_norm # 返回这一回合的 state-action value
 79 | 
 80 | 
 81 |     def _discount_and_norm_rewards(self): #用bellman公式计算出vt(s,a)
 82 |         #discount
 83 |         discounted_ep_rs = np.zeros_like(self.ep_rs)
 84 |         running_add = 0
 85 |         for t in reversed(range(0, len(self.ep_rs))): #倒数遍历这个episode中的reward
 86 |             running_add = running_add * self.gamma + self.ep_rs[t]
 87 |             discounted_ep_rs[t] = running_add
 88 |             # r1,r2,r3 -> r1+r2*gamma+r3*gamma^2, r2+r3*gamma, r3
 89 | 
 90 |         #normalize
 91 |         discounted_ep_rs -= np.mean(discounted_ep_rs)
 92 |         discounted_ep_rs /= np.std(discounted_ep_rs)
 93 |         return discounted_ep_rs
 94 | 
 95 | 
 96 | 
 97 | #将算法应用起来吧！
 98 | RENDER = False  # 在屏幕上显示模拟窗口会拖慢运行速度, 我们等计算机学得差不多了再显示模拟
 99 | DISPLAY_REWARD_THRESHOLD = 1000  # 当 回合总 reward 大于 400 时显示模拟窗口
100 | 
101 | #env = gym.make('CartPole-v0')   # CartPole 2个动作 向左 向右
102 | env = gym.make('MountainCar-v0') #3个动作 左侧加速、不加速、右侧加速
103 | 
104 | env = env.unwrapped     # 取消限制
105 | env.seed(1)     # 普通的 Policy gradient 方法, 使得回合的 variance 比较大, 所以我们选了一个好点的随机种子
106 | 
107 | print(env.action_space)     # 显示可用 action 
108 | print(env.observation_space)    # 显示可用 state 的 observation
109 | print(env.observation_space.high)   # 显示 observation 最高值
110 | print(env.observation_space.low)    # 显示 observation 最低值
111 | 
112 | # 定义
113 | RL = PolicyGradient(
114 |     n_actions=env.action_space.n,
115 |     n_features=env.observation_space.shape[0],
116 |     learning_rate=0.02,
117 |     reward_decay=0.99,   # gamma
118 |     # output_graph=True,    # 输出 tensorboard 文件
119 | )
120 | 
121 | for i_episode in range(100):
122 |     observation = env.reset()
123 |     while True:
124 |         if RENDER:
125 |             env.render()
126 |         action = RL.choose_action(observation)
127 |         observation_, reward, done, info = env.step(action)
128 |         RL.store_transition(observation, action, reward)
129 |         
130 |         if done:
131 |             ep_rs_sum = sum(RL.ep_rs)
132 |             if 'running_reward' not in globals():
133 |                 running_reward = ep_rs_sum
134 |             else:
135 |                 running_reward = running_reward *0.99 + ep_rs_sum *0.01 #不是简单的求和展示当下rewad 比较科学
136 |             print("episode:", i_episode, "reward:", int(running_reward))
137 |             vt = RL.learn() #学习 输出vt 
138 |             break
139 |     
140 |         observation =  observation_
141 | 
142 | 
143 | 


--------------------------------------------------------------------------------
/code/proximal_policy_optimization.py:
--------------------------------------------------------------------------------
  1 | #pendulum
  2 | #动作空间：只有一维力矩 长度为1
  3 | #状态空间：一维速度，长度为3
  4 | 
  5 | '''
  6 |     Critic网络直接给出V(s)
  7 |     Actor网络由2部分组成 oldpi pi
  8 |     PPO升级于A2C（critic按batch更新，离线训练，有2个pi），升级于PG（加入critic网络，利用advantage引导pg优化）
  9 | '''
 10 | import tensorflow as tf
 11 | import numpy as np
 12 | import matplotlib.pyplot as plt
 13 | import gym
 14 | 
 15 | EP_MAX = 1000
 16 | EP_LEN = 200
 17 | GAMMA = 0.9
 18 | A_LR = 0.0001
 19 | C_LR = 0.0002
 20 | BATCH = 32
 21 | A_UPDATE_STEPS = 10
 22 | C_UPDATE_STEPS = 10
 23 | S_DIM, A_DIM = 3, 1 #pendulum游戏
 24 | METHOD = [
 25 |     dict(name='kl_pen', kl_target=0.01, lam=0.5),   # KL penalty
 26 |     dict(name='clip', epsilon=0.2),                 # Clipped surrogate objective, find this is better
 27 | ][1]        # choose the method for optimization
 28 | 
 29 | 
 30 | 
 31 | class PPO(object):
 32 |     def __init__(self):
 33 |         self.sess = tf.Session()
 34 |         self.tfs = tf.placeholder(tf.float32, [None, S_DIM], 'state') #[N_each_batch,DIM]
 35 |         
 36 |         #搭建AC网络 critic
 37 |         with tf.variable_scope('critic'):
 38 |             layer1 = tf.layers.dense(self.tfs, 100, tf.nn.relu)
 39 |             self.v = tf.layers.dense(layer1, 1)
 40 |             self.tfdc_r = tf.placeholder(tf.float32, [None, 1], 'discounted_r')
 41 |             self.advantage = self.tfdc_r - self.v # discounted reward - Critic 出来的 state value
 42 |             self.closs = tf.reduce_mean(tf.square(self.advantage)) # mse loss of critic
 43 |             self.ctrain_op = tf.train.AdamOptimizer(C_LR).minimize(self.closs)
 44 | 
 45 |         pi, pi_params = self._build_anet('pi', trainable=True)
 46 |         oldpi, oldpi_params = self._build_anet('oldpi', trainable=False) #每个pi的本质是一个概率分布
 47 |         with tf.variable_scope('sample_action'):
 48 |             self.sample_op = tf.squeeze(pi.sample(1), axis=0) #按概率分布pi选择一个action
 49 |         with tf.variable_scope('update_oldpi'):
 50 |             self.update_oldpi_op = [oldp.assign(p) for p, oldp in zip(pi_params, oldpi_params)] #将pi的参数复制给oldpi
 51 | 
 52 |         self.tfa = tf.placeholder(tf.float32, [None, A_DIM], 'action')
 53 |         self.tfadv = tf.placeholder(tf.float32, [None, 1], 'advantage')
 54 |         with tf.variable_scope('loss'):
 55 |             with tf.variable_scope('surrogate'): 
 56 |                 ratio  = pi.prob(self.tfa) / oldpi.prob(self.tfa) #(New Policy/Old Policy) 的比例
 57 |                 surr = ratio * self.tfadv #surrogate objective
 58 |             if METHOD['name'] == 'kl_pen':  # 如果用 KL penatily
 59 |                 self.tflam = tf.placeholder(tf.float32, None, 'lambda')
 60 |                 kl = tf.distributions.kl_divergence(oldpi, pi)
 61 |                 self.kl_mean = tf.reduce_mean(kl)
 62 |                 self.aloss = tf.reduce_mean(surr - self.tflam * kl) #actor 最终的loss function
 63 |             else:   # 如果用 clipping 的方式
 64 |                 self.aloss = tf.reduce_mean(tf.minimum(surr, tf.clip_by_value(ratio, 1-METHOD['epsilon'], 1+METHOD['epsilon'])*self.tfadv))
 65 |         
 66 |         with tf.variable_scope('atrain'):
 67 |             self.atrain_op = tf.train.AdamOptimizer(A_LR).minimize(-self.aloss)
 68 |         
 69 |         self.sess.run(tf.global_variables_initializer())
 70 | 
 71 | 
 72 |     def update(self, s, a, r): #update ppo
 73 |         # 先要将 oldpi 里的参数更新 pi 中的
 74 |         self.sess.run(self.update_oldpi_op) 
 75 |         adv = self.sess.run(self.advantage, {self.tfs:s, self.tfdc_r:r})
 76 |         # adv = (adv - adv.mean())/(adv.std()+1e-6)     # sometimes helpful
 77 |         # update actor
 78 |         # 更新 Actor 时, kl penalty 和 clipping 方式是不同的
 79 |         if METHOD['name'] == 'kl_pen':
 80 |             for _ in range(A_UPDATE_STEPS): #actor 一次训练更新10次
 81 |                 _, kl = self.sess.run(
 82 |                     [self.atrain_op, self.kl_mean],
 83 |                     {self.tfs: s, self.tfa: a, self.tfadv: adv, self.tflam: METHOD['lam']})
 84 |                 if kl > 4*METHOD['kl_target']:  # this in in google's paper
 85 |                     break
 86 |             if kl < METHOD['kl_target'] / 1.5:  # adaptive lambda, this is in OpenAI's paper
 87 |                 METHOD['lam'] /= 2
 88 |             elif kl > METHOD['kl_target'] * 1.5:
 89 |                 METHOD['lam'] *= 2
 90 |             METHOD['lam'] = np.clip(METHOD['lam'], 1e-4, 10)    # sometimes explode, this clipping is my solution
 91 |         else:   # clipping method, find this is better (OpenAI's paper)
 92 |             [self.sess.run(self.atrain_op, {self.tfs: s, self.tfa: a, self.tfadv: adv}) for _ in range(A_UPDATE_STEPS)] #actor 一次训练更新10次
 93 |         # 更新 Critic 的时候, 他们是一样的  critic一次训练更新10次
 94 |         [self.sess.run(self.ctrain_op, {self.tfs: s, self.tfdc_r: r}) for _ in range(C_UPDATE_STEPS)]
 95 | 
 96 |     def choose_action(self, s):
 97 |         s = s[np.newaxis, :]
 98 |         a = self.sess.run(self.sample_op, {self.tfs:s})[0]
 99 |         return np.clip(a,-2,2) #动作不要超出【-2，2】的范围 因为是按概率分布取动作 所以加上这一步很有必要！
100 |     
101 |     def get_v(self, s): #V(s)状态值 由critic网络给出
102 |         if s.ndim < 2: 
103 |             s = s[np.newaxis, :]
104 |         return self.sess.run(self.v, {self.tfs:s})[0,0]
105 | 
106 |     def _build_anet(self, name, trainable): #critic网络输出动作的概率分布 包含参数均值u与方差sigma
107 |         with tf.variable_scope(name):
108 |             l1 = tf.layers.dense(self.tfs, 100, tf.nn.relu, trainable=trainable)
109 |             mu = 2 * tf.layers.dense(l1, A_DIM, tf.nn.tanh, trainable=trainable)
110 |             sigma = tf.layers.dense(l1, A_DIM, tf.nn.softplus, trainable=trainable)
111 |             norm_dist = tf.distributions.Normal(loc=mu, scale=sigma)
112 |         params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)
113 |         return norm_dist, params
114 | 
115 | env = gym.make('Pendulum-v0').unwrapped
116 | 
117 | ppo = PPO()
118 | all_ep_r = []
119 | 
120 | 
121 | for ep in range(EP_MAX):
122 |     s = env.reset()
123 |     buffer_s, buffer_a, buffer_r = [],[],[]
124 |     ep_r = 0
125 |     for t in range(EP_LEN):
126 |         env.render()
127 |         a = ppo.choose_action(s)
128 |         s_, r, done, info = env.step(a)
129 |         buffer_s.append(s)
130 |         buffer_a.append(a)
131 |         buffer_r.append((r+8)/8) # normalize reward, 发现有帮助
132 |         s = s_
133 |         ep_r += r #一个episode的reward之和
134 | 
135 |         # 如果 buffer 收集一个 batch 了或者 episode 完了
136 |         #则更新ppo
137 |         if (t+1) % BATCH == 0 or t == EP_LEN -1:
138 |             #计算折扣奖励
139 |             v_s_ = ppo.get_v(s_)
140 |             discounted_r = []
141 |             for r in buffer_r[::-1]:
142 |                 v_s_ = r + GAMMA * v_s_
143 |                 discounted_r.append(v_s_)
144 |             discounted_r.reverse()
145 | 
146 |             bs, ba, br = np.vstack(buffer_s), np.vstack(buffer_a), np.array(discounted_r)[:, np.newaxis] #存入一个Batch
147 |             #清空buffer
148 |             buffer_s, buffer_a, buffer_r = [],[],[]
149 |             ppo.update(bs, ba, br) #训练PPO
150 |         if ep == 0:
151 |             all_ep_r.append(ep_r)
152 |         else:
153 |             all_ep_r.append(all_ep_r[-1]*0.9 + ep_r*0.1)
154 | 
155 |         print('Ep: %i' % ep,"|Ep_r: %i" % ep_r,("|Lam: %.4f" % METHOD['lam']) if METHOD['name'] == 'kl_pen' else '',)
156 | 
157 | 
158 | #plt.plot(np.arange(len(all_ep_r)), all_ep_r)
159 | #plt.xlabel('Episode')
160 | #plt.ylabel('Moving averaged episode reward')
161 | #plt.show()
162 | 
163 | 
164 | 
165 | 
166 | 
167 | 


--------------------------------------------------------------------------------
/code/tensrolayer-implemented/a3c.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Asynchronous Advantage Actor Critic (A3C) with Continuous Action Space.
  3 | Actor Critic History
  4 | ----------------------
  5 | A3C > DDPG (for continuous action space) > AC
  6 | Advantage
  7 | ----------
  8 | Train faster and more stable than AC.
  9 | Disadvantage
 10 | -------------
 11 | Have bias.
 12 | Reference
 13 | ----------
 14 | Original Paper: https://arxiv.org/pdf/1602.01783.pdf
 15 | MorvanZhou's tutorial: https://morvanzhou.github.io/tutorials/
 16 | MorvanZhou's code: https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/experiments/Solve_BipedalWalker/A3C.py
 17 | Environment
 18 | -----------
 19 | BipedalWalker-v2 : https://gym.openai.com/envs/BipedalWalker-v2
 20 | Reward is given for moving forward, total 300+ points up to the far end.
 21 | If the robot falls, it gets -100. Applying motor torque costs a small amount of
 22 | points, more optimal agent will get better score. State consists of hull angle
 23 | speed, angular velocity, horizontal speed, vertical speed, position of joints
 24 | and joints angular speed, legs contact with ground, and 10 lidar rangefinder
 25 | measurements. There's no coordinates in the state vector.
 26 | Prerequisites
 27 | --------------
 28 | tensorflow 2.0.0a0
 29 | tensorflow-probability 0.6.0
 30 | tensorlayer 2.0.0
 31 | &&
 32 | pip install box2d box2d-kengz --user
 33 | To run
 34 | ------
 35 | python tutorial_A3C.py --train/test
 36 | """
 37 | 
 38 | import argparse
 39 | import multiprocessing
 40 | import threading
 41 | import time
 42 | 
 43 | import numpy as np
 44 | 
 45 | import gym
 46 | import tensorflow as tf
 47 | import tensorflow_probability as tfp
 48 | import tensorlayer as tl
 49 | from tensorlayer.layers import DenseLayer, InputLayer
 50 | 
 51 | tfd = tfp.distributions
 52 | 
 53 | tl.logging.set_verbosity(tl.logging.DEBUG)
 54 | 
 55 | np.random.seed(2)
 56 | tf.random.set_seed(2)  # reproducible
 57 | 
 58 | # add arguments in command  --train/test
 59 | parser = argparse.ArgumentParser(description='Train or test neural net motor controller.')
 60 | parser.add_argument('--train', dest='train', action='store_true', default=False)
 61 | parser.add_argument('--test', dest='test', action='store_true', default=True)
 62 | args = parser.parse_args()
 63 | 
 64 | #####################  hyper parameters  ####################
 65 | 
 66 | GAME = 'BipedalWalker-v2'  # BipedalWalkerHardcore-v2   BipedalWalker-v2  LunarLanderContinuous-v2
 67 | LOG_DIR = './log'  # the log file
 68 | N_WORKERS = multiprocessing.cpu_count()  # number of workers accroding to number of cores in cpu
 69 | # N_WORKERS = 2     # manually set number of workers
 70 | MAX_GLOBAL_EP = 8  # number of training episodes
 71 | GLOBAL_NET_SCOPE = 'Global_Net'
 72 | UPDATE_GLOBAL_ITER = 10  # update global policy after several episodes
 73 | GAMMA = 0.99  # reward discount factor
 74 | ENTROPY_BETA = 0.005  # factor for entropy boosted exploration
 75 | LR_A = 0.00005  # learning rate for actor
 76 | LR_C = 0.0001  # learning rate for critic
 77 | GLOBAL_RUNNING_R = []
 78 | GLOBAL_EP = 0  # will increase during training, stop training when it >= MAX_GLOBAL_EP
 79 | 
 80 | ###################  Asynchronous Advantage Actor Critic (A3C)  ####################################
 81 | 
 82 | 
 83 | class ACNet(object):
 84 | 
 85 |     def __init__(self, scope, globalAC=None):
 86 |         self.scope = scope
 87 |         self.save_path = './model'
 88 | 
 89 |         w_init = tf.keras.initializers.glorot_normal(seed=None)  # initializer, glorot=xavier
 90 | 
 91 |         def get_actor(input_shape):  # policy network
 92 |             with tf.name_scope(self.scope):
 93 |                 ni = tl.layers.Input(input_shape, name='in')
 94 |                 nn = tl.layers.Dense(n_units=500, act=tf.nn.relu6, W_init=w_init, name='la')(ni)
 95 |                 nn = tl.layers.Dense(n_units=300, act=tf.nn.relu6, W_init=w_init, name='la2')(nn)
 96 |                 mu = tl.layers.Dense(n_units=N_A, act=tf.nn.tanh, W_init=w_init, name='mu')(nn)
 97 |                 sigma = tl.layers.Dense(n_units=N_A, act=tf.nn.softplus, W_init=w_init, name='sigma')(nn)
 98 |             return tl.models.Model(inputs=ni, outputs=[mu, sigma], name=scope + '/Actor')
 99 | 
100 |         self.actor = get_actor([None, N_S])
101 |         self.actor.train()  # train mode for Dropout, BatchNorm
102 | 
103 |         def get_critic(input_shape):  # we use Value-function here, but not Q-function.
104 |             with tf.name_scope(self.scope):
105 |                 ni = tl.layers.Input(input_shape, name='in')
106 |                 nn = tl.layers.Dense(n_units=500, act=tf.nn.relu6, W_init=w_init, name='lc')(ni)
107 |                 nn = tl.layers.Dense(n_units=300, act=tf.nn.relu6, W_init=w_init, name='lc2')(nn)
108 |                 v = tl.layers.Dense(n_units=1, W_init=w_init, name='v')(nn)
109 |             return tl.models.Model(inputs=ni, outputs=v, name=scope + '/Critic')
110 | 
111 |         self.critic = get_critic([None, N_S])
112 |         self.critic.train()  # train mode for Dropout, BatchNorm
113 | 
114 |     @tf.function  # convert numpy functions to tf.Operations in the TFgraph, return tensor
115 |     def update_global(
116 |             self, buffer_s, buffer_a, buffer_v_target, globalAC
117 |     ):  # refer to the global Actor-Crtic network for updating it with samples
118 |         ''' update the global critic '''
119 |         with tf.GradientTape() as tape:
120 |             self.v = self.critic(buffer_s)
121 |             self.v_target = buffer_v_target
122 |             td = tf.subtract(self.v_target, self.v, name='TD_error')
123 |             self.c_loss = tf.reduce_mean(tf.square(td))
124 |         self.c_grads = tape.gradient(self.c_loss, self.critic.trainable_weights)
125 |         OPT_C.apply_gradients(zip(self.c_grads, globalAC.critic.trainable_weights))  # local grads applies to global net
126 |         # del tape # Drop the reference to the tape
127 |         ''' update the global actor '''
128 |         with tf.GradientTape() as tape:
129 |             self.mu, self.sigma = self.actor(buffer_s)
130 |             self.test = self.sigma[0]
131 |             self.mu, self.sigma = self.mu * A_BOUND[1], self.sigma + 1e-5
132 | 
133 |             normal_dist = tfd.Normal(self.mu, self.sigma)  # no tf.contrib for tf2.0
134 |             self.a_his = buffer_a  # float32
135 |             log_prob = normal_dist.log_prob(self.a_his)
136 |             exp_v = log_prob * td  # td is from the critic part, no gradients for it
137 |             entropy = normal_dist.entropy()  # encourage exploration
138 |             self.exp_v = ENTROPY_BETA * entropy + exp_v
139 |             self.a_loss = tf.reduce_mean(-self.exp_v)
140 |         self.a_grads = tape.gradient(self.a_loss, self.actor.trainable_weights)
141 |         OPT_A.apply_gradients(zip(self.a_grads, globalAC.actor.trainable_weights))  # local grads applies to global net
142 |         return self.test  # for test purpose
143 | 
144 |     @tf.function
145 |     def pull_global(self, globalAC):  # run by a local, pull weights from the global nets
146 |         for l_p, g_p in zip(self.actor.trainable_weights, globalAC.actor.trainable_weights):
147 |             l_p.assign(g_p)
148 |         for l_p, g_p in zip(self.critic.trainable_weights, globalAC.critic.trainable_weights):
149 |             l_p.assign(g_p)
150 | 
151 |     def choose_action(self, s):  # run by a local
152 |         s = s[np.newaxis, :]
153 |         self.mu, self.sigma = self.actor(s)
154 | 
155 |         with tf.name_scope('wrap_a_out'):
156 |             self.mu, self.sigma = self.mu * A_BOUND[1], self.sigma + 1e-5
157 |         normal_dist = tfd.Normal(self.mu, self.sigma)  # for continuous action space
158 |         self.A = tf.clip_by_value(tf.squeeze(normal_dist.sample(1), axis=0), *A_BOUND)
159 |         return self.A.numpy()[0]
160 | 
161 |     def save_ckpt(self):  # save trained weights
162 |         tl.files.save_npz(self.actor.trainable_weights, name='model_actor.npz')
163 |         tl.files.save_npz(self.critic.trainable_weights, name='model_critic.npz')
164 | 
165 |     def load_ckpt(self):  # load trained weights
166 |         tl.files.load_and_assign_npz(name='model_actor.npz', network=self.actor)
167 |         tl.files.load_and_assign_npz(name='model_critic.npz', network=self.critic)
168 | 
169 | 
170 | class Worker(object):
171 | 
172 |     def __init__(self, name, globalAC):
173 |         self.env = gym.make(GAME)
174 |         self.name = name
175 |         self.AC = ACNet(name, globalAC)
176 | 
177 |     # def work(self):
178 |     def work(self, globalAC):
179 |         global GLOBAL_RUNNING_R, GLOBAL_EP
180 |         total_step = 1
181 |         buffer_s, buffer_a, buffer_r = [], [], []
182 |         while not COORD.should_stop() and GLOBAL_EP < MAX_GLOBAL_EP:
183 |             s = self.env.reset()
184 |             ep_r = 0
185 |             while True:
186 |                 # visualize Worker_0 during training
187 |                 if self.name == 'Worker_0' and total_step % 30 == 0:
188 |                     self.env.render()
189 |                 s = s.astype('float32')  # double to float
190 |                 a = self.AC.choose_action(s)
191 |                 s_, r, done, _info = self.env.step(a)
192 | 
193 |                 s_ = s_.astype('float32')  # double to float
194 |                 # set robot falls reward to -2 instead of -100
195 |                 if r == -100: r = -2
196 | 
197 |                 ep_r += r
198 |                 buffer_s.append(s)
199 |                 buffer_a.append(a)
200 |                 buffer_r.append(r)
201 | 
202 |                 if total_step % UPDATE_GLOBAL_ITER == 0 or done:  # update global and assign to local net
203 | 
204 |                     if done:
205 |                         v_s_ = 0  # terminal
206 |                     else:
207 |                         v_s_ = self.AC.critic(s_[np.newaxis, :])[0, 0]  # reduce dim from 2 to 0
208 | 
209 |                     buffer_v_target = []
210 | 
211 |                     for r in buffer_r[::-1]:  # reverse buffer r
212 |                         v_s_ = r + GAMMA * v_s_
213 |                         buffer_v_target.append(v_s_)
214 | 
215 |                     buffer_v_target.reverse()
216 | 
217 |                     buffer_s, buffer_a, buffer_v_target = (
218 |                         np.vstack(buffer_s), np.vstack(buffer_a), np.vstack(buffer_v_target)
219 |                     )
220 |                     # update gradients on global network
221 |                     self.AC.update_global(buffer_s, buffer_a, buffer_v_target.astype('float32'), globalAC)
222 |                     buffer_s, buffer_a, buffer_r = [], [], []
223 | 
224 |                     # update local network from global network
225 |                     self.AC.pull_global(globalAC)
226 | 
227 |                 s = s_
228 |                 total_step += 1
229 |                 if done:
230 |                     if len(GLOBAL_RUNNING_R) == 0:  # record running episode reward
231 |                         GLOBAL_RUNNING_R.append(ep_r)
232 |                     else:  # moving average
233 |                         GLOBAL_RUNNING_R.append(0.95 * GLOBAL_RUNNING_R[-1] + 0.05 * ep_r)
234 |                     # print(
235 |                     #     self.name,
236 |                     #     "Episode: ",
237 |                     #     GLOBAL_EP,
238 |                     #     # "| pos: %i" % self.env.unwrapped.hull.position[0],  # number of move
239 |                     #     '| reward: %.1f' % ep_r,
240 |                     #     "| running_reward: %.1f" % GLOBAL_RUNNING_R[-1],
241 |                     #     # '| sigma:', test, # debug
242 |                     #     # 'WIN ' * 5 if self.env.unwrapped.hull.position[0] >= 88 else '',
243 |                     # )
244 |                     print('{}, Episode: {}/{}  | Episode Reward: {:.4f}  | Running Time: {:.4f}'\
245 |                     .format(self.name, GLOBAL_EP, MAX_GLOBAL_EP, ep_r, time.time()-t0 ))
246 |                     GLOBAL_EP += 1
247 |                     break
248 | 
249 | 
250 | if __name__ == "__main__":
251 | 
252 |     env = gym.make(GAME)
253 | 
254 |     N_S = env.observation_space.shape[0]
255 |     N_A = env.action_space.shape[0]
256 | 
257 |     A_BOUND = [env.action_space.low, env.action_space.high]
258 |     A_BOUND[0] = A_BOUND[0].reshape(1, N_A)
259 |     A_BOUND[1] = A_BOUND[1].reshape(1, N_A)
260 |     # print(A_BOUND)
261 |     if args.train:
262 |         # ============================= TRAINING ===============================
263 |         t0 = time.time()
264 |         with tf.device("/cpu:0"):
265 | 
266 |             OPT_A = tf.optimizers.RMSprop(LR_A, name='RMSPropA')
267 |             OPT_C = tf.optimizers.RMSprop(LR_C, name='RMSPropC')
268 | 
269 |             GLOBAL_AC = ACNet(GLOBAL_NET_SCOPE)  # we only need its params
270 |             workers = []
271 |             # Create worker
272 |             for i in range(N_WORKERS):
273 |                 i_name = 'Worker_%i' % i  # worker name
274 |                 workers.append(Worker(i_name, GLOBAL_AC))
275 | 
276 |         COORD = tf.train.Coordinator()
277 | 
278 |         # start TF threading
279 |         worker_threads = []
280 |         for worker in workers:
281 |             # t = threading.Thread(target=worker.work)
282 |             job = lambda: worker.work(GLOBAL_AC)
283 |             t = threading.Thread(target=job)
284 |             t.start()
285 |             worker_threads.append(t)
286 |         COORD.join(worker_threads)
287 |         import matplotlib.pyplot as plt
288 |         plt.plot(GLOBAL_RUNNING_R)
289 |         plt.xlabel('episode')
290 |         plt.ylabel('global running reward')
291 |         plt.savefig('a3c.png')
292 |         plt.show()
293 | 
294 |         GLOBAL_AC.save_ckpt()
295 | 
296 |     if args.test:
297 |         # ============================= EVALUATION =============================
298 |         # env = gym.make(GAME)
299 |         # GLOBAL_AC = ACNet(GLOBAL_NET_SCOPE)
300 |         GLOBAL_AC.load_ckpt()
301 |         while True:
302 |             s = env.reset()
303 |             rall = 0
304 |             while True:
305 |                 env.render()
306 |                 s = s.astype('float32')  # double to float
307 |                 a = GLOBAL_AC.choose_action(s)
308 |                 s, r, d, _ = env.step(a)
309 |                 rall += r
310 |                 if d:
311 |                     print("reward", rall)
312 |                     break


--------------------------------------------------------------------------------
/code/tensrolayer-implemented/ac.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Actor-Critic 
  3 | -------------
  4 | It uses TD-error as the Advantage.
  5 | Actor Critic History
  6 | ----------------------
  7 | A3C > DDPG > AC
  8 | Advantage
  9 | ----------
 10 | AC converge faster than Policy Gradient.
 11 | Disadvantage (IMPORTANT)
 12 | ------------------------
 13 | The Policy is oscillated (difficult to converge), DDPG can solve
 14 | this problem using advantage of DQN.
 15 | Reference
 16 | ----------
 17 | paper: https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf
 18 | View more on MorvanZhou's tutorial page: https://morvanzhou.github.io/tutorials/
 19 | Environment
 20 | ------------
 21 | CartPole-v0: https://gym.openai.com/envs/CartPole-v0
 22 | A pole is attached by an un-actuated joint to a cart, which moves along a
 23 | frictionless track. The system is controlled by applying a force of +1 or -1
 24 | to the cart. The pendulum starts upright, and the goal is to prevent it from
 25 | falling over.
 26 | A reward of +1 is provided for every timestep that the pole remains upright.
 27 | The episode ends when the pole is more than 15 degrees from vertical, or the
 28 | cart moves more than 2.4 units from the center.
 29 | Prerequisites
 30 | --------------
 31 | tensorflow >=2.0.0a0
 32 | tensorlayer >=2.0.0
 33 | To run
 34 | ------
 35 | python tutorial_AC.py --train/test
 36 | """
 37 | import argparse
 38 | import time
 39 | 
 40 | import numpy as np
 41 | 
 42 | import gym
 43 | import tensorflow as tf
 44 | import tensorlayer as tl
 45 | 
 46 | tl.logging.set_verbosity(tl.logging.DEBUG)
 47 | 
 48 | np.random.seed(2)
 49 | tf.random.set_seed(2)  # reproducible
 50 | 
 51 | # add arguments in command  --train/test
 52 | parser = argparse.ArgumentParser(description='Train or test neural net motor controller.')
 53 | parser.add_argument('--train', dest='train', action='store_true', default=False)
 54 | parser.add_argument('--test', dest='test', action='store_true', default=True)
 55 | args = parser.parse_args()
 56 | 
 57 | #####################  hyper parameters  ####################
 58 | 
 59 | OUTPUT_GRAPH = False
 60 | MAX_EPISODE = 3000  # number of overall episodes for training
 61 | DISPLAY_REWARD_THRESHOLD = 100  # renders environment if running reward is greater then this threshold
 62 | MAX_EP_STEPS = 1000  # maximum time step in one episode
 63 | RENDER = False  # rendering wastes time
 64 | LAMBDA = 0.9  # reward discount in TD error
 65 | LR_A = 0.001  # learning rate for actor
 66 | LR_C = 0.01  # learning rate for critic
 67 | 
 68 | ###############################  Actor-Critic  ####################################
 69 | 
 70 | 
 71 | class Actor(object):
 72 | 
 73 |     def __init__(self, n_features, n_actions, lr=0.001):
 74 | 
 75 |         def get_model(inputs_shape):
 76 |             ni = tl.layers.Input(inputs_shape, name='state')
 77 |             nn = tl.layers.Dense(
 78 |                 n_units=30, act=tf.nn.relu6, W_init=tf.random_uniform_initializer(0, 0.01), name='hidden'
 79 |             )(ni)
 80 |             nn = tl.layers.Dense(
 81 |                 n_units=10, act=tf.nn.relu6, W_init=tf.random_uniform_initializer(0, 0.01), name='hidden2'
 82 |             )(nn)
 83 |             nn = tl.layers.Dense(n_units=n_actions, name='actions')(nn)
 84 |             return tl.models.Model(inputs=ni, outputs=nn, name="Actor")
 85 | 
 86 |         self.model = get_model([None, n_features])
 87 |         self.model.train()
 88 |         self.optimizer = tf.optimizers.Adam(lr)
 89 | 
 90 |     def learn(self, s, a, td):
 91 |         with tf.GradientTape() as tape:
 92 |             _logits = self.model(np.array([s]))
 93 |             ## cross-entropy loss weighted by td-error (advantage),
 94 |             # the cross-entropy mearsures the difference of two probability distributions: the predicted logits and sampled action distribution,
 95 |             # then weighted by the td-error: small difference of real and predict actions for large td-error (advantage); and vice versa.
 96 |             _exp_v = tl.rein.cross_entropy_reward_loss(logits=_logits, actions=[a], rewards=td[0])
 97 |         grad = tape.gradient(_exp_v, self.model.trainable_weights)
 98 |         self.optimizer.apply_gradients(zip(grad, self.model.trainable_weights))
 99 |         return _exp_v
100 | 
101 |     def choose_action(self, s):
102 |         _logits = self.model(np.array([s]))
103 |         _probs = tf.nn.softmax(_logits).numpy()
104 |         return tl.rein.choice_action_by_probs(_probs.ravel())  # sample according to probability distribution
105 | 
106 |     def choose_action_greedy(self, s):
107 |         _logits = self.model(np.array([s]))  # logits: probability distribution of actions
108 |         _probs = tf.nn.softmax(_logits).numpy()
109 |         return np.argmax(_probs.ravel())
110 | 
111 |     def save_ckpt(self):  # save trained weights
112 |         tl.files.save_npz(self.model.trainable_weights, name='model_actor.npz')
113 | 
114 |     def load_ckpt(self):  # load trained weights
115 |         tl.files.load_and_assign_npz(name='model_actor.npz', network=self.model)
116 | 
117 | 
118 | class Critic(object):
119 | 
120 |     def __init__(self, n_features, lr=0.01):
121 | 
122 |         def get_model(inputs_shape):
123 |             ni = tl.layers.Input(inputs_shape, name='state')
124 |             nn = tl.layers.Dense(
125 |                 n_units=30, act=tf.nn.relu6, W_init=tf.random_uniform_initializer(0, 0.01), name='hidden'
126 |             )(ni)
127 |             nn = tl.layers.Dense(
128 |                 n_units=5, act=tf.nn.relu, W_init=tf.random_uniform_initializer(0, 0.01), name='hidden2'
129 |             )(nn)
130 |             nn = tl.layers.Dense(n_units=1, act=None, name='value')(nn)
131 |             return tl.models.Model(inputs=ni, outputs=nn, name="Critic")
132 | 
133 |         self.model = get_model([1, n_features])
134 |         self.model.train()
135 | 
136 |         self.optimizer = tf.optimizers.Adam(lr)
137 | 
138 |     def learn(self, s, r, s_):
139 |         v_ = self.model(np.array([s_]))
140 |         with tf.GradientTape() as tape:
141 |             v = self.model(np.array([s]))
142 |             ## TD_error = r + lambd * V(newS) - V(S)
143 |             td_error = r + LAMBDA * v_ - v
144 |             loss = tf.square(td_error)
145 |         grad = tape.gradient(loss, self.model.trainable_weights)
146 |         self.optimizer.apply_gradients(zip(grad, self.model.trainable_weights))
147 | 
148 |         return td_error
149 | 
150 |     def save_ckpt(self):  # save trained weights
151 |         tl.files.save_npz(self.model.trainable_weights, name='model_critic.npz')
152 | 
153 |     def load_ckpt(self):  # load trained weights
154 |         tl.files.load_and_assign_npz(name='model_critic.npz', network=self.model)
155 | 
156 | 
157 | if __name__ == '__main__':
158 |     ''' 
159 |     choose environment
160 |     1. Openai gym:
161 |     env = gym.make()
162 |     2. DeepMind Control Suite:
163 |     env = dm_control2gym.make()
164 |     '''
165 |     env = gym.make('CartPole-v0')
166 |     # dm_control2gym.create_render_mode('example mode', show=True, return_pixel=False, height=240, width=320, camera_id=-1, overlays=(),
167 |     #              depth=False, scene_option=None)
168 |     # env = dm_control2gym.make(domain_name="cartpole", task_name="balance")
169 |     env.seed(2)  # reproducible
170 |     # env = env.unwrapped
171 |     N_F = env.observation_space.shape[0]
172 |     # N_A = env.action_space.shape[0]
173 |     N_A = env.action_space.n
174 | 
175 |     print("observation dimension: %d" % N_F)  # 4
176 |     print("observation high: %s" % env.observation_space.high)  # [ 2.4 , inf , 0.41887902 , inf]
177 |     print("observation low : %s" % env.observation_space.low)  # [-2.4 , -inf , -0.41887902 , -inf]
178 |     print("num of actions: %d" % N_A)  # 2 : left or right
179 | 
180 |     actor = Actor(n_features=N_F, n_actions=N_A, lr=LR_A)
181 |     # we need a good teacher, so the teacher should learn faster than the actor
182 |     critic = Critic(n_features=N_F, lr=LR_C)
183 | 
184 |     if args.train:
185 |         t0 = time.time()
186 |         for i_episode in range(MAX_EPISODE):
187 |             # episode_time = time.time()
188 |             s = env.reset().astype(np.float32)
189 |             t = 0  # number of step in this episode
190 |             all_r = []  # rewards of all steps
191 |             while True:
192 | 
193 |                 if RENDER: env.render()
194 | 
195 |                 a = actor.choose_action(s)
196 | 
197 |                 s_new, r, done, info = env.step(a)
198 |                 s_new = s_new.astype(np.float32)
199 | 
200 |                 if done: r = -20
201 |                 # these may helpful in some tasks
202 |                 # if abs(s_new[0]) >= env.observation_space.high[0]:
203 |                 # #  cart moves more than 2.4 units from the center
204 |                 #     r = -20
205 |                 # reward for the distance between cart to the center
206 |                 # r -= abs(s_new[0])  * .1
207 | 
208 |                 all_r.append(r)
209 | 
210 |                 td_error = critic.learn(
211 |                     s, r, s_new
212 |                 )  # learn Value-function : gradient = grad[r + lambda * V(s_new) - V(s)]
213 |                 try:
214 |                     actor.learn(s, a, td_error)  # learn Policy : true_gradient = grad[logPi(s, a) * td_error]
215 |                 except KeyboardInterrupt:  # if Ctrl+C at running actor.learn(), then save model, or exit if not at actor.learn()
216 |                     actor.save_ckpt()
217 |                     critic.save_ckpt()
218 |                     # logging
219 | 
220 |                 s = s_new
221 |                 t += 1
222 | 
223 |                 if done or t >= MAX_EP_STEPS:
224 |                     ep_rs_sum = sum(all_r)
225 | 
226 |                     if 'running_reward' not in globals():
227 |                         running_reward = ep_rs_sum
228 |                     else:
229 |                         running_reward = running_reward * 0.95 + ep_rs_sum * 0.05
230 |                     # start rending if running_reward greater than a threshold
231 |                     # if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = True
232 |                     # print("Episode: %d reward: %f running_reward %f took: %.5f" % \
233 |                     #     (i_episode, ep_rs_sum, running_reward, time.time() - episode_time))
234 |                     print('Episode: {}/{}  | Episode Reward: {:.4f}  | Running Time: {:.4f}'\
235 |                     .format(i_episode, MAX_EPISODE, ep_rs_sum, time.time()-t0 ))
236 | 
237 |                     # Early Stopping for quick check
238 |                     if t >= MAX_EP_STEPS:
239 |                         print("Early Stopping")
240 |                         s = env.reset().astype(np.float32)
241 |                         rall = 0
242 |                         while True:
243 |                             env.render()
244 |                             # a = actor.choose_action(s)
245 |                             a = actor.choose_action_greedy(s)  # Hao Dong: it is important for this task
246 |                             s_new, r, done, info = env.step(a)
247 |                             s_new = np.concatenate((s_new[0:N_F], s[N_F:]), axis=0).astype(np.float32)
248 |                             rall += r
249 |                             s = s_new
250 |                             if done:
251 |                                 print("reward", rall)
252 |                                 s = env.reset().astype(np.float32)
253 |                                 rall = 0
254 |                     break
255 |         actor.save_ckpt()
256 |         critic.save_ckpt()
257 | 
258 |     if args.test:
259 |         actor.load_ckpt()
260 |         critic.load_ckpt()
261 |         t0 = time.time()
262 | 
263 |         for i_episode in range(MAX_EPISODE):
264 |             episode_time = time.time()
265 |             s = env.reset().astype(np.float32)
266 |             t = 0  # number of step in this episode
267 |             all_r = []  # rewards of all steps
268 |             while True:
269 |                 if RENDER: env.render()
270 |                 a = actor.choose_action(s)
271 |                 s_new, r, done, info = env.step(a)
272 |                 s_new = s_new.astype(np.float32)
273 |                 if done: r = -20
274 |                 # these may helpful in some tasks
275 |                 # if abs(s_new[0]) >= env.observation_space.high[0]:
276 |                 # #  cart moves more than 2.4 units from the center
277 |                 #     r = -20
278 |                 # reward for the distance between cart to the center
279 |                 # r -= abs(s_new[0])  * .1
280 | 
281 |                 all_r.append(r)
282 |                 s = s_new
283 |                 t += 1
284 | 
285 |                 if done or t >= MAX_EP_STEPS:
286 |                     ep_rs_sum = sum(all_r)
287 | 
288 |                     if 'running_reward' not in globals():
289 |                         running_reward = ep_rs_sum
290 |                     else:
291 |                         running_reward = running_reward * 0.95 + ep_rs_sum * 0.05
292 |                     # start rending if running_reward greater than a threshold
293 |                     # if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = True
294 |                     # print("Episode: %d reward: %f running_reward %f took: %.5f" % \
295 |                     #     (i_episode, ep_rs_sum, running_reward, time.time() - episode_time))
296 |                     print('Episode: {}/{}  | Episode Reward: {:.4f}  | Running Time: {:.4f}'\
297 |                     .format(i_episode, MAX_EPISODE, ep_rs_sum, time.time()-t0 ))
298 | 
299 |                     # Early Stopping for quick check
300 |                     if t >= MAX_EP_STEPS:
301 |                         print("Early Stopping")
302 |                         s = env.reset().astype(np.float32)
303 |                         rall = 0
304 |                         while True:
305 |                             env.render()
306 |                             # a = actor.choose_action(s)
307 |                             a = actor.choose_action_greedy(s)  # Hao Dong: it is important for this task
308 |                             s_new, r, done, info = env.step(a)
309 |                             s_new = np.concatenate((s_new[0:N_F], s[N_F:]), axis=0).astype(np.float32)
310 |                             rall += r
311 |                             s = s_new
312 |                             if done:
313 |                                 print("reward", rall)
314 |                                 s = env.reset().astype(np.float32)
315 |                                 rall = 0
316 |                     break


--------------------------------------------------------------------------------
/code/tensrolayer-implemented/ddpg.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Deep Deterministic Policy Gradient (DDPG)
  3 | -----------------------------------------
  4 | An algorithm concurrently learns a Q-function and a policy.
  5 | It uses off-policy data and the Bellman equation to learn the Q-function,
  6 | and uses the Q-function to learn the policy.
  7 | Reference
  8 | ---------
  9 | Deterministic Policy Gradient Algorithms, Silver et al. 2014
 10 | Continuous Control With Deep Reinforcement Learning, Lillicrap et al. 2016
 11 | MorvanZhou's tutorial page: https://morvanzhou.github.io/tutorials/
 12 | Environment
 13 | -----------
 14 | Openai Gym Pendulum-v0, continual action space
 15 | Prerequisites
 16 | -------------
 17 | tensorflow >=2.0.0a0
 18 | tensorflow-probability 0.6.0
 19 | tensorlayer >=2.0.0
 20 | To run
 21 | ------
 22 | python tutorial_DDPG.py --train/test
 23 | """
 24 | 
 25 | import argparse
 26 | import os
 27 | import time
 28 | 
 29 | import matplotlib.pyplot as plt
 30 | import numpy as np
 31 | 
 32 | import gym
 33 | import tensorflow as tf
 34 | import tensorlayer as tl
 35 | 
 36 | parser = argparse.ArgumentParser(description='Train or test neural net motor controller.')
 37 | parser.add_argument('--train', dest='train', action='store_true', default=True)
 38 | parser.add_argument('--test', dest='train', action='store_false')
 39 | args = parser.parse_args()
 40 | 
 41 | #####################  hyper parameters  ####################
 42 | 
 43 | ENV_NAME = 'Pendulum-v0'  # environment name
 44 | RANDOMSEED = 1  # random seed
 45 | 
 46 | LR_A = 0.001  # learning rate for actor
 47 | LR_C = 0.002  # learning rate for critic
 48 | GAMMA = 0.9  # reward discount
 49 | TAU = 0.01  # soft replacement
 50 | MEMORY_CAPACITY = 10000  # size of replay buffer
 51 | BATCH_SIZE = 32  # update batchsize
 52 | 
 53 | MAX_EPISODES = 200  # total number of episodes for training
 54 | MAX_EP_STEPS = 200  # total number of steps for each episode
 55 | TEST_PER_EPISODES = 10  # test the model per episodes
 56 | VAR = 3  # control exploration
 57 | 
 58 | ###############################  DDPG  ####################################
 59 | 
 60 | 
 61 | class DDPG(object):
 62 |     """
 63 |     DDPG class
 64 |     """
 65 | 
 66 |     def __init__(self, a_dim, s_dim, a_bound):
 67 |         self.memory = np.zeros((MEMORY_CAPACITY, s_dim * 2 + a_dim + 1), dtype=np.float32)
 68 |         self.pointer = 0
 69 |         self.a_dim, self.s_dim, self.a_bound = a_dim, s_dim, a_bound
 70 | 
 71 |         W_init = tf.random_normal_initializer(mean=0, stddev=0.3)
 72 |         b_init = tf.constant_initializer(0.1)
 73 | 
 74 |         def get_actor(input_state_shape, name=''):
 75 |             """
 76 |             Build actor network
 77 |             :param input_state_shape: state
 78 |             :param name: name
 79 |             :return: act
 80 |             """
 81 |             inputs = tl.layers.Input(input_state_shape, name='A_input')
 82 |             x = tl.layers.Dense(n_units=30, act=tf.nn.relu, W_init=W_init, b_init=b_init, name='A_l1')(inputs)
 83 |             x = tl.layers.Dense(n_units=a_dim, act=tf.nn.tanh, W_init=W_init, b_init=b_init, name='A_a')(x)
 84 |             x = tl.layers.Lambda(lambda x: np.array(a_bound) * x)(x)
 85 |             return tl.models.Model(inputs=inputs, outputs=x, name='Actor' + name)
 86 | 
 87 |         def get_critic(input_state_shape, input_action_shape, name=''):
 88 |             """
 89 |             Build critic network
 90 |             :param input_state_shape: state
 91 |             :param input_action_shape: act
 92 |             :param name: name
 93 |             :return: Q value Q(s,a)
 94 |             """
 95 |             s = tl.layers.Input(input_state_shape, name='C_s_input')
 96 |             a = tl.layers.Input(input_action_shape, name='C_a_input')
 97 |             x = tl.layers.Concat(1)([s, a])
 98 |             x = tl.layers.Dense(n_units=60, act=tf.nn.relu, W_init=W_init, b_init=b_init, name='C_l1')(x)
 99 |             x = tl.layers.Dense(n_units=1, W_init=W_init, b_init=b_init, name='C_out')(x)
100 |             return tl.models.Model(inputs=[s, a], outputs=x, name='Critic' + name)
101 | 
102 |         self.actor = get_actor([None, s_dim])
103 |         self.critic = get_critic([None, s_dim], [None, a_dim])
104 |         self.actor.train()
105 |         self.critic.train()
106 | 
107 |         def copy_para(from_model, to_model):
108 |             """
109 |             Copy parameters for soft updating
110 |             :param from_model: latest model
111 |             :param to_model: target model
112 |             :return: None
113 |             """
114 |             for i, j in zip(from_model.trainable_weights, to_model.trainable_weights):
115 |                 j.assign(i)
116 | 
117 |         self.actor_target = get_actor([None, s_dim], name='_target')
118 |         copy_para(self.actor, self.actor_target)
119 |         self.actor_target.eval()
120 | 
121 |         self.critic_target = get_critic([None, s_dim], [None, a_dim], name='_target')
122 |         copy_para(self.critic, self.critic_target)
123 |         self.critic_target.eval()
124 | 
125 |         self.R = tl.layers.Input([None, 1], tf.float32, 'r')
126 | 
127 |         self.ema = tf.train.ExponentialMovingAverage(decay=1 - TAU)  # soft replacement
128 | 
129 |         self.actor_opt = tf.optimizers.Adam(LR_A)
130 |         self.critic_opt = tf.optimizers.Adam(LR_C)
131 | 
132 |     def ema_update(self):
133 |         """
134 |         Soft updating by exponential smoothing
135 |         :return: None
136 |         """
137 |         paras = self.actor.trainable_weights + self.critic.trainable_weights
138 |         self.ema.apply(paras)
139 |         for i, j in zip(self.actor_target.trainable_weights + self.critic_target.trainable_weights, paras):
140 |             i.assign(self.ema.average(j))
141 | 
142 |     def choose_action(self, s):
143 |         """
144 |         Choose action
145 |         :param s: state
146 |         :return: act
147 |         """
148 |         return self.actor(np.array([s], dtype=np.float32))[0]
149 | 
150 |     def learn(self):
151 |         """
152 |         Update parameters
153 |         :return: None
154 |         """
155 |         indices = np.random.choice(MEMORY_CAPACITY, size=BATCH_SIZE)
156 |         bt = self.memory[indices, :]
157 |         bs = bt[:, :self.s_dim]
158 |         ba = bt[:, self.s_dim:self.s_dim + self.a_dim]
159 |         br = bt[:, -self.s_dim - 1:-self.s_dim]
160 |         bs_ = bt[:, -self.s_dim:]
161 | 
162 |         with tf.GradientTape() as tape:
163 |             a_ = self.actor_target(bs_)
164 |             q_ = self.critic_target([bs_, a_])
165 |             y = br + GAMMA * q_
166 |             q = self.critic([bs, ba])
167 |             td_error = tf.losses.mean_squared_error(y, q)
168 |         c_grads = tape.gradient(td_error, self.critic.trainable_weights)
169 |         self.critic_opt.apply_gradients(zip(c_grads, self.critic.trainable_weights))
170 | 
171 |         with tf.GradientTape() as tape:
172 |             a = self.actor(bs)
173 |             q = self.critic([bs, a])
174 |             a_loss = - tf.reduce_mean(q)  # maximize the q
175 |         a_grads = tape.gradient(a_loss, self.actor.trainable_weights)
176 |         self.actor_opt.apply_gradients(zip(a_grads, self.actor.trainable_weights))
177 | 
178 |         self.ema_update()
179 | 
180 |     def store_transition(self, s, a, r, s_):
181 |         """
182 |         Store data in data buffer
183 |         :param s: state
184 |         :param a: act
185 |         :param r: reward
186 |         :param s_: next state
187 |         :return: None
188 |         """
189 |         s = s.astype(np.float32)
190 |         s_ = s_.astype(np.float32)
191 |         transition = np.hstack((s, a, [r], s_))
192 |         index = self.pointer % MEMORY_CAPACITY  # replace the old memory with new memory
193 |         self.memory[index, :] = transition
194 |         self.pointer += 1
195 | 
196 |     def save_ckpt(self):
197 |         """
198 |         save trained weights
199 |         :return: None
200 |         """
201 |         if not os.path.exists('model'):
202 |             os.makedirs('model')
203 | 
204 |         tl.files.save_weights_to_hdf5('model/ddpg_actor.hdf5', self.actor)
205 |         tl.files.save_weights_to_hdf5('model/ddpg_actor_target.hdf5', self.actor_target)
206 |         tl.files.save_weights_to_hdf5('model/ddpg_critic.hdf5', self.critic)
207 |         tl.files.save_weights_to_hdf5('model/ddpg_critic_target.hdf5', self.critic_target)
208 | 
209 |     def load_ckpt(self):
210 |         """
211 |         load trained weights
212 |         :return: None
213 |         """
214 |         tl.files.load_hdf5_to_weights_in_order('model/ddpg_actor.hdf5', self.actor)
215 |         tl.files.load_hdf5_to_weights_in_order('model/ddpg_actor_target.hdf5', self.actor_target)
216 |         tl.files.load_hdf5_to_weights_in_order('model/ddpg_critic.hdf5', self.critic)
217 |         tl.files.load_hdf5_to_weights_in_order('model/ddpg_critic_target.hdf5', self.critic_target)
218 | 
219 | 
220 | if __name__ == '__main__':
221 | 
222 |     env = gym.make(ENV_NAME)
223 |     env = env.unwrapped
224 | 
225 |     # reproducible
226 |     env.seed(RANDOMSEED)
227 |     np.random.seed(RANDOMSEED)
228 |     tf.random.set_seed(RANDOMSEED)
229 | 
230 |     s_dim = env.observation_space.shape[0]
231 |     a_dim = env.action_space.shape[0]
232 |     a_bound = env.action_space.high
233 | 
234 |     ddpg = DDPG(a_dim, s_dim, a_bound)
235 | 
236 |     if args.train:  # train
237 | 
238 |         reward_buffer = []
239 |         t0 = time.time()
240 |         for i in range(MAX_EPISODES):
241 |             t1 = time.time()
242 |             s = env.reset()
243 |             ep_reward = 0
244 |             for j in range(MAX_EP_STEPS):
245 |                 # Add exploration noise
246 |                 a = ddpg.choose_action(s)
247 |                 a = np.clip(np.random.normal(a, VAR), -2, 2)  # add randomness to action selection for exploration
248 |                 s_, r, done, info = env.step(a)
249 | 
250 |                 ddpg.store_transition(s, a, r / 10, s_)
251 | 
252 |                 if ddpg.pointer > MEMORY_CAPACITY:
253 |                     ddpg.learn()
254 | 
255 |                 s = s_
256 |                 ep_reward += r
257 |                 if j == MAX_EP_STEPS - 1:
258 |                     print(
259 |                         '\rEpisode: {}/{}  | Episode Reward: {:.4f}  | Running Time: {:.4f}'.format(
260 |                             i, MAX_EPISODES, ep_reward,
261 |                             time.time() - t1
262 |                         ), end=''
263 |                     )
264 |                 plt.show()
265 |             # test
266 |             if i and not i % TEST_PER_EPISODES:
267 |                 t1 = time.time()
268 |                 s = env.reset()
269 |                 ep_reward = 0
270 |                 for j in range(MAX_EP_STEPS):
271 | 
272 |                     a = ddpg.choose_action(s)  # without exploration noise
273 |                     s_, r, done, info = env.step(a)
274 | 
275 |                     s = s_
276 |                     ep_reward += r
277 |                     if j == MAX_EP_STEPS - 1:
278 |                         print(
279 |                             '\rEpisode: {}/{}  | Episode Reward: {:.4f}  | Running Time: {:.4f}'.format(
280 |                                 i, MAX_EPISODES, ep_reward,
281 |                                 time.time() - t1
282 |                             )
283 |                         )
284 | 
285 |                         reward_buffer.append(ep_reward)
286 | 
287 |             if reward_buffer:
288 |                 plt.ion()
289 |                 plt.cla()
290 |                 plt.title('DDPG')
291 |                 plt.plot(np.array(range(len(reward_buffer))) * TEST_PER_EPISODES, reward_buffer)  # plot the episode vt
292 |                 plt.xlabel('episode steps')
293 |                 plt.ylabel('normalized state-action value')
294 |                 plt.ylim(-2000, 0)
295 |                 plt.show()
296 |                 plt.pause(0.1)
297 |         plt.ioff()
298 |         plt.show()
299 |         print('\nRunning time: ', time.time() - t0)
300 |         ddpg.save_ckpt()
301 | 
302 |     # test
303 |     ddpg.load_ckpt()
304 |     while True:
305 |         s = env.reset()
306 |         for i in range(MAX_EP_STEPS):
307 |             env.render()
308 |             s, r, done, info = env.step(ddpg.choose_action(s))
309 |             if done:
310 |                 break


--------------------------------------------------------------------------------
/code/tensrolayer-implemented/dqn.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Deep Q-Network Q(a, s)
  3 | -----------------------
  4 | TD Learning, Off-Policy, e-Greedy Exploration (GLIE).
  5 | Q(S, A) <- Q(S, A) + alpha * (R + lambda * Q(newS, newA) - Q(S, A))
  6 | delta_w = R + lambda * Q(newS, newA)
  7 | See David Silver RL Tutorial Lecture 5 - Q-Learning for more details.
  8 | Reference
  9 | ----------
 10 | original paper: https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf
 11 | EN: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0#.5m3361vlw
 12 | CN: https://zhuanlan.zhihu.com/p/25710327
 13 | Note: Policy Network has been proved to be better than Q-Learning, see tutorial_atari_pong.py
 14 | Environment
 15 | -----------
 16 | # The FrozenLake v0 environment
 17 | https://gym.openai.com/envs/FrozenLake-v0
 18 | The agent controls the movement of a character in a grid world. Some tiles of
 19 | the grid are walkable, and others lead to the agent falling into the water.
 20 | Additionally, the movement direction of the agent is uncertain and only partially
 21 | depends on the chosen direction. The agent is rewarded for finding a walkable
 22 | path to a goal tile.
 23 | SFFF       (S: starting point, safe)
 24 | FHFH       (F: frozen surface, safe)
 25 | FFFH       (H: hole, fall to your doom)
 26 | HFFG       (G: goal, where the frisbee is located)
 27 | The episode ends when you reach the goal or fall in a hole. You receive a reward
 28 | of 1 if you reach the goal, and zero otherwise.
 29 | Prerequisites
 30 | --------------
 31 | tensorflow>=2.0.0a0
 32 | tensorlayer>=2.0.0
 33 | To run
 34 | -------
 35 | python tutorial_DQN.py --train/test
 36 | """
 37 | import argparse
 38 | import time
 39 | 
 40 | import numpy as np
 41 | 
 42 | import gym
 43 | import tensorflow as tf
 44 | import tensorlayer as tl
 45 | 
 46 | # add arguments in command  --train/test
 47 | parser = argparse.ArgumentParser(description='Train or test neural net motor controller.')
 48 | parser.add_argument('--train', dest='train', action='store_true', default=False)
 49 | parser.add_argument('--test', dest='test', action='store_true', default=True)
 50 | args = parser.parse_args()
 51 | 
 52 | tl.logging.set_verbosity(tl.logging.DEBUG)
 53 | 
 54 | #####################  hyper parameters  ####################
 55 | lambd = .99  # decay factor
 56 | e = 0.1  # e-Greedy Exploration, the larger the more random
 57 | num_episodes = 10000
 58 | render = False  # display the game environment
 59 | running_reward = None
 60 | 
 61 | ##################### DQN ##########################
 62 | 
 63 | 
 64 | def to_one_hot(i, n_classes=None):
 65 |     a = np.zeros(n_classes, 'uint8')
 66 |     a[i] = 1
 67 |     return a
 68 | 
 69 | 
 70 | ## Define Q-network q(a,s) that ouput the rewards of 4 actions by given state, i.e. Action-Value Function.
 71 | # encoding for state: 4x4 grid can be represented by one-hot vector with 16 integers.
 72 | def get_model(inputs_shape):
 73 |     ni = tl.layers.Input(inputs_shape, name='observation')
 74 |     nn = tl.layers.Dense(4, act=None, W_init=tf.random_uniform_initializer(0, 0.01), b_init=None, name='q_a_s')(ni)
 75 |     return tl.models.Model(inputs=ni, outputs=nn, name="Q-Network")
 76 | 
 77 | 
 78 | def save_ckpt(model):  # save trained weights
 79 |     tl.files.save_npz(model.trainable_weights, name='dqn_model.npz')
 80 | 
 81 | 
 82 | def load_ckpt(model):  # load trained weights
 83 |     tl.files.load_and_assign_npz(name='dqn_model.npz', network=model)
 84 | 
 85 | 
 86 | if __name__ == '__main__':
 87 | 
 88 |     qnetwork = get_model([None, 16])
 89 |     qnetwork.train()
 90 |     train_weights = qnetwork.trainable_weights
 91 | 
 92 |     optimizer = tf.optimizers.SGD(learning_rate=0.1)
 93 |     env = gym.make('FrozenLake-v0')
 94 | 
 95 |     if args.train:
 96 |         t0 = time.time()
 97 |         for i in range(num_episodes):
 98 |             ## Reset environment and get first new observation
 99 |             # episode_time = time.time()
100 |             s = env.reset()  # observation is state, integer 0 ~ 15
101 |             rAll = 0
102 |             for j in range(99):  # step index, maximum step is 99
103 |                 if render: env.render()
104 |                 ## Choose an action by greedily (with e chance of random action) from the Q-network
105 |                 allQ = qnetwork(np.asarray([to_one_hot(s, 16)], dtype=np.float32)).numpy()
106 |                 a = np.argmax(allQ, 1)
107 | 
108 |                 ## e-Greedy Exploration !!! sample random action
109 |                 if np.random.rand(1) < e:
110 |                     a[0] = env.action_space.sample()
111 |                 ## Get new state and reward from environment
112 |                 s1, r, d, _ = env.step(a[0])
113 |                 ## Obtain the Q' values by feeding the new state through our network
114 |                 Q1 = qnetwork(np.asarray([to_one_hot(s1, 16)], dtype=np.float32)).numpy()
115 | 
116 |                 ## Obtain maxQ' and set our target value for chosen action.
117 |                 maxQ1 = np.max(Q1)  # in Q-Learning, policy is greedy, so we use "max" to select the next action.
118 |                 targetQ = allQ
119 |                 targetQ[0, a[0]] = r + lambd * maxQ1
120 |                 ## Train network using target and predicted Q values
121 |                 # it is not real target Q value, it is just an estimation,
122 |                 # but check the Q-Learning update formula:
123 |                 #    Q'(s,a) <- Q(s,a) + alpha(r + lambd * maxQ(s',a') - Q(s, a))
124 |                 # minimizing |r + lambd * maxQ(s',a') - Q(s, a)|^2 equals to force Q'(s,a) ≈ Q(s,a)
125 |                 with tf.GradientTape() as tape:
126 |                     _qvalues = qnetwork(np.asarray([to_one_hot(s, 16)], dtype=np.float32))
127 |                     _loss = tl.cost.mean_squared_error(targetQ, _qvalues, is_mean=False)
128 |                 grad = tape.gradient(_loss, train_weights)
129 |                 optimizer.apply_gradients(zip(grad, train_weights))
130 | 
131 |                 rAll += r
132 |                 s = s1
133 |                 ## Reduce chance of random action if an episode is done.
134 |                 if d ==True:
135 |                     e = 1. / ((i / 50) + 10)  # reduce e, GLIE: Greey in the limit with infinite Exploration
136 |                     break
137 | 
138 |             ## Note that, the rewards here with random action
139 |             running_reward = rAll if running_reward is None else running_reward * 0.99 + rAll * 0.01
140 |             # print("Episode [%d/%d] sum reward: %f running reward: %f took: %.5fs " % \
141 |             #     (i, num_episodes, rAll, running_reward, time.time() - episode_time))
142 |             print('Episode: {}/{}  | Episode Reward: {:.4f} | Running Average Reward: {:.4f}  | Running Time: {:.4f}'\
143 |             .format(i, num_episodes, rAll, running_reward,  time.time()-t0 ))
144 |         save_ckpt(qnetwork)  # save model
145 | 
146 |     if args.test:
147 |         t0 = time.time()
148 |         load_ckpt(qnetwork)  # load model
149 |         for i in range(num_episodes):
150 |             ## Reset environment and get first new observation
151 |             episode_time = time.time()
152 |             s = env.reset()  # observation is state, integer 0 ~ 15
153 |             rAll = 0
154 |             for j in range(99):  # step index, maximum step is 99
155 |                 if render: env.render()
156 |                 ## Choose an action by greedily (with e chance of random action) from the Q-network
157 |                 allQ = qnetwork(np.asarray([to_one_hot(s, 16)], dtype=np.float32)).numpy()
158 |                 a = np.argmax(allQ, 1)  # no epsilon, only greedy for testing
159 | 
160 |                 ## Get new state and reward from environment
161 |                 s1, r, d, _ = env.step(a[0])
162 |                 rAll += r
163 |                 s = s1
164 |                 ## Reduce chance of random action if an episode is done.
165 |                 if d ==True:
166 |                     e = 1. / ((i / 50) + 10)  # reduce e, GLIE: Greey in the limit with infinite Exploration
167 |                     break
168 | 
169 |             ## Note that, the rewards here with random action
170 |             running_reward = rAll if running_reward is None else running_reward * 0.99 + rAll * 0.01
171 |             # print("Episode [%d/%d] sum reward: %f running reward: %f took: %.5fs " % \
172 |             #     (i, num_episodes, rAll, running_reward, time.time() - episode_time))
173 |             print('Episode: {}/{}  | Episode Reward: {:.4f} | Running Average Reward: {:.4f}  | Running Time: {:.4f}'\
174 |             .format(i, num_episodes, rAll, running_reward,  time.time()-t0 ))


--------------------------------------------------------------------------------
/code/tensrolayer-implemented/dqn_variants.py:
--------------------------------------------------------------------------------
  1 | """
  2 | DQN and its variants
  3 | ------------------------
  4 | We implement Double DQN, Dueling DQN and Noisy DQN here.
  5 | The max operator in standard DQN uses the same values both to select and to
  6 | evaluate an action by
  7 | Q(s_t, a_t) = R_{t+1} + \gamma * max_{a}Q_{tar}(s_{t+1}, a).
  8 | Double DQN propose to use following evaluation to address overestimation problem
  9 | of max operator:
 10 | Q(s_t, a_t) = R_{t+1} + \gamma * Q_{tar}(s_{t+1}, max_{a}Q(s_{t+1}, a)).
 11 | Dueling DQN uses dueling architecture where the value of state and the advantage
 12 | of each action is estimated separately.
 13 | Noisy DQN propose to explore by adding parameter noises.
 14 | Reference:
 15 | ------------------------
 16 | 1. Double DQN
 17 |     Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double
 18 |     q-learning[C]//Thirtieth AAAI Conference on Artificial Intelligence. 2016.
 19 | 2. Dueling DQN
 20 |     Wang Z, Schaul T, Hessel M, et al. Dueling network architectures for deep
 21 |     reinforcement learning[J]. arXiv preprint arXiv:1511.06581, 2015.
 22 | 3. Noisy DQN
 23 |     Plappert M, Houthooft R, Dhariwal P, et al. Parameter space noise for
 24 |     exploration[J]. arXiv preprint arXiv:1706.01905, 2017.
 25 | Environment:
 26 | ------------------------
 27 | Cartpole and Pong in OpenAI Gym
 28 | Requirements:
 29 | ------------------------
 30 | tensorflow>=2.0.0a0
 31 | tensorlayer>=2.0.0
 32 | To run:
 33 | ------------------------
 34 | python tutorial_DQN_variantes.py --mode=train
 35 | python tutorial_DQN_variantes.py --mode=test --save_path=dqn_variants/8000.npz
 36 | """
 37 | import argparse
 38 | import os
 39 | import random
 40 | import time
 41 | 
 42 | import numpy as np
 43 | 
 44 | import tensorflow as tf
 45 | import tensorlayer as tl
 46 | from tutorial_wrappers import build_env
 47 | 
 48 | parser = argparse.ArgumentParser()
 49 | parser.add_argument('--mode', help='train or test', default='train')
 50 | parser.add_argument(
 51 |     '--save_path', default='dqn_variants', help='folder to save if mode == train else model path,'
 52 |     'qnet will be saved once target net update'
 53 | )
 54 | parser.add_argument('--seed', help='random seed', type=int, default=0)
 55 | parser.add_argument('--env_id', default='CartPole-v0', help='CartPole-v0 or PongNoFrameskip-v4')
 56 | args = parser.parse_args()
 57 | 
 58 | if args.mode == 'train':
 59 |     os.makedirs(args.save_path, exist_ok=True)
 60 | random.seed(args.seed)
 61 | np.random.seed(args.seed)
 62 | tf.random.set_seed(args.seed)  # reproducible
 63 | env_id = args.env_id
 64 | env = build_env(env_id, seed=args.seed)
 65 | 
 66 | # ####################  hyper parameters  ####################
 67 | if env_id == 'CartPole-v0':
 68 |     qnet_type = 'MLP'
 69 |     number_timesteps = 10000  # total number of time steps to train on
 70 |     explore_timesteps = 100
 71 |     # epsilon-greedy schedule, final exploit prob is 0.99
 72 |     epsilon = lambda i_iter: 1 - 0.99 * min(1, i_iter / explore_timesteps)
 73 |     lr = 5e-3  # learning rate
 74 |     buffer_size = 1000  # replay buffer size
 75 |     target_q_update_freq = 50  # how frequency target q net update
 76 |     ob_scale = 1.0  # scale observations
 77 | else:
 78 |     # reward will increase obviously after 1e5 time steps
 79 |     qnet_type = 'CNN'
 80 |     number_timesteps = int(1e6)  # total number of time steps to train on
 81 |     explore_timesteps = 1e5
 82 |     # epsilon-greedy schedule, final exploit prob is 0.99
 83 |     epsilon = lambda i_iter: 1 - 0.99 * min(1, i_iter / explore_timesteps)
 84 |     lr = 1e-4  # learning rate
 85 |     buffer_size = 10000  # replay buffer size
 86 |     target_q_update_freq = 200  # how frequency target q net update
 87 |     ob_scale = 1.0 / 255  # scale observations
 88 | 
 89 | in_dim = env.observation_space.shape
 90 | out_dim = env.action_space.n
 91 | reward_gamma = 0.99  # reward discount
 92 | batch_size = 32  # batch size for sampling from replay buffer
 93 | warm_start = buffer_size / 10  # sample times befor learning
 94 | noise_update_freq = 50  # how frequency param noise net update
 95 | 
 96 | 
 97 | # ##############################  DQN  ####################################
 98 | class MLP(tl.models.Model):
 99 | 
100 |     def __init__(self, name):
101 |         super(MLP, self).__init__(name=name)
102 |         self.h1 = tl.layers.Dense(64, tf.nn.tanh, in_channels=in_dim[0])
103 |         self.qvalue = tl.layers.Dense(out_dim, in_channels=64, name='q', W_init=tf.initializers.GlorotUniform())
104 |         self.svalue = tl.layers.Dense(1, in_channels=64, name='s', W_init=tf.initializers.GlorotUniform())
105 |         self.noise_scale = 0
106 | 
107 |     def forward(self, ni):
108 |         feature = self.h1(ni)
109 | 
110 |         # apply noise to all linear layer
111 |         if self.noise_scale != 0:
112 |             noises = []
113 |             for layer in [self.qvalue, self.svalue]:
114 |                 for var in layer.trainable_weights:
115 |                     noise = tf.random.normal(tf.shape(var), 0, self.noise_scale)
116 |                     noises.append(noise)
117 |                     var.assign_add(noise)
118 | 
119 |         qvalue = self.qvalue(feature)
120 |         svalue = self.svalue(feature)
121 | 
122 |         if self.noise_scale != 0:
123 |             idx = 0
124 |             for layer in [self.qvalue, self.svalue]:
125 |                 for var in layer.trainable_weights:
126 |                     var.assign_sub(noises[idx])
127 |                     idx += 1
128 | 
129 |         # dueling network
130 |         out = svalue + qvalue - tf.reduce_mean(qvalue, 1, keepdims=True)
131 |         return out
132 | 
133 | 
134 | class CNN(tl.models.Model):
135 | 
136 |     def __init__(self, name):
137 |         super(CNN, self).__init__(name=name)
138 |         h, w, in_channels = in_dim
139 |         dense_in_channels = 64 * ((h - 28) // 8) * ((w - 28) // 8)
140 |         self.conv1 = tl.layers.Conv2d(
141 |             32, (8, 8), (4, 4), tf.nn.relu, 'VALID', in_channels=in_channels, name='conv2d_1',
142 |             W_init=tf.initializers.GlorotUniform()
143 |         )
144 |         self.conv2 = tl.layers.Conv2d(
145 |             64, (4, 4), (2, 2), tf.nn.relu, 'VALID', in_channels=32, name='conv2d_2',
146 |             W_init=tf.initializers.GlorotUniform()
147 |         )
148 |         self.conv3 = tl.layers.Conv2d(
149 |             64, (3, 3), (1, 1), tf.nn.relu, 'VALID', in_channels=64, name='conv2d_3',
150 |             W_init=tf.initializers.GlorotUniform()
151 |         )
152 |         self.flatten = tl.layers.Flatten(name='flatten')
153 |         self.preq = tl.layers.Dense(
154 |             256, tf.nn.relu, in_channels=dense_in_channels, name='pre_q', W_init=tf.initializers.GlorotUniform()
155 |         )
156 |         self.qvalue = tl.layers.Dense(out_dim, in_channels=256, name='q', W_init=tf.initializers.GlorotUniform())
157 |         self.pres = tl.layers.Dense(
158 |             256, tf.nn.relu, in_channels=dense_in_channels, name='pre_s', W_init=tf.initializers.GlorotUniform()
159 |         )
160 |         self.svalue = tl.layers.Dense(1, in_channels=256, name='state', W_init=tf.initializers.GlorotUniform())
161 |         self.noise_scale = 0
162 | 
163 |     def forward(self, ni):
164 |         feature = self.flatten(self.conv3(self.conv2(self.conv1(ni))))
165 | 
166 |         # apply noise to all linear layer
167 |         if self.noise_scale != 0:
168 |             noises = []
169 |             for layer in [self.preq, self.qvalue, self.pres, self.svalue]:
170 |                 for var in layer.trainable_weights:
171 |                     noise = tf.random.normal(tf.shape(var), 0, self.noise_scale)
172 |                     noises.append(noise)
173 |                     var.assign_add(noise)
174 | 
175 |         qvalue = self.qvalue(self.preq(feature))
176 |         svalue = self.svalue(self.pres(feature))
177 | 
178 |         if self.noise_scale != 0:
179 |             idx = 0
180 |             for layer in [self.preq, self.qvalue, self.pres, self.svalue]:
181 |                 for var in layer.trainable_weights:
182 |                     var.assign_sub(noises[idx])
183 |                     idx += 1
184 | 
185 |         # dueling network
186 |         return svalue + qvalue - tf.reduce_mean(qvalue, 1, keepdims=True)
187 | 
188 | 
189 | class ReplayBuffer(object):
190 | 
191 |     def __init__(self, size):
192 |         self._storage = []
193 |         self._maxsize = size
194 |         self._next_idx = 0
195 | 
196 |     def __len__(self):
197 |         return len(self._storage)
198 | 
199 |     def add(self, *args):
200 |         if self._next_idx >= len(self._storage):
201 |             self._storage.append(args)
202 |         else:
203 |             self._storage[self._next_idx] = args
204 |         self._next_idx = (self._next_idx + 1) % self._maxsize
205 | 
206 |     def _encode_sample(self, idxes):
207 |         b_o, b_a, b_r, b_o_, b_d = [], [], [], [], []
208 |         for i in idxes:
209 |             o, a, r, o_, d = self._storage[i]
210 |             b_o.append(o)
211 |             b_a.append(a)
212 |             b_r.append(r)
213 |             b_o_.append(o_)
214 |             b_d.append(d)
215 |         return (
216 |             np.stack(b_o).astype('float32') * ob_scale,
217 |             np.stack(b_a).astype('int32'),
218 |             np.stack(b_r).astype('float32'),
219 |             np.stack(b_o_).astype('float32') * ob_scale,
220 |             np.stack(b_d).astype('float32'),
221 |         )
222 | 
223 |     def sample(self, batch_size):
224 |         indexes = range(len(self._storage))
225 |         idxes = [random.choice(indexes) for _ in range(batch_size)]
226 |         return self._encode_sample(idxes)
227 | 
228 | 
229 | def huber_loss(x):
230 |     """Loss function for value"""
231 |     return tf.where(tf.abs(x) < 1, tf.square(x) * 0.5, tf.abs(x) - 0.5)
232 | 
233 | 
234 | def sync(net, net_tar):
235 |     """Copy q network to target q network"""
236 |     for var, var_tar in zip(net.trainable_weights, net_tar.trainable_weights):
237 |         var_tar.assign(var)
238 | 
239 | 
240 | def log_softmax(x, dim):
241 |     temp = x - np.max(x, dim, keepdims=True)
242 |     return temp - np.log(np.exp(temp).sum(dim, keepdims=True))
243 | 
244 | 
245 | def softmax(x, dim):
246 |     temp = np.exp(x - np.max(x, dim, keepdims=True))
247 |     return temp / temp.sum(dim, keepdims=True)
248 | 
249 | 
250 | if __name__ == '__main__':
251 |     if args.mode == 'train':
252 |         qnet = MLP('q') if qnet_type == 'MLP' else CNN('q')
253 |         qnet.train()
254 |         trainabel_weights = qnet.trainable_weights
255 |         targetqnet = MLP('targetq') if qnet_type == 'MLP' else CNN('targetq')
256 |         targetqnet.infer()
257 |         sync(qnet, targetqnet)
258 |         optimizer = tf.optimizers.Adam(learning_rate=lr)
259 |         buffer = ReplayBuffer(buffer_size)
260 | 
261 |         o = env.reset()
262 |         nepisode = 0
263 |         t = time.time()
264 |         noise_scale = 1e-2
265 |         for i in range(1, number_timesteps + 1):
266 |             eps = epsilon(i)
267 | 
268 |             # select action
269 |             if random.random() < eps:
270 |                 a = int(random.random() * out_dim)
271 |             else:
272 |                 # noise schedule is based on KL divergence between perturbed and
273 |                 # non-perturbed policy, see https://arxiv.org/pdf/1706.01905.pdf
274 |                 obv = np.expand_dims(o, 0).astype('float32') * ob_scale
275 |                 if i < explore_timesteps:
276 |                     qnet.noise_scale = noise_scale
277 |                     q_ptb = qnet(obv).numpy()
278 |                     qnet.noise_scale = 0
279 |                     if i % noise_update_freq == 0:
280 |                         q = qnet(obv).numpy()
281 |                         kl_ptb = (log_softmax(q, 1) - log_softmax(q_ptb, 1))
282 |                         kl_ptb = np.sum(kl_ptb * softmax(q, 1), 1).mean()
283 |                         kl_explore = -np.log(1 - eps + eps / out_dim)
284 |                         if kl_ptb < kl_explore:
285 |                             noise_scale *= 1.01
286 |                         else:
287 |                             noise_scale /= 1.01
288 |                     a = q_ptb.argmax(1)[0]
289 |                 else:
290 |                     a = qnet(obv).numpy().argmax(1)[0]
291 | 
292 |             # execute action and feed to replay buffer
293 |             # note that `_` tail in var name means next
294 |             o_, r, done, info = env.step(a)
295 |             buffer.add(o, a, r, o_, done)
296 | 
297 |             if i >= warm_start:
298 |                 # sync q net and target q net
299 |                 if i % target_q_update_freq == 0:
300 |                     sync(qnet, targetqnet)
301 |                     path = os.path.join(args.save_path, '{}.npz'.format(i))
302 |                     tl.files.save_npz(qnet.trainable_weights, name=path)
303 | 
304 |                 # sample from replay buffer
305 |                 b_o, b_a, b_r, b_o_, b_d = buffer.sample(batch_size)
306 | 
307 |                 # double q estimation
308 |                 b_a_ = tf.one_hot(tf.argmax(qnet(b_o_), 1), out_dim)
309 |                 b_q_ = (1 - b_d) * tf.reduce_sum(targetqnet(b_o_) * b_a_, 1)
310 | 
311 |                 # calculate loss
312 |                 with tf.GradientTape() as q_tape:
313 |                     b_q = tf.reduce_sum(qnet(b_o) * tf.one_hot(b_a, out_dim), 1)
314 |                     loss = tf.reduce_mean(huber_loss(b_q - (b_r + reward_gamma * b_q_)))
315 | 
316 |                 # backward gradients
317 |                 q_grad = q_tape.gradient(loss, trainabel_weights)
318 |                 optimizer.apply_gradients(zip(q_grad, trainabel_weights))
319 | 
320 |             if done:
321 |                 o = env.reset()
322 |             else:
323 |                 o = o_
324 | 
325 |             # episode in info is real (unwrapped) message
326 |             if info.get('episode'):
327 |                 nepisode += 1
328 |                 reward, length = info['episode']['r'], info['episode']['l']
329 |                 fps = int(length / (time.time() - t))
330 |                 print(
331 |                     'Time steps so far: {}, episode so far: {}, '
332 |                     'episode reward: {:.4f}, episode length: {}, FPS: {}'.format(i, nepisode, reward, length, fps)
333 |                 )
334 |                 t = time.time()
335 |     else:
336 |         qnet = MLP('q') if qnet_type == 'MLP' else CNN('q')
337 |         tl.files.load_and_assign_npz(name=args.save_path, network=qnet)
338 |         qnet.eval()
339 | 
340 |         nepisode = 0
341 |         o = env.reset()
342 |         for i in range(1, number_timesteps + 1):
343 |             obv = np.expand_dims(o, 0).astype('float32') * ob_scale
344 |             a = qnet(obv).numpy().argmax(1)[0]
345 | 
346 |             # execute action
347 |             # note that `_` tail in var name means next
348 |             o_, r, done, info = env.step(a)
349 | 
350 |             if done:
351 |                 o = env.reset()
352 |             else:
353 |                 o = o_
354 | 
355 |             # episode in info is real (unwrapped) message
356 |             if info.get('episode'):
357 |                 nepisode += 1
358 |                 reward, length = info['episode']['r'], info['episode']['l']
359 |                 print(
360 |                     'Time steps so far: {}, episode so far: {}, '
361 |                     'episode reward: {:.4f}, episode length: {}'.format(i, nepisode, reward, length)
362 |                 )


--------------------------------------------------------------------------------
/code/tensrolayer-implemented/pg.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Vanilla Policy Gradient(VPG or REINFORCE)
  3 | -----------------------------------------
  4 | The policy gradient algorithm works by updating policy parameters via stochastic gradient ascent on policy performance.
  5 | It's an on-policy algorithm can be used for environments with either discrete or continuous action spaces.
  6 | Here is an example on discrete action space game CartPole-v0.
  7 | To apply it on continuous action space, you need to change the last softmax layer and the choose_action function.
  8 | Reference
  9 | ---------
 10 | Cookbook: Barto A G, Sutton R S. Reinforcement Learning: An Introduction[J]. 1998.
 11 | MorvanZhou's tutorial page: https://morvanzhou.github.io/tutorials/
 12 | Environment
 13 | -----------
 14 | Openai Gym CartPole-v0, discrete action space
 15 | Prerequisites
 16 | --------------
 17 | tensorflow >=2.0.0a0
 18 | tensorflow-probability 0.6.0
 19 | tensorlayer >=2.0.0
 20 | To run
 21 | ------
 22 | python tutorial_PG.py --train/test
 23 | """
 24 | 
 25 | import argparse
 26 | import os
 27 | import time
 28 | 
 29 | import matplotlib.pyplot as plt
 30 | import numpy as np
 31 | 
 32 | import gym
 33 | import tensorflow as tf
 34 | import tensorlayer as tl
 35 | 
 36 | parser = argparse.ArgumentParser(description='Train or test neural net motor controller.')
 37 | parser.add_argument('--train', dest='train', action='store_true', default=True)
 38 | parser.add_argument('--test', dest='train', action='store_false')
 39 | args = parser.parse_args()
 40 | 
 41 | #####################  hyper parameters  ####################
 42 | 
 43 | ENV_NAME = 'CartPole-v0'  # environment name
 44 | RANDOMSEED = 1  # random seed
 45 | 
 46 | DISPLAY_REWARD_THRESHOLD = 400  # renders environment if total episode reward is greater then this threshold
 47 | RENDER = False  # rendering wastes time
 48 | num_episodes = 300
 49 | 
 50 | ###############################  PG  ####################################
 51 | 
 52 | 
 53 | class PolicyGradient:
 54 |     """
 55 |     PG class
 56 |     """
 57 | 
 58 |     def __init__(self, n_features, n_actions, learning_rate=0.01, reward_decay=0.95):
 59 |         self.n_actions = n_actions
 60 |         self.n_features = n_features
 61 |         self.lr = learning_rate
 62 |         self.gamma = reward_decay
 63 | 
 64 |         self.ep_obs, self.ep_as, self.ep_rs = [], [], []
 65 | 
 66 |         def get_model(inputs_shape):
 67 |             """
 68 |             Build a neural network model.
 69 |             :param inputs_shape: state_shape
 70 |             :return: act
 71 |             """
 72 |             with tf.name_scope('inputs'):
 73 |                 self.tf_obs = tl.layers.Input(inputs_shape, tf.float32, name="observations")
 74 |                 self.tf_acts = tl.layers.Input([
 75 |                     None,
 76 |                 ], tf.int32, name="actions_num")
 77 |                 self.tf_vt = tl.layers.Input([
 78 |                     None,
 79 |                 ], tf.float32, name="actions_value")
 80 |             # fc1
 81 |             layer = tl.layers.Dense(
 82 |                 n_units=30, act=tf.nn.tanh, W_init=tf.random_normal_initializer(mean=0, stddev=0.3),
 83 |                 b_init=tf.constant_initializer(0.1), name='fc1'
 84 |             )(self.tf_obs)
 85 |             # fc2
 86 |             all_act = tl.layers.Dense(
 87 |                 n_units=self.n_actions, act=None, W_init=tf.random_normal_initializer(mean=0, stddev=0.3),
 88 |                 b_init=tf.constant_initializer(0.1), name='all_act'
 89 |             )(layer)
 90 |             return tl.models.Model(inputs=self.tf_obs, outputs=all_act, name='PG model')
 91 | 
 92 |         self.model = get_model([None, n_features])
 93 |         self.model.train()
 94 |         self.optimizer = tf.optimizers.Adam(self.lr)
 95 | 
 96 |     def choose_action(self, s):
 97 |         """
 98 |         choose action with probabilities.
 99 |         :param s: state
100 |         :return: act
101 |         """
102 |         _logits = self.model(np.array([s], np.float32))
103 |         _probs = tf.nn.softmax(_logits).numpy()
104 |         return tl.rein.choice_action_by_probs(_probs.ravel())
105 | 
106 |     def choose_action_greedy(self, s):
107 |         """
108 |         choose action with greedy policy
109 |         :param s: state
110 |         :return: act
111 |         """
112 |         _probs = tf.nn.softmax(self.model(np.array([s], np.float32))).numpy()
113 |         return np.argmax(_probs.ravel())
114 | 
115 |     def store_transition(self, s, a, r):
116 |         """
117 |         store data in memory buffer
118 |         :param s: state
119 |         :param a: act
120 |         :param r: reward
121 |         :return:
122 |         """
123 |         self.ep_obs.append(np.array([s], np.float32))
124 |         self.ep_as.append(a)
125 |         self.ep_rs.append(r)
126 | 
127 |     def learn(self):
128 |         """
129 |         update policy parameters via stochastic gradient ascent
130 |         :return: None
131 |         """
132 |         # discount and normalize episode reward
133 |         discounted_ep_rs_norm = self._discount_and_norm_rewards()
134 | 
135 |         with tf.GradientTape() as tape:
136 | 
137 |             _logits = self.model(np.vstack(self.ep_obs))
138 |             # to maximize total reward (log_p * R) is to minimize -(log_p * R), and the tf only have minimize(loss)
139 |             neg_log_prob = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=_logits, labels=np.array(self.ep_as))
140 |             # this is negative log of chosen action
141 | 
142 |             # or in this way:
143 |             # neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob)*tf.one_hot(self.tf_acts, self.n_actions), axis=1)
144 | 
145 |             loss = tf.reduce_mean(neg_log_prob * discounted_ep_rs_norm)  # reward guided loss
146 | 
147 |         grad = tape.gradient(loss, self.model.trainable_weights)
148 |         self.optimizer.apply_gradients(zip(grad, self.model.trainable_weights))
149 | 
150 |         self.ep_obs, self.ep_as, self.ep_rs = [], [], []  # empty episode data
151 |         return discounted_ep_rs_norm
152 | 
153 |     def _discount_and_norm_rewards(self):
154 |         """
155 |         compute discount_and_norm_rewards
156 |         :return: discount_and_norm_rewards
157 |         """
158 |         # discount episode rewards
159 |         discounted_ep_rs = np.zeros_like(self.ep_rs)
160 |         running_add = 0
161 |         for t in reversed(range(0, len(self.ep_rs))):
162 |             running_add = running_add * self.gamma + self.ep_rs[t]
163 |             discounted_ep_rs[t] = running_add
164 | 
165 |         # normalize episode rewards
166 |         discounted_ep_rs -= np.mean(discounted_ep_rs)
167 |         discounted_ep_rs /= np.std(discounted_ep_rs)
168 |         return discounted_ep_rs
169 | 
170 |     def save_ckpt(self):
171 |         """
172 |         save trained weights
173 |         :return: None
174 |         """
175 |         if not os.path.exists('model'):
176 |             os.makedirs('model')
177 |         tl.files.save_weights_to_hdf5('model/pg_policy.hdf5', self.model)
178 | 
179 |     def load_ckpt(self):
180 |         """
181 |         load trained weights
182 |         :return: None
183 |         """
184 |         tl.files.load_hdf5_to_weights_in_order('model/pg_policy.hdf5', self.model)
185 | 
186 | 
187 | if __name__ == '__main__':
188 | 
189 |     # reproducible
190 |     np.random.seed(RANDOMSEED)
191 |     tf.random.set_seed(RANDOMSEED)
192 | 
193 |     tl.logging.set_verbosity(tl.logging.DEBUG)
194 | 
195 |     env = gym.make(ENV_NAME)
196 |     env.seed(RANDOMSEED)  # reproducible, general Policy gradient has high variance
197 |     env = env.unwrapped
198 | 
199 |     print(env.action_space)
200 |     print(env.observation_space)
201 |     print(env.observation_space.high)
202 |     print(env.observation_space.low)
203 | 
204 |     RL = PolicyGradient(
205 |         n_actions=env.action_space.n,
206 |         n_features=env.observation_space.shape[0],
207 |         learning_rate=0.02,
208 |         reward_decay=0.99,
209 |         # output_graph=True,
210 |     )
211 | 
212 |     if args.train:
213 |         reward_buffer = []
214 | 
215 |         for i_episode in range(num_episodes):
216 | 
217 |             episode_time = time.time()
218 |             observation = env.reset()
219 | 
220 |             while True:
221 |                 if RENDER:
222 |                     env.render()
223 | 
224 |                 action = RL.choose_action(observation)
225 | 
226 |                 observation_, reward, done, info = env.step(action)
227 | 
228 |                 RL.store_transition(observation, action, reward)
229 | 
230 |                 if done:
231 |                     ep_rs_sum = sum(RL.ep_rs)
232 | 
233 |                     if 'running_reward' not in globals():
234 |                         running_reward = ep_rs_sum
235 |                     else:
236 |                         running_reward = running_reward * 0.99 + ep_rs_sum * 0.01
237 |                     #打开渲染 可视化游戏界面
238 |                     #if running_reward > DISPLAY_REWARD_THRESHOLD:
239 |                     #    RENDER = True  # rendering
240 | 
241 |                     # print("episode:", i_episode, "  reward:", int(running_reward))
242 | 
243 |                     print(
244 |                         "Episode [%d/%d] \tsum reward: %d  \trunning reward: %f \ttook: %.5fs " %
245 |                         (i_episode, num_episodes, ep_rs_sum, running_reward, time.time() - episode_time)
246 |                     )
247 |                     reward_buffer.append(running_reward)
248 | 
249 |                     vt = RL.learn()
250 |                     break
251 |                 '''
252 |                     plt.ion()
253 |                     plt.cla()
254 |                     plt.title('PG')
255 |                     plt.plot(reward_buffer, )  # plot the episode vt
256 |                     plt.xlabel('episode steps')
257 |                     plt.ylabel('normalized state-action value')
258 |                     plt.show()
259 |                     plt.pause(0.1)
260 |                 '''
261 |                     
262 | 
263 |                 observation = observation_
264 |         RL.save_ckpt()
265 |         #plt.ioff()
266 |         #plt.show()
267 | 
268 |     # test
269 |     RL.load_ckpt()
270 |     observation = env.reset()
271 |     while True:
272 |         env.render()
273 |         action = RL.choose_action(observation)
274 |         observation, reward, done, info = env.step(action)
275 |         if done:
276 |             observation = env.reset()


--------------------------------------------------------------------------------
/code/tensrolayer-implemented/ppo.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Proximal Policy Optimization (PPO)
  3 | ----------------------------
  4 | A simple version of Proximal Policy Optimization (PPO) using single thread.
  5 | PPO is a family of first-order methods that use a few other tricks to keep new policies close to old.
  6 | PPO methods are significantly simpler to implement, and empirically seem to perform at least as well as TRPO.
  7 | Reference
  8 | ---------
  9 | Proximal Policy Optimization Algorithms, Schulman et al. 2017
 10 | High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al. 2016
 11 | Emergence of Locomotion Behaviours in Rich Environments, Heess et al. 2017
 12 | MorvanZhou's tutorial page: https://morvanzhou.github.io/tutorials
 13 | Environment
 14 | -----------
 15 | Openai Gym Pendulum-v0, continual action space
 16 | Prerequisites
 17 | --------------
 18 | tensorflow >=2.0.0a0
 19 | tensorflow-probability 0.6.0
 20 | tensorlayer >=2.0.0
 21 | To run
 22 | ------
 23 | python tutorial_PPO.py --train/test
 24 | """
 25 | import argparse
 26 | import os
 27 | import time
 28 | 
 29 | import matplotlib.pyplot as plt
 30 | import numpy as np
 31 | 
 32 | import gym
 33 | import tensorflow as tf
 34 | import tensorflow_probability as tfp
 35 | import tensorlayer as tl
 36 | 
 37 | parser = argparse.ArgumentParser(description='Train or test neural net motor controller.')
 38 | parser.add_argument('--train', dest='train', action='store_true', default=True)
 39 | parser.add_argument('--test', dest='train', action='store_false')
 40 | args = parser.parse_args()
 41 | 
 42 | #####################  hyper parameters  ####################
 43 | 
 44 | ENV_NAME = 'Pendulum-v0'  # environment name
 45 | RANDOMSEED = 1  # random seed
 46 | 
 47 | EP_MAX = 1000  # total number of episodes for training
 48 | EP_LEN = 200  # total number of steps for each episode
 49 | GAMMA = 0.9  # reward discount
 50 | A_LR = 0.0001  # learning rate for actor
 51 | C_LR = 0.0002  # learning rate for critic
 52 | BATCH = 32  # update batchsize
 53 | A_UPDATE_STEPS = 10  # actor update steps
 54 | C_UPDATE_STEPS = 10  # critic update steps
 55 | S_DIM, A_DIM = 3, 1  # state dimension, action dimension
 56 | EPS = 1e-8  # epsilon
 57 | METHOD = [
 58 |     dict(name='kl_pen', kl_target=0.01, lam=0.5),  # KL penalty
 59 |     dict(name='clip', epsilon=0.2),  # Clipped surrogate objective, find this is better
 60 | ][1]  # choose the method for optimization
 61 | 
 62 | ###############################  PPO  ####################################
 63 | 
 64 | 
 65 | class PPO(object):
 66 |     '''
 67 |     PPO class
 68 |     '''
 69 | 
 70 |     def __init__(self):
 71 | 
 72 |         # critic 输出v值
 73 |         tfs = tl.layers.Input([None, S_DIM], tf.float32, 'state')
 74 |         l1 = tl.layers.Dense(100, tf.nn.relu)(tfs)
 75 |         v = tl.layers.Dense(1)(l1)
 76 |         self.critic = tl.models.Model(tfs, v)
 77 |         self.critic.train()
 78 | 
 79 |         # actor 2个pi网络，固定一个，训练另一个
 80 |         self.actor = self._build_anet('pi', trainable=True)
 81 |         self.actor_old = self._build_anet('oldpi', trainable=False)
 82 |         self.actor_opt = tf.optimizers.Adam(A_LR)
 83 |         self.critic_opt = tf.optimizers.Adam(C_LR)
 84 | 
 85 |     def a_train(self, tfs, tfa, tfadv):
 86 |         '''
 87 |         Update policy network
 88 |         :param tfs: state
 89 |         :param tfa: act
 90 |         :param tfadv: advantage
 91 |         :return:
 92 |         '''
 93 |         tfs = np.array(tfs, np.float32)
 94 |         tfa = np.array(tfa, np.float32)
 95 |         tfadv = np.array(tfadv, np.float32) #优势函数 -值函数的变形
 96 |         with tf.GradientTape() as tape:
 97 |             mu, sigma = self.actor(tfs)
 98 |             pi = tfp.distributions.Normal(mu, sigma) #初始化策略 正态分布
 99 | 
100 |             mu_old, sigma_old = self.actor_old(tfs)
101 |             oldpi = tfp.distributions.Normal(mu_old, sigma_old)
102 | 
103 |             # ratio = tf.exp(pi.log_prob(self.tfa) - oldpi.log_prob(self.tfa))
104 |             ratio = pi.prob(tfa) / (oldpi.prob(tfa) + EPS)
105 |             surr = ratio * tfadv
106 |             if METHOD['name'] == 'kl_pen': #使用KL作为惩罚项的loss
107 |                 tflam = METHOD['lam']
108 |                 kl = tfp.distributions.kl_divergence(oldpi, pi)
109 |                 kl_mean = tf.reduce_mean(kl)
110 |                 aloss = -(tf.reduce_mean(surr - tflam * kl))
111 |             else:  # clipping method, find this is better 使用clip作为loss
112 |                 aloss = -tf.reduce_mean(
113 |                     tf.minimum(surr,
114 |                                tf.clip_by_value(ratio, 1. - METHOD['epsilon'], 1. + METHOD['epsilon']) * tfadv)
115 |                 )
116 |         a_gard = tape.gradient(aloss, self.actor.trainable_weights)
117 | 
118 |         self.actor_opt.apply_gradients(zip(a_gard, self.actor.trainable_weights))
119 | 
120 |         if METHOD['name'] == 'kl_pen':
121 |             return kl_mean
122 | 
123 |     def update_old_pi(self):
124 |         '''
125 |         Update old policy parameter
126 |         :return: None
127 |         '''
128 |         for p, oldp in zip(self.actor.trainable_weights, self.actor_old.trainable_weights):
129 |             oldp.assign(p) #将新的pi的参数直接赋值给旧的pi
130 | 
131 |     def c_train(self, tfdc_r, s): #训练critic网络，mse优化
132 |         '''
133 |         Update actor network
134 |         :param tfdc_r: cumulative reward
135 |         :param s: state
136 |         :return: None
137 |         '''
138 |         tfdc_r = np.array(tfdc_r, dtype=np.float32)
139 |         with tf.GradientTape() as tape:
140 |             v = self.critic(s)
141 |             advantage = tfdc_r - v
142 |             closs = tf.reduce_mean(tf.square(advantage))
143 |         # print('tfdc_r value', tfdc_r)
144 |         grad = tape.gradient(closs, self.critic.trainable_weights)
145 |         self.critic_opt.apply_gradients(zip(grad, self.critic.trainable_weights))
146 | 
147 |     def cal_adv(self, tfs, tfdc_r):
148 |         '''
149 |         Calculate advantage
150 |         :param tfs: state
151 |         :param tfdc_r: cumulative reward
152 |         :return: advantage
153 |         '''
154 |         tfdc_r = np.array(tfdc_r, dtype=np.float32)
155 |         advantage = tfdc_r - self.critic(tfs)
156 |         return advantage.numpy()
157 | 
158 |     def update(self, s, a, r):
159 |         '''
160 |         Update parameter with the constraint of KL divergent
161 |         :param s: state
162 |         :param a: act
163 |         :param r: reward
164 |         :return: None
165 |         '''
166 |         s, a, r = s.astype(np.float32), a.astype(np.float32), r.astype(np.float32)
167 | 
168 |         self.update_old_pi()
169 |         adv = self.cal_adv(s, r)
170 |         # adv = (adv - adv.mean())/(adv.std()+1e-6)  # sometimes helpful
171 | 
172 |         # update actor
173 |         if METHOD['name'] == 'kl_pen':
174 |             for _ in range(A_UPDATE_STEPS):
175 |                 kl = self.a_train(s, a, adv)
176 |                 if kl > 4 * METHOD['kl_target']:  # this in in google's paper
177 |                     break
178 |             if kl < METHOD['kl_target'] / 1.5:  # adaptive lambda, this is in OpenAI's paper
179 |                 METHOD['lam'] /= 2
180 |             elif kl > METHOD['kl_target'] * 1.5:
181 |                 METHOD['lam'] *= 2
182 |             METHOD['lam'] = np.clip(
183 |                 METHOD['lam'], 1e-4, 10
184 |             )  # sometimes explode, this clipping is MorvanZhou's solution
185 |         else:  # clipping method, find this is better (OpenAI's paper)
186 |             for _ in range(A_UPDATE_STEPS):
187 |                 self.a_train(s, a, adv)
188 | 
189 |         # update critic
190 |         for _ in range(C_UPDATE_STEPS):
191 |             self.c_train(r, s) #Critic的训练一次更新十步
192 | 
193 |     def _build_anet(self, name, trainable):
194 |         '''
195 |         Build policy network
196 |         :param name: name
197 |         :param trainable: trainable flag
198 |         :return: policy network
199 |         '''
200 |         tfs = tl.layers.Input([None, S_DIM], tf.float32, name + '_state')
201 |         l1 = tl.layers.Dense(100, tf.nn.relu, name=name + '_l1')(tfs)
202 |         a = tl.layers.Dense(A_DIM, tf.nn.tanh, name=name + '_a')(l1)
203 |         mu = tl.layers.Lambda(lambda x: x * 2, name=name + '_lambda')(a)
204 |         sigma = tl.layers.Dense(A_DIM, tf.nn.softplus, name=name + '_sigma')(l1)
205 |         model = tl.models.Model(tfs, [mu, sigma], name)
206 | 
207 |         if trainable:
208 |             model.train()
209 |         else:
210 |             model.eval()
211 |         return model
212 | 
213 |     def choose_action(self, s): 
214 |         '''
215 |         Choose action
216 |         :param s: state
217 |         :return: clipped act
218 |         '''
219 |         s = s[np.newaxis, :].astype(np.float32)
220 |         mu, sigma = self.actor(s)
221 |         pi = tfp.distributions.Normal(mu, sigma)
222 |         a = tf.squeeze(pi.sample(1), axis=0)[0]  # choosing action
223 |         return np.clip(a, -2, 2)
224 | 
225 |     def get_v(self, s):
226 |         '''
227 |         Compute value
228 |         :param s: state
229 |         :return: value
230 |         '''
231 |         s = s.astype(np.float32)
232 |         if s.ndim < 2: s = s[np.newaxis, :]
233 |         return self.critic(s)[0, 0]
234 | 
235 |     def save_ckpt(self):
236 |         """
237 |         save trained weights
238 |         :return: None
239 |         """
240 |         if not os.path.exists('model'):
241 |             os.makedirs('model')
242 |         tl.files.save_weights_to_hdf5('model/ppo_actor.hdf5', self.actor)
243 |         tl.files.save_weights_to_hdf5('model/ppo_actor_old.hdf5', self.actor_old)
244 |         tl.files.save_weights_to_hdf5('model/ppo_critic.hdf5', self.critic)
245 | 
246 |     def load_ckpt(self):
247 |         """
248 |         load trained weights
249 |         :return: None
250 |         """
251 |         tl.files.load_hdf5_to_weights_in_order('model/ppo_actor.hdf5', self.actor)
252 |         tl.files.load_hdf5_to_weights_in_order('model/ppo_actor_old.hdf5', self.actor_old)
253 |         tl.files.load_hdf5_to_weights_in_order('model/ppo_critic.hdf5', self.critic)
254 | 
255 | 
256 | if __name__ == '__main__':
257 | 
258 |     env = gym.make(ENV_NAME).unwrapped
259 | 
260 |     # reproducible
261 |     env.seed(RANDOMSEED)
262 |     np.random.seed(RANDOMSEED)
263 |     tf.random.set_seed(RANDOMSEED)
264 | 
265 |     ppo = PPO()
266 | 
267 |     if args.train:
268 |         all_ep_r = []
269 |         for ep in range(EP_MAX):
270 |             s = env.reset()
271 |             buffer_s, buffer_a, buffer_r = [], [], []
272 |             ep_r = 0
273 |             t0 = time.time()
274 |             for t in range(EP_LEN):  # in one episode
275 |                 # env.render()
276 |                 a = ppo.choose_action(s)
277 |                 s_, r, done, _ = env.step(a)
278 |                 buffer_s.append(s)
279 |                 buffer_a.append(a)
280 |                 buffer_r.append((r + 8) / 8)  # normalize reward, find to be useful
281 |                 s = s_
282 |                 ep_r += r
283 | 
284 |                 # update ppo
285 |                 if (t + 1) % BATCH == 0 or t == EP_LEN - 1: #采集了一个Batch的数据或者走到最后一步了才进行更新
286 |                     v_s_ = ppo.get_v(s_)
287 |                     discounted_r = []
288 |                     for r in buffer_r[::-1]:
289 |                         v_s_ = r + GAMMA * v_s_
290 |                         discounted_r.append(v_s_)
291 |                     discounted_r.reverse()
292 | 
293 |                     bs, ba, br = np.vstack(buffer_s), np.vstack(buffer_a), np.array(discounted_r)[:, np.newaxis]
294 |                     buffer_s, buffer_a, buffer_r = [], [], []
295 |                     ppo.update(bs, ba, br)
296 |             if ep == 0:
297 |                 all_ep_r.append(ep_r)
298 |             else:
299 |                 all_ep_r.append(all_ep_r[-1] * 0.9 + ep_r * 0.1)
300 |             print(
301 |                 'Episode: {}/{}  | Episode Reward: {:.4f}  | Running Time: {:.4f}'.format(
302 |                     ep, EP_MAX, ep_r,
303 |                     time.time() - t0
304 |                 )
305 |             )
306 | 
307 |             plt.ion()
308 |             plt.cla()
309 |             plt.title('PPO')
310 |             plt.plot(np.arange(len(all_ep_r)), all_ep_r)
311 |             plt.ylim(-2000, 0)
312 |             plt.xlabel('Episode')
313 |             plt.ylabel('Moving averaged episode reward')
314 |             plt.show()
315 |             plt.pause(0.1)
316 |         ppo.save_ckpt()
317 |         plt.ioff()
318 |         plt.show()
319 | 
320 |     # test
321 |     ppo.load_ckpt()
322 |     while True:
323 |         s = env.reset()
324 |         for i in range(EP_LEN):
325 |             env.render()
326 |             s, r, done, _ = env.step(ppo.choose_action(s))
327 |             if done:
328 |                 break


--------------------------------------------------------------------------------
/code/tensrolayer-implemented/qlearning.py:
--------------------------------------------------------------------------------
 1 | """Q-Table learning algorithm.
 2 | Non deep learning - TD Learning, Off-Policy, e-Greedy Exploration
 3 | Q(S, A) <- Q(S, A) + alpha * (R + lambda * Q(newS, newA) - Q(S, A))
 4 | See David Silver RL Tutorial Lecture 5 - Q-Learning for more details.
 5 | For Q-Network, see tutorial_frozenlake_q_network.py
 6 | EN: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0#.5m3361vlw
 7 | CN: https://zhuanlan.zhihu.com/p/25710327
 8 | tensorflow==2.0.0a0
 9 | tensorlayer==2.0.0
10 | """
11 | 
12 | import time
13 | 
14 | import numpy as np
15 | 
16 | import gym
17 | 
18 | ## Load the environment
19 | env = gym.make('FrozenLake-v0')
20 | render = False  # display the game environment
21 | running_reward = None
22 | 
23 | ##================= Implement Q-Table learning algorithm =====================##
24 | ## Initialize table with all zeros
25 | Q = np.zeros([env.observation_space.n, env.action_space.n])
26 | ## Set learning parameters
27 | lr = .85  # alpha, if use value function approximation, we can ignore it
28 | lambd = .99  # decay factor
29 | num_episodes = 10000
30 | rList = []  # rewards for each episode
31 | for i in range(num_episodes):
32 |     ## Reset environment and get first new observation
33 |     episode_time = time.time()
34 |     s = env.reset()
35 |     rAll = 0
36 |     ## The Q-Table learning algorithm
37 |     for j in range(99):
38 |         if render: env.render()
39 |         ## Choose an action by greedily (with noise) picking from Q table
40 |         a = np.argmax(Q[s, :] + np.random.randn(1, env.action_space.n) * (1. / (i + 1)))
41 |         ## Get new state and reward from environment
42 |         s1, r, d, _ = env.step(a)
43 |         ## Update Q-Table with new knowledge
44 |         Q[s, a] = Q[s, a] + lr * (r + lambd * np.max(Q[s1, :]) - Q[s, a])
45 |         rAll += r
46 |         s = s1
47 |         if d ==True:
48 |             break
49 |     rList.append(rAll)
50 |     running_reward = r if running_reward is None else running_reward * 0.99 + r * 0.01
51 |     print("Episode [%d/%d] sum reward: %f running reward: %f took: %.5fs " % \
52 |         (i, num_episodes, rAll, running_reward, time.time() - episode_time))
53 | 
54 | print("Final Q-Table Values:/n %s" % Q)


--------------------------------------------------------------------------------
/code/tensrolayer-implemented/tutorial_wrappers.py:
--------------------------------------------------------------------------------
  1 | """Env wrappers
  2 | Note that this file is adapted from `https://pypi.org/project/gym-vec-env` and
  3 | `https://github.com/openai/baselines/blob/master/baselines/common/*wrappers.py`
  4 | """
  5 | from collections import deque
  6 | from functools import partial
  7 | from multiprocessing import Pipe, Process, cpu_count
  8 | from sys import platform
  9 | 
 10 | import numpy as np
 11 | 
 12 | import cv2
 13 | import gym
 14 | from gym import spaces
 15 | 
 16 | __all__ = (
 17 |     'build_env',  # build env
 18 |     'TimeLimit',  # Time limit wrapper
 19 |     'NoopResetEnv',  # Run random number of no-ops on reset
 20 |     'FireResetEnv',  # Reset wrapper for envs with fire action
 21 |     'EpisodicLifeEnv',  # end-of-life == end-of-episode wrapper
 22 |     'MaxAndSkipEnv',  # skip frame wrapper
 23 |     'ClipRewardEnv',  # clip reward wrapper
 24 |     'WarpFrame',  # warp observation wrapper
 25 |     'FrameStack',  # stack frame wrapper
 26 |     'LazyFrames',  # lazy store wrapper
 27 |     'RewardScaler',  # reward scale
 28 |     'SubprocVecEnv',  # vectorized env wrapper
 29 |     'VecFrameStack',  # stack frames in vectorized env
 30 |     'Monitor',  # Episode reward and length monitor
 31 | )
 32 | cv2.ocl.setUseOpenCL(False)
 33 | # env_id -> env_type
 34 | id2type = dict()
 35 | for _env in gym.envs.registry.all():
 36 |     id2type[_env.id] = _env.entry_point.split(':')[0].rsplit('.', 1)[1]
 37 | 
 38 | 
 39 | def build_env(env_id, vectorized=False, seed=0, reward_scale=1.0, nenv=0):
 40 |     """Build env based on options"""
 41 |     env_type = id2type[env_id]
 42 |     nenv = nenv or cpu_count() // (1 + (platform == 'darwin'))
 43 |     stack = env_type == 'atari'
 44 |     if not vectorized:
 45 |         env = _make_env(env_id, env_type, seed, reward_scale, stack)
 46 |     else:
 47 |         env = _make_vec_env(env_id, env_type, nenv, seed, reward_scale, stack)
 48 | 
 49 |     return env
 50 | 
 51 | 
 52 | def _make_env(env_id, env_type, seed, reward_scale, frame_stack=True):
 53 |     """Make single env"""
 54 |     if env_type == 'atari':
 55 |         env = gym.make(env_id)
 56 |         assert 'NoFrameskip' in env.spec.id
 57 |         env = NoopResetEnv(env, noop_max=30)
 58 |         env = MaxAndSkipEnv(env, skip=4)
 59 |         env = Monitor(env)
 60 |         # deepmind wrap
 61 |         env = EpisodicLifeEnv(env)
 62 |         if 'FIRE' in env.unwrapped.get_action_meanings():
 63 |             env = FireResetEnv(env)
 64 |         env = WarpFrame(env)
 65 |         env = ClipRewardEnv(env)
 66 |         if frame_stack:
 67 |             env = FrameStack(env, 4)
 68 |     elif env_type == 'classic_control':
 69 |         env = Monitor(gym.make(env_id))
 70 |     else:
 71 |         raise NotImplementedError
 72 |     if reward_scale != 1:
 73 |         env = RewardScaler(env, reward_scale)
 74 |     env.seed(seed)
 75 |     return env
 76 | 
 77 | 
 78 | def _make_vec_env(env_id, env_type, nenv, seed, reward_scale, frame_stack=True):
 79 |     """Make vectorized env"""
 80 |     env = SubprocVecEnv([partial(_make_env, env_id, env_type, seed + i, reward_scale, False) for i in range(nenv)])
 81 |     if frame_stack:
 82 |         env = VecFrameStack(env, 4)
 83 |     return env
 84 | 
 85 | 
 86 | class TimeLimit(gym.Wrapper):
 87 | 
 88 |     def __init__(self, env, max_episode_steps=None):
 89 |         super(TimeLimit, self).__init__(env)
 90 |         self._max_episode_steps = max_episode_steps
 91 |         self._elapsed_steps = 0
 92 | 
 93 |     def step(self, ac):
 94 |         observation, reward, done, info = self.env.step(ac)
 95 |         self._elapsed_steps += 1
 96 |         if self._elapsed_steps >= self._max_episode_steps:
 97 |             done = True
 98 |             info['TimeLimit.truncated'] = True
 99 |         return observation, reward, done, info
100 | 
101 |     def reset(self, **kwargs):
102 |         self._elapsed_steps = 0
103 |         return self.env.reset(**kwargs)
104 | 
105 | 
106 | class NoopResetEnv(gym.Wrapper):
107 | 
108 |     def __init__(self, env, noop_max=30):
109 |         """Sample initial states by taking random number of no-ops on reset.
110 |         No-op is assumed to be action 0.
111 |         """
112 |         super(NoopResetEnv, self).__init__(env)
113 |         self.noop_max = noop_max
114 |         self.override_num_noops = None
115 |         self.noop_action = 0
116 |         assert env.unwrapped.get_action_meanings()[0] == 'NOOP'
117 | 
118 |     def reset(self, **kwargs):
119 |         """ Do no-op action for a number of steps in [1, noop_max]."""
120 |         self.env.reset(**kwargs)
121 |         if self.override_num_noops is not None:
122 |             noops = self.override_num_noops
123 |         else:
124 |             noops = self.unwrapped.np_random.randint(1, self.noop_max + 1)
125 |         assert noops > 0
126 |         obs = None
127 |         for _ in range(noops):
128 |             obs, _, done, _ = self.env.step(self.noop_action)
129 |             if done:
130 |                 obs = self.env.reset(**kwargs)
131 |         return obs
132 | 
133 |     def step(self, ac):
134 |         return self.env.step(ac)
135 | 
136 | 
137 | class FireResetEnv(gym.Wrapper):
138 | 
139 |     def __init__(self, env):
140 |         """Take action on reset for environments that are fixed until firing."""
141 |         super(FireResetEnv, self).__init__(env)
142 |         assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
143 |         assert len(env.unwrapped.get_action_meanings()) >= 3
144 | 
145 |     def reset(self, **kwargs):
146 |         self.env.reset(**kwargs)
147 |         obs, _, done, _ = self.env.step(1)
148 |         if done:
149 |             self.env.reset(**kwargs)
150 |         obs, _, done, _ = self.env.step(2)
151 |         if done:
152 |             self.env.reset(**kwargs)
153 |         return obs
154 | 
155 |     def step(self, ac):
156 |         return self.env.step(ac)
157 | 
158 | 
159 | class EpisodicLifeEnv(gym.Wrapper):
160 | 
161 |     def __init__(self, env):
162 |         """Make end-of-life == end-of-episode, but only reset on true game over.
163 |         Done by DeepMind for the DQN and co. since it helps value estimation.
164 |         """
165 |         super(EpisodicLifeEnv, self).__init__(env)
166 |         self.lives = 0
167 |         self.was_real_done = True
168 | 
169 |     def step(self, action):
170 |         obs, reward, done, info = self.env.step(action)
171 |         self.was_real_done = done
172 |         # check current lives, make loss of life terminal,
173 |         # then update lives to handle bonus lives
174 |         lives = self.env.unwrapped.ale.lives()
175 |         if 0 < lives < self.lives:
176 |             # for Qbert sometimes we stay in lives == 0 condition for a few
177 |             # frames so it's important to keep lives > 0, so that we only reset
178 |             # once the environment advertises done.
179 |             done = True
180 |         self.lives = lives
181 |         return obs, reward, done, info
182 | 
183 |     def reset(self, **kwargs):
184 |         """Reset only when lives are exhausted.
185 |         This way all states are still reachable even though lives are episodic,
186 |         and the learner need not know about any of this behind-the-scenes.
187 |         """
188 |         if self.was_real_done:
189 |             obs = self.env.reset(**kwargs)
190 |         else:
191 |             # no-op step to advance from terminal/lost life state
192 |             obs, _, _, _ = self.env.step(0)
193 |         self.lives = self.env.unwrapped.ale.lives()
194 |         return obs
195 | 
196 | 
197 | class MaxAndSkipEnv(gym.Wrapper):
198 | 
199 |     def __init__(self, env, skip=4):
200 |         """Return only every `skip`-th frame"""
201 |         super(MaxAndSkipEnv, self).__init__(env)
202 |         # most recent raw observations (for max pooling across time steps)
203 |         shape = (2, ) + env.observation_space.shape
204 |         self._obs_buffer = np.zeros(shape, dtype=np.uint8)
205 |         self._skip = skip
206 | 
207 |     def step(self, action):
208 |         """Repeat action, sum reward, and max over last observations."""
209 |         total_reward = 0.0
210 |         done = info = None
211 |         for i in range(self._skip):
212 |             obs, reward, done, info = self.env.step(action)
213 |             if i == self._skip - 2:
214 |                 self._obs_buffer[0] = obs
215 |             if i == self._skip - 1:
216 |                 self._obs_buffer[1] = obs
217 |             total_reward += reward
218 |             if done:
219 |                 break
220 |         # Note that the observation on the done=True frame doesn't matter
221 |         max_frame = self._obs_buffer.max(axis=0)
222 | 
223 |         return max_frame, total_reward, done, info
224 | 
225 |     def reset(self, **kwargs):
226 |         return self.env.reset(**kwargs)
227 | 
228 | 
229 | class ClipRewardEnv(gym.RewardWrapper):
230 | 
231 |     def __init__(self, env):
232 |         super(ClipRewardEnv, self).__init__(env)
233 | 
234 |     def reward(self, reward):
235 |         """Bin reward to {+1, 0, -1} by its sign."""
236 |         return np.sign(reward)
237 | 
238 | 
239 | class WarpFrame(gym.ObservationWrapper):
240 | 
241 |     def __init__(self, env, width=84, height=84, grayscale=True):
242 |         """Warp frames to 84x84 as done in the Nature paper and later work."""
243 |         super(WarpFrame, self).__init__(env)
244 |         self.width = width
245 |         self.height = height
246 |         self.grayscale = grayscale
247 |         shape = (self.height, self.width, 1 if self.grayscale else 3)
248 |         self.observation_space = spaces.Box(low=0, high=255, shape=shape, dtype=np.uint8)
249 | 
250 |     def observation(self, frame):
251 |         if self.grayscale:
252 |             frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
253 |         size = (self.width, self.height)
254 |         frame = cv2.resize(frame, size, interpolation=cv2.INTER_AREA)
255 |         if self.grayscale:
256 |             frame = np.expand_dims(frame, -1)
257 |         return frame
258 | 
259 | 
260 | class FrameStack(gym.Wrapper):
261 | 
262 |     def __init__(self, env, k):
263 |         """Stack k last frames.
264 |         Returns lazy array, which is much more memory efficient.
265 |         See Also `LazyFrames`
266 |         """
267 |         super(FrameStack, self).__init__(env)
268 |         self.k = k
269 |         self.frames = deque([], maxlen=k)
270 |         shp = env.observation_space.shape
271 |         shape = shp[:-1] + (shp[-1] * k, )
272 |         self.observation_space = spaces.Box(low=0, high=255, shape=shape, dtype=env.observation_space.dtype)
273 | 
274 |     def reset(self):
275 |         ob = self.env.reset()
276 |         for _ in range(self.k):
277 |             self.frames.append(ob)
278 |         return np.asarray(self._get_ob())
279 | 
280 |     def step(self, action):
281 |         ob, reward, done, info = self.env.step(action)
282 |         self.frames.append(ob)
283 |         return np.asarray(self._get_ob()), reward, done, info
284 | 
285 |     def _get_ob(self):
286 |         assert len(self.frames) == self.k
287 |         return LazyFrames(list(self.frames))
288 | 
289 | 
290 | class LazyFrames(object):
291 | 
292 |     def __init__(self, frames):
293 |         """This object ensures that common frames between the observations are
294 |         only stored once. It exists purely to optimize memory usage which can be
295 |         huge for DQN's 1M frames replay buffers.
296 |         This object should only be converted to numpy array before being passed
297 |         to the model. You'd not believe how complex the previous solution was.
298 |         """
299 |         self._frames = frames
300 |         self._out = None
301 | 
302 |     def _force(self):
303 |         if self._out is None:
304 |             self._out = np.concatenate(self._frames, axis=-1)
305 |             self._frames = None
306 |         return self._out
307 | 
308 |     def __array__(self, dtype=None):
309 |         out = self._force()
310 |         if dtype is not None:
311 |             out = out.astype(dtype)
312 |         return out
313 | 
314 |     def __len__(self):
315 |         return len(self._force())
316 | 
317 |     def __getitem__(self, i):
318 |         return self._force()[i]
319 | 
320 | 
321 | class RewardScaler(gym.RewardWrapper):
322 |     """Bring rewards to a reasonable scale for PPO.
323 |     This is incredibly important and effects performance drastically.
324 |     """
325 | 
326 |     def __init__(self, env, scale=0.01):
327 |         super(RewardScaler, self).__init__(env)
328 |         self.scale = scale
329 | 
330 |     def reward(self, reward):
331 |         return reward * self.scale
332 | 
333 | 
334 | class VecFrameStack(object):
335 | 
336 |     def __init__(self, env, k):
337 |         self.env = env
338 |         self.k = k
339 |         self.action_space = env.action_space
340 |         self.frames = deque([], maxlen=k)
341 |         shp = env.observation_space.shape
342 |         shape = shp[:-1] + (shp[-1] * k, )
343 |         self.observation_space = spaces.Box(low=0, high=255, shape=shape, dtype=env.observation_space.dtype)
344 | 
345 |     def reset(self):
346 |         ob = self.env.reset()
347 |         for _ in range(self.k):
348 |             self.frames.append(ob)
349 |         return np.asarray(self._get_ob())
350 | 
351 |     def step(self, action):
352 |         ob, reward, done, info = self.env.step(action)
353 |         self.frames.append(ob)
354 |         return np.asarray(self._get_ob()), reward, done, info
355 | 
356 |     def _get_ob(self):
357 |         assert len(self.frames) == self.k
358 |         return LazyFrames(list(self.frames))
359 | 
360 | 
361 | def _worker(remote, parent_remote, env_fn_wrapper):
362 |     parent_remote.close()
363 |     env = env_fn_wrapper.x()
364 |     while True:
365 |         cmd, data = remote.recv()
366 |         if cmd == 'step':
367 |             ob, reward, done, info = env.step(data)
368 |             if done:
369 |                 ob = env.reset()
370 |             remote.send((ob, reward, done, info))
371 |         elif cmd == 'reset':
372 |             ob = env.reset()
373 |             remote.send(ob)
374 |         elif cmd == 'reset_task':
375 |             ob = env._reset_task()
376 |             remote.send(ob)
377 |         elif cmd == 'close':
378 |             remote.close()
379 |             break
380 |         elif cmd == 'get_spaces':
381 |             remote.send((env.observation_space, env.action_space))
382 |         else:
383 |             raise NotImplementedError
384 | 
385 | 
386 | class CloudpickleWrapper(object):
387 |     """
388 |     Uses cloudpickle to serialize contents
389 |     """
390 | 
391 |     def __init__(self, x):
392 |         self.x = x
393 | 
394 |     def __getstate__(self):
395 |         import cloudpickle
396 |         return cloudpickle.dumps(self.x)
397 | 
398 |     def __setstate__(self, ob):
399 |         import pickle
400 |         self.x = pickle.loads(ob)
401 | 
402 | 
403 | class SubprocVecEnv(object):
404 | 
405 |     def __init__(self, env_fns):
406 |         """
407 |         envs: list of gym environments to run in subprocesses
408 |         """
409 |         self.num_envs = len(env_fns)
410 | 
411 |         self.waiting = False
412 |         self.closed = False
413 |         nenvs = len(env_fns)
414 |         self.nenvs = nenvs
415 |         self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
416 |         zipped_args = zip(self.work_remotes, self.remotes, env_fns)
417 |         self.ps = [
418 |             Process(target=_worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
419 |             for (work_remote, remote, env_fn) in zipped_args
420 |         ]
421 | 
422 |         for p in self.ps:
423 |             # if the main process crashes, we should not cause things to hang
424 |             p.daemon = True
425 |             p.start()
426 |         for remote in self.work_remotes:
427 |             remote.close()
428 | 
429 |         self.remotes[0].send(('get_spaces', None))
430 |         observation_space, action_space = self.remotes[0].recv()
431 |         self.observation_space = observation_space
432 |         self.action_space = action_space
433 | 
434 |     def _step_async(self, actions):
435 |         """
436 |             Tell all the environments to start taking a step
437 |             with the given actions.
438 |             Call step_wait() to get the results of the step.
439 |             You should not call this if a step_async run is
440 |             already pending.
441 |             """
442 |         for remote, action in zip(self.remotes, actions):
443 |             remote.send(('step', action))
444 |         self.waiting = True
445 | 
446 |     def _step_wait(self):
447 |         """
448 |             Wait for the step taken with step_async().
449 |             Returns (obs, rews, dones, infos):
450 |              - obs: an array of observations, or a tuple of
451 |                     arrays of observations.
452 |              - rews: an array of rewards
453 |              - dones: an array of "episode done" booleans
454 |              - infos: a sequence of info objects
455 |             """
456 |         results = [remote.recv() for remote in self.remotes]
457 |         self.waiting = False
458 |         obs, rews, dones, infos = zip(*results)
459 |         return np.stack(obs), np.stack(rews), np.stack(dones), infos
460 | 
461 |     def reset(self):
462 |         """
463 |             Reset all the environments and return an array of
464 |             observations, or a tuple of observation arrays.
465 |             If step_async is still doing work, that work will
466 |             be cancelled and step_wait() should not be called
467 |             until step_async() is invoked again.
468 |             """
469 |         for remote in self.remotes:
470 |             remote.send(('reset', None))
471 |         return np.stack([remote.recv() for remote in self.remotes])
472 | 
473 |     def _reset_task(self):
474 |         for remote in self.remotes:
475 |             remote.send(('reset_task', None))
476 |         return np.stack([remote.recv() for remote in self.remotes])
477 | 
478 |     def close(self):
479 |         if self.closed:
480 |             return
481 |         if self.waiting:
482 |             for remote in self.remotes:
483 |                 remote.recv()
484 |         for remote in self.remotes:
485 |             remote.send(('close', None))
486 |         for p in self.ps:
487 |             p.join()
488 |             self.closed = True
489 | 
490 |     def __len__(self):
491 |         return self.nenvs
492 | 
493 |     def step(self, actions):
494 |         self._step_async(actions)
495 |         return self._step_wait()
496 | 
497 | 
498 | class Monitor(gym.Wrapper):
499 | 
500 |     def __init__(self, env):
501 |         super(Monitor, self).__init__(env)
502 |         self._monitor_rewards = None
503 | 
504 |     def reset(self, **kwargs):
505 |         self._monitor_rewards = []
506 |         return self.env.reset(**kwargs)
507 | 
508 |     def step(self, action):
509 |         o_, r, done, info = self.env.step(action)
510 |         self._monitor_rewards.append(r)
511 |         if done:
512 |             info['episode'] = {'r': sum(self._monitor_rewards), 'l': len(self._monitor_rewards)}
513 |         return o_, r, done, info
514 | 
515 | 
516 | class NormalizedActions(gym.ActionWrapper):
517 | 
518 |     def _action(self, action):
519 |         low = self.action_space.low
520 |         high = self.action_space.high
521 | 
522 |         action = low + (action + 1.0) * 0.5 * (high - low)
523 |         action = np.clip(action, low, high)
524 | 
525 |         return action
526 | 
527 |     def _reverse_action(self, action):
528 |         low = self.action_space.low
529 |         high = self.action_space.high
530 | 
531 |         action = 2 * (action - low) / (high - low) - 1
532 |         action = np.clip(action, low, high)
533 | 
534 |         return action
535 | 
536 | 
537 | def unit_test():
538 |     env_id = 'CartPole-v0'
539 |     unwrapped_env = gym.make(env_id)
540 |     wrapped_env = build_env(env_id, False)
541 |     o = wrapped_env.reset()
542 |     print('Reset {} observation shape {}'.format(env_id, o.shape))
543 |     done = False
544 |     while not done:
545 |         a = unwrapped_env.action_space.sample()
546 |         o_, r, done, info = wrapped_env.step(a)
547 |         print('Take action {} get reward {} info {}'.format(a, r, info))
548 | 
549 |     env_id = 'PongNoFrameskip-v4'
550 |     nenv = 2
551 |     unwrapped_env = gym.make(env_id)
552 |     wrapped_env = build_env(env_id, True, nenv=nenv)
553 |     o = wrapped_env.reset()
554 |     print('Reset {} observation shape {}'.format(env_id, o.shape))
555 |     for _ in range(1000):
556 |         a = [unwrapped_env.action_space.sample() for _ in range(nenv)]
557 |         a = np.asarray(a, 'int64')
558 |         o_, r, done, info = wrapped_env.step(a)
559 |         print('Take action {} get reward {} info {}'.format(a, r, info))
560 | 
561 | 
562 | if __name__ == '__main__':
563 |     unit_test()


--------------------------------------------------------------------------------
/notes/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/morningsky/NTU-ReinforcementLearning-Notes/d4a9dbf584ae24d974d9b7839f34cee3f18b79dd/notes/.DS_Store


--------------------------------------------------------------------------------
/notes/1 Introduction.md:
--------------------------------------------------------------------------------
 1 | # 李宏毅深度强化学习 笔记
 2 | 
 3 | 课程主页：[NTU-MLDS18](http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLDS18.html)
 4 | 
 5 | 视频：[youtube](https://www.youtube.com/playlist?list=PLJV_el3uVTsODxQFgzMzPLa16h6B8kWM_) [B站](https://www.bilibili.com/video/av24724071/?spm_id_from=333.788.videocard.4)
 6 | 
 7 | 参考资料： [作业代码参考](https://github.com/JasonYao81000/MLDS2018SPRING/tree/master/hw4)  [纯numpy实现非Deep的RL算法](https://github.com/ddbourgin/numpy-ml/tree/master/numpy_ml/rl_models) [OpenAI tutorial](https://github.com/openai/spinningup/tree/master/docs)
 8 | 
 9 | # 1. Introduction
10 | 
11 | ![1](http://oss.hackslog.cn/imgs/075034.png)
12 | 
13 | 
14 | 
15 | 这门课的学习路线如上，强化学习是作为单独一个模块介绍。李宏毅老师讲这门课不是从MDP开始讲起，而是从如何获得最大化奖励出发，直接引出Policy Gradient（以及PPO），再讲Q-learning（原始Q-learning，DQN，各种DQN的升级），然后是A2C（以及A3C, DDPG），紧接着介绍了一些Reward Shaping的方法（主要是Curiosity，Curriculum Learning ，Hierarchical RL），最后介绍Imitation Learning (Inverse RL)。比较全面的展现了深度强化学习的核心内容，也比较直观。
16 | 
17 | ![image-20191029211249836](http://oss.hackslog.cn/imgs/075024.png)
18 | 
19 | 首先强化学习是一种解决序列决策问题的方法，他是通过与环境交互进行学习。首先会有一个Env，给agent一个state，agent根据得到的state执行一个action，这个action会改变Env，使自己跳转到下一个state，同时Env会反馈给agent一个reward，agent学习的目标就是通过采取action，使得reward的期望最大化。
20 | 
21 | ![image-20191029211454593](http://oss.hackslog.cn/imgs/075050.png)
22 | 
23 | 
24 | 
25 | 在alpha go的例子中，state（又称observation）为所看到的棋盘，action就是落子，reward通过围棋的规则给出，如果最终赢了，得1，输了，得-1。
26 | 
27 | 下面从2个例子中看强化学习与有监督学习的区别。RL不需要给定标签，但需要有reward。
28 | 
29 | ![image-20191029211749897](http://oss.hackslog.cn/imgs/075057.png)
30 | 
31 | 实际上alphgo是从提前收集的数据上进行有监督学习，效果不错后，再去做强化学习，提高水平。
32 | 
33 | ![image-20191029211848429](http://oss.hackslog.cn/imgs/075104.png)
34 | 
35 | 
36 | 
37 | 人没有告诉机器人具体哪里说错了，机器需要根据最终的评价自己总结，一般需要对话好多次。所以通常训练对话模型会训练2个agent互相对话
38 | 
39 | ![image-20191029212015623](http://oss.hackslog.cn/imgs/075108.png)
40 | 
41 | ![image-20191029212144625](http://oss.hackslog.cn/imgs/075112.png)
42 | 
43 | 
44 | 
45 | 一个难点是怎么判断对话的效果，一般会设置一些预先定义的规则。
46 | 
47 | ![image-20191029212313069](http://oss.hackslog.cn/imgs/075117.png)
48 | 
49 | 强化学习还有很多成功的应用，凡是序列决策问题，大多数可以用RL解决。
50 | 
51 | 


--------------------------------------------------------------------------------
/notes/2 Policy Gradient.md:
--------------------------------------------------------------------------------
  1 | # 2. Policy Gradient
  2 | 
  3 | ## 2.1 Origin Policy Gradient
  4 | 
  5 | ![](http://oss.hackslog.cn/imgs/075626.jpg)
  6 | 
  7 | 在alpha go场景中，actor决定下哪个位置，env就是你的对手，reward是围棋的规则。强化学习三大基本组件里面，env和reward是事先给定的，我们唯一能做的就是通过调整actor，使得到的累积reward最大化。
  8 | 
  9 | 
 10 | 
 11 | ![](http://oss.hackslog.cn/imgs/075657.jpg)
 12 | 
 13 | 
 14 | 
 15 | 一般把actor的策略定义成Policy，数学符号为$\pi$，参数是$\theta$，本质是一个NN（神经网络）。
 16 | 
 17 | 
 18 | 
 19 | 那么针对Atari游戏：输入游戏的画面，Policy $\pi$输出各个动作的概率，agent根据这个概率分布采取行动。通过调整$\theta$, 我们就可以调整策略的输出。
 20 | 
 21 | 
 22 | 
 23 | ![](http://oss.hackslog.cn/imgs/075712.jpg)_page-0007)
 24 | 
 25 | 
 26 | 
 27 | 每次采取一个行动会有一个reward
 28 | 
 29 | 
 30 | 
 31 | ![](http://oss.hackslog.cn/imgs/075809.jpg)
 32 | 
 33 | 玩一场游戏叫做一个episode，actor存在的目的就是最大化所能得到的return，这个return指的是每一个时间步得到的reward之和。注意我们期望最大化的是return，不是一个时刻的reward。
 34 | 
 35 | 
 36 | 
 37 | 如果max的目标是当下时刻的reward，那么在Atari游戏中如果agent在某个s下执行开火，得到了较大的reward，那么可能agent就会一直选择开火。并不代表，最终能够取得游戏的胜利。
 38 | 
 39 | 
 40 | 
 41 | 那么，怎么得到这个actor呢？
 42 | 
 43 | 
 44 | 
 45 | ![](http://oss.hackslog.cn/imgs/075903.jpg)
 46 | 
 47 | 
 48 | 
 49 | 先定义玩一次游戏，即一个episode的游戏记录为trajectory $\tau$，内容如图所示，是s-a组成的序列对。
 50 | 
 51 | 
 52 | 
 53 | 假设actor的参数$\theta$已经给定，则可以得到每个$\tau$出现的概率。这个概率取决于两部分，$p\left(s_{t+1} | s_{t}, a_{t}\right)$部分由env的机制决定，actor没法控制，我们能控制的是$p_{\theta}\left(a_{t} | s_{t}\right)$ 由$\pi$的参数$\theta$决定。
 54 | 
 55 | 
 56 | 
 57 | ![](http://oss.hackslog.cn/imgs/075918.jpg)
 58 | 
 59 | 
 60 | 
 61 | 定义$R(\tau)$ 为一个episode的总的reward，即每个时间步下的即时reward相加，我习惯表述为return。
 62 | 
 63 | 
 64 | 
 65 | 定义$\bar{R}_{\theta}$ 为$R(\tau)$的期望，等价于将每一个轨迹$\tau$出现的概率乘与其return，再求和。
 66 | 
 67 | 
 68 | 
 69 | 由于$R(\tau)$是一个随机变量，因为actor本身在给定同样的state下会采取什么行为具有随机性，env在给定行为下输出什么state，也是随机的，所以只能算$R(\tau)$的期望。
 70 | 
 71 | 
 72 | 
 73 | 我们的目标就变成了最大化Expected Reward，那么如何最大化？
 74 | 
 75 | 
 76 | 
 77 | ![](http://oss.hackslog.cn/imgs/075954.jpg)
 78 | 
 79 | 
 80 | 
 81 | 优化算法是梯度更新，首先我们先计算出$\bar{R}_{\theta}$ 对$\theta$的梯度。
 82 | 
 83 | 
 84 | 
 85 | 从公式中可以看出$R(\tau)$可以是不可微的，因为与参数无关，不需要求导。
 86 | 
 87 | 
 88 | 
 89 | 第一个改写（红色部分）：将加权求和写成期望的形式。
 90 | 
 91 | 
 92 | 
 93 | 第二个近似：实际上没有办法把所有可能的轨迹（游戏记录）都求出来，所以一般是采样N个轨迹
 94 | 
 95 | 
 96 | 
 97 | 第三个改写：将$p_{\theta}\left(\tau^{n}\right)$的表达式展开(前2页slide)，去掉跟$\theta$无关的项（不需要求导），则可达到最终的简化结果。具体如下：首先用actor采集一个游戏记录
 98 | 
 99 | 
100 | 
101 | ![image-20191029215615001](http://oss.hackslog.cn/imgs/2019-11-06-080254.png)
102 | 
103 | 
104 | 
105 | ![image-20191029220147651](http://oss.hackslog.cn/imgs/2019-11-06-080301.png)
106 | 
107 | 
108 | 
109 | 最终得到的公式相当的直觉，在s下采取了a导致最终结果赢了，那么return就是正的，也就是会增加相应的s-a出现的概率P。
110 | 
111 | 
112 | 
113 | 上面的公式推导中可能会有疑问，为什么要引入log？再乘一个概率除一个概率？原因非常的直觉，如下：如果动作b本来出现的次数就多，那么在加权平均所有的episode后，参数会偏好执行动作b，而实际上动作b得到的return比a低，所以除掉自身出现的概率，以降低其对训练的影响。
114 | 
115 | 
116 | 
117 | ![image-20191029220546039](http://oss.hackslog.cn/imgs/2019-11-06-080313.png)
118 | 
119 | 
120 | 
121 | 那么，到底是怎么更新参数的呢？
122 | 
123 | 
124 | 
125 | ![](http://oss.hackslog.cn/imgs/080148.jpg)
126 | 
127 | 首先会拿agent跟环境互动，收集大量游戏记录，然后把每一个游戏记录拿到右边，计算一个参数theta的更新值，更新参数后，再拿新的actor去收集游戏记录，不断循环。
128 | 
129 | 
130 | 
131 | 注意：一般采样的数据只会用一次，用完就丢弃
132 | 
133 | 
134 | 
135 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080339.jpg)
136 | 
137 | 
138 | 
139 | 具体实现：可当成一个分类任务，只是分类的结果不是识别object，是给出actor要执行的动作。
140 | 
141 | 
142 | 
143 | 如何构建训练集？ 采样得到的a，作为ground truth。然后去最小化loss function。
144 | 
145 | 
146 | 
147 | 一般的分类问题loss function是交叉熵，在强化学习里面，只需要在前面乘一个weight，即交叉熵乘一个return。
148 | 
149 | 
150 | 
151 | 实现的过程中还有一些tips可以提高效果：
152 | 
153 | 
154 | 
155 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080618.jpg)
156 | 
157 | 
158 | 
159 | 如果 reward都是正的，那么理想的情况下：reward从大到小 b>a>c, 出现次数 b>a>c, 经过训练以后，reward值高的a，c会提高出现的概率，b会降低。但如果a没有采样到，则a出现的概率最终可能会下降，尽管a的reward高。
160 | 
161 | 
162 | 
163 | 解决方法：增加一个baseline，用r-b作为新的reward，让其有正有负。最简单的做法是b取所有轨迹的平均回报。
164 | 
165 | 一般r-b叫做优势函数Advantage Functions。我们不需要描述一个行动的绝对好坏，而只需要知道它相对于平均水平的优势。
166 | 
167 | 
168 | 
169 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080716.jpg)
170 | 
171 | 
172 | 
173 | 在这个公式里面，对于一个轨迹，每一个s-a的pair都会乘同一个weight，显然不公平，因为一局游戏里面往往有好的pair，有对结果不好的pair。所以我们希望给每一个pair乘不同的weight。整场游戏结果是好的，不代表每一个pair都是好的。如果sample次数够多，则不存在这个问题。
174 | 
175 | 
176 | 
177 | 解决思路：在执行当下action之前的事情跟其没有关系，无论得到多少功劳都跟它没有关系，只考虑在当下执行pair之后的reward，这才是它真正的贡献。把原来的总的return，换成未来的return。
178 | 
179 | 
180 | 
181 | 如图：对于第一组数据，在($s_a$,$a_1$)时候总的return是+3，那么如果对每一个pair都乘3，则($s_b$,$a_2$)会认为是有效的，但如果使用改进的思路，将其乘之后的return，即-2，则能有效抑制该pair对结果的贡献。
182 | 
183 | 
184 | 
185 | 再改进：加一个折扣系数，如果时间拖得越长，对于越之后的reward，当下action的功劳越小。
186 | 
187 | 
188 | 
189 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080806.jpg)
190 | 
191 | 
192 | 
193 | 我们将R-b 记为 A，意义就是评价当前s执行动作a，相比于采取其他的a，它有多好。之后我们会用一个critic网络来估计这个评价值。
194 | 
195 | ## 2.2 PPO
196 | 
197 | PPO算法是PG算法的变形,目的是把在线的学习变成离线的学习。
198 | 
199 | 核心的idea是对每一条经验（又称轨迹，即一个episode的游戏记录）不止使用一次。
200 | 
201 | 简单理解：在线学习就是一边玩一边学，离线学习就是先看着别人玩进行学习，之后再自己玩
202 | 
203 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080842.jpg)
204 | 
205 | 
206 | 
207 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080853.jpg)
208 | 
209 | 
210 | 
211 | Motivation：每次用$\pi_\theta$去采样数据之后，$\pi_\theta$都会更新，接下来又要采样新的数据。以至于PG算法大部分时间都在采样数据。那么能不能将这些数据保存下来，由另一个$\pi_{\theta'}$去更新参数？那么策略$\pi_\theta$采样的数据就能被$\pi_{\theta'}$多次利用。引入统计学中的经典方法：
212 | 
213 | 
214 | 
215 | 重要性采样：如果想求一个函数的期望，但无法积分，则可以通过采样求平均的方法来近似，但是如果p分布不知道（无法采样），我们知道q分布，则如上图通过一个重要性权重，用q分布来替代p分布进行采样。这个重要性权重的作用就是修正两个分布的差异。
216 | 
217 | 
218 | 
219 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080922.jpg)
220 | 
221 | 
222 | 
223 | 存在的问题：如果p跟q的差异比较大，则方差会很大
224 | 
225 | 
226 | 
227 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080951.jpg)
228 | 
229 | 
230 | 
231 | 如果sample的次数不够多，比如按原分布p进行采样，最终f的期望值是负数（大部分概率都在左侧，左侧f是负值），如果按q分布进行sample，只sample到右边，则f就一直是正的，严重偏离原来的分布。当然采样次数够多的时候，q也sample到了左边，则p/q这个负weight非常大，会平衡掉右边的正值，会导致最终计算出的期望值仍然是负值。但实际上采样的次数总是有限的，出现这种问题的概率也很大。
232 | 
233 | 
234 | 
235 | 先忽略这个问题，加入重要性采样之后，训练变成了离线的
236 | 
237 | 
238 | 
239 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081110.jpg)
240 | 
241 | 
242 | 
243 | 离线训练的实现：用另一个policy2与环境做互动，采集数据，然后在这个数据上训练policy1。尽管2个采集的数据分布不一样，但加入一个重要性的weights，可以修正其差异。等policy1训练的差不多以后，policy2再去采集数据，不断循环。
244 | 
245 | 
246 | 
247 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081122.jpg)
248 | 
249 | 
250 | 
251 | 由于我们得到的$A^{\theta}\left(s_{t}, a_{t}\right)$（执行当下action后得到reward-baseline）是由policy2采集的数据观察得到的，所以 $A^{\theta}\left(s_{t}, a_{t}\right)$的参数得修正为$\theta'$
252 | 
253 | 
254 | 
255 | 根据$\nabla f(x)=f(x) \nabla \log f(x)$反推目标函数$J$，注意要优化的参数是$\theta$ ，$\theta’$只负责采集数据。
256 | 
257 | 
258 | 
259 | 利用 $\theta’$采集的数据来训练$\theta$，会不会有问题？（虽然有修正，但毕竟还是不同） 答案是我们需要保证他们的差异尽可能的小，那么在刚刚的公式里再加入一些限制保证其差异足够小，则诞生了 PPO算法。
260 | 
261 | 
262 | 
263 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081210.jpg)
264 | 
265 | 
266 | 
267 | 引入函数KL，KL衡量两个分布的距离。注意：不是参数上的距离，是2个$\pi$给同样的state之后基于各自参数输出的action space的距离
268 | 
269 | 
270 | 
271 | 加入KL的公式直觉的理解：如果我们学习出来的$\theta$跟$\theta'$越像，则KL越小，J越大。我们的学习目标还是跟原先的PG算法一样，用梯度上升训练，最大化J。这个操作有点像正则化，用来解决重要性采样存在的问题。
272 | 
273 | 
274 | 
275 | TRPO是PPO是前身，把KL这个限制条件放在优化的目标函数外面。对于梯度上升的优化过程，这种限制比较难处理，使优化变得复杂，一般不用。
276 | 
277 | 
278 | 
279 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081228.jpg)
280 | 
281 | 
282 | 
283 | 实现过程：初始化policy参数，在每一个迭代里面，用$theta^k$采集很多数据，同时计算出奖励A值，接着train这个数据，更新$\theta$优化J。由于是离线训练，可以多次更新后，再去采集新的数据。
284 | 
285 | 
286 | 
287 | 有一个trick是KL的权重beta也可以调整，使其更加的adaptive。
288 | 
289 | 
290 | 
291 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081245.jpg)
292 | 
293 | 
294 | 
295 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081315.jpg)
296 | 
297 | 
298 | 
299 | 实际上KL也是比较难计算的，所以有了PPO2算法，不计算KL，通过clip达到同样效果。
300 | 
301 | 
302 | 
303 | clip(a, b, c): if a<b => b  If a>c => c If b<a<c => a
304 | 
305 | 
306 | 
307 | 看图：绿色：min里面的第一项，蓝色：min里面的第二项，红色 min的输出
308 | 
309 | 
310 | 
311 | 这个公式的直觉理解：希望$\theta$与$\theta^k$在优化之后不要差距太大。如果A>0，即这个state-action是好的，所以需要增加这个pair出现的几率，所以在max J的过程中会增大$\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta^{k}}\left(a_{t} | s_{t}\right)}$, 但最大不要超过1+eplison，如果A<0，不断减小，小到1-eplison，始终不会相差太大。
312 | 
313 | 
314 | 
315 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081323.jpg)
316 | 
317 | PG算法效果非常不稳定，自从有了PPO，PG的算法可以在很多任务上work。
318 | 
319 | 


--------------------------------------------------------------------------------
/notes/3 Q - Learning.md:
--------------------------------------------------------------------------------
  1 | # 3. Q - Learning
  2 | 
  3 | ## 3.1 Q-learning
  4 | 
  5 | 在之前的policy-based算法里，我们的目标是learn 一个actor，value-based的强化学习算法目标是learn一个critic。 
  6 | 
  7 | 定义一个Critic，也就是状态值函数$V^{\pi}(s)$，它的值是：当使用策略$\pi$进行游戏时，在观察到一个state s之后，环境输出的累积的reward值的期望。注意取决于两个值，一个是state s，一个是actor$\pi$。
  8 | 
  9 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081510.jpg)
 10 | 
 11 | ![0004](http://oss.hackslog.cn/imgs/2019-11-06-081524.jpg)
 12 | 
 13 | 如果是不同的actor，在同样的state下，critic给出的值也是不同的。那么怎么估计出这个函数V呢？
 14 | 
 15 | 主要有MC和TD的方法，实际上还有DP的方法，但是用DP求解需要整个环境都是已知的。而在强化学习的大部分任务里，都是model-free的，需要agent自己去探索环境。
 16 | 
 17 | ![0005](http://oss.hackslog.cn/imgs/2019-11-06-081539.jpg)
 18 | 
 19 | MC：直接让agent与环境互动，统计计算出在$S_a$之后直到一个episode结束的累积reward作为$G_a$。
 20 | 
 21 | 训练的目标就是让$V^{\pi}(s)$的输出尽可能的接近$G_a$。
 22 | 
 23 | ![0006](http://oss.hackslog.cn/imgs/2019-11-06-081554.jpg)
 24 | 
 25 | MC每次必须把游戏玩到结束，TD不需要把游戏玩到底，只需要玩了一次游戏，有一个状态的变化。
 26 | 
 27 | 那么训练的目标就是让$V^{\pi}(s_t)$  和$V^{\pi}(s_t+1)$的差接近$r_t$
 28 | 
 29 | ![0007](http://oss.hackslog.cn/imgs/2019-11-06-081607.jpg)
 30 | 
 31 | MC方差大，因为$r$是一个随机变量，MC方法中的$G$是$r$之和，而TD方法只有$r$是随机变量，r的方差比G小。但TD方法的$V^{\pi}$有可能估计的不准。
 32 | 
 33 | ![0008](http://oss.hackslog.cn/imgs/2019-11-06-081621.jpg)
 34 | 
 35 | 用MC和TD估计的结果不一样
 36 | 
 37 | ![0009](http://oss.hackslog.cn/imgs/2019-11-06-081633.jpg)
 38 | 
 39 | 定义另一种Critic，状态-动作值函数$Q^{\pi}(s,a)$，有的地方叫做Q-function，输入是一个pair $(s,a)$，意思是用$\pi$玩游戏时，在s状态下强制执行动作a（策略$\pi$在s下不一定会执行a），所得到的累积reward。
 40 | 
 41 | 有两种写法，输入pair，输出Q，此时的Q是一个标量。
 42 | 
 43 | 另一种是输入s，输出所有可能的action的Q值，此时Q是一个向量。
 44 | 
 45 | 那么Critic到底怎么用呢？
 46 | 
 47 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081645.jpg)
 48 | 
 49 | Q-learning的过程：
 50 | 
 51 | 初始化一个actor $\pi$去收集数据，然后learn一个基于$ \pi$的Q-function，接着寻找一个新的比原来的$\pi$要好actor , 找到后更新$\pi$，再去寻找新的Q-function，不断循环，得到更好的policy。
 52 | 
 53 | 可见Q-learning的核心思想是先找到最优的Q-function，再通过这个Q-function得出最优策略。而Policy-based的算法是直接去学习策略。这是本质区别。
 54 | 
 55 | 那么，怎么样才算比原来的好？
 56 | 
 57 | ![0012](http://oss.hackslog.cn/imgs/2019-11-06-081701.jpg)
 58 | 
 59 | 定义好的策略：对所有可能的s而言，$V_\pi(s)$一定小于$V_\pi'(s)$，则$V_\pi'(s)$就是更好的策略。
 60 | 
 61 | $\pi'(s)$的本质：假设已经学习到了一个actor $\pi$的Q-function，给一个state，把所有可能的action都代入Q，执行那个可以让Q最大的action。
 62 | 
 63 | 注意：实际上，给定一个s，$ \pi$不一定会执行a，现在的做法是强制执行a，计算执行之后玩下去得到的reward进行比较。
 64 | 
 65 | 在实现的时候$\pi'$没有额外的参数，依赖于Q。并且当动作是连续值的时候，无法进行argmax。
 66 | 
 67 | 那么， 为什么actor $\pi’$能被找到？
 68 | 
 69 | ![0013](http://oss.hackslog.cn/imgs/2019-11-06-081720.jpg)
 70 | 
 71 | 上面是为了证明：只要你估计出了一个actor的Q-function，则一定可以找到一个更好的actor。
 72 | 
 73 | 核心思想：在一个episode中某一步把$\pi$换成了$ \pi'$比完全follow $ \pi$，得到的奖励期望值会更大。
 74 | 
 75 | 注意$r_{t+1}$指的是在执行当下$a_t$得到的奖励，有的文献也会写成$r_t$
 76 | 
 77 | 训练的时候有一些Tips可以提高效率：
 78 | 
 79 | ![0014](http://oss.hackslog.cn/imgs/2019-11-06-081740.jpg)
 80 | 
 81 | Tips 1 引入target网络
 82 | 
 83 | 训练的时候，每次需要两个Q function（两个的输入不同）一起更新，不稳定。 一般会固定一个Q作为Target，产生回归任务的label，在训练N次之后，再更新target的参数。回归任务的目标，让$Q^\pi(s_t,a_t)$与$\mathrm{Q}^{\pi}\left(s_{t+1}, \pi\left(s_{t+1}\right)\right))+r$越来越接近，即降低mse。最终希望训练得到的$Q^\pi$能直接估计出这个$(s_t,a_t)$未来的一个累积奖励。
 84 | 
 85 | 注意：target网络的参数不需要训练，直接每隔N次复制Q的参数。训练的目标只有一个 Q。
 86 | 
 87 | ![0015](http://oss.hackslog.cn/imgs/2019-11-06-081754.jpg)
 88 | 
 89 | Tips2 改进探索机制
 90 | 
 91 | PG算法，每次都会sample新的action，随机性比较大，大概率会尽可能的覆盖所有的动作。而之前的Q-learning，策略的本质是绝对贪婪策略，那么如果有的action没有被sample到，则可能之后再也不会选择这样的action。这种探索的机制（收集数据的方法）不好，所以改进贪心算法，让actor每次会$\varepsilon$的概率执行随机动作。
 92 | 
 93 | ![0016](http://oss.hackslog.cn/imgs/2019-11-06-081820.jpg)
 94 | 
 95 | ![0017](http://oss.hackslog.cn/imgs/2019-11-06-081833.jpg)
 96 | 
 97 | Tips 3 引入记忆池机制
 98 | 
 99 | 将采集到的一些数据收集起来，放入replay buffer。好处：
100 | 
101 | 1.可重复使用过去的policy采集的数据，降低agent与环境互动的次数，加快训练效率 。
102 | 
103 | 2.replay buffer里面包含了不同actor采集的数据，这样每次随机抽取一个batch进行训练的时候，每个batch内的数据会有较大的差异（数据更加diverse），有助于训练。
104 | 
105 | 那么，当我们训练的目标是$ \pi$的Q-function，训练数据混杂了$\pi’$,$\pi’'$,$\pi’''$采集的数据 有没有问题呢？没有，不是因为这些$ \pi$很像，主要原因是我们采样的不是一个轨迹，只是采样了一笔experience($s_t,a_t,r_t,s_{t+1}$)。这个理论上证明是没有问题的，很难解释...
106 | 
107 | ![0018](http://oss.hackslog.cn/imgs/2019-11-06-081844.jpg)
108 | 
109 | 采用了3个Tips的Q-learning训练过程如图：
110 | 
111 | 注意图中省略了一个循环，即存储了很多笔experience之后才会进行sample。相比于原始的Q-learning，每次sample是从replay buff里面随机抽一个batch，然后计算用绝对贪心策略得到Q-target的值作为label，接着在回归任务中更新Q的参数。每训练多步后，更新Q-target的参数。
112 | 
113 | ## 3.2 Tips of Q-learning
114 | 
115 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081854.jpg)
116 | 
117 | DQN估计出的值一般都高于实际的值，double DQN估计出的值与实际值比较接近。
118 | 
119 | ![0021](http://oss.hackslog.cn/imgs/2019-11-06-081906.jpg)
120 | 
121 | Q是一个估计值，被高估的越多，越容易被选择。
122 | 
123 | ![0022](http://oss.hackslog.cn/imgs/2019-11-06-081948.jpg)
124 | 
125 | Double的思想有点像行政跟立法分权。
126 | 
127 | 用要训练的Q-network去选择动作，用固定不动的target-network去做估计，相比于DQN,只需要改一行代码！
128 | 
129 | ![0023](http://oss.hackslog.cn/imgs/2019-11-06-081958.jpg)
130 | 
131 | 改了network架构，其他没动。每个网络结构的输出是一个标量+一个向量
132 | 
133 | ![0024](http://oss.hackslog.cn/imgs/2019-11-06-082009.jpg)
134 | 
135 | 比如下一时刻，我们需要把3->4, 0->-1,那么Dueling结构里会倾向于不修改A，只调整V来达到目的，这样只需要把V中 0->1, 如果Q中的第三行-2没有被sample到，也进行了更新，提高效率，减少训练次数。
136 | 
137 | ![0025](http://oss.hackslog.cn/imgs/2019-11-06-082019.jpg)
138 | 
139 | 实际实现的时候，通过添加了限制条件，也就是把A normalize，使得其和为0，这样只会更新V。
140 | 
141 | 这种结构让DQN也能处理连续的动作空间。
142 | 
143 | ![0028](http://oss.hackslog.cn/imgs/2019-11-06-082253.jpg)
144 | 
145 | 加入权重的replay buffer
146 | 
147 | motivation：TD error大的数据应该更可能被采样到
148 | 
149 | 注意论文原文实现的细节里，也修改了参数更新的方法
150 | 
151 | ![0029](http://oss.hackslog.cn/imgs/2019-11-06-082300.jpg)
152 | 
153 | 原来收集一条experience是执行一个step，现在变成执行N个step。相比TD的好处：之前只sample一个$(s_t,a_t)$pair，现在sample多个才估测Q值，估计的误差会更小。坏处，与MC一样，reward的项数比较多，相加的方差更大。 调N就是一个trade-off的过程。
154 | 
155 | ![0030](http://oss.hackslog.cn/imgs/2019-11-06-082309.jpg)
156 | 
157 | 在Q-function的参数空间上+noise
158 | 
159 | 比较有意思的是，OpenAI DeepMind几乎在同一个时间发布了Noisy Net思想的论文。
160 | 
161 | ![0031](http://oss.hackslog.cn/imgs/2019-11-06-082316.jpg)
162 | 
163 | 在同一个episode里面，在动作空间上加噪声，会导致相同state下执行的action不一样。而在参数空间加噪声，则在相同或者相似的state下，会采取同一个action。 注意加噪声只是为了在不同的episode的里面，train Q的时候不会针对特定的一个state永远只执行一个特定的action。
164 | 
165 | ![0033](http://oss.hackslog.cn/imgs/2019-11-06-082325.jpg)
166 | 
167 | 带分布的Q-function
168 | 
169 | Motivation：原来计算Q-function的值是通过累积reward的期望，也就是均值，但实际上累积的reward可能在不同的分布下会得到相同的Q值。
170 | 
171 | 注意：每个Q-function的本质都是一个概率分布。
172 | 
173 | ![0034](http://oss.hackslog.cn/imgs/2019-11-06-082332.jpg)
174 | 
175 | 让$Q^ \pi$直接输出每一个Q-function的分布，但实际上选择action的时候还是会根据mean值大的选。不过拥有了这个分布，可以计算方差，这样如果有的任务需要在追求回报最大的同时降低风险，则可以利用这个分布。
176 | 
177 | ![0036](http://oss.hackslog.cn/imgs/2019-11-06-082339.jpg)
178 | 
179 | ![0037](http://oss.hackslog.cn/imgs/2019-11-06-082345.jpg)
180 | 
181 | Rainbow：集成了7种升级技术的DQN
182 | 
183 | 上图是一个一个改进拿掉之后的效果，看紫色似乎double 没啥用，实际上是因为有Q-function的分布存在，一般不会过高估计Q值，所以double 意义不大。
184 | 
185 | 直觉的理解：使用分布DQN，即时Q值被高估很多，由于最终只会映射到对应的分布区间，所以最终的输出值也不会过大。
186 | 
187 | ## 3.3 Q-learning in continuous actions
188 | 
189 | 在出现PPO之前, PG的算法非常不稳定。DQN 比较稳定，也容易train，因为DQN是只要估计出Q-function，就能得到好的policy，而估计Q-function就是一个回归问题，回归问题比较容易判断learn的效果，看mse。问题是Q-learning不太容易处理连续动作空间。比如汽车的速度，是一个连续变量。
190 | 
191 | ![0039](http://oss.hackslog.cn/imgs/2019-11-06-082354.jpg)
192 | 
193 | 当动作值是连续时，怎么解argmax：
194 | 
195 | 1. 通过映射，强行离散化
196 | 
197 | 2. 使用梯度上升解这个公式，这相当于每次train完Q后，在选择action的时候又要train一次网络，比较耗时间。
198 | 
199 | ![0040](http://oss.hackslog.cn/imgs/2019-11-06-082404.jpg)
200 | 
201 | 3. 设计特定的网络，使得输出还是一个标量。
202 | 
203 | ![0042](http://oss.hackslog.cn/imgs/2019-11-06-082433.jpg)
204 | 
205 | 最有效的解决方法是，针对连续动作空间，不要使用Q-learning。使用AC算法！


--------------------------------------------------------------------------------
/notes/4 Actor Critic.md:
--------------------------------------------------------------------------------
 1 | # 4. Actor Critic
 2 | 
 3 | ## 4.1 Advantage Actor-Critic (A2C)
 4 | 
 5 | ![](http://oss.hackslog.cn/imgs/2019-11-06-082605.jpg)
 6 | 
 7 | 由于每次在执行PG算法之前，一般只能采样少量的数据，导致对于同一个$(s_t,a_t)$，得到的$G$的值方差很大，不稳定。那么能不能直接估计出期望值，来替代采样的结果？
 8 | 
 9 | ![AC-4](http://oss.hackslog.cn/imgs/2019-11-06-082530.jpg)
10 | 
11 | 回顾下Q-learning中的定义，我们发现：
12 | 
13 | ![AC-5](http://oss.hackslog.cn/imgs/2019-11-06-082602.jpg)
14 | 
15 | PG算法中G的期望的定义恰好也是Q-learning算法中$Q^\pi(s,a)$的定义： 假设现在的policy是$ \pi$的情况下，在某一个s，采取某一个a以后得到的累积reward的期望值。
16 | 
17 | 因此在这里将Q-learning引入到预估reward中，也即policy gradient和q-learning的结合,叫做Actor-Critic。
18 | 
19 | 把原来的reward和baseline分别替换，PG算法中的减法就变成了$Q^{\pi_{\theta}}\left(s_{t}^{n}, a_{t}^{n}\right)-V^{\pi_{\theta}}\left(s_{t}^{n}\right)$。似乎我们需要训练2个网络？
20 | 
21 | ![AC-6](http://oss.hackslog.cn/imgs/2019-11-06-082629.jpg)
22 | 
23 | 实际上Q与V可以互相转化，我们只需要训练V。转化公式中为什么要加期望？在s下执行a得到的$ r_t$和$s_{t+1}$是随机的。
24 | 
25 | 实际将Q变成V的操作中，我们会去掉期望，使得只需要训练（估计）状态值函数$V^\pi$，这样会导致一点偏差，但比同时估计两个function导致的偏差要好。（A3C原始paper通过实验验证了这一点）。
26 | 
27 | ![AC-7](http://oss.hackslog.cn/imgs/2019-11-06-082641.jpg)
28 | 
29 | A2C的训练流程：收集数据，估计出状态值函数$V^\pi(s)$，套用公式更新策略$\pi$，再利用新的$\pi$与环境互动收集新的数据，不断循环。
30 | 
31 | ![AC-8](http://oss.hackslog.cn/imgs/2019-11-06-082652.jpg)
32 | 
33 | 训练过程中的2个Tips：
34 | 
35 | 1. Actor与Critic的前几层一般会共用参数，因为输入都是state
36 | 2. 正则化：让采用不同action的概率尽量平均，希望有更大的entropy，这样能够探索更多情况。
37 | 
38 | ## 4.2 Asynchronous Advantage Actor-Critic (A3C)
39 | 
40 | ![AC-9](http://oss.hackslog.cn/imgs/2019-11-06-082709.jpg)
41 | 
42 | A3C算法的motivation：开分身学习~
43 | 
44 | ![AC-10](http://oss.hackslog.cn/imgs/2019-11-06-082718.jpg)
45 | 
46 | 训练过程：每个agent复制一份全局参数，然后各自采样数据，计算梯度，更新这份全局参数，然后将结果传回，复制一份新的参数。
47 | 
48 | 注意：
49 | 
50 | 1. 初始条件会尽量的保证多样性(Diverse)，让每个agent探索的情况更加不一样。
51 | 
52 | 2. 所有的actor都是平行跑的，每个worker把各自的参数传回去然后复制一份新的全局参数。此时可能这份全局参数已经发生了改变，没有关系。
53 | 
54 | ## 4.3 Pathwise Derivative Policy Gradient (PDPG)
55 | 
56 | 在之前Actor-Critic框架里，Critic的作用是评估agent所执行的action好不好？那么Critic能不能不止给出评价，还给出指导意见呢？即告诉actor要怎样做才能更好？于是有了DPG算法：
57 | 
58 | ![AC-12](http://oss.hackslog.cn/imgs/2019-11-06-082731.jpg)
59 | 
60 | ![AC-13](http://oss.hackslog.cn/imgs/2019-11-06-082746.jpg)
61 | 
62 | 在上面介绍A2C算法的motivation，主要是从改进PG算法引入。那么从Q-learning的角度来看，PDPG相当于learn一个actor，来解决argmax这个优化问题，以处理连续动作空间，直接根据输入的状态输出动作。
63 | 
64 | ![AC-14](http://oss.hackslog.cn/imgs/2019-11-06-082759.jpg)
65 | 
66 | Actor+Critic连成一个大的网络，训练过程中也会采取TD-target的技巧，固定住Critic $\pi'$，使用梯度上升优化Actor
67 | 
68 | ![AC-15](http://oss.hackslog.cn/imgs/2019-11-06-082809.jpg)
69 | 
70 | 训练过程：Actor会学到策略$\pi$，使基于策略$\pi$，输入s可以获得能够最大化Q的action，天然地能够处理continuous的情况。当actor生成的$Q^\pi$效果比较好时，重新采样生成新的Q。有点像GAN中的判别器与生成器。
71 | 
72 | 注意：从算法的流程可知，Actor 网络和 Critic 网络是分开训练的，但是两者的输入输出存在联系，Actor 网络输出的 action 是 Critic 网络的输入，同时 Critic 网络的输出会被用到 Actor 网路进行反向传播。
73 | 
74 | 由于Critic模块是基于Q-learning算法，所以Q learning的技巧，探索机制，回忆缓冲都可以用上。
75 | 
76 | ![AC-16](http://oss.hackslog.cn/imgs/2019-11-06-082820.jpg)
77 | 
78 | ![AC-17](http://oss.hackslog.cn/imgs/2019-11-06-082830.jpg)
79 | 
80 | 与Q-learning相比的改进：
81 | 
82 | - 不通过Q-function输出动作，直接用learn一个actor网络输出动作（Policy-based的算法的通用特性）。
83 | - 对于连续变量，不好解argmax的优化问题，转化成了直接选择$\pi-target$ 输出的动作，再基于Q-target得出y。
84 |   - 引入$\pi-target$，也使得actor网络不会频繁更新，会通过采样一批数据训练好后再更新，提高训练效率。
85 | 
86 | ![AC-18](http://oss.hackslog.cn/imgs/2019-11-06-082845.jpg)
87 | 
88 | 总结下：最基础的 Policy Gradient 是回合更新的，通过引入 Critic 后变成了单步更新，而这种结合了 policy 和 value 的方法也叫 Actor-Critic，Critic 有多种可选的方法。A3C在A2C的基础上引入了多个 agent 对网络进行异步更新。对于输出动作为连续值的情形，原始的输出动作概率分布的PG算法不能解决，同时Q-learning算法也不能处理这类问题，因此提出了 DPG 。


--------------------------------------------------------------------------------
/notes/5 Sparse Reward.md:
--------------------------------------------------------------------------------
 1 | # 5. Sparse Reward 
 2 | 
 3 | 大多数RL的任务中，是没法得到reward，reward=0，导致reward空间非常的sparse。
 4 | 
 5 | 比如我们需要赢得一局游戏才能知道胜负得到reward，那么玩这句游戏的很长一段时间内，我们得不到reward。比如如果机器人要将东西放入杯子才能得到一个reward，尝试了很多动作很有可能都是0。
 6 | 
 7 | 但是人可以在非常sprse的环境下进行学习，所以这一章节提出的很多算法与人的一些学习机制比较类似。
 8 | 
 9 | ## 5.1 Reward Shaping
10 | 
11 | 手动设计新的reward，让agent做的更好。但有些比较复杂的任务，需要domain knowledge去设计新的reward。
12 | 
13 | ![](http://oss.hackslog.cn/imgs/2019-11-06-082912.jpg)
14 | 
15 | ![](http://oss.hackslog.cn/imgs/2019-11-06-082928.jpg)
16 | 
17 | ## 5.2 Curiosity
18 | 
19 | 好奇心机制非常的直觉，也非常的强大。有个案例：[Happy Bird](https://github.com/pathak22/noreward-rl)
20 | 
21 | ![](http://oss.hackslog.cn/imgs/2019-11-06-082959.jpg)
22 | 
23 | 好奇心也是reward shaping的一种，引入一个新的reward ：ICM，同时优化2个reward。如何设计一个ICM模块，使agent拥有好奇心？
24 | 
25 | ![](http://oss.hackslog.cn/imgs/2019-11-06-083015.jpg)
26 | 
27 | 单独训练一个状态估计的模型，如果在某个state下采取某个action得到的下一个state难以预测，则鼓励agent进行尝试这个action。 不过有的state很难预测，但不重要。比如说某个游戏里面背景是树叶飘动，很难预测，接下来agent一直不动看着树叶飘动，没有意义。
28 | 
29 | ![](http://oss.hackslog.cn/imgs/2019-11-06-083031.jpg)
30 | 
31 | 再设计一个moudle，判断环境中state的重要性：learn一个feature ext的网络，去掉环境中与action关系不大的state。
32 | 
33 | 原理：输入两个处理过的state，预测action，使逼近真实的action。这样使得处理之后的state都是跟agent要采取的action相关的。
34 | 
35 | ## 5.3 Curriculum Learning
36 | 
37 | 课程学习：为learning做规划，通常由易到难。
38 | 
39 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092656.jpg)
40 | 
41 | 设计不同难度的课程，一开始直接把板子放入柱子，则agent只要把板子压下去就能获得reward，接着把板子的初始位置提高一些，agent有可能把板子抽出则无法获得reward，接着更general的情况，把板子放倒柱子外面，再让agent去学习。
42 | 
43 | 生成课程的方法通常如下：从目标反推，越靠近目标的state越简单，不断生成难度更高的state。
44 | 
45 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092702.jpg)
46 | 
47 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092658.jpg)
48 | 
49 | ## 5.4 Hierarchical RL
50 | 
51 | 分层学习：把大的任务拆解成小任务
52 | 
53 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092745.jpg)
54 | 
55 | 上层的agent给下层的agent提供一个愿景，如果下层的达不到目标，会获得惩罚。如果下层的agent得到的错误的目标，那么它会假设最初的目标也是错的。


--------------------------------------------------------------------------------
/notes/6 Imitation Learning.md:
--------------------------------------------------------------------------------
 1 | # 6. Imitation Learning 
 2 | 
 3 | 模仿学习，又叫学徒学习，反向强化学习
 4 | 
 5 | 之前介绍的强化学习都有一个reward function，但生活中大多数任务无法定义reward，或者难以定义。但是这些任务中如果收集很厉害的范例（专家经验）比较简单，则可以用模仿学习解决。
 6 | 
 7 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092747.jpg)
 8 | 
 9 | ## 6.1 Behavior Cloning
10 | 
11 | 本质是有监督学习
12 | 
13 | ![0004](http://oss.hackslog.cn/imgs/2019-11-06-092759.jpg)
14 | 
15 | ![0005](http://oss.hackslog.cn/imgs/2019-11-06-092817.jpg)
16 | 
17 | 存在问题：training data里面没有撞墙的case，则agent遇到这种情况不知如何决策
18 | 
19 | ![0006](http://oss.hackslog.cn/imgs/2019-11-06-092821.jpg)
20 | 
21 | 一个直觉的解决方法是数据增强：每次通过牺牲一个专家，学会了一种新的case，策略$\pi$得到了增强。
22 | 
23 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092831.jpg)
24 | 
25 | 行为克隆还存在一个关键问题：agent不知道哪些行为对结局重要，哪些不重要。由于是采样学习，有可能只记住了多余的无用的行为。
26 | 
27 | ![0009](http://oss.hackslog.cn/imgs/2019-11-06-092847.jpg)
28 | 
29 | 同时也由于RL的训练数据不是独立同分布，当下的action会影响之后的state，所以不能直接套用监督学习的框架。
30 | 
31 | 为了解决这些问题，就有了反向强化学习，现在一般说模仿学习指的就是反向强化学习。
32 | 
33 | ## 6.2 Inverse RL
34 | 
35 | ![0011](http://oss.hackslog.cn/imgs/2019-11-06-092859.jpg)
36 | 
37 | 之前的强化学习是reard和env通过RL 学到一个最优的actor。
38 | 
39 | 反向强化学习是，假设有一批expert的数据，通过env和IRL推导expert因为什么样子的reward function才会采取这样的行为。
40 | 
41 | 好处：也许expert的行为复杂但reward function很简单。拿到这个reward function后我们就可以训练出好的agent。
42 | 
43 | ![0012](http://oss.hackslog.cn/imgs/2019-11-06-092907.jpg)
44 | 
45 | IRL的框架：先射箭 再画靶。
46 | 
47 | 具体过程：
48 | 
49 | Expert先跟环境互动，玩N场游戏，存储记录，我们的actor $ \pi$也去互动，生成N场游戏记录。接下来定义一个reward function $R$，保证expert的$R$比我们的actor的$R$大就行。再根据定义的的$R$用RL的方法去学习一个新的actor ，这个过程也会采集新的游戏记录，等训练好这个actor，也就是当这个actor可以基于$R$获得高分的时候，重新定义一个新的reward function$R'$，让expert的$R'$大于agent，不断循环。
50 | 
51 | ![0013](http://oss.hackslog.cn/imgs/2019-11-06-092917.jpg)
52 | 
53 | IRL与GAN的框架是一样的，学习 一个 reward function相当于学习一个判别器，这个判别器给expert高分，给我们的actor低分。
54 | 
55 | 一个有趣的事实是给不同的expert，我们的agent最终也会学会不同的策略风格。如下蓝色是expert的行为，红色是学习到的actor的行为。
56 | 
57 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092932.jpg)
58 | 
59 | 针对训练robot的任务：
60 | 
61 | IRL有个好处是不需要定义规则让robot执行动作，人给robot示范一下动作即可。但robot学习时候的视野跟它执行该动作时候的视野不一致，怎么把它在第三人称视野学到的策略泛化到第一人称视野呢？
62 | 
63 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092943.jpg)
64 | 
65 | ![0019](http://oss.hackslog.cn/imgs/2019-11-06-092953.jpg)
66 | 
67 | 解决思路跟好奇心机制类似，抽出视野中不重要的因素，让第一人称和第三人称视野中的state都是有用的，与action强相关的。


--------------------------------------------------------------------------------
/slides/AC.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/morningsky/NTU-ReinforcementLearning-Notes/d4a9dbf584ae24d974d9b7839f34cee3f18b79dd/slides/AC.pdf


--------------------------------------------------------------------------------
/slides/IRL (v2).pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/morningsky/NTU-ReinforcementLearning-Notes/d4a9dbf584ae24d974d9b7839f34cee3f18b79dd/slides/IRL (v2).pdf


--------------------------------------------------------------------------------
/slides/PPO (v3).pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/morningsky/NTU-ReinforcementLearning-Notes/d4a9dbf584ae24d974d9b7839f34cee3f18b79dd/slides/PPO (v3).pdf


--------------------------------------------------------------------------------
/slides/QLearning (v2).pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/morningsky/NTU-ReinforcementLearning-Notes/d4a9dbf584ae24d974d9b7839f34cee3f18b79dd/slides/QLearning (v2).pdf


--------------------------------------------------------------------------------
/slides/Reward (v3).pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/morningsky/NTU-ReinforcementLearning-Notes/d4a9dbf584ae24d974d9b7839f34cee3f18b79dd/slides/Reward (v3).pdf


--------------------------------------------------------------------------------