├── README.md ├── code ├── .DS_Store ├── actor_critic_advantage.py ├── ddpg_update.py ├── deep_deterministic_policy_gradient.py ├── policy_gradient.py ├── proximal_policy_optimization.py └── tensrolayer-implemented │ ├── a3c.py │ ├── ac.py │ ├── ddpg.py │ ├── dqn.py │ ├── dqn_variants.py │ ├── pg.py │ ├── ppo.py │ ├── qlearning.py │ └── tutorial_wrappers.py ├── notes ├── .DS_Store ├── 1 Introduction.md ├── 2 Policy Gradient.md ├── 3 Q - Learning.md ├── 4 Actor Critic.md ├── 5 Sparse Reward.md └── 6 Imitation Learning.md └── slides ├── AC.pdf ├── IRL (v2).pdf ├── PPO (v3).pdf ├── QLearning (v2).pdf └── Reward (v3).pdf /README.md: -------------------------------------------------------------------------------- 1 | # 李宏毅深度强化学习 笔记 2 | 3 | ### 课程主页:[NTU-MLDS18](http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLDS18.html) 4 | 5 | ### 视频: 6 | - [youtube](https://www.youtube.com/playlist?list=PLJV_el3uVTsODxQFgzMzPLa16h6B8kWM_) 7 | - [B站](https://www.bilibili.com/video/av24724071/?spm_id_from=333.788.videocard.4) 8 | 9 | 10 | ![1](http://oss.hackslog.cn/imgs/075034.png) 11 | 12 | 这门课的学习路线如上,强化学习是作为单独一个模块介绍。李宏毅老师讲这门课不是从MDP开始讲起,而是从如何获得最大化奖励出发,直接引出Policy Gradient(以及PPO),再讲Q-learning(原始Q-learning,DQN,各种DQN的升级),然后是A2C(以及A3C, DDPG),紧接着介绍了一些Reward Shaping的方法(主要是Curiosity,Curriculum Learning ,Hierarchical Learning),最后介绍Imitation Learning (Inverse RL)。比较全面的展现了深度强化学习的核心内容,也比较直观。跟伯克利学派的课类似,与UCL上来就讲MDP,解各种value iteration的思路有较大区别。 13 | 文档中的notes以对slides的批注为主,方便在阅读slides时理解,code以纯tensorflow实现,主要参考[莫凡RL教学](https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/),修正部分代码以保持前后一致性,已经加入便于理解的注释。 14 | ### 参考资料: 15 | [作业代码参考](https://github.com/JasonYao81000/MLDS2018SPRING/tree/master/hw4) [纯numpy实现非Deep的RL算法](https://github.com/ddbourgin/numpy-ml/tree/master/numpy_ml/rl_models) [OpenAI tutorial](https://github.com/openai/spinningup/tree/master/docs) [莫凡RL教学](https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/) 16 | - code中的tensorlayer实现来自于[Tensorlayer-RL](https://github.com/tensorlayer/tensorlayer/tree/master/examples/reinforcement_learning),比起原生tensorflow更加简洁 17 | -------------------------------------------------------------------------------- /code/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/morningsky/NTU-ReinforcementLearning-Notes/d4a9dbf584ae24d974d9b7839f34cee3f18b79dd/code/.DS_Store -------------------------------------------------------------------------------- /code/actor_critic_advantage.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import gym 4 | 5 | ''' 6 | 比较PG算法 7 | PG loss = log_prob * v估计 (来自贝尔曼公式) 8 | A2C loss = log_prob * TD-error(来自critic网络 表达当前动作的价值比平均动作的价值好多少) 9 | DDPG : critic不仅能影响actor actor也能影响critic 相当于critic不仅告诉actor的行为好不好,还告诉他应该怎么改进才能更好(传一个梯度 dq/da) 10 | PPO: 对PG的更新加了限制,提高训练稳定性 相比于A2C 只是actor网络更加复杂 11 | ''' 12 | class Actor(object): #本质还是policy gradient 不过A2C是单步更新 13 | def __init__(self, 14 | sess, #两个网络需要共用一个session 所以外部初始化 15 | n_actions, 16 | n_features, 17 | lr=0.01, ): 18 | #self.ep_obs, self.ep_as, self.ep_rs =[],[],[] #由于是单步更新 所以不需要存储每个episode的数据 19 | self.sess = sess 20 | 21 | self.s = tf.placeholder(tf.float32, [1, n_features], "state") 22 | self.a = tf.placeholder(tf.int32, None, "act") # 23 | self.td_error = tf.placeholder(tf.float32, None, "td_error") # TD_error更新的幅度 td 的理解应该是 Q(s, a) - V(s), 某个动作价值减去平均动作价值 24 | 25 | with tf.variable_scope('Actor'): #将原来的name_scope换成variable_scope ,可以在一个scope里面共享变量 26 | l1 = tf.layers.dense( 27 | inputs=self.s, 28 | units=20, # number of hidden units 29 | activation=tf.nn.relu, 30 | kernel_initializer=tf.random_normal_initializer(0., .1), # weights 31 | bias_initializer=tf.constant_initializer(0.1), # biases 32 | name='l1' 33 | ) 34 | 35 | self.acts_prob = tf.layers.dense( 36 | inputs=l1, 37 | units=n_actions, # output units 38 | activation=tf.nn.softmax, # get action probabilities 39 | kernel_initializer=tf.random_normal_initializer(0., .1), # weights 40 | bias_initializer=tf.constant_initializer(0.1), # biases 41 | name='acts_prob' 42 | ) 43 | 44 | #with tf.name_scope('loss'): 45 | # 最大化 总体 reward (log_p * R) 就是在最小化 -(log_p * R), 而 tf 的功能里只有最小化 loss 46 | #neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob)*tf.one_hot(self.tf_acts, self.n_actions), axis=1) #加- 变为梯度下降 47 | #loss = tf.reduce_mean(neg_log_prob * self.tf_vt) 48 | with tf.variable_scope('loss'): 49 | log_prob = tf.log(self.acts_prob[0,self.a]) #[[0.1,0.2,0.3]] -> 0.1, if a=0 50 | self.loss = log_prob * self.td_error # advantage (TD_error) guided loss 51 | 52 | with tf.name_scope('train'): 53 | self.train_op = tf.train.AdamOptimizer(lr).minimize(-self.loss) 54 | 55 | def choose_action(self, s): #选择行为 56 | s = s[np.newaxis, :] 57 | probs = self.sess.run(self.acts_prob, {self.s: s}) # get probabilities for all actions 58 | action = np.random.choice(np.arange(probs.shape[1]), p=probs.ravel()) 59 | return action # return a int 60 | 61 | 62 | def learn(self, s, a, td): 63 | s = s[np.newaxis, :] 64 | feed_dict = {self.s: s, self.a: a, self.td_error: td} 65 | _, loss = self.sess.run([self.train_op, self.loss], feed_dict) 66 | return loss 67 | 68 | 69 | class Critic(object): 70 | def __init__(self, sess, n_features, lr=0.01, gamma=0.9): 71 | self.sess = sess 72 | 73 | self.s = tf.placeholder(tf.float32, [1, n_features], "state") 74 | self.v_ = tf.placeholder(tf.float32, [1, 1], "v_next") 75 | self.r = tf.placeholder(tf.float32, None, 'r') 76 | 77 | with tf.variable_scope('Critic'): 78 | l1 = tf.layers.dense( 79 | inputs=self.s, 80 | units=20, # number of hidden units 81 | activation=tf.nn.relu, # None 82 | # have to be linear to make sure the convergence of actor. 83 | # But linear approximator seems hardly learns the correct Q. 84 | kernel_initializer=tf.random_normal_initializer(0., .1), # weights 85 | bias_initializer=tf.constant_initializer(0.1), # biases 86 | name='l1' 87 | ) 88 | 89 | self.v = tf.layers.dense( 90 | inputs=l1, 91 | units=1, # output units 92 | activation=None, 93 | kernel_initializer=tf.random_normal_initializer(0., .1), # weights 94 | bias_initializer=tf.constant_initializer(0.1), # biases 95 | name='V' 96 | ) 97 | 98 | with tf.variable_scope('squared_TD_error'): 99 | self.td_error = self.r + gamma * self.v_ - self.v 100 | self.loss = tf.square(self.td_error) # TD_error = (r+gamma*V_next) - V_eval 101 | with tf.variable_scope('train'): 102 | self.train_op = tf.train.AdamOptimizer(lr).minimize(self.loss) 103 | 104 | def learn(self, s, r, s_): 105 | s, s_ = s[np.newaxis, :], s_[np.newaxis, :] 106 | 107 | v_ = self.sess.run(self.v, {self.s: s_}) 108 | td_error, _ = self.sess.run([self.td_error, self.train_op], 109 | {self.s: s, self.v_: v_, self.r: r}) 110 | return td_error 111 | 112 | np.random.seed(2) 113 | tf.set_random_seed(2) # reproducible 114 | 115 | # Superparameters 116 | OUTPUT_GRAPH = False 117 | MAX_EPISODE = 100#3000 118 | DISPLAY_REWARD_THRESHOLD = 200 # renders environment if total episode reward is greater then this threshold 119 | MAX_EP_STEPS = 1000 # maximum time step in one episode 120 | RENDER = False # rendering wastes time 121 | GAMMA = 0.9 # reward discount in TD error 122 | LR_A = 0.01 # learning rate for actor 123 | LR_C = 0.05 # learning rate for critic 124 | 125 | env = gym.make('CartPole-v0') 126 | env.seed(1) # reproducible 127 | env = env.unwrapped 128 | 129 | N_F = env.observation_space.shape[0] 130 | N_A = env.action_space.n 131 | 132 | from gym import Space 133 | 134 | sess = tf.Session() #两个网络共用一个session 135 | 136 | actor = Actor(sess, n_features=N_F, n_actions=N_A, lr=LR_A) 137 | critic = Critic(sess, n_features=N_F, lr=LR_C) # we need a good teacher, so the teacher should learn faster than the actor 138 | 139 | sess.run(tf.global_variables_initializer()) 140 | 141 | if OUTPUT_GRAPH: 142 | tf.summary.FileWriter("logs/", sess.graph) 143 | 144 | for i_episode in range(MAX_EPISODE): 145 | state = env.reset() 146 | t = 0 147 | r_list = [] 148 | 149 | while True: 150 | if RENDER: 151 | env.render() 152 | action = actor.choose_action(state) 153 | state_, reward, done, info = env.step(action) 154 | if done: 155 | reward=-20 #最后一步的奖励 一个trick 156 | r_list.append(reward) 157 | td_error = critic.learn(state, reward, state_) 158 | actor.learn(state, action, td_error) 159 | state = state_ 160 | 161 | if done or t>= MAX_EP_STEPS: 162 | ep_rs_sum = sum(r_list) 163 | if 'running_reward' not in globals(): 164 | running_reward = ep_rs_sum 165 | else: 166 | running_reward = running_reward * 0.95 + ep_rs_sum * 0.05 167 | if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = False # rendering 168 | print("episode:", i_episode, " reward:", int(running_reward)) 169 | break 170 | 171 | 172 | 173 | 174 | 175 | 176 | -------------------------------------------------------------------------------- /code/ddpg_update.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import gym 4 | import time 5 | 6 | ##################### hyper parameters #################### 7 | 8 | MAX_EPISODES = 200 9 | MAX_EP_STEPS = 200 10 | LR_A = 0.01 # learning rate for actor 11 | LR_C = 0.02 # learning rate for critic 12 | GAMMA = 0.9 # reward discount 13 | TAU = 0.01 # soft replacement 14 | MEMORY_CAPACITY = 10000 15 | BATCH_SIZE = 32 16 | 17 | RENDER = False 18 | ENV_NAME = 'Pendulum-v0' 19 | 20 | #pendulum 动作与状态都是连续空间 21 | #动作空间:只有一维力矩 长度为1 虽然是连续值,但是有bound【-2,2】 22 | #状态空间:一维速度,长度为3 23 | 24 | ############################### DDPG #################################### 25 | #离线训练 单步更新 按batch更新 引入replay buffer机制 26 | class DDPG(object): 27 | def __init__(self, a_dim, s_dim, a_bound,): #初始化2个网络图 注意无论是critic还是actor网络都有target-network机制 target-network不训练 28 | self.memory = np.zeros((MEMORY_CAPACITY, s_dim * 2 + a_dim +1), dtype=np.float32) #借鉴replay buff机制 s*2 : s, s_ 29 | self.pointer = 0 30 | self.sess = tf.Session() 31 | 32 | self.a_dim, self.s_dim, self.a_bound = a_dim, s_dim, a_bound 33 | self.S = tf.placeholder(tf.float32, [None, s_dim], 's') #前面的None用来给batch size占位 34 | self.S_ = tf.placeholder(tf.float32, [None, s_dim], 's_') 35 | self.R = tf.placeholder(tf.float32, [None,1], 'r') 36 | 37 | with tf.variable_scope('Actor'): 38 | self.a = self._build_a(self.S, scope='eval', trainable=True) #要训练的pi网络,也负责收集数据 # input s, output a 39 | a_ = self._build_a(self.S, scope='target', trainable=False) #target网络不训练,只负责输出动作给critic # input s_, output a, get a_ for critic 40 | with tf.variable_scope('Critic'): 41 | q = self._build_c(self.S, self.a, scope='eval', trainable=True) #要训练的Q, 与target输出的q算mse(td-error) 注意这个a来自于memory 42 | q_ = self._build_c(self.S_, a_, scope='target', trainable=False) #这个网络不训练, 用于给出 Actor 更新参数时的 Gradient ascent 强度 即dq/da 注意这个a来自于actor要更新参数时候的a 43 | 44 | # networks parameters 45 | self.ae_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval') 46 | self.at_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target') 47 | self.ce_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/eval') 48 | self.ct_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target') 49 | 50 | #taget 网络更新 即从eval网络中复制参数 51 | self.soft_replace = [tf.assign(t, (1-TAU)*t + TAU *e) for t, e in zip(self.at_params+self.ct_params,self.ae_params+self.ce_params)] 52 | 53 | #训练critic网络(eval) 54 | q_target = self.R + GAMMA * q_ #贝尔曼公式(里面的q_来自于Q-target网络输入(s_,a_)的输出) 得出q的”真实值“ 与预测值求mse 55 | td_error = tf.losses.mean_squared_error(labels=q_target, predictions=q) #预测值q 来自于q-eval网络输入当前时刻的(s,a)的输出 56 | self.ctrain = tf.train.AdamOptimizer(LR_C).minimize(td_error, var_list = self.ce_params) #要train的是q-eval网络的参数 最小化mse 57 | 58 | #训练actor网络(eval) 59 | a_loss = -tf.reduce_mean(q) #maximize q 60 | self.atrain = tf.train.AdamOptimizer(LR_A).minimize(a_loss, var_list = self.ae_params) # 61 | 62 | self.sess.run(tf.global_variables_initializer()) 63 | 64 | 65 | def choose_action(self, s): 66 | s = s[np.newaxis, :] 67 | return self.sess.run(self.a, feed_dict={self.S: s})[0] # single action 68 | 69 | 70 | def learn(self): 71 | #每次学习都是先更新target网络参数 72 | self.sess.run(self.soft_replace) 73 | indices = np.random.choice(MEMORY_CAPACITY, size=BATCH_SIZE) 74 | bt = self.memory[indices, : ] #从memory中取一个batch的数据来训练 75 | bs = bt[:, :self.s_dim] #a batch of state 76 | ba = bt[:, self.s_dim: self.s_dim + self.a_dim] #a batch of action 77 | br = bt[:, -self.s_dim - 1: -self.s_dim] #a batch of reward 78 | bs_ = bt[:, -self.s_dim:] 79 | 80 | #一次训练一个batch 这一个batch的训练过程中target网络相当于固定不动 81 | self.sess.run(self.atrain, {self.S: bs}) 82 | self.sess.run(self.ctrain, {self.S: bs, self.a: ba, self.R: br, self.S_: bs_}) 83 | 84 | 85 | def store_transition(self, s, a, r, s_): #离线训练算法标准操作 86 | transition = np.hstack((s, a, [r], s_)) 87 | index = self.pointer % MEMORY_CAPACITY # replace the old memory with new memory 88 | self.memory[index, :] = transition 89 | self.pointer += 1 90 | 91 | def _build_a(self, s, scope, trainable): #actor网络结构 直接输出动作确定a 92 | with tf.variable_scope(scope): 93 | net = tf.layers.dense(s, 30, activation=tf.nn.relu, name='l1', trainable=trainable) 94 | a = tf.layers.dense(net, self.a_dim, activation=tf.nn.tanh, name='a', trainable=trainable) #a经过了tanh 数值缩放到了【-1,1】 95 | return tf.multiply(a, self.a_bound, name='scaled_a') #输出的每个a值都乘边界[max,] 可以保证输出范围在【-max,max】 如果最小 最大值不是相反数 得用clip正则化 96 | 97 | def _build_c(self, s, a, scope, trainable): #critic网络结构 输出Q(s,a) 98 | with tf.variable_scope(scope): 99 | n_l1 = 30 100 | w1_s = tf.get_variable('w1_s', [self.s_dim, n_l1], trainable=trainable) 101 | w1_a = tf.get_variable('w1_a', [self.a_dim, n_l1], trainable=trainable) 102 | b1 = tf.get_variable('b1', [1, n_l1], trainable=trainable) 103 | net = tf.nn.relu(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1) 104 | return tf.layers.dense(net, 1, trainable=trainable) # Q(s,a) 105 | 106 | env = gym.make(ENV_NAME) 107 | env = env.unwrapped 108 | env.seed(1) 109 | s_dim = env.observation_space.shape[0] 110 | a_dim = env.action_space.shape[0] 111 | a_bound = env.action_space.high 112 | ddpg = DDPG(a_dim, s_dim, a_bound) 113 | 114 | var = 3 # control exploration 115 | t1 = time.time() 116 | for i in range(MAX_EPISODES): 117 | s = env.reset() 118 | ep_reward = 0 119 | for j in range(MAX_EP_STEPS): #没有明确停止条件的游戏都需要这么一个 120 | if RENDER: 121 | env.render() 122 | a = ddpg.choose_action(s) 123 | a = np.clip(np.random.normal(a, var),-2,2) #增加exploration noise 以actor输出的a为均值,var为方差进行选择a 同时保证a的值在【-2,2】 124 | s_, r, done, info = env.step(a) 125 | 126 | ddpg.store_transition(s, a, r/10, s_) 127 | if ddpg.pointer > MEMORY_CAPACITY: #存储的数据满了开始训练各个网络 128 | var *= 0.9995 #降低动作选择的随机性 129 | ddpg.learn() #超过10000才开始训练,每次从经验库中抽取一个batch,每走一步都会执行一次训练 单步更新 130 | s = s_ 131 | ep_reward += r 132 | if j == MAX_EP_STEPS-1: 133 | print('Episode:', i, ' Reward: %i' % int(ep_reward), 'Explore: %.2f' % var, ) 134 | # if ep_reward > -300:RENDER = True 135 | break 136 | print('Running time: ', time.time() - t1) -------------------------------------------------------------------------------- /code/deep_deterministic_policy_gradient.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import gym 4 | import time 5 | 6 | 7 | np.random.seed(1) 8 | tf.set_random_seed(1) 9 | 10 | MAX_EPISODES = 200 11 | MAX_EP_STEPS = 200 12 | LR_A = 0.001 # learning rate for actor 13 | LR_C = 0.001 # learning rate for critic 14 | GAMMA = 0.9 # reward discount 15 | REPLACEMENT = [ 16 | dict(name='soft', tau=0.01), 17 | dict(name='hard', rep_iter_a=600, rep_iter_c=500) 18 | ][0] # you can try different target replacement strategies 19 | MEMORY_CAPACITY = 10000 20 | BATCH_SIZE = 32 21 | 22 | RENDER = False 23 | OUTPUT_GRAPH = True 24 | ENV_NAME = 'Pendulum-v0' 25 | 26 | class Actor(object): 27 | def __init__(self, sess, action_dim, action_bound, learning_rate, replacement): 28 | self.sess = sess 29 | self.a_dim = action_dim 30 | self.action_bound = action_bound 31 | self.lr = learning_rate 32 | self.replacement = replacement 33 | self.t_replace_counter = 0 34 | 35 | with tf.variable_scope('Actor'): 36 | # 这个网络用于及时更新参数 37 | self.a = self._build_net(S, scope='eval_net', trainable=True) #由target网络给出确定的action 38 | #对比ppo 网络的输出是一个概率分布 39 | #pi, pi_params = self._build_anet('pi', trainable=True) 40 | #self.sample_op = tf.squeeze(pi.sample(1), axis=0) #按概率分布pi选择一个action 41 | 42 | # 这个网络不及时更新参数, 用于预测 Critic 的 Q_target 中的 action 43 | self.a_ = self._build_net(S_, scope='target_net', trainable=False) 44 | 45 | self.e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval_net') 46 | self.t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target_net') 47 | 48 | if self.replacement['name'] == 'hard': 49 | self.t_replace_counter = 0 50 | self.hard_replace = [tf.assign(t, e) for t, e in zip(self.t_params, self.e_params)] 51 | else: 52 | self.soft_replace = [tf.assign(t, (1 - self.replacement['tau']) * t + self.replacement['tau'] * e) 53 | for t, e in zip(self.t_params, self.e_params)] 54 | 55 | def _build_net(self, s, scope, trainable): 56 | with tf.variable_scope(scope): 57 | init_w = tf.random_normal_initializer(0., 0.3) 58 | init_b = tf.constant_initializer(0.1) 59 | net = tf.layers.dense(s, 30, activation=tf.nn.relu, 60 | kernel_initializer=init_w, bias_initializer=init_b, name='l1', 61 | trainable=trainable) 62 | with tf.variable_scope('a'): 63 | actions = tf.layers.dense(net, self.a_dim, activation=tf.nn.tanh, kernel_initializer=init_w, 64 | bias_initializer=init_b, name='a', trainable=trainable) 65 | scaled_a = tf.multiply(actions, self.action_bound, name='scaled_a') # Scale output to -action_bound to action_bound 66 | return scaled_a 67 | 68 | def learn(self, s): # batch update 69 | self.sess.run(self.train_op, feed_dict={S: s}) 70 | 71 | if self.replacement['name'] == 'soft': 72 | self.sess.run(self.soft_replace) 73 | else: 74 | if self.t_replace_counter % self.replacement['rep_iter_a'] == 0: 75 | self.sess.run(self.hard_replace) 76 | self.t_replace_counter += 1 77 | 78 | def choose_action(self, s): 79 | s = s[np.newaxis, :] # single state 80 | return self.sess.run(self.a, feed_dict={S: s})[0] # single action 81 | #对比ppo a = self.sess.run(self.sample_op, {self.tfs:s})[0] 82 | 83 | ## 将 critic 产出的 dQ/da 加入到 Actor 的 Graph 中去 84 | def add_grad_to_graph(self, a_grads): 85 | with tf.variable_scope('policy_grads'): 86 | # ys = policy; 87 | # xs = policy's parameters; 88 | # a_grads = the gradients of the policy to get more Q 89 | # tf.gradients will calculate dys/dxs with a initial gradients for ys, so this is dq/da * da/dparams 90 | self.policy_grads = tf.gradients(ys=self.a, xs=self.e_params, grad_ys=a_grads) ##grad_ys 这是从 Critic 来的 dQ/da 91 | 92 | with tf.variable_scope('A_train'): 93 | opt = tf.train.AdamOptimizer(-self.lr) # (- learning rate) for ascent policy 94 | self.train_op = opt.apply_gradients(zip(self.policy_grads, self.e_params)) 95 | 96 | class Critic(object): 97 | def __init__(self, sess, state_dim, action_dim, learning_rate, gamma, replacement, a, a_): 98 | self.sess = sess 99 | self.s_dim = state_dim 100 | self.a_dim = action_dim 101 | self.lr = learning_rate 102 | self.gamma = gamma 103 | self.replacement = replacement 104 | 105 | with tf.variable_scope('Critic'): 106 | # Input (s, a), output q 107 | self.a = tf.stop_gradient(a) # stop critic update flows to actor 108 | self.q = self._build_net(S, self.a, 'eval_net', trainable=True) 109 | 110 | # Input (s_, a_), output q_ for q_target 111 | self.q_ = self._build_net(S_, a_, 'target_net', trainable=False) # target_q is based on a_ from Actor's target_net 112 | 113 | self.e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/eval_net') 114 | self.t_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target_net') 115 | 116 | with tf.variable_scope('target_q'): 117 | self.target_q = R + self.gamma * self.q_ ## self.q_ 根据 Actor 的 target_net 来 118 | 119 | with tf.variable_scope('TD_error'): 120 | self.loss = tf.reduce_mean(tf.squared_difference(self.target_q, self.q)) # self.q 又基于 Actor 的 target_net 121 | 122 | with tf.variable_scope('C_train'): 123 | self.train_op = tf.train.AdamOptimizer(self.lr).minimize(self.loss) 124 | 125 | with tf.variable_scope('a_grad'): 126 | self.a_grads = tf.gradients(self.q, a)[0] # tensor of gradients of each sample (None, a_dim) 127 | 128 | if self.replacement['name'] == 'hard': 129 | self.t_replace_counter = 0 130 | self.hard_replacement = [tf.assign(t, e) for t, e in zip(self.t_params, self.e_params)] 131 | else: 132 | self.soft_replacement = [tf.assign(t, (1 - self.replacement['tau']) * t + self.replacement['tau'] * e) 133 | for t, e in zip(self.t_params, self.e_params)] 134 | 135 | def _build_net(self, s, a, scope, trainable): 136 | with tf.variable_scope(scope): 137 | init_w = tf.random_normal_initializer(0., 0.1) 138 | init_b = tf.constant_initializer(0.1) 139 | 140 | with tf.variable_scope('l1'): 141 | n_l1 = 30 142 | w1_s = tf.get_variable('w1_s', [self.s_dim, n_l1], initializer=init_w, trainable=trainable) 143 | w1_a = tf.get_variable('w1_a', [self.a_dim, n_l1], initializer=init_w, trainable=trainable) 144 | b1 = tf.get_variable('b1', [1, n_l1], initializer=init_b, trainable=trainable) 145 | net = tf.nn.relu(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1) 146 | 147 | with tf.variable_scope('q'): 148 | q = tf.layers.dense(net, 1, kernel_initializer=init_w, bias_initializer=init_b, trainable=trainable) # Q(s,a) 149 | return q 150 | 151 | def learn(self, s, a, r, s_): 152 | self.sess.run(self.train_op, feed_dict={S: s, self.a: a, R: r, S_: s_}) 153 | if self.replacement['name'] == 'soft': 154 | self.sess.run(self.soft_replacement) 155 | else: 156 | if self.t_replace_counter % self.replacement['rep_iter_c'] == 0: 157 | self.sess.run(self.hard_replacement) 158 | self.t_replace_counter += 1 159 | 160 | 161 | class Memory(object): 162 | def __init__(self, capacity, dims): 163 | self.capacity = capacity 164 | self.data = np.zeros((capacity, dims)) 165 | self.pointer = 0 166 | 167 | def store_transition(self, s, a, r, s_): 168 | transition = np.hstack((s, a, [r], s_)) 169 | index = self.pointer % self.capacity # replace the old memory with new memory 170 | self.data[index, :] = transition 171 | self.pointer += 1 172 | 173 | def sample(self, n): 174 | assert self.pointer >= self.capacity, 'Memory has not been fulfilled' 175 | indices = np.random.choice(self.capacity, size=n) 176 | return self.data[indices, :] 177 | 178 | 179 | env = gym.make(ENV_NAME) 180 | env = env.unwrapped 181 | env.seed(1) 182 | 183 | state_dim = env.observation_space.shape[0] 184 | action_dim = env.action_space.shape[0] 185 | action_bound = env.action_space.high 186 | 187 | # all placeholder for tf 188 | with tf.name_scope('S'): 189 | S = tf.placeholder(tf.float32, shape=[None, state_dim], name='s') 190 | with tf.name_scope('R'): 191 | R = tf.placeholder(tf.float32, [None, 1], name='r') 192 | with tf.name_scope('S_'): 193 | S_ = tf.placeholder(tf.float32, shape=[None, state_dim], name='s_') 194 | 195 | 196 | sess = tf.Session() 197 | 198 | # Create actor and critic. 199 | # They are actually connected to each other, details can be seen in tensorboard or in this picture: 200 | actor = Actor(sess, action_dim, action_bound, LR_A, REPLACEMENT) 201 | critic = Critic(sess, state_dim, action_dim, LR_C, GAMMA, REPLACEMENT, actor.a, actor.a_) # 将 actor 同它的 eval_net/target_net 产生的 a/a_ 传给 Critic 202 | actor.add_grad_to_graph(critic.a_grads) # 将 critic 产出的 dQ/da 加入到 Actor 的 Graph 中去 203 | 204 | sess.run(tf.global_variables_initializer()) 205 | 206 | M = Memory(MEMORY_CAPACITY, dims=2 * state_dim + action_dim + 1) 207 | 208 | if OUTPUT_GRAPH: 209 | tf.summary.FileWriter("logs/", sess.graph) 210 | 211 | var = 3 # control exploration 212 | 213 | t1 = time.time() 214 | for i in range(MAX_EPISODES): 215 | s = env.reset() 216 | ep_reward = 0 217 | 218 | for j in range(MAX_EP_STEPS): 219 | 220 | if RENDER: 221 | env.render() 222 | 223 | # Add exploration noise 224 | a = actor.choose_action(s) 225 | a = np.clip(np.random.normal(a, var), -2, 2) # add randomness to action selection for exploration 226 | s_, r, done, info = env.step(a) 227 | 228 | M.store_transition(s, a, r / 10, s_) 229 | 230 | if M.pointer > MEMORY_CAPACITY: 231 | var *= .9995 # decay the action randomness 232 | b_M = M.sample(BATCH_SIZE) 233 | b_s = b_M[:, :state_dim] 234 | b_a = b_M[:, state_dim: state_dim + action_dim] 235 | b_r = b_M[:, -state_dim - 1: -state_dim] 236 | b_s_ = b_M[:, -state_dim:] 237 | 238 | critic.learn(b_s, b_a, b_r, b_s_) 239 | actor.learn(b_s) 240 | 241 | s = s_ 242 | ep_reward += r 243 | 244 | if j == MAX_EP_STEPS-1: 245 | print('Episode:', i, ' Reward: %i' % int(ep_reward), 'Explore: %.2f' % var, ) 246 | if ep_reward > -300: 247 | RENDER = True 248 | break 249 | 250 | print('Running time: ', time.time()-t1) -------------------------------------------------------------------------------- /code/policy_gradient.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import gym 4 | 5 | class PolicyGradient: 6 | def __init__(self, 7 | n_actions, 8 | n_features, 9 | learning_rate=0.01, 10 | reward_decay=0.95, 11 | output_graph=False 12 | ): 13 | self.n_actions = n_actions 14 | self.n_features = n_features 15 | self.lr = learning_rate 16 | self.gamma = reward_decay 17 | self.ep_obs, self.ep_as, self.ep_rs =[],[],[] #states,actions,rewards 18 | self.__build_net() 19 | self.sess = tf.Session() 20 | 21 | if output_graph: 22 | # $ tensorboard --logdir=logs 23 | # http://0.0.0.0:6006/ 24 | # tf.train.SummaryWriter soon be deprecated, use following 25 | tf.summary.FileWriter("logs/", self.sess.graph) 26 | 27 | self.sess.run(tf.global_variables_initializer()) 28 | 29 | def __build_net(self): #PG网络 30 | with tf.name_scope('inputs'): 31 | self.tf_obs = tf.placeholder(tf.float32, [None, self.n_features],name="observations") 32 | self.tf_acts = tf.placeholder(tf.int32, [None,], name="actions_num") 33 | self.tf_vt = tf.placeholder(tf.float32, [None,], name="actions_value") #V(s,a) 34 | 35 | layer = tf.layers.dense( 36 | inputs = self.tf_obs, 37 | units = 10, 38 | activation = tf.nn.tanh, 39 | kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3), 40 | bias_initializer=tf.constant_initializer(0.1), 41 | name = 'fc1' 42 | ) 43 | all_act = tf.layers.dense( 44 | inputs=layer, 45 | units=self.n_actions, # 输出个数 46 | activation=None, # 之后再加 Softmax 47 | kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3), 48 | bias_initializer=tf.constant_initializer(0.1), 49 | name='fc2' 50 | ) 51 | self.all_act_prob = tf.nn.softmax(all_act, name='act_prob') 52 | with tf.name_scope('loss'): 53 | # 最大化 总体 reward (log_p * R) 就是在最小化 -(log_p * R), 而 tf 的功能里只有最小化 loss 54 | neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob)*tf.one_hot(self.tf_acts, self.n_actions), axis=1) #加- 变为梯度下降 55 | loss = tf.reduce_mean(neg_log_prob * self.tf_vt) 56 | 57 | with tf.name_scope('train'): 58 | self.train_op = tf.train.AdamOptimizer(self.lr).minimize(loss) 59 | 60 | def choose_action(self, observation): #选择行为 61 | prob_weights = self.sess.run(self.all_act_prob, feed_dict = {self.tf_obs: observation[np.newaxis, :]}) #[0,1,2]->[[0,1,2]] 所有action的概率 矩阵形式 62 | action = np.random.choice(range(prob_weights.shape[1]), p=prob_weights.ravel()) # 根据概率来选 action range(prob_weights.shape[1]用0,1,2,表示动作 63 | return action 64 | 65 | def store_transition(self, s, a, r):#存储一个回合的经验 66 | self.ep_obs.append(s) 67 | self.ep_as.append(a) 68 | self.ep_rs.append(r) 69 | 70 | def learn(self): 71 | discounted_ep_rs_norm = self._discount_and_norm_rewards() # 衰减, 并标准化这回合的 reward 72 | self.sess.run(self.train_op, feed_dict={ 73 | self.tf_obs: np.vstack(self.ep_obs), # shape=[None, n_obs] 74 | self.tf_acts: np.array(self.ep_as), # shape=[None, ] 75 | self.tf_vt: discounted_ep_rs_norm, # shape=[None, ] 76 | }) 77 | self.ep_obs, self.ep_as, self.ep_rs = [],[],[] #清空回合数据 78 | return discounted_ep_rs_norm # 返回这一回合的 state-action value 79 | 80 | 81 | def _discount_and_norm_rewards(self): #用bellman公式计算出vt(s,a) 82 | #discount 83 | discounted_ep_rs = np.zeros_like(self.ep_rs) 84 | running_add = 0 85 | for t in reversed(range(0, len(self.ep_rs))): #倒数遍历这个episode中的reward 86 | running_add = running_add * self.gamma + self.ep_rs[t] 87 | discounted_ep_rs[t] = running_add 88 | # r1,r2,r3 -> r1+r2*gamma+r3*gamma^2, r2+r3*gamma, r3 89 | 90 | #normalize 91 | discounted_ep_rs -= np.mean(discounted_ep_rs) 92 | discounted_ep_rs /= np.std(discounted_ep_rs) 93 | return discounted_ep_rs 94 | 95 | 96 | 97 | #将算法应用起来吧! 98 | RENDER = False # 在屏幕上显示模拟窗口会拖慢运行速度, 我们等计算机学得差不多了再显示模拟 99 | DISPLAY_REWARD_THRESHOLD = 1000 # 当 回合总 reward 大于 400 时显示模拟窗口 100 | 101 | #env = gym.make('CartPole-v0') # CartPole 2个动作 向左 向右 102 | env = gym.make('MountainCar-v0') #3个动作 左侧加速、不加速、右侧加速 103 | 104 | env = env.unwrapped # 取消限制 105 | env.seed(1) # 普通的 Policy gradient 方法, 使得回合的 variance 比较大, 所以我们选了一个好点的随机种子 106 | 107 | print(env.action_space) # 显示可用 action 108 | print(env.observation_space) # 显示可用 state 的 observation 109 | print(env.observation_space.high) # 显示 observation 最高值 110 | print(env.observation_space.low) # 显示 observation 最低值 111 | 112 | # 定义 113 | RL = PolicyGradient( 114 | n_actions=env.action_space.n, 115 | n_features=env.observation_space.shape[0], 116 | learning_rate=0.02, 117 | reward_decay=0.99, # gamma 118 | # output_graph=True, # 输出 tensorboard 文件 119 | ) 120 | 121 | for i_episode in range(100): 122 | observation = env.reset() 123 | while True: 124 | if RENDER: 125 | env.render() 126 | action = RL.choose_action(observation) 127 | observation_, reward, done, info = env.step(action) 128 | RL.store_transition(observation, action, reward) 129 | 130 | if done: 131 | ep_rs_sum = sum(RL.ep_rs) 132 | if 'running_reward' not in globals(): 133 | running_reward = ep_rs_sum 134 | else: 135 | running_reward = running_reward *0.99 + ep_rs_sum *0.01 #不是简单的求和展示当下rewad 比较科学 136 | print("episode:", i_episode, "reward:", int(running_reward)) 137 | vt = RL.learn() #学习 输出vt 138 | break 139 | 140 | observation = observation_ 141 | 142 | 143 | -------------------------------------------------------------------------------- /code/proximal_policy_optimization.py: -------------------------------------------------------------------------------- 1 | #pendulum 2 | #动作空间:只有一维力矩 长度为1 3 | #状态空间:一维速度,长度为3 4 | 5 | ''' 6 | Critic网络直接给出V(s) 7 | Actor网络由2部分组成 oldpi pi 8 | PPO升级于A2C(critic按batch更新,离线训练,有2个pi),升级于PG(加入critic网络,利用advantage引导pg优化) 9 | ''' 10 | import tensorflow as tf 11 | import numpy as np 12 | import matplotlib.pyplot as plt 13 | import gym 14 | 15 | EP_MAX = 1000 16 | EP_LEN = 200 17 | GAMMA = 0.9 18 | A_LR = 0.0001 19 | C_LR = 0.0002 20 | BATCH = 32 21 | A_UPDATE_STEPS = 10 22 | C_UPDATE_STEPS = 10 23 | S_DIM, A_DIM = 3, 1 #pendulum游戏 24 | METHOD = [ 25 | dict(name='kl_pen', kl_target=0.01, lam=0.5), # KL penalty 26 | dict(name='clip', epsilon=0.2), # Clipped surrogate objective, find this is better 27 | ][1] # choose the method for optimization 28 | 29 | 30 | 31 | class PPO(object): 32 | def __init__(self): 33 | self.sess = tf.Session() 34 | self.tfs = tf.placeholder(tf.float32, [None, S_DIM], 'state') #[N_each_batch,DIM] 35 | 36 | #搭建AC网络 critic 37 | with tf.variable_scope('critic'): 38 | layer1 = tf.layers.dense(self.tfs, 100, tf.nn.relu) 39 | self.v = tf.layers.dense(layer1, 1) 40 | self.tfdc_r = tf.placeholder(tf.float32, [None, 1], 'discounted_r') 41 | self.advantage = self.tfdc_r - self.v # discounted reward - Critic 出来的 state value 42 | self.closs = tf.reduce_mean(tf.square(self.advantage)) # mse loss of critic 43 | self.ctrain_op = tf.train.AdamOptimizer(C_LR).minimize(self.closs) 44 | 45 | pi, pi_params = self._build_anet('pi', trainable=True) 46 | oldpi, oldpi_params = self._build_anet('oldpi', trainable=False) #每个pi的本质是一个概率分布 47 | with tf.variable_scope('sample_action'): 48 | self.sample_op = tf.squeeze(pi.sample(1), axis=0) #按概率分布pi选择一个action 49 | with tf.variable_scope('update_oldpi'): 50 | self.update_oldpi_op = [oldp.assign(p) for p, oldp in zip(pi_params, oldpi_params)] #将pi的参数复制给oldpi 51 | 52 | self.tfa = tf.placeholder(tf.float32, [None, A_DIM], 'action') 53 | self.tfadv = tf.placeholder(tf.float32, [None, 1], 'advantage') 54 | with tf.variable_scope('loss'): 55 | with tf.variable_scope('surrogate'): 56 | ratio = pi.prob(self.tfa) / oldpi.prob(self.tfa) #(New Policy/Old Policy) 的比例 57 | surr = ratio * self.tfadv #surrogate objective 58 | if METHOD['name'] == 'kl_pen': # 如果用 KL penatily 59 | self.tflam = tf.placeholder(tf.float32, None, 'lambda') 60 | kl = tf.distributions.kl_divergence(oldpi, pi) 61 | self.kl_mean = tf.reduce_mean(kl) 62 | self.aloss = tf.reduce_mean(surr - self.tflam * kl) #actor 最终的loss function 63 | else: # 如果用 clipping 的方式 64 | self.aloss = tf.reduce_mean(tf.minimum(surr, tf.clip_by_value(ratio, 1-METHOD['epsilon'], 1+METHOD['epsilon'])*self.tfadv)) 65 | 66 | with tf.variable_scope('atrain'): 67 | self.atrain_op = tf.train.AdamOptimizer(A_LR).minimize(-self.aloss) 68 | 69 | self.sess.run(tf.global_variables_initializer()) 70 | 71 | 72 | def update(self, s, a, r): #update ppo 73 | # 先要将 oldpi 里的参数更新 pi 中的 74 | self.sess.run(self.update_oldpi_op) 75 | adv = self.sess.run(self.advantage, {self.tfs:s, self.tfdc_r:r}) 76 | # adv = (adv - adv.mean())/(adv.std()+1e-6) # sometimes helpful 77 | # update actor 78 | # 更新 Actor 时, kl penalty 和 clipping 方式是不同的 79 | if METHOD['name'] == 'kl_pen': 80 | for _ in range(A_UPDATE_STEPS): #actor 一次训练更新10次 81 | _, kl = self.sess.run( 82 | [self.atrain_op, self.kl_mean], 83 | {self.tfs: s, self.tfa: a, self.tfadv: adv, self.tflam: METHOD['lam']}) 84 | if kl > 4*METHOD['kl_target']: # this in in google's paper 85 | break 86 | if kl < METHOD['kl_target'] / 1.5: # adaptive lambda, this is in OpenAI's paper 87 | METHOD['lam'] /= 2 88 | elif kl > METHOD['kl_target'] * 1.5: 89 | METHOD['lam'] *= 2 90 | METHOD['lam'] = np.clip(METHOD['lam'], 1e-4, 10) # sometimes explode, this clipping is my solution 91 | else: # clipping method, find this is better (OpenAI's paper) 92 | [self.sess.run(self.atrain_op, {self.tfs: s, self.tfa: a, self.tfadv: adv}) for _ in range(A_UPDATE_STEPS)] #actor 一次训练更新10次 93 | # 更新 Critic 的时候, 他们是一样的 critic一次训练更新10次 94 | [self.sess.run(self.ctrain_op, {self.tfs: s, self.tfdc_r: r}) for _ in range(C_UPDATE_STEPS)] 95 | 96 | def choose_action(self, s): 97 | s = s[np.newaxis, :] 98 | a = self.sess.run(self.sample_op, {self.tfs:s})[0] 99 | return np.clip(a,-2,2) #动作不要超出【-2,2】的范围 因为是按概率分布取动作 所以加上这一步很有必要! 100 | 101 | def get_v(self, s): #V(s)状态值 由critic网络给出 102 | if s.ndim < 2: 103 | s = s[np.newaxis, :] 104 | return self.sess.run(self.v, {self.tfs:s})[0,0] 105 | 106 | def _build_anet(self, name, trainable): #critic网络输出动作的概率分布 包含参数均值u与方差sigma 107 | with tf.variable_scope(name): 108 | l1 = tf.layers.dense(self.tfs, 100, tf.nn.relu, trainable=trainable) 109 | mu = 2 * tf.layers.dense(l1, A_DIM, tf.nn.tanh, trainable=trainable) 110 | sigma = tf.layers.dense(l1, A_DIM, tf.nn.softplus, trainable=trainable) 111 | norm_dist = tf.distributions.Normal(loc=mu, scale=sigma) 112 | params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name) 113 | return norm_dist, params 114 | 115 | env = gym.make('Pendulum-v0').unwrapped 116 | 117 | ppo = PPO() 118 | all_ep_r = [] 119 | 120 | 121 | for ep in range(EP_MAX): 122 | s = env.reset() 123 | buffer_s, buffer_a, buffer_r = [],[],[] 124 | ep_r = 0 125 | for t in range(EP_LEN): 126 | env.render() 127 | a = ppo.choose_action(s) 128 | s_, r, done, info = env.step(a) 129 | buffer_s.append(s) 130 | buffer_a.append(a) 131 | buffer_r.append((r+8)/8) # normalize reward, 发现有帮助 132 | s = s_ 133 | ep_r += r #一个episode的reward之和 134 | 135 | # 如果 buffer 收集一个 batch 了或者 episode 完了 136 | #则更新ppo 137 | if (t+1) % BATCH == 0 or t == EP_LEN -1: 138 | #计算折扣奖励 139 | v_s_ = ppo.get_v(s_) 140 | discounted_r = [] 141 | for r in buffer_r[::-1]: 142 | v_s_ = r + GAMMA * v_s_ 143 | discounted_r.append(v_s_) 144 | discounted_r.reverse() 145 | 146 | bs, ba, br = np.vstack(buffer_s), np.vstack(buffer_a), np.array(discounted_r)[:, np.newaxis] #存入一个Batch 147 | #清空buffer 148 | buffer_s, buffer_a, buffer_r = [],[],[] 149 | ppo.update(bs, ba, br) #训练PPO 150 | if ep == 0: 151 | all_ep_r.append(ep_r) 152 | else: 153 | all_ep_r.append(all_ep_r[-1]*0.9 + ep_r*0.1) 154 | 155 | print('Ep: %i' % ep,"|Ep_r: %i" % ep_r,("|Lam: %.4f" % METHOD['lam']) if METHOD['name'] == 'kl_pen' else '',) 156 | 157 | 158 | #plt.plot(np.arange(len(all_ep_r)), all_ep_r) 159 | #plt.xlabel('Episode') 160 | #plt.ylabel('Moving averaged episode reward') 161 | #plt.show() 162 | 163 | 164 | 165 | 166 | 167 | -------------------------------------------------------------------------------- /code/tensrolayer-implemented/a3c.py: -------------------------------------------------------------------------------- 1 | """ 2 | Asynchronous Advantage Actor Critic (A3C) with Continuous Action Space. 3 | Actor Critic History 4 | ---------------------- 5 | A3C > DDPG (for continuous action space) > AC 6 | Advantage 7 | ---------- 8 | Train faster and more stable than AC. 9 | Disadvantage 10 | ------------- 11 | Have bias. 12 | Reference 13 | ---------- 14 | Original Paper: https://arxiv.org/pdf/1602.01783.pdf 15 | MorvanZhou's tutorial: https://morvanzhou.github.io/tutorials/ 16 | MorvanZhou's code: https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/experiments/Solve_BipedalWalker/A3C.py 17 | Environment 18 | ----------- 19 | BipedalWalker-v2 : https://gym.openai.com/envs/BipedalWalker-v2 20 | Reward is given for moving forward, total 300+ points up to the far end. 21 | If the robot falls, it gets -100. Applying motor torque costs a small amount of 22 | points, more optimal agent will get better score. State consists of hull angle 23 | speed, angular velocity, horizontal speed, vertical speed, position of joints 24 | and joints angular speed, legs contact with ground, and 10 lidar rangefinder 25 | measurements. There's no coordinates in the state vector. 26 | Prerequisites 27 | -------------- 28 | tensorflow 2.0.0a0 29 | tensorflow-probability 0.6.0 30 | tensorlayer 2.0.0 31 | && 32 | pip install box2d box2d-kengz --user 33 | To run 34 | ------ 35 | python tutorial_A3C.py --train/test 36 | """ 37 | 38 | import argparse 39 | import multiprocessing 40 | import threading 41 | import time 42 | 43 | import numpy as np 44 | 45 | import gym 46 | import tensorflow as tf 47 | import tensorflow_probability as tfp 48 | import tensorlayer as tl 49 | from tensorlayer.layers import DenseLayer, InputLayer 50 | 51 | tfd = tfp.distributions 52 | 53 | tl.logging.set_verbosity(tl.logging.DEBUG) 54 | 55 | np.random.seed(2) 56 | tf.random.set_seed(2) # reproducible 57 | 58 | # add arguments in command --train/test 59 | parser = argparse.ArgumentParser(description='Train or test neural net motor controller.') 60 | parser.add_argument('--train', dest='train', action='store_true', default=False) 61 | parser.add_argument('--test', dest='test', action='store_true', default=True) 62 | args = parser.parse_args() 63 | 64 | ##################### hyper parameters #################### 65 | 66 | GAME = 'BipedalWalker-v2' # BipedalWalkerHardcore-v2 BipedalWalker-v2 LunarLanderContinuous-v2 67 | LOG_DIR = './log' # the log file 68 | N_WORKERS = multiprocessing.cpu_count() # number of workers accroding to number of cores in cpu 69 | # N_WORKERS = 2 # manually set number of workers 70 | MAX_GLOBAL_EP = 8 # number of training episodes 71 | GLOBAL_NET_SCOPE = 'Global_Net' 72 | UPDATE_GLOBAL_ITER = 10 # update global policy after several episodes 73 | GAMMA = 0.99 # reward discount factor 74 | ENTROPY_BETA = 0.005 # factor for entropy boosted exploration 75 | LR_A = 0.00005 # learning rate for actor 76 | LR_C = 0.0001 # learning rate for critic 77 | GLOBAL_RUNNING_R = [] 78 | GLOBAL_EP = 0 # will increase during training, stop training when it >= MAX_GLOBAL_EP 79 | 80 | ################### Asynchronous Advantage Actor Critic (A3C) #################################### 81 | 82 | 83 | class ACNet(object): 84 | 85 | def __init__(self, scope, globalAC=None): 86 | self.scope = scope 87 | self.save_path = './model' 88 | 89 | w_init = tf.keras.initializers.glorot_normal(seed=None) # initializer, glorot=xavier 90 | 91 | def get_actor(input_shape): # policy network 92 | with tf.name_scope(self.scope): 93 | ni = tl.layers.Input(input_shape, name='in') 94 | nn = tl.layers.Dense(n_units=500, act=tf.nn.relu6, W_init=w_init, name='la')(ni) 95 | nn = tl.layers.Dense(n_units=300, act=tf.nn.relu6, W_init=w_init, name='la2')(nn) 96 | mu = tl.layers.Dense(n_units=N_A, act=tf.nn.tanh, W_init=w_init, name='mu')(nn) 97 | sigma = tl.layers.Dense(n_units=N_A, act=tf.nn.softplus, W_init=w_init, name='sigma')(nn) 98 | return tl.models.Model(inputs=ni, outputs=[mu, sigma], name=scope + '/Actor') 99 | 100 | self.actor = get_actor([None, N_S]) 101 | self.actor.train() # train mode for Dropout, BatchNorm 102 | 103 | def get_critic(input_shape): # we use Value-function here, but not Q-function. 104 | with tf.name_scope(self.scope): 105 | ni = tl.layers.Input(input_shape, name='in') 106 | nn = tl.layers.Dense(n_units=500, act=tf.nn.relu6, W_init=w_init, name='lc')(ni) 107 | nn = tl.layers.Dense(n_units=300, act=tf.nn.relu6, W_init=w_init, name='lc2')(nn) 108 | v = tl.layers.Dense(n_units=1, W_init=w_init, name='v')(nn) 109 | return tl.models.Model(inputs=ni, outputs=v, name=scope + '/Critic') 110 | 111 | self.critic = get_critic([None, N_S]) 112 | self.critic.train() # train mode for Dropout, BatchNorm 113 | 114 | @tf.function # convert numpy functions to tf.Operations in the TFgraph, return tensor 115 | def update_global( 116 | self, buffer_s, buffer_a, buffer_v_target, globalAC 117 | ): # refer to the global Actor-Crtic network for updating it with samples 118 | ''' update the global critic ''' 119 | with tf.GradientTape() as tape: 120 | self.v = self.critic(buffer_s) 121 | self.v_target = buffer_v_target 122 | td = tf.subtract(self.v_target, self.v, name='TD_error') 123 | self.c_loss = tf.reduce_mean(tf.square(td)) 124 | self.c_grads = tape.gradient(self.c_loss, self.critic.trainable_weights) 125 | OPT_C.apply_gradients(zip(self.c_grads, globalAC.critic.trainable_weights)) # local grads applies to global net 126 | # del tape # Drop the reference to the tape 127 | ''' update the global actor ''' 128 | with tf.GradientTape() as tape: 129 | self.mu, self.sigma = self.actor(buffer_s) 130 | self.test = self.sigma[0] 131 | self.mu, self.sigma = self.mu * A_BOUND[1], self.sigma + 1e-5 132 | 133 | normal_dist = tfd.Normal(self.mu, self.sigma) # no tf.contrib for tf2.0 134 | self.a_his = buffer_a # float32 135 | log_prob = normal_dist.log_prob(self.a_his) 136 | exp_v = log_prob * td # td is from the critic part, no gradients for it 137 | entropy = normal_dist.entropy() # encourage exploration 138 | self.exp_v = ENTROPY_BETA * entropy + exp_v 139 | self.a_loss = tf.reduce_mean(-self.exp_v) 140 | self.a_grads = tape.gradient(self.a_loss, self.actor.trainable_weights) 141 | OPT_A.apply_gradients(zip(self.a_grads, globalAC.actor.trainable_weights)) # local grads applies to global net 142 | return self.test # for test purpose 143 | 144 | @tf.function 145 | def pull_global(self, globalAC): # run by a local, pull weights from the global nets 146 | for l_p, g_p in zip(self.actor.trainable_weights, globalAC.actor.trainable_weights): 147 | l_p.assign(g_p) 148 | for l_p, g_p in zip(self.critic.trainable_weights, globalAC.critic.trainable_weights): 149 | l_p.assign(g_p) 150 | 151 | def choose_action(self, s): # run by a local 152 | s = s[np.newaxis, :] 153 | self.mu, self.sigma = self.actor(s) 154 | 155 | with tf.name_scope('wrap_a_out'): 156 | self.mu, self.sigma = self.mu * A_BOUND[1], self.sigma + 1e-5 157 | normal_dist = tfd.Normal(self.mu, self.sigma) # for continuous action space 158 | self.A = tf.clip_by_value(tf.squeeze(normal_dist.sample(1), axis=0), *A_BOUND) 159 | return self.A.numpy()[0] 160 | 161 | def save_ckpt(self): # save trained weights 162 | tl.files.save_npz(self.actor.trainable_weights, name='model_actor.npz') 163 | tl.files.save_npz(self.critic.trainable_weights, name='model_critic.npz') 164 | 165 | def load_ckpt(self): # load trained weights 166 | tl.files.load_and_assign_npz(name='model_actor.npz', network=self.actor) 167 | tl.files.load_and_assign_npz(name='model_critic.npz', network=self.critic) 168 | 169 | 170 | class Worker(object): 171 | 172 | def __init__(self, name, globalAC): 173 | self.env = gym.make(GAME) 174 | self.name = name 175 | self.AC = ACNet(name, globalAC) 176 | 177 | # def work(self): 178 | def work(self, globalAC): 179 | global GLOBAL_RUNNING_R, GLOBAL_EP 180 | total_step = 1 181 | buffer_s, buffer_a, buffer_r = [], [], [] 182 | while not COORD.should_stop() and GLOBAL_EP < MAX_GLOBAL_EP: 183 | s = self.env.reset() 184 | ep_r = 0 185 | while True: 186 | # visualize Worker_0 during training 187 | if self.name == 'Worker_0' and total_step % 30 == 0: 188 | self.env.render() 189 | s = s.astype('float32') # double to float 190 | a = self.AC.choose_action(s) 191 | s_, r, done, _info = self.env.step(a) 192 | 193 | s_ = s_.astype('float32') # double to float 194 | # set robot falls reward to -2 instead of -100 195 | if r == -100: r = -2 196 | 197 | ep_r += r 198 | buffer_s.append(s) 199 | buffer_a.append(a) 200 | buffer_r.append(r) 201 | 202 | if total_step % UPDATE_GLOBAL_ITER == 0 or done: # update global and assign to local net 203 | 204 | if done: 205 | v_s_ = 0 # terminal 206 | else: 207 | v_s_ = self.AC.critic(s_[np.newaxis, :])[0, 0] # reduce dim from 2 to 0 208 | 209 | buffer_v_target = [] 210 | 211 | for r in buffer_r[::-1]: # reverse buffer r 212 | v_s_ = r + GAMMA * v_s_ 213 | buffer_v_target.append(v_s_) 214 | 215 | buffer_v_target.reverse() 216 | 217 | buffer_s, buffer_a, buffer_v_target = ( 218 | np.vstack(buffer_s), np.vstack(buffer_a), np.vstack(buffer_v_target) 219 | ) 220 | # update gradients on global network 221 | self.AC.update_global(buffer_s, buffer_a, buffer_v_target.astype('float32'), globalAC) 222 | buffer_s, buffer_a, buffer_r = [], [], [] 223 | 224 | # update local network from global network 225 | self.AC.pull_global(globalAC) 226 | 227 | s = s_ 228 | total_step += 1 229 | if done: 230 | if len(GLOBAL_RUNNING_R) == 0: # record running episode reward 231 | GLOBAL_RUNNING_R.append(ep_r) 232 | else: # moving average 233 | GLOBAL_RUNNING_R.append(0.95 * GLOBAL_RUNNING_R[-1] + 0.05 * ep_r) 234 | # print( 235 | # self.name, 236 | # "Episode: ", 237 | # GLOBAL_EP, 238 | # # "| pos: %i" % self.env.unwrapped.hull.position[0], # number of move 239 | # '| reward: %.1f' % ep_r, 240 | # "| running_reward: %.1f" % GLOBAL_RUNNING_R[-1], 241 | # # '| sigma:', test, # debug 242 | # # 'WIN ' * 5 if self.env.unwrapped.hull.position[0] >= 88 else '', 243 | # ) 244 | print('{}, Episode: {}/{} | Episode Reward: {:.4f} | Running Time: {:.4f}'\ 245 | .format(self.name, GLOBAL_EP, MAX_GLOBAL_EP, ep_r, time.time()-t0 )) 246 | GLOBAL_EP += 1 247 | break 248 | 249 | 250 | if __name__ == "__main__": 251 | 252 | env = gym.make(GAME) 253 | 254 | N_S = env.observation_space.shape[0] 255 | N_A = env.action_space.shape[0] 256 | 257 | A_BOUND = [env.action_space.low, env.action_space.high] 258 | A_BOUND[0] = A_BOUND[0].reshape(1, N_A) 259 | A_BOUND[1] = A_BOUND[1].reshape(1, N_A) 260 | # print(A_BOUND) 261 | if args.train: 262 | # ============================= TRAINING =============================== 263 | t0 = time.time() 264 | with tf.device("/cpu:0"): 265 | 266 | OPT_A = tf.optimizers.RMSprop(LR_A, name='RMSPropA') 267 | OPT_C = tf.optimizers.RMSprop(LR_C, name='RMSPropC') 268 | 269 | GLOBAL_AC = ACNet(GLOBAL_NET_SCOPE) # we only need its params 270 | workers = [] 271 | # Create worker 272 | for i in range(N_WORKERS): 273 | i_name = 'Worker_%i' % i # worker name 274 | workers.append(Worker(i_name, GLOBAL_AC)) 275 | 276 | COORD = tf.train.Coordinator() 277 | 278 | # start TF threading 279 | worker_threads = [] 280 | for worker in workers: 281 | # t = threading.Thread(target=worker.work) 282 | job = lambda: worker.work(GLOBAL_AC) 283 | t = threading.Thread(target=job) 284 | t.start() 285 | worker_threads.append(t) 286 | COORD.join(worker_threads) 287 | import matplotlib.pyplot as plt 288 | plt.plot(GLOBAL_RUNNING_R) 289 | plt.xlabel('episode') 290 | plt.ylabel('global running reward') 291 | plt.savefig('a3c.png') 292 | plt.show() 293 | 294 | GLOBAL_AC.save_ckpt() 295 | 296 | if args.test: 297 | # ============================= EVALUATION ============================= 298 | # env = gym.make(GAME) 299 | # GLOBAL_AC = ACNet(GLOBAL_NET_SCOPE) 300 | GLOBAL_AC.load_ckpt() 301 | while True: 302 | s = env.reset() 303 | rall = 0 304 | while True: 305 | env.render() 306 | s = s.astype('float32') # double to float 307 | a = GLOBAL_AC.choose_action(s) 308 | s, r, d, _ = env.step(a) 309 | rall += r 310 | if d: 311 | print("reward", rall) 312 | break -------------------------------------------------------------------------------- /code/tensrolayer-implemented/ac.py: -------------------------------------------------------------------------------- 1 | """ 2 | Actor-Critic 3 | ------------- 4 | It uses TD-error as the Advantage. 5 | Actor Critic History 6 | ---------------------- 7 | A3C > DDPG > AC 8 | Advantage 9 | ---------- 10 | AC converge faster than Policy Gradient. 11 | Disadvantage (IMPORTANT) 12 | ------------------------ 13 | The Policy is oscillated (difficult to converge), DDPG can solve 14 | this problem using advantage of DQN. 15 | Reference 16 | ---------- 17 | paper: https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf 18 | View more on MorvanZhou's tutorial page: https://morvanzhou.github.io/tutorials/ 19 | Environment 20 | ------------ 21 | CartPole-v0: https://gym.openai.com/envs/CartPole-v0 22 | A pole is attached by an un-actuated joint to a cart, which moves along a 23 | frictionless track. The system is controlled by applying a force of +1 or -1 24 | to the cart. The pendulum starts upright, and the goal is to prevent it from 25 | falling over. 26 | A reward of +1 is provided for every timestep that the pole remains upright. 27 | The episode ends when the pole is more than 15 degrees from vertical, or the 28 | cart moves more than 2.4 units from the center. 29 | Prerequisites 30 | -------------- 31 | tensorflow >=2.0.0a0 32 | tensorlayer >=2.0.0 33 | To run 34 | ------ 35 | python tutorial_AC.py --train/test 36 | """ 37 | import argparse 38 | import time 39 | 40 | import numpy as np 41 | 42 | import gym 43 | import tensorflow as tf 44 | import tensorlayer as tl 45 | 46 | tl.logging.set_verbosity(tl.logging.DEBUG) 47 | 48 | np.random.seed(2) 49 | tf.random.set_seed(2) # reproducible 50 | 51 | # add arguments in command --train/test 52 | parser = argparse.ArgumentParser(description='Train or test neural net motor controller.') 53 | parser.add_argument('--train', dest='train', action='store_true', default=False) 54 | parser.add_argument('--test', dest='test', action='store_true', default=True) 55 | args = parser.parse_args() 56 | 57 | ##################### hyper parameters #################### 58 | 59 | OUTPUT_GRAPH = False 60 | MAX_EPISODE = 3000 # number of overall episodes for training 61 | DISPLAY_REWARD_THRESHOLD = 100 # renders environment if running reward is greater then this threshold 62 | MAX_EP_STEPS = 1000 # maximum time step in one episode 63 | RENDER = False # rendering wastes time 64 | LAMBDA = 0.9 # reward discount in TD error 65 | LR_A = 0.001 # learning rate for actor 66 | LR_C = 0.01 # learning rate for critic 67 | 68 | ############################### Actor-Critic #################################### 69 | 70 | 71 | class Actor(object): 72 | 73 | def __init__(self, n_features, n_actions, lr=0.001): 74 | 75 | def get_model(inputs_shape): 76 | ni = tl.layers.Input(inputs_shape, name='state') 77 | nn = tl.layers.Dense( 78 | n_units=30, act=tf.nn.relu6, W_init=tf.random_uniform_initializer(0, 0.01), name='hidden' 79 | )(ni) 80 | nn = tl.layers.Dense( 81 | n_units=10, act=tf.nn.relu6, W_init=tf.random_uniform_initializer(0, 0.01), name='hidden2' 82 | )(nn) 83 | nn = tl.layers.Dense(n_units=n_actions, name='actions')(nn) 84 | return tl.models.Model(inputs=ni, outputs=nn, name="Actor") 85 | 86 | self.model = get_model([None, n_features]) 87 | self.model.train() 88 | self.optimizer = tf.optimizers.Adam(lr) 89 | 90 | def learn(self, s, a, td): 91 | with tf.GradientTape() as tape: 92 | _logits = self.model(np.array([s])) 93 | ## cross-entropy loss weighted by td-error (advantage), 94 | # the cross-entropy mearsures the difference of two probability distributions: the predicted logits and sampled action distribution, 95 | # then weighted by the td-error: small difference of real and predict actions for large td-error (advantage); and vice versa. 96 | _exp_v = tl.rein.cross_entropy_reward_loss(logits=_logits, actions=[a], rewards=td[0]) 97 | grad = tape.gradient(_exp_v, self.model.trainable_weights) 98 | self.optimizer.apply_gradients(zip(grad, self.model.trainable_weights)) 99 | return _exp_v 100 | 101 | def choose_action(self, s): 102 | _logits = self.model(np.array([s])) 103 | _probs = tf.nn.softmax(_logits).numpy() 104 | return tl.rein.choice_action_by_probs(_probs.ravel()) # sample according to probability distribution 105 | 106 | def choose_action_greedy(self, s): 107 | _logits = self.model(np.array([s])) # logits: probability distribution of actions 108 | _probs = tf.nn.softmax(_logits).numpy() 109 | return np.argmax(_probs.ravel()) 110 | 111 | def save_ckpt(self): # save trained weights 112 | tl.files.save_npz(self.model.trainable_weights, name='model_actor.npz') 113 | 114 | def load_ckpt(self): # load trained weights 115 | tl.files.load_and_assign_npz(name='model_actor.npz', network=self.model) 116 | 117 | 118 | class Critic(object): 119 | 120 | def __init__(self, n_features, lr=0.01): 121 | 122 | def get_model(inputs_shape): 123 | ni = tl.layers.Input(inputs_shape, name='state') 124 | nn = tl.layers.Dense( 125 | n_units=30, act=tf.nn.relu6, W_init=tf.random_uniform_initializer(0, 0.01), name='hidden' 126 | )(ni) 127 | nn = tl.layers.Dense( 128 | n_units=5, act=tf.nn.relu, W_init=tf.random_uniform_initializer(0, 0.01), name='hidden2' 129 | )(nn) 130 | nn = tl.layers.Dense(n_units=1, act=None, name='value')(nn) 131 | return tl.models.Model(inputs=ni, outputs=nn, name="Critic") 132 | 133 | self.model = get_model([1, n_features]) 134 | self.model.train() 135 | 136 | self.optimizer = tf.optimizers.Adam(lr) 137 | 138 | def learn(self, s, r, s_): 139 | v_ = self.model(np.array([s_])) 140 | with tf.GradientTape() as tape: 141 | v = self.model(np.array([s])) 142 | ## TD_error = r + lambd * V(newS) - V(S) 143 | td_error = r + LAMBDA * v_ - v 144 | loss = tf.square(td_error) 145 | grad = tape.gradient(loss, self.model.trainable_weights) 146 | self.optimizer.apply_gradients(zip(grad, self.model.trainable_weights)) 147 | 148 | return td_error 149 | 150 | def save_ckpt(self): # save trained weights 151 | tl.files.save_npz(self.model.trainable_weights, name='model_critic.npz') 152 | 153 | def load_ckpt(self): # load trained weights 154 | tl.files.load_and_assign_npz(name='model_critic.npz', network=self.model) 155 | 156 | 157 | if __name__ == '__main__': 158 | ''' 159 | choose environment 160 | 1. Openai gym: 161 | env = gym.make() 162 | 2. DeepMind Control Suite: 163 | env = dm_control2gym.make() 164 | ''' 165 | env = gym.make('CartPole-v0') 166 | # dm_control2gym.create_render_mode('example mode', show=True, return_pixel=False, height=240, width=320, camera_id=-1, overlays=(), 167 | # depth=False, scene_option=None) 168 | # env = dm_control2gym.make(domain_name="cartpole", task_name="balance") 169 | env.seed(2) # reproducible 170 | # env = env.unwrapped 171 | N_F = env.observation_space.shape[0] 172 | # N_A = env.action_space.shape[0] 173 | N_A = env.action_space.n 174 | 175 | print("observation dimension: %d" % N_F) # 4 176 | print("observation high: %s" % env.observation_space.high) # [ 2.4 , inf , 0.41887902 , inf] 177 | print("observation low : %s" % env.observation_space.low) # [-2.4 , -inf , -0.41887902 , -inf] 178 | print("num of actions: %d" % N_A) # 2 : left or right 179 | 180 | actor = Actor(n_features=N_F, n_actions=N_A, lr=LR_A) 181 | # we need a good teacher, so the teacher should learn faster than the actor 182 | critic = Critic(n_features=N_F, lr=LR_C) 183 | 184 | if args.train: 185 | t0 = time.time() 186 | for i_episode in range(MAX_EPISODE): 187 | # episode_time = time.time() 188 | s = env.reset().astype(np.float32) 189 | t = 0 # number of step in this episode 190 | all_r = [] # rewards of all steps 191 | while True: 192 | 193 | if RENDER: env.render() 194 | 195 | a = actor.choose_action(s) 196 | 197 | s_new, r, done, info = env.step(a) 198 | s_new = s_new.astype(np.float32) 199 | 200 | if done: r = -20 201 | # these may helpful in some tasks 202 | # if abs(s_new[0]) >= env.observation_space.high[0]: 203 | # # cart moves more than 2.4 units from the center 204 | # r = -20 205 | # reward for the distance between cart to the center 206 | # r -= abs(s_new[0]) * .1 207 | 208 | all_r.append(r) 209 | 210 | td_error = critic.learn( 211 | s, r, s_new 212 | ) # learn Value-function : gradient = grad[r + lambda * V(s_new) - V(s)] 213 | try: 214 | actor.learn(s, a, td_error) # learn Policy : true_gradient = grad[logPi(s, a) * td_error] 215 | except KeyboardInterrupt: # if Ctrl+C at running actor.learn(), then save model, or exit if not at actor.learn() 216 | actor.save_ckpt() 217 | critic.save_ckpt() 218 | # logging 219 | 220 | s = s_new 221 | t += 1 222 | 223 | if done or t >= MAX_EP_STEPS: 224 | ep_rs_sum = sum(all_r) 225 | 226 | if 'running_reward' not in globals(): 227 | running_reward = ep_rs_sum 228 | else: 229 | running_reward = running_reward * 0.95 + ep_rs_sum * 0.05 230 | # start rending if running_reward greater than a threshold 231 | # if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = True 232 | # print("Episode: %d reward: %f running_reward %f took: %.5f" % \ 233 | # (i_episode, ep_rs_sum, running_reward, time.time() - episode_time)) 234 | print('Episode: {}/{} | Episode Reward: {:.4f} | Running Time: {:.4f}'\ 235 | .format(i_episode, MAX_EPISODE, ep_rs_sum, time.time()-t0 )) 236 | 237 | # Early Stopping for quick check 238 | if t >= MAX_EP_STEPS: 239 | print("Early Stopping") 240 | s = env.reset().astype(np.float32) 241 | rall = 0 242 | while True: 243 | env.render() 244 | # a = actor.choose_action(s) 245 | a = actor.choose_action_greedy(s) # Hao Dong: it is important for this task 246 | s_new, r, done, info = env.step(a) 247 | s_new = np.concatenate((s_new[0:N_F], s[N_F:]), axis=0).astype(np.float32) 248 | rall += r 249 | s = s_new 250 | if done: 251 | print("reward", rall) 252 | s = env.reset().astype(np.float32) 253 | rall = 0 254 | break 255 | actor.save_ckpt() 256 | critic.save_ckpt() 257 | 258 | if args.test: 259 | actor.load_ckpt() 260 | critic.load_ckpt() 261 | t0 = time.time() 262 | 263 | for i_episode in range(MAX_EPISODE): 264 | episode_time = time.time() 265 | s = env.reset().astype(np.float32) 266 | t = 0 # number of step in this episode 267 | all_r = [] # rewards of all steps 268 | while True: 269 | if RENDER: env.render() 270 | a = actor.choose_action(s) 271 | s_new, r, done, info = env.step(a) 272 | s_new = s_new.astype(np.float32) 273 | if done: r = -20 274 | # these may helpful in some tasks 275 | # if abs(s_new[0]) >= env.observation_space.high[0]: 276 | # # cart moves more than 2.4 units from the center 277 | # r = -20 278 | # reward for the distance between cart to the center 279 | # r -= abs(s_new[0]) * .1 280 | 281 | all_r.append(r) 282 | s = s_new 283 | t += 1 284 | 285 | if done or t >= MAX_EP_STEPS: 286 | ep_rs_sum = sum(all_r) 287 | 288 | if 'running_reward' not in globals(): 289 | running_reward = ep_rs_sum 290 | else: 291 | running_reward = running_reward * 0.95 + ep_rs_sum * 0.05 292 | # start rending if running_reward greater than a threshold 293 | # if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = True 294 | # print("Episode: %d reward: %f running_reward %f took: %.5f" % \ 295 | # (i_episode, ep_rs_sum, running_reward, time.time() - episode_time)) 296 | print('Episode: {}/{} | Episode Reward: {:.4f} | Running Time: {:.4f}'\ 297 | .format(i_episode, MAX_EPISODE, ep_rs_sum, time.time()-t0 )) 298 | 299 | # Early Stopping for quick check 300 | if t >= MAX_EP_STEPS: 301 | print("Early Stopping") 302 | s = env.reset().astype(np.float32) 303 | rall = 0 304 | while True: 305 | env.render() 306 | # a = actor.choose_action(s) 307 | a = actor.choose_action_greedy(s) # Hao Dong: it is important for this task 308 | s_new, r, done, info = env.step(a) 309 | s_new = np.concatenate((s_new[0:N_F], s[N_F:]), axis=0).astype(np.float32) 310 | rall += r 311 | s = s_new 312 | if done: 313 | print("reward", rall) 314 | s = env.reset().astype(np.float32) 315 | rall = 0 316 | break -------------------------------------------------------------------------------- /code/tensrolayer-implemented/ddpg.py: -------------------------------------------------------------------------------- 1 | """ 2 | Deep Deterministic Policy Gradient (DDPG) 3 | ----------------------------------------- 4 | An algorithm concurrently learns a Q-function and a policy. 5 | It uses off-policy data and the Bellman equation to learn the Q-function, 6 | and uses the Q-function to learn the policy. 7 | Reference 8 | --------- 9 | Deterministic Policy Gradient Algorithms, Silver et al. 2014 10 | Continuous Control With Deep Reinforcement Learning, Lillicrap et al. 2016 11 | MorvanZhou's tutorial page: https://morvanzhou.github.io/tutorials/ 12 | Environment 13 | ----------- 14 | Openai Gym Pendulum-v0, continual action space 15 | Prerequisites 16 | ------------- 17 | tensorflow >=2.0.0a0 18 | tensorflow-probability 0.6.0 19 | tensorlayer >=2.0.0 20 | To run 21 | ------ 22 | python tutorial_DDPG.py --train/test 23 | """ 24 | 25 | import argparse 26 | import os 27 | import time 28 | 29 | import matplotlib.pyplot as plt 30 | import numpy as np 31 | 32 | import gym 33 | import tensorflow as tf 34 | import tensorlayer as tl 35 | 36 | parser = argparse.ArgumentParser(description='Train or test neural net motor controller.') 37 | parser.add_argument('--train', dest='train', action='store_true', default=True) 38 | parser.add_argument('--test', dest='train', action='store_false') 39 | args = parser.parse_args() 40 | 41 | ##################### hyper parameters #################### 42 | 43 | ENV_NAME = 'Pendulum-v0' # environment name 44 | RANDOMSEED = 1 # random seed 45 | 46 | LR_A = 0.001 # learning rate for actor 47 | LR_C = 0.002 # learning rate for critic 48 | GAMMA = 0.9 # reward discount 49 | TAU = 0.01 # soft replacement 50 | MEMORY_CAPACITY = 10000 # size of replay buffer 51 | BATCH_SIZE = 32 # update batchsize 52 | 53 | MAX_EPISODES = 200 # total number of episodes for training 54 | MAX_EP_STEPS = 200 # total number of steps for each episode 55 | TEST_PER_EPISODES = 10 # test the model per episodes 56 | VAR = 3 # control exploration 57 | 58 | ############################### DDPG #################################### 59 | 60 | 61 | class DDPG(object): 62 | """ 63 | DDPG class 64 | """ 65 | 66 | def __init__(self, a_dim, s_dim, a_bound): 67 | self.memory = np.zeros((MEMORY_CAPACITY, s_dim * 2 + a_dim + 1), dtype=np.float32) 68 | self.pointer = 0 69 | self.a_dim, self.s_dim, self.a_bound = a_dim, s_dim, a_bound 70 | 71 | W_init = tf.random_normal_initializer(mean=0, stddev=0.3) 72 | b_init = tf.constant_initializer(0.1) 73 | 74 | def get_actor(input_state_shape, name=''): 75 | """ 76 | Build actor network 77 | :param input_state_shape: state 78 | :param name: name 79 | :return: act 80 | """ 81 | inputs = tl.layers.Input(input_state_shape, name='A_input') 82 | x = tl.layers.Dense(n_units=30, act=tf.nn.relu, W_init=W_init, b_init=b_init, name='A_l1')(inputs) 83 | x = tl.layers.Dense(n_units=a_dim, act=tf.nn.tanh, W_init=W_init, b_init=b_init, name='A_a')(x) 84 | x = tl.layers.Lambda(lambda x: np.array(a_bound) * x)(x) 85 | return tl.models.Model(inputs=inputs, outputs=x, name='Actor' + name) 86 | 87 | def get_critic(input_state_shape, input_action_shape, name=''): 88 | """ 89 | Build critic network 90 | :param input_state_shape: state 91 | :param input_action_shape: act 92 | :param name: name 93 | :return: Q value Q(s,a) 94 | """ 95 | s = tl.layers.Input(input_state_shape, name='C_s_input') 96 | a = tl.layers.Input(input_action_shape, name='C_a_input') 97 | x = tl.layers.Concat(1)([s, a]) 98 | x = tl.layers.Dense(n_units=60, act=tf.nn.relu, W_init=W_init, b_init=b_init, name='C_l1')(x) 99 | x = tl.layers.Dense(n_units=1, W_init=W_init, b_init=b_init, name='C_out')(x) 100 | return tl.models.Model(inputs=[s, a], outputs=x, name='Critic' + name) 101 | 102 | self.actor = get_actor([None, s_dim]) 103 | self.critic = get_critic([None, s_dim], [None, a_dim]) 104 | self.actor.train() 105 | self.critic.train() 106 | 107 | def copy_para(from_model, to_model): 108 | """ 109 | Copy parameters for soft updating 110 | :param from_model: latest model 111 | :param to_model: target model 112 | :return: None 113 | """ 114 | for i, j in zip(from_model.trainable_weights, to_model.trainable_weights): 115 | j.assign(i) 116 | 117 | self.actor_target = get_actor([None, s_dim], name='_target') 118 | copy_para(self.actor, self.actor_target) 119 | self.actor_target.eval() 120 | 121 | self.critic_target = get_critic([None, s_dim], [None, a_dim], name='_target') 122 | copy_para(self.critic, self.critic_target) 123 | self.critic_target.eval() 124 | 125 | self.R = tl.layers.Input([None, 1], tf.float32, 'r') 126 | 127 | self.ema = tf.train.ExponentialMovingAverage(decay=1 - TAU) # soft replacement 128 | 129 | self.actor_opt = tf.optimizers.Adam(LR_A) 130 | self.critic_opt = tf.optimizers.Adam(LR_C) 131 | 132 | def ema_update(self): 133 | """ 134 | Soft updating by exponential smoothing 135 | :return: None 136 | """ 137 | paras = self.actor.trainable_weights + self.critic.trainable_weights 138 | self.ema.apply(paras) 139 | for i, j in zip(self.actor_target.trainable_weights + self.critic_target.trainable_weights, paras): 140 | i.assign(self.ema.average(j)) 141 | 142 | def choose_action(self, s): 143 | """ 144 | Choose action 145 | :param s: state 146 | :return: act 147 | """ 148 | return self.actor(np.array([s], dtype=np.float32))[0] 149 | 150 | def learn(self): 151 | """ 152 | Update parameters 153 | :return: None 154 | """ 155 | indices = np.random.choice(MEMORY_CAPACITY, size=BATCH_SIZE) 156 | bt = self.memory[indices, :] 157 | bs = bt[:, :self.s_dim] 158 | ba = bt[:, self.s_dim:self.s_dim + self.a_dim] 159 | br = bt[:, -self.s_dim - 1:-self.s_dim] 160 | bs_ = bt[:, -self.s_dim:] 161 | 162 | with tf.GradientTape() as tape: 163 | a_ = self.actor_target(bs_) 164 | q_ = self.critic_target([bs_, a_]) 165 | y = br + GAMMA * q_ 166 | q = self.critic([bs, ba]) 167 | td_error = tf.losses.mean_squared_error(y, q) 168 | c_grads = tape.gradient(td_error, self.critic.trainable_weights) 169 | self.critic_opt.apply_gradients(zip(c_grads, self.critic.trainable_weights)) 170 | 171 | with tf.GradientTape() as tape: 172 | a = self.actor(bs) 173 | q = self.critic([bs, a]) 174 | a_loss = - tf.reduce_mean(q) # maximize the q 175 | a_grads = tape.gradient(a_loss, self.actor.trainable_weights) 176 | self.actor_opt.apply_gradients(zip(a_grads, self.actor.trainable_weights)) 177 | 178 | self.ema_update() 179 | 180 | def store_transition(self, s, a, r, s_): 181 | """ 182 | Store data in data buffer 183 | :param s: state 184 | :param a: act 185 | :param r: reward 186 | :param s_: next state 187 | :return: None 188 | """ 189 | s = s.astype(np.float32) 190 | s_ = s_.astype(np.float32) 191 | transition = np.hstack((s, a, [r], s_)) 192 | index = self.pointer % MEMORY_CAPACITY # replace the old memory with new memory 193 | self.memory[index, :] = transition 194 | self.pointer += 1 195 | 196 | def save_ckpt(self): 197 | """ 198 | save trained weights 199 | :return: None 200 | """ 201 | if not os.path.exists('model'): 202 | os.makedirs('model') 203 | 204 | tl.files.save_weights_to_hdf5('model/ddpg_actor.hdf5', self.actor) 205 | tl.files.save_weights_to_hdf5('model/ddpg_actor_target.hdf5', self.actor_target) 206 | tl.files.save_weights_to_hdf5('model/ddpg_critic.hdf5', self.critic) 207 | tl.files.save_weights_to_hdf5('model/ddpg_critic_target.hdf5', self.critic_target) 208 | 209 | def load_ckpt(self): 210 | """ 211 | load trained weights 212 | :return: None 213 | """ 214 | tl.files.load_hdf5_to_weights_in_order('model/ddpg_actor.hdf5', self.actor) 215 | tl.files.load_hdf5_to_weights_in_order('model/ddpg_actor_target.hdf5', self.actor_target) 216 | tl.files.load_hdf5_to_weights_in_order('model/ddpg_critic.hdf5', self.critic) 217 | tl.files.load_hdf5_to_weights_in_order('model/ddpg_critic_target.hdf5', self.critic_target) 218 | 219 | 220 | if __name__ == '__main__': 221 | 222 | env = gym.make(ENV_NAME) 223 | env = env.unwrapped 224 | 225 | # reproducible 226 | env.seed(RANDOMSEED) 227 | np.random.seed(RANDOMSEED) 228 | tf.random.set_seed(RANDOMSEED) 229 | 230 | s_dim = env.observation_space.shape[0] 231 | a_dim = env.action_space.shape[0] 232 | a_bound = env.action_space.high 233 | 234 | ddpg = DDPG(a_dim, s_dim, a_bound) 235 | 236 | if args.train: # train 237 | 238 | reward_buffer = [] 239 | t0 = time.time() 240 | for i in range(MAX_EPISODES): 241 | t1 = time.time() 242 | s = env.reset() 243 | ep_reward = 0 244 | for j in range(MAX_EP_STEPS): 245 | # Add exploration noise 246 | a = ddpg.choose_action(s) 247 | a = np.clip(np.random.normal(a, VAR), -2, 2) # add randomness to action selection for exploration 248 | s_, r, done, info = env.step(a) 249 | 250 | ddpg.store_transition(s, a, r / 10, s_) 251 | 252 | if ddpg.pointer > MEMORY_CAPACITY: 253 | ddpg.learn() 254 | 255 | s = s_ 256 | ep_reward += r 257 | if j == MAX_EP_STEPS - 1: 258 | print( 259 | '\rEpisode: {}/{} | Episode Reward: {:.4f} | Running Time: {:.4f}'.format( 260 | i, MAX_EPISODES, ep_reward, 261 | time.time() - t1 262 | ), end='' 263 | ) 264 | plt.show() 265 | # test 266 | if i and not i % TEST_PER_EPISODES: 267 | t1 = time.time() 268 | s = env.reset() 269 | ep_reward = 0 270 | for j in range(MAX_EP_STEPS): 271 | 272 | a = ddpg.choose_action(s) # without exploration noise 273 | s_, r, done, info = env.step(a) 274 | 275 | s = s_ 276 | ep_reward += r 277 | if j == MAX_EP_STEPS - 1: 278 | print( 279 | '\rEpisode: {}/{} | Episode Reward: {:.4f} | Running Time: {:.4f}'.format( 280 | i, MAX_EPISODES, ep_reward, 281 | time.time() - t1 282 | ) 283 | ) 284 | 285 | reward_buffer.append(ep_reward) 286 | 287 | if reward_buffer: 288 | plt.ion() 289 | plt.cla() 290 | plt.title('DDPG') 291 | plt.plot(np.array(range(len(reward_buffer))) * TEST_PER_EPISODES, reward_buffer) # plot the episode vt 292 | plt.xlabel('episode steps') 293 | plt.ylabel('normalized state-action value') 294 | plt.ylim(-2000, 0) 295 | plt.show() 296 | plt.pause(0.1) 297 | plt.ioff() 298 | plt.show() 299 | print('\nRunning time: ', time.time() - t0) 300 | ddpg.save_ckpt() 301 | 302 | # test 303 | ddpg.load_ckpt() 304 | while True: 305 | s = env.reset() 306 | for i in range(MAX_EP_STEPS): 307 | env.render() 308 | s, r, done, info = env.step(ddpg.choose_action(s)) 309 | if done: 310 | break -------------------------------------------------------------------------------- /code/tensrolayer-implemented/dqn.py: -------------------------------------------------------------------------------- 1 | """ 2 | Deep Q-Network Q(a, s) 3 | ----------------------- 4 | TD Learning, Off-Policy, e-Greedy Exploration (GLIE). 5 | Q(S, A) <- Q(S, A) + alpha * (R + lambda * Q(newS, newA) - Q(S, A)) 6 | delta_w = R + lambda * Q(newS, newA) 7 | See David Silver RL Tutorial Lecture 5 - Q-Learning for more details. 8 | Reference 9 | ---------- 10 | original paper: https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf 11 | EN: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0#.5m3361vlw 12 | CN: https://zhuanlan.zhihu.com/p/25710327 13 | Note: Policy Network has been proved to be better than Q-Learning, see tutorial_atari_pong.py 14 | Environment 15 | ----------- 16 | # The FrozenLake v0 environment 17 | https://gym.openai.com/envs/FrozenLake-v0 18 | The agent controls the movement of a character in a grid world. Some tiles of 19 | the grid are walkable, and others lead to the agent falling into the water. 20 | Additionally, the movement direction of the agent is uncertain and only partially 21 | depends on the chosen direction. The agent is rewarded for finding a walkable 22 | path to a goal tile. 23 | SFFF (S: starting point, safe) 24 | FHFH (F: frozen surface, safe) 25 | FFFH (H: hole, fall to your doom) 26 | HFFG (G: goal, where the frisbee is located) 27 | The episode ends when you reach the goal or fall in a hole. You receive a reward 28 | of 1 if you reach the goal, and zero otherwise. 29 | Prerequisites 30 | -------------- 31 | tensorflow>=2.0.0a0 32 | tensorlayer>=2.0.0 33 | To run 34 | ------- 35 | python tutorial_DQN.py --train/test 36 | """ 37 | import argparse 38 | import time 39 | 40 | import numpy as np 41 | 42 | import gym 43 | import tensorflow as tf 44 | import tensorlayer as tl 45 | 46 | # add arguments in command --train/test 47 | parser = argparse.ArgumentParser(description='Train or test neural net motor controller.') 48 | parser.add_argument('--train', dest='train', action='store_true', default=False) 49 | parser.add_argument('--test', dest='test', action='store_true', default=True) 50 | args = parser.parse_args() 51 | 52 | tl.logging.set_verbosity(tl.logging.DEBUG) 53 | 54 | ##################### hyper parameters #################### 55 | lambd = .99 # decay factor 56 | e = 0.1 # e-Greedy Exploration, the larger the more random 57 | num_episodes = 10000 58 | render = False # display the game environment 59 | running_reward = None 60 | 61 | ##################### DQN ########################## 62 | 63 | 64 | def to_one_hot(i, n_classes=None): 65 | a = np.zeros(n_classes, 'uint8') 66 | a[i] = 1 67 | return a 68 | 69 | 70 | ## Define Q-network q(a,s) that ouput the rewards of 4 actions by given state, i.e. Action-Value Function. 71 | # encoding for state: 4x4 grid can be represented by one-hot vector with 16 integers. 72 | def get_model(inputs_shape): 73 | ni = tl.layers.Input(inputs_shape, name='observation') 74 | nn = tl.layers.Dense(4, act=None, W_init=tf.random_uniform_initializer(0, 0.01), b_init=None, name='q_a_s')(ni) 75 | return tl.models.Model(inputs=ni, outputs=nn, name="Q-Network") 76 | 77 | 78 | def save_ckpt(model): # save trained weights 79 | tl.files.save_npz(model.trainable_weights, name='dqn_model.npz') 80 | 81 | 82 | def load_ckpt(model): # load trained weights 83 | tl.files.load_and_assign_npz(name='dqn_model.npz', network=model) 84 | 85 | 86 | if __name__ == '__main__': 87 | 88 | qnetwork = get_model([None, 16]) 89 | qnetwork.train() 90 | train_weights = qnetwork.trainable_weights 91 | 92 | optimizer = tf.optimizers.SGD(learning_rate=0.1) 93 | env = gym.make('FrozenLake-v0') 94 | 95 | if args.train: 96 | t0 = time.time() 97 | for i in range(num_episodes): 98 | ## Reset environment and get first new observation 99 | # episode_time = time.time() 100 | s = env.reset() # observation is state, integer 0 ~ 15 101 | rAll = 0 102 | for j in range(99): # step index, maximum step is 99 103 | if render: env.render() 104 | ## Choose an action by greedily (with e chance of random action) from the Q-network 105 | allQ = qnetwork(np.asarray([to_one_hot(s, 16)], dtype=np.float32)).numpy() 106 | a = np.argmax(allQ, 1) 107 | 108 | ## e-Greedy Exploration !!! sample random action 109 | if np.random.rand(1) < e: 110 | a[0] = env.action_space.sample() 111 | ## Get new state and reward from environment 112 | s1, r, d, _ = env.step(a[0]) 113 | ## Obtain the Q' values by feeding the new state through our network 114 | Q1 = qnetwork(np.asarray([to_one_hot(s1, 16)], dtype=np.float32)).numpy() 115 | 116 | ## Obtain maxQ' and set our target value for chosen action. 117 | maxQ1 = np.max(Q1) # in Q-Learning, policy is greedy, so we use "max" to select the next action. 118 | targetQ = allQ 119 | targetQ[0, a[0]] = r + lambd * maxQ1 120 | ## Train network using target and predicted Q values 121 | # it is not real target Q value, it is just an estimation, 122 | # but check the Q-Learning update formula: 123 | # Q'(s,a) <- Q(s,a) + alpha(r + lambd * maxQ(s',a') - Q(s, a)) 124 | # minimizing |r + lambd * maxQ(s',a') - Q(s, a)|^2 equals to force Q'(s,a) ≈ Q(s,a) 125 | with tf.GradientTape() as tape: 126 | _qvalues = qnetwork(np.asarray([to_one_hot(s, 16)], dtype=np.float32)) 127 | _loss = tl.cost.mean_squared_error(targetQ, _qvalues, is_mean=False) 128 | grad = tape.gradient(_loss, train_weights) 129 | optimizer.apply_gradients(zip(grad, train_weights)) 130 | 131 | rAll += r 132 | s = s1 133 | ## Reduce chance of random action if an episode is done. 134 | if d ==True: 135 | e = 1. / ((i / 50) + 10) # reduce e, GLIE: Greey in the limit with infinite Exploration 136 | break 137 | 138 | ## Note that, the rewards here with random action 139 | running_reward = rAll if running_reward is None else running_reward * 0.99 + rAll * 0.01 140 | # print("Episode [%d/%d] sum reward: %f running reward: %f took: %.5fs " % \ 141 | # (i, num_episodes, rAll, running_reward, time.time() - episode_time)) 142 | print('Episode: {}/{} | Episode Reward: {:.4f} | Running Average Reward: {:.4f} | Running Time: {:.4f}'\ 143 | .format(i, num_episodes, rAll, running_reward, time.time()-t0 )) 144 | save_ckpt(qnetwork) # save model 145 | 146 | if args.test: 147 | t0 = time.time() 148 | load_ckpt(qnetwork) # load model 149 | for i in range(num_episodes): 150 | ## Reset environment and get first new observation 151 | episode_time = time.time() 152 | s = env.reset() # observation is state, integer 0 ~ 15 153 | rAll = 0 154 | for j in range(99): # step index, maximum step is 99 155 | if render: env.render() 156 | ## Choose an action by greedily (with e chance of random action) from the Q-network 157 | allQ = qnetwork(np.asarray([to_one_hot(s, 16)], dtype=np.float32)).numpy() 158 | a = np.argmax(allQ, 1) # no epsilon, only greedy for testing 159 | 160 | ## Get new state and reward from environment 161 | s1, r, d, _ = env.step(a[0]) 162 | rAll += r 163 | s = s1 164 | ## Reduce chance of random action if an episode is done. 165 | if d ==True: 166 | e = 1. / ((i / 50) + 10) # reduce e, GLIE: Greey in the limit with infinite Exploration 167 | break 168 | 169 | ## Note that, the rewards here with random action 170 | running_reward = rAll if running_reward is None else running_reward * 0.99 + rAll * 0.01 171 | # print("Episode [%d/%d] sum reward: %f running reward: %f took: %.5fs " % \ 172 | # (i, num_episodes, rAll, running_reward, time.time() - episode_time)) 173 | print('Episode: {}/{} | Episode Reward: {:.4f} | Running Average Reward: {:.4f} | Running Time: {:.4f}'\ 174 | .format(i, num_episodes, rAll, running_reward, time.time()-t0 )) -------------------------------------------------------------------------------- /code/tensrolayer-implemented/dqn_variants.py: -------------------------------------------------------------------------------- 1 | """ 2 | DQN and its variants 3 | ------------------------ 4 | We implement Double DQN, Dueling DQN and Noisy DQN here. 5 | The max operator in standard DQN uses the same values both to select and to 6 | evaluate an action by 7 | Q(s_t, a_t) = R_{t+1} + \gamma * max_{a}Q_{tar}(s_{t+1}, a). 8 | Double DQN propose to use following evaluation to address overestimation problem 9 | of max operator: 10 | Q(s_t, a_t) = R_{t+1} + \gamma * Q_{tar}(s_{t+1}, max_{a}Q(s_{t+1}, a)). 11 | Dueling DQN uses dueling architecture where the value of state and the advantage 12 | of each action is estimated separately. 13 | Noisy DQN propose to explore by adding parameter noises. 14 | Reference: 15 | ------------------------ 16 | 1. Double DQN 17 | Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double 18 | q-learning[C]//Thirtieth AAAI Conference on Artificial Intelligence. 2016. 19 | 2. Dueling DQN 20 | Wang Z, Schaul T, Hessel M, et al. Dueling network architectures for deep 21 | reinforcement learning[J]. arXiv preprint arXiv:1511.06581, 2015. 22 | 3. Noisy DQN 23 | Plappert M, Houthooft R, Dhariwal P, et al. Parameter space noise for 24 | exploration[J]. arXiv preprint arXiv:1706.01905, 2017. 25 | Environment: 26 | ------------------------ 27 | Cartpole and Pong in OpenAI Gym 28 | Requirements: 29 | ------------------------ 30 | tensorflow>=2.0.0a0 31 | tensorlayer>=2.0.0 32 | To run: 33 | ------------------------ 34 | python tutorial_DQN_variantes.py --mode=train 35 | python tutorial_DQN_variantes.py --mode=test --save_path=dqn_variants/8000.npz 36 | """ 37 | import argparse 38 | import os 39 | import random 40 | import time 41 | 42 | import numpy as np 43 | 44 | import tensorflow as tf 45 | import tensorlayer as tl 46 | from tutorial_wrappers import build_env 47 | 48 | parser = argparse.ArgumentParser() 49 | parser.add_argument('--mode', help='train or test', default='train') 50 | parser.add_argument( 51 | '--save_path', default='dqn_variants', help='folder to save if mode == train else model path,' 52 | 'qnet will be saved once target net update' 53 | ) 54 | parser.add_argument('--seed', help='random seed', type=int, default=0) 55 | parser.add_argument('--env_id', default='CartPole-v0', help='CartPole-v0 or PongNoFrameskip-v4') 56 | args = parser.parse_args() 57 | 58 | if args.mode == 'train': 59 | os.makedirs(args.save_path, exist_ok=True) 60 | random.seed(args.seed) 61 | np.random.seed(args.seed) 62 | tf.random.set_seed(args.seed) # reproducible 63 | env_id = args.env_id 64 | env = build_env(env_id, seed=args.seed) 65 | 66 | # #################### hyper parameters #################### 67 | if env_id == 'CartPole-v0': 68 | qnet_type = 'MLP' 69 | number_timesteps = 10000 # total number of time steps to train on 70 | explore_timesteps = 100 71 | # epsilon-greedy schedule, final exploit prob is 0.99 72 | epsilon = lambda i_iter: 1 - 0.99 * min(1, i_iter / explore_timesteps) 73 | lr = 5e-3 # learning rate 74 | buffer_size = 1000 # replay buffer size 75 | target_q_update_freq = 50 # how frequency target q net update 76 | ob_scale = 1.0 # scale observations 77 | else: 78 | # reward will increase obviously after 1e5 time steps 79 | qnet_type = 'CNN' 80 | number_timesteps = int(1e6) # total number of time steps to train on 81 | explore_timesteps = 1e5 82 | # epsilon-greedy schedule, final exploit prob is 0.99 83 | epsilon = lambda i_iter: 1 - 0.99 * min(1, i_iter / explore_timesteps) 84 | lr = 1e-4 # learning rate 85 | buffer_size = 10000 # replay buffer size 86 | target_q_update_freq = 200 # how frequency target q net update 87 | ob_scale = 1.0 / 255 # scale observations 88 | 89 | in_dim = env.observation_space.shape 90 | out_dim = env.action_space.n 91 | reward_gamma = 0.99 # reward discount 92 | batch_size = 32 # batch size for sampling from replay buffer 93 | warm_start = buffer_size / 10 # sample times befor learning 94 | noise_update_freq = 50 # how frequency param noise net update 95 | 96 | 97 | # ############################## DQN #################################### 98 | class MLP(tl.models.Model): 99 | 100 | def __init__(self, name): 101 | super(MLP, self).__init__(name=name) 102 | self.h1 = tl.layers.Dense(64, tf.nn.tanh, in_channels=in_dim[0]) 103 | self.qvalue = tl.layers.Dense(out_dim, in_channels=64, name='q', W_init=tf.initializers.GlorotUniform()) 104 | self.svalue = tl.layers.Dense(1, in_channels=64, name='s', W_init=tf.initializers.GlorotUniform()) 105 | self.noise_scale = 0 106 | 107 | def forward(self, ni): 108 | feature = self.h1(ni) 109 | 110 | # apply noise to all linear layer 111 | if self.noise_scale != 0: 112 | noises = [] 113 | for layer in [self.qvalue, self.svalue]: 114 | for var in layer.trainable_weights: 115 | noise = tf.random.normal(tf.shape(var), 0, self.noise_scale) 116 | noises.append(noise) 117 | var.assign_add(noise) 118 | 119 | qvalue = self.qvalue(feature) 120 | svalue = self.svalue(feature) 121 | 122 | if self.noise_scale != 0: 123 | idx = 0 124 | for layer in [self.qvalue, self.svalue]: 125 | for var in layer.trainable_weights: 126 | var.assign_sub(noises[idx]) 127 | idx += 1 128 | 129 | # dueling network 130 | out = svalue + qvalue - tf.reduce_mean(qvalue, 1, keepdims=True) 131 | return out 132 | 133 | 134 | class CNN(tl.models.Model): 135 | 136 | def __init__(self, name): 137 | super(CNN, self).__init__(name=name) 138 | h, w, in_channels = in_dim 139 | dense_in_channels = 64 * ((h - 28) // 8) * ((w - 28) // 8) 140 | self.conv1 = tl.layers.Conv2d( 141 | 32, (8, 8), (4, 4), tf.nn.relu, 'VALID', in_channels=in_channels, name='conv2d_1', 142 | W_init=tf.initializers.GlorotUniform() 143 | ) 144 | self.conv2 = tl.layers.Conv2d( 145 | 64, (4, 4), (2, 2), tf.nn.relu, 'VALID', in_channels=32, name='conv2d_2', 146 | W_init=tf.initializers.GlorotUniform() 147 | ) 148 | self.conv3 = tl.layers.Conv2d( 149 | 64, (3, 3), (1, 1), tf.nn.relu, 'VALID', in_channels=64, name='conv2d_3', 150 | W_init=tf.initializers.GlorotUniform() 151 | ) 152 | self.flatten = tl.layers.Flatten(name='flatten') 153 | self.preq = tl.layers.Dense( 154 | 256, tf.nn.relu, in_channels=dense_in_channels, name='pre_q', W_init=tf.initializers.GlorotUniform() 155 | ) 156 | self.qvalue = tl.layers.Dense(out_dim, in_channels=256, name='q', W_init=tf.initializers.GlorotUniform()) 157 | self.pres = tl.layers.Dense( 158 | 256, tf.nn.relu, in_channels=dense_in_channels, name='pre_s', W_init=tf.initializers.GlorotUniform() 159 | ) 160 | self.svalue = tl.layers.Dense(1, in_channels=256, name='state', W_init=tf.initializers.GlorotUniform()) 161 | self.noise_scale = 0 162 | 163 | def forward(self, ni): 164 | feature = self.flatten(self.conv3(self.conv2(self.conv1(ni)))) 165 | 166 | # apply noise to all linear layer 167 | if self.noise_scale != 0: 168 | noises = [] 169 | for layer in [self.preq, self.qvalue, self.pres, self.svalue]: 170 | for var in layer.trainable_weights: 171 | noise = tf.random.normal(tf.shape(var), 0, self.noise_scale) 172 | noises.append(noise) 173 | var.assign_add(noise) 174 | 175 | qvalue = self.qvalue(self.preq(feature)) 176 | svalue = self.svalue(self.pres(feature)) 177 | 178 | if self.noise_scale != 0: 179 | idx = 0 180 | for layer in [self.preq, self.qvalue, self.pres, self.svalue]: 181 | for var in layer.trainable_weights: 182 | var.assign_sub(noises[idx]) 183 | idx += 1 184 | 185 | # dueling network 186 | return svalue + qvalue - tf.reduce_mean(qvalue, 1, keepdims=True) 187 | 188 | 189 | class ReplayBuffer(object): 190 | 191 | def __init__(self, size): 192 | self._storage = [] 193 | self._maxsize = size 194 | self._next_idx = 0 195 | 196 | def __len__(self): 197 | return len(self._storage) 198 | 199 | def add(self, *args): 200 | if self._next_idx >= len(self._storage): 201 | self._storage.append(args) 202 | else: 203 | self._storage[self._next_idx] = args 204 | self._next_idx = (self._next_idx + 1) % self._maxsize 205 | 206 | def _encode_sample(self, idxes): 207 | b_o, b_a, b_r, b_o_, b_d = [], [], [], [], [] 208 | for i in idxes: 209 | o, a, r, o_, d = self._storage[i] 210 | b_o.append(o) 211 | b_a.append(a) 212 | b_r.append(r) 213 | b_o_.append(o_) 214 | b_d.append(d) 215 | return ( 216 | np.stack(b_o).astype('float32') * ob_scale, 217 | np.stack(b_a).astype('int32'), 218 | np.stack(b_r).astype('float32'), 219 | np.stack(b_o_).astype('float32') * ob_scale, 220 | np.stack(b_d).astype('float32'), 221 | ) 222 | 223 | def sample(self, batch_size): 224 | indexes = range(len(self._storage)) 225 | idxes = [random.choice(indexes) for _ in range(batch_size)] 226 | return self._encode_sample(idxes) 227 | 228 | 229 | def huber_loss(x): 230 | """Loss function for value""" 231 | return tf.where(tf.abs(x) < 1, tf.square(x) * 0.5, tf.abs(x) - 0.5) 232 | 233 | 234 | def sync(net, net_tar): 235 | """Copy q network to target q network""" 236 | for var, var_tar in zip(net.trainable_weights, net_tar.trainable_weights): 237 | var_tar.assign(var) 238 | 239 | 240 | def log_softmax(x, dim): 241 | temp = x - np.max(x, dim, keepdims=True) 242 | return temp - np.log(np.exp(temp).sum(dim, keepdims=True)) 243 | 244 | 245 | def softmax(x, dim): 246 | temp = np.exp(x - np.max(x, dim, keepdims=True)) 247 | return temp / temp.sum(dim, keepdims=True) 248 | 249 | 250 | if __name__ == '__main__': 251 | if args.mode == 'train': 252 | qnet = MLP('q') if qnet_type == 'MLP' else CNN('q') 253 | qnet.train() 254 | trainabel_weights = qnet.trainable_weights 255 | targetqnet = MLP('targetq') if qnet_type == 'MLP' else CNN('targetq') 256 | targetqnet.infer() 257 | sync(qnet, targetqnet) 258 | optimizer = tf.optimizers.Adam(learning_rate=lr) 259 | buffer = ReplayBuffer(buffer_size) 260 | 261 | o = env.reset() 262 | nepisode = 0 263 | t = time.time() 264 | noise_scale = 1e-2 265 | for i in range(1, number_timesteps + 1): 266 | eps = epsilon(i) 267 | 268 | # select action 269 | if random.random() < eps: 270 | a = int(random.random() * out_dim) 271 | else: 272 | # noise schedule is based on KL divergence between perturbed and 273 | # non-perturbed policy, see https://arxiv.org/pdf/1706.01905.pdf 274 | obv = np.expand_dims(o, 0).astype('float32') * ob_scale 275 | if i < explore_timesteps: 276 | qnet.noise_scale = noise_scale 277 | q_ptb = qnet(obv).numpy() 278 | qnet.noise_scale = 0 279 | if i % noise_update_freq == 0: 280 | q = qnet(obv).numpy() 281 | kl_ptb = (log_softmax(q, 1) - log_softmax(q_ptb, 1)) 282 | kl_ptb = np.sum(kl_ptb * softmax(q, 1), 1).mean() 283 | kl_explore = -np.log(1 - eps + eps / out_dim) 284 | if kl_ptb < kl_explore: 285 | noise_scale *= 1.01 286 | else: 287 | noise_scale /= 1.01 288 | a = q_ptb.argmax(1)[0] 289 | else: 290 | a = qnet(obv).numpy().argmax(1)[0] 291 | 292 | # execute action and feed to replay buffer 293 | # note that `_` tail in var name means next 294 | o_, r, done, info = env.step(a) 295 | buffer.add(o, a, r, o_, done) 296 | 297 | if i >= warm_start: 298 | # sync q net and target q net 299 | if i % target_q_update_freq == 0: 300 | sync(qnet, targetqnet) 301 | path = os.path.join(args.save_path, '{}.npz'.format(i)) 302 | tl.files.save_npz(qnet.trainable_weights, name=path) 303 | 304 | # sample from replay buffer 305 | b_o, b_a, b_r, b_o_, b_d = buffer.sample(batch_size) 306 | 307 | # double q estimation 308 | b_a_ = tf.one_hot(tf.argmax(qnet(b_o_), 1), out_dim) 309 | b_q_ = (1 - b_d) * tf.reduce_sum(targetqnet(b_o_) * b_a_, 1) 310 | 311 | # calculate loss 312 | with tf.GradientTape() as q_tape: 313 | b_q = tf.reduce_sum(qnet(b_o) * tf.one_hot(b_a, out_dim), 1) 314 | loss = tf.reduce_mean(huber_loss(b_q - (b_r + reward_gamma * b_q_))) 315 | 316 | # backward gradients 317 | q_grad = q_tape.gradient(loss, trainabel_weights) 318 | optimizer.apply_gradients(zip(q_grad, trainabel_weights)) 319 | 320 | if done: 321 | o = env.reset() 322 | else: 323 | o = o_ 324 | 325 | # episode in info is real (unwrapped) message 326 | if info.get('episode'): 327 | nepisode += 1 328 | reward, length = info['episode']['r'], info['episode']['l'] 329 | fps = int(length / (time.time() - t)) 330 | print( 331 | 'Time steps so far: {}, episode so far: {}, ' 332 | 'episode reward: {:.4f}, episode length: {}, FPS: {}'.format(i, nepisode, reward, length, fps) 333 | ) 334 | t = time.time() 335 | else: 336 | qnet = MLP('q') if qnet_type == 'MLP' else CNN('q') 337 | tl.files.load_and_assign_npz(name=args.save_path, network=qnet) 338 | qnet.eval() 339 | 340 | nepisode = 0 341 | o = env.reset() 342 | for i in range(1, number_timesteps + 1): 343 | obv = np.expand_dims(o, 0).astype('float32') * ob_scale 344 | a = qnet(obv).numpy().argmax(1)[0] 345 | 346 | # execute action 347 | # note that `_` tail in var name means next 348 | o_, r, done, info = env.step(a) 349 | 350 | if done: 351 | o = env.reset() 352 | else: 353 | o = o_ 354 | 355 | # episode in info is real (unwrapped) message 356 | if info.get('episode'): 357 | nepisode += 1 358 | reward, length = info['episode']['r'], info['episode']['l'] 359 | print( 360 | 'Time steps so far: {}, episode so far: {}, ' 361 | 'episode reward: {:.4f}, episode length: {}'.format(i, nepisode, reward, length) 362 | ) -------------------------------------------------------------------------------- /code/tensrolayer-implemented/pg.py: -------------------------------------------------------------------------------- 1 | """ 2 | Vanilla Policy Gradient(VPG or REINFORCE) 3 | ----------------------------------------- 4 | The policy gradient algorithm works by updating policy parameters via stochastic gradient ascent on policy performance. 5 | It's an on-policy algorithm can be used for environments with either discrete or continuous action spaces. 6 | Here is an example on discrete action space game CartPole-v0. 7 | To apply it on continuous action space, you need to change the last softmax layer and the choose_action function. 8 | Reference 9 | --------- 10 | Cookbook: Barto A G, Sutton R S. Reinforcement Learning: An Introduction[J]. 1998. 11 | MorvanZhou's tutorial page: https://morvanzhou.github.io/tutorials/ 12 | Environment 13 | ----------- 14 | Openai Gym CartPole-v0, discrete action space 15 | Prerequisites 16 | -------------- 17 | tensorflow >=2.0.0a0 18 | tensorflow-probability 0.6.0 19 | tensorlayer >=2.0.0 20 | To run 21 | ------ 22 | python tutorial_PG.py --train/test 23 | """ 24 | 25 | import argparse 26 | import os 27 | import time 28 | 29 | import matplotlib.pyplot as plt 30 | import numpy as np 31 | 32 | import gym 33 | import tensorflow as tf 34 | import tensorlayer as tl 35 | 36 | parser = argparse.ArgumentParser(description='Train or test neural net motor controller.') 37 | parser.add_argument('--train', dest='train', action='store_true', default=True) 38 | parser.add_argument('--test', dest='train', action='store_false') 39 | args = parser.parse_args() 40 | 41 | ##################### hyper parameters #################### 42 | 43 | ENV_NAME = 'CartPole-v0' # environment name 44 | RANDOMSEED = 1 # random seed 45 | 46 | DISPLAY_REWARD_THRESHOLD = 400 # renders environment if total episode reward is greater then this threshold 47 | RENDER = False # rendering wastes time 48 | num_episodes = 300 49 | 50 | ############################### PG #################################### 51 | 52 | 53 | class PolicyGradient: 54 | """ 55 | PG class 56 | """ 57 | 58 | def __init__(self, n_features, n_actions, learning_rate=0.01, reward_decay=0.95): 59 | self.n_actions = n_actions 60 | self.n_features = n_features 61 | self.lr = learning_rate 62 | self.gamma = reward_decay 63 | 64 | self.ep_obs, self.ep_as, self.ep_rs = [], [], [] 65 | 66 | def get_model(inputs_shape): 67 | """ 68 | Build a neural network model. 69 | :param inputs_shape: state_shape 70 | :return: act 71 | """ 72 | with tf.name_scope('inputs'): 73 | self.tf_obs = tl.layers.Input(inputs_shape, tf.float32, name="observations") 74 | self.tf_acts = tl.layers.Input([ 75 | None, 76 | ], tf.int32, name="actions_num") 77 | self.tf_vt = tl.layers.Input([ 78 | None, 79 | ], tf.float32, name="actions_value") 80 | # fc1 81 | layer = tl.layers.Dense( 82 | n_units=30, act=tf.nn.tanh, W_init=tf.random_normal_initializer(mean=0, stddev=0.3), 83 | b_init=tf.constant_initializer(0.1), name='fc1' 84 | )(self.tf_obs) 85 | # fc2 86 | all_act = tl.layers.Dense( 87 | n_units=self.n_actions, act=None, W_init=tf.random_normal_initializer(mean=0, stddev=0.3), 88 | b_init=tf.constant_initializer(0.1), name='all_act' 89 | )(layer) 90 | return tl.models.Model(inputs=self.tf_obs, outputs=all_act, name='PG model') 91 | 92 | self.model = get_model([None, n_features]) 93 | self.model.train() 94 | self.optimizer = tf.optimizers.Adam(self.lr) 95 | 96 | def choose_action(self, s): 97 | """ 98 | choose action with probabilities. 99 | :param s: state 100 | :return: act 101 | """ 102 | _logits = self.model(np.array([s], np.float32)) 103 | _probs = tf.nn.softmax(_logits).numpy() 104 | return tl.rein.choice_action_by_probs(_probs.ravel()) 105 | 106 | def choose_action_greedy(self, s): 107 | """ 108 | choose action with greedy policy 109 | :param s: state 110 | :return: act 111 | """ 112 | _probs = tf.nn.softmax(self.model(np.array([s], np.float32))).numpy() 113 | return np.argmax(_probs.ravel()) 114 | 115 | def store_transition(self, s, a, r): 116 | """ 117 | store data in memory buffer 118 | :param s: state 119 | :param a: act 120 | :param r: reward 121 | :return: 122 | """ 123 | self.ep_obs.append(np.array([s], np.float32)) 124 | self.ep_as.append(a) 125 | self.ep_rs.append(r) 126 | 127 | def learn(self): 128 | """ 129 | update policy parameters via stochastic gradient ascent 130 | :return: None 131 | """ 132 | # discount and normalize episode reward 133 | discounted_ep_rs_norm = self._discount_and_norm_rewards() 134 | 135 | with tf.GradientTape() as tape: 136 | 137 | _logits = self.model(np.vstack(self.ep_obs)) 138 | # to maximize total reward (log_p * R) is to minimize -(log_p * R), and the tf only have minimize(loss) 139 | neg_log_prob = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=_logits, labels=np.array(self.ep_as)) 140 | # this is negative log of chosen action 141 | 142 | # or in this way: 143 | # neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob)*tf.one_hot(self.tf_acts, self.n_actions), axis=1) 144 | 145 | loss = tf.reduce_mean(neg_log_prob * discounted_ep_rs_norm) # reward guided loss 146 | 147 | grad = tape.gradient(loss, self.model.trainable_weights) 148 | self.optimizer.apply_gradients(zip(grad, self.model.trainable_weights)) 149 | 150 | self.ep_obs, self.ep_as, self.ep_rs = [], [], [] # empty episode data 151 | return discounted_ep_rs_norm 152 | 153 | def _discount_and_norm_rewards(self): 154 | """ 155 | compute discount_and_norm_rewards 156 | :return: discount_and_norm_rewards 157 | """ 158 | # discount episode rewards 159 | discounted_ep_rs = np.zeros_like(self.ep_rs) 160 | running_add = 0 161 | for t in reversed(range(0, len(self.ep_rs))): 162 | running_add = running_add * self.gamma + self.ep_rs[t] 163 | discounted_ep_rs[t] = running_add 164 | 165 | # normalize episode rewards 166 | discounted_ep_rs -= np.mean(discounted_ep_rs) 167 | discounted_ep_rs /= np.std(discounted_ep_rs) 168 | return discounted_ep_rs 169 | 170 | def save_ckpt(self): 171 | """ 172 | save trained weights 173 | :return: None 174 | """ 175 | if not os.path.exists('model'): 176 | os.makedirs('model') 177 | tl.files.save_weights_to_hdf5('model/pg_policy.hdf5', self.model) 178 | 179 | def load_ckpt(self): 180 | """ 181 | load trained weights 182 | :return: None 183 | """ 184 | tl.files.load_hdf5_to_weights_in_order('model/pg_policy.hdf5', self.model) 185 | 186 | 187 | if __name__ == '__main__': 188 | 189 | # reproducible 190 | np.random.seed(RANDOMSEED) 191 | tf.random.set_seed(RANDOMSEED) 192 | 193 | tl.logging.set_verbosity(tl.logging.DEBUG) 194 | 195 | env = gym.make(ENV_NAME) 196 | env.seed(RANDOMSEED) # reproducible, general Policy gradient has high variance 197 | env = env.unwrapped 198 | 199 | print(env.action_space) 200 | print(env.observation_space) 201 | print(env.observation_space.high) 202 | print(env.observation_space.low) 203 | 204 | RL = PolicyGradient( 205 | n_actions=env.action_space.n, 206 | n_features=env.observation_space.shape[0], 207 | learning_rate=0.02, 208 | reward_decay=0.99, 209 | # output_graph=True, 210 | ) 211 | 212 | if args.train: 213 | reward_buffer = [] 214 | 215 | for i_episode in range(num_episodes): 216 | 217 | episode_time = time.time() 218 | observation = env.reset() 219 | 220 | while True: 221 | if RENDER: 222 | env.render() 223 | 224 | action = RL.choose_action(observation) 225 | 226 | observation_, reward, done, info = env.step(action) 227 | 228 | RL.store_transition(observation, action, reward) 229 | 230 | if done: 231 | ep_rs_sum = sum(RL.ep_rs) 232 | 233 | if 'running_reward' not in globals(): 234 | running_reward = ep_rs_sum 235 | else: 236 | running_reward = running_reward * 0.99 + ep_rs_sum * 0.01 237 | #打开渲染 可视化游戏界面 238 | #if running_reward > DISPLAY_REWARD_THRESHOLD: 239 | # RENDER = True # rendering 240 | 241 | # print("episode:", i_episode, " reward:", int(running_reward)) 242 | 243 | print( 244 | "Episode [%d/%d] \tsum reward: %d \trunning reward: %f \ttook: %.5fs " % 245 | (i_episode, num_episodes, ep_rs_sum, running_reward, time.time() - episode_time) 246 | ) 247 | reward_buffer.append(running_reward) 248 | 249 | vt = RL.learn() 250 | break 251 | ''' 252 | plt.ion() 253 | plt.cla() 254 | plt.title('PG') 255 | plt.plot(reward_buffer, ) # plot the episode vt 256 | plt.xlabel('episode steps') 257 | plt.ylabel('normalized state-action value') 258 | plt.show() 259 | plt.pause(0.1) 260 | ''' 261 | 262 | 263 | observation = observation_ 264 | RL.save_ckpt() 265 | #plt.ioff() 266 | #plt.show() 267 | 268 | # test 269 | RL.load_ckpt() 270 | observation = env.reset() 271 | while True: 272 | env.render() 273 | action = RL.choose_action(observation) 274 | observation, reward, done, info = env.step(action) 275 | if done: 276 | observation = env.reset() -------------------------------------------------------------------------------- /code/tensrolayer-implemented/ppo.py: -------------------------------------------------------------------------------- 1 | """ 2 | Proximal Policy Optimization (PPO) 3 | ---------------------------- 4 | A simple version of Proximal Policy Optimization (PPO) using single thread. 5 | PPO is a family of first-order methods that use a few other tricks to keep new policies close to old. 6 | PPO methods are significantly simpler to implement, and empirically seem to perform at least as well as TRPO. 7 | Reference 8 | --------- 9 | Proximal Policy Optimization Algorithms, Schulman et al. 2017 10 | High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al. 2016 11 | Emergence of Locomotion Behaviours in Rich Environments, Heess et al. 2017 12 | MorvanZhou's tutorial page: https://morvanzhou.github.io/tutorials 13 | Environment 14 | ----------- 15 | Openai Gym Pendulum-v0, continual action space 16 | Prerequisites 17 | -------------- 18 | tensorflow >=2.0.0a0 19 | tensorflow-probability 0.6.0 20 | tensorlayer >=2.0.0 21 | To run 22 | ------ 23 | python tutorial_PPO.py --train/test 24 | """ 25 | import argparse 26 | import os 27 | import time 28 | 29 | import matplotlib.pyplot as plt 30 | import numpy as np 31 | 32 | import gym 33 | import tensorflow as tf 34 | import tensorflow_probability as tfp 35 | import tensorlayer as tl 36 | 37 | parser = argparse.ArgumentParser(description='Train or test neural net motor controller.') 38 | parser.add_argument('--train', dest='train', action='store_true', default=True) 39 | parser.add_argument('--test', dest='train', action='store_false') 40 | args = parser.parse_args() 41 | 42 | ##################### hyper parameters #################### 43 | 44 | ENV_NAME = 'Pendulum-v0' # environment name 45 | RANDOMSEED = 1 # random seed 46 | 47 | EP_MAX = 1000 # total number of episodes for training 48 | EP_LEN = 200 # total number of steps for each episode 49 | GAMMA = 0.9 # reward discount 50 | A_LR = 0.0001 # learning rate for actor 51 | C_LR = 0.0002 # learning rate for critic 52 | BATCH = 32 # update batchsize 53 | A_UPDATE_STEPS = 10 # actor update steps 54 | C_UPDATE_STEPS = 10 # critic update steps 55 | S_DIM, A_DIM = 3, 1 # state dimension, action dimension 56 | EPS = 1e-8 # epsilon 57 | METHOD = [ 58 | dict(name='kl_pen', kl_target=0.01, lam=0.5), # KL penalty 59 | dict(name='clip', epsilon=0.2), # Clipped surrogate objective, find this is better 60 | ][1] # choose the method for optimization 61 | 62 | ############################### PPO #################################### 63 | 64 | 65 | class PPO(object): 66 | ''' 67 | PPO class 68 | ''' 69 | 70 | def __init__(self): 71 | 72 | # critic 输出v值 73 | tfs = tl.layers.Input([None, S_DIM], tf.float32, 'state') 74 | l1 = tl.layers.Dense(100, tf.nn.relu)(tfs) 75 | v = tl.layers.Dense(1)(l1) 76 | self.critic = tl.models.Model(tfs, v) 77 | self.critic.train() 78 | 79 | # actor 2个pi网络,固定一个,训练另一个 80 | self.actor = self._build_anet('pi', trainable=True) 81 | self.actor_old = self._build_anet('oldpi', trainable=False) 82 | self.actor_opt = tf.optimizers.Adam(A_LR) 83 | self.critic_opt = tf.optimizers.Adam(C_LR) 84 | 85 | def a_train(self, tfs, tfa, tfadv): 86 | ''' 87 | Update policy network 88 | :param tfs: state 89 | :param tfa: act 90 | :param tfadv: advantage 91 | :return: 92 | ''' 93 | tfs = np.array(tfs, np.float32) 94 | tfa = np.array(tfa, np.float32) 95 | tfadv = np.array(tfadv, np.float32) #优势函数 -值函数的变形 96 | with tf.GradientTape() as tape: 97 | mu, sigma = self.actor(tfs) 98 | pi = tfp.distributions.Normal(mu, sigma) #初始化策略 正态分布 99 | 100 | mu_old, sigma_old = self.actor_old(tfs) 101 | oldpi = tfp.distributions.Normal(mu_old, sigma_old) 102 | 103 | # ratio = tf.exp(pi.log_prob(self.tfa) - oldpi.log_prob(self.tfa)) 104 | ratio = pi.prob(tfa) / (oldpi.prob(tfa) + EPS) 105 | surr = ratio * tfadv 106 | if METHOD['name'] == 'kl_pen': #使用KL作为惩罚项的loss 107 | tflam = METHOD['lam'] 108 | kl = tfp.distributions.kl_divergence(oldpi, pi) 109 | kl_mean = tf.reduce_mean(kl) 110 | aloss = -(tf.reduce_mean(surr - tflam * kl)) 111 | else: # clipping method, find this is better 使用clip作为loss 112 | aloss = -tf.reduce_mean( 113 | tf.minimum(surr, 114 | tf.clip_by_value(ratio, 1. - METHOD['epsilon'], 1. + METHOD['epsilon']) * tfadv) 115 | ) 116 | a_gard = tape.gradient(aloss, self.actor.trainable_weights) 117 | 118 | self.actor_opt.apply_gradients(zip(a_gard, self.actor.trainable_weights)) 119 | 120 | if METHOD['name'] == 'kl_pen': 121 | return kl_mean 122 | 123 | def update_old_pi(self): 124 | ''' 125 | Update old policy parameter 126 | :return: None 127 | ''' 128 | for p, oldp in zip(self.actor.trainable_weights, self.actor_old.trainable_weights): 129 | oldp.assign(p) #将新的pi的参数直接赋值给旧的pi 130 | 131 | def c_train(self, tfdc_r, s): #训练critic网络,mse优化 132 | ''' 133 | Update actor network 134 | :param tfdc_r: cumulative reward 135 | :param s: state 136 | :return: None 137 | ''' 138 | tfdc_r = np.array(tfdc_r, dtype=np.float32) 139 | with tf.GradientTape() as tape: 140 | v = self.critic(s) 141 | advantage = tfdc_r - v 142 | closs = tf.reduce_mean(tf.square(advantage)) 143 | # print('tfdc_r value', tfdc_r) 144 | grad = tape.gradient(closs, self.critic.trainable_weights) 145 | self.critic_opt.apply_gradients(zip(grad, self.critic.trainable_weights)) 146 | 147 | def cal_adv(self, tfs, tfdc_r): 148 | ''' 149 | Calculate advantage 150 | :param tfs: state 151 | :param tfdc_r: cumulative reward 152 | :return: advantage 153 | ''' 154 | tfdc_r = np.array(tfdc_r, dtype=np.float32) 155 | advantage = tfdc_r - self.critic(tfs) 156 | return advantage.numpy() 157 | 158 | def update(self, s, a, r): 159 | ''' 160 | Update parameter with the constraint of KL divergent 161 | :param s: state 162 | :param a: act 163 | :param r: reward 164 | :return: None 165 | ''' 166 | s, a, r = s.astype(np.float32), a.astype(np.float32), r.astype(np.float32) 167 | 168 | self.update_old_pi() 169 | adv = self.cal_adv(s, r) 170 | # adv = (adv - adv.mean())/(adv.std()+1e-6) # sometimes helpful 171 | 172 | # update actor 173 | if METHOD['name'] == 'kl_pen': 174 | for _ in range(A_UPDATE_STEPS): 175 | kl = self.a_train(s, a, adv) 176 | if kl > 4 * METHOD['kl_target']: # this in in google's paper 177 | break 178 | if kl < METHOD['kl_target'] / 1.5: # adaptive lambda, this is in OpenAI's paper 179 | METHOD['lam'] /= 2 180 | elif kl > METHOD['kl_target'] * 1.5: 181 | METHOD['lam'] *= 2 182 | METHOD['lam'] = np.clip( 183 | METHOD['lam'], 1e-4, 10 184 | ) # sometimes explode, this clipping is MorvanZhou's solution 185 | else: # clipping method, find this is better (OpenAI's paper) 186 | for _ in range(A_UPDATE_STEPS): 187 | self.a_train(s, a, adv) 188 | 189 | # update critic 190 | for _ in range(C_UPDATE_STEPS): 191 | self.c_train(r, s) #Critic的训练一次更新十步 192 | 193 | def _build_anet(self, name, trainable): 194 | ''' 195 | Build policy network 196 | :param name: name 197 | :param trainable: trainable flag 198 | :return: policy network 199 | ''' 200 | tfs = tl.layers.Input([None, S_DIM], tf.float32, name + '_state') 201 | l1 = tl.layers.Dense(100, tf.nn.relu, name=name + '_l1')(tfs) 202 | a = tl.layers.Dense(A_DIM, tf.nn.tanh, name=name + '_a')(l1) 203 | mu = tl.layers.Lambda(lambda x: x * 2, name=name + '_lambda')(a) 204 | sigma = tl.layers.Dense(A_DIM, tf.nn.softplus, name=name + '_sigma')(l1) 205 | model = tl.models.Model(tfs, [mu, sigma], name) 206 | 207 | if trainable: 208 | model.train() 209 | else: 210 | model.eval() 211 | return model 212 | 213 | def choose_action(self, s): 214 | ''' 215 | Choose action 216 | :param s: state 217 | :return: clipped act 218 | ''' 219 | s = s[np.newaxis, :].astype(np.float32) 220 | mu, sigma = self.actor(s) 221 | pi = tfp.distributions.Normal(mu, sigma) 222 | a = tf.squeeze(pi.sample(1), axis=0)[0] # choosing action 223 | return np.clip(a, -2, 2) 224 | 225 | def get_v(self, s): 226 | ''' 227 | Compute value 228 | :param s: state 229 | :return: value 230 | ''' 231 | s = s.astype(np.float32) 232 | if s.ndim < 2: s = s[np.newaxis, :] 233 | return self.critic(s)[0, 0] 234 | 235 | def save_ckpt(self): 236 | """ 237 | save trained weights 238 | :return: None 239 | """ 240 | if not os.path.exists('model'): 241 | os.makedirs('model') 242 | tl.files.save_weights_to_hdf5('model/ppo_actor.hdf5', self.actor) 243 | tl.files.save_weights_to_hdf5('model/ppo_actor_old.hdf5', self.actor_old) 244 | tl.files.save_weights_to_hdf5('model/ppo_critic.hdf5', self.critic) 245 | 246 | def load_ckpt(self): 247 | """ 248 | load trained weights 249 | :return: None 250 | """ 251 | tl.files.load_hdf5_to_weights_in_order('model/ppo_actor.hdf5', self.actor) 252 | tl.files.load_hdf5_to_weights_in_order('model/ppo_actor_old.hdf5', self.actor_old) 253 | tl.files.load_hdf5_to_weights_in_order('model/ppo_critic.hdf5', self.critic) 254 | 255 | 256 | if __name__ == '__main__': 257 | 258 | env = gym.make(ENV_NAME).unwrapped 259 | 260 | # reproducible 261 | env.seed(RANDOMSEED) 262 | np.random.seed(RANDOMSEED) 263 | tf.random.set_seed(RANDOMSEED) 264 | 265 | ppo = PPO() 266 | 267 | if args.train: 268 | all_ep_r = [] 269 | for ep in range(EP_MAX): 270 | s = env.reset() 271 | buffer_s, buffer_a, buffer_r = [], [], [] 272 | ep_r = 0 273 | t0 = time.time() 274 | for t in range(EP_LEN): # in one episode 275 | # env.render() 276 | a = ppo.choose_action(s) 277 | s_, r, done, _ = env.step(a) 278 | buffer_s.append(s) 279 | buffer_a.append(a) 280 | buffer_r.append((r + 8) / 8) # normalize reward, find to be useful 281 | s = s_ 282 | ep_r += r 283 | 284 | # update ppo 285 | if (t + 1) % BATCH == 0 or t == EP_LEN - 1: #采集了一个Batch的数据或者走到最后一步了才进行更新 286 | v_s_ = ppo.get_v(s_) 287 | discounted_r = [] 288 | for r in buffer_r[::-1]: 289 | v_s_ = r + GAMMA * v_s_ 290 | discounted_r.append(v_s_) 291 | discounted_r.reverse() 292 | 293 | bs, ba, br = np.vstack(buffer_s), np.vstack(buffer_a), np.array(discounted_r)[:, np.newaxis] 294 | buffer_s, buffer_a, buffer_r = [], [], [] 295 | ppo.update(bs, ba, br) 296 | if ep == 0: 297 | all_ep_r.append(ep_r) 298 | else: 299 | all_ep_r.append(all_ep_r[-1] * 0.9 + ep_r * 0.1) 300 | print( 301 | 'Episode: {}/{} | Episode Reward: {:.4f} | Running Time: {:.4f}'.format( 302 | ep, EP_MAX, ep_r, 303 | time.time() - t0 304 | ) 305 | ) 306 | 307 | plt.ion() 308 | plt.cla() 309 | plt.title('PPO') 310 | plt.plot(np.arange(len(all_ep_r)), all_ep_r) 311 | plt.ylim(-2000, 0) 312 | plt.xlabel('Episode') 313 | plt.ylabel('Moving averaged episode reward') 314 | plt.show() 315 | plt.pause(0.1) 316 | ppo.save_ckpt() 317 | plt.ioff() 318 | plt.show() 319 | 320 | # test 321 | ppo.load_ckpt() 322 | while True: 323 | s = env.reset() 324 | for i in range(EP_LEN): 325 | env.render() 326 | s, r, done, _ = env.step(ppo.choose_action(s)) 327 | if done: 328 | break -------------------------------------------------------------------------------- /code/tensrolayer-implemented/qlearning.py: -------------------------------------------------------------------------------- 1 | """Q-Table learning algorithm. 2 | Non deep learning - TD Learning, Off-Policy, e-Greedy Exploration 3 | Q(S, A) <- Q(S, A) + alpha * (R + lambda * Q(newS, newA) - Q(S, A)) 4 | See David Silver RL Tutorial Lecture 5 - Q-Learning for more details. 5 | For Q-Network, see tutorial_frozenlake_q_network.py 6 | EN: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0#.5m3361vlw 7 | CN: https://zhuanlan.zhihu.com/p/25710327 8 | tensorflow==2.0.0a0 9 | tensorlayer==2.0.0 10 | """ 11 | 12 | import time 13 | 14 | import numpy as np 15 | 16 | import gym 17 | 18 | ## Load the environment 19 | env = gym.make('FrozenLake-v0') 20 | render = False # display the game environment 21 | running_reward = None 22 | 23 | ##================= Implement Q-Table learning algorithm =====================## 24 | ## Initialize table with all zeros 25 | Q = np.zeros([env.observation_space.n, env.action_space.n]) 26 | ## Set learning parameters 27 | lr = .85 # alpha, if use value function approximation, we can ignore it 28 | lambd = .99 # decay factor 29 | num_episodes = 10000 30 | rList = [] # rewards for each episode 31 | for i in range(num_episodes): 32 | ## Reset environment and get first new observation 33 | episode_time = time.time() 34 | s = env.reset() 35 | rAll = 0 36 | ## The Q-Table learning algorithm 37 | for j in range(99): 38 | if render: env.render() 39 | ## Choose an action by greedily (with noise) picking from Q table 40 | a = np.argmax(Q[s, :] + np.random.randn(1, env.action_space.n) * (1. / (i + 1))) 41 | ## Get new state and reward from environment 42 | s1, r, d, _ = env.step(a) 43 | ## Update Q-Table with new knowledge 44 | Q[s, a] = Q[s, a] + lr * (r + lambd * np.max(Q[s1, :]) - Q[s, a]) 45 | rAll += r 46 | s = s1 47 | if d ==True: 48 | break 49 | rList.append(rAll) 50 | running_reward = r if running_reward is None else running_reward * 0.99 + r * 0.01 51 | print("Episode [%d/%d] sum reward: %f running reward: %f took: %.5fs " % \ 52 | (i, num_episodes, rAll, running_reward, time.time() - episode_time)) 53 | 54 | print("Final Q-Table Values:/n %s" % Q) -------------------------------------------------------------------------------- /code/tensrolayer-implemented/tutorial_wrappers.py: -------------------------------------------------------------------------------- 1 | """Env wrappers 2 | Note that this file is adapted from `https://pypi.org/project/gym-vec-env` and 3 | `https://github.com/openai/baselines/blob/master/baselines/common/*wrappers.py` 4 | """ 5 | from collections import deque 6 | from functools import partial 7 | from multiprocessing import Pipe, Process, cpu_count 8 | from sys import platform 9 | 10 | import numpy as np 11 | 12 | import cv2 13 | import gym 14 | from gym import spaces 15 | 16 | __all__ = ( 17 | 'build_env', # build env 18 | 'TimeLimit', # Time limit wrapper 19 | 'NoopResetEnv', # Run random number of no-ops on reset 20 | 'FireResetEnv', # Reset wrapper for envs with fire action 21 | 'EpisodicLifeEnv', # end-of-life == end-of-episode wrapper 22 | 'MaxAndSkipEnv', # skip frame wrapper 23 | 'ClipRewardEnv', # clip reward wrapper 24 | 'WarpFrame', # warp observation wrapper 25 | 'FrameStack', # stack frame wrapper 26 | 'LazyFrames', # lazy store wrapper 27 | 'RewardScaler', # reward scale 28 | 'SubprocVecEnv', # vectorized env wrapper 29 | 'VecFrameStack', # stack frames in vectorized env 30 | 'Monitor', # Episode reward and length monitor 31 | ) 32 | cv2.ocl.setUseOpenCL(False) 33 | # env_id -> env_type 34 | id2type = dict() 35 | for _env in gym.envs.registry.all(): 36 | id2type[_env.id] = _env.entry_point.split(':')[0].rsplit('.', 1)[1] 37 | 38 | 39 | def build_env(env_id, vectorized=False, seed=0, reward_scale=1.0, nenv=0): 40 | """Build env based on options""" 41 | env_type = id2type[env_id] 42 | nenv = nenv or cpu_count() // (1 + (platform == 'darwin')) 43 | stack = env_type == 'atari' 44 | if not vectorized: 45 | env = _make_env(env_id, env_type, seed, reward_scale, stack) 46 | else: 47 | env = _make_vec_env(env_id, env_type, nenv, seed, reward_scale, stack) 48 | 49 | return env 50 | 51 | 52 | def _make_env(env_id, env_type, seed, reward_scale, frame_stack=True): 53 | """Make single env""" 54 | if env_type == 'atari': 55 | env = gym.make(env_id) 56 | assert 'NoFrameskip' in env.spec.id 57 | env = NoopResetEnv(env, noop_max=30) 58 | env = MaxAndSkipEnv(env, skip=4) 59 | env = Monitor(env) 60 | # deepmind wrap 61 | env = EpisodicLifeEnv(env) 62 | if 'FIRE' in env.unwrapped.get_action_meanings(): 63 | env = FireResetEnv(env) 64 | env = WarpFrame(env) 65 | env = ClipRewardEnv(env) 66 | if frame_stack: 67 | env = FrameStack(env, 4) 68 | elif env_type == 'classic_control': 69 | env = Monitor(gym.make(env_id)) 70 | else: 71 | raise NotImplementedError 72 | if reward_scale != 1: 73 | env = RewardScaler(env, reward_scale) 74 | env.seed(seed) 75 | return env 76 | 77 | 78 | def _make_vec_env(env_id, env_type, nenv, seed, reward_scale, frame_stack=True): 79 | """Make vectorized env""" 80 | env = SubprocVecEnv([partial(_make_env, env_id, env_type, seed + i, reward_scale, False) for i in range(nenv)]) 81 | if frame_stack: 82 | env = VecFrameStack(env, 4) 83 | return env 84 | 85 | 86 | class TimeLimit(gym.Wrapper): 87 | 88 | def __init__(self, env, max_episode_steps=None): 89 | super(TimeLimit, self).__init__(env) 90 | self._max_episode_steps = max_episode_steps 91 | self._elapsed_steps = 0 92 | 93 | def step(self, ac): 94 | observation, reward, done, info = self.env.step(ac) 95 | self._elapsed_steps += 1 96 | if self._elapsed_steps >= self._max_episode_steps: 97 | done = True 98 | info['TimeLimit.truncated'] = True 99 | return observation, reward, done, info 100 | 101 | def reset(self, **kwargs): 102 | self._elapsed_steps = 0 103 | return self.env.reset(**kwargs) 104 | 105 | 106 | class NoopResetEnv(gym.Wrapper): 107 | 108 | def __init__(self, env, noop_max=30): 109 | """Sample initial states by taking random number of no-ops on reset. 110 | No-op is assumed to be action 0. 111 | """ 112 | super(NoopResetEnv, self).__init__(env) 113 | self.noop_max = noop_max 114 | self.override_num_noops = None 115 | self.noop_action = 0 116 | assert env.unwrapped.get_action_meanings()[0] == 'NOOP' 117 | 118 | def reset(self, **kwargs): 119 | """ Do no-op action for a number of steps in [1, noop_max].""" 120 | self.env.reset(**kwargs) 121 | if self.override_num_noops is not None: 122 | noops = self.override_num_noops 123 | else: 124 | noops = self.unwrapped.np_random.randint(1, self.noop_max + 1) 125 | assert noops > 0 126 | obs = None 127 | for _ in range(noops): 128 | obs, _, done, _ = self.env.step(self.noop_action) 129 | if done: 130 | obs = self.env.reset(**kwargs) 131 | return obs 132 | 133 | def step(self, ac): 134 | return self.env.step(ac) 135 | 136 | 137 | class FireResetEnv(gym.Wrapper): 138 | 139 | def __init__(self, env): 140 | """Take action on reset for environments that are fixed until firing.""" 141 | super(FireResetEnv, self).__init__(env) 142 | assert env.unwrapped.get_action_meanings()[1] == 'FIRE' 143 | assert len(env.unwrapped.get_action_meanings()) >= 3 144 | 145 | def reset(self, **kwargs): 146 | self.env.reset(**kwargs) 147 | obs, _, done, _ = self.env.step(1) 148 | if done: 149 | self.env.reset(**kwargs) 150 | obs, _, done, _ = self.env.step(2) 151 | if done: 152 | self.env.reset(**kwargs) 153 | return obs 154 | 155 | def step(self, ac): 156 | return self.env.step(ac) 157 | 158 | 159 | class EpisodicLifeEnv(gym.Wrapper): 160 | 161 | def __init__(self, env): 162 | """Make end-of-life == end-of-episode, but only reset on true game over. 163 | Done by DeepMind for the DQN and co. since it helps value estimation. 164 | """ 165 | super(EpisodicLifeEnv, self).__init__(env) 166 | self.lives = 0 167 | self.was_real_done = True 168 | 169 | def step(self, action): 170 | obs, reward, done, info = self.env.step(action) 171 | self.was_real_done = done 172 | # check current lives, make loss of life terminal, 173 | # then update lives to handle bonus lives 174 | lives = self.env.unwrapped.ale.lives() 175 | if 0 < lives < self.lives: 176 | # for Qbert sometimes we stay in lives == 0 condition for a few 177 | # frames so it's important to keep lives > 0, so that we only reset 178 | # once the environment advertises done. 179 | done = True 180 | self.lives = lives 181 | return obs, reward, done, info 182 | 183 | def reset(self, **kwargs): 184 | """Reset only when lives are exhausted. 185 | This way all states are still reachable even though lives are episodic, 186 | and the learner need not know about any of this behind-the-scenes. 187 | """ 188 | if self.was_real_done: 189 | obs = self.env.reset(**kwargs) 190 | else: 191 | # no-op step to advance from terminal/lost life state 192 | obs, _, _, _ = self.env.step(0) 193 | self.lives = self.env.unwrapped.ale.lives() 194 | return obs 195 | 196 | 197 | class MaxAndSkipEnv(gym.Wrapper): 198 | 199 | def __init__(self, env, skip=4): 200 | """Return only every `skip`-th frame""" 201 | super(MaxAndSkipEnv, self).__init__(env) 202 | # most recent raw observations (for max pooling across time steps) 203 | shape = (2, ) + env.observation_space.shape 204 | self._obs_buffer = np.zeros(shape, dtype=np.uint8) 205 | self._skip = skip 206 | 207 | def step(self, action): 208 | """Repeat action, sum reward, and max over last observations.""" 209 | total_reward = 0.0 210 | done = info = None 211 | for i in range(self._skip): 212 | obs, reward, done, info = self.env.step(action) 213 | if i == self._skip - 2: 214 | self._obs_buffer[0] = obs 215 | if i == self._skip - 1: 216 | self._obs_buffer[1] = obs 217 | total_reward += reward 218 | if done: 219 | break 220 | # Note that the observation on the done=True frame doesn't matter 221 | max_frame = self._obs_buffer.max(axis=0) 222 | 223 | return max_frame, total_reward, done, info 224 | 225 | def reset(self, **kwargs): 226 | return self.env.reset(**kwargs) 227 | 228 | 229 | class ClipRewardEnv(gym.RewardWrapper): 230 | 231 | def __init__(self, env): 232 | super(ClipRewardEnv, self).__init__(env) 233 | 234 | def reward(self, reward): 235 | """Bin reward to {+1, 0, -1} by its sign.""" 236 | return np.sign(reward) 237 | 238 | 239 | class WarpFrame(gym.ObservationWrapper): 240 | 241 | def __init__(self, env, width=84, height=84, grayscale=True): 242 | """Warp frames to 84x84 as done in the Nature paper and later work.""" 243 | super(WarpFrame, self).__init__(env) 244 | self.width = width 245 | self.height = height 246 | self.grayscale = grayscale 247 | shape = (self.height, self.width, 1 if self.grayscale else 3) 248 | self.observation_space = spaces.Box(low=0, high=255, shape=shape, dtype=np.uint8) 249 | 250 | def observation(self, frame): 251 | if self.grayscale: 252 | frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY) 253 | size = (self.width, self.height) 254 | frame = cv2.resize(frame, size, interpolation=cv2.INTER_AREA) 255 | if self.grayscale: 256 | frame = np.expand_dims(frame, -1) 257 | return frame 258 | 259 | 260 | class FrameStack(gym.Wrapper): 261 | 262 | def __init__(self, env, k): 263 | """Stack k last frames. 264 | Returns lazy array, which is much more memory efficient. 265 | See Also `LazyFrames` 266 | """ 267 | super(FrameStack, self).__init__(env) 268 | self.k = k 269 | self.frames = deque([], maxlen=k) 270 | shp = env.observation_space.shape 271 | shape = shp[:-1] + (shp[-1] * k, ) 272 | self.observation_space = spaces.Box(low=0, high=255, shape=shape, dtype=env.observation_space.dtype) 273 | 274 | def reset(self): 275 | ob = self.env.reset() 276 | for _ in range(self.k): 277 | self.frames.append(ob) 278 | return np.asarray(self._get_ob()) 279 | 280 | def step(self, action): 281 | ob, reward, done, info = self.env.step(action) 282 | self.frames.append(ob) 283 | return np.asarray(self._get_ob()), reward, done, info 284 | 285 | def _get_ob(self): 286 | assert len(self.frames) == self.k 287 | return LazyFrames(list(self.frames)) 288 | 289 | 290 | class LazyFrames(object): 291 | 292 | def __init__(self, frames): 293 | """This object ensures that common frames between the observations are 294 | only stored once. It exists purely to optimize memory usage which can be 295 | huge for DQN's 1M frames replay buffers. 296 | This object should only be converted to numpy array before being passed 297 | to the model. You'd not believe how complex the previous solution was. 298 | """ 299 | self._frames = frames 300 | self._out = None 301 | 302 | def _force(self): 303 | if self._out is None: 304 | self._out = np.concatenate(self._frames, axis=-1) 305 | self._frames = None 306 | return self._out 307 | 308 | def __array__(self, dtype=None): 309 | out = self._force() 310 | if dtype is not None: 311 | out = out.astype(dtype) 312 | return out 313 | 314 | def __len__(self): 315 | return len(self._force()) 316 | 317 | def __getitem__(self, i): 318 | return self._force()[i] 319 | 320 | 321 | class RewardScaler(gym.RewardWrapper): 322 | """Bring rewards to a reasonable scale for PPO. 323 | This is incredibly important and effects performance drastically. 324 | """ 325 | 326 | def __init__(self, env, scale=0.01): 327 | super(RewardScaler, self).__init__(env) 328 | self.scale = scale 329 | 330 | def reward(self, reward): 331 | return reward * self.scale 332 | 333 | 334 | class VecFrameStack(object): 335 | 336 | def __init__(self, env, k): 337 | self.env = env 338 | self.k = k 339 | self.action_space = env.action_space 340 | self.frames = deque([], maxlen=k) 341 | shp = env.observation_space.shape 342 | shape = shp[:-1] + (shp[-1] * k, ) 343 | self.observation_space = spaces.Box(low=0, high=255, shape=shape, dtype=env.observation_space.dtype) 344 | 345 | def reset(self): 346 | ob = self.env.reset() 347 | for _ in range(self.k): 348 | self.frames.append(ob) 349 | return np.asarray(self._get_ob()) 350 | 351 | def step(self, action): 352 | ob, reward, done, info = self.env.step(action) 353 | self.frames.append(ob) 354 | return np.asarray(self._get_ob()), reward, done, info 355 | 356 | def _get_ob(self): 357 | assert len(self.frames) == self.k 358 | return LazyFrames(list(self.frames)) 359 | 360 | 361 | def _worker(remote, parent_remote, env_fn_wrapper): 362 | parent_remote.close() 363 | env = env_fn_wrapper.x() 364 | while True: 365 | cmd, data = remote.recv() 366 | if cmd == 'step': 367 | ob, reward, done, info = env.step(data) 368 | if done: 369 | ob = env.reset() 370 | remote.send((ob, reward, done, info)) 371 | elif cmd == 'reset': 372 | ob = env.reset() 373 | remote.send(ob) 374 | elif cmd == 'reset_task': 375 | ob = env._reset_task() 376 | remote.send(ob) 377 | elif cmd == 'close': 378 | remote.close() 379 | break 380 | elif cmd == 'get_spaces': 381 | remote.send((env.observation_space, env.action_space)) 382 | else: 383 | raise NotImplementedError 384 | 385 | 386 | class CloudpickleWrapper(object): 387 | """ 388 | Uses cloudpickle to serialize contents 389 | """ 390 | 391 | def __init__(self, x): 392 | self.x = x 393 | 394 | def __getstate__(self): 395 | import cloudpickle 396 | return cloudpickle.dumps(self.x) 397 | 398 | def __setstate__(self, ob): 399 | import pickle 400 | self.x = pickle.loads(ob) 401 | 402 | 403 | class SubprocVecEnv(object): 404 | 405 | def __init__(self, env_fns): 406 | """ 407 | envs: list of gym environments to run in subprocesses 408 | """ 409 | self.num_envs = len(env_fns) 410 | 411 | self.waiting = False 412 | self.closed = False 413 | nenvs = len(env_fns) 414 | self.nenvs = nenvs 415 | self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)]) 416 | zipped_args = zip(self.work_remotes, self.remotes, env_fns) 417 | self.ps = [ 418 | Process(target=_worker, args=(work_remote, remote, CloudpickleWrapper(env_fn))) 419 | for (work_remote, remote, env_fn) in zipped_args 420 | ] 421 | 422 | for p in self.ps: 423 | # if the main process crashes, we should not cause things to hang 424 | p.daemon = True 425 | p.start() 426 | for remote in self.work_remotes: 427 | remote.close() 428 | 429 | self.remotes[0].send(('get_spaces', None)) 430 | observation_space, action_space = self.remotes[0].recv() 431 | self.observation_space = observation_space 432 | self.action_space = action_space 433 | 434 | def _step_async(self, actions): 435 | """ 436 | Tell all the environments to start taking a step 437 | with the given actions. 438 | Call step_wait() to get the results of the step. 439 | You should not call this if a step_async run is 440 | already pending. 441 | """ 442 | for remote, action in zip(self.remotes, actions): 443 | remote.send(('step', action)) 444 | self.waiting = True 445 | 446 | def _step_wait(self): 447 | """ 448 | Wait for the step taken with step_async(). 449 | Returns (obs, rews, dones, infos): 450 | - obs: an array of observations, or a tuple of 451 | arrays of observations. 452 | - rews: an array of rewards 453 | - dones: an array of "episode done" booleans 454 | - infos: a sequence of info objects 455 | """ 456 | results = [remote.recv() for remote in self.remotes] 457 | self.waiting = False 458 | obs, rews, dones, infos = zip(*results) 459 | return np.stack(obs), np.stack(rews), np.stack(dones), infos 460 | 461 | def reset(self): 462 | """ 463 | Reset all the environments and return an array of 464 | observations, or a tuple of observation arrays. 465 | If step_async is still doing work, that work will 466 | be cancelled and step_wait() should not be called 467 | until step_async() is invoked again. 468 | """ 469 | for remote in self.remotes: 470 | remote.send(('reset', None)) 471 | return np.stack([remote.recv() for remote in self.remotes]) 472 | 473 | def _reset_task(self): 474 | for remote in self.remotes: 475 | remote.send(('reset_task', None)) 476 | return np.stack([remote.recv() for remote in self.remotes]) 477 | 478 | def close(self): 479 | if self.closed: 480 | return 481 | if self.waiting: 482 | for remote in self.remotes: 483 | remote.recv() 484 | for remote in self.remotes: 485 | remote.send(('close', None)) 486 | for p in self.ps: 487 | p.join() 488 | self.closed = True 489 | 490 | def __len__(self): 491 | return self.nenvs 492 | 493 | def step(self, actions): 494 | self._step_async(actions) 495 | return self._step_wait() 496 | 497 | 498 | class Monitor(gym.Wrapper): 499 | 500 | def __init__(self, env): 501 | super(Monitor, self).__init__(env) 502 | self._monitor_rewards = None 503 | 504 | def reset(self, **kwargs): 505 | self._monitor_rewards = [] 506 | return self.env.reset(**kwargs) 507 | 508 | def step(self, action): 509 | o_, r, done, info = self.env.step(action) 510 | self._monitor_rewards.append(r) 511 | if done: 512 | info['episode'] = {'r': sum(self._monitor_rewards), 'l': len(self._monitor_rewards)} 513 | return o_, r, done, info 514 | 515 | 516 | class NormalizedActions(gym.ActionWrapper): 517 | 518 | def _action(self, action): 519 | low = self.action_space.low 520 | high = self.action_space.high 521 | 522 | action = low + (action + 1.0) * 0.5 * (high - low) 523 | action = np.clip(action, low, high) 524 | 525 | return action 526 | 527 | def _reverse_action(self, action): 528 | low = self.action_space.low 529 | high = self.action_space.high 530 | 531 | action = 2 * (action - low) / (high - low) - 1 532 | action = np.clip(action, low, high) 533 | 534 | return action 535 | 536 | 537 | def unit_test(): 538 | env_id = 'CartPole-v0' 539 | unwrapped_env = gym.make(env_id) 540 | wrapped_env = build_env(env_id, False) 541 | o = wrapped_env.reset() 542 | print('Reset {} observation shape {}'.format(env_id, o.shape)) 543 | done = False 544 | while not done: 545 | a = unwrapped_env.action_space.sample() 546 | o_, r, done, info = wrapped_env.step(a) 547 | print('Take action {} get reward {} info {}'.format(a, r, info)) 548 | 549 | env_id = 'PongNoFrameskip-v4' 550 | nenv = 2 551 | unwrapped_env = gym.make(env_id) 552 | wrapped_env = build_env(env_id, True, nenv=nenv) 553 | o = wrapped_env.reset() 554 | print('Reset {} observation shape {}'.format(env_id, o.shape)) 555 | for _ in range(1000): 556 | a = [unwrapped_env.action_space.sample() for _ in range(nenv)] 557 | a = np.asarray(a, 'int64') 558 | o_, r, done, info = wrapped_env.step(a) 559 | print('Take action {} get reward {} info {}'.format(a, r, info)) 560 | 561 | 562 | if __name__ == '__main__': 563 | unit_test() -------------------------------------------------------------------------------- /notes/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/morningsky/NTU-ReinforcementLearning-Notes/d4a9dbf584ae24d974d9b7839f34cee3f18b79dd/notes/.DS_Store -------------------------------------------------------------------------------- /notes/1 Introduction.md: -------------------------------------------------------------------------------- 1 | # 李宏毅深度强化学习 笔记 2 | 3 | 课程主页:[NTU-MLDS18](http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLDS18.html) 4 | 5 | 视频:[youtube](https://www.youtube.com/playlist?list=PLJV_el3uVTsODxQFgzMzPLa16h6B8kWM_) [B站](https://www.bilibili.com/video/av24724071/?spm_id_from=333.788.videocard.4) 6 | 7 | 参考资料: [作业代码参考](https://github.com/JasonYao81000/MLDS2018SPRING/tree/master/hw4) [纯numpy实现非Deep的RL算法](https://github.com/ddbourgin/numpy-ml/tree/master/numpy_ml/rl_models) [OpenAI tutorial](https://github.com/openai/spinningup/tree/master/docs) 8 | 9 | # 1. Introduction 10 | 11 | ![1](http://oss.hackslog.cn/imgs/075034.png) 12 | 13 | 14 | 15 | 这门课的学习路线如上,强化学习是作为单独一个模块介绍。李宏毅老师讲这门课不是从MDP开始讲起,而是从如何获得最大化奖励出发,直接引出Policy Gradient(以及PPO),再讲Q-learning(原始Q-learning,DQN,各种DQN的升级),然后是A2C(以及A3C, DDPG),紧接着介绍了一些Reward Shaping的方法(主要是Curiosity,Curriculum Learning ,Hierarchical RL),最后介绍Imitation Learning (Inverse RL)。比较全面的展现了深度强化学习的核心内容,也比较直观。 16 | 17 | ![image-20191029211249836](http://oss.hackslog.cn/imgs/075024.png) 18 | 19 | 首先强化学习是一种解决序列决策问题的方法,他是通过与环境交互进行学习。首先会有一个Env,给agent一个state,agent根据得到的state执行一个action,这个action会改变Env,使自己跳转到下一个state,同时Env会反馈给agent一个reward,agent学习的目标就是通过采取action,使得reward的期望最大化。 20 | 21 | ![image-20191029211454593](http://oss.hackslog.cn/imgs/075050.png) 22 | 23 | 24 | 25 | 在alpha go的例子中,state(又称observation)为所看到的棋盘,action就是落子,reward通过围棋的规则给出,如果最终赢了,得1,输了,得-1。 26 | 27 | 下面从2个例子中看强化学习与有监督学习的区别。RL不需要给定标签,但需要有reward。 28 | 29 | ![image-20191029211749897](http://oss.hackslog.cn/imgs/075057.png) 30 | 31 | 实际上alphgo是从提前收集的数据上进行有监督学习,效果不错后,再去做强化学习,提高水平。 32 | 33 | ![image-20191029211848429](http://oss.hackslog.cn/imgs/075104.png) 34 | 35 | 36 | 37 | 人没有告诉机器人具体哪里说错了,机器需要根据最终的评价自己总结,一般需要对话好多次。所以通常训练对话模型会训练2个agent互相对话 38 | 39 | ![image-20191029212015623](http://oss.hackslog.cn/imgs/075108.png) 40 | 41 | ![image-20191029212144625](http://oss.hackslog.cn/imgs/075112.png) 42 | 43 | 44 | 45 | 一个难点是怎么判断对话的效果,一般会设置一些预先定义的规则。 46 | 47 | ![image-20191029212313069](http://oss.hackslog.cn/imgs/075117.png) 48 | 49 | 强化学习还有很多成功的应用,凡是序列决策问题,大多数可以用RL解决。 50 | 51 | -------------------------------------------------------------------------------- /notes/2 Policy Gradient.md: -------------------------------------------------------------------------------- 1 | # 2. Policy Gradient 2 | 3 | ## 2.1 Origin Policy Gradient 4 | 5 | ![](http://oss.hackslog.cn/imgs/075626.jpg) 6 | 7 | 在alpha go场景中,actor决定下哪个位置,env就是你的对手,reward是围棋的规则。强化学习三大基本组件里面,env和reward是事先给定的,我们唯一能做的就是通过调整actor,使得到的累积reward最大化。 8 | 9 | 10 | 11 | ![](http://oss.hackslog.cn/imgs/075657.jpg) 12 | 13 | 14 | 15 | 一般把actor的策略定义成Policy,数学符号为$\pi$,参数是$\theta$,本质是一个NN(神经网络)。 16 | 17 | 18 | 19 | 那么针对Atari游戏:输入游戏的画面,Policy $\pi$输出各个动作的概率,agent根据这个概率分布采取行动。通过调整$\theta$, 我们就可以调整策略的输出。 20 | 21 | 22 | 23 | ![](http://oss.hackslog.cn/imgs/075712.jpg)_page-0007) 24 | 25 | 26 | 27 | 每次采取一个行动会有一个reward 28 | 29 | 30 | 31 | ![](http://oss.hackslog.cn/imgs/075809.jpg) 32 | 33 | 玩一场游戏叫做一个episode,actor存在的目的就是最大化所能得到的return,这个return指的是每一个时间步得到的reward之和。注意我们期望最大化的是return,不是一个时刻的reward。 34 | 35 | 36 | 37 | 如果max的目标是当下时刻的reward,那么在Atari游戏中如果agent在某个s下执行开火,得到了较大的reward,那么可能agent就会一直选择开火。并不代表,最终能够取得游戏的胜利。 38 | 39 | 40 | 41 | 那么,怎么得到这个actor呢? 42 | 43 | 44 | 45 | ![](http://oss.hackslog.cn/imgs/075903.jpg) 46 | 47 | 48 | 49 | 先定义玩一次游戏,即一个episode的游戏记录为trajectory $\tau$,内容如图所示,是s-a组成的序列对。 50 | 51 | 52 | 53 | 假设actor的参数$\theta$已经给定,则可以得到每个$\tau$出现的概率。这个概率取决于两部分,$p\left(s_{t+1} | s_{t}, a_{t}\right)$部分由env的机制决定,actor没法控制,我们能控制的是$p_{\theta}\left(a_{t} | s_{t}\right)$ 由$\pi$的参数$\theta$决定。 54 | 55 | 56 | 57 | ![](http://oss.hackslog.cn/imgs/075918.jpg) 58 | 59 | 60 | 61 | 定义$R(\tau)$ 为一个episode的总的reward,即每个时间步下的即时reward相加,我习惯表述为return。 62 | 63 | 64 | 65 | 定义$\bar{R}_{\theta}$ 为$R(\tau)$的期望,等价于将每一个轨迹$\tau$出现的概率乘与其return,再求和。 66 | 67 | 68 | 69 | 由于$R(\tau)$是一个随机变量,因为actor本身在给定同样的state下会采取什么行为具有随机性,env在给定行为下输出什么state,也是随机的,所以只能算$R(\tau)$的期望。 70 | 71 | 72 | 73 | 我们的目标就变成了最大化Expected Reward,那么如何最大化? 74 | 75 | 76 | 77 | ![](http://oss.hackslog.cn/imgs/075954.jpg) 78 | 79 | 80 | 81 | 优化算法是梯度更新,首先我们先计算出$\bar{R}_{\theta}$ 对$\theta$的梯度。 82 | 83 | 84 | 85 | 从公式中可以看出$R(\tau)$可以是不可微的,因为与参数无关,不需要求导。 86 | 87 | 88 | 89 | 第一个改写(红色部分):将加权求和写成期望的形式。 90 | 91 | 92 | 93 | 第二个近似:实际上没有办法把所有可能的轨迹(游戏记录)都求出来,所以一般是采样N个轨迹 94 | 95 | 96 | 97 | 第三个改写:将$p_{\theta}\left(\tau^{n}\right)$的表达式展开(前2页slide),去掉跟$\theta$无关的项(不需要求导),则可达到最终的简化结果。具体如下:首先用actor采集一个游戏记录 98 | 99 | 100 | 101 | ![image-20191029215615001](http://oss.hackslog.cn/imgs/2019-11-06-080254.png) 102 | 103 | 104 | 105 | ![image-20191029220147651](http://oss.hackslog.cn/imgs/2019-11-06-080301.png) 106 | 107 | 108 | 109 | 最终得到的公式相当的直觉,在s下采取了a导致最终结果赢了,那么return就是正的,也就是会增加相应的s-a出现的概率P。 110 | 111 | 112 | 113 | 上面的公式推导中可能会有疑问,为什么要引入log?再乘一个概率除一个概率?原因非常的直觉,如下:如果动作b本来出现的次数就多,那么在加权平均所有的episode后,参数会偏好执行动作b,而实际上动作b得到的return比a低,所以除掉自身出现的概率,以降低其对训练的影响。 114 | 115 | 116 | 117 | ![image-20191029220546039](http://oss.hackslog.cn/imgs/2019-11-06-080313.png) 118 | 119 | 120 | 121 | 那么,到底是怎么更新参数的呢? 122 | 123 | 124 | 125 | ![](http://oss.hackslog.cn/imgs/080148.jpg) 126 | 127 | 首先会拿agent跟环境互动,收集大量游戏记录,然后把每一个游戏记录拿到右边,计算一个参数theta的更新值,更新参数后,再拿新的actor去收集游戏记录,不断循环。 128 | 129 | 130 | 131 | 注意:一般采样的数据只会用一次,用完就丢弃 132 | 133 | 134 | 135 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080339.jpg) 136 | 137 | 138 | 139 | 具体实现:可当成一个分类任务,只是分类的结果不是识别object,是给出actor要执行的动作。 140 | 141 | 142 | 143 | 如何构建训练集? 采样得到的a,作为ground truth。然后去最小化loss function。 144 | 145 | 146 | 147 | 一般的分类问题loss function是交叉熵,在强化学习里面,只需要在前面乘一个weight,即交叉熵乘一个return。 148 | 149 | 150 | 151 | 实现的过程中还有一些tips可以提高效果: 152 | 153 | 154 | 155 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080618.jpg) 156 | 157 | 158 | 159 | 如果 reward都是正的,那么理想的情况下:reward从大到小 b>a>c, 出现次数 b>a>c, 经过训练以后,reward值高的a,c会提高出现的概率,b会降低。但如果a没有采样到,则a出现的概率最终可能会下降,尽管a的reward高。 160 | 161 | 162 | 163 | 解决方法:增加一个baseline,用r-b作为新的reward,让其有正有负。最简单的做法是b取所有轨迹的平均回报。 164 | 165 | 一般r-b叫做优势函数Advantage Functions。我们不需要描述一个行动的绝对好坏,而只需要知道它相对于平均水平的优势。 166 | 167 | 168 | 169 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080716.jpg) 170 | 171 | 172 | 173 | 在这个公式里面,对于一个轨迹,每一个s-a的pair都会乘同一个weight,显然不公平,因为一局游戏里面往往有好的pair,有对结果不好的pair。所以我们希望给每一个pair乘不同的weight。整场游戏结果是好的,不代表每一个pair都是好的。如果sample次数够多,则不存在这个问题。 174 | 175 | 176 | 177 | 解决思路:在执行当下action之前的事情跟其没有关系,无论得到多少功劳都跟它没有关系,只考虑在当下执行pair之后的reward,这才是它真正的贡献。把原来的总的return,换成未来的return。 178 | 179 | 180 | 181 | 如图:对于第一组数据,在($s_a$,$a_1$)时候总的return是+3,那么如果对每一个pair都乘3,则($s_b$,$a_2$)会认为是有效的,但如果使用改进的思路,将其乘之后的return,即-2,则能有效抑制该pair对结果的贡献。 182 | 183 | 184 | 185 | 再改进:加一个折扣系数,如果时间拖得越长,对于越之后的reward,当下action的功劳越小。 186 | 187 | 188 | 189 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080806.jpg) 190 | 191 | 192 | 193 | 我们将R-b 记为 A,意义就是评价当前s执行动作a,相比于采取其他的a,它有多好。之后我们会用一个critic网络来估计这个评价值。 194 | 195 | ## 2.2 PPO 196 | 197 | PPO算法是PG算法的变形,目的是把在线的学习变成离线的学习。 198 | 199 | 核心的idea是对每一条经验(又称轨迹,即一个episode的游戏记录)不止使用一次。 200 | 201 | 简单理解:在线学习就是一边玩一边学,离线学习就是先看着别人玩进行学习,之后再自己玩 202 | 203 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080842.jpg) 204 | 205 | 206 | 207 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080853.jpg) 208 | 209 | 210 | 211 | Motivation:每次用$\pi_\theta$去采样数据之后,$\pi_\theta$都会更新,接下来又要采样新的数据。以至于PG算法大部分时间都在采样数据。那么能不能将这些数据保存下来,由另一个$\pi_{\theta'}$去更新参数?那么策略$\pi_\theta$采样的数据就能被$\pi_{\theta'}$多次利用。引入统计学中的经典方法: 212 | 213 | 214 | 215 | 重要性采样:如果想求一个函数的期望,但无法积分,则可以通过采样求平均的方法来近似,但是如果p分布不知道(无法采样),我们知道q分布,则如上图通过一个重要性权重,用q分布来替代p分布进行采样。这个重要性权重的作用就是修正两个分布的差异。 216 | 217 | 218 | 219 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080922.jpg) 220 | 221 | 222 | 223 | 存在的问题:如果p跟q的差异比较大,则方差会很大 224 | 225 | 226 | 227 | ![](http://oss.hackslog.cn/imgs/2019-11-06-080951.jpg) 228 | 229 | 230 | 231 | 如果sample的次数不够多,比如按原分布p进行采样,最终f的期望值是负数(大部分概率都在左侧,左侧f是负值),如果按q分布进行sample,只sample到右边,则f就一直是正的,严重偏离原来的分布。当然采样次数够多的时候,q也sample到了左边,则p/q这个负weight非常大,会平衡掉右边的正值,会导致最终计算出的期望值仍然是负值。但实际上采样的次数总是有限的,出现这种问题的概率也很大。 232 | 233 | 234 | 235 | 先忽略这个问题,加入重要性采样之后,训练变成了离线的 236 | 237 | 238 | 239 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081110.jpg) 240 | 241 | 242 | 243 | 离线训练的实现:用另一个policy2与环境做互动,采集数据,然后在这个数据上训练policy1。尽管2个采集的数据分布不一样,但加入一个重要性的weights,可以修正其差异。等policy1训练的差不多以后,policy2再去采集数据,不断循环。 244 | 245 | 246 | 247 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081122.jpg) 248 | 249 | 250 | 251 | 由于我们得到的$A^{\theta}\left(s_{t}, a_{t}\right)$(执行当下action后得到reward-baseline)是由policy2采集的数据观察得到的,所以 $A^{\theta}\left(s_{t}, a_{t}\right)$的参数得修正为$\theta'$ 252 | 253 | 254 | 255 | 根据$\nabla f(x)=f(x) \nabla \log f(x)$反推目标函数$J$,注意要优化的参数是$\theta$ ,$\theta’$只负责采集数据。 256 | 257 | 258 | 259 | 利用 $\theta’$采集的数据来训练$\theta$,会不会有问题?(虽然有修正,但毕竟还是不同) 答案是我们需要保证他们的差异尽可能的小,那么在刚刚的公式里再加入一些限制保证其差异足够小,则诞生了 PPO算法。 260 | 261 | 262 | 263 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081210.jpg) 264 | 265 | 266 | 267 | 引入函数KL,KL衡量两个分布的距离。注意:不是参数上的距离,是2个$\pi$给同样的state之后基于各自参数输出的action space的距离 268 | 269 | 270 | 271 | 加入KL的公式直觉的理解:如果我们学习出来的$\theta$跟$\theta'$越像,则KL越小,J越大。我们的学习目标还是跟原先的PG算法一样,用梯度上升训练,最大化J。这个操作有点像正则化,用来解决重要性采样存在的问题。 272 | 273 | 274 | 275 | TRPO是PPO是前身,把KL这个限制条件放在优化的目标函数外面。对于梯度上升的优化过程,这种限制比较难处理,使优化变得复杂,一般不用。 276 | 277 | 278 | 279 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081228.jpg) 280 | 281 | 282 | 283 | 实现过程:初始化policy参数,在每一个迭代里面,用$theta^k$采集很多数据,同时计算出奖励A值,接着train这个数据,更新$\theta$优化J。由于是离线训练,可以多次更新后,再去采集新的数据。 284 | 285 | 286 | 287 | 有一个trick是KL的权重beta也可以调整,使其更加的adaptive。 288 | 289 | 290 | 291 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081245.jpg) 292 | 293 | 294 | 295 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081315.jpg) 296 | 297 | 298 | 299 | 实际上KL也是比较难计算的,所以有了PPO2算法,不计算KL,通过clip达到同样效果。 300 | 301 | 302 | 303 | clip(a, b, c): if a b If a>c => c If b a 304 | 305 | 306 | 307 | 看图:绿色:min里面的第一项,蓝色:min里面的第二项,红色 min的输出 308 | 309 | 310 | 311 | 这个公式的直觉理解:希望$\theta$与$\theta^k$在优化之后不要差距太大。如果A>0,即这个state-action是好的,所以需要增加这个pair出现的几率,所以在max J的过程中会增大$\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta^{k}}\left(a_{t} | s_{t}\right)}$, 但最大不要超过1+eplison,如果A<0,不断减小,小到1-eplison,始终不会相差太大。 312 | 313 | 314 | 315 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081323.jpg) 316 | 317 | PG算法效果非常不稳定,自从有了PPO,PG的算法可以在很多任务上work。 318 | 319 | -------------------------------------------------------------------------------- /notes/3 Q - Learning.md: -------------------------------------------------------------------------------- 1 | # 3. Q - Learning 2 | 3 | ## 3.1 Q-learning 4 | 5 | 在之前的policy-based算法里,我们的目标是learn 一个actor,value-based的强化学习算法目标是learn一个critic。 6 | 7 | 定义一个Critic,也就是状态值函数$V^{\pi}(s)$,它的值是:当使用策略$\pi$进行游戏时,在观察到一个state s之后,环境输出的累积的reward值的期望。注意取决于两个值,一个是state s,一个是actor$\pi$。 8 | 9 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081510.jpg) 10 | 11 | ![0004](http://oss.hackslog.cn/imgs/2019-11-06-081524.jpg) 12 | 13 | 如果是不同的actor,在同样的state下,critic给出的值也是不同的。那么怎么估计出这个函数V呢? 14 | 15 | 主要有MC和TD的方法,实际上还有DP的方法,但是用DP求解需要整个环境都是已知的。而在强化学习的大部分任务里,都是model-free的,需要agent自己去探索环境。 16 | 17 | ![0005](http://oss.hackslog.cn/imgs/2019-11-06-081539.jpg) 18 | 19 | MC:直接让agent与环境互动,统计计算出在$S_a$之后直到一个episode结束的累积reward作为$G_a$。 20 | 21 | 训练的目标就是让$V^{\pi}(s)$的输出尽可能的接近$G_a$。 22 | 23 | ![0006](http://oss.hackslog.cn/imgs/2019-11-06-081554.jpg) 24 | 25 | MC每次必须把游戏玩到结束,TD不需要把游戏玩到底,只需要玩了一次游戏,有一个状态的变化。 26 | 27 | 那么训练的目标就是让$V^{\pi}(s_t)$ 和$V^{\pi}(s_t+1)$的差接近$r_t$ 28 | 29 | ![0007](http://oss.hackslog.cn/imgs/2019-11-06-081607.jpg) 30 | 31 | MC方差大,因为$r$是一个随机变量,MC方法中的$G$是$r$之和,而TD方法只有$r$是随机变量,r的方差比G小。但TD方法的$V^{\pi}$有可能估计的不准。 32 | 33 | ![0008](http://oss.hackslog.cn/imgs/2019-11-06-081621.jpg) 34 | 35 | 用MC和TD估计的结果不一样 36 | 37 | ![0009](http://oss.hackslog.cn/imgs/2019-11-06-081633.jpg) 38 | 39 | 定义另一种Critic,状态-动作值函数$Q^{\pi}(s,a)$,有的地方叫做Q-function,输入是一个pair $(s,a)$,意思是用$\pi$玩游戏时,在s状态下强制执行动作a(策略$\pi$在s下不一定会执行a),所得到的累积reward。 40 | 41 | 有两种写法,输入pair,输出Q,此时的Q是一个标量。 42 | 43 | 另一种是输入s,输出所有可能的action的Q值,此时Q是一个向量。 44 | 45 | 那么Critic到底怎么用呢? 46 | 47 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081645.jpg) 48 | 49 | Q-learning的过程: 50 | 51 | 初始化一个actor $\pi$去收集数据,然后learn一个基于$ \pi$的Q-function,接着寻找一个新的比原来的$\pi$要好actor , 找到后更新$\pi$,再去寻找新的Q-function,不断循环,得到更好的policy。 52 | 53 | 可见Q-learning的核心思想是先找到最优的Q-function,再通过这个Q-function得出最优策略。而Policy-based的算法是直接去学习策略。这是本质区别。 54 | 55 | 那么,怎么样才算比原来的好? 56 | 57 | ![0012](http://oss.hackslog.cn/imgs/2019-11-06-081701.jpg) 58 | 59 | 定义好的策略:对所有可能的s而言,$V_\pi(s)$一定小于$V_\pi'(s)$,则$V_\pi'(s)$就是更好的策略。 60 | 61 | $\pi'(s)$的本质:假设已经学习到了一个actor $\pi$的Q-function,给一个state,把所有可能的action都代入Q,执行那个可以让Q最大的action。 62 | 63 | 注意:实际上,给定一个s,$ \pi$不一定会执行a,现在的做法是强制执行a,计算执行之后玩下去得到的reward进行比较。 64 | 65 | 在实现的时候$\pi'$没有额外的参数,依赖于Q。并且当动作是连续值的时候,无法进行argmax。 66 | 67 | 那么, 为什么actor $\pi’$能被找到? 68 | 69 | ![0013](http://oss.hackslog.cn/imgs/2019-11-06-081720.jpg) 70 | 71 | 上面是为了证明:只要你估计出了一个actor的Q-function,则一定可以找到一个更好的actor。 72 | 73 | 核心思想:在一个episode中某一步把$\pi$换成了$ \pi'$比完全follow $ \pi$,得到的奖励期望值会更大。 74 | 75 | 注意$r_{t+1}$指的是在执行当下$a_t$得到的奖励,有的文献也会写成$r_t$ 76 | 77 | 训练的时候有一些Tips可以提高效率: 78 | 79 | ![0014](http://oss.hackslog.cn/imgs/2019-11-06-081740.jpg) 80 | 81 | Tips 1 引入target网络 82 | 83 | 训练的时候,每次需要两个Q function(两个的输入不同)一起更新,不稳定。 一般会固定一个Q作为Target,产生回归任务的label,在训练N次之后,再更新target的参数。回归任务的目标,让$Q^\pi(s_t,a_t)$与$\mathrm{Q}^{\pi}\left(s_{t+1}, \pi\left(s_{t+1}\right)\right))+r$越来越接近,即降低mse。最终希望训练得到的$Q^\pi$能直接估计出这个$(s_t,a_t)$未来的一个累积奖励。 84 | 85 | 注意:target网络的参数不需要训练,直接每隔N次复制Q的参数。训练的目标只有一个 Q。 86 | 87 | ![0015](http://oss.hackslog.cn/imgs/2019-11-06-081754.jpg) 88 | 89 | Tips2 改进探索机制 90 | 91 | PG算法,每次都会sample新的action,随机性比较大,大概率会尽可能的覆盖所有的动作。而之前的Q-learning,策略的本质是绝对贪婪策略,那么如果有的action没有被sample到,则可能之后再也不会选择这样的action。这种探索的机制(收集数据的方法)不好,所以改进贪心算法,让actor每次会$\varepsilon$的概率执行随机动作。 92 | 93 | ![0016](http://oss.hackslog.cn/imgs/2019-11-06-081820.jpg) 94 | 95 | ![0017](http://oss.hackslog.cn/imgs/2019-11-06-081833.jpg) 96 | 97 | Tips 3 引入记忆池机制 98 | 99 | 将采集到的一些数据收集起来,放入replay buffer。好处: 100 | 101 | 1.可重复使用过去的policy采集的数据,降低agent与环境互动的次数,加快训练效率 。 102 | 103 | 2.replay buffer里面包含了不同actor采集的数据,这样每次随机抽取一个batch进行训练的时候,每个batch内的数据会有较大的差异(数据更加diverse),有助于训练。 104 | 105 | 那么,当我们训练的目标是$ \pi$的Q-function,训练数据混杂了$\pi’$,$\pi’'$,$\pi’''$采集的数据 有没有问题呢?没有,不是因为这些$ \pi$很像,主要原因是我们采样的不是一个轨迹,只是采样了一笔experience($s_t,a_t,r_t,s_{t+1}$)。这个理论上证明是没有问题的,很难解释... 106 | 107 | ![0018](http://oss.hackslog.cn/imgs/2019-11-06-081844.jpg) 108 | 109 | 采用了3个Tips的Q-learning训练过程如图: 110 | 111 | 注意图中省略了一个循环,即存储了很多笔experience之后才会进行sample。相比于原始的Q-learning,每次sample是从replay buff里面随机抽一个batch,然后计算用绝对贪心策略得到Q-target的值作为label,接着在回归任务中更新Q的参数。每训练多步后,更新Q-target的参数。 112 | 113 | ## 3.2 Tips of Q-learning 114 | 115 | ![](http://oss.hackslog.cn/imgs/2019-11-06-081854.jpg) 116 | 117 | DQN估计出的值一般都高于实际的值,double DQN估计出的值与实际值比较接近。 118 | 119 | ![0021](http://oss.hackslog.cn/imgs/2019-11-06-081906.jpg) 120 | 121 | Q是一个估计值,被高估的越多,越容易被选择。 122 | 123 | ![0022](http://oss.hackslog.cn/imgs/2019-11-06-081948.jpg) 124 | 125 | Double的思想有点像行政跟立法分权。 126 | 127 | 用要训练的Q-network去选择动作,用固定不动的target-network去做估计,相比于DQN,只需要改一行代码! 128 | 129 | ![0023](http://oss.hackslog.cn/imgs/2019-11-06-081958.jpg) 130 | 131 | 改了network架构,其他没动。每个网络结构的输出是一个标量+一个向量 132 | 133 | ![0024](http://oss.hackslog.cn/imgs/2019-11-06-082009.jpg) 134 | 135 | 比如下一时刻,我们需要把3->4, 0->-1,那么Dueling结构里会倾向于不修改A,只调整V来达到目的,这样只需要把V中 0->1, 如果Q中的第三行-2没有被sample到,也进行了更新,提高效率,减少训练次数。 136 | 137 | ![0025](http://oss.hackslog.cn/imgs/2019-11-06-082019.jpg) 138 | 139 | 实际实现的时候,通过添加了限制条件,也就是把A normalize,使得其和为0,这样只会更新V。 140 | 141 | 这种结构让DQN也能处理连续的动作空间。 142 | 143 | ![0028](http://oss.hackslog.cn/imgs/2019-11-06-082253.jpg) 144 | 145 | 加入权重的replay buffer 146 | 147 | motivation:TD error大的数据应该更可能被采样到 148 | 149 | 注意论文原文实现的细节里,也修改了参数更新的方法 150 | 151 | ![0029](http://oss.hackslog.cn/imgs/2019-11-06-082300.jpg) 152 | 153 | 原来收集一条experience是执行一个step,现在变成执行N个step。相比TD的好处:之前只sample一个$(s_t,a_t)$pair,现在sample多个才估测Q值,估计的误差会更小。坏处,与MC一样,reward的项数比较多,相加的方差更大。 调N就是一个trade-off的过程。 154 | 155 | ![0030](http://oss.hackslog.cn/imgs/2019-11-06-082309.jpg) 156 | 157 | 在Q-function的参数空间上+noise 158 | 159 | 比较有意思的是,OpenAI DeepMind几乎在同一个时间发布了Noisy Net思想的论文。 160 | 161 | ![0031](http://oss.hackslog.cn/imgs/2019-11-06-082316.jpg) 162 | 163 | 在同一个episode里面,在动作空间上加噪声,会导致相同state下执行的action不一样。而在参数空间加噪声,则在相同或者相似的state下,会采取同一个action。 注意加噪声只是为了在不同的episode的里面,train Q的时候不会针对特定的一个state永远只执行一个特定的action。 164 | 165 | ![0033](http://oss.hackslog.cn/imgs/2019-11-06-082325.jpg) 166 | 167 | 带分布的Q-function 168 | 169 | Motivation:原来计算Q-function的值是通过累积reward的期望,也就是均值,但实际上累积的reward可能在不同的分布下会得到相同的Q值。 170 | 171 | 注意:每个Q-function的本质都是一个概率分布。 172 | 173 | ![0034](http://oss.hackslog.cn/imgs/2019-11-06-082332.jpg) 174 | 175 | 让$Q^ \pi$直接输出每一个Q-function的分布,但实际上选择action的时候还是会根据mean值大的选。不过拥有了这个分布,可以计算方差,这样如果有的任务需要在追求回报最大的同时降低风险,则可以利用这个分布。 176 | 177 | ![0036](http://oss.hackslog.cn/imgs/2019-11-06-082339.jpg) 178 | 179 | ![0037](http://oss.hackslog.cn/imgs/2019-11-06-082345.jpg) 180 | 181 | Rainbow:集成了7种升级技术的DQN 182 | 183 | 上图是一个一个改进拿掉之后的效果,看紫色似乎double 没啥用,实际上是因为有Q-function的分布存在,一般不会过高估计Q值,所以double 意义不大。 184 | 185 | 直觉的理解:使用分布DQN,即时Q值被高估很多,由于最终只会映射到对应的分布区间,所以最终的输出值也不会过大。 186 | 187 | ## 3.3 Q-learning in continuous actions 188 | 189 | 在出现PPO之前, PG的算法非常不稳定。DQN 比较稳定,也容易train,因为DQN是只要估计出Q-function,就能得到好的policy,而估计Q-function就是一个回归问题,回归问题比较容易判断learn的效果,看mse。问题是Q-learning不太容易处理连续动作空间。比如汽车的速度,是一个连续变量。 190 | 191 | ![0039](http://oss.hackslog.cn/imgs/2019-11-06-082354.jpg) 192 | 193 | 当动作值是连续时,怎么解argmax: 194 | 195 | 1. 通过映射,强行离散化 196 | 197 | 2. 使用梯度上升解这个公式,这相当于每次train完Q后,在选择action的时候又要train一次网络,比较耗时间。 198 | 199 | ![0040](http://oss.hackslog.cn/imgs/2019-11-06-082404.jpg) 200 | 201 | 3. 设计特定的网络,使得输出还是一个标量。 202 | 203 | ![0042](http://oss.hackslog.cn/imgs/2019-11-06-082433.jpg) 204 | 205 | 最有效的解决方法是,针对连续动作空间,不要使用Q-learning。使用AC算法! -------------------------------------------------------------------------------- /notes/4 Actor Critic.md: -------------------------------------------------------------------------------- 1 | # 4. Actor Critic 2 | 3 | ## 4.1 Advantage Actor-Critic (A2C) 4 | 5 | ![](http://oss.hackslog.cn/imgs/2019-11-06-082605.jpg) 6 | 7 | 由于每次在执行PG算法之前,一般只能采样少量的数据,导致对于同一个$(s_t,a_t)$,得到的$G$的值方差很大,不稳定。那么能不能直接估计出期望值,来替代采样的结果? 8 | 9 | ![AC-4](http://oss.hackslog.cn/imgs/2019-11-06-082530.jpg) 10 | 11 | 回顾下Q-learning中的定义,我们发现: 12 | 13 | ![AC-5](http://oss.hackslog.cn/imgs/2019-11-06-082602.jpg) 14 | 15 | PG算法中G的期望的定义恰好也是Q-learning算法中$Q^\pi(s,a)$的定义: 假设现在的policy是$ \pi$的情况下,在某一个s,采取某一个a以后得到的累积reward的期望值。 16 | 17 | 因此在这里将Q-learning引入到预估reward中,也即policy gradient和q-learning的结合,叫做Actor-Critic。 18 | 19 | 把原来的reward和baseline分别替换,PG算法中的减法就变成了$Q^{\pi_{\theta}}\left(s_{t}^{n}, a_{t}^{n}\right)-V^{\pi_{\theta}}\left(s_{t}^{n}\right)$。似乎我们需要训练2个网络? 20 | 21 | ![AC-6](http://oss.hackslog.cn/imgs/2019-11-06-082629.jpg) 22 | 23 | 实际上Q与V可以互相转化,我们只需要训练V。转化公式中为什么要加期望?在s下执行a得到的$ r_t$和$s_{t+1}$是随机的。 24 | 25 | 实际将Q变成V的操作中,我们会去掉期望,使得只需要训练(估计)状态值函数$V^\pi$,这样会导致一点偏差,但比同时估计两个function导致的偏差要好。(A3C原始paper通过实验验证了这一点)。 26 | 27 | ![AC-7](http://oss.hackslog.cn/imgs/2019-11-06-082641.jpg) 28 | 29 | A2C的训练流程:收集数据,估计出状态值函数$V^\pi(s)$,套用公式更新策略$\pi$,再利用新的$\pi$与环境互动收集新的数据,不断循环。 30 | 31 | ![AC-8](http://oss.hackslog.cn/imgs/2019-11-06-082652.jpg) 32 | 33 | 训练过程中的2个Tips: 34 | 35 | 1. Actor与Critic的前几层一般会共用参数,因为输入都是state 36 | 2. 正则化:让采用不同action的概率尽量平均,希望有更大的entropy,这样能够探索更多情况。 37 | 38 | ## 4.2 Asynchronous Advantage Actor-Critic (A3C) 39 | 40 | ![AC-9](http://oss.hackslog.cn/imgs/2019-11-06-082709.jpg) 41 | 42 | A3C算法的motivation:开分身学习~ 43 | 44 | ![AC-10](http://oss.hackslog.cn/imgs/2019-11-06-082718.jpg) 45 | 46 | 训练过程:每个agent复制一份全局参数,然后各自采样数据,计算梯度,更新这份全局参数,然后将结果传回,复制一份新的参数。 47 | 48 | 注意: 49 | 50 | 1. 初始条件会尽量的保证多样性(Diverse),让每个agent探索的情况更加不一样。 51 | 52 | 2. 所有的actor都是平行跑的,每个worker把各自的参数传回去然后复制一份新的全局参数。此时可能这份全局参数已经发生了改变,没有关系。 53 | 54 | ## 4.3 Pathwise Derivative Policy Gradient (PDPG) 55 | 56 | 在之前Actor-Critic框架里,Critic的作用是评估agent所执行的action好不好?那么Critic能不能不止给出评价,还给出指导意见呢?即告诉actor要怎样做才能更好?于是有了DPG算法: 57 | 58 | ![AC-12](http://oss.hackslog.cn/imgs/2019-11-06-082731.jpg) 59 | 60 | ![AC-13](http://oss.hackslog.cn/imgs/2019-11-06-082746.jpg) 61 | 62 | 在上面介绍A2C算法的motivation,主要是从改进PG算法引入。那么从Q-learning的角度来看,PDPG相当于learn一个actor,来解决argmax这个优化问题,以处理连续动作空间,直接根据输入的状态输出动作。 63 | 64 | ![AC-14](http://oss.hackslog.cn/imgs/2019-11-06-082759.jpg) 65 | 66 | Actor+Critic连成一个大的网络,训练过程中也会采取TD-target的技巧,固定住Critic $\pi'$,使用梯度上升优化Actor 67 | 68 | ![AC-15](http://oss.hackslog.cn/imgs/2019-11-06-082809.jpg) 69 | 70 | 训练过程:Actor会学到策略$\pi$,使基于策略$\pi$,输入s可以获得能够最大化Q的action,天然地能够处理continuous的情况。当actor生成的$Q^\pi$效果比较好时,重新采样生成新的Q。有点像GAN中的判别器与生成器。 71 | 72 | 注意:从算法的流程可知,Actor 网络和 Critic 网络是分开训练的,但是两者的输入输出存在联系,Actor 网络输出的 action 是 Critic 网络的输入,同时 Critic 网络的输出会被用到 Actor 网路进行反向传播。 73 | 74 | 由于Critic模块是基于Q-learning算法,所以Q learning的技巧,探索机制,回忆缓冲都可以用上。 75 | 76 | ![AC-16](http://oss.hackslog.cn/imgs/2019-11-06-082820.jpg) 77 | 78 | ![AC-17](http://oss.hackslog.cn/imgs/2019-11-06-082830.jpg) 79 | 80 | 与Q-learning相比的改进: 81 | 82 | - 不通过Q-function输出动作,直接用learn一个actor网络输出动作(Policy-based的算法的通用特性)。 83 | - 对于连续变量,不好解argmax的优化问题,转化成了直接选择$\pi-target$ 输出的动作,再基于Q-target得出y。 84 | - 引入$\pi-target$,也使得actor网络不会频繁更新,会通过采样一批数据训练好后再更新,提高训练效率。 85 | 86 | ![AC-18](http://oss.hackslog.cn/imgs/2019-11-06-082845.jpg) 87 | 88 | 总结下:最基础的 Policy Gradient 是回合更新的,通过引入 Critic 后变成了单步更新,而这种结合了 policy 和 value 的方法也叫 Actor-Critic,Critic 有多种可选的方法。A3C在A2C的基础上引入了多个 agent 对网络进行异步更新。对于输出动作为连续值的情形,原始的输出动作概率分布的PG算法不能解决,同时Q-learning算法也不能处理这类问题,因此提出了 DPG 。 -------------------------------------------------------------------------------- /notes/5 Sparse Reward.md: -------------------------------------------------------------------------------- 1 | # 5. Sparse Reward 2 | 3 | 大多数RL的任务中,是没法得到reward,reward=0,导致reward空间非常的sparse。 4 | 5 | 比如我们需要赢得一局游戏才能知道胜负得到reward,那么玩这句游戏的很长一段时间内,我们得不到reward。比如如果机器人要将东西放入杯子才能得到一个reward,尝试了很多动作很有可能都是0。 6 | 7 | 但是人可以在非常sprse的环境下进行学习,所以这一章节提出的很多算法与人的一些学习机制比较类似。 8 | 9 | ## 5.1 Reward Shaping 10 | 11 | 手动设计新的reward,让agent做的更好。但有些比较复杂的任务,需要domain knowledge去设计新的reward。 12 | 13 | ![](http://oss.hackslog.cn/imgs/2019-11-06-082912.jpg) 14 | 15 | ![](http://oss.hackslog.cn/imgs/2019-11-06-082928.jpg) 16 | 17 | ## 5.2 Curiosity 18 | 19 | 好奇心机制非常的直觉,也非常的强大。有个案例:[Happy Bird](https://github.com/pathak22/noreward-rl) 20 | 21 | ![](http://oss.hackslog.cn/imgs/2019-11-06-082959.jpg) 22 | 23 | 好奇心也是reward shaping的一种,引入一个新的reward :ICM,同时优化2个reward。如何设计一个ICM模块,使agent拥有好奇心? 24 | 25 | ![](http://oss.hackslog.cn/imgs/2019-11-06-083015.jpg) 26 | 27 | 单独训练一个状态估计的模型,如果在某个state下采取某个action得到的下一个state难以预测,则鼓励agent进行尝试这个action。 不过有的state很难预测,但不重要。比如说某个游戏里面背景是树叶飘动,很难预测,接下来agent一直不动看着树叶飘动,没有意义。 28 | 29 | ![](http://oss.hackslog.cn/imgs/2019-11-06-083031.jpg) 30 | 31 | 再设计一个moudle,判断环境中state的重要性:learn一个feature ext的网络,去掉环境中与action关系不大的state。 32 | 33 | 原理:输入两个处理过的state,预测action,使逼近真实的action。这样使得处理之后的state都是跟agent要采取的action相关的。 34 | 35 | ## 5.3 Curriculum Learning 36 | 37 | 课程学习:为learning做规划,通常由易到难。 38 | 39 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092656.jpg) 40 | 41 | 设计不同难度的课程,一开始直接把板子放入柱子,则agent只要把板子压下去就能获得reward,接着把板子的初始位置提高一些,agent有可能把板子抽出则无法获得reward,接着更general的情况,把板子放倒柱子外面,再让agent去学习。 42 | 43 | 生成课程的方法通常如下:从目标反推,越靠近目标的state越简单,不断生成难度更高的state。 44 | 45 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092702.jpg) 46 | 47 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092658.jpg) 48 | 49 | ## 5.4 Hierarchical RL 50 | 51 | 分层学习:把大的任务拆解成小任务 52 | 53 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092745.jpg) 54 | 55 | 上层的agent给下层的agent提供一个愿景,如果下层的达不到目标,会获得惩罚。如果下层的agent得到的错误的目标,那么它会假设最初的目标也是错的。 -------------------------------------------------------------------------------- /notes/6 Imitation Learning.md: -------------------------------------------------------------------------------- 1 | # 6. Imitation Learning 2 | 3 | 模仿学习,又叫学徒学习,反向强化学习 4 | 5 | 之前介绍的强化学习都有一个reward function,但生活中大多数任务无法定义reward,或者难以定义。但是这些任务中如果收集很厉害的范例(专家经验)比较简单,则可以用模仿学习解决。 6 | 7 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092747.jpg) 8 | 9 | ## 6.1 Behavior Cloning 10 | 11 | 本质是有监督学习 12 | 13 | ![0004](http://oss.hackslog.cn/imgs/2019-11-06-092759.jpg) 14 | 15 | ![0005](http://oss.hackslog.cn/imgs/2019-11-06-092817.jpg) 16 | 17 | 存在问题:training data里面没有撞墙的case,则agent遇到这种情况不知如何决策 18 | 19 | ![0006](http://oss.hackslog.cn/imgs/2019-11-06-092821.jpg) 20 | 21 | 一个直觉的解决方法是数据增强:每次通过牺牲一个专家,学会了一种新的case,策略$\pi$得到了增强。 22 | 23 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092831.jpg) 24 | 25 | 行为克隆还存在一个关键问题:agent不知道哪些行为对结局重要,哪些不重要。由于是采样学习,有可能只记住了多余的无用的行为。 26 | 27 | ![0009](http://oss.hackslog.cn/imgs/2019-11-06-092847.jpg) 28 | 29 | 同时也由于RL的训练数据不是独立同分布,当下的action会影响之后的state,所以不能直接套用监督学习的框架。 30 | 31 | 为了解决这些问题,就有了反向强化学习,现在一般说模仿学习指的就是反向强化学习。 32 | 33 | ## 6.2 Inverse RL 34 | 35 | ![0011](http://oss.hackslog.cn/imgs/2019-11-06-092859.jpg) 36 | 37 | 之前的强化学习是reard和env通过RL 学到一个最优的actor。 38 | 39 | 反向强化学习是,假设有一批expert的数据,通过env和IRL推导expert因为什么样子的reward function才会采取这样的行为。 40 | 41 | 好处:也许expert的行为复杂但reward function很简单。拿到这个reward function后我们就可以训练出好的agent。 42 | 43 | ![0012](http://oss.hackslog.cn/imgs/2019-11-06-092907.jpg) 44 | 45 | IRL的框架:先射箭 再画靶。 46 | 47 | 具体过程: 48 | 49 | Expert先跟环境互动,玩N场游戏,存储记录,我们的actor $ \pi$也去互动,生成N场游戏记录。接下来定义一个reward function $R$,保证expert的$R$比我们的actor的$R$大就行。再根据定义的的$R$用RL的方法去学习一个新的actor ,这个过程也会采集新的游戏记录,等训练好这个actor,也就是当这个actor可以基于$R$获得高分的时候,重新定义一个新的reward function$R'$,让expert的$R'$大于agent,不断循环。 50 | 51 | ![0013](http://oss.hackslog.cn/imgs/2019-11-06-092917.jpg) 52 | 53 | IRL与GAN的框架是一样的,学习 一个 reward function相当于学习一个判别器,这个判别器给expert高分,给我们的actor低分。 54 | 55 | 一个有趣的事实是给不同的expert,我们的agent最终也会学会不同的策略风格。如下蓝色是expert的行为,红色是学习到的actor的行为。 56 | 57 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092932.jpg) 58 | 59 | 针对训练robot的任务: 60 | 61 | IRL有个好处是不需要定义规则让robot执行动作,人给robot示范一下动作即可。但robot学习时候的视野跟它执行该动作时候的视野不一致,怎么把它在第三人称视野学到的策略泛化到第一人称视野呢? 62 | 63 | ![](http://oss.hackslog.cn/imgs/2019-11-06-092943.jpg) 64 | 65 | ![0019](http://oss.hackslog.cn/imgs/2019-11-06-092953.jpg) 66 | 67 | 解决思路跟好奇心机制类似,抽出视野中不重要的因素,让第一人称和第三人称视野中的state都是有用的,与action强相关的。 -------------------------------------------------------------------------------- /slides/AC.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/morningsky/NTU-ReinforcementLearning-Notes/d4a9dbf584ae24d974d9b7839f34cee3f18b79dd/slides/AC.pdf -------------------------------------------------------------------------------- /slides/IRL (v2).pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/morningsky/NTU-ReinforcementLearning-Notes/d4a9dbf584ae24d974d9b7839f34cee3f18b79dd/slides/IRL (v2).pdf -------------------------------------------------------------------------------- /slides/PPO (v3).pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/morningsky/NTU-ReinforcementLearning-Notes/d4a9dbf584ae24d974d9b7839f34cee3f18b79dd/slides/PPO (v3).pdf -------------------------------------------------------------------------------- /slides/QLearning (v2).pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/morningsky/NTU-ReinforcementLearning-Notes/d4a9dbf584ae24d974d9b7839f34cee3f18b79dd/slides/QLearning (v2).pdf -------------------------------------------------------------------------------- /slides/Reward (v3).pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/morningsky/NTU-ReinforcementLearning-Notes/d4a9dbf584ae24d974d9b7839f34cee3f18b79dd/slides/Reward (v3).pdf --------------------------------------------------------------------------------