├── .gitignore ├── A-Guide-Resource-For-DeepRL ├── .DS_Store ├── README.md └── assets │ ├── 2019-11-09-10-52-14.png │ ├── 2019-11-09-10-56-44.png │ ├── 2019-11-09-11-04-36.png │ ├── 2019-11-09-11-59-58.png │ ├── 2019-11-09-12-01-14.png │ ├── 2019-11-09-12-04-34.png │ ├── 2019-11-09-12-05-05.png │ ├── 2019-11-09-12-23-09.png │ ├── 2019-11-09-12-23-34.png │ └── 2019-11-10-09-13-28.png ├── AI-Basic-Resource ├── .DS_Store └── README.md ├── DRL-Algorithm ├── .DS_Store ├── AC-serial │ ├── A2C │ │ └── README.md │ ├── A3C │ │ └── README.md │ ├── ACER │ │ └── README.md │ ├── GA3C │ │ └── README.md │ └── README.md ├── DQN-serial │ ├── .DS_Store │ ├── Double Q-Learning │ │ ├── .DS_Store │ │ └── README.md │ ├── DoubleDQN │ │ ├── .DS_Store │ │ └── README.md │ ├── Dueling DQN │ │ ├── .DS_Store │ │ └── ReadME.md │ ├── Prioritized Experience Replay(PER) │ │ ├── .DS_Store │ │ └── ReadMe.md │ ├── Q-Learning │ │ ├── .DS_Store │ │ └── README.md │ └── README.md ├── PG-serial │ ├── DD4G │ │ └── README.md │ ├── DDPG │ │ └── README.md │ └── README.md ├── PPO-serial │ ├── .DS_Store │ ├── README.md │ └── TRPO │ │ ├── README.md │ │ ├── TRPO整理笔记.md │ │ └── note_img │ │ ├── MM.png │ │ ├── cga.png │ │ ├── cliff.png │ │ ├── mountain.png │ │ ├── pg_formula.png │ │ ├── taylor.png │ │ └── trpo.jpg ├── README.md └── SAC │ └── README.md ├── DRL-Application └── README.md ├── DRL-Books ├── .DS_Store └── README.md ├── DRL-Competition ├── .DS_Store └── README.md ├── DRL-ConferencePaper ├── .DS_Store ├── AAAI │ ├── 2020 │ │ └── ReadMe.md │ └── .DS_Store ├── ACL │ ├── 2018 │ │ └── README.md │ ├── 2019 │ │ └── README.md │ └── .DS_Store ├── ICLR │ ├── 2020 │ │ └── README.md │ └── .DS_Store ├── ICML │ ├── .DS_Store │ └── README.md ├── IJCAI │ ├── 2018 │ │ └── README.md │ ├── 2019 │ │ └── README.md │ ├── .DS_Store │ └── README.md ├── Level.md ├── NIPS │ ├── 2019 │ │ ├── .DS_Store │ │ └── ReadMe.md │ └── .DS_Store └── README.md ├── DRL-Course ├── .DS_Store ├── DavidSliver强化学习课程 │ └── README.md ├── DeepMind深度强化学习课程 │ └── README.md ├── README.md └── UC Berkeley CS 294深度强化学习课程 │ └── README.md ├── DRL-Interviews ├── .DS_Store ├── drl-interview.assets │ ├── 20171108090350229.jfif │ ├── eea4714c.png │ ├── equation-1584541764589.svg │ └── equation.svg └── drl-interview.md ├── DRL-Multi-Agent ├── .DS_Store ├── Nick_Sun ├── QMIX.assets │ ├── v2-79fe8838e84d6def61e3db6cf7332428_hd.jpg │ └── v2-98cea01bf7d7d2239d4d50460a57e6cf_hd.jpg ├── QMIX.md └── README.md ├── DRL-News ├── .DS_Store └── README.md ├── DRL-OpenSource ├── .DS_Store ├── Baidu-PARL │ └── README.md ├── Google-Dopamine(多巴胺) │ └── README.md ├── Google-TensorForce │ └── README.md ├── Intel-Coach │ └── README.md ├── OpenAI-baselines │ └── README.md ├── README.md ├── RLlab │ └── README.md ├── Ray │ ├── README.assets │ │ └── rllib-components.svg │ └── README.md └── TensorLayer │ └── tensorlayer.md ├── DRL-PaperDaily ├── .DS_Store └── README.md ├── DRL-PaperReadCodingPlan ├── .DS_Store └── README.md ├── DRL-TopicResearch ├── .DS_Store └── 奖励函数研究 │ ├── .DS_Store │ ├── README.md │ └── environment │ └── .DS_Store ├── DRL-WorkingCompany ├── .DS_Store ├── 163.md ├── HuaWei.md ├── Inspir-AI.md ├── KuaiShou.md ├── MeiTuan.md ├── ReadMe.md ├── Tencent.md ├── Testin.md ├── VIVO.md └── assets │ └── .DS_Store ├── LICENSE └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/* 2 | -------------------------------------------------------------------------------- /A-Guide-Resource-For-DeepRL/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/A-Guide-Resource-For-DeepRL/.DS_Store -------------------------------------------------------------------------------- /A-Guide-Resource-For-DeepRL/assets/2019-11-09-10-52-14.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/A-Guide-Resource-For-DeepRL/assets/2019-11-09-10-52-14.png -------------------------------------------------------------------------------- /A-Guide-Resource-For-DeepRL/assets/2019-11-09-10-56-44.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/A-Guide-Resource-For-DeepRL/assets/2019-11-09-10-56-44.png -------------------------------------------------------------------------------- /A-Guide-Resource-For-DeepRL/assets/2019-11-09-11-04-36.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/A-Guide-Resource-For-DeepRL/assets/2019-11-09-11-04-36.png -------------------------------------------------------------------------------- /A-Guide-Resource-For-DeepRL/assets/2019-11-09-11-59-58.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/A-Guide-Resource-For-DeepRL/assets/2019-11-09-11-59-58.png -------------------------------------------------------------------------------- /A-Guide-Resource-For-DeepRL/assets/2019-11-09-12-01-14.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/A-Guide-Resource-For-DeepRL/assets/2019-11-09-12-01-14.png -------------------------------------------------------------------------------- /A-Guide-Resource-For-DeepRL/assets/2019-11-09-12-04-34.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/A-Guide-Resource-For-DeepRL/assets/2019-11-09-12-04-34.png -------------------------------------------------------------------------------- /A-Guide-Resource-For-DeepRL/assets/2019-11-09-12-05-05.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/A-Guide-Resource-For-DeepRL/assets/2019-11-09-12-05-05.png -------------------------------------------------------------------------------- /A-Guide-Resource-For-DeepRL/assets/2019-11-09-12-23-09.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/A-Guide-Resource-For-DeepRL/assets/2019-11-09-12-23-09.png -------------------------------------------------------------------------------- /A-Guide-Resource-For-DeepRL/assets/2019-11-09-12-23-34.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/A-Guide-Resource-For-DeepRL/assets/2019-11-09-12-23-34.png -------------------------------------------------------------------------------- /A-Guide-Resource-For-DeepRL/assets/2019-11-10-09-13-28.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/A-Guide-Resource-For-DeepRL/assets/2019-11-10-09-13-28.png -------------------------------------------------------------------------------- /AI-Basic-Resource/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/AI-Basic-Resource/.DS_Store -------------------------------------------------------------------------------- /AI-Basic-Resource/README.md: -------------------------------------------------------------------------------- 1 | 本栏目主要汇总人工智能领域基础书籍与资料 2 | 3 | 4 | 目录如下: 5 | 程序员的数学:[阅读程序员的数学]() 6 | -------------------------------------------------------------------------------- /DRL-Algorithm/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/.DS_Store -------------------------------------------------------------------------------- /DRL-Algorithm/AC-serial/A2C/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/AC-serial/A2C/README.md -------------------------------------------------------------------------------- /DRL-Algorithm/AC-serial/A3C/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/AC-serial/A3C/README.md -------------------------------------------------------------------------------- /DRL-Algorithm/AC-serial/ACER/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/AC-serial/ACER/README.md -------------------------------------------------------------------------------- /DRL-Algorithm/AC-serial/GA3C/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/AC-serial/GA3C/README.md -------------------------------------------------------------------------------- /DRL-Algorithm/AC-serial/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/AC-serial/README.md -------------------------------------------------------------------------------- /DRL-Algorithm/DQN-serial/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/DQN-serial/.DS_Store -------------------------------------------------------------------------------- /DRL-Algorithm/DQN-serial/Double Q-Learning/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/DQN-serial/Double Q-Learning/.DS_Store -------------------------------------------------------------------------------- /DRL-Algorithm/DQN-serial/Double Q-Learning/README.md: -------------------------------------------------------------------------------- 1 | ![](assets/2019-12-04-20-03-09.png) 2 | 3 | 论文地址: https://papers.nips.cc/paper/3964-double-q-learning.pdf 4 | 5 | 6 | 本论文是由DeepMind发表于2015年NIPS的一篇论文,作者Hasselt。 7 | 8 | >前言: Q-Learning算法由于受到大规模的动作值过估计(overestimation)而出现不稳定和效果不佳等现象的存在,而导致overestimation的主要原因来自于最大化值函数(max)逼近,该过程目标是为了最大的累计期望奖励,而在这个过程中产生了正向偏差。而本文章作者巧妙的是使用了两个估计器(double estimator)去计算Q-learning的值函数,作者将这种方法定义了一个名字叫“Double Q-learning”(本质上一个off-policy算法),并对其收敛过程进行了证明(缺点:当然double Q-learning算法有时会低估动作值,但不会像Q学习那样遭受过高估计) 9 | 10 | 11 | ## 1. 问题及原因 12 | > "**过估计**" (overestimate) 13 | > 14 | > 过估计是指*对一系列数先求最大值再求平均,通常比先求平均再求最大值要大(或相等*,数学表达为:$E(\max (X1, X2, ...)) \geq \max (E(X1), E(X2), ...)$ 15 | 16 | 17 | 一般来说Q-learning方法导致overestimation的原因归结于其更新过程,其表达为: 18 | $$ 19 | Q_{t+1}\left(s_{t}, a_{t}\right)=Q_{t}\left(s_{t}, a_{t}\right)+\alpha_{t}\left(s_{t}, a_{t}\right)\left(r_{t}+\gamma \max _{a} Q_{t}\left(s_{t+1}, a\right)-Q_{t}\left(s_{t}, a_{t}\right)\right) 20 | $$ 21 | 22 | 其中的 $\max\limits_{a}$ 表示为最大化action-value, 而更新最优化过程如下: 23 | $$ 24 | \forall s, a: Q^{*}(s, a)=\sum_{s^{\prime}} P_{s a}^{s^{\prime}}\left(R_{s a}^{s^{\prime}}+\gamma \max _{a} Q^{*}\left(s^{\prime}, a\right)\right) 25 | $$ 26 | 27 | 对于任意的$s, a$ 来说,最优值函数 $Q^{*}$ 的更新依赖于 $\max \limits_{a} Q^{*}(s,...)$, 从公式中可以看出,我们把N个Q值先通过取max操作之后,然后求平均(期望),会比我们先算出N个Q值取了期望之后再max要大。这就是overestimate的原因。 28 | 29 | > 注: 一般用于加速Q-learning算法的方法有:Delayed Q-learning, Phased Q-learning, Fitted Q-iteration等 30 | 31 | 32 | 33 | ## 2. Estimator原理与思想 34 | 通常情况下对于一个集合中的变量 $X= \left\{ X_{1},X_{2},...,X_{M} \right\}$来说,奖励的最大化累计期望表示为: 35 | $$ 36 | \max \limits_{a} E \left\{ X_{i} \right\} 37 | $$ 38 | 39 | 那么在实际的过程中,对于每个 $X_{i}$,我们定义 $ S=\bigcup_{i=1}^{M} S_{i}$为采样,其中的 $S_{i}$ 表示为对于所有$X_{i}$采样的一个子集,假设 $S_{i}$ 满足独立同分布情况, 那么期望值的“无偏估计”可以通过计算每个变量的样本平均值来获得,其计算方法如下: 40 | $$ 41 | E\left\{X_{i}\right\}=E\left\{\mu_{i}\right\} \approx \mu_{i}(S) \stackrel{\text { def }}{=} \frac{1}{\left|S_{i}\right|} \sum_{S \in S_{i}} s 42 | $$ 43 | 44 | >注: $\mu_{i}$ 是 $X_{i}$ 的估计器 45 | 46 | 这个过程是一个无偏估计,因为每一个采样 $s\in S_{i}$是一个对 $X_{i}$ 的无偏估计,因此,近似中的误差仅由估计中的 **“方差“** 组成,当我们获得更多样本时会减小。 47 | 48 | 为了后面方便理解,这里我们定义两个函数:“**概率密度函数”**(Probability Density Function, PDF)和“**累积分布函数**”(Cumulative Distribution Function, CDF),概率密度函数$f_{i}$表示$i^{th}$个 $X_{i}$,则累积分布函数表示为: $F_{i}(x)= \int_{-\infty}^{x} f_{i}(x)dx$,同样的道理,对于PDF和CDF来说估计器分别表示为$f_{i}^{\mu}$和$F_{i}^{\mu}$。 49 | >**补充** 50 | > **1. 概率密度函数**, 51 | > 其实就是给定一个值, 判断这个值在该正态分布中所在的位置后, 获得其他数据高于该值或低于该值的比例,其中的曲线就是概率密度函数(PDF),通常情况下pdf的曲线下面积(AUC)总和为1,且曲线末端不会接触到x轴(换句话说, 我们不可能100%的确定某件事)。 52 | > ![](assets/2019-12-05-19-57-42.png) 53 | > 54 | >**2. 累积分布函数** 55 | > 累积分布函数 (CDF) 计算给定 x 值的累积概率。可使用 CDF 确定取自总体的随机观测值将小于或等于特定值的概率。还可以使用此信息来确定观测值将大于特定值或介于两个值之间的概率。 56 | >![](assets/2019-12-05-19-58-31.png) 57 | >例如,罐装苏打水的填充重量服从正态分布,且均值为 12 盎司,标准差为 0.25 盎司。概率密度函数 (PDF) 描述了填充重量的可能值的可能性。CDF 提供每个 x 值的累积概率。 58 | >*此处参考[PDF-CDF指导](https://support.minitab.com/zh-cn/minitab/18/help-and-how-to/probability-distributions-and-random-data/supporting-topics/basics/using-the-probability-density-function-pdf/)* 59 | 60 | #### (1)单估计器方法(Single Estimator) 61 | >所谓的单估计就是使用一组估计量的最大值作为近似值, 62 | 63 | 即近似$\max \limits_{a} E \left\{ X_{i} \right\}$的最好的方式就是最大化估计器,表示为: 64 | $$ 65 | \max _{i} E\left\{X_{i}\right\}=\max _{i} E\left\{\mu_{i}\right\} \approx \max _{i} \mu_{i}(S) 66 | $$ 67 | 68 | $\mu$表示为估计器,而此处对于最大的估计器$f_{max}^{\mu}$来说,它是依赖于 $f_{i}^{\mu}$ 的,若要求取PDF,首先需要考虑CDF,但它的概率分布中最大的估计器小于等于$x$,这等同于所有的估计均小于等于$x$,数学表示为: 69 | $$ 70 | x: F_{\max }^{\mu}(x) \stackrel{\text { def }}{=} P\left(\max _{i} \mu_{i} \leq x\right)=\prod_{i=1}^{M} P\left(\mu_{i} \leq x\right) \stackrel{\text { def }}{=} \prod_{i=1}^{M} F_{i}^{\mu}(x) 71 | $$ 72 | 73 | 那么$\max_{i}\mu_{i}(S)$是对$E\left\{\max _{j} \mu_{j}\right\}=\int_{-\infty}^{\infty} x f_{\max }^{\mu}(x)dx$的无偏估计,详细表示为: 74 | $$ 75 | E\left\{\max _{j} \mu_{j}\right\}=\int_{-\infty}^{\infty} x \frac{d}{d x} \prod_{i=1}^{M} F_{i}^{\mu}(x) d x=\sum_{j}^{M} \int_{-\infty}^{\infty} x f_{j}^{\mu}(s) \prod_{i \neq j}^{M} F_{i}^{\mu}(x) d x 76 | $$ 77 | 78 | #### (2)双估计器方法(Double Estimator) 79 | >对每个变量使用两个估计器,并将估计器的选择与其值解耦。 80 | 81 | **问题**:单一估计器方法导致过高估计可能会对使用此方法的算法(例如Q学习)产生很大的负面影响。为了解决这个问题,double estimator方法用来解决过高估计。 82 | 83 | 那么对于原来的 $\max_{i} E \left\{ X_{i} \right\}$来说,此处我们需要定义两个估计器:$\mu^{A}$和$\mu^{B}$,他们分别表示为:$\mu^{A}=\left\{\mu_{1}^{A}, \ldots, \mu_{M}^{A}\right\}$,$\mu^{B}=\left\{\mu_{1}^{B}, \ldots, \mu_{M}^{B}\right\}$,然后两个估计器都使用采样的样本子集来更新,其规则表示为: 84 | $$ \left\{ 85 | \begin{aligned} 86 | S=S^{A} \cup S^{B} \\ 87 | S^{A} \cap S^{B}=\emptyset \\ 88 | \mu_{i}^{A}(S)=\frac{1}{\left|S_{i}^{A}\right|} \sum_{s \in S_{i}^{A}} s \\ 89 | \mu_{i}^{B}(S)=\frac{1}{\left|S_{i}^{B}\right|} \sum_{s \in S_{i}^{B}} s 90 | \end{aligned} 91 | \right. 92 | $$ 93 | 94 | 那么像单估计器$\mu_{i}$一样,如果我们假设样本以适当的方式(例如随机地)分布在两组估计器上,则$\mu_{i}^{A}$和$\mu_{i}^{B}$也都是无偏的。设 $\operatorname{Max}^{A}(S) \stackrel{\text { def }}{=}\left\{j | \mu_{j}^{A}(S)=\max _{i} \mu_{i}^{A}(S)\right\}$ 为$\mu^{A}(S)$ 中最大估计值集合,由于$\mu^{B}$是一个独立的无偏估计值,那么 $E\left\{\mu_{j}^{B}\right\}=E\left\{X_{j}\right\}$对于任何$j$都成立,包括 $j \in Max^{A}$ 95 | > 此处有疑问,为什么包括$j \in Max^{A}$?? 96 | 97 | 设$a^{*}$ 是最大化 $\mu^{A}$ 的估计器,表示为:$\mu^{A}: \mu_{a^{*}}^{A}(S) \stackrel{\text { def }}{=} \max _{i} \mu_{i}^{A}(S)$,如果存在多个最大化的$\mu^{A}$ 是最大化估计量,我们可以例如随机选择一个,然后我们可以将$\mu_{a^{*}}^{A}$用作$\max _{i} E\left\{\mu_{i}^{B}\right\}$的估计值,那么对于$\max_{i}E\left\{X_{i}\right\}$可以近似为: 98 | $$ 99 | \max _{i} E\left\{X_{i}\right\}=\max _{i} E\left\{\mu_{i}^{B}\right\} \approx \mu_{a^{*}}^{B} 100 | $$ 101 | 102 | 随着我们获得更多的样本,估计量的方差减小,在极限情况下,$\mu_{i}^{A}(S)=\mu_{i}^{B}(S)=E\left\{X_{i}\right\}$ 103 | 104 | 105 | > 具体证明过程部分如下: 106 | ![](assets/2019-12-05-20-49-21.png) 107 | 108 | ## 3. Double Q-learning算法 109 | 我们可以解释为 Q-learning学习其实使用单估计器(single estimate)去估计下一个状态:那么$\max _{a} Q_{t}\left(s_{t+1}, a\right)$是 $E\left\{\max _{a} Q_{t}\left(s_{t+1}, a\right)\right\}$的一个估计,一般的,将期望理解为对同一实验的所有可能运行的平均,而不是(通常在强化学习环境中使用)对下一个状态的期望,根据原理部分,Double Q-learning将使用两个函数 $Q^{A}$和$Q^{B}$(对应两个估计器),并且每个$Q$函数都会使用另一个$Q$函数的值更新下一个状态。两个$Q$函数都必须从不同的经验集中学习,这一点很重要,但是要选择要执行的动作可以同时使用两个值函数。 因此,该算法的数据效率不低于Q学习。 在实验中作者为每个动作计算了两个Q值的平均值,然后对所得的平均Q值进行了贪婪探索。**算法伪代码如下**: 110 | ![](assets/![](assets/2019-12-04-20-07-30.png).png) 111 | 112 | 为了区分Double Q-learning算法和Q-learning的区别,本文同样Q-learning算法伪代码贴出来了。 113 | 114 | ![](assets/2019-12-04-20-30-18.png) 115 | 116 | > 对比:此处对于Q-learning算法和double Q-learning 算法来说,double使用了B网络来更新A网络,同样的道理对于B网络则使用A网络的值来更新。 117 | 118 | 119 | ## 4. 实验过程于结果 120 | 121 | ![](assets/2019-12-05-21-11-37.png) 122 | 123 | ## 5. 附录:收敛性证明过程 124 | 对于Double Q-learning收敛性的证明过程如下: 125 | ![](assets/2019-12-05-21-40-39.png) 126 | ![](assets/2019-12-05-21-41-39.png) -------------------------------------------------------------------------------- /DRL-Algorithm/DQN-serial/DoubleDQN/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/DQN-serial/DoubleDQN/.DS_Store -------------------------------------------------------------------------------- /DRL-Algorithm/DQN-serial/DoubleDQN/README.md: -------------------------------------------------------------------------------- 1 | ![](assets/2019-12-17-09-18-14.png) 2 | 3 | 论文地址: https://arxiv.org/pdf/1509.06461.pdf 4 | > 本文是Google DeepMind于2015年12月提出的一篇解决Q值"过估计(overestimate)"的文章,作者Hado van Hasselt在其2010年发表的[Double Q-learning](https://blog.csdn.net/gsww404/article/details/103413124)算法工作的基础上结合了DQN的思想,提出了本文的state-of-the-art的Double DQN算法。给出了过估计的通用原因解释和解决方法的数学证明,最后在Atari游戏上有超高的分数实验表现。 5 | 6 | 正常论文的阅读方式,先看摘要和结论: 7 | ![](assets/2019-12-17-09-40-05.png) 8 | 通常情况下,在Q-learning学习中“过估计”是经常发生的,并且影响实验的性能,作者提出了一种可以回答这个问题,并在Double Q-learning算法的基础上进行function approximation的方法,结果表明不仅可以减少观察值的过估计,而且在许多游戏上还有更好的性能表现。而结论部分如下: 9 | ![](assets/2019-12-17-09-43-35.png) 10 | 11 | 作者将整个文章的贡献总结了五点:前三点基本上说了过估计问题的存在,重要性和Double Q-learning算法能解决这个问题,**本文重点是第四**,作者提出了一种在Double Q-learning基础上**利用“DQN”算法网络结构**的方法“Double DQN”,并在第五点获得state-of-the-art的效果,下面详细介绍。 12 | 13 | ### 1. 问题阐述 14 | 15 | #### 1.1 过估计问题现象 16 | [Q-learning](https://blog.csdn.net/gsww404/article/details/103566859)算法在低维状态下的成功以及DQN和target DQN的效果已经很好了,但是人们发现了一个问题就是之前的Q-learning、DQN算法都会过高估计(overestimate)Q值。开始大家都将其原因归结于函数逼近和噪音。 17 | + **Q-learning**拿到状态对应的所有动作Q值之后是直接选取Q值最大的那个动作,这会导致更加倾向于估计的值比真实的值要高。为了能够使得标准的Q-learning学习去大规模的问题,将其参数化Q值函数表示为: 18 | $$ 19 | \theta_{t+1}=\boldsymbol{\theta}_{t}+\alpha\left(Y_{t}^{\mathrm{Q}}-Q\left(S_{t}, A_{t} ; \boldsymbol{\theta}_{t}\right)\right) \nabla_{\boldsymbol{\theta}_{t}} Q\left(S_{t}, A_{t} ; \boldsymbol{\theta}_{t}\right) 20 | $$ 21 | 其中 $Y_{t}^{\mathrm{Q}}$表示为: 22 | $$ 23 | Y_{t}^{\mathrm{Q}} \equiv R_{t+1}+\gamma \max _{a} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}\right) 24 | $$ 25 | 其实我们发现这个更新过程和梯度下降大同小异,此处均以更新参数 $\theta$ 进行学习。 26 | + **DQN**算法非常重要的两个元素是“经验回放”和“目标网络”,通常情况下,DQN算法更新是利用目标网络的参数 $\theta^{-}$,它每个$\tau$ 步更新一次,其数学表示为: 27 | $$ 28 | Y_{t}^{\mathrm{DQN}} \equiv R_{t+1}+\gamma \max _{a} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}^{-}\right) 29 | $$ 30 | 上述的标准的Q-learning学习和DQN中均使用了 $\max$ 操作,使得选择和评估一个动作值都会过高估计,为了解决这个问题,Double Q-learning率先使用了两个值函数进行解耦,其互相随机的更新两个值函数,并利用彼此的经验去更新网络权重$\theta$和$\theta^{-}$, 为了能够明显的对比, 31 | 32 | + **Double Q-learning**,2010年本文作者Hasselt就针对过高估计Q值的问题提出了[*Double Q-learning*](https://blog.csdn.net/gsww404/article/details/103413124),他就是尝试通过将选择动作和评估动作分割开来避免过高估计的问题。在原始的Double Q-Learning算法里面,有两个价值函数(value function),一个用来选择动作(当前状态的策略),一个用来评估当前状态的价值。这两个价值函数的参数分别记做 $\theta$ 和 $\theta^{'}$ 。算法的思路如下: 33 | 34 | $$ 35 | Y_{t}^{\mathrm{Q}}=R_{t+1}+\gamma Q\left(S_{t+1}, \underset{a}{\operatorname{argmax}} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}\right) ; \boldsymbol{\theta}_{t}\right) 36 | $$ 37 | 通过对原始的Q-learning算法的改进,Double Q-learning的误差表示为: 38 | $$ 39 | Y_{t}^{\text {DoubleQ }} \equiv R_{t+1}+\gamma Q\left(S_{t+1}, \underset{a}{\operatorname{argmax}} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}\right) ; \boldsymbol{\theta}_{t}^{\prime}\right) 40 | $$ 41 | 此处意味着我们仍然使用贪心策略去学习估计Q值,而使用第二组权重参数$\theta^{'}$去评估其策略。 42 | #### 1.2 估计误差: “过估计” 43 | ##### 1.2.1 上界估计 44 | Thrun等人在1993年的时候就给出如果动作值包含在区间 $[-\epsilon, \epsilon]$ 之间的标准分布下的随机的误差,那么上限估计为: $\gamma \epsilon \frac{m-1}{m+1}$ (m表示动作的数量) 45 | ##### 1.2.2 下界估计 46 | ![](assets/2019-12-17-15-30-16.png) 47 | 作者给出了一个定理1: 48 | > 在一个状态下如果动作 $m>2$且 $C=\frac{1}{m} \sum_{a}\left(Q_{t}(s, a)-V_{*}(s)\right)^{2}>0$,则 49 | 【1】$\max _{a} Q_{t}(s, a) \geq V_{*}(s)+\sqrt{\frac{C}{m-1}}$ 50 | 【2】Double Q-learning的下界绝对误差为0 51 | 52 | 根据定理1我们得到下界估计的值随着 $m$ 的增大而减小,通过实验,下面结果表明 $m$对估计的影响,图中明显表明,Q-learning的随m的增大越来越大,而Double Q-learning是**无偏估计**,并未随着m增大而过度变化,基本上在0附近。 53 | ![](assets/2019-12-17-11-43-03.png) 54 | 55 | >附录:定理1证明过程 56 | ![](assets/2019-12-17-15-56-20.png) 57 | 58 | **此处作者还得出一个定理结论** 59 | ![](assets/2019-12-17-16-44-27.png) 60 | > 证明如下: 61 | ![](assets/2019-12-17-16-45-28.png) 62 | 63 | 64 | **为了进一步说明Q-learning, Double Q-learning估值偏差的区别,作者给出了一个有真实$Q$值的环境:假设$Q$值为 $Q_(s, a) = sin(s)以及Q_(s, a) = 2 exp(-s^2)$ ,然后尝试用6阶和9阶多项式拟合这两条曲线,一共进行了三组实验,参见下面表格** 65 | ![](assets/2019-12-17-16-00-20.png) 66 | 67 | 68 | 这个试验中设定有10个action(分别记做 a1,a2,…,a10 ),并且Q值只与state有关。所以对于每个state,每个action都应该有相同的true value,他们的值可以通过目标Q值那一栏的公式计算出来。此外这个实作还有一个人为的设定是每个action都有两个相邻的state不采样,比如说 a1 不采样-5和-4(这里把-4和-5看作是state的编号), a2 不采样-4和-3等。这样我们可以整理出一张参与采样的action与对应state的表格: 69 | ![](assets/2019-12-17-16-08-30.png) 70 | 浅蓝色代表对应的格子有学习得到的估值,灰色代表这部分不采样,也没有对应的估值(类似于监督学习这部分没有对应的标记,所以无法学习到东西) 71 | 72 | 这样实验过后得到的结果用下图展示: 73 | 74 | ![](assets/2019-12-17-11-43-40.png) 75 | 从这里面可以看出很多事情: 76 | 77 | + 最左边三幅图(对应 action2 那一列学到的估值)中紫色的线代表真实值(也就是目标Q值,通过s不同取值计算得出),绿色的线是通过Q-learning学习后得到的估值,其中绿点标记的是采样点,也就是说是通过这几个点的真实值进行学习的。结果显示前面两组的估值不准确,原因是我们有十一个值( s∈−6,−5,−2,−1,0,1,2,3,4,5,6 ),用6阶多项式没有办法完美拟合这些点。对于第三组实验,虽然能看出在采样的这十一个点中,我们的多项式拟合已经足够准确了,但是对于其他没有采样的点我们的误差甚至比六阶多项式对应的点还要大。 78 | + 中间的三张图画出了这十个动作学到的估值曲线(对应图中绿色的线条),并且用黑色虚线标记了这十根线中每个位置的最大值。结果可以发现这根黑色虚线几乎在所有的位置都比真实值要高。 79 | + 右边的三幅图显示了中间图中黑色虚线和左边图中紫线的差值,并且将Double Q-Learning实作的结果用同样的方式进行比较,结果发现Double Q-Learning的方式实作的结果更加接近0。这证明了Double Q-learnign确实能降低Q-Learning中过高估计的问题。 80 | + 前面提到过有人认为过高估计的一个原因是不够灵活的value function,但是从这个实验结果中可以看出,虽然说**在采样的点上,value function越灵活,Q值越接近于真实值,但是对于没有采样的点,灵活的value function会导致更差的结果**,在RL领域中,大家经常使用的是比较灵活的value function,所以这一点的影响比较严重。 81 | + 虽然有人认为对于一个state,如果这个state对应的action的估值都均匀的升高了,还是不影响我们的决策啊,反正估值最高的那个动作还是最高,我们选择的动作依然是正确的。但是这个实验也证明了:不同状态,不同动作,相应的估值过高估计的程度也是不一样的,因此上面这种说法也并不正确。 82 | 83 | 84 | ### 2. Double DQN 算法原理及过程 85 | 86 | >通过以上的证明和拟合曲线实验表明,过高估计不仅真实存在,而且对实验的结果有很大的影响,为了解决问这个问题,在Double的基础上作者提出了本文的“**Double DQN**”算法 87 | 88 | 下面我们提出Double DQN算法的更新过程: 89 | $$ 90 | Y_{t}^{\text {DoubleDQN }} \equiv R_{t+1}+\gamma Q\left(S_{t+1}, \underset{a}{\operatorname{argmax}} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}\right), \boldsymbol{\theta}_{t}^{-}\right) 91 | $$ 92 | 93 | 该过程和前面的Double Q-learning算法更新公式基本一样,唯一的区别在于 $\theta^{'}$和$\theta^{-}$,两者的区别在于Double Q-learning算法是利用交换来不断的更新,Double DQN则使用了DQN的思想,直接利用目标网络($\theta^{-}$)进行更新。 94 | 95 | 在实验中,作者基本上 96 | ![](assets/2019-12-17-16-29-39.png) 97 | 98 | 实验结果如下: 99 | ![](assets/2019-12-17-11-44-07.png) 100 | 101 | 102 | > 对于Atari游戏来讲,我们很难说某个状态的Q值等于多少,一般情况是将训练好的策略去运行游戏,然后根据游戏中积累reward,就能得到平均的reward作为true value了。 103 | + 从实验第一行结果我们明显可以看出在集中游戏中,值函数相对于Double DQN都明显的比较高(如果没有过高估计的话,收敛之后我们的估值应该跟真实值相同的),此处说明过高估计确实不容易避免。 104 | + Wizard of Wor和Asterix这两个游戏可以看出,DQN的结果比较不稳定。也表明过高估计会影响到学习的性能的稳定性。**因此不稳定的问题的本质原因还是对Q值的过高估计**。 105 | + 对于鲁棒性的测试 106 | 107 | 108 | 此外作者为了对游戏有一个统计学意义上的总结,对分数进行了正则化,表示为: 109 | $$ 110 | \text { score }_{\text{normalized}}=\frac{\text {score}_{\text{agent}}-\text { score}_{\text{random}}}{\text { score}_{\text{human}}-\text { score}_{\text{random}}} 111 | $$ 112 | 实验结果如下: 113 | ![](assets/2019-12-17-16-34-24.png) 114 | 115 | 以上基本上是本论文的内容,下面我们借助实验进行code的Double DQN算法。其实本部分的复现只是将更新的DQN的目标函数换一下。对于论文中的多项式拟合并不做复现。 116 | 117 | ### 3. 代码复现 118 | 119 | 此处采用Morvan的代码,实验环境是:Tensorflow=1.0&gym=0.8.0,先coding一个智能体Agent 120 | 121 | ```python 122 | # file name Agent.py 123 | 124 | import numpy as np 125 | import tensorflow as tf 126 | 127 | np.random.seed(1) 128 | tf.set_random_seed(1) 129 | 130 | # Double DQN 131 | class DoubleDQN: 132 | def __init__( 133 | self, 134 | n_actions, 135 | n_features, 136 | learning_rate=0.005, 137 | reward_decay=0.9, 138 | e_greedy=0.9, 139 | replace_target_iter=200, 140 | memory_size=3000, 141 | batch_size=32, 142 | e_greedy_increment=None, 143 | output_graph=False, 144 | double_q=True, 145 | sess=None, 146 | ): 147 | self.n_actions = n_actions 148 | self.n_features = n_features 149 | self.lr = learning_rate 150 | self.gamma = reward_decay 151 | self.epsilon_max = e_greedy 152 | self.replace_target_iter = replace_target_iter 153 | self.memory_size = memory_size 154 | self.batch_size = batch_size 155 | self.epsilon_increment = e_greedy_increment 156 | self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max 157 | 158 | self.double_q = double_q # decide to use double q or not 159 | 160 | self.learn_step_counter = 0 161 | self.memory = np.zeros((self.memory_size, n_features*2+2)) 162 | self._build_net() 163 | t_params = tf.get_collection('target_net_params') 164 | e_params = tf.get_collection('eval_net_params') 165 | self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)] 166 | 167 | if sess is None: 168 | self.sess = tf.Session() 169 | self.sess.run(tf.global_variables_initializer()) 170 | else: 171 | self.sess = sess 172 | if output_graph: 173 | tf.summary.FileWriter("logs/", self.sess.graph) 174 | self.cost_his = [] 175 | 176 | def _build_net(self): 177 | def build_layers(s, c_names, n_l1, w_initializer, b_initializer): 178 | with tf.variable_scope('l1'): 179 | w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names) 180 | b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names) 181 | l1 = tf.nn.relu(tf.matmul(s, w1) + b1) 182 | 183 | with tf.variable_scope('l2'): 184 | w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names) 185 | b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names) 186 | out = tf.matmul(l1, w2) + b2 187 | return out 188 | # ------------------ build evaluate_net ------------------ 189 | self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s') # input 190 | self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target') # for calculating loss 191 | 192 | with tf.variable_scope('eval_net'): 193 | c_names, n_l1, w_initializer, b_initializer = \ 194 | ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 20, \ 195 | tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1) # config of layers 196 | 197 | self.q_eval = build_layers(self.s, c_names, n_l1, w_initializer, b_initializer) 198 | 199 | with tf.variable_scope('loss'): 200 | self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval)) 201 | with tf.variable_scope('train'): 202 | self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss) 203 | 204 | # ------------------ build target_net ------------------ 205 | self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_') # input 206 | with tf.variable_scope('target_net'): 207 | c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES] 208 | 209 | self.q_next = build_layers(self.s_, c_names, n_l1, w_initializer, b_initializer) 210 | 211 | def store_transition(self, s, a, r, s_): 212 | if not hasattr(self, 'memory_counter'): 213 | self.memory_counter = 0 214 | transition = np.hstack((s, [a, r], s_)) 215 | index = self.memory_counter % self.memory_size 216 | self.memory[index, :] = transition 217 | self.memory_counter += 1 218 | 219 | def choose_action(self, observation): 220 | observation = observation[np.newaxis, :] 221 | actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation}) 222 | action = np.argmax(actions_value) 223 | 224 | if not hasattr(self, 'q'): # record action value it gets 225 | self.q = [] 226 | self.running_q = 0 227 | self.running_q = self.running_q*0.99 + 0.01 * np.max(actions_value) 228 | self.q.append(self.running_q) 229 | 230 | if np.random.uniform() > self.epsilon: # choosing action 231 | action = np.random.randint(0, self.n_actions) 232 | return action 233 | 234 | def learn(self): 235 | if self.learn_step_counter % self.replace_target_iter == 0: 236 | self.sess.run(self.replace_target_op) 237 | print('\ntarget_params_replaced\n') 238 | 239 | if self.memory_counter > self.memory_size: 240 | sample_index = np.random.choice(self.memory_size, size=self.batch_size) 241 | else: 242 | sample_index = np.random.choice(self.memory_counter, size=self.batch_size) 243 | batch_memory = self.memory[sample_index, :] 244 | 245 | q_next, q_eval4next = self.sess.run( 246 | [self.q_next, self.q_eval], 247 | feed_dict={self.s_: batch_memory[:, -self.n_features:], # next observation 248 | self.s: batch_memory[:, -self.n_features:]}) # next observation 249 | q_eval = self.sess.run(self.q_eval, {self.s: batch_memory[:, :self.n_features]}) 250 | 251 | q_target = q_eval.copy() 252 | 253 | batch_index = np.arange(self.batch_size, dtype=np.int32) 254 | eval_act_index = batch_memory[:, self.n_features].astype(int) 255 | reward = batch_memory[:, self.n_features + 1] 256 | # Double DQN算法和DQN算法的区别。 257 | if self.double_q: 258 | max_act4next = np.argmax(q_eval4next, axis=1) # the action that brings the highest value is evaluated by q_eval 259 | selected_q_next = q_next[batch_index, max_act4next] # Double DQN, select q_next depending on above actions 260 | else: 261 | selected_q_next = np.max(q_next, axis=1) # the natural DQN 262 | 263 | q_target[batch_index, eval_act_index] = reward + self.gamma * selected_q_next 264 | 265 | _, self.cost = self.sess.run([self._train_op, self.loss], 266 | feed_dict={self.s: batch_memory[:, :self.n_features], 267 | self.q_target: q_target}) 268 | self.cost_his.append(self.cost) 269 | 270 | self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max 271 | self.learn_step_counter += 1 272 | 273 | ``` 274 | 主函数入口: 275 | ```python 276 | import gym 277 | from Agent import DoubleDQN 278 | import numpy as np 279 | import matplotlib.pyplot as plt 280 | import tensorflow as tf 281 | 282 | 283 | env = gym.make('Pendulum-v0') 284 | env = env.unwrapped 285 | env.seed(1) 286 | MEMORY_SIZE = 3000 287 | ACTION_SPACE = 11 288 | 289 | sess = tf.Session() 290 | with tf.variable_scope('Natural_DQN'): 291 | natural_DQN = DoubleDQN( 292 | n_actions=ACTION_SPACE, n_features=3, memory_size=MEMORY_SIZE, 293 | e_greedy_increment=0.001, double_q=False, sess=sess 294 | ) 295 | 296 | with tf.variable_scope('Double_DQN'): 297 | double_DQN = DoubleDQN( 298 | n_actions=ACTION_SPACE, n_features=3, memory_size=MEMORY_SIZE, 299 | e_greedy_increment=0.001, double_q=True, sess=sess, output_graph=True) 300 | 301 | sess.run(tf.global_variables_initializer()) 302 | 303 | 304 | def train(RL): 305 | total_steps = 0 306 | observation = env.reset() 307 | while True: 308 | # if total_steps - MEMORY_SIZE > 8000: env.render() 309 | 310 | action = RL.choose_action(observation) 311 | 312 | f_action = (action-(ACTION_SPACE-1)/2)/((ACTION_SPACE-1)/4) # convert to [-2 ~ 2] float actions 313 | observation_, reward, done, info = env.step(np.array([f_action])) 314 | 315 | reward /= 10 # normalize to a range of (-1, 0). r = 0 when get upright 316 | # the Q target at upright state will be 0, because Q_target = r + gamma * Qmax(s', a') = 0 + gamma * 0 317 | # so when Q at this state is greater than 0, the agent overestimates the Q. Please refer to the final result. 318 | 319 | RL.store_transition(observation, action, reward, observation_) 320 | 321 | if total_steps > MEMORY_SIZE: # learning 322 | RL.learn() 323 | 324 | if total_steps - MEMORY_SIZE > 20000: # stop game 325 | break 326 | 327 | observation = observation_ 328 | total_steps += 1 329 | return RL.q 330 | 331 | q_natural = train(natural_DQN) 332 | q_double = train(double_DQN) 333 | 334 | plt.plot(np.array(q_natural), c='r', label='natural') 335 | plt.plot(np.array(q_double), c='b', label='double') 336 | plt.legend(loc='best') 337 | plt.ylabel('Q eval') 338 | plt.xlabel('training steps') 339 | plt.grid() 340 | plt.show() 341 | ``` 342 | 343 | 参考文献: 344 | [1]. [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/pdf/1509.06461.pdf) by Hado van Hasselt and Arthur Guez and David Silver,DeepMind 345 | [2].[JUNMO的博客: junmo1215.github.io](https://junmo1215.github.io/paper/2017/12/08/Note-Deep-Reinforcement-Learning-with-Double-Q-learning.html) 346 | [3]. [Morvanzhou的Github](https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/tree/master/contents/5.1_Double_DQN) -------------------------------------------------------------------------------- /DRL-Algorithm/DQN-serial/Dueling DQN/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/DQN-serial/Dueling DQN/.DS_Store -------------------------------------------------------------------------------- /DRL-Algorithm/DQN-serial/Dueling DQN/ReadME.md: -------------------------------------------------------------------------------- 1 | ![](assets/2019-12-24-15-32-40.png) 2 | 3 | 本文是DeepMind发表于ICML2016顶会的文章(获得Best Paper奖),第一作者Ziyu Wang(第四作Hado Van Hasselt就是前几篇文章[#Double Q-learning#](https://blog.csdn.net/gsww404/article/details/103413124),[Double DQN](https://blog.csdn.net/gsww404/article/details/103583784)的作者),可以说他们开创了DQN,后续还有几个DeepMind的文章。 4 | 5 | ![](assets/2019-12-24-15-34-02.png) 6 | 7 | ![](assets/2019-12-24-15-34-48.png) 8 | 9 | ### 1. 问题阐述 10 | 在前面已经学习了 11 | + [Q-learning](https://blog.csdn.net/gsww404/article/details/103566859) 12 | + [Double Q-learning](https://blog.csdn.net/gsww404/article/details/103413124): 解决值函数**过估计问题** 13 | + [DQN](https://blog.csdn.net/gsww404/article/details/79763051): 解决**大状态空间、动作空间**问题 14 | + [Double DQN](https://blog.csdn.net/gsww404/article/details/103583784): 解决值函数**过估计问题** 15 | + [PER-DQN](https://blog.csdn.net/gsww404/article/details/103673852): 解决经验回放的**采样问题** 16 | 17 | 18 | 19 | ### 2. 算法原理和过程 20 | 21 | ![](assets/2019-12-24-15-38-01.png) 22 | 23 | 24 | #### 2.n 算法伪代码 25 | ![](assets/2019-12-24-15-39-11.png) 26 | 27 | #### 2.5 实现结果 28 | ![](assets/2019-12-24-15-40-18.png) 29 | ### 3. 代码复现 30 | 31 | -------------------------------------------------------------------------------- /DRL-Algorithm/DQN-serial/Prioritized Experience Replay(PER)/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/DQN-serial/Prioritized Experience Replay(PER)/.DS_Store -------------------------------------------------------------------------------- /DRL-Algorithm/DQN-serial/Q-Learning/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/DQN-serial/Q-Learning/.DS_Store -------------------------------------------------------------------------------- /DRL-Algorithm/DQN-serial/Q-Learning/README.md: -------------------------------------------------------------------------------- 1 | 2 | ![](assets/2019-12-08-17-08-25.png) 3 | 4 | 论文地址: http://www.gatsby.ucl.ac.uk/~dayan/papers/cjch.pdf 5 | > Q-Learning是发表于1989年的一种value-based,且model-free的特别经典的off-policy算法,近几年的DQN等算法均是在此基础上通过神经网络进行展开的。 6 | ### 1. 相关简介 7 | 强化学习学习过程中,通常是将学习的序列数据存储在表格中,下次进行学习的时候通过获取表中的数据,并进行更新学习。 8 | 9 | ### 2. 原理及推导 10 | 11 | Q-Learning就是在某一个时刻的状态(state)下,采取动作a能够获得收益的期望,环境会根据agent的动作反馈相应的reward奖赏,**核心就是将state和action构建成一张Q_table表来存储Q值,然后根据Q值来选取能够获得最大收益的动作**,如表所示: 12 | 13 | | Q-Table | $a_{1}$ | $a_{2}$ | 14 | | ---- | ---- |---- | 15 | | $s_{1}$ | $Q(s_{1},a_{1})$ |$Q(s_{1},a_{2})$| 16 | | $s_{2}$ | $Q(s_{2},a_{1})$ |$Q(s_{2},a_{2})$| 17 | | $s_{3}$ | $Q(s_{3},a_{1})$ |$Q(s_{3},a_{2})$| 18 | 19 | Q-learning的主要优势就是使用了时间差分法TD(融合了蒙特卡洛和动态规划)能够进行离线(off-policy)学习, 使用bellman方程可以对马尔科夫过程求解最优策略。 20 | 21 | 算法伪代码 22 | ![](assets/2019-12-08-17-42-05.png) 23 | 24 | 从伪代码中可以看出,在每个episode中的更新方式采用了贪婪greedy(进行探索)进行最优动作的选取,并通过更新 $Q$值(这里的 $\max \limits_{a}$ 操作是非常关键的一部分)来达到学习目的。代码的复现过程中也是严格按照伪代码的顺序进行完成。 25 | 26 | ### 3. 代码复现 27 | 本文参考莫烦的代码,利用Q-learning算法实现一个走迷宫的实现,具体为红色块(机器人)通过上下左右移动,最后找到黄色圈(宝藏),黑色块为障碍物。 28 | ![](assets/2019-12-16-10-30-45.png) 29 | 30 | > 分析:对于机器人来说,选取的动作choose_action有四个状态,上下左右,也就是下文中的self.action(本质可以用一个list进行表示) 31 | ##### 第一步:构建Q值表、动作值选取和Q值更新 32 | ```python 33 | 34 | import numpy as np 35 | import pandas as pd 36 | 37 | 38 | class QLearningTable: 39 | def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9): 40 | self.actions = actions # a list 41 | self.lr = learning_rate 42 | self.gamma = reward_decay 43 | self.epsilon = e_greedy 44 | self.q_table = pd.DataFrame(columns=self.actions, dtype=np.float64) 45 | # 创建一个列为self.action的表结构 46 | 47 | # 定义选取动作值 48 | def choose_action(self, observation): 49 | self.check_state_exist(observation) 50 | # 动作选择,从均匀分布中采样(np.random.uniform) 51 | if np.random.uniform() < self.epsilon: 52 | # 选择最好的动作,此处通过loc函数直接对元素赋值 53 | state_action = self.q_table.loc[observation, :] 54 | # some actions may have the same value, randomly choose on in these actions 55 | action = np.random.choice(state_action[state_action == np.max(state_action)].index) 56 | else: 57 | # choose random action 58 | action = np.random.choice(self.actions) 59 | return action 60 | 61 | def learn(self, s, a, r, s_): 62 | self.check_state_exist(s_) 63 | q_predict = self.q_table.loc[s, a] 64 | if s_ != 'terminal': 65 | q_target = r + self.gamma * self.q_table.loc[s_, :].max() # next state is not terminal 66 | else: 67 | q_target = r # next state is terminal 68 | self.q_table.loc[s, a] += self.lr * (q_target - q_predict) # update 69 | 70 | def check_state_exist(self, state): 71 | if state not in self.q_table.index: 72 | # append new state to q table 73 | self.q_table = self.q_table.append( 74 | pd.Series( 75 | [0]*len(self.actions), 76 | index=self.q_table.columns, 77 | name=state, 78 | ) 79 | ) 80 | ``` 81 | 82 | ##### 第二步: 写episode循环中的内容 83 | ```python 84 | def update(): 85 | for episode in range(100): 86 | # initial observation 87 | observation = env.reset() 88 | # 每个Episode 89 | while True: 90 | # fresh env 91 | env.render() 92 | 93 | # RL choose action based on observation 94 | action = RL.choose_action(str(observation)) 95 | 96 | # RL take action and get next observation and reward 97 | observation_, reward, done = env.step(action) 98 | 99 | # RL learn from this transition 100 | RL.learn(str(observation), action, reward, str(observation_)) 101 | 102 | # swap observation 103 | observation = observation_ 104 | 105 | # break while loop when end of this episode 106 | if done: 107 | break 108 | 109 | # end of game 110 | print('game over') 111 | env.destroy() 112 | ``` 113 | 114 | ##### 第三步:写主函数入口 115 | ```python 116 | 117 | if __name__ == "__main__": 118 | env = Maze() 119 | RL = QLearningTable(actions=list(range(env.n_actions))) 120 | env.after(100, update) 121 | env.mainloop() 122 | 123 | ``` 124 | 注:这里对环境maze函数的代码略去,大多数实验中,我们直接使用gym环境或者其他的现有的环境即可,此处环境见参考文献完整代码 125 | 126 | 127 | 参考文献: 128 | 1. MorvanZhou.github. [(2017,点击查看完整源代码)](https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/tree/master/contents/2_Q_Learning_maze) 129 | -------------------------------------------------------------------------------- /DRL-Algorithm/DQN-serial/README.md: -------------------------------------------------------------------------------- 1 | # DRL 前沿算法 2 | 3 | ## DRL开山之作 4 | Deep Mind 使用 DQN 让机器人玩 Atari 游戏并达到(超越)人类顶级选手水平: 5 | 6 | + [Playing Atari with Deep Reinforcement Learning(2013)](https://arxiv.org/pdf/1312.5602.pdf) 7 | 8 | + [Human-level control through deep reinforcement learning(2015)](https://daiwk.github.io/assets/dqn.pdf) 9 | 10 | ## AlphaGo 及后续 11 | 12 | + AlphaGo: [Mastering the game of Go with deep neural networks and tree search(2016)](https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf) 13 | 14 | + AlphaGo Zero: [Mastering the Game of Go without Human Knowledge(2017)](http://discovery.ucl.ac.uk/10045895/1/agz_unformatted_nature.pdf) 15 | 16 | + AlphaZero: [Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm(2017)](https://arxiv.org/pdf/1712.01815.pdf) 17 | 18 | ## DQN 改进 19 | + Dueling DQN: [【Code】](https://github.com/NeuronDance/DeepRL/tree/master/DRL%E5%89%8D%E6%B2%BF%E7%AE%97%E6%B3%95/Dueling-DQN),[Dueling Network Architectures for Deep Reinforcement Learning(2015)](https://arxiv.org/pdf/1511.06581.pdf) 20 | 21 | + Double DQN: [【Code】](https://github.com/NeuronDance/DeepRL/tree/master/DRL%E5%89%8D%E6%B2%BF%E7%AE%97%E6%B3%95/Dueling-DQN) ,[Deep Reinforcement Learning with Double Q-learning(2015)](https://arxiv.org/pdf/1509.06461.pdf) 22 | 23 | + NAF: [Continuous Deep Q-Learning with Model-based Acceleration(2016)](https://arxiv.org/pdf/1603.00748.pdf) 24 | [PRIORITIZED EXPERIENCE REPLAY(2015)](https://arxiv.org/pdf/1511.05952.pdf) 25 | 26 | ## 基于策略梯度(Policy Gradient) 27 | + DPG: [Deterministic Policy Gradient Algorithms(2014)](https://hal.inria.fr/file/index/docid/938992/filename/dpg-icml2014.pdf) 28 | 29 | + DDPG: [CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING(2015)](https://arxiv.org/pdf/1509.02971.pdf) 30 | 31 | + D4PG: [Distributed Distributional Deterministic Policy Gradients(2018)](https://openreview.net/pdf?id=SyZipzbCb) 32 | 33 | + TRPO: [Trust Region Policy Optimization(2015)](https://arxiv.org/pdf/1502.05477.pdf) 34 | 35 | + PPO: [Proximal Policy Optimization Algorithms(2017)](https://arxiv.org/pdf/1707.06347.pdf) 36 | 37 | + ACER: [SAMPLE EFFICIENT ACTOR-CRITIC WITH EXPERIENCE REPLAY(2017)](https://arxiv.org/pdf/1611.01224.pdf) 38 | 39 | + ACTKR: [Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation(2017)](https://arxiv.org/pdf/1708.05144.pdf) 40 | 41 | + SAC: [Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor(2018)](https://arxiv.org/pdf/1801.01290.pdf) 42 | 43 | ## 异步强化学习 44 | + A3C: [Asynchronous Methods for Deep Reinforcement Learning(2016)](http://arxiv.org/abs/1602.01783) 45 | 46 | + GA3C: [Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU](https://openreview.net/pdf?id=r1VGvBcxl) 47 | 48 | ## 分层 DRL 49 | + [Deep Successor Reinforcement Learning(2016)](https://arxiv.org/pdf/1606.02396.pdf) 50 | 51 | + [Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation(2016)](https://arxiv.org/pdf/1604.06057.pdf) 52 | 53 | + [Hierarchical Reinforcement Learning using Spatio-Temporal Abstractions and Deep Neural Networks(2016)](https://arxiv.org/pdf/1605.05359v1.pdf) 54 | 55 | + [STOCHASTIC NEURAL NETWORKS FOR HIERARCHICAL REINFORCEMENT LEARNING(2017)](https://openreview.net/pdf?id=B1oK8aoxe) 56 | 57 | ## 逆强化学习 58 | + [Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization(2016)](https://arxiv.org/pdf/1603.00448v3.pdf) 59 | 60 | + [Maximum Entropy Deep Inverse Reinforcement Learning(2015)](https://arxiv.org/pdf/1507.04888v3.pdf) 61 | + [GENERALIZING SKILLS WITH SEMI-SUPERVISED REINFORCEMENT LEARNING(2017)](https://arxiv.org/pdf/1612.00429.pdf) 62 | 63 | ## 参考链接 64 | + [深度增强学习方向论文整理](https://zhuanlan.zhihu.com/p/23600620) 65 | 66 | + [Deep Reinforcement Learning Papers](https://github.com/muupan/deep-reinforcement-learning-papers) 67 | 68 | + [Policy Gradient Algorithms](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html) 69 | 70 | + [Policy Gradient Algorithms](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html) 71 | 72 | 73 | 74 | ## 致谢 75 | @J.Q.Wang 76 | -------------------------------------------------------------------------------- /DRL-Algorithm/PG-serial/DD4G/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/PG-serial/DD4G/README.md -------------------------------------------------------------------------------- /DRL-Algorithm/PG-serial/DDPG/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/PG-serial/DDPG/README.md -------------------------------------------------------------------------------- /DRL-Algorithm/PG-serial/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/PG-serial/README.md -------------------------------------------------------------------------------- /DRL-Algorithm/PPO-serial/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/PPO-serial/.DS_Store -------------------------------------------------------------------------------- /DRL-Algorithm/PPO-serial/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/PPO-serial/README.md -------------------------------------------------------------------------------- /DRL-Algorithm/PPO-serial/TRPO/README.md: -------------------------------------------------------------------------------- 1 | ## Trust Region Policy Optimization 2 | 3 | ### Add-on 4 | [TRPO学习笔记1](https://github.com/NeuronDance/DeepRL/blob/master/DRL%E5%89%8D%E6%B2%BF%E7%AE%97%E6%B3%95(16%E7%A7%8D)/TRPO/TRPO%E6%95%B4%E7%90%86%E7%AC%94%E8%AE%B0.md) 贡献者:@[zanghyu](https://github.com/zanghyu) 5 | -------------------------------------------------------------------------------- /DRL-Algorithm/PPO-serial/TRPO/TRPO整理笔记.md: -------------------------------------------------------------------------------- 1 | # Trust Region Policy Optimization 2 | 3 | > 注意:为了友好显示latex公式,请大家自行加载google浏览器latex数学显示插件:[MathJax Plugin for Github点击进入下载页面](https://github.com/mathjax/MathJax/releases) 4 | 5 | 这篇文章是policy gradient方面很经典的一篇文章,作者是John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, Pieter Abbeel。 6 | 7 | 稍微多提一句,Michael Jordan是吴恩达的老师,是统计学派的领军人物。而Pieter Abbeel是吴恩达的徒弟。Sergey Levine是Abbeel的博后,John Schulman也是师从Abbeel。这里Abbeel的学生都是来自UCB的,他们大多和OpenAI有所交集。一作Schulman现在在领导OpenAI团队的Games Team。二作Levine目前在教UCB的cs294课程,质量很高。 8 | 9 | 从这里我们就可以看出,文章应该是一篇比较具有创新性并且foundational的工作,肯定会有不少理论推导和证明。 10 | 11 | 题外话说完了,接下来我来理一理这篇文章的思路由来以及发展。 12 | 13 | ## 背景 14 | 15 | ### 关于policy optimization(策略优化) 16 | 17 | Policy optimization主要分为两个方面: 18 | 19 | - Policy gradient methods (策略梯度方法) 20 | 21 | - Derivative-free optimization methods (无梯度方法) 22 | 23 | ​ \- Cross Entropy Method(交叉熵方法)and Covariance Matrix Adaptation ( 自适应协方差矩阵) 24 | 25 | 26 | 其中我们在这里重点考虑的是策略梯度的方法,本文将要介绍的TRPO就是属于Policy Gradient的子算法。 27 | 28 | ### 关于Policy gradient 29 | 30 | #### PG的优缺点 31 | 32 | **优点**: 33 | 34 | 1. PG的方法可以应用于高维连续动作空间 35 | 2. 样本复杂性能够得到保证 36 | 3. 从理论上来讲policy-based比value-based可解释性更强一点 37 | 38 | **缺点**: 39 | 40 | 1. 收敛到局部最优而不是全局最优(如果有bad samples 可能就会收敛到不好的地方) 41 | 2. 样本利用率低 42 | 3. 策略评估效率低且方差大 43 | 44 | 因此我们要想办法克服其缺点,提高其表现能力。 45 | 46 | #### 回顾Policy gradient 47 | 48 | ![](note_img/pg_formula.png) 49 | 50 | 策略梯度如图所示,在该式子中,整体的策略梯度是找到最大reward对应的最陡峭的方向(即梯度最大方向),而后面的期望中告诉我们这个梯度是要sample一条trajectory得出来的。(个人理解:这里的trajectory是指从游戏开始到游戏结束的整条轨迹,而有时候提到的rollout是指trajectory中的某一小段)。 51 | 52 | 从这个公式中我们也能看出来policy gradient的一些问题: 53 | 54 | 1. 我们从策略中sample了一整条的轨迹只为了更新一次梯度,而不是在每个时间点上去更新梯度。如果我们假设一个轨迹有几百步这么多,当这些都用来更新一个策略的时候,稍微有一个地方出现偏差那么整个更新过程就会变得十分不稳定,这对于训练来说是十分不利的。 55 | 56 | 2. PG的方法假设其更新是平坦的,如果是在比较陡峭的地方可能就会产生比较差的结果。当一步更新过大时可能会导致灾难,更新过小又会导致学习过慢(如下图)。 57 | 58 | ​ ![](note_img/mountain.png) 59 | 60 | ​ ![](note_img/cliff.png) 61 | 62 | 如图所示,假设我们最开始处于黄点的位置,在这里地形比较平坦,因而我们要设置较大的学习率,以获得比较好的学习速度。但是假设有一个不太好的动作,我们就会从黄点跌落到红点。由于红点处的梯度很高,比较大的学习率可能会让策略向更低差的地方跌下去。因此实际上学习率对梯度的不敏感导致policy gradient的方法受到收敛问题的干扰。 63 | 64 | ### 记号 65 | 66 | $Q_{\pi}\left(s_{t}, a_{t}\right)=\mathbb{E}_{s_{t+1}, a_{t+1}, \ldots}\left[\sum_{l=0}^{\infty} \gamma^{\prime} r\left(s_{t+l}\right)\right]$ 67 | 68 | $V_{\pi}\left(s_{t}\right)=\mathbb{E}_{a_{t}, s_{t+1}, \cdots}\left[\sum_{l=0}^{\infty} \gamma^{\prime} r\left(s_{t+l}\right)\right]$ 69 | 70 | $\begin{aligned} A_{\pi}(s, a)=& Q_{\pi}(s, a)-V_{\pi}(s), \text { where } a_{t} \sim \pi\left(a_{t} | s_{t}\right) \\ s_{t+1} & \sim P\left(s_{t+1} | s_{t}, a_{t}\right) \text { for } t \geq 0 \end{aligned}$ 71 | 72 | $\begin{array}{l}{\eta(\pi)=\mathbb{E}_{s_{0}, a_{0}, \ldots}\left[\sum_{t=0}^{\infty} \gamma^{t} r\left(s_{t}\right)\right], \text { where }} \\ {s_{0} \sim \rho_{0}\left(s_{0}\right), a_{t} \sim \pi\left(a_{t} | s_{t}\right), s_{t+1} \sim P\left(s_{t+1} | s_{t}, a_{t}\right)}\end{array}$ 73 | 74 | $\rho_{\pi}(s)=P\left(s_{0}=s\right)+\gamma P\left(s_{1}=s\right)+\gamma^{2} P\left(s_{2}=s\right)+\dots$ 75 | 76 | 其中,$A$ 是优势函数,即该动作相对于平均而言有多少额外reward。$\eta_\pi$ 是期望的折扣reward。$\rho_\pi$ 是加了折扣的state访问频率。 77 | 78 | ### 从记号出发 79 | 80 | 首先根据上面的记号,我们可以得出: 81 | 82 | ​ $\eta\left(\pi^{\prime}\right)=\eta(\pi)+E_{\pi^{\prime}}\left[\sum_{t=0}^{\infty} \gamma^{t} A_{\pi}\left(s_{t}, a_{t}\right)\right]​$ 83 | 84 | 为什么这个式子是成立的呢? 85 | 86 | 1. 这里的$A_\pi$ 告诉我们做这个动作会给我们多少额外的reward 87 | 2. 将这个$A$ 按照时间加起来,告诉了我们这条轨迹上总的额外reward 88 | 3. 将这个轨迹的结果按照新的policy求期望,得到的就是平均额外的reward 89 | 90 | 需要注意的一点是这个公式告诉了我们不同策略下我们能够得到多少reward加成。而这个reward加成是按照轨迹来分析的,因为期望里面是按照时间顺序从0开始到正无穷的。由于我们之前分析了按照轨迹进行PG的样本利用率很低,因此我们需要将这个式子进行一定程度的变换。 91 | 92 | ### 展开分析 93 | 94 | 我们可以将上式改写为: 95 | 96 | ​ $ \begin{aligned} \eta\left(\pi^{\prime}\right) &=\eta(\pi)+\sum_{t=0} \sum_{s} P\left(s_{t}=s | \pi^{\prime}\right) \sum_{a} \pi^{\prime}(a | s) \gamma^{t} A_{\pi}(s, a) \\ &=\eta(\pi)+\sum_{s} \sum_{s}^{\infty} \gamma^{t} P\left(s_{t}=s | \pi^{\prime}\right) \sum_{a} \pi^{\prime}(a | s) A_{\pi}(s, a) \\ &=\eta(\pi)+\sum_{s} \rho_{\pi'}(s) \sum_{a} \pi^{\prime}(a | s) A_{\pi}(s, a) \end{aligned}$ 97 | 98 | 这里的第一行式子是将上面的式子的期望进行了展开,最后这里得到的结果是按照state来分析的,这明显就要比上面的按照轨迹分析要好一点。 99 | 100 | 那么我们现在可以分析一下这个式子,这个式子告诉了我们什么事情呢? 101 | 102 | 由于$\rho_{\pi'}$ 是代表访问频率的,这个是不可能小于0的,所以我们只要保证$ \sum_{a} \pi^{\prime}(a) A_{\pi}(s, a) \geq 0$ ,那么就可以得到$\sum_{s} \rho_{\pi^{\prime}}(s) \sum_{a} \pi^{\prime}(a | s) A_{\pi}(s, a) \geq 0$ 。也就是说$\eta\left(\pi^{\prime}\right) \geq \eta(\pi) $ ,我们新的策略一定不会比旧的策略差。 103 | 104 | 但是这里也有一个缺点:在这个式子里,$\pi'$ 是我们的未知量,$\pi$ 是我们的已知量。而我们的右式中出现了$\rho_{\pi'}$ 和$\pi'$ 这两项。我们没有办法用未知的这两项去求我们的目标$\pi'$ ,因此这个式子实际上是无法应用的。 105 | 106 | 那么现在直接计算的路子已经堵死了,我们还能如何运用这个式子呢?没错,我们用近似的方法。 107 | 108 | 首先我们进行一定程度的假设,假设策略的改变不影响state的访问频率,即$\rho_{\pi^{\prime}}(s) \approx \rho_{\pi}(s)​$ 。 109 | 110 | 注意:**这里只是我们的一个假设,并不是他们真正相等** 111 | 112 | 那么我们就可以得到: 113 | 114 | ​ $\eta\left(\pi^{\prime}\right) \approx \eta(\pi)+\sum_{s} \rho_{\pi}(s) \sum_{a} \pi^{\prime}(a) A_{\pi}(s, a)$ 115 | 116 | 重写一下这个式子: 117 | 118 | ​ $\Delta_{\eta}=\eta\left(\pi^{\prime}\right)-\eta(\pi) \approx \sum_{s} \rho_{\pi}(s) \sum_{a} \pi(a) *\left[\frac{\pi^{\prime}(a)}{\pi(a)} A^{\pi}(s, a)\right]$ 119 | 120 | 我们会得到这样形式的一个表达式。为什么这个式子对我们有用,要比上面的好呢? 121 | 122 | 1. 这里的$\sum_{s} \rho_{\pi}(s) \sum_{a} \pi(a)$ 是在旧的policy下的样本 123 | 2. $A^{\pi}(s, a)$ 是旧的policy下的优势函数 124 | 3. $\frac{\pi^{\prime}(a)}{\pi(a)}$ 我们可以用important sampling的思想来做 125 | 126 | 因此这个式子就变成了可以使用的了。 127 | 128 | 记住,这个式子可用的前提是$\rho_{\pi^{\prime}}(s) \approx \rho_{\pi}(s)$ 。那么这个条件到底是否成立呢?我们需要验证一下: 129 | 130 | 首先让$\pi^{\prime}(a)=\pi(a)+\Delta \pi_{a}$ ,接下来我们可以得到: 131 | 132 | ​ $ \sum_{a} \pi^{\prime}(a) A_{\pi}(s, a)=\sum_{a} \pi(a) A_{\pi}(s, a)+\sum_{a} \Delta \pi_{a} A_{\pi}(s, a)$ , 133 | 134 | 在这里面,根据记号来说,$A_{\pi}(s, a)=Q_{\pi}(s, a)-V_{\pi}(s)$ ,因此$\sum_{a} \pi(a) A_{\pi}(s, a)=\sum_{a} \pi(a)\left(Q_{\pi}(s, a)-V_{\pi}(s)\right)=\sum_{a}\left(V_{\pi}(s)-V_{\pi}(s)\right)=0$ 135 | 136 | ​ 所以, $\sum_{a} \pi^{\prime}(a) A_{\pi}(s, a)=0+\sum_{a} \Delta \pi_{a} A_{\pi}(s, a)$ 137 | 138 | 再让$\rho_{\pi^{\prime}}(s)=\rho_{\pi}(s)+\Delta \rho$ ,我们就可以得到: 139 | 140 | ​ $\Delta \eta=\sum_{s}\left(\rho_{\pi}(s)+\Delta \rho\right)\left(\sum_{a} \Delta \pi_{a} A_{\pi}(s, a)\right)$ 141 | 142 | ​ $\Delta \eta=\sum_{s} \rho_{\pi}(s)\left(\sum_{a} \Delta \pi_{a} A_{\pi}(s, a)\right)+\sum_{s} \Delta \rho\left(\sum_{a} \Delta \pi_{a} A_{\pi}(s, a)\right)$ 143 | 144 | 这里的第一项就是我们想要的,第二项我们可以看出是一个二阶的项。那么我们能否忽略或者近似第二项呢? 145 | 146 | ### Conservative Greedy Algorithm 147 | 148 | 首先要说明的是,TRPO的整篇工作都是基于Kakade在2002年ICML上的论文《Approximately Optimal Approximate Reinforcement Learning》的工作进一步发展而来的,因此Kakade提出的这个方法是理解TRPO的关键。 149 | 150 | 我们刚才说到了$\Delta\eta​$ 中的第二项要想办法近似或者忽略,这篇论文中就提出了一个方案: 151 | 152 | ​ $\eta\left(\pi_{n e w}\right)-\eta(\pi) \geq A_{\pi}\left(\pi_{n e w}\right)-\frac{2 \epsilon \gamma}{(1-\gamma)} \alpha^{2}$ ,其中, 153 | 154 | ​ $\pi_{n e w}=(1-\alpha) \pi+\alpha \pi^{\prime}​$ 155 | 156 | ​ $\epsilon=\frac{1}{1-\gamma}\left(\max _{s} \sum_{a} \pi^{\prime}(s, a) A^{\pi}(s, a)\right)$ 157 | 158 | ​ $A_{\pi}\left(\pi^{n e w}\right) :=\sum_{s} \rho_{\pi}(s) \sum_{a} \pi^{n e w}(s, a) A^{\pi}(s, a)$ 159 | 160 | 注意这里关于策略有三个不同的量:$\pi​$,$\pi'​$ 和$\pi_{new}​$ 。其中$\pi_{new}​$是根据前两个计算而来。这里的$\alpha​$ 和$\pi'​$ 都是可以通过计算得到的值。 161 | 162 | 这个式子就告诉了我们当$\pi_{new}$ 的更新符合$\pi_{n e w}=(1-\alpha) \pi+\alpha \pi^{\prime}$ 的时候,我们策略的更新是有保障的提高的。整个论文的算法流程如下图: 163 | 164 | ![](note_img/cga.png) 165 | 166 | 但是该算法仍然有一些问题,比如这里$\alpha$ 计算起来比较困难,而且该算法只能保证$\pi_{new}$ 在其更新公式下更新才能有保证。也就是说每次我们在更新策略的时候都要先计算好对应的$\alpha$,然后才能得到保证单调变好的策略。这其实并不利于应用,我们想要更加通用的方法来解决policy的问题,用一种不需要$\alpha$ 的方式。 167 | 168 | ### TRPO 169 | 170 | 于是针对于Kakade提出来的公式,TRPO进行了一定程度的改进: 171 | 172 | ​ $\eta\left(\pi_{\text { new }}\right) \geq L_{\pi_{\text { ald }}}\left(\pi_{\text { new }}\right)-\frac{4 e \gamma}{(1-\gamma)^{2}} \alpha^{2}​$ ,其中, 173 | 174 | ​ $\epsilon=\max _{s, a}\left|A_{\pi}(s, a)\right|​$ 175 | 176 | ​ $L_{\pi}\left(\pi_{n e w}\right)=\eta(\pi)+\sum_{s} \rho_{\pi}(s) \sum_{a} \pi_{n e w}(a | s) A_{\pi}(s, a)​$ 177 | 178 | ​ $P\left(a \neq a_{\text { new }} | s\right) \leq \alpha​$ 179 | 180 | 对比一下两个公式就可以发现,在该式中,我们将$\pi'$ 去掉了,也就是说任意$\pi_{new}$ 都是可以保证满足这个不等式的。而这里的$\alpha$与上面的定义也有所不同。具体该式的证明过程可以在论文附录中找到,与Kakade2002中证明过程基本相同。 181 | 182 | 这个式子主要的贡献就是我们不需要再混合着$\alpha$ 才能求$\pi_{new}$ 了,在这种情况下我们一样得到了一个能够保证不断提高的策略。 183 | 184 | 接下来就是我们如何一步一步将这个式子进行近似,进行分析,将他能够利用神经网络来得到解的过程。 185 | 186 | ### TRPO的深入 187 | 188 | #### KL散度 189 | 190 | 在上面的式子中,$\alpha$ 参数的含义是$P\left(a \neq a_{\text { new }} | s\right) \leq \alpha$ ,这里实际上说明了$\alpha$ 代表的是total variation divergence,即总变化距离。在这里,我们用最大变化距离来代替总变化距离: 191 | 192 | ​ $D_{\mathrm{TV}}^{\max }(\pi, \overline{\pi})=\max _{s} D_{T V}(\pi(\cdot | s) \| \tilde{\pi}(\cdot | s))$ ,其中 $D_{T V}(p \| q)=\frac{1}{2} \sum_{i}\left|p_{i}-q_{i}\right|$ 193 | 194 | 在[Pollard,2000]中证明了$D_{KL} \geq D_{TV}^2$ ,我在TRPO的reference中找到的该文链接已经挂了,所以找到了额外的两个pdf中分别给予了不同的证明: [链接1](http://people.seas.harvard.edu/~madhusudan/courses/Spring2016/scribe/lect07.pdf) 中的第4页和[链接2](http://www.stat.yale.edu/~pollard/Courses/607.spring05/handouts/Totalvar.pdf)中的problem 8 。 195 | 196 | 由于我们用最大变化距离代替了总的变化距离,因此这里我们也应当用最大的KL散度: 197 | 198 | ​ $D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi})=\max _{s} D_{\mathrm{KL}}(\pi(\cdot | s) \| \tilde{\pi}(\cdot | s))$ 199 | 200 | 这样我们就可以将之前的式子改写成: 201 | 202 | ​ $\begin{array}{r}{\eta(\tilde{\pi}) \geq L_{\pi}(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi})}\end{array}$ ,其中 $C=\frac{4 \epsilon \gamma}{(1-\gamma)^{2}}$ 203 | 204 | #### MM算法 205 | 206 | 让$M_{i}(\pi)=L_{\pi_{i}}(\pi)-C D_{\mathrm{KL}}^{\max }\left(\pi_{i}, \pi\right)$ ,然后将这个式子再重写一下,我们得到: 207 | 208 | ​ $\eta\left(\pi_{i+1}\right)-\eta\left(\pi_{i}\right) \geq M_{i}\left(\pi_{i+1}\right)-M\left(\pi_{i}\right)$ 209 | 210 | 我们分析得到的式子可以发现,在$\pi_i$ 确定的情况下,$\eta(\pi_i)$ 和$M(\pi_i)$ 就确定了,那么只要我们增大$M_i(\pi_{i+1})$ ,就相当于优化了$\eta(\pi_{i+1})​$ 。这实际上就是利用了MM算法,我们在无法优化目标函数的情况下,转而优化其下界,他的下界提高的时候,目标函数也就得到了提高。 211 | 212 | ![](note_img/MM.png) 213 | 214 | #### 置信域 215 | 216 | 现在我们有了明确的优化目标,我们想要得到: 217 | 218 | ​ $\max _{\pi_{i}}\left(\mathcal{L}_{\pi}\left(\pi_{i}\right)-C D_{K L}^{\max }\left(\pi_{i}, \pi\right)\right)​$ 219 | 220 | 在实际优化过程中,我们会发现这个式子会有个问题:当$\gamma$ (折扣因子)越接近于1的时候,这里的$C$ 会变的越大,相对应的梯度就会变得越小。一个比较好的方案是把这里的$C$ 当做超参数来处理,但是$C$ 又不能是一个固定不变的超参数,而应该是一个adaptive的值。这就导致调节这个超参数比较复杂。因此我们需要一个替代方案。 221 | 222 | 我们可以发现,原式是一个加$KL$散度做惩罚的优化问题,可以通过拉格朗日对偶转化为一个约束$KL$散度的优化问题。即转化为: 223 | 224 | ​ $\begin{array}{l}{\max _{\pi_{i}} \mathcal{L}_{\pi}\left(\pi_{i}\right)} , {\text { s.t. } D_{K L}^{\max }\left(\pi_{i}, \pi\right) \leq \delta}\end{array}$ 225 | 226 | 这里的$\delta$ 也是一个超参数,不过这个超参数就要比刚才的$C$ 好调节的多了。 227 | 228 | #### 平均$KL$ 散度 229 | 230 | 注意,上式中虽然问题得到了简化,但是约束条件中是要让最大的$KL$ 小于一个超参数,求最大$KL$ 的时候就肯定会把每一对都计算一遍,这个计算复杂度就要高很多。那么我们想用近似替代的方案,把最大$KL$ 用平均$KL$ 替代: 231 | 232 | ​ $\overline{D}_{\mathrm{KL}}^{\rho}\left(\theta_{1}, \theta_{2}\right) :=\mathbb{E}_{s \sim \rho}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{1}}(\cdot | s) \| \pi_{\theta_{2}}(\cdot | s)\right)\right]$ 233 | 234 | 这样我们只要保证从$\rho$ 中sample出来的state的$KL$ 散度都是满足条件的,那么我们就认为最大$KL$ 也是满足条件的。 235 | 236 | #### 重要性采样 237 | 238 | 这里的重要性采样已经在上文提过了,就是将 239 | 240 | ​ $\begin{array}{l}{\text { maximize } L_{\theta_{\text { old }}(\theta)}}, {\text { subject to } \overline{D}_{\mathrm{KL}}^{\rho_{\theta_{\text { old }}}}\left(\theta_{\text { old }}, \theta\right) \leq \delta}\end{array}$ 241 | 242 | 转化为 243 | 244 | ​ $\begin{array}{l}{\text { maximize } \mathbb{E}_{s \sim \rho_{\text { old }}, a \sim \pi_{\text { old }}}\left[\frac{\pi_{\theta}(a | s)}{\pi_{\theta_{\text { old }}}(a | s)} Q_{\theta_{\text { old }}}(s, a)\right]} , {\text { subject to } \mathbb{E}_{s \sim \rho_{\text { old }}}\left[D_{K L}\left(\pi_{\theta_{\text { old }}}(\cdot | s) \| \pi_{\theta}(\cdot | s)\right)\right] \leq \delta}\end{array}$ 245 | 246 | 即将所有我们需要用到的地方都用sample的形式表达出来。 247 | 248 | #### 计算梯度 249 | 250 | 现在目标公式已经很清晰了,接下来就是我们应该如何去优化它,计算其梯度。 251 | 252 | 首先我们将上式进行泰勒展开,展开到二阶的形式: 253 | 254 | ![](note_img/taylor.png) 255 | 256 | 将等于0的项去掉,我们就可以得到: 257 | 258 | ​ $\mathcal{L}_{\theta_{k}}(\theta) \approx g^{T}\left(\theta-\theta_{k}\right)$ ,其中 $g \doteq \nabla_{\theta} \mathcal{L}_{\theta_{k}}\left.(\theta)\right|_{\theta_{k}}$ 259 | 260 | ​ $\overline{D}_{K L}(\theta \| \theta_{k}) \approx \frac{1}{2}\left(\theta-\theta_{k}\right)^{T} H\left(\theta-\theta_{k}\right)$ ,其中 $H \doteq \nabla_{\theta}^{2} \overline{D}_{K L}\left.(\theta \| \theta_{k})\right|_{\theta_{k}}$ 261 | 262 | 所以原式从: 263 | 264 | ​ $ \pi_{k+1}= \arg \max _{\pi^{\prime}} \mathcal{L}_{\pi_{k}}\left(\pi^{\prime}\right), \text { s.t. } \overline{D}_{K L}\left(\pi^{\prime} \| \pi_{k}\right) \leq \delta $ 265 | 266 | 转变为: 267 | 268 | ​ $\begin{array}{l}{\theta_{k+1}=\arg \max _{\theta} g^{T}\left(\theta-\theta_{k}\right)}, {\text { s.t. } \frac{1}{2}\left(\theta-\theta_{k}\right)^{T} H\left(\theta-\theta_{k}\right) \leq \delta}\end{array}$ 269 | 270 | 这里同时告诉了我们是如何从策略空间转移到参数空间的。 271 | 272 | 注意,这里的$H$ 就是两种$\theta$ 之间的$KL$ 的二阶导数。实际上这个$H$ 也被称为Fisher Information Matrix。举个例子,比如在欧式空间中,我们常用的有笛卡尔坐标系和极坐标系。距离分别是用 273 | 274 | ​ $d=\sqrt{\left(x_{2}-x_{1}\right)^{2}+\left(y_{2}-y_{1}\right)^{2}}$ 和 $d=\sqrt{r_{1}^{2}+r_{2}^{2}-2 r_{1} r_{2} \cos \left(\theta_{1}-\theta_{2}\right)}$ 275 | 276 | 来计算的。这是因为我们平时都假设空间是平坦的。但是如果空间非平坦,那么实际上距离的计算应该为: 277 | 278 | ​ $d=\Delta \mathrm{w}^{\mathrm{T}} \mathrm{G}_{\mathrm{w}} \Delta \mathrm{w}$ 279 | 280 | 来计算。而这里的$G$就相当于一个FIM。对应到笛卡尔坐标系和极坐标系,距离则分别为: 281 | 282 | ​ $d=\left( \begin{array}{ll}{\delta \mathbf{x}} & {\delta \mathbf{y}}\end{array}\right) \left( \begin{array}{ll}{1} & {0} \\ {0} & {1}\end{array}\right) \left( \begin{array}{l}{\delta \mathbf{x}} \\ {\delta \mathbf{y}}\end{array}\right)$ 和$d=\left( \begin{array}{ll}{\delta r} & {\delta \theta}\end{array}\right) \left( \begin{array}{ll}{1} & {0} \\ {0} & {r}\end{array}\right) \left( \begin{array}{l}{\delta r} \\ {\delta \theta}\end{array}\right)$ 283 | 284 | 这个公式也说明了距离是可以通过这个公式以任一坐标系下的对应计算得到,而不同坐标系也就对应了不同的$G$。从这里我们也可以发现,其实我们的约束条件$\frac{1}{2}\left(\theta-\theta_{k}\right)^{T} H\left(\theta-\theta_{k}\right) \leq \delta$ ,实际上就相当于约束了策略之间的距离要小于一个值。这里的$G$ 也就相当于衡量了策略对于模型参数$\theta$ 的敏感程度。客观上说明了策略梯度的算法具有模型不变性。 285 | 286 | #### 共轭梯度法 287 | 288 | 得到了我们的参数空间下的目标函数后,可以通过数学推导得出解析式: 289 | 290 | ​ $\theta_{k+1}=\theta_{k}+\sqrt{\frac{2 \delta}{g^{T} H^{-1} g}} {H^{-1} g}$ 291 | 292 | 虽然我们可以求得这里的矩阵$H$,但是在式中我们还要求$H^{-1}$ ,这个相对来说是一个比较难以求解。TRPO提供的思路是将$H^{-1}g$ 整体进行计算。令 293 | 294 | ​ $x_{k} \approx \hat{H}_{k}^{-1} \hat{g}_{k}$ ,那么 $\hat{H}_{k} x_{k} \approx \hat{g}$ 295 | 296 | 我们可以发现实际上要解的问题就变成了求$A x=b$ 中$x$ 的问题。这相当于 297 | 298 | ​ $f(x)=\frac{1}{2} x^{T} A x-b^{T} x$ ,因为其$f^{\prime}(x)=A x-b=0$ 299 | 300 | 这个问题往往用共轭梯度(Conjugate Gradient)的方法进行求解,具体的关于CG的相关内容可以见[链接](https://medium.com/@jonathan_hui/rl-conjugate-gradient-5a644459137a) 中。 301 | 302 | ### TRPO最终算法 303 | 304 | 最后我们得到了最终的整个算法流程: 305 | 306 | ![](note_img/trpo.jpg) 307 | 308 | ### TRPO的一些限制 309 | 310 | - 在TRPO算法中忽略了关于advantage函数的评价方法的error 311 | - 由于有矩阵$H$ 的存在,TRPO很难使用一阶的优化方法 312 | 313 | ### PPO 314 | 315 | 现在我们知道,整个TRPO算法中,最复杂的部分就是$\max _{\pi_{i}}\left(\mathcal{L}_{\pi}\left(\pi_{i}\right)-C D_{K L}^{\max }\left(\pi_{i}, \pi\right)\right)$ 的目标函数中$C$ 难以设置。而PPO的一个改进版本就是将$C$ 设置成了一个自适应的超参数,从而简化了目标函数的优化过程。 316 | 317 | 关于PPO的部分可以参照[链接](https://medium.com/@jonathan_hui/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12) 318 | 319 | ### 引用 320 | 321 | 除了几个链接的引用之外,还用到了如下的一些资料: 322 | 323 | [1] RL — Trust Region Policy Optimization (TRPO) Explained. 324 | 325 | [2] Overview of the TRPO RL paper/algorithm. 326 | 327 | [3] Lecture 7: Approximately optimal approximate RL, TRPO. 328 | 329 | [4] Trust Region Policy Optimization. 330 | 331 | [5] Approximately Optimal Approximate Reinforcement Learning. 332 | 333 | [6] Lecture 14 Natural Policy Gradients, TRPO, PPO, ACKTR. 334 | 335 | [7] Natural Policy Gradients, TRPO, PPO. 336 | -------------------------------------------------------------------------------- /DRL-Algorithm/PPO-serial/TRPO/note_img/MM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/PPO-serial/TRPO/note_img/MM.png -------------------------------------------------------------------------------- /DRL-Algorithm/PPO-serial/TRPO/note_img/cga.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/PPO-serial/TRPO/note_img/cga.png -------------------------------------------------------------------------------- /DRL-Algorithm/PPO-serial/TRPO/note_img/cliff.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/PPO-serial/TRPO/note_img/cliff.png -------------------------------------------------------------------------------- /DRL-Algorithm/PPO-serial/TRPO/note_img/mountain.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/PPO-serial/TRPO/note_img/mountain.png -------------------------------------------------------------------------------- /DRL-Algorithm/PPO-serial/TRPO/note_img/pg_formula.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/PPO-serial/TRPO/note_img/pg_formula.png -------------------------------------------------------------------------------- /DRL-Algorithm/PPO-serial/TRPO/note_img/taylor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/PPO-serial/TRPO/note_img/taylor.png -------------------------------------------------------------------------------- /DRL-Algorithm/PPO-serial/TRPO/note_img/trpo.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/PPO-serial/TRPO/note_img/trpo.jpg -------------------------------------------------------------------------------- /DRL-Algorithm/README.md: -------------------------------------------------------------------------------- 1 | # DRL 前沿算法 2 | 3 | ## DRL开山之作 4 | Deep Mind 使用 DQN 让机器人玩 Atari 游戏并达到(超越)人类顶级选手水平: 5 | 6 | + [Playing Atari with Deep Reinforcement Learning(2013)](https://arxiv.org/pdf/1312.5602.pdf) 7 | 8 | + [Human-level control through deep reinforcement learning(2015)](https://daiwk.github.io/assets/dqn.pdf) 9 | 10 | ## AlphaGo 及后续 11 | 12 | + AlphaGo: [Mastering the game of Go with deep neural networks and tree search(2016)](https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf) 13 | 14 | + AlphaGo Zero: [Mastering the Game of Go without Human Knowledge(2017)](http://discovery.ucl.ac.uk/10045895/1/agz_unformatted_nature.pdf) 15 | 16 | + AlphaZero: [Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm(2017)](https://arxiv.org/pdf/1712.01815.pdf) 17 | 18 | ## DQN 改进 19 | + Dueling DQN: [【Code】](https://github.com/NeuronDance/DeepRL/tree/master/DRL%E5%89%8D%E6%B2%BF%E7%AE%97%E6%B3%95/Dueling-DQN),[Dueling Network Architectures for Deep Reinforcement Learning(2015)](https://arxiv.org/pdf/1511.06581.pdf) 20 | 21 | + Double DQN: [【Code】](https://github.com/NeuronDance/DeepRL/tree/master/DRL%E5%89%8D%E6%B2%BF%E7%AE%97%E6%B3%95/Dueling-DQN) ,[Deep Reinforcement Learning with Double Q-learning(2015)](https://arxiv.org/pdf/1509.06461.pdf) 22 | 23 | + NAF: [Continuous Deep Q-Learning with Model-based Acceleration(2016)](https://arxiv.org/pdf/1603.00748.pdf) 24 | [PRIORITIZED EXPERIENCE REPLAY(2015)](https://arxiv.org/pdf/1511.05952.pdf) 25 | 26 | ## 基于策略梯度(Policy Gradient) 27 | + DPG: [Deterministic Policy Gradient Algorithms(2014)](https://hal.inria.fr/file/index/docid/938992/filename/dpg-icml2014.pdf) 28 | 29 | + DDPG: [CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING(2015)](https://arxiv.org/pdf/1509.02971.pdf) 30 | 31 | + D4PG: [Distributed Distributional Deterministic Policy Gradients(2018)](https://openreview.net/pdf?id=SyZipzbCb) 32 | 33 | + TRPO: [Trust Region Policy Optimization(2015)](https://arxiv.org/pdf/1502.05477.pdf) 34 | 35 | + PPO: [Proximal Policy Optimization Algorithms(2017)](https://arxiv.org/pdf/1707.06347.pdf) 36 | 37 | + ACER: [SAMPLE EFFICIENT ACTOR-CRITIC WITH EXPERIENCE REPLAY(2017)](https://arxiv.org/pdf/1611.01224.pdf) 38 | 39 | + ACTKR: [Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation(2017)](https://arxiv.org/pdf/1708.05144.pdf) 40 | 41 | + SAC: [Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor(2018)](https://arxiv.org/pdf/1801.01290.pdf) 42 | 43 | ## 异步强化学习 44 | + A3C: [Asynchronous Methods for Deep Reinforcement Learning(2016)](http://arxiv.org/abs/1602.01783) 45 | 46 | + GA3C: [Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU](https://openreview.net/pdf?id=r1VGvBcxl) 47 | 48 | ## 分层 DRL 49 | + [Deep Successor Reinforcement Learning(2016)](https://arxiv.org/pdf/1606.02396.pdf) 50 | 51 | + [Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation(2016)](https://arxiv.org/pdf/1604.06057.pdf) 52 | 53 | + [Hierarchical Reinforcement Learning using Spatio-Temporal Abstractions and Deep Neural Networks(2016)](https://arxiv.org/pdf/1605.05359v1.pdf) 54 | 55 | + [STOCHASTIC NEURAL NETWORKS FOR HIERARCHICAL REINFORCEMENT LEARNING(2017)](https://openreview.net/pdf?id=B1oK8aoxe) 56 | 57 | ## 逆强化学习 58 | + [Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization(2016)](https://arxiv.org/pdf/1603.00448v3.pdf) 59 | 60 | + [Maximum Entropy Deep Inverse Reinforcement Learning(2015)](https://arxiv.org/pdf/1507.04888v3.pdf) 61 | + [GENERALIZING SKILLS WITH SEMI-SUPERVISED REINFORCEMENT LEARNING(2017)](https://arxiv.org/pdf/1612.00429.pdf) 62 | 63 | ## 参考链接 64 | + [深度增强学习方向论文整理](https://zhuanlan.zhihu.com/p/23600620) 65 | 66 | + [Deep Reinforcement Learning Papers](https://github.com/muupan/deep-reinforcement-learning-papers) 67 | 68 | + [Policy Gradient Algorithms](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html) 69 | 70 | + [Policy Gradient Algorithms](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html) 71 | 72 | 73 | 74 | ## 致谢 75 | @J.Q.Wang 76 | -------------------------------------------------------------------------------- /DRL-Algorithm/SAC/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Algorithm/SAC/README.md -------------------------------------------------------------------------------- /DRL-Books/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Books/.DS_Store -------------------------------------------------------------------------------- /DRL-Books/README.md: -------------------------------------------------------------------------------- 1 | 说明:本栏目书籍仅供各位深度强化学习者、爱好者学习和交流使用,切勿商业使用! 2 | 3 | ### 书籍列表: 4 | 5 | 1、[强化学习圣经:Reinforcement Learning:An Introduction 2:edition Sutton](https://github.com/NeuronDance/DeepRL/blob/master/DRL%E4%B9%A6%E7%B1%8D/%E7%94%B5%E5%AD%90%E4%B9%A6%E5%8E%9F%E7%89%88/RL_sutton_2018.pdf)
6 | 2、[深度强化学习原理与入门](https://github.com/NeuronDance/DeepRL/blob/master/DRL%E4%B9%A6%E7%B1%8D/%E7%94%B5%E5%AD%90%E4%B9%A6%E5%8E%9F%E7%89%88/%E6%B7%B1%E5%85%A5%E6%B5%85%E5%87%BA%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0%EF%BC%9A%E5%8E%9F%E7%90%86%E5%85%A5%E9%97%A8.pdf)
7 | 3、[deep-reinforcement-learning-hands](https://github.com/NeuronDance/DeepRL/blob/master/DRL%E4%B9%A6%E7%B1%8D/%E7%94%B5%E5%AD%90%E4%B9%A6%E5%8E%9F%E7%89%88/deep-reinforcement-learning-hands.pdf)
8 | -------------------------------------------------------------------------------- /DRL-Competition/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Competition/.DS_Store -------------------------------------------------------------------------------- /DRL-Competition/README.md: -------------------------------------------------------------------------------- 1 | # 深度强化学习竞赛(按截止时间排列) 2 | ### 第一部分竞赛 3 | + ICRA2019 DJI Robomaster人工智能挑战赛 [直接访问](https://www.robomaster.com/zh-CN/robo/icra) 4 | + OpenAI首届强化学习竞赛Retro Contest [直接访问](https://blog.openai.com/retro-contest/) 5 | + Kaggle竞赛DataLabCup: Deep Reinforcement Learning [直接访问](https://www.kaggle.com/c/datalabcup-deep-reinforcement-learning) 6 | + DeepRacer竞赛,由AWS组织:[详情查看](https://aws.amazon.com/cn/deepracer/) 7 | 8 | 9 | ### 第二部分竞赛 10 | 11 | | Status | Began | Ends | Conference | Competition | 12 | | ------- | ---------- | ---------- | ---------- | ----------- | 13 | | Round 1 | 2019-06-03 | 2019-10-15 | NeurIPS 2019 | [Robot Open-Ended Autonomous Learning Challenge](https://www.aicrowd.com/challenges/neurips-2019-robot-open-ended-autonomous-learning) | 14 | | Round 1 | 2019-06-06 | 2019-11 | NeurIPS 2019 | [Learning to Move: Walk Around](https://www.aicrowd.com/challenges/neurips-2019-learning-to-move-walk-around) | 15 | | Round 1 | 2019-06-08 | 2019-10-25 | NeurIPS 2019 | [MineRL Competition 2019](https://www.aicrowd.com/challenges/neurips-2019-minerl-competition) | 16 | | Qualification Round | 2019-07-01 | 2019-11-21 | NeurIPS 2019 | [Game of Drones](https://www.microsoft.com/en-us/research/academic-program/game-of-drones-competition-at-neurips-2019/) | 17 | | Started | 2019-07-01 | 2019-12 | NeurIPS 2019 | [The Animal-AI Olympics](http://animalaiolympics.com) | 18 | | Round 1 | 2019-07-22 | 2019-12-01 | N/A | [Flatland Challenge](https://www.aicrowd.com/challenges/flatland-challenge) | 19 | | Test Tournament 1 | 2019-08-06 | 2019-12-23 | NeurIPS 2019 | [Reconnaissance Blind Chess](https://rbc.jhuapl.edu/) | 20 | | N/A | N/A | N/A | N/A | [AWS DeepRacer League](https://aws.amazon.com/deepracer/league/) | 21 | 22 | ## Upcoming Competitions 23 | 24 | | Status | Begins | Ends | Conference | Competition | 25 | | ------- | ---------- | ---------- | ---------- | ----------- | 26 | | N/A | - | 2019-11-08 | NeurIPS 2019 | [Pommerman Competition](https://www.pommerman.com/competitions) | 27 | 28 | ## Past Competitions 29 | 30 | | Status | Began | Ended | Conference | Competition | 31 | | ------- | ---------- | ---------- | ---------- | ----------- | 32 | | N/A | 2018-04-05 | 2018-06-05 | N/A | [OpenAI Retro Contest](https://openai.com/blog/retro-contest/) | 33 | | N/A | 2018-06-16 | 2018-11-06 | NeurIPS 2018 | [AI for Prosthetics Challenge](https://www.crowdai.org/challenges/nips-2018-ai-for-prosthetics-challenge) | 34 | | N/A | 2019-05-20 | 2019-05-28 | N/A | [C1 Terminal Season 3](https://terminal.c1games.com/) | 35 | | N/A | 2018-12-08 | 2019-06-30 | N/A | [First TextWorld Problems](https://competitions.codalab.org/competitions/20865) | 36 | | N/A | 2019-02-11 | 2019-08-01 | N/A | [Obstacle Tower Challenge](https://www.aicrowd.com/challenges/unity-obstacle-tower-challenge) | 37 | 38 | 39 | 注:第二部分比赛内容来源于仓库[seungjaeryanlee](https://github.com/seungjaeryanlee/awesome-rl-competitions/blob/master/README.md),在此表示感谢! 40 | -------------------------------------------------------------------------------- /DRL-ConferencePaper/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-ConferencePaper/.DS_Store -------------------------------------------------------------------------------- /DRL-ConferencePaper/AAAI/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-ConferencePaper/AAAI/.DS_Store -------------------------------------------------------------------------------- /DRL-ConferencePaper/AAAI/2020/ReadMe.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ## [1]. Google Research Football: A Novel Reinforcement Learning Environment 4 | > Karol Kurach (Google Brain)*; Anton Raichuk (Google); Piotr Stańczyk (Google Brain); Michał Zając (Google Brain); 5 | Olivier Bachem (Google Brain); Lasse Espeholt (DeepMind); Carlos Riquelme (Google Brain); Damien Vincent 6 | (Google Brain); Marcin Michalski (Google); Olivier Bousquet (Google); Sylvain Gelly (Google Brain) 7 | 8 | 9 | ## [2]. Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance 10 | >Xiaojian Ma (University of California, Los Angeles)*; Mingxuan Jing (Tsinghua University); Wenbing Huang 11 | (Tsinghua University); Chao Yang (Tsinghua University); Fuchun Sun (Tsinghua); Huaping Liu (Tsinghua University); 12 | Bin Fang (Tsinghua University) 13 | 14 | ## [3]. Proximal Distilled Evolutionary Reinforcement Learning 15 | >Cristian Bodnar (University of Cambridge)*; Ben Day (University of Cambridge); Pietro Lió (University of 16 | Cambridge) 17 | 18 | ## [4]. Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video 19 | >Jie Wu (Sun Yat-sen University)*; Guanbin Li (Sun Yat-­sen University); si liu (Beihang University); Liang Lin 20 | (DarkMatter AI) 21 | 22 | ## [5]. RL-Duet: Online Music Accompaniment Generation Using Deep Reinforcement Learning 23 | >Nan Jiang (Tsinghua University)*; Sheng Jin (Tsinghua University); Zhiyao Duan (Unversity of Rochester); Changshui 24 | Zhang (Tsinghua University) 25 | 26 | ## [6]. Mastering Complex Control in MOBA Games with Deep Reinforcement Learning 27 | >Deheng Ye (Tencent)*; Zhao Liu (Tencent); Mingfei Sun (Tencent); Bei Shi (Tencent AI Lab); Peilin Zhao (Tencent AI 28 | Lab); Hao Wu (Tencent); Hongsheng Yu (Tencent); Shaojie Yang (Tencent); Xipeng Wu (Tencent); Qingwei Guo 29 | (Tsinghua University); Qiaobo Chen (Tencent); Yinyuting Yin (Tencent); Hao Zhang (Tencent); Tengfei Shi (Tencent); 30 | Liang Wang (Tencent); Qiang Fu (Tencent AI Lab); Wei Yang (Tencent AI Lab); Lanxiao Huang (Tencent) 31 | 32 | ## [7]. Partner Selection for the Emergence of Cooperation in Multi‐Agent Systems using Reinforcement Learning 33 | >Nicolas Anastassacos (The Alan Turing Institute)*; Steve Hailes (University College London); Mirco Musolesi (UCL) 34 | 35 | ## [8]. Uncertainty-Aware Action Advising for Deep Reinforcement Learning Agents 36 | > Felipe Leno da Silva (University of Sao Paulo)*; Pablo Hernandez-Leal (Borealis AI); Bilal Kartal (Borealis AI); 37 | Matthew Taylor (Borealis AI) 38 | 39 | ## [9]. MetaLight: Value-based Meta-reinforcement Learning for Traffic Signal Control 40 | > Xinshi Zang (Shanghai Jiao Tong University)*; Huaxiu Yao (Pennsylvania State University); Guanjie Zheng 41 | (Pennsylvania State University); Nan Xu (University of Southern California); Kai Xu (Shanghai Tianrang Intelligent 42 | Technology Co., Ltd); Zhenhui (Jessie) Li (Penn State University) 43 | 44 | ## [10]. Adaptive Quantitative Trading: an Imitative Deep Reinforcement Learning Approach 45 | > Yang Liu (University of Science and Technology of China)*; Qi Liu (" University of Science and Technology of China, 46 | China"); Hongke Zhao (Tianjin University); Zhen Pan (University of Science and Technology of China); Chuanren Liu 47 | (The University of Tennessee Knoxville) 48 | 49 | ## [11]. Neighborhood Cognition Consistent Multi‐Agent Reinforcement Learning 50 | > Hangyu Mao (Peking University)*; Wulong Liu (Huawei Noah's Ark Lab); Jianye Hao (Tianjin University); Jun Luo 51 | (Huawei Technologies Canada Co. Ltd.); Dong Li ( Huawei Noah's Ark Lab); Zhengchao Zhang (Peking University); 52 | Jun Wang (UCL); Zhen Xiao (Peking University) 53 | 54 | ## [12]. SMIX($\lambda$): Enhancing Centralized Value Functions for Cooperative Multi-Agent Reinforcement Learning 55 | > Chao Wen (Nanjing University of Aeronautics and Astronautics)*; Xinghu Yao (Nanjing University of Aeronautics 56 | and Astronautics); Yuhui Wang (Nanjing University of Aeronautics and Astronautics, China); Xiaoyang Tan (Nanjing 57 | University of Aeronautics and Astronautics, China) 58 | 59 | ## [13]. Unpaired Image Enhancement Featuring Reinforcement-­Learning-Controlled Image Editing Software 60 | > Satoshi Kosugi (The University of Tokyo)*; Toshihiko Yamasaki (The University of Tokyo) 61 | 62 | ## [14]. Crowdfunding Dynamics Tracking: A Reinforcement Learning Approach 63 | > Jun Wang (University of Science and Technology of China)*; Hefu Zhang (University of Science and Technology of 64 | China); Qi Liu (" University of Science and Technology of China, China"); Zhen Pan (University of Science and 65 | Technology of China); Hanqing Tao (University of Science and Technology of China (USTC)) 66 | 67 | ## [15]. Model and Reinforcement Learning for Markov Games with Risk Preferences 68 | >Wenjie Huang (Shenzhen Research Institute of Big Data)*; Hai Pham Viet (Department of Computer Science, 69 | School of Computing, National University of Singapore); William Benjamin Haskell (Supply Chain and Operations 70 | Management Area, Krannert School of Management, Purdue University) 71 | 72 | ## [16]. Finding Needles in a Moving Haystack: Prioritizing Alerts with Adversarial Reinforcement Learning 73 | >Liang Tong (Washington University in Saint Louis)*; Aron Laszka (University of Houston); Chao Yan (Vanderbilt 74 | UNIVERSITY); Ning Zhang (Washington University in St. Louis); Yevgeniy Vorobeychik (Washington University in St. 75 | Louis) 76 | 77 | ## [17]. Toward A Thousand Lights: Decentralized Deep Reinforcement Learning for Large‐Scale Traffic Signal Control 78 | >Chacha Chen (Pennsylvania State University)*; Hua Wei (Pennsylvania State University); Nan Xu (University of 79 | Southern California); Guanjie Zheng (Pennsylvania State University); Ming Yang (Shanghai Tianrang Intelligent 80 | Technology Co., Ltd); Yuanhao Xiong (Zhejiang University); Kai Xu (Shanghai Tianrang Intelligent Technology Co., 81 | Ltd); Zhenhui (Jessie) Li (Penn State University) 82 | 83 | ## [18]. Deep Reinforcement Learning for Active Human Pose Estimation 84 | > Erik Gärtner (Lund University)*; Aleksis Pirinen (Lund University); Cristian Sminchisescu (Lund University) 85 | 86 | ## [19]. Be Relevant, Non‐redundant, Timely: Deep Reinforcement Learning for Real‐time Event Summarization 87 | > Min Yang ( Chinese Academy of Sciences)*; Chengming Li (Chinese Academy of Sciences); Fei Sun (Alibaba Group); 88 | Zhou Zhao (Zhejiang University); Ying Shen (Peking University Shenzhen Graduate School); Chenglin Wu (fuzhi.ai) 89 | 90 | ## [20]. A Tale of Two‐Timescale Reinforcement Learning with the Tightest Finite‐Time Bound 91 | > Gal Dalal (Technion)*; Balazs Szorenyi (Yahoo Research); Gugan Thoppe (Duke University) 92 | 93 | ## [21]. Reinforcement Learning with Perturbed Rewards 94 | > Jingkang Wang (University of Toronto); Yang Liu (UCSC); Bo Li (University of Illinois at Urbana–Champaign)* 95 | 96 | ## [22]. Exploratory Combinatorial Optimization with Reinforcement Learning 97 | >Thomas Barrett (University of Oxford)*; William Clements (Unchartech); Jakob Foerster (Facebook AI Research); 98 | Alexander Lvovsky (Oxford University) 99 | 100 | ## [23]. Algorithmic Improvements for Deep Reinforcement Learning applied to Interactive Fiction 101 | > Vishal Jain (Mila, McGill University)*; Liam Fedus (Google); Hugo Larochelle (Google); Doina Precup (McGill 102 | University); Marc G. Bellemare (Google Brain) 103 | 104 | ## [24]. Spatiotemporally Constrained Action Space Attacks on Deep Reinforcement Learning Agents 105 | > Xian Yeow Lee (Iowa State University)*; Sambit Ghadai (Iowa State University); Kai Liang Tan (Iowa State 106 | University); Chinmay Hegde (New York University); Soumik Sarkar (Iowa State University) 107 | 108 | ## [25]. Modelling Sentence Pairs via Reinforcement Learning: An Actor‐Critic Approach to Learn the Irrelevant Words 109 | >MAHTAB AHMED (The University of Western Ontario)*; Robert Mercer (The University of Western Ontario) 110 | 111 | ## [26]. Transfer Reinforcement Learning using Output-­Gated Working Memory 112 | >Arthur Williams (Middle Tennessee State University)*; Joshua Phillips (Middle Tennessee State University) 113 | 114 | ## [27]. Reinforcement-­Learning based Portfolio Management with Augmented Asset Movement Prediction States 115 | >Yunan Ye (Zhejiang University)*; Hengzhi Pei (Fudan University); Boxin Wang (University of Illinois at Urbana-­ 116 | Champaign); Pin-­Yu Chen (IBM Research); Yada Zhu (IBM Research); Jun Xiao (Zhejiang University); Bo Li 117 | (University of Illinois at Urbana–Champaign) 118 | 119 | ## [28]. Deep Reinforcement Learning for General Game Playing 120 | > Adrian Goldwaser (University of New South Wales)*; Michael Thielscher (University of New South Wales) 121 | 122 | ## [29]. Stealthy and Efficient Adversarial Attacks against Deep Reinforcement Learning 123 | > Jianwen Sun (Nanyang Technological University)*; Tianwei Zhang ( Nanyang Technological University); Xiaofei Xie 124 | (Nanyang Technological University); Lei Ma (Kyushu University); Yan Zheng (Tianjin University); Kangjie Chen 125 | (Tianjin University); Yang Liu (Nanyang Technology University, Singapore) 126 | 127 | ## [30]. LeDeepChef: Deep Reinforcement Learning Agent for Families of Text-­Based Games 128 | >Leonard Adolphs (ETHZ)*; Thomas Hofmann (ETH Zurich) 129 | 130 | ## [31]. Induction of Subgoal Automata for Reinforcement Learning 131 | >Daniel Furelos-­Blanco (Imperial College London)*; Mark Law (Imperial College London); Alessandra Russo (Imperial 132 | College London); Krysia Broda (Imperial College London); Anders Jonsson (UPF) 133 | 134 | ## [32]. MRI Reconstruction with Interpretable Pixel-­Wise Operations Using Reinforcement Learning 135 | >wentian li (Tsinghua University)*; XIDONG FENG (department of Automation,Tsinghua University); Haotian An 136 | (Tsinghua University); Xiang Yao Ng (Tsinghua University); Yu-­Jin Zhang (Tsinghua University) 137 | 138 | ## [33]. Explainable Reinforcement Learning Through a Causal Lens 139 | > Prashan Madumal (University of Melbourne)*; Tim Miller (University of Melbourne); Liz Sonenberg (University of 140 | Melbourne); Frank Vetere (University of Melbourne) 141 | 142 | ## [34]. Reinforcement Learning based Metapath Discovery in Large-­scale Heterogeneous Information Networks 143 | > Guojia Wan (Wuhan University); Bo Du (School of Compuer Science, Wuhan University)*; Shirui Pan (Monash 144 | University); Reza Haffari (Monash University, Australia) 145 | 146 | ## [35]. Reinforcement Learning When All Actions are Not Always Available 147 | > Yash Chandak (University of Massachusetts Amherst)*; Georgios Theocharous ("Adobe Research, USA"); Blossom 148 | Metevier (University of Massachusetts, Amherst); Philip Thomas (University of Massachusetts Amherst) 149 | 150 | ## [36]. Reinforcement Mechanism Design: With Applications to Dynamic Pricing in Sponsored Search Auctions 151 | > Weiran Shen (Carnegie Mellon University)*; Binghui Peng (Columbia University); Hanpeng Liu (Tsinghua 152 | University); Michael Zhang (Chinese University of Hong Kong); Ruohan Qian (Baidu Inc.); Yan Hong (Baidu Inc.); Zhi 153 | Guo (Baidu Inc.); Zongyao Ding (Baidu Inc.); Pengjun Lu (Baidu Inc.); Pingzhong Tang (Tsinghua University) 154 | 155 | ## [37]. Metareasoning in Modular Software Systems: On-­the-­Fly Configuration Using Reinforcement Learning with 156 | > Rich Contextual Representations 157 | Aditya Modi (Univ. of Michigan Ann Arbor)*; Debadeepta Dey (Microsoft); Alekh Agarwal (Microsoft); Adith 158 | Swaminathan (Microsoft Research); Besmira Nushi (Microsoft Research); Sean Andrist (Microsoft Research); Eric 159 | Horvitz (MSR) 160 | 161 | ## [38]. Joint Entity and Relation Extraction with a Hybrid Transformer and Reinforcement Learning Based Model 162 | > Ya Xiao (Tongji University)*; Chengxiang Tan (Tongji University); Zhijie Fan (The Third Research Institute of the 163 | Ministry of Public Security); Qian Xu (Tongji University); Wenye Zhu (Tongji University) 164 | 165 | ## [39]. Reinforcement Learning of Risk-­Constrained Policies in Markov Decision Processes 166 | > Tomas Brazdil (Masaryk University); Krishnendu Chatterjee (IST Austria); Petr Novotný (Masaryk University)*; Jiří 167 | Vahala (Masaryk University) 168 | 169 | ## [40]. Deep Model-­Based Reinforcement Learning via Estimated Uncertainty and Conservative Policy Optimization 170 | > Qi Zhou (University of Science and Technology of China); Houqiang Li (University of Science and Technology of 171 | China); Jie Wang (University of Science and Technology of China)* 172 | 173 | ## [41]. Reinforcement Learning with Non-­Markovian Rewards 174 | > Maor Gaon (Ben-­Gurion University); Ronen Brafman (BGU)* 175 | 176 | ## [42]. Modular Robot Design Synthesis with Deep Reinforcement Learning 177 | > Julian Whitman (Carnegie Mellon University)*; Raunaq Bhirangi (Carnegie Mellon University); Matthew Travers 178 | (CMU); Howie Choset (Carnegie Melon University) 179 | 180 | ## [43]. BAR -­A Reinforcement Learning Agent for Bounding-­Box Automated Refinement 181 | > Morgane Ayle (American University of Beirut -­ AUB)*; Jimmy Tekli (BMW Group / Université de Franche-­Comté -­ 182 | UFC); Julia Zini (American University of Beirut -­ AUB); Boulos El Asmar (BMW Group / Karlsruher Institut für 183 | Technologie -­ KIT); Mariette Awad (American University of Beirut-­ AUB) 184 | 185 | ## [44]. Hierarchical Reinforcement Learning for Open-­Domain Dialog 186 | > Abdelrhman Saleh (Harvard University)*; Natasha Jaques (MIT); Asma Ghandeharioun (MIT); Judy Hanwen Shen(MIT); Rosalind Picard (MIT Media Lab) 187 | 188 | ## [45]. Copy or Rewrite: Hybrid Summarization with Hierarchical Reinforcement Learning 189 | > Liqiang Xiao (Artificial Intelligence Institute, SJTU)*; Lu Wang (Khoury College of Computer Science, Northeastern 190 | University); Hao He (Shanghai Jiao Tong University); Yaohui Jin (Artificial Intelligence Institute, SJTU) 191 | 192 | ## [46]. Generalizable Resource Allocation in Stream Processing via Deep Reinforcement Learning 193 | > Xiang Ni (IBM Research); Jing Li (NJIT); Wang Zhou (IBM Research); Mo Yu (IBM T. J. Watson)*; Kun-­Lung Wu (IBM 194 | Research) 195 | 196 | ## [47]. Actor Critic Deep Reinforcement Learning for Neural Malware Control 197 | > Yu Wang (Microsoft)*; Jack Stokes (Microsoft Research); Mady Marinescu (Microsoft Corporation) 198 | 199 | ## [48]. Fixed-­Horizon Temporal Difference Methods for Stable Reinforcement Learning 200 | > Kristopher De Asis (University of Alberta)*; Alan Chan (University of Alberta); Silviu Pitis (University of Toronto); 201 | Richard Sutton (University of Alberta); Daniel Graves (Huawei) 202 | 203 | ## [49]. Sequence Generation with Optimal-­Transport-­Enhanced Reinforcement Learning 204 | > Liqun Chen (Duke University)*; Ke Bai (Duke University); Chenyang Tao (Duke University); Yizhe Zhang (Microsoft Research); Guoyin Wang (Duke University); Wenlin Wang (Duke Univeristy); Ricardo Henao (Duke University); Lawrence Carin Duke (CS) 205 | 206 | ## [50]. Scaling All-­Goals Updates in Reinforcement Learning Using Convolutional Neural Networks 207 | > Fabio Pardo (Imperial College London)*; Vitaly Levdik (Imperial College London); Petar Kormushev (Imperial College London) 208 | 209 | ## [51]. Parameterized Indexed Value Function for Efficient Exploration in Reinforcement Learning 210 | > Tian Tan (Stanford University)*; Zhihan Xiong (Stanford University); Vikranth Dwaracherla (Stanford University) 211 | 212 | ## [52]. Solving Online Threat Screening Games using Constrained Action Space Reinforcement Learning 213 | > Sanket Shah (Singpore Management University)*; Arunesh Sinha (Singapore Management University); Pradeep Varakantham (Singapore Management University); Andrew Perrault (Harvard University); Milind Tambe (Harvard University) -------------------------------------------------------------------------------- /DRL-ConferencePaper/ACL/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-ConferencePaper/ACL/.DS_Store -------------------------------------------------------------------------------- /DRL-ConferencePaper/ACL/2018/README.md: -------------------------------------------------------------------------------- 1 | ## ACL 2018 2 | 1. **Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning.** Julia Kreutzer, Joshua Uyheng, Stefan Riezler. 3 | 2. **Deep Reinforcement Learning for Chinese Zero Pronoun Resolution**. Qingyu Yin, Yu Zhang, Wei-Nan Zhang, Ting Liu, William Yang Wang. 4 | 3. **Robust Distant Supervision Relation Extraction via Deep Reinforcement Learning.** Pengda Qin, Weiran XU, William Yang Wang. 5 | 4. **Unpaired Sentiment-to-Sentiment Translation: A Cycled Reinforcement Learning Approach**.Jingjing Xu, Xu SUN, Qi Zeng, Xiaodong Zhang, Xuancheng Ren, Houfeng Wang, Wenjie Li. 6 | 5. **End-to-End Reinforcement Learning for Automatic Taxonomy Induction**.Yuning Mao, Xiang Ren, Jiaming Shen, Xiaotao Gu, Jiawei Han. 7 | 6. **Discourse Marker Augmented Network with Reinforcement Learning for Natural Language Inference.** Boyuan Pan, Yazheng Yang, Zhou Zhao, Yueting Zhuang, Deng Cai, Xiaofei He. 8 | 7. **Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing**.Daniel Fried, Dan Klein. 9 | 8. **Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning.** Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Kam-Fai Wong. -------------------------------------------------------------------------------- /DRL-ConferencePaper/ACL/2019/README.md: -------------------------------------------------------------------------------- 1 | ## ACL 2019 2 | 1. **Neural Keyphrase Generation via Reinforcement Learning with Adaptive Rewards**,Hou Pong Chan, Wang Chen, Lu Wang and Irwin King 3 | 2. **Rewarding Smatch: Transition-Based AMR Parsing with Reinforcement Learning**,Tahira Naseem, Abhishek Shah, Hui Wan, Radu Florian, Salim Roukos and Miguel Ballesteros 4 | 3. **End-to-end Deep Reinforcement Learning Based Coreference Resolution**,Hongliang Fei, Xu Li, Dingcheng Li and Ping Li 5 | 4. **Using Semantic Similarity as Reward for Reinforcement Learning in Sentence Generation**,Go Yasui, Yoshimasa Tsuruoka and Masaaki Nagata 6 | 5. **Multi-Task Semantic Dependency Parsing with Policy Gradient for Learning Easy-First Strategies**Shuhei Kurita and Anders Søgaard 7 | -------------------------------------------------------------------------------- /DRL-ConferencePaper/ICLR/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-ConferencePaper/ICLR/.DS_Store -------------------------------------------------------------------------------- /DRL-ConferencePaper/ICML/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-ConferencePaper/ICML/.DS_Store -------------------------------------------------------------------------------- /DRL-ConferencePaper/ICML/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-ConferencePaper/ICML/README.md -------------------------------------------------------------------------------- /DRL-ConferencePaper/IJCAI/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-ConferencePaper/IJCAI/.DS_Store -------------------------------------------------------------------------------- /DRL-ConferencePaper/IJCAI/2018/README.md: -------------------------------------------------------------------------------- 1 | ## IJCAI 2018 2 | 1. **Impression Allocation for Combating Fraud in E-commerce Via Deep Reinforcement Learning with Action Norm Penalty**, Mengchen Zhao, Zhao Li, BO AN, Haifeng Lu, Yifan Yang, Chen Chu 3 | 2. **Learning to Design Games: Strategic Environments in Reinforcement Learning**,Haifeng Zhang, Jun Wang, Zhiming Zhou, Weinan Zhang, Yin Wen, Yong Yu, Wenxin Li 4 | 3. **A Weakly Supervised Method for Topic Segmentation and Labeling in Goal-oriented Dialogues via Reinforcement Learning**,Ryuichi Takanobu, Minlie Huang, Zhongzhou Zhao, Fenglin Li, Haiqing Chen, Xiaoyan Zhu, Liqiang Nie 5 | 4. **Cross-modal Bidirectional Translation via Reinforcement Learning**, Jinwei Qi, Yuxin Peng 6 | 5. **StackDRL: Stacked Deep Reinforcement Learning for Fine-grained Visual Categorization**, Xiangteng He, Yuxin Peng, Junjie Zhao 7 | 6. **PEORL: Integrating Symbolic Planning and Hierarchical Reinforcement Learning for Robust Decision-Making**, Fangkai Yang, Daoming Lyu, Bo Liu, Steven Gustafson 8 | 7. **Extracting Action Sequences from Texts Based on Deep Reinforcement Learning**, Wenfeng Feng, Hankz Hankui Zhuo, Subbarao Kambhampati 9 | 8. **Deep Reinforcement Learning in Ice Hockey for Context-Aware Player Evaluation**, Guiliang Liu, Oliver Schulte 10 | 9. **Toward Diverse Text Generation with Inverse Reinforcement Learningg**, Zhan Shi, Xinchi Chen, Xipeng Qiu, Xuanjing Huan 11 | 10. **Multi-Level Policy and Reward Reinforcement Learning for Image Captioning**, Anan Liu, Ning Xu, Hanwang Zhang, Weizhi Nie, Yuting Su, Yongdong Zhang 12 | 11. **A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning**, Long Yang, Minhao Shi, Qian Zheng, Wenjia Meng, Gang Pan 13 | 12. **Algorithms or Actions? A Study in Large-Scale Reinforcement Learning**, Anderson Rocha Tavares, Sivasubramanian Anbalagan, Leandro Soriano Marcolino, Luiz Chaimowicz 14 | 13. **Master-Slave Curriculum Design for Reinforcement Learning**, Yuechen Wu, Wei Zhang, Ke Song 15 | 14. **Exploration by Distributional Reinforcement Learning**, Yunhao Tang, Shipra Agrawal 16 | 15. **Hashing over Predicted Future Frames for Informed Exploration of Deep Reinforcement Learning**, Haiyan Yin, Jianda Chen, Sinno Jialin Pan 17 | 16. **Keeping in Touch with Collaborative UAVs: A Deep Reinforcement Learning Approach**, Bo Yang, Min Liu 18 | 17. **Policy Optimization with Second-Order Advantage Information**, Jiajin Li, Baoxiang Wang, Shengyu Zhang 19 | 18. **Scheduled Policy Optimization for Natural Language Communication with Intelligent Agents**,Wenhan Xiong, Xiaoxiao Guo, Mo Yu, Shiyu Chang, Bowen Zhou, William Yang Wang 20 | 19. **Stochastic Fractional Hamiltonian Monte Carlo**, Nanyang Ye, Zhanxing Zhu 21 | 20. **Three-Head Neural Network Architecture for Monte Carlo Tree Search**, Chao Gao, Martin Mueller, Ryan Hayward 22 | 21. **Bidding in Periodic Double Auctions Using Heuristics and Dynamic Monte Carlo Tree Search, Moinul Morshed Porag Chowdhury**, Christopher Kiekintveld, Son Tran, William Yeoh 23 | 22. **On Q-learning Convergence for Non-Markov Decision Processes**, Sultan Javed Majeed, Marcus Hutter 24 | 23. **Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes**, Shun Zhang, Edmund H. Durfee, Satinder Singh 25 | 24. **Multi-Agent Path Finding with Deadlines**, Hang Ma, Glenn Wagner, Ariel Felner, Jiaoyang Li, T. K. Satish Kumar, Sven Koenig 26 | 25. **Recurrent Deep Multiagent Q-Learning for Autonomous Brokers in Smart Grid**, Yaodong Yang, Jianye Hao, Mingyang Sun, Zan Wang, Changjie Fan, Goran Strbac 27 | 26. **Multi-agent Epistemic Planning with Common Knowledge**, Qiang Liu, Yongmei Liu 28 | 27. **Symbolic Synthesis of Fault-Tolerance Ratios in Parameterised Multi-Agent Systems**,Panagiotis Kouvaros, Alessio Lomuscio, Edoardo Pirovano 29 | 28. **Socially Motivated Partial Cooperation in Multi-agent Local Search**, Tal Ze’evi, Roie Zivan, Omer Lev 30 | 29. **Combining Opinion Pooling and Evidential Updating for Multi-Agent Consensus**, Chanelle Lee, Jonathan Lawry, Alan Winfield -------------------------------------------------------------------------------- /DRL-ConferencePaper/IJCAI/2019/README.md: -------------------------------------------------------------------------------- 1 | ## IJCAI 2019 2 | 1. **A Dual Reinforcement Learning Framework for Unsupervised Text Style Transfer**: Fuli Luo, Peng Li, Jie Zhou, Pengcheng Yang, Baobao Chang, Xu Sun, Zhifang Sui 3 | 2. **A Restart-based Rank-1 Evolution Strategy for Reinforcement Learning**: Zefeng Chen, Yuren Zhou, Xiao-yu He, Siyu Jiang 4 | 3. **An Actor-Critic-Attention Mechanism for Deep Reinforcement Learning in Multi-view Environments**:Elaheh Barati, Xuewen Chen 5 | 4. **An Atari Model Zoo for Analyzing, Visualizing, and Comparing Deep Reinforcement Learning Agents**: Felipe Such, Vashisht Madhavan, Rosanne Liu, Rui Wang, Pablo Samuel Castro, Yulun Li, Jiale Zhi, Ludwig Schubert, Marc G. Bellemare, Jeff Clune, Joel Lehman 6 | 5. **Automatic Successive Reinforcement Learning with Multiple Auxiliary Rewards**: Zhao-Yang Fu, De-Chuan Zhan, Xin-Chun Li, Yi-Xing Lu 7 | 6. **Autoregressive Policies for Continuous Control Deep Reinforcement Learning**:Dmytro Korenkevych, Ashique Rupam Mahmood, Gautham Vasan, James Bergstra 8 | 7. **Deep Multi-Agent Reinforcement Learning with Discrete-Continuous Hybrid Action Spaces** :Haotian Fu, Hongyao Tang, Jianye Hao, Zihan Lei, Yingfeng Chen, Changjie Fan 9 | 8. **Dynamic Electronic Toll Collection via Multi-Agent Deep Reinforcement Learning with Edge-Based Graph Convolutional Network Representation**:Wei Qiu, Haipeng Chen, Bo An 10 | 9. **Energy-Efficient Slithering Gait Exploration for a Snake-Like Robot Based on Reinforcement Learning**: Zhenshan Bing, Christian Lemke, Zhuangyi Jiang, Kai Huang, Alois Knoll 11 | 10. **Explaining Reinforcement Learning to Mere Mortals**: An Empirical Study: Andrew Anderson, Jonathan Dodge, Amrita Sadarangani, Zoe Juozapaitis, Evan Newman, Jed Irvine, Souti Chattopadhyay, Alan Fern, Margaret Burnett 12 | 11. **Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space**: Zhou Fan, Rui Su, Weinan Zhang, Yong Yu 13 | 12. **Incremental Learning of Planning Actions in Model-Based Reinforcement Learning**: Alvin Ng, Ron Petrick 14 | 13. **Interactive Reinforcement Learning with Dynamic Reuse of Prior Knowledge from Human/Agent's Demonstration**: Zhaodong Wang, Matt Taylor 15 | 14. **Interactive Teaching Algorithms for Inverse Reinforcement Learning**: Parameswaran Kamalaruban, Rati Devidze, Volkan Cevher, Adish Singla 16 | 15. **Large-Scale Home Energy Management Using Entropy-Based Collective Multiagent Deep Reinforcement Learning**: Yaodong Yang, Jianye Hao, Yan Zheng, Chao Yu 17 | 16. **Meta Reinforcement Learning with Task Embedding and Shared Policy**: Lin Lan, Zhenguo Li, Xiaohong Guan, Pinghui Wang 18 | 17. **Metatrace Actor-Critic: Online Step-Size Tuning by Meta-gradient Descent for Reinforcement Learning Control**: Kenny Young, Baoxiang Wang, Matthew E. Taylor 19 | 18. **Playing Card-Based RTS Games with Deep Reinforcement Learning**: Tianyu Liu, Zijie Zheng, Hongchang Li, Kaigui Bian, Lingyang Song 20 | 19. **Playing FPS Games With Environment-Aware Hierarchical Reinforcement Learning**: Shihong Song, Jiayi Weng, Hang Su, Dong Yan, Haosheng Zou, Jun Zhu 21 | 20. **Reinforcement Learning Experience Reuse with Policy Residual Representation**: WenJi Zhou, Yang Yu, Yingfeng Chen, Kai Guan, Tangjie Lv, Changjie Fan, Zhi-Hua Zhou 22 | 21. **Reward Learning for Efficient Reinforcement Learning in Extractive Document Summarisation**: Yang Gao, Christian Meyer, Mohsen Mesgar, Iryna Gurevych 23 | 22. **Sharing Experience in Multitask Reinforcement Learning**: Tung-Long Vuong, Do-Van Nguyen, Tai-Long Nguyen, Cong-Minh Bui, Hai-Dang Kieu, Viet-Cuong Ta, Quoc-Long Tran, Thanh-Ha Le 24 | 23. **SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets**: Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, Heng-Tze Cheng, Tushar Chandra, Craig Boutilier 25 | 24. **Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning**: Wenjie Shi, Shiji Song, Cheng Wu 26 | 25. **Solving Continual Combinatorial Selection via Deep Reinforcement Learning**: HyungSeok Song, Hyeryung Jang, Hai H. Tran, Se-eun Yoon, Kyunghwan Son, Donggyu Yun, Hyoju Chung, Yung Yi 27 | 26. **Successor Options: An Option Discovery Framework for Reinforcement Learning**: Rahul Ramesh, Manan Tomar, Balaraman Ravindran 28 | 27. **Transfer of Temporal Logic Formulas in Reinforcement Learning**: Zhe Xu, Ufuk Topcu 29 | 28. **Using Natural Language for Reward Shaping in Reinforcement Learning**: Prasoon Goyal, Scott Niekum, Raymond Mooney 30 | 29. **Value Function Transfer for Deep Multi-Agent Reinforcement Learning Based on N-Step Returns**: Yong Liu, Yujing Hu, Yang Gao, Yingfeng Chen, Changjie Fan 31 | 30. **Failure-Scenario Maker for Rule-Based Agent using Multi-agent Adversarial Reinforcement Learning and its Application to Autonomous Driving**: Akifumi Wachi 32 | 31. **LTL and Beyond: Formal Languages for Reward Function Specification in Reinforcement Learning**: Alberto Camacho, Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, Sheila McIlraith 33 | 32. **A Survey of Reinforcement Learning Informed by Natural Language**: Jelena Luketina↵, Nantas Nardelli, Gregory Farquhar, Jakob Foerster, Jacob Andreas, Edward Grefenstett, Shimon Whiteson, Tim Rocktäschel 34 | 33. **Leveraging Human Guidance for Deep Reinforcement Learning Tasks**: Ruohan Zhang, Faraz Torabi, Lin Guan, Dana H. Ballard, Peter Stone 35 | 34. **CRSRL: Customer Routing System using Reinforcement Learning**: Chong Long, Zining Liu, Xiaolu Lu, Zehong Hu, Yafang Wang 36 | 35. **Deep Reinforcement Learning for Ride-sharing Dispatching and Repositioning**: Zhiwei (Tony) Qin, Xiaocheng Tang, Yan Jiao, Fan Zhang, Chenxi Wang 37 | 36. **Learning Deep Decentralized Policy Network by Collective Rewards for Real-Time Combat Game**: Peixi Peng, Junliang Xing, Lili Cao, Lisen Mu, Chang Huang 38 | 37. **Monte Carlo Tree Search for Policy Optimization**: Xiaobai Ma, Katherine Driggs-Campbell, Zongzhang Zhang, Mykel J. Kochenderfer 39 | 38. **On Principled Entropy Exploration in Policy Optimization**: Jincheng Mei, Chenjun Xiao, Ruitong Huang, Dale Schuurmans, Martin Müller 40 | 39. **Recurrent Existence Determination Through Policy Optimization**: Baoxiang Wang 41 | 40. **Diversity-Inducing Policy Gradient: Using Maximum Mean Discrepancy to Find a Set of Diverse Policies**: Muhammad Masood, Finale Doshi-Velez 42 | 41. **A probabilistic logic for resource-bounded multi-agent systems**: Hoang Nga Nguyen, Abdur Rakib 43 | 42. **A Value-based Trust Assessment Model for Multi-agent Systems**: Kinzang Chhogyal, Abhaya Nayak, Aditya Ghose, Hoa Khanh Dam 44 | 43. **Branch-and-Cut-and-Price for Multi-Agent Pathfinding**: Edward Lam, Pierre Le Bodic, Daniel Harabor, Peter J. Stuckey 45 | 44. **Decidability of Model Checking Multi-Agent Systems with Regular Expressions against Epistemic HS Specifications**: Jakub Michaliszyn, Piotr Witkowski 46 | 45. **Improved Heuristics for Multi-Agent Path Finding with Conflict-Based Search**: Jiaoyang Li, Eli Boyarski, Ariel Felner, Hang Ma, Sven Koenig 47 | 46. **Integrating Decision Sharing with Prediction in Decentralized Planning for Multi-Agent Coordination under Uncertainty**: Minglong Li, Wenjing Yang, Zhongxuan Cai, Shaowu Yang, Ji Wang 48 | 47. **Multi-agent Attentional Activity Recognition**: Kaixuan Chen, Lina Yao, Dalin Zhang, Bin Guo, Zhiwen Yu 49 | 48. **Multi-Agent Pathfinding with Continuous Time**: Anton Andreychuk, Konstantin Yakovlev, Dor Atzmon, Roni Stern 50 | 49. **Priority Inheritance with Backtracking for Iterative Multi-agent Path Finding**: Keisuke Okumura, Manao Machida, Xavier Défago, Yasumasa Tamura 51 | 50. **The Interplay of Emotions and Norms in Multiagent Systems**: Anup K. Kalia, Nirav Ajmeri, Kevin S. Chan, Jin-Hee Cho, Sibel Adali, Munindar Singh 52 | 51. **Unifying Search-based and Compilation-based Approaches to Multi-agent Path Finding through Satisfiability Modulo Theories**: Pavel Surynek 53 | 52. **Implicitly Coordinated Multi-Agent Path Finding under Destination Uncertainty: Success Guarantees and Computational Complexity (Extended Abstract)**: Bernhard Nebel, Thomas Bolander, Thorsten Engesser, Robert Mattmüller 54 | 53. **Embodied Conversational AI Agents in a Multi-modal Multi-agent Competitive Dialogue**: Rahul Divekar, Xiangyang Mou, Lisha Chen, Maíra Gatti de Bayser, Melina Alberio Guerra, Hui Su 55 | 54. **Multi-Agent Path Finding on Ozobots**: Roman Barták, Ivan Krasičenko, Jiří Švancara 56 | 55. **Multi-Agent Visualization for Explaining Federated Learning**: Xiguang Wei, Quan Li, Yang Liu, Han Yu, Tianjian Chen, Qiang Yang 57 | 56. **Automated Machine Learning with Monte-Carlo Tree Search**: Herilalaina Rakotoarison, Marc Schoenauer, Michele Sebag 58 | 57. **Influence of State-Variable Constraints on Partially Observable Monte Carlo Planning**: Alberto Castellini, Georgios Chalkiadakis, Alessandro Farinelli 59 | 58. **Multiple Policy Value Monte Carlo Tree Search**: Li-Cheng Lan, Wei Li, Ting han Wei, I-Chen Wu 60 | 59. **Subgoal-Based Temporal Abstraction in Monte-Carlo Tree Search**: Thomas Gabor, Jan Peter, Thomy Phan, Christian Meyer, Claudia Linnhoff-Popien 61 | 60. **A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification**: Shaohuai Shi, Kaiyong Zhao, Qiang Wang, Zhenheng Tang, Xiaowen Chu 62 | 61. **AsymDPOP: Complete Inference for Asymmetric Distributed Constraint Optimization Problems**: Yanchen Deng, Ziyu Chen, Dingding Chen, Wenxin Zhang, Xingqiong Jiang 63 | 62. **Distributed Collaborative Feature Selection Based on Intermediate Representation**: Xiucai Ye, Hongmin Li, Akira Imakura, Tetsuya Sakurai 64 | 63. **FABA: An Algorithm for Fast Aggregation against Byzantine Attacks in Distributed Neural Networks**: Qi Xia, Zeyi Tao, Zijiang Hao, Qun Li 65 | 64. **Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent**: Shuheng Shen, Linli Xu, Jingchang Liu, Xianfeng Liang, Yifei Cheng 66 | 65. **Fully Distributed Bayesian Optimization with Stochastic Policies**: Javier Garcia-Barcos, Ruben Martinez-Cantin -------------------------------------------------------------------------------- /DRL-ConferencePaper/IJCAI/README.md: -------------------------------------------------------------------------------- 1 | ## IJCAI会议 2 | ### 关于IJCAI 3 | **International Joint Conferences on Artificial Intelligence(IJCAI)** is a non-profit corporation founded in California, in 1969 for scientific and educational purposes, including dissemination of information on Artificial Intelligence at conferences in which cutting-edge scientific results are presented and through dissemination of materials presented at these meetings in form of Proceedings, books, video recordings, and other educational materials. IJCAI consists of two divisions: the Conference Division and the AI Journal Division. IJCAI conferences present premier international gatherings of AI researchers and practitioners and they were held biennially in odd-numbered years since 1969. 4 | 5 | Starting with 2016, IJCAI conferences are held annually. IJCAI-PRICAI-20 will be held in Yokohama, Japan, IJCAI-21 in Montreal, Canada, IJCAI-ECAI-22 in Bologna, Italy and IJCAI-23 in Cape Town, South Africa. 6 | 7 | IJCAI is governed by the Board of Trustees, with IJCAI Secretariat in charge of its operations. 8 | 9 | IJCAI-19 was be held in Macao, P.R. China from August 10-16, 2019. The IJCAI Organization and Local Arrangements Committee thank you for participating. 10 | ### 官网:https://www.ijcai.org/ 11 | 12 | ### 赞助商 13 | ![](asserts/2020-02-12-13-00-22.png) 14 | -------------------------------------------------------------------------------- /DRL-ConferencePaper/Level.md: -------------------------------------------------------------------------------- 1 | ### About 2 | 3 | 本部分包含AAAI,ICML,NIPS,CVPR等顶级论文汇总,目前已归总AAAI-2020,NIPS2019等.... 4 | . 5 | . 6 | . 7 | . 8 | . 9 | . 10 | 11 | --- 12 | ##关于会议投稿及会议级别内容: 13 | 14 | 15 | ## 第一层次 16 | 17 | ### **IJCAI (1+): International Joint Conference on Artificial Intelligence** 18 | > **影响因子:1.82 (top 4.09 %)**
IJCAI是AI最好的综合性会议, 1969年开始, 每两年开一次, 奇数年开. 因为AI 实在太大, 所以虽然每届基本上能录100多篇(现在已经到200多篇了),但分到每个 领域就没几篇了,象achine learning、computer vision这么大的领域每次大概也 就10篇左右, 所以难度很大. 不过从录用率上来看倒不太低,基本上20%左右, 因为内 行人都会掂掂分量, 没希望的就别浪费reviewer的时间了. 最近中国大陆投往国际会 议的文章象潮水一样, 而且因为国内很少有能自己把关的研究组, 所以很多会议都在 complain说中国的低质量文章严重妨碍了PC的工作效率. 在这种情况下, 估计这几年 国际会议的录用率都会降下去. 另外, 以前的IJCAI是没有poster的, 03年开始, 为了 减少被误杀的好人, 增加了2页纸的poster.值得一提的是, IJCAI是由貌似一个公司 的”IJCAI Inc.”主办的(当然实际上并不是公司, 实际上是个基金会), 每次会议上要 发几个奖, 其中最重要的两个是IJCAI Research Excellence Award 和 Computer & Thoughts Award, 前者是终身成就奖, 每次一个人, 基本上是AI的最高奖(有趣的是, 以AI为主业拿图灵奖的6位中, 有2位还没得到这个奖), 后者是奖给35岁以下的 青年科学家, 每次一个人. 这两个奖的获奖演说是每次IJCAI的一个重头戏.另外, IJCAI 的 PC member 相当于其他会议的area chair, 权力很大, 因为是由PC member 去找 reviewer 来审, 而不象一般会议的PC member其实就是 reviewer. 为了制约 这种权力, IJCAI的审稿程序是每篇文章分配2位PC member, primary PC member去找3位reviewer, second PC member找一位. 19 | 20 | 21 | ### **AAAI(1)**: National Conference on Artificial Intelligence 22 | > **影响因子:1.49 (top 9.17%)**
AAAI是美国人工智能学会AAAI的年会. 是一个很好的会议, 但其档次不稳定, 可 以给到1+, 也可以给到1-或者2+, 总的来说我给它”1″. 这是因为它的开法完全受 IJCAI制约: 每年开, 但如果这一年的 IJCAI在北美举行, 那么就停开. 所以, 偶数年 里因为没有IJCAI, 它就是最好的AI综合性会议, 但因为号召力毕竟比IJCAI要小一些, 特别是欧洲人捧AAAI场的比IJCAI少得多(其实亚洲人也是), 所以比IJCAI还是要稍弱 一点, 基本上在1和1+之间; 在奇数年, 如果IJCAI不在北美, AAAI自然就变成了比 IJCAI低一级的会议(1-或2+), 例如2005年既有IJCAI又有AAAI, 两个会议就进行了协 调, 使得IJCAI的录用通知时间比AAAI的deadline早那么几天, 这样IJCAI落选的文章 可以投往AAAI.在审稿时IJCAI 的 PC chair也在一直催, 说大家一定要快, 因为AAAI 那边一直在担心IJCAI的录用通知出晚了AAAI就麻烦了. 23 | 24 | ### **COLT (1)**: Annual Conference on Computational Learning Theory 25 | > **影响因子:1.49 (top 9.25%)**
COLT是计算学习理论最好的会议, ACM主办, 每年举行. 计算学习理论基本上可以看成理论计算机科学和机器学习的交叉, 所以这个会被一些人看成是理论计算 机科学的会而不是AI的会. 我一个朋友用一句话对它进行了精彩的刻画: “一小群数 学家在开会”. 因为COLT的领域比较小, 所以每年会议基本上都是那些人. 这里顺便 提一件有趣的事, 因为最近国内搞的会议太多太滥, 而且很多会议都是LNCS/LNAI出 论文集, LNCS/LNAI基本上已经被搞臭了, 但很不幸的是, LNCS/LNAI中有一些很好的 会议 26 | 27 | 28 | ### **CVPR (1)**: IEEE International Conference on Computer Vision and Pattern Recognition 29 | > **影响因子:**
CVPR是计算机视觉和模式识别方面最好的会议之一, IEEE主办, 每年举行. 虽然题 目上有计算机视觉, 但个人认为它的模式识别味道更重一些. 事实上它应该是模式识 别最好的会议, 而在计算机视觉方面, 还有ICCV 与之相当. IEEE一直有个倾向, 要把 会办成”盛会”, 历史上已经有些会被它从quality很好的会办成”盛会”了. CVPR搞不好 也要走这条路. 这几年录的文章已经不少了. 最近负责CVPR会议的TC的chair发信 说, 对这个community来说, 让好人被误杀比被坏人漏网更糟糕, 所以我们是不是要减 少好人被误杀的机会啊? 所以我估计明年或者后年的CVPR就要扩招了. 30 | 31 | 32 | ### **ICCV (1)**: IEEE International Conference on Computer Vision 33 | > **影响因子:1.78 (top 4.75%)**
ICCV 的全称是 IEEE International Conference on Computer Vision,即国际计算机视觉大会,由IEEE主办,与计算机视觉模式识别会议(CVPR)和欧洲计算机视觉会议(ECCV)并称计算机视觉方向的三大顶级会议,被澳大利亚ICT学术会议排名和中国计算机学会等机构评为最高级别学术会议,在业内具有极高的评价。不同于在美国每年召开一次的CVPR和只在欧洲召开的ECCV,ICCV在世界范围内每两年召开一次。ICCV论文录用率非常低,是三大会议中公认级别最高的。ICCV会议时间通常在四到五天,相关领域的专家将会展示最新的研究成果。 34 | 35 | 36 | ### **ICML (1)**: International Conference on Machine Learning 37 | > **影响因子:2.12 (top 1.88%)**
ICML是International Conference on Machine Learning的缩写,即国际机器学习大会。ICML如今已发展为由国际机器学习学会(IMLS)主办的年度机器学习国际顶级会议。每年举办一次,和NIPS,CVPR不相上下。 38 | 39 | ### **NIPS (1)**: Annual Conference on Neural Information Processing Systems 40 | > **影响因子:1.06 (top 20.96%)**
NIPS是神经计算方面最好的会议之一, NIPS主办, 每年举行. 值得注意的是, 这个会 每年的举办地都是一样的, 以前是美国丹佛, 现在是加拿大温哥华; 而且它是年底开会, 会开完后第2年才出论文集, 也就是说, NIPS’05的论文集是06年出. 会议的名字 “Advances in Neural Information Processing Systems”, 所以, 与ICML/ECML这样 的”标准的”机器学习会议不同, NIPS里有相当一部分神经科学的内容, 和机器学习有 一定的距离. 但由于会议的主体内容是机器学习, 或者说与机器学习关系紧密, 所以 不少人把NIPS看成是机器学习方面最好的会议之一. 这个会议基本上控制在Michael Jordan的徒子徒孙手中, 所以对Jordan系的人来说, 发NIPS并不是难事, 一些未必很 强的工作也能发上去, 但对这个圈子之外的人来说, 想发一篇实在很难, 因为留给”外 人”的口子很小. 所以对Jordan系以外的人来说, 发NIPS的难度比ICML更大. 换句话说, ICML比较开放, 小圈子的影响不象NIPS那么大, 所以北美和欧洲人都认, 而NIPS则有 些人(特别是一些欧洲人, 包括一些大家)坚决不投稿. 这对会议本身当然并不是好事, 但因为Jordan系很强大, 所以它似乎也不太care. 最近IMLS(国际机器学习学会)改选 理事, 有资格提名的人包括近三年在ICML/ECML/COLT发过文章的人, NIPS则被排除在 外了. 无论如何, 这是一个非常好的会. 41 | 42 | 43 | ### **ACL (1-)**: Annual Meeting of the Association for Computational Linguistics 44 | > **影响因子:1.06 (top 20.96%)**
ACL是计算语言学/自然语言处理方面最好的会议, ACL (Association of Computational Linguistics) 主办, 每年开 45 | 46 | 47 | ### **KR (1-)**: International Conference on Principles of Knowledge Representation and Reasoning 48 | > **影响因子:1.06 (top 20.96%)**
KR是知识表示和推理方面最好的会议之一, 实际上也是传统AI(即基于逻辑的AI) 最好的会议之一. KR Inc.主办, 现在是偶数年开. 49 | 50 | 51 | ### **SIGIR (1-)**: Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 52 | > **影响因子:1.06 (top 20.96%)**
SIGIR是信息检索方面最好的会议, ACM主办, 每年开. 这个会现在小圈子气越来 越重. 信息检索应该不算AI, 不过因为这里面用到机器学习越来越多, 最近几年甚至 有点机器学习应用会议的味道了, 所以把它也列进来. 53 | 54 | 55 | ### **SIGKDD (1-)**: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 56 | > **影响因子:**
SIGKDD是数据挖掘方面最好的会议, ACM主办, 每年开. 这个会议历史比较短, 毕竟, 与其他领域相比,数据挖掘还只是个小弟弟甚至小侄儿. 在几年前还很难把它列 在tier-1里面, 一方面是名声远不及其他的 top conference响亮, 另一方面是相对容易 被录用. 但现在它被列在tier-1应该是毫无疑问的事情了. 57 | 58 | 59 | ### **UAI (1-)**: International Conference on Uncertainty in Artificial Intelligence 60 | > **影响因子:**
UAI: 名字叫”人工智能中的不确定性”, 涉及表示/推理/学习等很多方面, AUAI (Association of UAI) 主办, 每年开. 61 | 62 | 63 | ## 第二层次 64 | 65 | ### **AAMAS (2+)**: International Joint Conference on Autonomous Agents and Multiagent Systems 66 | > **影响因子:**
AAMAS是agent方面最好的会议. 但是现在agent已经是一个一般性的概念, 几乎所有AI有关的会议上都有这方面的内容, 所以AAMAS下降的趋势非常明显. 67 | 68 | ### **ECCV (2+)**: European Conference on Computer Vision 69 | > **影响因子:1.58 (top 7.20 %)**
ECCV是计算机视觉方面仅次于ICCV的会议, 因为这个领域发展很快, 有可能升级到1-去. 70 | 71 | ### **ECML (2+)**: European Conference on Machine Learning 72 | > **影响因子:0.83 (top 30.63 %)**
ECML是机器学习方面仅次于ICML的会议, 欧洲人极力捧场, 一些人认为它已经是1-了. 我保守一点, 仍然把它放在2+. 因为机器学习发展很快, 这个会议的reputation上升非常明显. 73 | 74 | ### **ICDM (2+)**: IEEE International Conference on Data Mining 75 | > **影响因子:0.35 (top 59.86 %)**
ICDM是数据挖掘方面仅次于SIGKDD的会议, 目前和SDM相当. 这个会只有5年历史, 上升速度之快非常惊人. 几年前ICDM还比不上PAKDD, 现在已经拉开很大距离了. 76 | 77 | ### **SDM (2+)**: SIAM International Conference on Data Mining 78 | > **影响因子:**
SDM是数据挖掘方面仅次于SIGKDD的会议, 目前和ICDM相当. SIAM的底子很厚, 但在CS里面的影响比ACM和IEEE还是要小, SDM眼看着要被ICDM超过了, 但至少目前还是相当的. 79 | 80 | ### **ICAPS (2)**: International Conference on Automated Planning and Scheduling 81 | > **影响因子:**
ICAPS是人工智能规划方面最好的会议, 是由以前的国际和欧洲规划会议合并来的. 因为这个领域逐渐变冷清, 影响比以前已经小了. 82 | 83 | ### **ICCBR (2)**: International Conference on Case-Based Reasoning 84 | > **影响因子:0.35 (top 59.86 %)**
ICCBR是Case-Based Reasoning方面最好的会议. 因为领域不太大, 而且一直半冷不热, 所以总是停留在2上. 85 | 86 | 87 | ### **COLLING (2)**: International Conference on Computational Linguistics 88 | > **影响因子:**
COLLING是计算语言学/自然语言处理方面仅次于ACL的会, 但与ACL的差距比ICCV-ECCV和ICML-ECML大得多. 89 | 90 | 91 | ### **ECAI (2)**: European Conference on Artificial Intelligence 92 | > **影响因子:0.69 (top 38.49 %)**
ECAI是欧洲的人工智能综合型会议, 历史很久, 但因为有IJCAI/AAAI压着,很难往上升. 93 | 94 | 95 | ### **ALT (2-)**: International Conference on Algorithmic Learning Theory 96 | > **影响因子:0.63 (top 42.91 %)**
ALT有点象COLT的tier-2版, 但因为搞计算学习理论的人没多少, 做得好的数来数去就那么些group, 基本上到COLT去了, 所以ALT里面有不少并非计算学习理论的内容. 97 | 98 | 99 | ### **EMNLP (2-)**: Conference on Empirical Methods in Natural Language Processing 100 | > **影响因子:**
EMNLP是计算语言学/自然语言处理方面一个不错的会. 有些人认为与COLLING相当, 但我觉得它还是要弱一点. 101 | 102 | 103 | ### **ILP (2-)**: International Conference on Inductive Logic Programming 104 | > **影响因子:0.63 (top 42.91 %)**
ILP是归纳逻辑程序设计方面最好的会议. 但因为很多其他会议里都有ILP方面的内容, 所以它只能保住2-的位置了. 105 | 106 | 107 | ### **PKDD (2-)**: European Conference on Principles and Practice of Knowledge Discovery in Databases 108 | > **影响因子:0.63 (top 42.91 %)**
PKDD是欧洲的数据挖掘会议, 目前在数据挖掘会议里面排第4. 欧洲人很想把它抬起来, 所以这些年一直和ECML一起捆绑着开, 希望能借ECML把它带起来.但因为ICDM和SDM, 这已经不太可能了. 所以今年的 PKDD和ECML虽然还是一起开, 但已经独立审稿了(以前是可以同时投两个会, 作者可以声明优先被哪个会考虑, 如果ECML中不了还可以被 PKDD接受). 109 | 110 | 111 | ## 第三层次 112 | ### **ACCV (3+)**: Asian Conference on Computer Vision 113 | > **影响因子:0.42 (top 55.61%)**
ACCV是亚洲的计算机视觉会议, 在亚太级别的会议里算很好的了. 114 | 115 | 116 | ### **ICTAI (3+)**: IEEE International Conference on Tools with Artificial Intelligence 117 | > **影响因子:0.42 (top 55.61%)**
ICTAI是IEEE最主要的人工智能会议, 偏应用, 是被IEEE办烂的一个典型. 以前的quality还是不错的, 但是办得越久声誉反倒越差了, 糟糕的是似乎还在继续下滑, 现在其实3+已经不太呆得住了. 118 | 119 | ### **PAKDD (3+): Pacific-Asia Conference on Knowledge Discovery and Data Mining 120 | > **影响因子:0.42 (top 55.61%)**
PAKDD是亚太数据挖掘会议, 目前在数据挖掘会议里排第5. 121 | 122 | 注:部分第三层次会议知名度并不高,所以不进行一一列举。 123 | 124 | **以上给出的评分或等级都是个人意见, 仅供参考. 特别要说明的是: 综合建议** 125 | ## 综合建议: 126 | 1. **第一层次conference上的文章并不一定比第三层次的好, 只能说前者的平均水准更高.** 127 | 2. **研究工作的好坏不是以它发表在哪儿来决定的, 发表在高档次的地方只是为了让工作更容易被同行注意到. 第三层次会议上发表1篇被引用10次的文章可能比在第一层次会议上发表10篇被引用0次的文章更有价值. 所以, 数top会议文章数并没有太大意义, 重要的是同行的评价和认可程度.** 128 | 3. **很多经典工作并不是发表在高档次的发表源上, 有不少经典工作甚至是发表在很低档的发表源上. 原因很多, 就不细说了.** 129 | 4. **会议毕竟是会议, 由于审稿时间紧, 错杀好人和漏过坏人的情况比比皆是, 更何况还要考虑到有不少刚开始做研究的学生在代老板审稿.** 130 | 5. **会议的reputation并不是一成不变的,新会议可能一开始没什么声誉,但过几年后就野鸡变凤凰,老会议可能原来声誉很好,但越来越往下滑.** 131 | 6. **只有计算机科学才重视会议论文, 其他学科并不把会议当回事. 但在计算机科学中也有不太重视会议的分支.** 132 | 7. **Politics无所不在. 你老板是谁, 你在哪个研究组, 你在哪个单位, 这些简单的因素都可能造成决定性的影响. 换言之, 不同环境的人发表的难度是不一样的. 了解到这一点后, 你可能会对high-level发表源上来自low-level单位名不见经传作者的文章特别注意(例如如果<计算机学报>上发表了平顶山铁 道电子信息科技学院的作者的文章,我一定会仔细读).** 133 | 8. **评价体系有巨大的影响. 不管是在哪儿谋生的学者, 都需要在一定程度上去迎合评价体系, 否则连生路都没有了, 还谈什么做研究. 以国内来说, 由于评价体系只重视journal, 有一些工作做得很出色的学者甚至从来不投会议. 另外, 经费也有巨大的制约作用. 国外很多好的研究组往往是重要会议都有文章. 但国内是不行的, 档次低一些的会议还可以投了只交注册费不开会, 档次高的会议不去做报告会有很大的负面影响, 所以只能投很少的会议. 这是在国内做CS研究最不利的地方. 我的一个猜想:人民币升值对国内CS研究会有不小的促进作用(当然, 人民币升值对整个中国来说利大于弊还是弊大于利很难说).** 134 | 135 | 参考如下:http://blog.sina.com.cn/s/blog_631a4cc40100xl7d.html 136 | 137 | -------------------------------------------------------------------------------- /DRL-ConferencePaper/NIPS/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-ConferencePaper/NIPS/.DS_Store -------------------------------------------------------------------------------- /DRL-ConferencePaper/NIPS/2019/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-ConferencePaper/NIPS/2019/.DS_Store -------------------------------------------------------------------------------- /DRL-ConferencePaper/README.md: -------------------------------------------------------------------------------- 1 | ### About 2 | 3 | 本部分包含IJCAI、AAAI,ICML,NIPS,ACL等顶级会议中关于深度强化学习的论文汇总,包括: 4 | 5 | ### IJCAI 6 | + 2019 7 | + 2018 8 | 9 | ### AAAI 10 | + 2019 11 | + 2018 12 | 13 | ### ICML 14 | + 2019 15 | + 2018 16 | 17 | ### NIPS 18 | + 2019 19 | + 2018 20 | 21 | ### ACL 22 | + 2019 23 | + 2018 -------------------------------------------------------------------------------- /DRL-Course/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Course/.DS_Store -------------------------------------------------------------------------------- /DRL-Course/DavidSliver强化学习课程/README.md: -------------------------------------------------------------------------------- 1 | # David Sliver强化学习课程 2 | 3 | David Silver主讲的一套强化学习视频公开课,较为系统、全面地介绍了强化学习的各种思想、实现算法。其一套公开课一共分为十讲,每讲平均为100分钟。 4 | 5 | 课程视频 6 | + [中文字幕版本](https://www.bilibili.com/video/av9831889/) 7 | + [英文原版](https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT) -------------------------------------------------------------------------------- /DRL-Course/DeepMind深度强化学习课程/README.md: -------------------------------------------------------------------------------- 1 | # DeepMind深度强化学习课程 2 | 这门课程是 DeepMind 与伦敦大学学院(UCL)的合作项目,由于 DeepMind 的研究人员去 UCL 授课,内容由两部分组成,一是深度学习(利用深度神经网络进行机器学习),二是强化学习(利用强化学习进行预测和控制),最后两条线结合在一起,也就成了 DeepMind 的拿手好戏——深度强化学习。 3 | 4 | 课程视频 5 | + [youtube版本](https://www.youtube.com/playlist?list=PLqYmG7hTraZDNJre23vqCGIVpfZ_K2RZs) 6 | 7 | 课程内容 8 | 9 | + 深度学习 1:介绍基于机器学习的 AI 10 | 11 | + 深度学习 2:介绍 TensorFlow 12 | 13 | + 深度学习 3:神经网络基础 14 | 15 | + 强化学习 1:强化学习简介 16 | 17 | + 强化学习 2:开发和利用 18 | 19 | + 强化学习 3:马尔科夫决策过程和动态编程 20 | 21 | + 强化学习 4:无模型的预测和控制 22 | 23 | + 深度学习 4:图像识别、端到端学习和 Embeddings 之外 24 | 25 | + 强化学习 5:函数逼近和深度强化学习 26 | 27 | + 强化学习 6:策略梯度和 Actor Critics 28 | 29 | + 深度学习 5:机器学习的优化方法 30 | 31 | + 强化学习 7:规划和模型 32 | 33 | + 深度学习 6:NLP 的深度学习 34 | 35 | + 强化学习 8:深度强化学习中的高级话题 36 | 37 | + 深度学习 7:深度学习中的注意力和记忆 38 | 39 | + 强化学习 9:深度 RL 智能体简史 40 | 41 | + 深度学习 8:无监督学习和生成式模型 42 | 43 | + 强化学习 10:经典游戏的案例学习 -------------------------------------------------------------------------------- /DRL-Course/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Course/README.md -------------------------------------------------------------------------------- /DRL-Course/UC Berkeley CS 294深度强化学习课程/README.md: -------------------------------------------------------------------------------- 1 | # UC Berkeley CS 294深度强化学习课程 2 | 本课程要求具有 CS 189 或同等学力。本课程将假定你已了解强化学习、数值优化和机器学习的相关背景知识。 3 | 4 | [课程主页](https://www.youtube.com/playlist?list=PLqYmG7hTraZDNJre23vqCGIVpfZ_K2RZs) 5 | 6 | 课程视频 7 | + [youtube版本](https://www.youtube.com/playlist?list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3) 8 | -------------------------------------------------------------------------------- /DRL-Interviews/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Interviews/.DS_Store -------------------------------------------------------------------------------- /DRL-Interviews/drl-interview.assets/20171108090350229.jfif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Interviews/drl-interview.assets/20171108090350229.jfif -------------------------------------------------------------------------------- /DRL-Interviews/drl-interview.assets/eea4714c.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Interviews/drl-interview.assets/eea4714c.png -------------------------------------------------------------------------------- /DRL-Interviews/drl-interview.assets/equation-1584541764589.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | -------------------------------------------------------------------------------- /DRL-Interviews/drl-interview.assets/equation.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | -------------------------------------------------------------------------------- /DRL-Interviews/drl-interview.md: -------------------------------------------------------------------------------- 1 | ## 深度强化学习面试题目总结 2 | 1. 什么是强化学习? 3 | > 强化学习(Reinforcement Learning, RL),又称增强学习,是机器学习的范式和方法论之一,用于描述和解决智能体(agent)在与环境的交互过程中通过学习策略以达成回报最大化或实现特定目标的问题 4 | ![](assets/markdown-img-paste-20190923214243777.png) 5 | > 6 | > 7 | 2. 强化学习和监督学习、无监督学习的区别是什么? 8 | > 监督学习一般有标签信息,而且是单步决策问题,比如分类问题。监督学习的样本一般是独立 9 | >同分布的。无监督学习没有任何标签信息,一般对应的是聚类问题。强化学习介于监督和无监督学习之间,每一步决策之后会有一个标量的反馈信号,即回报。通过最大化回报以获得一个最优策略。因此强化学习一般是多步决策,并且样本之间有强的相关性。 10 | > 11 | > 12 | 3. 强化学习适合解决什么样子的问题? 13 | > 强化学习适合于解决模型未知,且当前决策会影响环境状态的(序列)决策问题。Bandit问题可以看成是一种特殊的强化学习问题,序列长度为1,单步决策后就完事了,所以动作不影响状态。当然也有影响的bandit问题,叫做contextual bandit问题。 14 | > 15 | > 16 | 4. 强化学习的损失函数(loss function)是什么?和深度学习的损失函数有何关系? 17 | > 累积回报。依赖于不同的问题设定,累积回报具有不同的形式。比如对于有限长度的MDP问题直接用回报和作为优化目标。对于无限长的问题,为了保证求和是有意义的,需要使用折扣累积回报或者平均回报。深度学习的损失函数一般是多个独立同分布样本预测值和标签值的误差,需要最小化。强化学习 的损失函数是轨迹上的累积和,需要最大化。 18 | > 19 | > 20 | 5. POMDP是什么?马尔科夫过程是什么?马尔科夫决策过程是什么?里面的“马尔科夫”体现了什么性质? 21 | > POMDP是部分可观测马尔科夫决策问题。 22 | >马尔科夫过程表示一个状态序列,每一个状态是一个随机变量,变量之间满足马尔科夫性,表示为 23 | >一个元组,S是状态,P表示转移概率。 24 | >MDP表示为一个五元组,S是状态集合,A是动作集合,P表示转移概率,即模型, 25 | >R是回报函数,$\gamma$表示折扣因子。 26 | >马尔科夫体现了无后效性,也就是说未来的决策之和当前的状态有关,和历史状态无关。 27 | > 28 | 29 | 6. 贝尔曼方程的具体数学表达式是什么? 30 | > 对于状态值函数的贝尔曼方程: 31 | > 32 | >![](assets/interview-6-1.png) 33 | > 34 | >对于动作值函数的贝尔曼方程: 35 | > 36 | >![](assets/interview-6-2.png) 37 | 38 | 39 | 7. 最优值函数和最优策略为什么等价? 40 | > 最优值函数唯一的确定了某个状态或这状态-动作对相对比他状态和状态-动作对的利好,我们可以 41 | >依赖这个值唯一的确定当前的动作,他们是对应的,所以等价。 42 | 43 | 8. 值迭代和策略迭代的区别? 44 | > 策略迭代。它有两个循环,一个是在策略估计的时候,为了求当前策略的值函数需要迭代很多次。 45 | >另外一个是外面的大循环,就是策略评估,策略提升这个循环。值迭代算法则是一步到位,直接估计 46 | >最优值函数,因此没有策略提升环节。 47 | >参考[值迭代算法](https://zhuanlan.zhihu.com/p/55217561) 48 | > 49 | > 50 | 9. 如果不满足马尔科夫性怎么办?当前时刻的状态和它之前很多很多个状态都有关之间关系? 51 | > 如果不满足马尔科夫性,强行只用当前的状态来决策,势必导致决策的片面性,得到不好的策略。 52 | >为了解决这个问题,可以利用RNN对历史信息建模,获得包含历史信息的状态表征。表征过程可以 53 | >使用注意力机制等手段。最后在表征状态空间求解MDP问题。 54 | 55 | 10. 求解马尔科夫决策过程都有哪些方法?有模型用什么方法?动态规划是怎么回事? 56 | > 方法有:动态规划,时间差分,蒙特卡洛。 57 | >有模型可以使用动态规划方法。 58 | >动态规划是指:将一个问题拆分成几个子问题,分别求解这些子问题,然后获得原问题的解。 59 | >贝尔曼方程中为了求解一个状态的值函数,利用了其他状态的值函数,就是这种思想(个人觉得)。 60 | > 61 | > 62 | 11. 简述动态规划(DP)算法? 63 | > 不知道说的是一般的DP算法,还是为了解决MDP问题的DP算法。这里假设指的是后者。总的来说DP方法 64 | >就是利用最优贝尔曼方程来更新值函数以求解策略的方法。最优贝尔曼方程如下: 65 | >![](assets/interview-11.png) 66 | > 67 | >参考[DP算法概述](https://zhuanlan.zhihu.com/p/54763496/edit) 68 | > 69 | > 70 | 12. 简述蒙特卡罗估计值函数(MC)算法。 71 | > 蒙特卡洛就死采样仿真,蒙特卡洛估计值函数就是根据采集的数据,利用值函数的定义来更新 72 | >值函数。这里假设是基于表格的问题,步骤如下: 73 | > 74 | >初始化,这里为每一个状态初始化了一个 Return(s) 的列表。可以想象列表中每一个元素就是一次累积回报。 75 | > 76 | >根据策略pi生成轨迹tau 77 | > 78 | >利用轨迹,统计每个状态对应的后续累积回报,并将这个值加入对应状态的Return(s)列表 79 | > 80 | >重复若干次,每次都会往对应出现的状态回报列表中加入新的值。最后根据定义,每个状态回报列表中数字的均值 81 | >就是他的值估计 82 | > 83 | >参考[MC值估计](https://zhuanlan.zhihu.com/p/55487868) 84 | 85 | 13. 简述时间差分(TD)算法。 86 | > TD,MC和DP算法都使用广义策略迭代来求解策略,区别仅仅在于值函数估计的方法不同。DP使用的是贝尔曼 87 | >方程,MC使用的是采样法,而TD方法的核心是使用自举(bootstrapping),即值函数的更新为: 88 | >V(s_t) <-- r_t + \gamma V(s_next),使用了下一个状态的值函数来估计当前状态的值。 89 | 90 | 14. 简述动态规划、蒙特卡洛和时间差分的对比(共同点和不同点) 91 | > 此时必须祭出一张图: 92 | > 93 | >![](assets/interview-14.png) 94 | > 95 | >简单来说,共同点:都是用来估计值函数的一种手段。 96 | > 97 | >不同点: 98 | > 99 | >MC通过采样求均值的方法求解;DP由于已知模型,因此直接可以计算期望,不用采样,但是DP仍然 100 | >使用了自举;TD结合了采样和自举 101 | 102 | 15. MC和TD分别是无偏估计吗? 103 | > MC是无偏的,TD是有偏的 104 | 105 | 16. MC、TD谁的方差大,为什么? 106 | > MC的方差大,以为TD使用了自举,实现一种类似于平滑的效果,所以估计的值函数方差小。 107 | > 108 | 17. 简述on-policy和off-policy的区别 109 | > on-policy:行为策略和要优化的策略是一个策略,更新了策略后,就用该策略的最新版本采样数据。off-policy:使用任意的一个行为策略来收集收据,利用收集的数据更新目标策略。 110 | > 111 | 18. 简述Q-Learning,写出其Q(s,a)更新公式。它是on-policy还是off-policy,为什么? 112 | > Q学习是通过计算最优动作值函数来求策略的一种算法,更新公式为: 113 | > 114 | >Q(s_t, a_t) <--- Q(s_t, a_t) + \alpha [R_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t)] 115 | > 116 | >是离策略的,由于是值更新使用了下一个时刻的argmax_a Q,所以我们只关心哪个动作使得 Q(s_{t+1}, a) 取得最大值, 117 | >而实际到底采取了哪个动作(行为策略),并不关心。 118 | >这表明优化策略并没有用到行为策略的数据,所以说它是离策略(off-policy)的。 119 | > 120 | > 121 | 19. 写出用第n步的值函数更新当前值函数的公式(1-step,2-step,n-step的意思)。当n的取值变大时,期望和方差分别变大、变小? 122 | > n-step的更新目标: 123 | > 124 | > ![](assets/interview-19-1.png) 125 | > 126 | >利用更细目标更新当前值函数,趋近于目标: 127 | > 128 | >![](assets/interview-19-2.png) 129 | > 130 | >当n越大时,越接近于MC方法,因此方差越大,期望越接近于真实值,偏差越小。 131 | 132 | 133 | 20. TD(λ)方法:当λ=0时实际上与哪种方法等价,λ=1呢? 134 | > 当lambda = 0等价于TD(0);lambda = 1时等价于折扣形式的MC方法。参考[文章](https://zhuanlan.zhihu.com/p/72587762)。 135 | 136 | 21. 写出蒙特卡洛、TD和TD(λ)这三种方法更新值函数的公式? 137 | > MC更新公式参考问题14附图(左图公式),如果是单步的TD方法,更新公式参考问题14附图(中), 138 | >n步的更新参考问题19。TD(lambda)用lambda-return更新值函数,lambda-return是n-step 139 | >return的加权和,n-step回报的系数为\lmabda^{n-1},所以lambda-return等于: 140 | > 141 | >![](assets/interview-21-1.png) 142 | > 143 | >所以TD(lambda)的更新公式只需要把问题19的更新公式中的G_{t:t+n}换成上图的G_t^lambda就行。 144 | >当然这种更新叫做前向视角,是离线的,因为要计算G_t^n,为了在线更新,需要用到资格迹。定义 145 | >资格迹为: 146 | > 147 | >![](assets/interview-21-2.png) 148 | > 149 | >利用资格迹后,值函数可以在线更新为: 150 | > 151 | >![](assets/interview-21-3.png) 152 | > 153 | >更多的关于TD(lambda)细节参考[文章](https://amreis.github.io/ml/reinf-learn/2017/11/02/reinforcement-learning-eligibility-traces.html) 154 | 155 | 22. value-based和policy-based的区别是什么? 156 | > value-based通过求解最优值函数间接的求解最优策略;policy-based的方法直接将策略参数化, 157 | >通过策略搜索,策略梯度或者进化方法来更新策略的参数以最大化回报。基于值函数的方法不易扩展到 158 | >连续动作空间,并且当同时采用非线性近似、自举和离策略时会有收敛性问题。策略梯度具有良好的 159 | >收敛性证明。 160 | 23. DQN的两个关键trick分别是什么? 161 | > 使用目标网络(target network)来缓解训练不稳定的问题;经验回放 162 | 163 | 24. 阐述目标网络和experience replay的作用? 164 | > 在DQN中某个动作值函数的更新依赖于其他动作值函数。如果我们一直更新值网络的参数,会导致 165 | >更新目标不断变化,也就是我们在追逐一个不断变化的目标,这样势必会不太稳定。引入目标网络就是把 166 | >更新目标中不断变化的值先稳定一段时间,更新参数,然后再更新目标网络。这样在一定的阶段内 167 | >目标是固定的,训练也更稳定。 168 | > 169 | >经验回放应该是为了消除样本之间的相关性。 170 | 25. 手工推导策略梯度过程? 171 | > ![](assets/interview-25.png) 172 | 173 | 参考[文章](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html) 174 | 175 | 26. 描述随机策略和确定性策略的特点? 176 | > 随机策略表示为某个状态下动作取值的分布,确定性策略在每个状态只有一个确定的动作可以选。 177 | >从熵的角度来说,确定性策略的熵为0,没有任何随机性。随机策略有利于我们进行适度的探索,确定 178 | >性策略的探索问题更为严峻。 179 | 180 | 27. 不打破数据相关性,神经网络的训练效果为什么就不好? 181 | > 在神经网络中通常使用随机梯度下降法。随机的意思是我们随机选择一些样本来增量式的估计梯度,比如常用的 182 | > 采用batch训练。如果样本是相关的,那就意味着前后两个batch的很可能也是相关的,那么估计的梯度也会呈现 183 | > 出某种相关性。如果不幸的情况下,后面的梯度估计可能会抵消掉前面的梯度量。从而使得训练难以收敛。 184 | 185 | 186 | 187 | 28. 画出DQN玩Flappy Bird的流程图。在这个游戏中,状态是什么,状态是怎么转移的?奖赏函数如何设计,有没有奖赏延迟问题? 188 | 189 | 190 | 29. DQN都有哪些变种?引入状态奖励的是哪种? 191 | > Double DQN, 优先经验回放, Dueling-DQN 192 | 193 | 30. 简述double DQN原理? 194 | > DQN由于总是选择当前值函数最大的动作值函数来更新当前的动作值函数,因此存在着过估计问题(估计的值函数大于真实的值函数)。 195 | > 为了解耦这两个过程,double DQN 使用了两个值网络,一个网络用来执行动作选择,然后用另一个值函数对一个的动作值更新当前 196 | > 网络。比如要更新Q1, 在下一个状态使得Q1取得最大值的动作为a*,那么Q1的更新为 r + gamma (Q2(s_, a*).二者交替训练。 197 | 198 | 199 | 31. 策略梯度方法中基线baseline如何确定? 200 | > 基线只要不是动作a的函数就可以,常用的选择可以是状态值函数v(s) 201 | 202 | 203 | 32. 画出DDPG框架结构图?![ddpg total arch](./drl-interview.assets/20171108090350229.jfif) 204 | 205 | 206 | 33. Actor-Critic两者的区别是什么? 207 | > Actor是策略模块,输出动作;critic是判别器,用来计算值函数。 208 | 209 | 34. actor-critic框架中的critic起了什么作用? 210 | > critic表示了对于当前决策好坏的衡量。结合策略模块,当critic判别某个动作的选择时有益的,策略就更新参数以增大该动作出现的概率,反之降低 211 | > 动作出现的概率。 212 | 213 | 35. DDPG是on-policy还是off-policy,为什么? 214 | > off-policy。因为在DDPG为了保证一定的探索,对于输出动作加了一定的噪音,也就是说行为策略不再是优化的策略。 215 | 216 | 36. 是否了解过D4PG算法?简述其过程 217 | 218 | > **Distributed Distributional DDPG (D4PG)** 分布的分布式DDPG, 主要改进: 219 | > 220 | > 1. **分布式 critic:** 不再只估计Q值的期望值,而是去估计期望Q值的分布, 即将期望Q值作为一个随机变量来进行估计。 221 | > 2. **N步累计回报:** 当计算TD误差时,D4PG计算的是N步的TD目标值而不仅仅只有一步,这样就可以考虑未来更多步骤的回报。 222 | > 3. **多个分布式并行演员:**D4PG使用K个独立的演员并行收集训练样本并存储到同一个回访缓冲中。 223 | > 4. **优先经验回放**(**Prioritized Experience Replay**,[**PER**](https://arxiv.org/abs/1511.05952)):最后一个改进是使用一个非均匀的概率 $\pi$ 从一个大小为 RR 的回放缓冲中进行采样。 224 | > 225 | > 详见:[Lil'Log](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#d4pg) 226 | 227 | 228 | 37. 简述A3C算法?A3C是on-policy还是off-policy,为什么? 229 | 230 | > **异步优势演员-评论家方法**: 评论家学习值函数,同时有多个演员并行训练并且不时与全局参数同步。A3C旨在用于并行训练。 231 | > 232 | > on-policy. 233 | 234 | 235 | 37. A3C算法是如何异步更新的?是否能够阐述GA3C和A3C的区别? 236 | 237 | > 下面是算法大纲: 238 | > 239 | > 1. 定义全局参数 $\theta$ 和 $w$ 以及特定线程参数 $θ′$ 和 $w′$。 240 | > 241 | > 2. 初始化时间步 $t=1$。 242 | > 243 | > 3. 当 $T<=T_{max}$: 244 | > 245 | > 1. 重置梯度:$dθ=0$ 并且 $dw=0$。 246 | > 247 | > 2. 将特定于线程的参数与全局参数同步:$θ′=θ$ 以及 $w′=w$。 248 | > 249 | > 3. 令 $t_{start} =t$ 并且随机采样一个初始状态 $s_t$。 250 | > 251 | > 4. 当 ($s_t!=$ 终止状态)并$t−t_{start}<=t_{max}$: 252 | > 253 | > 1. 根据当前线程的策略选择当前执行的动作 $a_t∼π_{θ′}(a_t|s_t)$,执行动作后接收回报$r_t$然后转移到下一个状态st+1。 254 | > 2. 更新 t 以及 T:t=t+1 并且 T=T+1。 255 | > 256 | > 5. 初始化保存累积回报估计值的变量:r={0Vw′(st) 如果 st 是终止状态 否则r={0 如果 st 是终止状态Vw′(st) 否则 257 | > 258 | > 6. 对于 i=t−1,…,tstarti=t−1,…,tstart: 259 | > 260 | > 1. r←γr+rir←γr+ri;这里 rr 是 GiGi 的蒙特卡洛估计。 261 | > 262 | > 2. 累积关于参数 θ′θ′ 的梯度:dθ←dθ+∇θ′logπθ′(ai|si)(r−Vw′(si))dθ←dθ+∇θ′log⁡πθ′(ai|si)(r−Vw′(si)); 263 | > 264 | > 累积关于参数 w′w′ 的梯度:dw←dw+2(r−Vw′(si))∇w′(r−Vw′(si))dw←dw+2(r−Vw′(si))∇w′(r−Vw′(si))。 265 | > 266 | > 7. 分别使用 dθdθ 以及 dwdw异步更新 θθ 以及 ww。 267 | 268 | 269 | 37. 简述A3C的优势函数? 270 | 271 | > A(s,a)=Q(s,a)-V(s) 272 | > 273 | > 是为了解决value-based方法具有高变异性。它代表着与该状态下采取的平均行动相比所取得的进步 274 | > 275 | > 如果 A(s,a)>0: 梯度被推向了该方向 276 | > 277 | > 如果 A(s,a)<0: (我们的action比该state下的平均值还差) 梯度被推向了反方向 278 | > 279 | > 但是这样就需要两套 value function 280 | > 281 | > 所以可以使用TD error 做估计:$A(s,a)=r+\gamma V(s')-V(s)$ 282 | 283 | 284 | 37. 什么是重要性采样? 285 | 286 | > 期望:$E|f|=\int_{x} p(x) f(x) d x=\frac{1}{N} \sum_{i=1}^{N} f\left(x_{i}\right)$ 按照 p(x)的分布来产生随机数进行采样 287 | > 288 | > 在采样分布未知的情况下,引入新的已知分布q(x),将期望修正为 289 | > 290 | > $E|f|=\int_{x} q(x)\left(\frac{p(x)}{q(x)} f(x)\right) d x$ 291 | > 292 | > 这样就可以针对q(x)来对 p(x)/q(x)*f(x)进行采样了 293 | > 294 | > $E|f|=\frac{1}{N} \sum_{i=1}^{N} \frac{p\left(x_{i}^{\prime}\right)}{q\left(x_{i}^{\prime}\right)} f\left(x_{i}^{\prime}\right)$ 295 | > 296 | > 即为重要性采样。 297 | 298 | 299 | 37. 为什么TRPO能保证新策略的回报函数单调不减? 300 | 301 | > 在每次迭代时对策略更新的幅度强制施加KL散度约束来避免更新一步就使得策略发生剧烈变化. 302 | > 303 | > 将新策略所对应的回报函数分解成旧的策略所对应的回报函数+其他项 304 | > 305 | > $\eta(\tilde{\pi})=\eta(\pi)+E_{s_{0}, a_{0}, \cdots \tilde{\pi}}\left[\sum_{t=0}^{\infty} \gamma^{t} A_{\pi}\left(s_{t}, a_{t}\right)\right]$ 306 | > 307 | > 如果这个其他项大于等于0,那么就可以保证。 308 | 309 | 310 | 37. TRPO是如何通过优化方法使每个局部点找到让损失函数非增的最优步长来解决学习率的问题; 311 | 312 | 38. 如何理解利用平均KL散度代替最大KL散度? 313 | 314 | 39. 简述PPO算法?与TRPO算法有何关系? 315 | 316 | > 思想与TRPO相同,都是为了避免过大的策略更新, PPO把约束转换到loss函数中 317 | 318 | ![[公式]](./drl-interview.assets/equation.svg) 319 | 320 | ![[公式]](./drl-interview.assets/equation-1584541764589.svg) 321 | 322 | 323 | 37. 简述DPPO和PPO的关系? 324 | 325 | > Deepmind在OpenAI的PPO基础上做的多线程版 326 | 327 | 328 | 37. 强化学习如何用在推荐系统中? 329 | 330 | > 可以把用户过去的点击购买的商品作为 State, 把推荐的商品作为 Action 331 | 332 | 333 | 37. 推荐场景中奖赏函数如何设计? 334 | 335 | > 点击率和下单率 336 | > 337 | > ![img](./drl-interview.assets/eea4714c.png) 338 | 339 | 340 | 37. 场景中状态是什么,当前状态怎么转移到下一状态? 341 | 38. 自动驾驶和机器人的场景如何建模成强化学习问题?MDP各元素对应真实场景中的哪些变量? 342 | 343 | > 这就因项目而异了吧 344 | 345 | 346 | 37. 强化学习需要大量数据,如何生成或采集到这些数据? 347 | 348 | > Simulator是个好东西,mujoco对真实环境模拟 349 | 350 | 351 | 37. 是否用某种DRL算法玩过Torcs游戏?具体怎么解决? 352 | 353 | 354 | 355 | 356 | 37. 是否了解过奖励函数的设置(reward shaping)? 357 | 358 | > [强化学习奖励函数塑形简介(The reward shaping of RL)](https://zhuanlan.zhihu.com/p/56425081) 359 | 360 | ### 贡献致谢列表 361 | @[huiwenzhang](https://github.com/huiwenzhang) 362 | 363 | @[skylark0924](https://github.com/skylark0924) 364 | 365 | 366 | #### 参考及引用链接: 367 | 368 | [1]https://zhuanlan.zhihu.com/p/33133828
369 | [2]https://aemah.github.io/2018/11/07/RL_interview/ 370 | -------------------------------------------------------------------------------- /DRL-Multi-Agent/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Multi-Agent/.DS_Store -------------------------------------------------------------------------------- /DRL-Multi-Agent/Nick_Sun: -------------------------------------------------------------------------------- 1 | Book 2 | ===== 3 | Shoham, Yoav, and Kevin Leyton-Brown. Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press, 2008. http://www.masfoundations.org/download.html 4 | 5 | Overview 6 | ===== 7 | Buşoniu, Lucian, Robert Babuška, and Bart De Schutter. "Multi-agent reinforcement learning: An overview." Innovations in multi-agent systems and applications-1. Springer, Berlin, Heidelberg, 2010. 183-221. http://www.dcsc.tudelft.nl/~bdeschutter/pub/rep/10_003.pdf 8 | (Code: http://busoniu.net/repository.php) 9 | Dobrev, Dimiter. "The Definition of AI in Terms of Multi Agent Systems." arXiv preprint arXiv:1210.0887 (2012). https://arxiv.org/ftp/arxiv/papers/1210/1210.0887.pdf 10 | Multiagent-Reinforcement-Learning (ppt), 2013. : http://www.ecmlpkdd2013.org/wp-content/uploads/2013/09/Multiagent-Reinforcement-Learning.pdf 11 | Kapoor, Sanyam. "Multi-agent reinforcement learning: A report on challenges and approaches." arXiv preprint arXiv:1807.09427 (2018). https://arxiv.org/abs/1807.09427 12 | 13 | Algorithm 14 | ===== 15 | (JAL) Claus, Caroline, and Craig Boutilier. "The dynamics of reinforcement learning in cooperative multiagent systems." AAAI/IAAI 1998.746-752 (1998): 2. https://www.aaai.org/Papers/AAAI/1998/AAAI98-106.pdf 16 | (Distributed Q-learning) Lauer, Martin, and Martin Riedmiller. "An algorithm for distributed reinforcement learning in cooperative multi-agent systems." In Proceedings of the Seventeenth International Conference on Machine Learning. 2000. https://www.researchgate.net/publication/2641625_An_Algorithm_for_Distributed_Reinforcement_Learning_in_Cooperative_Multi-Agent_Systems 17 | (team Q-learning) Littman, Michael L. "Value-function reinforcement learning in Markov games." Cognitive Systems Research 2.1 (2001): 55-66. http://www.sts.rpi.edu/~rsun/si-mal/article3.pdf 18 | (FMQ) Kapetanakis, Spiros, and Daniel Kudenko. "Reinforcement learning of coordination in cooperative multi-agent systems." AAAI/IAAI 2002 (2002): 326-331. https://www.aaai.org/Papers/AAAI/2002/AAAI02-050.pdf 19 | (OAL) Wang, Xiaofeng, and Tuomas Sandholm. "Reinforcement learning to play an optimal Nash equilibrium in team Markov games." Advances in neural information processing systems. 2003. https://papers.nips.cc/paper/2171-reinforcement-learning-to-play-an-optimal-nash-equilibrium-in-team-markov-games.pdf 20 | Qi, Dehu, and Ron Sun. "A multi-agent system integrating reinforcement learning, bidding and genetic algorithms." Web Intelligence and Agent Systems: An International Journal 1.3, 4 (2003): 187-202. https://pdfs.semanticscholar.org/2cb8/885ea3d8d6bccde87153f18f8be7f23ff935.pdf 21 | (TEAM-Q) Wang, Ying, and Clarence W. De Silva. "Multi-robot box-pushing: Single-agent q-learning vs. team q-learning." 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2006. https://ieeexplore.ieee.org/document/4058979 22 | (FMQ) Matignon, Laëtitia, Guillaume Laurent, and Nadine Le Fort-Piat. "A study of FMQ heuristic in cooperative multi-agent games." 2008. https://www.researchgate.net/publication/29616600_A_study_of_FMQ_heuristic_in_cooperative_multi-agent_games 23 | (MADDPG) Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in Neural Information Processing Systems. 2017. https://arxiv.org/abs/1706.02275 24 | Foerster, Jakob N., et al. "Counterfactual multi-agent policy gradients." Thirty-Second AAAI Conference on Artificial Intelligence. 2018. https://arxiv.org/abs/1705.08926 25 | 26 | Cooperation 27 | ===== 28 | Kok, Jelle R., and Nikos Vlassis. "Sparse cooperative Q-learning." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004. https://icml.cc/Conferences/2004/proceedings/papers/267.pdf 29 | Crandall, Jacob W., and Michael A. Goodrich. "Learning to compete, compromise, and cooperate in repeated general-sum games." Proceedings of the 22nd international conference on Machine learning. ACM, 2005. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.448.8292&rep=rep1&type=pdf 30 | Panait, Liviu, and Sean Luke. "Cooperative multi-agent learning: The state of the art." Autonomous agents and multi-agent systems 11.3 (2005): 387-434. https://cs.gmu.edu/~eclab/papers/panait05cooperative.pdf 31 | Kok, Jelle R., and Nikos Vlassis. "Collaborative multiagent reinforcement learning by payoff propagation." Journal of Machine Learning Research 7.Sep (2006): 1789-1828. http://www.jmlr.org/papers/volume7/kok06a/kok06a.pdf 32 | De Cote, Enrique Munoz, Alessandro Lazaric, and Marcello Restelli. "Learning to cooperate in multi-agent social dilemmas." AAMAS. Vol. 6. 2006. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.107.335&rep=rep1&type=pdf 33 | Ma, Jie, and Stephen Cameron. "Combining policy search with planning in multi-agent cooperation." Robot Soccer World Cup. Springer, Berlin, Heidelberg, 2008. https://www.researchgate.net/publication/220797588_Combining_Policy_Search_with_Planning_in_Multi-agent_Cooperation 34 | Tampuu, Ardi, et al. "Multiagent cooperation and competition with deep reinforcement learning." PloS one 12.4 (2017): e0172395. https://arxiv.org/abs/1511.08779 35 | 36 | Coordination 37 | ===== 38 | Kapetanakis, Spiros, and Daniel Kudenko. "Reinforcement learning of coordination in cooperative multi-agent systems." AAAI/IAAI 2002 (2002): 326-331. https://www.aaai.org/Papers/AAAI/2002/AAAI02-050.pdf 39 | Lau, Qiangfeng Peter, Mong Li Lee, and Wynne Hsu. "Coordination guided reinforcement learning." Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 2012. http://www.ifaamas.org/Proceedings/aamas2012/papers/1B_1.pdf 40 | Zhang, Chongjie, and Victor Lesser. "Coordinating multi-agent reinforcement learning with limited communication." Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems, 2013. https://pdfs.semanticscholar.org/5e7b/0822821575555e318845531b6d5b2d359b18.pdf 41 | Hao, Jianye, et al. "Reinforcement social learning of coordination in networked cooperative multiagent systems." Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence. 2014. http://mipc.inf.ed.ac.uk/2014/papers/mipc2014_hao_etal.pdf 42 | Le, Hoang M., et al. "Coordinated multi-agent imitation learning." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017. https://arxiv.org/abs/1703.03121 43 | Khadka, Shauharda, Somdeb Majumdar, and Kagan Tumer. "Evolutionary Reinforcement Learning for Sample-Efficient Multiagent Coordination." arXiv preprint arXiv:1906.07315 (2019). https://arxiv.org/abs/1906.07315 44 | 45 | Communicate 46 | ===== 47 | Varshavskaya, Paulina, Leslie Pack Kaelbling, and Daniela Rus. "Efficient distributed reinforcement learning through agreement." Distributed Autonomous Robotic Systems 8. Springer, Berlin, Heidelberg, 2009. 367-378. https://www.researchgate.net/publication/241128592_Efficient_Distributed_Reinforcement_Learning_Through_Agreement 48 | Hausknecht, Matthew John. Cooperation and communication in multiagent deep reinforcement learning. Diss. 2016. http://www.cs.utexas.edu/~larg/hausknecht_thesis/slides/thesis.pdf 49 | Sukhbaatar, Sainbayar, and Rob Fergus. "Learning multiagent communication with backpropagation." Advances in Neural Information Processing Systems. 2016. https://arxiv.org/abs/1605.07736 50 | Foerster, Jakob, et al. "Learning to communicate with deep multi-agent reinforcement learning." Advances in Neural Information Processing Systems. 2016. https://arxiv.org/abs/1605.06676 51 | 52 | Application 53 | ===== 54 | Zheng, Lianmin, et al. "MAgent: A many-agent reinforcement learning platform for artificial collective intelligence." Thirty-Second AAAI Conference on Artificial Intelligence. 2018. https://arxiv.org/abs/1712.00600 55 | Shalev-Shwartz, Shai, Shaked Shammah, and Amnon Shashua. "Safe, multi-agent, reinforcement learning for autonomous driving." arXiv preprint arXiv:1610.03295 (2016). https://arxiv.org/abs/1610.03295 56 | -------------------------------------------------------------------------------- /DRL-Multi-Agent/QMIX.assets/v2-79fe8838e84d6def61e3db6cf7332428_hd.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Multi-Agent/QMIX.assets/v2-79fe8838e84d6def61e3db6cf7332428_hd.jpg -------------------------------------------------------------------------------- /DRL-Multi-Agent/QMIX.assets/v2-98cea01bf7d7d2239d4d50460a57e6cf_hd.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-Multi-Agent/QMIX.assets/v2-98cea01bf7d7d2239d4d50460a57e6cf_hd.jpg -------------------------------------------------------------------------------- /DRL-Multi-Agent/QMIX.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # QMIX 4 | 5 | ## Theory 6 | 7 | 论文中提出了一种能以**中心化**的端到端的方式训练去中心化策略的基于价值的全新方法 QMIX。QMIX 能够将仅基于局部观察的每个智能体的价值以复杂的非线性方式组合起来,估计**联合的动作-价值**。 8 | 9 | ## Algorithm 10 | 11 | ## ![img](./QMIX.assets/v2-98cea01bf7d7d2239d4d50460a57e6cf_hd.jpg)![img](./QMIX.assets/v2-79fe8838e84d6def61e3db6cf7332428_hd.jpg) 12 | 13 | ## Paper 14 | 15 | **QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning** [ ICML 2018 ](https://arxiv.org/pdf/1803.11485.pdf ) 16 | 17 | ## Application 18 | 19 | 论文中假设环境为模拟器环境或实验室环境, 在这种环境下,**交流的限制被解除**,**全局信息**是可以获得的。 20 | 21 | 本文 在**starCraft**游戏上实验 22 | 23 | ## Code 24 | 25 | **QMIX by Ray framework**: https://github.com/ray-project/ray/tree/master/rllib/agents/qmix 26 | 27 | 28 | 29 | ## Cite 30 | 31 | 32 | 33 | 34 | 35 | -------------------------------------------------------------------------------- /DRL-Multi-Agent/README.md: -------------------------------------------------------------------------------- 1 | ### Book 2 | 3 | [1].[Shoham, Yoav, and Kevin Leyton-Brown. Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press, 2008.](http://www.masfoundations.org/download.html)
4 | 5 | ### Overview 6 | [1].[Buşoniu, Lucian, Robert Babuška, and Bart De Schutter. "Multi-agent reinforcement learning: An overview." Innovations in multi-agent systems and applications-1. Springer, Berlin, Heidelberg, 2010. 183-221.](http://www.dcsc.tudelft.nl/~bdeschutter/pub/rep/10_003.pdf) 代码请点击[Code](http://busoniu.net/repository.php)
7 | 8 | [2].[Dobrev, Dimiter. "The Definition of AI in Terms of Multi Agent Systems." arXiv preprint arXiv:1210.0887 (2012).](https://arxiv.org/ftp/arxiv/papers/1210/1210.0887.pdf) 9 | [3].[Multiagent-Reinforcement-Learning (ppt), 2013.](http://www.ecmlpkdd2013.org/wp-content/uploads/2013/09/Multiagent-Reinforcement-Learning.pdf)
10 | [4].[Kapoor, Sanyam. "Multi-agent reinforcement learning: A report on challenges and approaches." arXiv preprint arXiv:1807.09427 (2018).](https://arxiv.org/abs/1807.09427)
11 | 12 | ### Algorithm 13 | 14 | **VDN:** 15 | 16 | **Value-Decomposition Networks For Cooperative Multi-Agent Learning** [ AAMAS 2018 ]( https://arxiv.org/pdf/1706.05296.pdf ) 17 | 18 | **QMIX: ** 19 | 20 | **Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning** [ ICML 2018 ](https://arxiv.org/pdf/1803.11485.pdf ) 21 | 22 | **JAL:** 23 | 24 | [(JAL) Claus, Caroline, and Craig Boutilier. "The dynamics of reinforcement learning in cooperative multiagent systems." AAAI/IAAI 1998.746-752 (1998): 2.]( https://www.aaai.org/Papers/AAAI/1998/AAAI98-106.pdf)
25 | 26 | **Distributed Q-learning:** 27 | 28 | [(Distributed Q-learning) Lauer, Martin, and Martin Riedmiller. "An algorithm for distributed reinforcement learning in cooperative multi-agent systems." In Proceedings of the Seventeenth International Conference on Machine Learning. 2000.]( https://www.researchgate.net/publication/2641625_An_Algorithm_for_Distributed_Reinforcement_Learning_in_Cooperative_Multi-Agent_Systems)
29 | 30 | **Team Q-learning:** 31 | 32 | [(team Q-learning) Littman, Michael L. "Value-function reinforcement learning in Markov games." Cognitive Systems Research 2.1 (2001): 55-66.]( http://www.sts.rpi.edu/~rsun/si-mal/article3.pdf)
33 | 34 | **FMQ:** 35 | 36 | [(FMQ) Kapetanakis, Spiros, and Daniel Kudenko. "Reinforcement learning of coordination in cooperative multi-agent systems." AAAI/IAAI 2002 (2002): 326-331.]( https://www.aaai.org/Papers/AAAI/2002/AAAI02-050.pdf)
37 | 38 | **OAL:** 39 | 40 | [(OAL) Wang, Xiaofeng, and Tuomas Sandholm. "Reinforcement learning to play an optimal Nash equilibrium in team Markov games." Advances in neural information processing systems. 2003.]( https://papers.nips.cc/paper/2171-reinforcement-learning-to-play-an-optimal-nash-equilibrium-in-team-markov-games.pdf)
41 | 42 | 43 | 44 | [Qi, Dehu, and Ron Sun. "A multi-agent system integrating reinforcement learning, bidding and genetic algorithms." Web Intelligence and Agent Systems: An International Journal 1.3, 4 (2003): 187-202.]( https://pdfs.semanticscholar.org/2cb8/885ea3d8d6bccde87153f18f8be7f23ff935.pdf)
45 | 46 | **TEAM_Q:** 47 | 48 | [(TEAM-Q) Wang, Ying, and Clarence W. De Silva. "Multi-robot box-pushing: Single-agent q-learning vs. team q-learning." 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2006.]( https://ieeexplore.ieee.org/document/4058979)
**FMQ:** 49 | [(FMQ) Matignon, Laëtitia, Guillaume Laurent, and Nadine Le Fort-Piat. "A study of FMQ heuristic in cooperative multi-agent games." 2008. ](https://www.researchgate.net/publication/29616600_A_study_of_FMQ_heuristic_in_cooperative_multi-agent_games)
**MADDPG:** 50 | 51 | [(MADDPG) Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in Neural Information Processing Systems. 2017.]( https://arxiv.org/abs/1706.02275)
52 | 53 | **MARWIL:** 54 | 55 | **Exponentially Weighted Imitation Learning for Batched Historical Data** [NIPS 2018]( http://papers.nips.cc/paper/7866-exponentially-weighted-imitation-learning-for-batched-historical-data ) 56 | 57 | 58 | 59 | [Foerster, Jakob N., et al. "Counterfactual multi-agent policy gradients." Thirty-Second AAAI Conference on Artificial Intelligence. 2018.]( https://arxiv.org/abs/1705.08926)
60 | 61 | ### Code 62 | 63 | **QMIX by Ray framework**: https://github.com/ray-project/ray/tree/master/rllib/agents/qmix (also VDN) 64 | 65 | **MADDPG by Ray framework**: https://github.com/ray-project/ray/blob/master/rllib/contrib/maddpg/maddpg.py 66 | 67 | **MARWIL by Ray framework**: https://github.com/ray-project/ray/blob/master/rllib/agents/marwil/marwil.py 68 | 69 | 70 | ### Cooperation 71 | [1]. [Kok, Jelle R., and Nikos Vlassis. "Sparse cooperative Q-learning." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.]( https://icml.cc/Conferences/2004/proceedings/papers/267.pdf)
72 | [2]. [Crandall, Jacob W., and Michael A. Goodrich. "Learning to compete, compromise, and cooperate in repeated general-sum games." Proceedings of the 22nd international conference on Machine learning. ACM, 2005.]( http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.448.8292&rep=rep1&type=pdf)
73 | [3]. [Panait, Liviu, and Sean Luke. "Cooperative multi-agent learning: The state of the art." Autonomous agents and multi-agent systems 11.3 (2005): 387-434.]( https://cs.gmu.edu/~eclab/papers/panait05cooperative.pdf)
74 | [4]. [Kok, Jelle R., and Nikos Vlassis. "Collaborative multiagent reinforcement learning by payoff propagation." Journal of Machine Learning Research 7.Sep (2006): 1789-1828.]( http://www.jmlr.org/papers/volume7/kok06a/kok06a.pdf)
75 | [5]. [De Cote, Enrique Munoz, Alessandro Lazaric, and Marcello Restelli. "Learning to cooperate in multi-agent social dilemmas." AAMAS. Vol. 6. 2006.]( http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.107.335&rep=rep1&type=pdf)
76 | [6]. [Ma, Jie, and Stephen Cameron. "Combining policy search with planning in multi-agent cooperation." Robot Soccer World Cup. Springer, Berlin, Heidelberg, 2008. ](https://www.researchgate.net/publication/220797588_Combining_Policy_Search_with_Planning_in_Multi-agent_Cooperation)
77 | [7]. [Tampuu, Ardi, et al. "Multiagent cooperation and competition with deep reinforcement learning." PloS one 12.4 (2017): e0172395. ](https://arxiv.org/abs/1511.08779)
78 | 79 | ### Coordination 80 | [1]. [Kapetanakis, Spiros, and Daniel Kudenko. "Reinforcement learning of coordination in cooperative multi-agent systems." AAAI/IAAI 2002 (2002): 326-331.]( https://www.aaai.org/Papers/AAAI/2002/AAAI02-050.pdf)
81 | [2]. [Lau, Qiangfeng Peter, Mong Li Lee, and Wynne Hsu. "Coordination guided reinforcement learning." Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 2012.]( http://www.ifaamas.org/Proceedings/aamas2012/papers/1B_1.pdf)
82 | [3]. [Zhang, Chongjie, and Victor Lesser. "Coordinating multi-agent reinforcement learning with limited communication." Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems, 2013.]( https://pdfs.semanticscholar.org/5e7b/0822821575555e318845531b6d5b2d359b18.pdf)
83 | [4]. [Hao, Jianye, et al. "Reinforcement social learning of coordination in networked cooperative multiagent systems." Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence. 2014.]( http://mipc.inf.ed.ac.uk/2014/papers/mipc2014_hao_etal.pdf)
84 | [5]. [Le, Hoang M., et al. "Coordinated multi-agent imitation learning." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017. ](https://arxiv.org/abs/1703.03121)
85 | [6]. [Khadka, Shauharda, Somdeb Majumdar, and Kagan Tumer. "Evolutionary Reinforcement Learning for Sample-Efficient Multiagent Coordination." arXiv preprint arXiv:1906.07315 (2019). ](https://arxiv.org/abs/1906.07315)
86 | 87 | ### Communicate 88 | [1]. [Varshavskaya, Paulina, Leslie Pack Kaelbling, and Daniela Rus. "Efficient distributed reinforcement learning through agreement." Distributed Autonomous Robotic Systems 8. Springer, Berlin, Heidelberg, 2009. 367-378.]( https://www.researchgate.net/publication/241128592_Efficient_Distributed_Reinforcement_Learning_Through_Agreement)
89 | [2]. [Hausknecht, Matthew John. Cooperation and communication in multiagent deep reinforcement learning. Diss. 2016.]( http://www.cs.utexas.edu/~larg/hausknecht_thesis/slides/thesis.pdf)
90 | [3]. [Sukhbaatar, Sainbayar, and Rob Fergus. "Learning multiagent communication with backpropagation." Advances in Neural Information Processing Systems. 2016. ](https://arxiv.org/abs/1605.07736)
91 | [4]. [Foerster, Jakob, et al. "Learning to communicate with deep multi-agent reinforcement learning." Advances in Neural Information Processing Systems. 2016.]( https://arxiv.org/abs/1605.06676)
92 | 93 | ### Application 94 | [1]. [Zheng, Lianmin, et al. "MAgent: A many-agent reinforcement learning platform for artificial collective intelligence." Thirty-Second AAAI Conference on Artificial Intelligence. 2018.](https://arxiv.org/abs/1712.00600)
95 | [2].[Shalev-Shwartz, Shai, Shaked Shammah, and Amnon Shashua. "Safe, multi-agent, reinforcement learning for autonomous driving." arXiv preprint arXiv:1610.03295 (2016).](https://arxiv.org/abs/1610.03295)
96 | -------------------------------------------------------------------------------- /DRL-News/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-News/.DS_Store -------------------------------------------------------------------------------- /DRL-News/README.md: -------------------------------------------------------------------------------- 1 | 深度强化学习以及最近科技大事件实时展示![时间线从近到远] 2 | 3 | **最近即将发生的【最新追踪中-----敬请期待】** 4 | 5 | --- 6 | ### **2019.8.28** 7 | **DeepMind开源OpenSpiel平台**(https://github.com/deepmind/open_spiel) 8 | ![](assets/markdown-img-paste-2019082922530298.png) 9 | 10 | 什么是 OpenSpiel 11 | OpenSpiel 是一个综合性的强化学习游戏测试平台,包括了多种游戏环境和算法,用于强化学习研究或搜索策略的研究。可以帮助研究者解决很多强化学习研究中需要设置实验的问题,它支持: 12 | + 单人或多人博弈; 13 | + 完美信息或不完美信息博弈; 14 | + 带有随机性的博弈; 15 | + 普通的多玩家「一步」或二人玩家的多步博弈; 16 | + 交替行动(如下棋)或同时行动的游戏; 17 | + 零和博弈和非零和博弈(如需要合作的博弈等)。 18 | 19 | 目前,OpenSpiel平台支持多种编程语言: 20 | + C++11 21 | + Python 3 22 | 23 | 目前 OpenSpiel 已经在 Linux 系统上进行了测试(Debian 10 和 Ubuntu 19.04),但是没有在 MacOS 或 Windows 上测试过。但是因为后两个平台都可以自由使用代码,因此作者认为不太可能出现大的问题。 24 | 25 | **OpenSpiel 目前支持以下游戏,共 25 款,包括国际象棋、围棋、双陆棋、翻转棋等游戏**: 26 | 27 | 28 | 29 | OpenSpiel 怎么用 30 | 首先,我们先要明确,在 OpenSpiel 中 Game 对象包含了对某个游戏非常高层次的描述,例如游戏的方式、参与人数、最大分数等。而 State 对象描述了更加具体的游戏局,例如象棋中特定的棋子状态、扑克中特定的手牌组合。通过这两个对象,整个游戏都是通过树来表示的。 31 | 32 | OpenSpiel 首先需要加载游戏,配置游戏进行方式,然后就可以直接运行了。如下所示为玩 trajectory 游戏的 Python 代码: 33 | ```python 34 | import randomimport pyspielgame = pyspiel.load_game("kuhn_poker") 35 | 36 | state = game.new_initial_state() 37 | while not state.is_terminal(): 38 | legal_actions = state.legal_actions() 39 | if state.is_chance_node(): 40 | # Sample a chance event outcome. 41 | outcomes_with_probs = state.chance_outcomes() 42 | action_list, prob_list = zip(*outcomes_with_probs) 43 | action = np.random.choice(action_list, p=prob_list) state.apply_action(action) 44 | else: 45 | # The algorithm can pick an action based on an observation (fully observable # games) or an information state (information available for that player) 46 | # We arbitrarily select the first available action as an example. 47 | action = legal_actions[0] 48 | state.apply_action(action)复制代码 49 | ``` 50 | 51 | 52 | --- 53 | ### **2019.2.22** 54 | ![](assets/markdown-img-paste-20190223084737963.png) 55 | 56 | 英国的AI公司DeepMind开源了机器人足球模拟环境MuJoCo Soccer,实现了对2v2足球赛的模拟。虽然球员的样子比较简单(也是个球),但DeepMind让它们在强化学习中找到了团队精神。热爱足球游戏的网友仿佛嗅到了它前景:你们应该去找EA合作FIFA游戏! 57 | 58 | 让AI学会与队友配合 59 | 与AlphaGo类似,DeepMind也训练了许多“Player”。DeepMind从中选择10个双人足球团队,它们分别由不同训练计划制作而成的。 60 | 61 | 这10个团队每个都有250亿帧的学习经验,DeepMind收集了它们之间的100万场比赛。让我们分别从俯瞰视角来看一下其中一场2V2的足球比赛吧: 62 | ![](assets/markdown-img-paste-20190223084959410.png) 63 | 64 | 65 | DeepMind发现,随着学习量的增加,“球员”逐渐从“独行侠”变成了有团队协作精神的个体。一开始蓝色0号队员总是自己带球,无论队友的站位如何。在经历800亿画面的训练后,它已经学会积极寻找传球配合的机会,这种配合还会受到队友站位的影响。其中一场比赛中,我们甚至能看到到队友之间两次连续的传球,也就是在人类足球比赛中经常出现的2过1传球配合。 66 | 67 | 球队相生相克 68 | 除了个体技能外,DeepMind的实验结果还得到了足球世界中的战术相克。实验中选出的10个智能体中,B是最强的,Elo评分为1084.27;其次是C,Elo评分为1068.85;A的评分1016.48在其中仅排第五。 69 | ![](assets/markdown-img-paste-20190223085029541.png) 70 | 71 | 为何选择足球游戏 72 | 去年DeepMind开源了强化学习套件DeepMind Control Suite,让它模拟机器人、机械臂,实现对物理世界的操控。而足球是一个很好的训练多智能体的强化学习环境,比如传球、拦截、进球都可以作为奖励机制。同时对足球世界的模拟也需要物理引擎的帮助。DeepMind希望研究人员通过在这种多智能体环境中进行模拟物理实验, 在团队合作游戏领域内取得进一步进展。于是他们很自然地把2v2足球比赛引入了DeepMind Control Suite,让智能体的行为从自发随机到简单的追球,最后学会与队友之间进行团队配合。 73 | 74 | **DIY试玩方法** 75 | 现在你也可以自己去模拟这个足球游戏。首先安装MuJoCo Pro 2.00和dm_control,还需要在运行程序中导入soccer文件,然后就可以开始尝试了。 76 | 77 | ```python 78 | 79 | from dm_control.locomotion import soccer as dm_soccer 80 | 81 | # Load the 2-vs-2 soccer environment with episodes of 10 seconds: 82 | env = dm_soccer.load(team_size=2, time_limit=10.) 83 | 84 | # Retrieves action_specs for all 4 players. 85 | action_specs = env.action_spec() 86 | 87 | # Step through the environment for one episode with random actions. 88 | time_step = env.reset() 89 | while not time_step.last(): 90 | actions = [] 91 | for action_spec in action_specs: 92 | action = np.random.uniform( 93 | action_spec.minimum, action_spec.maximum, size=action_spec.shape) 94 | actions.append(action) 95 | time_step = env.step(actions) 96 | 97 | for i in range(len(action_specs)): 98 | print( 99 | "Player {}: reward = {}, discount = {}, observations = {}.".format( 100 | i, time_step.reward[i], time_step.discount, 101 | time_step.observation[i])) 102 | ``` 103 | 在运行代码中,你还可以修改队伍人数和游戏时长,如果改成11v11、90分钟,就变成了一场FIFA模拟赛(滑稽)。 104 | 105 | [Github源码链接](https://github.com/deepmind/dm_control/tree/master/dm_control/locomotion/soccer) 106 | 107 | --- 108 | 109 | ### 2019.1.21 110 | 111 | DeepMind联合牛津提出注意力神经过程
112 | 1月21日消息,来自DeepMind和牛津大学的研究者认为,神经过程(NP)存在着一个根本的不足——欠拟合,对其所依据的观测数据的输入给出了不准确的预测。他们通过将注意力纳入NP来解决这个问题,允许每个输入位置关注预测的相关上下文点。研究表明,这大大提高了预测的准确性,显著加快了训练速度,并扩大了可以建模的函数范围。 113 | 114 | ---- 115 | -------------------------------------------------------------------------------- /DRL-OpenSource/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-OpenSource/.DS_Store -------------------------------------------------------------------------------- /DRL-OpenSource/Baidu-PARL/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-OpenSource/Baidu-PARL/README.md -------------------------------------------------------------------------------- /DRL-OpenSource/Google-Dopamine(多巴胺)/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-OpenSource/Google-Dopamine(多巴胺)/README.md -------------------------------------------------------------------------------- /DRL-OpenSource/Google-TensorForce/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-OpenSource/Google-TensorForce/README.md -------------------------------------------------------------------------------- /DRL-OpenSource/Intel-Coach/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-OpenSource/Intel-Coach/README.md -------------------------------------------------------------------------------- /DRL-OpenSource/OpenAI-baselines/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-OpenSource/OpenAI-baselines/README.md -------------------------------------------------------------------------------- /DRL-OpenSource/README.md: -------------------------------------------------------------------------------- 1 | ## OpenSource Framework 2 | 3 | This README displays some Deep Reinforcement Learning Framework,including "Baidu-PARL","Google-Dopamine","Google-TensorForce","Intel-Coach","OpenAI-baselines",“Ray” etc. 4 | 5 | 6 | 7 | 8 | ### Contributors 9 | 10 | @[Skylark0924](https://github.com/Skylark0924) 11 | -------------------------------------------------------------------------------- /DRL-OpenSource/RLlab/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-OpenSource/RLlab/README.md -------------------------------------------------------------------------------- /DRL-OpenSource/Ray/README.md: -------------------------------------------------------------------------------- 1 | # Ray 2 | 3 | Ray 是由UC Berkeley 推出的高性能分布式框架,其中包括如下三种用于机器学习的加速库: 4 | 5 | - Tune:可扩展的超参数调整库([文档](https://ray.readthedocs.io/en/latest/tune.html)) 6 | - RLlib:可扩展的强化学习库([文档](https://ray.readthedocs.io/en/latest/rllib.html)) 7 | - [Distributed Training](https://ray.readthedocs.io/en/latest/distributed_training.html) 8 | 9 | Ray文档见[链接](https://ray.readthedocs.io/en/latest/) 10 | 11 | ## Ray 安装 12 | 13 | ### 最新版安装 14 | 15 | ``` 16 | pip install -U ray # also recommended: ray[debug] 17 | ``` 18 | 19 | ### 源码安装 20 | 21 | #### 依赖库 22 | 23 | **Ubuntu:** 24 | 25 | ``` 26 | sudo apt-get update 27 | sudo apt-get install -y build-essential curl unzip psmisc 28 | 29 | # If you are not using Anaconda, you need the following. 30 | sudo apt-get install python-dev # For Python 2. 31 | sudo apt-get install python3-dev # For Python 3. 32 | 33 | pip install cython==0.29.0 34 | ``` 35 | 36 | **MacOS:** 37 | 38 | ``` 39 | brew update 40 | brew install wget 41 | 42 | pip install cython==0.29.0 43 | ``` 44 | 45 | **Anaconda:** 46 | 47 | ``` 48 | conda install libgcc 49 | ``` 50 | 51 | #### Ray 安装 52 | 53 | ``` 54 | git clone https://github.com/ray-project/ray.git 55 | 56 | # Install Bazel. 57 | ray/ci/travis/install-bazel.sh 58 | 59 | # Optionally build the dashboard (requires Node.js, see below for more information). 60 | pushd ray/python/ray/dashboard/client 61 | npm ci 62 | npm run build 63 | popd 64 | 65 | # Install Ray. 66 | cd ray/python 67 | pip install -e . --verbose # Add --user if you see a permission denied error. 68 | ``` 69 | 70 | ## RLlib 71 | 72 | ### Algorithms 73 | 74 | 先介绍一下RLlib,RLlib中提供了几乎所有state-of-the-art 的DRL算法的实现,包括tensorflow和pytorch 75 | 76 | - High-throughput architectures: Ape-X | IMPALA | APPO 77 | - Gradient-based: (A2C, A3C) | (DDPG, TD3) |(DQN, Rainbow, Parametric DQN) | Policy Gradients | PPO | SAC 78 | - Derivative-free: ARS | Evolution Strategies | (QMIX, VDN, IQN) | MADDPG | MARWIL 79 | 80 | RLlib Algorithms 详见 [链接](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#) 81 | 82 | ### 使用方式 83 | 84 | 1. 直接使用 ALG_NAME.Trainer,以PPO举例: 85 | 86 | ```python 87 | import ray 88 | import ray.rllib.agents.ppo as ppo 89 | from ray.tune.logger import pretty_print 90 | 91 | ray.init() 92 | config = ppo.DEFAULT_CONFIG.copy() 93 | config["num_gpus"] = 0 94 | config["num_workers"] = 1 95 | config["eager"] = False 96 | trainer = ppo.PPOTrainer(config=config, env="CartPole-v0") 97 | 98 | # Can optionally call trainer.restore(path) to load a checkpoint. 99 | 100 | for i in range(1000): 101 | # Perform one iteration of training the policy with PPO 102 | result = trainer.train() 103 | print(pretty_print(result)) 104 | 105 | if i % 100 == 0: 106 | checkpoint = trainer.save() 107 | print("checkpoint saved at", checkpoint) 108 | ``` 109 | 110 | 2. 使用 **Tune** (官方推荐), 同样是PPO: 111 | 112 | ```python 113 | import ray 114 | from ray import tune 115 | 116 | ray.init() 117 | tune.run( 118 | "PPO", 119 | stop={"episode_reward_mean": 200}, 120 | config={ 121 | "env": "CartPole-v0", 122 | "num_gpus": 0, 123 | "num_workers": 1, 124 | "lr": tune.grid_search([0.01, 0.001, 0.0001]), 125 | "eager": False, 126 | }, 127 | ) 128 | ``` 129 | 130 | ### Tensorboard支持 131 | 132 | Ray 会**自动**在根目录创建一个叫做ray_results的文件夹并保存checkpoint,直接在终端中运行: 133 | 134 | ``` 135 | tensorboard --logdir ~/ray_results 136 | ``` 137 | 138 | 即可打开tensorboard查看结果 139 | 140 | ### 可扩展性 141 | 142 | RLlib提供了自定义培训的几乎所有方面的方法,包括环境,神经网络模型,动作分布和策略定义: 143 | 144 | ![_images/rllib-components.svg](D:\Github\DeepRL\DRL-OpenSource\Ray\README.assets\rllib-components.svg) 145 | 146 | ## Tune 147 | 148 | TODO 149 | 150 | -------------------------------------------------------------------------------- /DRL-OpenSource/TensorLayer/tensorlayer.md: -------------------------------------------------------------------------------- 1 | ## TensorLayer 2 | 3 | Documentation Version: 2.1.1 4 | 5 | Jun 2019 Deep Reinforcement Learning Model ZOO Release !!. 6 | 7 | Good News: We won the Best Open Source Software Award @ACM Multimedia (MM) 2017. 8 | 9 | TensorLayer is a Deep Learning (DL) and Reinforcement Learning (RL) library extended from Google TensorFlow. It provides popular DL and RL modules that can be easily customized and assembled for tackling real-world machine learning problems. More details can be found here. 10 | -------------------------------------------------------------------------------- /DRL-PaperDaily/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-PaperDaily/.DS_Store -------------------------------------------------------------------------------- /DRL-PaperDaily/README.md: -------------------------------------------------------------------------------- 1 | ### Deep Reinforcement Learning Paper Daily 2 | 3 | 4 | > This document used to display the latest papers about Deep Reinforcement Learning, 5 | 6 | ### Continuous updating...... 7 | 8 | Issue# 15:2020-2-20 9 | ---- 10 | 1. [Locally Private Distributed Reinforcement Learning](https://arxiv.org/abs/2001.11718) by Hajime Ono, Tsubasa Takahashi 11 | 2. [Effective Diversity in Population-Based Reinforcement Learning](https://arxiv.org/abs/2002.00632) by Jack Parker-Holder, Stephen Roberts 12 | 3. [Deep Reinforcement Learning for Autonomous Driving: A Survey](https://arxiv.org/abs/2002.00444) by B Ravi Kiran, Patrick Pérez 13 | 4. [Attractive or Faithful? Popularity-Reinforced Learning for Inspired Headline Generation](https://arxiv.org/abs/2002.02095) by Yun-Zhu Song, AAAI 2020 14 | 5. [Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning](https://arxiv.org/abs/2001.10742) by Ming Yin, Yu-Xiang Wang (Includes appendix. Accepted for AISTATS 2020) 15 | 16 | 17 | Issue# 14:2020-2-10 18 | ---- 19 | 1. [Model-based Multi-Agent Reinforcement Learning with Cooperative Prioritized Sweeping](https://arxiv.org/abs/2001.07527) by Eugenio Bargiacchi, Ann Nowé 20 | 2. [Reinforcement Learning with Probabilistically Complete Exploration](https://arxiv.org/abs/2001.06940) by Philippe Morere, Fabio Ramos 21 | 3. [Algorithms in Multi-Agent Systems: A Holistic Perspective from Reinforcement Learning and Game Theory](https://arxiv.org/abs/2001.06487) by Yunlong Lu, Kai Yan 22 | 4. [Local Policy Optimization for Trajectory-Centric Reinforcement Learning](https://arxiv.org/abs/2001.08092) by Patrik Kolaric, Daniel Nikovski 23 | 5. [On Simple Reactive Neural Networks for Behaviour-Based Reinforcement Learning](https://arxiv.org/abs/2001.07973) by Ameya Pore, Gerardo Aragon-Camarasa 24 | 6. [Graph Constrained Reinforcement Learning for Natural Language Action Spaces](https://arxiv.org/abs/2001.08837) by Prithviraj Ammanabrolu, Matthew Hausknecht(Accepted to ICLR 2020) 25 | 7. [Challenges and Countermeasures for Adversarial Attacks on Deep Reinforcement Learning](https://arxiv.org/abs/2001.09684) by Inaam Ilahi, Dusit Niyato 26 | 8. [Active Task-Inference-Guided Deep Inverse Reinforcement Learning](https://arxiv.org/abs/2001.09227) by Farzan Memarian, Ufuk Topcu 27 | 28 | 29 | Issue# 13:2020-1-20 30 | ---- 31 | 1. [Direct and indirect reinforcement learning](https://arxiv.org/abs/1912.10600) by Yang Guan, Bo Cheng 32 | 2. [Parameterized Indexed Value Function for Efficient Exploration in Reinforcement Learning](https://arxiv.org/abs/1912.10577) by Tian Tan, Vikranth R. Dwaracherla 33 | 3. [Continuous-Discrete Reinforcement Learning for Hybrid Control in Robotics](https://arxiv.org/abs/2001.00449) by Michael Neunert, Martin Riedmiller, Presented at the 3rd Conference on Robot Learning (CoRL 2019) 34 | 4. [Meta Reinforcement Learning with Autonomous Inference of Subtask Dependencies](https://arxiv.org/abs/2001.00248) by Sungryull Sohn, Honglak Lee, ICLR2020 35 | 5. [Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning](https://arxiv.org/abs/2001.00119) by Simone Parisi, Joni Pajarinen 36 | 6. [Optimal Options for Multi-Task Reinforcement Learning Under Time Constraints](https://arxiv.org/abs/2001.01620) by Manuel Del Verme, Gianluca Baldassarre 37 | 7. [MushroomRL: Simplifying Reinforcement Learning Research](https://arxiv.org/abs/2001.01102) by Carlo D'Eramo, Jan Peters 38 | 39 | Issue# 12:2020-1-10 40 | ---- 41 | 42 | 1. [Predictive Coding for Boosting Deep Reinforcement Learning with Sparse Rewards](https://arxiv.org/abs/1912.13414) by Xingyu Lu, Pieter Abbeel 43 | 2. Interestingness Elements for Explainable Reinforcement Learning: Understanding Agents' Capabilities and Limitations](https://arxiv.org/abs/1912.09007) byPedro Sequeira, Melinda Gervasio 44 | 3. [Reward-Conditioned Policies](https://arxiv.org/abs/1912.13465) by Aviral Kumar, Sergey Levine 45 | 4. [Pseudo Random Number Generation: a Reinforcement Learning approach](https://arxiv.org/abs/1912.11531) by Luca Pasqualini, Maurizio Parton 46 | 5. [Distributed Reinforcement Learning for Decentralized Linear Quadratic Control: A Derivative-Free Policy Optimization Approach](https://arxiv.org/abs/1912.09135) by Yingying Li, Na Li 47 | 6. [Deep Reinforcement Learning for Motion Planning of Mobile Robots](https://arxiv.org/abs/1912.09260) by Leonid Butyrev, Christopher Mutschler 48 | 49 | Issue# 11:2019-12-19 50 | ---- 51 | later updating...... 52 | 53 | Issue# 10:2019-12-13 54 | ---- 55 | 1. [On-policy Reinforcement Learning with Entropy Regularization](https://arxiv.org/abs/1912.01557) by Jingbin Liu, Shuai Liu 56 | 2. [Human-Robot Collaboration via Deep Reinforcement Learning of Real-World Interactions](https://arxiv.org/abs/1912.01715) by Jonas Tjomsland, A. Aldo Faisal, **NeurIPS'19** Workshop on Robot Learning: Control and Interaction in the Real World 57 | 3. [Iterative Policy-Space Expansion in Reinforcement Learning](https://arxiv.org/abs/1912.02532) by Jan Malte Lichtenberg, Özgür Şimşek, **NeurIPS 2019** 58 | 4. [Deep Model Compression via Deep Reinforcement Learning](https://arxiv.org/abs/1912.02254) by Huixin Zhan, Yongcan Cao 59 | 5. [Observational Overfitting in Reinforcement Learning](https://arxiv.org/abs/1912.02975) by Xingyou Song, Behnam Neyshabur 60 | 6. [Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions](https://arxiv.org/abs/1912.02875) by Juergen Schmidhuber 61 | 7. [Learning Sparse Representations Incrementally in Deep Reinforcement Learning](https://arxiv.org/abs/1912.04002) by Fernando Hernandez-Garcia, **Richard S. Sutton** 62 | 89. [Entropy Regularization with Discounted Future State Distribution in Policy Gradient Methods](https://arxiv.org/abs/1912.05104) by Riashat Islam, Doina Precup, **NeurIPS2019** Optimization Foundations of Reinforcement Learning Workshop 63 | 64 | 65 | 66 | Issue# 9:2019-12-3 67 | ---- 68 | 1. [Learning Representations in Reinforcement Learning:An Information Bottleneck Approach](https://arxiv.org/abs/1911.05695) by Pei Yingjun, Hou Xinwen 69 | 2. [Learning to Communicate in Multi-Agent Reinforcement Learning : A Review()](https://arxiv.org/abs/1911.05438) by Mohamed Salah Zaïem, Etienne Bennequin 70 | 3. [Accelerating Training in Pommerman with Imitation and Reinforcement Learning](https://arxiv.org/abs/1911.04947) by Hardik Meisheri, Harshad Khadilkar, Presented at Deep Reinforcement Learning workshop, NeurIPS-2019(★★) 71 | 72 | 73 | 74 | Issue# 8:2019-11-18 75 | ---- 76 | 1. [Real-Time Reinforcement Learning](https://arxiv.org/abs/1911.04448) by Simon Ramstedt, Christopher Pal 77 | 2. [Provably Convergent Off-Policy Actor-Critic with Function Approximation](https://arxiv.org/abs/1911.04384) by Shangtong Zhang, Shimon Whiteson, Optimization Foundations of Reinforcement Learning Workshop at NeurIPS 2019 78 | 3. [Driving Reinforcement Learning with Models](https://arxiv.org/abs/1911.04400) by Pietro Ferraro, Giovanni Russo 79 | 4. [Non-Cooperative Inverse Reinforcement Learning](https://arxiv.org/abs/1911.04220) by Xiangyuan Zhang, Tamer Başar 80 | 5. [Learning to reinforcement learn for Neural Architecture Search](https://arxiv.org/abs/1911.03769) by J. Gomez Robles, J. Vanschoren 81 | 82 | 83 | 84 | Issue# 7:2019-11-15 85 | ---- 86 | 1. [Fully Bayesian Recurrent Neural Networks for Safe Reinforcement Learning](https://arxiv.org/abs/1911.03308) by Matt Benatan, Edward O. Pyzer-Knapp 87 | 2. [Model-free Reinforcement Learning with Robust Stability Guarantee](https://arxiv.org/abs/1911.02875) by Minghao Han, Wei Pan, NeurIPS 2019 Workshop on Robot Learning: Control and Interaction in the Real World, Vancouver, Canada 88 | 3. [Option Compatible Reward Inverse Reinforcement Learning](https://arxiv.org/abs/1911.02723) by Rakhoon Hwang, Hyung Ju Hwang 89 | 4. [Deep Reinforcement Learning for Distributed Uncoordinated Cognitive Radios Resource Allocation](https://arxiv.org/abs/1911.03366) by Ankita Tondwalkar, Andres Kwasinski, submitted in the IEEE ICC 2020 Conference 90 | 5. [Experienced Deep Reinforcement Learning with Generative Adversarial Networks (GANs) for Model-Free Ultra Reliable Low Latency Communication](https://arxiv.org/abs/1911.03264) by Ali Taleb Zadeh Kasgari, H. Vincent Poor 91 | 6. [Mapless Navigation among Dynamics with Social-safety-awareness: a reinforcement learning approach from 2D laser scans](https://arxiv.org/abs/1911.03074) by Jun Jin, Martin Jagersand 92 | 93 | 94 | Issue# 6:2019-11-8 95 | ---- 96 | 1. [Gym-Ignition: Reproducible Robotic Simulations for Reinforcement Learning](https://arxiv.org/abs/1911.01715) by Diego Ferigo, Daniele Pucci, Accepted in SII2020 97 | 2. [DeepRacer: Educational Autonomous Racing Platform for Experimentation with Sim2Real Reinforcement Learning](https://arxiv.org/abs/1911.01562) by Bharathan Balaji, Dhanasekar Karuppasamy 98 | 99 | 100 | Issue# 5:2019-11-7 101 | ---- 102 | 1. [Online Robustness Training for Deep Reinforcement Learning](https://arxiv.org/pdf/1911.00887.pdf) by Marc Fischer, Martin Vechev, 2019-11-3 103 | 2. [Maximum Entropy Diverse Exploration: Disentangling Maximum Entropy Reinforcement Learning](https://arxiv.org/pdf/1911.00828.pdf) by Andrew Cohen, Xiangrong Tong, 2019-11 104 | 3. [Gradient-based Adaptive Markov Chain Monte Carlo](https://arxiv.org/pdf/1911.01373.pdf) by Michalis K. Titsias, Petros Dellaportas, NeurIPS 2019 105 | 4. [Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards](https://arxiv.org/pdf/1911.01417.pdf) by Alexander Trott, Richard Socher, NeurIPS 2019 106 | 5. [Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs](https://arxiv.org/pdf/1911.00954.pdf) by Andrea Zanette, Emma Brunskill 107 | 108 | 109 | Issue# 4:2019-11-5 110 | ---- 111 | 1. [Cascaded LSTMs based Deep Reinforcement Learning for Goal-driven Dialogue](https://arxiv.org/abs/1910.14229) by Yue Ma, Hong Chen, NLPCC 2017 112 | 2. [Dynamic Cloth Manipulation with Deep Reinforcement Learning](https://arxiv.org/abs/1910.14475) by Rishabh Jangir, Carme Torras, ICRA'2020 113 | 3. [Meta-Learning to Cluster](https://arxiv.org/abs/1910.14134) by Yibo Jiang, Nakul Verma 114 | 115 | 116 | Issue# 3:2019-11-4 117 | ---- 118 | 1. [Robust Model-free Reinforcement Learning with Multi-objective Bayesian Optimization](https://arxiv.org/abs/1910.13399) by Matteo Turchetta, Sebastian Trimpe, ICRA2020 119 | 2. [Certified Adversarial Robustness for Deep Reinforcement Learning](https://arxiv.org/abs/1910.12908) by Björn Lütjens, Jonathan P, CORL2019 120 | 3. [Generalization of Reinforcement Learners with Working and Episodic Memory](https://arxiv.org/abs/1910.13406) by Meire Fortunato, Charles Blundell, NeurIPS 2019 121 | 4. [Constrained Reinforcement Learning Has Zero Duality Gap](https://arxiv.org/abs/1910.13393) by Santiago Paternain, Alejandro Ribeiro 122 | 5. [Feedback Linearization for Unknown Systems via Reinforcement Learning](https://arxiv.org/abs/1910.13272) by Tyler Westenbroek, Claire J. Tomlin 123 | 6. [Multiplayer AlphaZero](https://arxiv.org/abs/1910.13012) by Nick Petosa, Tucker Balch 124 | 125 | 126 | 127 | 128 | Issue# 2:2019-11-3 129 | ---- 130 | 1. [Learning Fairness in Multi-Agent Systems](https://arxiv.org/abs/1910.14472) by Jiechuan Jiang, Zongqing Lu, NeurIPS2019 131 | 2. [VASE: Variational Assorted Surprise Exploration for Reinforcement Learning](https://arxiv.org/abs/1910.14351) by Haitao Xu, Lech Szymanski. 132 | 3. [RLINK: Deep Reinforcement Learning for User Identity Linkage](https://arxiv.org/abs/1910.14273) by Xiaoxue Li, Jianlong Tan 133 | 134 | 135 | 136 | Issue# 1:2019-11-2 137 | ---- 138 | 1. [Distributed Model-Free Algorithm for Multi-hop Ride-sharing using Deep Reinforcement Learning](https://arxiv.org/abs/1910.14002) by Ashutosh Singh, Abubakr Alabbasi, and Vaneet Aggarwal, 2019-10-30 139 | 2. [DADI: Dynamic Discovery of Fair Information with Adversarial Reinforcement Learning](https://arxiv.org/abs/1910.13983) by Michiel A. Bakker, Duy Patrick Tu,2019-10-30 140 | 3. [Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent Networks](https://arxiv.org/abs/1910.08701) by Alireza Falla, 2019-10-30 141 | 4. [Automatic Testing and Falsification with Dynamically Constrained Reinforcement Learning](https://arxiv.org/abs/1910.13645) by Xin Qin, Nikos Aréchiga, Andrew Best, Jyotirmoy Deshmukh, 2019-10-30 142 | 5. [RBED: Reward Based Epsilon Decay](https://arxiv.org/abs/1910.13701) by Aakash Maroti, 2019-10-30 -------------------------------------------------------------------------------- /DRL-PaperReadCodingPlan/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-PaperReadCodingPlan/.DS_Store -------------------------------------------------------------------------------- /DRL-PaperReadCodingPlan/README.md: -------------------------------------------------------------------------------- 1 | ## 深度强化学习《论文阅读复现计划》 2 | 3 | 4 | 本栏目主要致力于阅读深度强化学习相关论文并复现,诚挚欢迎并邀请各位参与本项目。 5 | 6 | ### 方法 7 | + 可提供论文title、链接至对应时间周期内,并注明**开始**和**结束**时间以及相关状态。 8 | 9 | + 解读标准:尽可能全面,正确(解读模版制作中,待公布)。 10 | + 贡献笔记、作者博客将获得GitHub、DRL-Lab公众等推送机会。 11 | 12 | 13 | ### Plan List 14 | 15 | | Number | Title | 涉及领域 |发表时间|发表机构|难度级别| 学习状态|贡献作者|论文笔记 | 16 | | ------- | ---------- | ----------| ----------| ---------- | ---------- | ----------- |-----------|----------- | 17 | | **No.1**
2020-3-1|[paper name](https://arxiv.org/pdf/1707.01495.pdf) | 经验池机制 |2018-2 |OpenAI | ★★★ |Y|zhangsan, Tom Wang|@[NeuronDance](https://github.com/neurondance),@[Keavnn](https://github.com/StepNeverStop)|[Note1](https://github.com/NeuronDance/DeepRL/tree/master/DRL-PaperWeekly/Detail/1_Hindsight-Experience%20-Replay) 18 | | **No.1**
2020/3/*| | | | | | | | 19 | -------------------------------------------------------------------------------- /DRL-TopicResearch/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-TopicResearch/.DS_Store -------------------------------------------------------------------------------- /DRL-TopicResearch/奖励函数研究/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-TopicResearch/奖励函数研究/.DS_Store -------------------------------------------------------------------------------- /DRL-TopicResearch/奖励函数研究/README.md: -------------------------------------------------------------------------------- 1 | # Gym环境奖励函数研究 2 | 3 | 本项目为开源公益项目,致力于研究强化学习中的奖励函数工程,研究主题为Gym环境中例子。 4 | 5 | ### 环境简介 6 | OpenAI Gym是一款用于研发和比较强化学习算法的工具包,它支持训练智能体(agent)做任何事,是一个用于开发和比较RL 算法的工具包,与其他的数值计算库兼容,如tensorflow 或者theano 库。现在主要支持的是python 语言,以后将支持其他语言。官方提供的gym文档。 7 | 8 | OpenAI Gym包含两部分: 9 | 10 | > gym 开源 包含一个测试问题集,每个问题成为环境(environment),可以用于自己的强化学习算法开发,这些环境有共享的接口,允许用户设计通用的算法,例如:Atari、CartPole等。 11 | 12 | > OpenAI Gym 服务 13 | 提供一个站点和api ,允许用户对自己训练的算法进行性能比较。 14 | 15 | Gym环境整体分为一下: 16 | + Classic control and toy text: 17 | 提供了一些RL相关论文中的一些小问题,开始学习Gym从这开始! 18 | + Algorithmic: 19 | 提供了学习算法的环境,比如翻转序列这样的问题,虽然能很容易用直接编程实现,但是单纯用例子来训练RL模型有难度的。这些问题有一个很好的特性: 能够通过改变序列长度改变难度。 20 | + Atari: 21 | 这里提供了一些小游戏,比如我们小时候玩过的小蜜蜂,弹珠等等。这些问题对RL研究有着很大影响! 22 | + Board games: 23 | 提供了Go这样一个简单的下棋游戏,由于这个问题是多人游戏,Gym提供有opponent与你训练的agent进行对抗。 24 | + 2D and 3D robots: 25 | 机器人控制环境。 这些问题用 MuJoCo 作为物理引擎。 26 | 27 | ### 研究环境 28 | 29 | Gym 环境是深度强化学习发展的重要基础,相关实例包括一下环境,更多请查阅Gym官网 30 | 1. Acrobot-v1 31 | 2. CartPole-v1 32 | 3. Pendulum-v0 33 | 4. MountainCar-v0 34 | 5. MountainCarContinous-v0 35 | 6. Pong-v0 36 | 37 | 7. Ant-v0 38 | 8. Halfcheetah-v2 39 | 9. Hopper-v2 40 | 10. InvertedDoublePendulum-v2 41 | 11. InvertedPendulum-v2 42 | 12. Reacher-v2 43 | 13. Swimmer-v2 44 | 14. Walker2d-v2 45 | 46 | 15. FetchPickAndPlace-v0 47 | 16. FetchPush-v0 48 | 17. FechReach-v0 49 | 18. FetchSlide-v0 50 | 51 | 52 | ### 初期研究流程和步骤 53 | 整体学习以文档形式展现,格式如下: 54 | 55 | (1)克隆仓库,并在environment目录下创建对应环境的文件,例如:Pendulum-v0.md 56 | 57 | (2)在Pendulum-v0.md中 58 | > 59 | + 介绍环境,包括原理图,github对应链接等 60 | + 介绍奖励函数的设置过程和设置原理 61 | + 对gym环境中涉及奖励的源码进行讲解 62 | 63 | (3)总结 64 | 65 | 实例请看environment目录下:Pendulum-v0.md 66 | 67 | ### 中期目标 68 | #### 1. 稀疏奖励 69 | #### 2. 逆向强化学习 70 | #### 71 | 72 | 73 | 74 |
75 | ### 致谢列表: 76 | 非常感谢以下同学对Gym环境解读工作所做的贡献: 77 | 78 | > 79 | > @GithubAccountName 贡献: Pendulum-v0贡献 80 | 81 | > @ChasingZenith 贡献:Ant-v0环境 82 | -------------------------------------------------------------------------------- /DRL-TopicResearch/奖励函数研究/environment/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-TopicResearch/奖励函数研究/environment/.DS_Store -------------------------------------------------------------------------------- /DRL-WorkingCompany/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-WorkingCompany/.DS_Store -------------------------------------------------------------------------------- /DRL-WorkingCompany/163.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-WorkingCompany/163.md -------------------------------------------------------------------------------- /DRL-WorkingCompany/HuaWei.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-WorkingCompany/HuaWei.md -------------------------------------------------------------------------------- /DRL-WorkingCompany/Inspir-AI.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # 启元世界 4 | 5 | ![](assets/markdown-img-paste-20190925113659331.png) 6 | 7 | 公司官网: http://www.inspirai.com 8 | 9 | 公司地址:北京市海淀区后屯路28号院KPHZ国际技术转移中心4层428室 10 | 11 | ### 关于启元(inspir-ai): 12 | 启元世界是一家 2017 年成立的以认知决策智能技术为核心的公司,由前阿里、Netflix、IBM 的科学家和高管发起,多位名牌大学的博士和硕士加入,并拥有伯克利、CMU 等知名机构的特聘顾问。启元世界的愿景是「打造决策智能、构建平行世界、激发人类潜能」,团队核心能力以深度学习、强化学习、超大规模并行计算为基础,拥有互联网、游戏等众多领域的成功经验,受到国内外一流投资人的青睐。 13 | 14 | ### 创始人: 15 | 袁泉,启元世界创始人 & CEO:曾担任阿里认知计算实验室负责人、资深总监,手机淘宝天猫推荐算法团队缔造者,打造了有好货、猜你喜欢等电商知名个性化产品,率团队荣获 2015 年双 11 CEO 特别贡献奖。加入阿里前,袁泉曾是 IBM 中国研究院的研究员,从事推荐等智能决策算法的研究,是 IBM 2011 年全球银行业 FOAK 创新项目发起人。在工业界大规模应用实践的同时,总结并发表了十余篇论文在国际顶级会议 ACM RecSys、KDD、SDM 等。袁泉拥有多项中美技术专利,长期担任 ACM RecSys、IEEE Transaction on Games 审稿人。 16 | 17 | 18 | ![](assets/markdown-img-paste-20190925113919903.png) 19 | 20 | ### 人才招聘 21 | 22 | 招聘对象 23 | 面向人群:海内外院校2019&2020届毕业生 24 | 毕业时间:2019年6月—2020年10月 25 | 26 | #### 招聘职位 27 | **1、强化学习算法研究员** 28 | 职位描述: 29 | ◆ 负责深度强化学习的前沿算法研究,推动和保持公司在国内外业界的技术领先性 30 | ◆以发表国际一流顶级会议文章,或设计优化算法原型系统为目标 31 | 职位要求: 32 | ◆机器学习、统计、数学、计算机科学、博弈论相关专业博士,或优秀硕士 33 | ◆在NIPS/ICML/ICLR/IROS/ACL/CVPR等著名国际会议或者顶级期刊上有论文发表经验 34 | ◆强化学习、深度学习理论和实践基础扎实;熟练使用Python/C++至少一门语言 35 | 工作地:北京/杭州 36 | 37 | **2、强化学习算法工程师** 38 | 职位描述: 39 | ◆ 负责深度强化学习算法的实现和优化 40 | ◆ 以参加某一国际相关竞赛,或完成某一前沿任务研发为目标 41 | 职位要求: 42 | ◆计算机、数学、自动化、统计、机器学习等相关专业,大学本科及以上学历 43 | ◆较强的算法设计和实现能力,在Kaggle、天池、ACM ICPC等国内外竞赛中获奖者优先 44 | ◆能够熟练的使用Python编程,有C++编程经验者优先 45 | ◆熟练使用TensorFlow、Ray、Caffe2等机器学习工具,熟悉底层实现为宜 46 | 工作地:北京/杭州 47 | 48 | **3、强化学习平台研发工程师** 49 | 职位描述: 50 | ◆ 参与公司机器学习平台的架构设计、实现和优化工作 51 | 职位要求: 52 | ◆ 计算机相关专业本科及以上学历 53 | ◆ 熟练应用C/C++、Python等语言,具有良好的编程习惯,熟悉多线程编程,内存管理,设计模式和Linux/Unix开发环境 54 | ◆ 掌握分布式系统相关知识,或熟悉GPU硬件架构和CUDA编程,有大规模/互联网系统开发经验者优先 55 | ◆ 良好的沟通能力,有责任心,自我驱动 56 | 工作地:北京/杭州 57 | 58 | **4、强化学习机器人工程师** 59 | 职位描述: 60 | ◆ 基于机器学习的方法进行机器人控制及路径规划,解决机器人在实际应用场景的迁移性、鲁棒性等问题,不断优化提升效果。 61 | 职位要求: 62 | ◆ 机器人、计算机、自动化、自控、机械等相关专业本科及以上学历; 63 | ◆ 熟悉机器人系统结构和ROS等软件栈,能够熟练使用C++、Python等任一编程语言; 64 | ◆ 具备较强的动手能力,在国内外各类机器人竞赛中获奖者优先; 65 | ◆ 具有实际 SLAM、定位算法、图像视觉处理等经验者优先;有机器人相关项目经验者优先 66 | 工作地:北京/杭州 67 | 投递方式 68 | 69 | ___ 70 | 71 | *投递简历至公司邮箱: hr@inspirai.com* 72 | 73 | **时间节点** 74 | 1: 投递简历至公司邮箱 75 | >3月15日 ——4月22日 76 | 欢迎访问官网inspirai.com 77 | 投递简历至 hr@inspirai.com 78 | 79 | 2:在线答题和面试 80 | >3月22日——5月1日 81 | 通常有3轮面试,请留意邮箱或电话通知 82 | 83 | 3:发放offer 84 | >资深博士/专家成为实习mentor,开启强化学习的难忘旅程 85 | -------------------------------------------------------------------------------- /DRL-WorkingCompany/KuaiShou.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-WorkingCompany/KuaiShou.md -------------------------------------------------------------------------------- /DRL-WorkingCompany/MeiTuan.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-WorkingCompany/MeiTuan.md -------------------------------------------------------------------------------- /DRL-WorkingCompany/ReadMe.md: -------------------------------------------------------------------------------- 1 | ### 那些做深度强化学习的公司 2 | 3 | ## 阿里巴巴 4 | 5 | ## 腾讯 6 | 7 | ## 网易 8 | 9 | ## 今日头条 10 | 11 | ## 华为诺亚方舟实验室 12 | 13 | ## 启元世界(http://www.inspirai.com) 14 | ![](assets/markdown-img-paste-20190925113659331.png) 15 | 启元世界是一家 2017 年成立的以认知决策智能技术为核心的公司,由前阿里、Netflix、IBM 的科学家和高管发起,多位名牌大学的博士和硕士加入,并拥有伯克利、CMU 等知名机构的特聘顾问。启元世界的愿景是「打造决策智能、构建平行世界、激发人类潜能」,团队核心能力以深度学习、强化学习、超大规模并行计算为基础,拥有互联网、游戏等众多领域的成功经验,受到国内外一流投资人的青睐。 16 | 17 | 袁泉,启元世界创始人 & CEO:曾担任阿里认知计算实验室负责人、资深总监,手机淘宝天猫推荐算法团队缔造者,打造了有好货、猜你喜欢等电商知名个性化产品,率团队荣获 2015 年双 11 CEO 特别贡献奖。加入阿里前,袁泉曾是 IBM 中国研究院的研究员,从事推荐等智能决策算法的研究,是 IBM 2011 年全球银行业 FOAK 创新项目发起人。在工业界大规模应用实践的同时,总结并发表了十余篇论文在国际顶级会议 ACM RecSys、KDD、SDM 等。袁泉拥有多项中美技术专利,长期担任 ACM RecSys、IEEE Transaction on Games 审稿人。 18 | 19 | 地址:北京市海淀区后屯路28号院KPHZ国际技术转移中心4层428室 20 | 21 | ![](assets/markdown-img-paste-20190925113919903.png) 22 | ## VIVO 23 | 24 | ## 美团点评 25 | 26 | ## 北京快手科技有限公司 27 | 28 | ## 深圳星行科技有限公司 29 | 30 | ## Testin云测公司 31 | 32 | ##超参数科技(深圳)有限公司 33 | 34 | ## 北京初速度科技有限公司 35 | -------------------------------------------------------------------------------- /DRL-WorkingCompany/Tencent.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-WorkingCompany/Tencent.md -------------------------------------------------------------------------------- /DRL-WorkingCompany/Testin.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-WorkingCompany/Testin.md -------------------------------------------------------------------------------- /DRL-WorkingCompany/VIVO.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-WorkingCompany/VIVO.md -------------------------------------------------------------------------------- /DRL-WorkingCompany/assets/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NeuronDance/DeepRL/8342ea71be1dffe26581d964f8011bcb973f85d6/DRL-WorkingCompany/assets/.DS_Store -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 NeuronDance 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Deep Reinforcement Learning(深度强化学习) 2 | 3 |

4 | 5 | 6 | 7 | [](https://img.shields.io/github/issues/NeuronDance/DeepRL) 8 | 9 |


10 | 11 | 12 | 本仓库由“深度强化学习实验室(DeepRL-Lab)”创建,希望能够为所有DRL研究者,学习者和爱好者提供一个学习指导。 13 | 14 | 15 | >如今机器学习发展如此迅猛,各类算法层出不群,特别是深度神经网络在计算机视觉、自然语言处理、时间序列预测等多个领域更是战果累累,可以说这波浪潮带动了很多人进入深度学习领域,也成就了其一番事业。而强化学习作为一门灵感来源于心理学中的行为主义理论的学科,其内容涉及概率论、统计学、逼近论、凸分析、计算复杂性理论、运筹学等多学科知识,难度之大,门槛之高,导致其发展速度特别缓慢。围棋作为人类的娱乐游戏中复杂度最高的一个,它横竖各有19条线,共有361个落子点,双方交替落子,状态空间高达10的171次方(注:宇宙中的原子总数是10的80次方,即使穷尽整个宇宙的物质也不能存下围棋的所有可能性) 16 | ### 1、Deep Reinforcement Learning? 17 | 时间 | 内容| 18 | -|-| 19 | 2015.10 | 由Google-DeepMind公司开发的AlphaGo程序击败了人类高级选手樊麾,成为第一个无需让子即可在19路棋盘上击败围棋职业棋手的计算机围棋程序,并写进了历史,论文发表在国际顶级期刊《Science》上| 20 | 2016.3| 透过自我对弈数以万计盘进行练习强化,AlphaGo在一场五番棋比赛中4:1击败顶尖职业棋手李世石。| 21 | 2016.12|Master(AlphaGo版本)开始出现于弈城围棋网和腾讯野狐围棋网,取得60连胜的成绩,以其空前的实力轰动了围棋界。| 22 | -|DeepMind 如约公布了他们最新版AlphaGo论文(Nature),介绍了迄今最强最新的版本AlphaGo Zero,使用纯强化学习,将价值网络和策略网络整合为一个架构,3天训练后就以100比0击败了上一版本的AlphaGo。AlphaGo已经退休,但技术永存。DeepMind已经完成围棋上的概念证明,接下来就是用强化学习创造改变世界的价值。| 23 | 24 | 围棋被攻克证明了强化学习发展的威力,作为AlphoGo的带头人,强化学习界的大神,David Sliver提出人工智能的终极目标是: 25 | 26 | **AI = DL(Deep Learning) + RL(Reinforcement Learning) == DRL(Deep Reinforcement Learning)** 27 | 28 | 29 | --- 30 | 31 | ### 2、Application? 32 | 在深度学习已经取得了很大的进步的基础上,深度强化学习真正的发展归功于神经网络、深度学习以及计算力的提升,David就是使用了神经网络逼近值函数后提出深度强化学习(Deep Reinforcement Learning,DRL),并证明了确定性策略等。纵观近四年的ICML,NPIS等顶级会议论文,强化学习的理论进步,应用领域逐渐爆发式增广,目前已经在如下领域有了广泛使用: 33 | > 34 | + 自动驾驶:自动驾驶载具(self-driving vehicle) 35 | + 控制论(离散和连续大动作空间): 玩具直升机、Gymm_cotrol物理部件控制、机器人行走、机械臂控制。 36 | + 游戏:Go, Atari 2600(DeepMind论文详解)等 37 | + 自然语言处理:机器翻译, 文本序列预测,问答系统,人机对话 38 | + 超参数学习:神经网络参数自动设计 39 | + 推荐系统:阿里巴巴黄皮书(商品推荐),广告投放。 40 | + 智能电网:电网负荷调试,调度等 41 | + 通信网络:动态路由, 流量分配等 42 | + 财务与财经系统分析与管理 43 | + 智能医疗 44 | + 智能交通网络及网络流 45 | + 物理化学实验:定量实验,核素碰撞,粒子束流调试等 46 | + 程序学习和网络安全:网络攻防等 47 | 48 | --- 49 | 50 | ### 3、一流研究机构有哪些? 51 | 机构名| Logo|官网|简介| 52 | -|-|-|-| 53 | DeepMind|![](assets/markdown-img-paste-20190222165835138.png)|[Access](https://deepmind.com/)|DeepMind是一家英国的人工智能公司。公司创建于2010年,最初名称是DeepMind科技(DeepMind Technologies Limited),在2014年被谷歌收购。| 54 | OpenAI|![](assets/markdown-img-paste-20190222165707224.png)|[Access](https://openai.com/)|OpenAI是一个非营利性人工智能(AI)研究组织,旨在促进和发展友好的人工智能,使人类整体受益。这家总部位于旧金山的组织成立于2015年底,旨在通过向公众开放其专利和研究,与其他机构和研究人员“自由合作”。创始人(尤其是伊隆马斯克和萨姆奥特曼)的部分动机是出于对通用人工智能风险的担忧。| 55 | UC Berkeley||[Access1](https://bair.berkeley.edu)
[Access2](http://hart.berkeley.edu/)|| 56 | ...|||| 57 | 58 | 59 | 60 | ### 4、业界大佬有哪些? 61 | Name|Company| Homepage|about| 62 | -|-|-|-| 63 | **Richard Sutton**|Deepmind|[page](http://incompleteideas.net/)|强化学习的祖师爷,著有《Reinforcement Learning: An Introduction》| 64 | **David Sliver**|DeepMind|[page](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Home.html),[Google学术](https://scholar.google.com/citations?user=-8DNE4UAAAAJ&hl=zh-CN)|AlphaGo、AlphaStar掌门人,UCL公开课主讲人,他工作重点是将强化学习与深度学习相结合,包括一个学习直接从像素中学习Atari游戏的程序。领导AlphaGo项目,最终推出了第一个在Go全尺寸游戏中击败顶级职业玩家的计划。 AlphaGo随后获得了荣誉9丹专业认证;并因创新而获得戛纳电影节奖。然后他领导了AlphaZero的开发,它使用相同的AI来学习玩从头开始(仅通过自己玩而不是从人类游戏中学习),然后学习以相同的方式下棋和将棋,比任何其他计算机更高的水平方案| 65 | **Oriol Vinyals**|DeepMind||AlphaStar主要负责人 66 | **Pieter Abbeel**|UC Berkeley| [page](http://people.eecs.berkeley.edu/~pabbeel/),[Google学术](https://scholar.google.com/citations?user=vtwH6GkAAAAJ&hl=zh-CN)|机器人和强化学习专家 加州大学伯克利分校教授,EECS,BAIR,CHAI(2008-),伯克利机器人学习实验室主任,伯克利人工智能研究(BAIR)实验室联合主任,联合创始人,总裁兼首席科学家covariant.ai(2017-),研究科学家(2016-2017),顾问(2018-)OpenAI,联合创始人Gradescope(2014-2018:TurnItIn收购)| 67 | 68 | 69 | ### 5、如何学习? 70 | 内容|学习方法与资料| 71 | -|-| 72 | 补充数学基础(高数、线代、概率论)|[Access](https://github.com/NeuronDance/DeepRL/tree/master/AI-Basic-Resource)| 73 | 基础与课程学习|[Access](https://github.com/NeuronDance/DeepRL/tree/master/DRL-Course)
74 | 强化学习竞赛|[Access](https://github.com/NeuronDance/DeepRL/tree/master/DRL-Competition)
75 | 开源框架学习|[Access](https://github.com/NeuronDance/DeepRL/tree/master/DRL-OpenSource) 76 | 77 | 78 | 79 | 80 | 81 | 82 | ### 6、关于深度强化学习实验室 83 | -|-|-| 84 | 成员|包含教授、讲师、博士、硕士、本科、|**学术界**:清华、北大、山大、浙大、北航、东南、南大、大工、天大、中科大、北理工、国防科大、牛津大学、帝国理工、CMU、南洋理工、柏林工业、西悉尼大学、埃默里大学等
**工业界**:腾讯、阿里巴巴、网易、头条、华为、快手等 85 | 86 | 愿景|DeepRL| 87 | [1]. 提供最全面的深度强化学习书籍、资料、综述等学习资源。
[2]. 阐述深度强化学习的基本原理、前沿算法、场景应用、竞赛分析、论文分享等专业知识。
[3]. 分享最前沿的业界动态和行业发展趋势。
[4]. 成为所有深度强化学习领域的研究者与爱好者交流平台。 88 | 89 | ### @致谢 90 | 欢迎每一位伙伴积极为项目贡献微薄之力,共同点亮星星之火。
91 | 92 | 93 | **贡献者列表(排名不分先后)**:
94 | 95 | --- 96 | @[taoyafan](https://github.com/taoyafan),@[BluesChang](https://github.com/BluesChang),@[Wangergou123](https://github.com/Wangergou123),@[TianLin0509](https://github.com/TianLin0509),@[zanghyu](https://github.com/zanghyu),@[hijkzzz](https://github.com/hijkzzz),@[tengshiquan](https://github.com/tengshiquan) 97 | 98 | --- 99 | 100 | #### @联系方式 101 | Title|| 102 | -|-| 103 | 微信群聊|加微信助手:NeuronDance(进交流群)| 104 | CSDN博客|[深度强化学习(DRL)探索](https://blog.csdn.net/gsww404)
| 105 | 知乎专栏|[DeepRL基础探索](https://zhuanlan.zhihu.com/deeprl)/[DeepRL前沿论文解读](https://zhuanlan.zhihu.com/drl-paper) 106 | 微信公众号|如下图| 107 | 108 | ![](http://deeprlhub.com/assets/files/2021-12-24/1640349661-676524-wechatimg64.jpeg) 109 | --------------------------------------------------------------------------------