├── .github
    └── FUNDING.yml
├── ACER.md
├── BSRL.md
├── BiCNet.md
├── C51-analysis.md
├── C51.md
├── CAPG.md
├── CDC.md
├── COF-PAC.md
├── COMA.md
├── ChallengesRealWorldRL.md
├── Coinrun.md
├── D4PG.md
├── DDPG.md
├── DDQN.md
├── DEBP.md
├── DPPO.md
├── DQN.md
├── DQfD.md
├── Dip.md
├── DirPG.md
├── Disagreement.md
├── Distral.md
├── DualMDP.md
├── Dueling.md
├── E2.md
├── ECMAC.md
├── EDDICT.md
├── EPG.md
├── EX2.md
├── GANAC.md
├── GANQL.md
├── GNLBE.md
├── GTD.md
├── GVF.md
├── GVG.md
├── GenerativeBelief.md
├── Geoff-PAC.md
├── HAL.md
├── HILP.md
├── HIRL.md
├── HIRO.md
├── I2As.md
├── IBP.md
├── IPG.md
├── IQN.md
├── ISMCI.md
├── Intervaltime.md
├── KL-RegulaRL.md
├── LEARN.md
├── LFOD.md
├── LICENSE
├── LOLA.md
├── LQR+GAIfO.md
├── LipschitzQ.md
├── MADDPG.md
├── MBDQN.md
├── MBIE-EB.md
├── MCAI.md
├── MCGE.md
├── MERL.md
├── MGRL.md
├── MMRB.md
├── MPO.md
├── MRL.md
├── MSRL.md
├── MetaSS.md
├── NDM.md
├── NEC.md
├── NashDQN.md
├── NoisyNet.md
├── OLRL.md
├── OP-GAIL.md
├── OPRE.md
├── OVPG.md
├── PCL.md
├── PEARL.md
├── PEB.md
├── PER.md
├── PGQ.md
├── PGS.md
├── PGSQL.md
├── PPO-CMA.md
├── PPO.md
├── PhiEB.md
├── ProMP.md
├── Programmable.md
├── Proposal.md
├── QEnsemble.md
├── QPROP.md
├── QR-DQN.md
├── RCFR.md
├── REACTOR.md
├── README.md
├── RECUR.md
├── REETDQN.md
├── RLCRC.md
├── RLNL.md
├── RLP.md
├── RLTUNER.md
├── ROMMEO.md
├── RVF.md
├── Rainbow.md
├── RayInterference.md
├── Reciprocity.md
├── RoboSumo.md
├── SGA.md
├── SOM.md
├── SPU.md
├── SRL.md
├── ST-DIM.md
├── SVRL.md
├── SoRB.md
├── TRPO.md
├── UBE.md
├── UML.md
├── UNREAL.md
├── VALOR.md
├── VICE.md
├── VISR.md
├── Viper.md
├── ZSTG.md
├── _config.yml
├── bad.md
├── bdope.md
├── content.md
├── database.csv
├── dmimic.md
├── images
    ├── .gitkeep
    ├── ACER.png
    ├── TRPO.png
    ├── Trust Region Policy Optimization.png
    ├── awesome-drl.png
    ├── incentizing.png
    ├── landscape.jpeg
    └── unreal.png
├── incentivizing.md
├── index.html
├── p10.md
├── parametric.md
├── reproducing.md
└── sc.md


/.github/FUNDING.yml:
--------------------------------------------------------------------------------
 1 | # These are supported funding model platforms
 2 | 
 3 | github: [tigerneil]# Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2]
 4 | patreon: # Replace with a single Patreon username
 5 | open_collective: # Replace with a single Open Collective username
 6 | ko_fi: # Replace with a single Ko-fi username
 7 | tidelift: # Replace with a single Tidelift platform-name/package-name e.g., npm/babel
 8 | community_bridge: # Replace with a single Community Bridge project-name e.g., cloud-foundry
 9 | liberapay: # Replace with a single Liberapay username
10 | issuehunt: # Replace with a single IssueHunt username
11 | otechie: # Replace with a single Otechie username
12 | custom: # Replace with up to 4 custom sponsorship URLs e.g., ['link1', 'link2']
13 | 


--------------------------------------------------------------------------------
/ACER.md:
--------------------------------------------------------------------------------
 1 | # Actor-Critic with Experience Replay
 2 | combining three breakthrough:
 3 | 1. Truncated importance sampling with bias correction
 4 | 2. stochastic dueling network architectures
 5 | 3. a new trust region policy optimization method
 6 | 
 7 | Utilize recent developments in 
 8 | * deep neural networks, 
 9 | * variance reduction techniques, 
10 | * the off-policy Retrace algorithm (Munos et al., 2016) Retrace
11 | * parallel training of RL agents (Mnih et al., 2016) A3C
12 | 
13 | **Theoretical result:**
14 | Retrace operator can be rewritten from our proposed truncated importance sampling with bias correction technique.
15 | 
16 | ![](images/ACER.png)
17 | 


--------------------------------------------------------------------------------
/BSRL.md:
--------------------------------------------------------------------------------
 1 | Behaviour Suite for Reinforcement Learning
 2 | > Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepezvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, Hado Van Hasselt
 3 | 
 4 | ## Abstract
 5 | This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. 
 6 | 
 7 | bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives. 
 8 | 
 9 | First, to collect clear, informative and scalable problems that capture key issues in the design of general and efficient learning algorithms. 
10 | 
11 | Second, to study agent behaviour through their performance on these shared benchmarks. 
12 | 
13 | To complement this effort, we open source github.com/deepmind/bsuite, which automates evaluation and analysis of any agent on bsuite. 
14 | 
15 | This library facilitates reproducible and accessible research on the core issues in RL, and ultimately the design of superior learning algorithms. 
16 | 
17 | Our code is Python, and easy to use within existing projects. 
18 | 
19 | We include examples with OpenAI Baselines, Dopamine as well as new reference implementations. 
20 | 
21 | Going forward, we hope to incorporate more excellent experiments from the research community, and commit to a periodic review of bsuite from a committee of prominent researchers.
22 | 


--------------------------------------------------------------------------------
/BiCNet.md:
--------------------------------------------------------------------------------
 1 | Multiagent Bidirectionally-Coordinated Nets
 2 | for Learning to Play StarCraft Combat Games
 3 | 
 4 | Peng Peng†, Quan Yuan†, Ying Wen‡, Yaodong Yang‡, Zhenkun Tang†, Haitao Long†, Jun Wang‡ ∗
 5 | †Alibaba Group, ‡University College London
 6 | 
 7 | Real-world artificial intelligence (AI) applications often require multiple agents
 8 | to work in a collaborative effort. Efficient learning for intra-agent communication
 9 | and coordination is an indispensable step towards general AI. In this paper, we take
10 | StarCraft combat game as the test scenario, where the task is to coordinate multiple
11 | agents as a team to defeat their enemies. To maintain a scalable yet effective
12 | communication protocol, we introduce a multiagent bidirectionally-coordinated
13 | network (BiCNet [’bIknet]) with a vectorised extension of actor-critic formulation.
14 | We show that BiCNet can handle different types of combats under diverse terrains
15 | with arbitrary numbers of AI agents for both sides. Our analysis demonstrates
16 | that without any supervisions such as human demonstrations or labelled data,
17 | BiCNet could learn various types of coordination strategies that is similar to these
18 | of experienced game players. Moreover, BiCNet is easily adaptable to the tasks
19 | with heterogeneous agents. In our experiments, we evaluate our approach against
20 | multiple baselines under different scenarios; it shows state-of-the-art performance,
21 | and possesses potential values for large-scale real-world applications.
22 | 


--------------------------------------------------------------------------------
/C51-analysis.md:
--------------------------------------------------------------------------------
 1 | # An Analysis of Categorical Distributional Reinforcement Learning
 2 | 
 3 | Distributional approaches to value-based reinforcement learning model the entire distribution of returns, rather than just their expected values, and have recently been shown to yield state-of-the-art empirical performance.
 4 | 
 5 | This was demonstrated by the recently proposed C51 algorithm, based on categorical distributional reinforcement learning (CDRL) [Bellemare et al., 2017a]. However, the theoretical properties of CDRL algorithms are not yet well understood. 
 6 | 
 7 | In this paper, we introduce a framework to analyse CDRL algorithms, establish the importance of the projected distributional Bellman operator
 8 | in distributional RL, draw fundamental connections between CDRL and the Cram´er distance, and give a proof of convergence for sample-based categorical distributional reinforcement
 9 | learning algorithms.
10 | 


--------------------------------------------------------------------------------
/C51.md:
--------------------------------------------------------------------------------
 1 | # A Distributional Perspective on Reinforcement Learning
 2 | > Marc G. Bellemare, Will Dabney, Rémi Munos
 3 | 
 4 | In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. 
 5 | 
 6 | This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. 
 7 | 
 8 | Although there is an established body of literature studying the value distribution, thus far it has always been used for a specific purpose such as implementing risk-aware behaviour. 
 9 | 
10 | We begin with theoretical results in both the policy evaluation and control settings, exposing a significant distributional instability in the latter. 
11 | 
12 | We then use the distributional perspective to design a **new algorithm** which applies Bellman's equation to the learning of approximate value distributions. 
13 | 
14 | We evaluate our algorithm using the suite of games from the Arcade Learning Environment. 
15 | 
16 | We obtain both state-of-the-art results and anecdotal evidence demonstrating the **importance of the value distribution** in approximate reinforcement learning. 
17 | 
18 | Finally, we combine theoretical and empirical evidence to highlight the ways in which the value distribution impacts learning in the approximate setting.
19 | 


--------------------------------------------------------------------------------
/CAPG.md:
--------------------------------------------------------------------------------
 1 | # Clipped Action Policy Gradient
 2 | > Yasuhiro Fujita 1 Shin-ichi Maeda 1
 3 | 
 4 | ## Abstract
 5 | Many continuous control tasks have bounded action spaces. 
 6 | 
 7 | When policy gradient methods are applied to such tasks, out-of-bound actions need to be clipped before execution, while policies are usually optimized as if the actions are not clipped.
 8 | 
 9 | We propose a policy gradient estimator that exploits the knowledge of actions being clipped to reduce the variance in estimation. 
10 | 
11 | We prove that our estimator, named clipped action policy gradient (CAPG), is unbiased and achieves lower variance than the conventional estimator that ignores action bounds. 
12 | 
13 | Experimental results demonstrate that CAPG generally outperforms the conventional estimator, indicating that it is a better policy gradient estimator for continuous controltasks. 
14 | 
15 | The source code is available at https://github.com/pfnet-research/capg.
16 | 


--------------------------------------------------------------------------------
/CDC.md:
--------------------------------------------------------------------------------
 1 | # The Complexity of Decentralized Control of Markov Decision Processes
 2 | > Daniel S. Bernstein, Shlomo Zilberstein, and Neil Immerman
 3 | 
 4 | ## Abstract
 5 | 
 6 | Planning for distributed agents with partial state information is considered from a decisiontheoretic perspective. We describe generalizations of both the MDP and POMDP models
 7 | that allow for decentralized control. 
 8 | 
 9 | For even a small number of agents, the finite-horizon problems corresponding to both of our models are complete for nondeterministic exponential time.
10 | 
11 | These complexity results illustrate a fundamental difference between centralized and decentralized control of Markov processes. 
12 | 
13 | In contrast to the MDP and POMDP problems, the problems we consider provably do not admit polynomialtime algorithms and most likely require doubly exponential time to solve in the worst case. 
14 | 
15 | We have thus provided mathematical evidence corresponding to the intuition that decentralized planning problems cannot easily be reduced to centralized problems and solved exactly using established techniques. 
16 | 


--------------------------------------------------------------------------------
/COF-PAC.md:
--------------------------------------------------------------------------------
 1 | # Provably Convergent Off-Policy Actor-Critic with Function Approximation
 2 | > Shangtong Zhang, Bo Liu, Hengshuai Yao, Shimon Whiteson
 3 | 
 4 | ## Abstract
 5 | We present the first provably convergent offpolicy actor-critic algorithm (COF-PAC) with function approximation in a two-timescale form.
 6 | 
 7 | Key to COF-PAC is the introduction of a new critic, the emphasis critic, which is trained via Gradient Emphasis Learning (GEM), a novel combination of the key ideas of Gradient Temporal Difference Learning and Emphatic Temporal Difference Learning. 
 8 | 
 9 | With the help of the emphasis critic and the canonical value function critic, we show convergence for COF-PAC, where the critics are linear and the actor can be nonlinear.
10 | 


--------------------------------------------------------------------------------
/COMA.md:
--------------------------------------------------------------------------------
 1 | # Counterfactual Multi-Agent Policy Gradients
 2 | 
 3 | > Jakob N. Foerster1,†
 4 | > Gregory Farquhar1,†
 5 | > Triantafyllos Afouras1
 6 | > Nantas Nardelli1
 7 | > Shimon Whiteson1
 8 | > from oxford
 9 | 
10 | ## existing problems for MAS
11 | Cooperative multi-agent systems can be naturally used to model many real world problems, such as network packet routing or the coordination of autonomous vehicles.
12 | 
13 | There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. 
14 | 
15 | ## solutions
16 | To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA)
17 | policy gradients. 
18 | 
19 | 1. COMA uses a **centralised critic** to estimate the Q-function and **decentralised actors** to optimise the agents’ policies. 
20 | 2. To address the challenges of multi-agent credit assignment, it uses a **counterfactual baseline** that marginalises out a single agent’s action, while keeping the other agents’ actions fixed. 
21 | 3. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. 
22 | 
23 | ## evaluation
24 | We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. 
25 | 
26 | COMA significantly improves average performance over other multi-agent actor-critic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.
27 | 


--------------------------------------------------------------------------------
/ChallengesRealWorldRL.md:
--------------------------------------------------------------------------------
 1 | # Challenges of Real-World Reinforcement Learning
 2 | > Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester
 3 | 
 4 | ## Abstract
 5 | Reinforcement learning (RL) has proven its worth in a series of artificial domains, and is beginning to show some successes in real-world scenarios. 
 6 | 
 7 | However, much of the research advances in RL are often hard to leverage in realworld systems due to a series of assumptions that are rarely satisfied in practice. 
 8 | 
 9 | We present a set of nine unique challenges that must be addressed to productionize RL to real world problems. 
10 | 
11 | For each of these challenges, we specify the exact meaning of the challenge, present some approaches from the literature, and specify some metrics for evaluating that challenge.
12 | 
13 | An approach that addresses all nine challenges would be applicable to a large number of real world problems. We also present an example domain that has been modified to present these challenges as a testbed for practical RL research.
14 | 


--------------------------------------------------------------------------------
/Coinrun.md:
--------------------------------------------------------------------------------
 1 | ## Quantifying Generalization in Reinforcement Learning
 2 | > Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, John Schulman
 3 | 
 4 | ### Abstract
 5 | In this paper, we investigate the problem of overfitting in deep reinforcement learning. 
 6 | 
 7 | Among the most common benchmarks in RL, it is customary to use the same environments for both training and testing. 
 8 | 
 9 | This practice offers relatively little insight into an agent’s ability to generalize. 
10 | 
11 | We address this issue by using procedurally generated environments to construct distinct training and test sets. 
12 | 
13 | Most notably, we introduce a new environment called CoinRun, designed as a benchmark for generalization in RL. 
14 | 
15 | Using CoinRun, we find that agents overfit to surprisingly large training sets. 
16 | 
17 | We then show that deeper convolutional architectures improve generalization, as do methods traditionally found in supervised learning, including L2 regularization, dropout, data augmentation and batch normalization.
18 | 


--------------------------------------------------------------------------------
/D4PG.md:
--------------------------------------------------------------------------------
 1 | # DISTRIBUTED DISTRIBUTIONAL DETERMINISTIC POLICY GRADIENTS
 2 | > Gabriel Barth-Maron˚, Matthew W. Hoffman˚, David Budden, Will Dabney,
 3 | Dan Horgan, Dhruva TB, Alistair Muldal, Nicolas Heess, Timothy Lillicrap
 4 | DeepMind
 5 | 
 6 | ## ABSTRACT
 7 | This work adopts the very successful distributional perspective on reinforcement learning and adapts it to the continuous control setting. 
 8 | 
 9 | We combine this within a distributed framework for off-policy learning in order to develop what we call the Distributed Distributional Deep Deterministic Policy Gradient algorithm, D4PG.
10 | 
11 | We also combine this technique with a number of additional, simple improvements such as the use of N-step returns and prioritized experience replay. 
12 | 
13 | Experimentally we examine the contribution of each of these individual components, and show how they interact, as well as their combined contributions. 
14 | 
15 | Our results show that across a wide variety of simple control tasks, difficult manipulation tasks, and a set of hard obstacle-based locomotion tasks the D4PG algorithm achieves state of the art performance.
16 | 


--------------------------------------------------------------------------------
/DDPG.md:
--------------------------------------------------------------------------------
1 | Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra
2 | 
3 | 
4 | We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain.
5 | We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces.
6 | Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving.
7 | Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives.
8 | We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
9 | 


--------------------------------------------------------------------------------
/DDQN.md:
--------------------------------------------------------------------------------
 1 | Deep Reinforcement Learning with Double Q-learning
 2 | 
 3 | Hado van Hasselt, Arthur Guez, David Silver
 4 | 
 5 | The popular Q-learning algorithm is known to overestimate
 6 | action values under certain conditions. It was not previously
 7 | known whether, in practice, such overestimations are common,
 8 | whether they harm performance, and whether they can
 9 | generally be prevented. In this paper, we answer all these
10 | questions affirmatively. In particular, we first show that the
11 | recent DQN algorithm, which combines Q-learning with a
12 | deep neural network, suffers from substantial overestimations
13 | in some games in the Atari 2600 domain. We then show that
14 | the idea behind the Double Q-learning algorithm, which was
15 | introduced in a tabular setting, can be generalized to work
16 | with large-scale function approximation. We propose a specific
17 | adaptation to the DQN algorithm and show that the resulting
18 | algorithm not only reduces the observed overestimations,
19 | as hypothesized, but that this also leads to much better
20 | performance on several games.
21 | 


--------------------------------------------------------------------------------
/DEBP.md:
--------------------------------------------------------------------------------
 1 | # Reinforcement Learning with Deep Energy-Based Policies
 2 | 
 3 | Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, Sergey Levine
 4 | 
 5 | ## Contribution
 6 | A method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before. 
 7 | 
 8 | ## Algorithm
 9 | We apply our method to learning maximum entropy policies, resulting into a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution.
10 | 
11 | ## Tricks
12 | We use the recently proposed amortized Stein variational gradient descent to learn a stochastic sampling network that approximates samples from this distribution. 
13 | 
14 | The benefits of the proposed algorithm include:
15 | * improved exploration
16 | * compositionality that allows transferring skills between tasks, which we confirm in simulated experiments with swimming and walking robots. 
17 | 
18 | We also draw a connection to actor-critic methods, which can be viewed performing approximate inference on the corresponding energy-based model.
19 | 


--------------------------------------------------------------------------------
/DPPO.md:
--------------------------------------------------------------------------------
1 | Distributed PPO
2 | 
3 | Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, David Silver 
4 | > DeepMind
5 | 
6 | The reinforcement learning paradigm allows, in principle, for complex behaviours to be learned directly from simple reward signals. In practice, however, it is common to carefully hand-design the reward function to encourage a particular solution, or to derive it from demonstration data. In this paper explore how a rich environment can help to promote the learning of complex behavior. Specifically, we train agents in diverse environmental contexts, and find that this encourages the emergence of robust behaviours that perform well across a suite of tasks. We demonstrate this principle for locomotion – behaviours that are known for their sensitivity to the choice of reward. We train several simulated bodies on a diverse set of challenging terrains and obstacles, using a simple reward function based on forward progress. Using a novel scalable variant of policy gradient reinforcement learning, our agents learn to run, jump, crouch and turn as required by the environment without explicit reward-based guidance. A visual depiction of highlights of the learned behavior can be viewed in this video.
7 | 


--------------------------------------------------------------------------------
/DQN.md:
--------------------------------------------------------------------------------
 1 | Playing Atari with Deep Reinforcement Learning
 2 | 
 3 | Volodymyr Mnih Koray Kavukcuoglu David Silver Alex Graves Ioannis Antonoglou
 4 | Daan Wierstra Martin Riedmiller
 5 | DeepMind Technologies
 6 | 
 7 | We present the first deep learning model to successfully learn control policies directly
 8 | from high-dimensional sensory input using reinforcement learning. The
 9 | model is a convolutional neural network, trained with a variant of Q-learning,
10 | whose input is raw pixels and whose output is a value function estimating future
11 | rewards. We apply our method to seven Atari 2600 games from the Arcade Learning
12 | Environment, with no adjustment of the architecture or learning algorithm. We
13 | find that it outperforms all previous approaches on six of the games and surpasses
14 | a human expert on three of them.
15 | 


--------------------------------------------------------------------------------
/DQfD.md:
--------------------------------------------------------------------------------
 1 | # Learning from Demonstrations for Real World Reinforcement Learning
 2 | 
 3 | Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, Audrunas Gruslys
 4 | 
 5 | DeepMind
 6 | 
 7 | https://arxiv.org/pdf/1704.03732.pdf
 8 | 
 9 | Deep reinforcement learning (RL) has achieved several high profile successes in
10 | difficult decision-making problems. However, these algorithms typically require
11 | a huge amount of data before they reach reasonable performance. In fact, their
12 | performance during learning can be extremely poor. This may be acceptable for
13 | a simulator, but it severely limits the applicability of deep RL to many real-world
14 | tasks, where the agent must learn in the real environment. In this paper we study a
15 | setting where the agent may access data from previous control of the system. We
16 | present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages
17 | this data to massively accelerate the learning process even from relatively
18 | small amounts of demonstration data and is able to automatically assess the necessary
19 | ratio of demonstration data while learning thanks to a prioritized replay
20 | mechanism. DQfD works by combining temporal difference updates with supervised
21 | classification of the demonstrator’s actions. We show that DQfD has better
22 | initial performance than Prioritized Dueling Double Deep Q-Networks (PDD
23 | DQN) as it starts with better scores on the first million steps on 41 of 42 games
24 | and on average it takes PDD DQN 82 million steps to catch up to DQfD’s performance.
25 | DQfD learns to out-perform the best demonstration given in 14 of 42
26 | games. In addition, DQfD leverages human demonstrations to achieve state-ofthe-art
27 | results for 17 games. Finally, we show that DQfD performs better than
28 | three related algorithms for incorporating demonstration data into DQN.
29 | 


--------------------------------------------------------------------------------
/Dip.md:
--------------------------------------------------------------------------------
 1 | # No Press Diplomacy: Modeling Multi-Agent Gameplay
 2 | > Philip Paquette,
 3 | Yuchen Lu,
 4 | Steven Bocco,
 5 | Max O. Smith,
 6 | Satya Ortiz-Gagné,
 7 | Jonathan K. Kummerfeld,
 8 | Satinder Singh,
 9 | Joelle Pineau,
10 | Aaron Courville
11 | 
12 | ## Abstract
13 | Diplomacy is a seven-player non-stochastic, non-cooperative game, where agents acquire resources through a mix of teamwork and betrayal. 
14 | 
15 | Reliance on trust and coordination makes Diplomacy the first non-cooperative multi-agent benchmark for complex sequential social dilemmas in a rich environment. 
16 | 
17 | In this work, we focus on training an agent that learns to play the No Press version of Diplomacy where there is no dedicated communication channel between players. 
18 | 
19 | We present DipNet, a neural-network-based policy model for No Press Diplomacy. 
20 | 
21 | The model was trained on a new dataset of more than 150,000 human games. 
22 | 
23 | Our model is trained by supervised learning (SL) from expert trajectories, which is then used to initialize a reinforcement learning (RL) agent trained through self-play. 
24 | 
25 | Both the SL and RL agents demonstrate state-of-the-art No Press performance by beating popular rule-based bots.
26 | 


--------------------------------------------------------------------------------
/DirPG.md:
--------------------------------------------------------------------------------
 1 | # Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces
 2 | > Guy Lorberbom, Chris J. Maddison, Nicolas Heess, Tamir Hazan, Daniel Tarlow
 3 | 
 4 | ## Abstract
 5 | Direct optimization [24] is an appealing approach to differentiating through discrete quantities [35, 19]. 
 6 | 
 7 | Rather than relying on REINFORCE or continuous relaxations of discrete structures, it uses optimization in discrete space to compute gradients through a discrete argmax operation. 
 8 | 
 9 | In this paper, we develop reinforcement learning algorithms that use direct optimization to compute gradients of the expected return in environments with discrete actions. 
10 | 
11 | We call the resulting algorithms direct policy gradient algorithms and investigate their properties, showing that there is a built-in variance reduction technique and that a parameter that was previously viewed as a numerical approximation can be interpreted as controlling risk sensitivity. 
12 | 
13 | We also tackle challenges in algorithm design, leveraging ideas from A?
14 | 
15 | Sampling [21] to develop a practical algorithm. 
16 | 
17 | Empirically, we show that the algorithm performs well in illustrative domains, and that it can make use of domain knowledge about upper bounds on return-to-go to speed up training.
18 | 


--------------------------------------------------------------------------------
/Disagreement.md:
--------------------------------------------------------------------------------
 1 | # Self-Supervised Exploration via Disagreement
 2 | > Deepak Pathak, Dhiraj Gandhi, Abhinav Gupta
 3 | 
 4 | ## Abstract
 5 | Efficient exploration is a long-standing problem in sensorimotor learning. 
 6 | 
 7 | Major advances have been demonstrated in noise-free, non-stochastic domains such as video games and simulation.
 8 | 
 9 | However, most of these formulations either get stuck in environments with stochastic dynamics or are too inefficient to be scalable to real robotics setups. 
10 | 
11 | In this paper, we propose a formulation for exploration inspired by the work in active learning literature. 
12 | 
13 | Specifically, we train an ensemble of dynamics models and incentivize the agent to explore such that the disagreement of those ensembles is maximized. 
14 | 
15 | This allows the agent to learn skills by exploring in a self-supervised manner without any external reward. 
16 | 
17 | Notably, we further leverage the disagreement objective to optimize the agent’s policy in a differentiable manner, without using reinforcement learning, which results in a sample-efficient exploration.
18 | 
19 | We demonstrate the efficacy of this formulation across a variety of benchmark environments including stochastic-Atari, Mujoco and Unity. 
20 | 
21 | Finally, we implement our differentiable exploration on a real robot which learns to interact with objects completely from scratch. 
22 | 
23 | Project videos and code are at https://pathak22.github.io/exploration-by-disagreement/.
24 | 


--------------------------------------------------------------------------------
/Distral.md:
--------------------------------------------------------------------------------
 1 | Distral: Robust Multitask Reinforcement Learning
 2 | 
 3 | Yee Whye Teh, Victor Bapst, Wojciech Marian Czarnecki, John Quan,
 4 | James Kirkpatrick, Raia Hadsell, Nicolas Heess, Razvan Pascanu
 5 | DeepMind, London, UK
 6 | 
 7 | Most deep reinforcement learning algorithms are data inefficient in complex and
 8 | rich environments, limiting their applicability to many scenarios. One direction
 9 | for improving data efficiency is multitask learning with shared neural network
10 | parameters, where efficiency may be improved through transfer across related tasks.
11 | In practice, however, this is not usually observed, because gradients from different
12 | tasks can interfere negatively, making learning unstable and sometimes even less
13 | data efficient. Another issue is the different reward schemes between tasks, which
14 | can easily lead to one task dominating the learning of a shared model. We propose
15 | a new approach for joint training of multiple tasks, which we refer to as Distral
16 | (Distill & transfer learning). Instead of sharing parameters between the different
17 | workers, we propose to share a “distilled” policy that captures common behaviour
18 | across tasks. Each worker is trained to solve its own task while constrained to
19 | stay close to the shared policy, while the shared policy is trained by distillation
20 | to be the centroid of all task policies. Both aspects of the learning process are
21 | derived by optimizing a joint objective function. We show that our approach
22 | supports efficient transfer on complex 3D environments, outperforming several
23 | related methods. Moreover, the proposed learning process is more robust and more
24 | stable—attributes that are critical in deep reinforcement learning.
25 | 


--------------------------------------------------------------------------------
/DualMDP.md:
--------------------------------------------------------------------------------
 1 | Learning to Design Games: Strategic Environments in Deep Reinforcement Learning
 2 | Haifeng Zhang, Jun Wang, Zhiming Zhou, Weinan Zhang, Ying Wen, Yong Yu, Wenxin Li
 3 | 
 4 | In typical reinforcement learning (RL), the environment is assumed given and the
 5 | goal of the learning is to identify an optimal policy for the agent taking actions
 6 | through its interactions with the environment. In this paper, we extend this setting
 7 | by considering the environment is not given, but controllable and learnable
 8 | through its interaction with the agent at the same time. Theoretically, we find a dual
 9 | Markov decision process (MDP) w.r.t. the environment to that w.r.t. the agent, and
10 | solving the dual MDP-policy pair yields a policy gradient solution to optimizing
11 | the parametrized environment. Furthermore, environments with non-differentiable
12 | parameters are addressed by a proposed general generative framework. Experiments
13 | on a Maze generation task show the effectiveness of generating diverse and
14 | challenging Mazes against agents with various settings.
15 | 


--------------------------------------------------------------------------------
/Dueling.md:
--------------------------------------------------------------------------------
1 | Dueling Network Architectures for Deep Reinforcement Learning
2 | 
3 | Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas
4 | DeepMind
5 | 
6 | In recent years there have been many successes of using deep representations in reinforcement learning. Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders. In this paper, we present a new neural network architecture for model-free reinforcement learning. Our dueling network represents two separate estimators: one for the state value function and one for the state-dependent action advantage function. The main benefit of this factoring is to generalize learning across actions without imposing any change to the underlying reinforcement learning algorithm. Our results show that this architecture leads to better policy evaluation in the presence of many similar-valued actions. Moreover, the dueling architecture enables our RL agent to outperform the state-of-the-art on the Atari 2600 domain.
7 | 


--------------------------------------------------------------------------------
/E2.md:
--------------------------------------------------------------------------------
 1 | # Some Considerations on Learning to Explore via Meta-Reinforcement Learning
 2 | > Bradly C. Stadie, Ge Yang, Rein Houthooft, Xi Chen, Yan Duan, Yuhuai Wu, Pieter Abbeel, Ilya Sutskever
 3 | 
 4 | ## Abstract
 5 | We interpret meta-reinforcement learning as the problem of learning how to quickly find a good sampling distribution in a new environment. 
 6 | 
 7 | This interpretation leads to the development of two new meta-reinforcement learning algorithms: E-MAML and E-RL2. 
 8 | 
 9 | Results are presented on a new environment we call ‘Krazy World’: a difficult high-dimensional gridworld which is designed to highlight the importance of correctly differentiating through sampling distributions in meta-reinforcement learning. 
10 | 
11 | Further results are presented on a set of maze environments. 
12 | 
13 | We show E-MAML and E-RL2 deliver better performance than baseline algorithms on both tasks.
14 | 


--------------------------------------------------------------------------------
/ECMAC.md:
--------------------------------------------------------------------------------
1 | Emergent Complexity via Multi-Agent Competition
2 | 
3 | 


--------------------------------------------------------------------------------
/EDDICT.md:
--------------------------------------------------------------------------------
1 | # Entropic Desired Dynamics for Intrinsic Control
2 | > Steven Hansen, Guillaume Desjardins, Kate Baumli, David Warde-Farley, Nicolas Heess, Simon Osindero, Volodymyr Mnih
3 | 
4 | **Abstract**
5 | An agent might be said, informally, to have mastery of its environment when it has maximised the effective number of states it can reliably reach. In practice, this often means maximizing the number of latent codes that can be discriminated from future states under some short time horizon (e.g. [15]). By situating these latent codes in a globally consistent coordinate system, we show that agents can reliably reach more states in the long term while still optimizing a local objective. A simple instantiation of this idea, Entropic Desired Dynamics for Intrinsic ConTrol (EDDICT), assumes fixed additive latent dynamics, which results in tractable learning and an interpretable latent space. Compared to prior methods, EDDICT’s globally consistent codes allow it to be far more exploratory, as demonstrated by improved state coverage and increased unsupervised performance on hard exploration games such as Montezuma’s Revenge.


--------------------------------------------------------------------------------
/EPG.md:
--------------------------------------------------------------------------------
 1 | # Expected Policy Gradients for Reinforcement Learning
 2 | 
 3 | Kamil Ciosek kamil.ciosek@cs.ox.ac.uk
 4 | Shimon Whiteson shimon.whiteson@cs.ox.ac.uk
 5 | Department of Computer Science, University of Oxford
 6 | 
 7 | We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected
 8 | sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first
 9 | derive a practical result for Gaussian policies and quadric critics and then extend it to an analytical method for the universal case, covering a broad class of actors and critics,
10 | including Gaussian, exponential families, and reparameterised policies with bounded support. 
11 | 
12 | 
13 | For Gaussian policies, we show that it is optimal to explore using covariance proportional to $e^H$, where $H$ is the scaled Hessian of the critic with respect to the actions. EPG also
14 | provides a general framework for reasoning about policy gradient methods, which we use to establish a new general policy gradient theorem, of which the stochastic and deterministic
15 | policy gradient theorems are special cases. Furthermore, we prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and with little
16 | computational overhead. Finally, we show that EPG outperforms existing approaches on six challenging domains involving the simulated control of physical systems.
17 | 


--------------------------------------------------------------------------------
/EX2.md:
--------------------------------------------------------------------------------
 1 | https://arxiv.org/pdf/1703.01260.pdf
 2 | 
 3 | Efficient exploration in high-dimensional environments remains a key challenge in reinforcement
 4 | learning (RL). 
 5 | 
 6 | Deep reinforcement learning methods have demonstrated the ability to
 7 | learn with highly general policy classes for complex
 8 | tasks with high-dimensional inputs, such as raw images. However, many of the most
 9 | effective exploration techniques rely on tabular representations, or on the ability to construct a
10 | generative model over states and actions. 
11 | 
12 | Both are exceptionally difficult when these inputs are
13 | complex and high dimensional. On the other hand, it is comparatively easy to build discriminative
14 | models on top of complex states such as images using standard deep neural networks.
15 | 
16 | This paper introduces a novel approach, EX2, which approximates state visitation densities by
17 | training an ensemble of discriminators, and assigns reward bonuses to rarely visited states. 
18 | 
19 | We demonstrate that EX2 achieves comparable performance to the state-of-the-art methods on lowdimensional
20 | tasks, and its effectiveness scales into high-dimensional state spaces such as visual domains without hand-designing features or density
21 | models.
22 | 


--------------------------------------------------------------------------------
/GANAC.md:
--------------------------------------------------------------------------------
 1 | # Connecting Generative Adversarial Networks and Actor-Critic Methods
 2 | David Pfau, Oriol Vinyals
 3 | 
 4 | Both generative adversarial networks (GAN) in unsupervised learning and actor-critic methods in reinforcement learning (RL) have gained a reputation for being
 5 | difficult to optimize. 
 6 | 
 7 | Practitioners in both fields have amassed a large number of strategies to mitigate these instabilities and improve training. Here we show that
 8 | GANs can be viewed as actor-critic methods in an environment where the actor cannot affect the reward. 
 9 | 
10 | We review the strategies for stabilizing training for each class of models, both those that generalize between the two and those that are particular to that model. We also review a number of extensions to GANs and RL
11 | algorithms with even more complicated information flow. 
12 | 
13 | We hope that by highlighting this formal connection we will encourage both GAN and RL communities to develop general, scalable, and stable algorithms for multilevel optimization with deep networks, and to draw inspiration across communities.
14 | 


--------------------------------------------------------------------------------
/GANQL.md:
--------------------------------------------------------------------------------
 1 | # GAN Q-learning
 2 | > Thang Doan, Bogdan Mazoure, Clare Lyle
 3 | > McGill University
 4 | 
 5 | ## Abstract
 6 | Distributional reinforcement learning (distributional RL) has seen empirical success in complex Markov Decision Processes (MDPs) in the setting of nonlinear function approximation. 
 7 | 
 8 | However there are many different ways in which one can leverage the distributional approach to reinforcement learning. 
 9 | 
10 | In this paper, we propose GAN Q-learning, a novel distributional RL method based on generative adversarial networks (GANs) and analyze its performance in simple tabular environments, as well as OpenAI Gym. 
11 | 
12 | We empirically show that our algorithm leverages the flexibility and blackbox approach of deep learning models while providing a viable alternative to traditional methods.
13 | 


--------------------------------------------------------------------------------
/GNLBE.md:
--------------------------------------------------------------------------------
 1 | # General non-linear Bellman equations
 2 | > Hado van Hasselt, John Quan, Matteo Hessel, Zhongwen Xu, Diana Borsa, Andre Barreto,
 3 | 
 4 | We consider a general class of non-linear Bellman equations. These open up a design space of algorithms that have interesting properties, which has two potential advantages. 
 5 | 
 6 | First, we can perhaps better model natural phenomena. For instance, hyperbolic discounting has been proposed as a mathematical model that matches human and animal data well, and can therefore be used to explain preference orderings. 
 7 | We present a different mathematical model that matches the same data, but that makes very different predictions under other circumstances. 
 8 | 
 9 | Second, the larger design space can perhaps lead to algorithms that perform better, similar to how discount factors are often used in practice even when the true objective is undiscounted. We show that many of the resulting Bellman operators still converge to a fixed point, and therefore that the resulting algorithms are reasonable and inherit many beneficial properties of their linear counterparts
10 | 


--------------------------------------------------------------------------------
/GTD.md:
--------------------------------------------------------------------------------
 1 | # Nonlinear Distributional Gradient Temporal-Difference Learning
 2 | > Chao Qu, Shie Mannor, and Huan Xu
 3 | 
 4 | ## Abstract
 5 | We devise a distributional variant of gradient temporal-difference (TD) learning.
 6 | 
 7 | Distributional reinforcement learning has been demonstrated to outperform the regular one in the recent study [Bellemare et al., 2017a]. 
 8 | 
 9 | In our paper, we design two new algorithms called distributional GTD2 and distributional TDC using the Cram´er distance on the distributional version of the Bellman error objective function, which inherits advantages of both the nonlinear gradient TD algorithms and the distributional RL approach. 
10 | 
11 | We prove the asymptotic almost-sure convergence to a local optimal solution for general smooth function approximators, which includes neural networks that have been widely used in recent study to solve the real-life RL problems. 
12 | 
13 | In each step, the computational complexity is linear w.r.t. the number of the parameters of the function approximator, thus can be implemented efficiently for neural networks.
14 | 


--------------------------------------------------------------------------------
/GVF.md:
--------------------------------------------------------------------------------
 1 | # Discovery of Useful Questions as Auxiliary Tasks
 2 | > Vivek Veeriah, Matteo Hessel, Zhongwen Xu, Richard Lewis, Janarthanan Rajendran, Junhyuk Oh, Hado van Hasselt, David Silver, Satinder Singh
 3 | 
 4 | ## Abstract
 5 | Arguably, intelligent agents ought to be able to discover their own questions so that in learning answers for them they learn unanticipated useful knowledge and skills; this departs from the focus in much of machine learning on agents learning answers to externally defined questions. 
 6 | 
 7 | We present a novel method for a reinforcement learning (RL) agent to discover questions formulated as general value functions or GVFs, a fairly rich form of knowledge representation. 
 8 | 
 9 | Specifically, our method uses non-myopic meta-gradients to learn GVF-questions such that learning answers to them, as an auxiliary task, induces useful representations for the main task faced by the RL agent. 
10 | 
11 | We demonstrate that auxiliary tasks based on the discovered GVFs are sufficient, on their own, to build representations that support main task learning, and that they do so better than popular hand-designed auxiliary tasks from the literature. Furthermore, we show, in the context of Atari 2600 videogames, how such auxiliary tasks, meta-learned alongside the main task, can improve the data efficiency of an actor-critic agent.
12 | 


--------------------------------------------------------------------------------
/GVG.md:
--------------------------------------------------------------------------------
1 | Robust Imitation of Diverse Behaviors
2 | 
3 | Ziyu Wang∗, Josh Merel∗, Scott Reed, Greg Wayne, Nando de Freitas, Nicolas Heess 
4 | > DeepMind
5 | Deep generative models have recently shown great promise in imitation learning for motor control. Given enough data, even supervised approaches can do one-shot imitation learning; however, they are vulnerable to cascading failures when the agent trajectory diverges from the demonstrations. Compared to purely supervised methods, Generative Adversarial Imitation Learning (GAIL) can learn more robust controllers from fewer demonstrations, but is inherently mode-seeking and more difficult to train. In this paper, we show how to combine the favourable aspects of these two approaches. The base of our model is a new type of variational autoencoder on demonstration trajectories that learns semantic policy embeddings. We show that these embeddings can be learned on a 9 DoF Jaco robot arm in reaching tasks, and then smoothly interpolated with a resulting smooth interpolation of reaching behavior. Leveraging these policy representations, we develop a new version of GAIL that (1) is much more robust than the purely-supervised controller, especially with few demonstrations, and (2) avoids mode collapse, capturing many diverse behaviors when GAIL on its own does not. We demonstrate our approach on learning diverse gaits from demonstration on a 2D biped and a 62 DoF 3D humanoid in the MuJoCo physics environment.
6 | 


--------------------------------------------------------------------------------
/GenerativeBelief.md:
--------------------------------------------------------------------------------
 1 | # Shaping Belief States with Generative Environment Models for RL
 2 | > Karol Gregor, Danilo Jimenez Rezende, Frederic Besse, Yan Wu, Hamza Merzic, Aäron van den Oord
 3 | 
 4 | ## Abstract
 5 | When agents interact with a complex environment, they must form and maintain beliefs about the relevant aspects of that environment. 
 6 | 
 7 | We propose a way to efficiently train expressive generative models in complex environments. 
 8 | 
 9 | We show that a predictive algorithm with an expressive generative model can form stable belief-states in visually rich and dynamic 3D environments. 
10 | 
11 | More precisely, we show that the learned representation captures the layout of the environment as well as the position and orientation of the agent. 
12 | 
13 | Our experiments show that the model substantially improves data-efficiency on a number of reinforcement learning (RL) tasks compared with strong model-free baseline agents. 
14 | 
15 | We find that predicting multiple steps into the future (overshooting), in combination with an expressive generative model, is critical for stable representations to emerge. 
16 | 
17 | In practice, using expressive generative models in RL is computationally expensive and we propose a scheme to reduce this computational burden, allowing us to build agents that are competitive with model-free baselines.
18 | 


--------------------------------------------------------------------------------
/Geoff-PAC.md:
--------------------------------------------------------------------------------
 1 | # Generalized Off-Policy Actor-Critic
 2 | > Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson
 3 | 
 4 | [Download from arxiv](https://arxiv.org/pdf/1903.11329.pdf)
 5 | 
 6 | ## Abstract
 7 | We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. 
 8 | 
 9 | Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. 
10 | 
11 | We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. 
12 | 
13 | We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.
14 | 


--------------------------------------------------------------------------------
/HAL.md:
--------------------------------------------------------------------------------
 1 | # Language as an Abstraction for Hierarchical Deep Reinforcement Learning
 2 | > Yiding Jiang, Shixiang Gu, Kevin Murphy, Chelsea Finn
 3 | 
 4 | ## Abstract
 5 | Solving complex, temporally-extended tasks is a long-standing problem in reinforcement learning (RL). 
 6 | 
 7 | We hypothesize that one critical element of solving such problems is the notion of compositionality. 
 8 | 
 9 | With the ability to learn concepts and sub-skills that can be composed to solve longer tasks, i.e. hierarchical RL, we can acquire temporally-extended behaviors. 
10 | 
11 | However, acquiring effective yet general abstractions for hierarchical RL is remarkably challenging. 
12 | 
13 | In this paper, we propose to use language as the abstraction, as it provides unique compositional structure, enabling fast learning and combinatorial generalization, while retaining tremendous flexibility, making it suitable for a variety of problems. 
14 | 
15 | Our approach learns an instruction-following low-level policy and a high-level policy that can reuse abstractions across tasks, in essence, permitting agents to reason using structured language. 
16 | 
17 | To study compositional task learning, we introduce an open-source object interaction environment built using the MuJoCo physics engine and the CLEVR engine. 
18 | 
19 | We find that, using our approach, agents can learn to solve to diverse, temporally-extended tasks such as object sorting and multi-object rearrangement, including from raw pixel observations. 
20 | 
21 | Our analysis find that the compositional nature of language is critical for learning diverse sub-skills and systematically generalizing to new sub-skills in comparison to non-compositional abstractions that use the same supervision.2
22 | 


--------------------------------------------------------------------------------
/HILP.md:
--------------------------------------------------------------------------------
 1 | # Foundation Policies with Hilbert Representations
 2 | > Seohong Park UC Berkeley
 3 | > Tobias Kreiman UC Berkeley
 4 | > Sergey Levine UC Berkeley
 5 | 
 6 | ## Abstract 
 7 | Unsupervised and self-supervised objectives, such as next token prediction, have enabled pre-training generalist models from large amounts of unlabeled data. 
 8 | 
 9 | In reinforcement learning (RL), however, finding a **truly** general and scalable unsupervised pre-training objective for **generalist policies** from offline data remains a major open question. 
10 | 
11 | While a number of methods have been proposed to enable generic self-supervised RL, based on principles such as goal-conditioned RL, behavioral cloning, and unsupervised skill learning, such methods remain limited in terms of either the diversity of the discovered behaviors, the need for high-quality demonstration data, or the lack of a clear prompting or adaptation mechanism for downstream tasks. 
12 | 
13 | In this work, we propose a novel unsupervised framework to pre-train **generalist** policies that capture **diverse, optimal, long-horizon behaviors** from unlabeled offline data such that they can be quickly adapted to any arbitrary new tasks in a zero-shot manner. 
14 | 
15 | Our key insight is to learn a **structured representation** that preserves the **temporal** structure of the underlying environment, and then to span this learned latent space with directional movements, which enables various zero-shot policy "prompting" schemes for downstream tasks. 
16 | 
17 | Through our experiments on simulated robotic locomotion and manipulation benchmarks, we show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion, even often outperforming prior methods designed specifically for each setting. 
18 | 
19 | Our code and videos are available at this [website](https://seohong.me/projects/hilp/). 
20 | 
21 | The repo is [code](https://github.com/seohongpark/HILP)
22 | 


--------------------------------------------------------------------------------
/HIRL.md:
--------------------------------------------------------------------------------
 1 | # Trial without Error: Towards Safe Reinforcement Learning via Human Intervention
 2 | 
 3 | William Saunders
 4 | Girish Sastry
 5 | Andreas Stuhlmüller
 6 | Owain Evans
 7 | 
 8 | AI systems are increasingly applied to complex tasks that involve interaction
 9 | with humans. During training, such systems are potentially dangerous, as they
10 | haven’t yet learned to avoid actions that could cause serious harm. How can an AI
11 | system explore and learn without making a single mistake that harms humans or
12 | otherwise causes serious damage? For model-free reinforcement learning, having a
13 | human “in the loop” and ready to intervene is currently the only way to prevent all
14 | catastrophes. We formalize human intervention for RL and show how to reduce
15 | the human labor required by training a supervised learner to imitate the human’s
16 | intervention decisions. We evaluate this scheme on Atari games, with a Deep RL
17 | agent being overseen by a human for four hours. When the class of catastrophes
18 | is simple, we are able to prevent all catastrophes without affecting the agent’s
19 | learning (whereas an RL baseline fails due to catastrophic forgetting). However,
20 | this scheme is less successful when catastrophes are more complex: it reduces
21 | but does not eliminate catastrophes and the supervised learner fails on adversarial
22 | examples found by the agent. Extrapolating to more challenging environments, we
23 | show that our implementation would not scale (due to the infeasible amount of
24 | human labor required). We outline extensions of the scheme that are necessary if
25 | we are to train model-free agents without a single catastrophe.
26 | 


--------------------------------------------------------------------------------
/HIRO.md:
--------------------------------------------------------------------------------
 1 | # WHY DOES HIERARCHY (SOMETIMES) WORK SO WELL IN REINFORCEMENT LEARNING?
 2 | > Ofir Nachum, Haoran Tang, Xingyu Lu, Shixiang Gu, Honglak Lee, Sergey Levine
 3 | 
 4 | ## ABSTRACT
 5 | Hierarchical reinforcement learning has demonstrated significant success at solving difficult reinforcement learning (RL) tasks. 
 6 | 
 7 | Previous works have motivated the use of hierarchy by appealing to a number of intuitive benefits, including learning over temporally extended transitions, exploring over temporally extended periods, and training and exploring in a more semantically meaningful action space, among others. 
 8 | 
 9 | However, in fully observed, Markovian settings, it is not immediately clear why hierarchical RL should provide benefits over standard “shallow” RL architectures. 
10 | 
11 | In this work, we isolate and evaluate the claimed benefits of hierarchical RL on a suite of tasks encompassing locomotion, navigation, and manipulation. 
12 | 
13 | Surprisingly, we find that most of the observed benefits of hierarchy can be attributed to improved exploration, as opposed to easier policy learning or imposed hierarchical structures. 
14 | 
15 | Given this insight, we present exploration techniques inspired by hierarchy that achieve performance competitive with hierarchical RL while at the same time being much simpler to use and implement.
16 | 


--------------------------------------------------------------------------------
/I2As.md:
--------------------------------------------------------------------------------
 1 | Imagination-Augmented Agents for Deep Reinforcement Learning
 2 | 
 3 | Théophane Weber∗ Sébastien Racanière∗ David P. Reichert∗ Lars Buesing
 4 | Arthur Guez Danilo Rezende Adria Puigdomènech Badia Oriol Vinyals
 5 | Nicolas Heess Yujia Li Razvan Pascanu Peter Battaglia
 6 | David Silver Daan Wierstra
 7 | 
 8 | We introduce Imagination-Augmented Agents (I2As), a novel architecture for deep
 9 | reinforcement learning combining model-free and model-based aspects. In contrast
10 | to most existing model-based reinforcement learning and planning methods,
11 | which prescribe how a model should be used to arrive at a policy, I2As learn to
12 | interpret predictions from a learned environment model to construct implicit plans
13 | in arbitrary ways, by using the predictions as additional context in deep policy
14 | networks. I2As show improved data efficiency, performance, and robustness to
15 | model misspecification compared to several baselines.
16 | 


--------------------------------------------------------------------------------
/IBP.md:
--------------------------------------------------------------------------------
 1 | Learning model-based planning from scratch
 2 | 
 3 | Razvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, Lars Buesing, Sebastien Racanière, David Reichert, Théophane Weber, Daan Wierstra, Peter Battaglia
 4 | 
 5 | Conventional wisdom holds that model-based planning is a powerful approach
 6 | to sequential decision-making. It is often very challenging in practice, however,
 7 | because while a model can be used to evaluate a plan, it does not prescribe how
 8 | to construct a plan. Here we introduce the “Imagination-based Planner”, the
 9 | first model-based, sequential decision-making agent that can learn to construct,
10 | evaluate, and execute plans. Before any action, it can perform a variable number
11 | of imagination steps, which involve proposing an imagined action and evaluating
12 | it with its model-based imagination. All imagined actions and outcomes are
13 | aggregated, iteratively, into a “plan context” which conditions future real and
14 | imagined actions. The agent can even decide how to imagine: testing out alternative
15 | imagined actions, chaining sequences of actions together, or building a more
16 | complex “imagination tree” by navigating flexibly among the previously imagined
17 | states using a learned policy. And our agent can learn to plan economically, jointly
18 | optimizing for external rewards and computational costs associated with using
19 | its imagination. We show that our architecture can learn to solve a challenging
20 | continuous control problem, and also learn elaborate planning strategies in a
21 | discrete maze-solving task. Our work opens a new direction toward learning the
22 | components of a model-based planning system and how to use them.
23 | 


--------------------------------------------------------------------------------
/IPG.md:
--------------------------------------------------------------------------------
 1 | Interpolated Policy Gradient: Merging On-Policy and
 2 | Off-Policy Gradient Estimation for Deep
 3 | Reinforcement Learning
 4 | 
 5 | Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Bernhard Schölkopf, Sergey Levine
 6 | > from Cambridge, MPI, DeepMind, UberAI, Berkeley
 7 | 
 8 | ## Background
 9 | * Off-policy model-free deep reinforcement learning methods using previously collected
10 | data can improve sample efficiency over on-policy policy gradient techniques.
11 | * On the other hand, on-policy algorithms are often more stable and easier to use.
12 | 
13 | ## Goal
14 | merging on- and off-policy updates for deep reinforcement learning. 
15 | 
16 | ## Theoretical results
17 | * show that off-policy updates with a value function estimator can be interpolated
18 | with on-policy policy gradient updates whilst still satisfying performance bounds.
19 | 
20 | Tool used: control variate methods to produce a family of policy gradient
21 | algorithms, with several recently proposed algorithms being special cases of this
22 | family. 
23 | 
24 | ## Empirical comparisons
25 | these techniques with the
26 | remaining algorithmic details fixed, and show how different mixing of off-policy
27 | gradient estimates with on-policy samples contribute to improvements in empirical
28 | performance. 
29 | 
30 | ## IPG
31 | The final algorithm provides a generalization and unification of existing deep policy gradient techniques, 
32 | * theoretical guarantees on the bias introduced by off-policy updates
33 | * improves on the state-of-the-art model-free deep RL methods on a number of OpenAI Gym continuous control benchmarks.
34 | 


--------------------------------------------------------------------------------
/IQN.md:
--------------------------------------------------------------------------------
 1 | # Implicit Quantile Networks for Distributional Reinforcement Learning
 2 | 
 3 | > Will Dabney, Georg Ostrovski, David Silver, Remi Munos
 4 | 
 5 | ## Abstract
 6 | In this work, we build on recent advances in distributional reinforcement learning to give a generally applicable, flexible, and state-of-the-art distributional variant of DQN. 
 7 | 
 8 | We achieve this by using quantile regression to approximate the full quantile function for the state-action return distribution. 
 9 | 
10 | By reparameterizing a distribution over the sample space, this yields an implicitly defined return distribution and gives rise to a large class of risk-sensitive policies. 
11 | 
12 | We demonstrate improved performance on the 57 Atari 2600 games in the ALE, and use our algorithm’s implicitly defined distributions to study the effects of risk-sensitive policies in Atari games.
13 | 


--------------------------------------------------------------------------------
/ISMCI.md:
--------------------------------------------------------------------------------
 1 | # INTRINSIC SOCIAL MOTIVATION VIA CAUSAL INFLUENCE IN MULTI-AGENT RL
 2 | > Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro A. Ortega, DJ Strouse, Joel Z. Leibo, Nando de Freitas
 3 | 
 4 | ## ABSTRACT
 5 | We derive a new intrinsic social motivation for multi-agent reinforcement learning (MARL), in which agents are rewarded for having causal influence over another agent’s actions. 
 6 | 
 7 | Causal influence is assessed using counterfactual reasoning. The reward does not depend on observing another agent’s reward function, and is thus a more realistic approach to MARL than taken in previous work. 
 8 | 
 9 | We show that the causal influence reward is related to maximizing the mutual information between agents’ actions. 
10 | 
11 | We test the approach in challenging social dilemma environments, where it consistently leads to enhanced cooperation between agents and higher collective reward. 
12 | 
13 | Moreover, we find that rewarding influence can lead agents to develop emergent communication protocols. 
14 | 
15 | We therefore employ influence to train agents to use an explicit communication channel, and find that it leads to more effective communication and higher collective reward. 
16 | 
17 | Finally, we show that influence can be computed by equipping each agent with an internal model that predicts the actions of other agents. 
18 | 
19 | This allows the social influence reward to be computed without the use of a centralised controller, and as such represents a significantly more general and scalable inductive bias for MARL with independent agents.
20 | 


--------------------------------------------------------------------------------
/Intervaltime.md:
--------------------------------------------------------------------------------
 1 | # Interval timing in deep reinforcement learning agents
 2 | > Ben Deverett,
 3 | Ryan Faulkner,
 4 | Meire Fortunato,
 5 | Greg Wayne,
 6 | Joel Z. Leibo
 7 | 
 8 | ## Abstract
 9 | The measurement of time is central to intelligent behavior. 
10 | 
11 | We know that both animals and artificial agents can successfully use temporal dependencies to select actions. 
12 | 
13 | In artificial agents, little work has directly addressed (1) which architectural components are necessary for successful development of this ability, (2) how this timing ability comes to be represented in the units and actions of the agent, and (3) whether the resulting behavior of the system converges on solutions similar to those of biology. 
14 | 
15 | Here we studied interval timing abilities in deep reinforcement learning agents trained end-to-end on an interval reproduction paradigm inspired by experimental literature on mechanisms of timing. 
16 | 
17 | We characterize the strategies developed by recurrent and feedforward agents, which both succeed at temporal reproduction using distinct mechanisms, some of which bear specific and intriguing similarities to biological systems. 
18 | 
19 | These findings advance our understanding of how agents come to represent time, and they highlight the value of experimentally inspired approaches to characterizing agent abilities.
20 | 


--------------------------------------------------------------------------------
/KL-RegulaRL.md:
--------------------------------------------------------------------------------
 1 | https://openreview.net/pdf?id=S1lqMn05Ym
 2 | INFORMATION ASYMMETRY IN KL-REGULARIZED RL
 3 | Alexandre Galashov, Siddhant M. Jayakumar, Leonard Hasenclever, Dhruva Tirumala,
 4 | Jonathan Schwarz, Guillaume Desjardins, Wojciech M. Czarnecki, Yee Whye Teh,
 5 | Razvan Pascanu, Nicolas Heess
 6 | DeepMind
 7 | London, UK
 8 | {agalashov,sidmj,leonardh,dhruvat,schwarzjn,gdesjardins,
 9 | lejlot,ywteh,razp,heess}@google.com
10 | ABSTRACT
11 | Many real world tasks exhibit rich structure that is repeated across different parts
12 | of the state space or in time. In this work we study the possibility of leveraging
13 | such repeated structure to speed up and regularize learning. We start from the KL
14 | regularized expected reward objective which introduces an additional component,
15 | a default policy. Instead of relying on a fixed default policy, we learn it from data.
16 | But crucially, we restrict the amount of information the default policy receives,
17 | forcing it to learn reusable behaviours that help the policy learn faster. We formalize
18 | this strategy and discuss connections to information bottleneck approaches and
19 | to the variational EM algorithm. We present empirical results in both discrete
20 | and continuous action domains and demonstrate that, for certain tasks, learning a
21 | default policy alongside the policy can significantly speed up and improve learning.
22 | 


--------------------------------------------------------------------------------
/LEARN.md:
--------------------------------------------------------------------------------
 1 | # Using Natural Language for Reward Shaping in Reinforcement Learning
 2 | 
 3 | > Prasoon Goyal , Scott Niekum , Raymond J. Mooney
 4 | 
 5 | ## Abstract
 6 | Recent reinforcement learning (RL) approaches have shown strong performance in complex domains such as Atari games, but are often highly sample inefficient. 
 7 | 
 8 | A common approach to reduce interaction time with the environment is to use reward shaping, which involves carefully designing reward functions that provide the agent intermediate rewards for progress towards the goal.
 9 | 
10 | However, designing appropriate shaping rewards is known to be difficult as well as time-consuming. 
11 | 
12 | In this work, we address this problem by using natural language instructions to perform reward shaping. 
13 | 
14 | We propose the LanguagE-Action Reward Network (LEARN), a framework that maps free-form natural language instructions to intermediate rewards based on actions taken by the agent. 
15 | 
16 | These intermediate language-based rewards can seamlessly be integrated into any standard reinforcement learning algorithm. 
17 | 
18 | We experiment with Montezuma’s Revenge from the Atari Learning Environment, a popular benchmark in RL. 
19 | 
20 | Our experiments on a diverse set of 15 tasks demonstrate that, for the same number of interactions with the environment, language-based rewards lead to successful completion of the task 60% more often on average, compared to learning without language.
21 | 


--------------------------------------------------------------------------------
/LFOD.md:
--------------------------------------------------------------------------------
 1 | A Laplacian Framework for Option Discovery in Reinforcement Learning
 2 | 
 3 | https://arxiv.org/pdf/1703.00956.pdf
 4 | 
 5 | Marlos C. Machado, Marc G. Bellemare, Michael Bowling
 6 | 
 7 | Representation learning and option discovery are
 8 | two of the biggest challenges in reinforcement
 9 | learning (RL). Proto-value functions (PVFs) are
10 | a well-known approach for representation learning
11 | in MDPs. In this paper we address the option
12 | discovery problem by showing how PVFs
13 | implicitly define options. We do it by introducing
14 | eigenpurposes, intrinsic reward functions derived
15 | from the learned representations. The options
16 | discovered from eigenpurposes traverse the
17 | principal directions of the state space. They are
18 | useful for multiple tasks because they are discovered
19 | without taking the environment’s rewards
20 | into consideration. Moreover, different options
21 | act at different time scales, making them helpful
22 | for exploration. We demonstrate features of
23 | eigenpurposes in traditional tabular domains as
24 | well as in Atari 2600 games.
25 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Xiaohu Zhu
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/LOLA.md:
--------------------------------------------------------------------------------
 1 | Learning with Opponent-Learning Awareness
 2 | 
 3 | Jakob N. Foerster2,†
 4 | jakob.foerster@cs.ox.ac.uk
 5 | Richard Y. Chen1,†
 6 | richardchen@openai.com
 7 | Maruan Al-Shedivat4
 8 | alshedivat@cs.cmu.edu
 9 | Shimon Whiteson2
10 | shimon.whiteson@cs.ox.ac.uk
11 | Pieter Abbeel1,3
12 | pieter@openai.com
13 | Igor Mordatch1
14 | mordatch@openai.com
15 | 
16 | https://arxiv.org/pdf/1709.04326.pdf
17 | 
18 | Multi-agent settings are quickly gathering importance in machine
19 | learning. Beyond a plethora of recent work on deep
20 | multi-agent reinforcement learning, hierarchical reinforcement
21 | learning, generative adversarial networks and decentralized
22 | optimization can all be seen as instances of this setting.
23 | However, the presence of multiple learning agents in
24 | these settings renders the training problem non-stationary
25 | and often leads to unstable training or undesired final results.
26 | We present Learning with Opponent-Learning Awareness
27 | (LOLA), a method that reasons about the anticipated
28 | learning of the other agents. The LOLA learning rule includes
29 | an additional term that accounts for the impact of
30 | the agent’s policy on the anticipated parameter update of the
31 | other agents. We show that the LOLA update rule can be ef-
32 | ficiently calculated using an extension of the likelihood ratio
33 | policy gradient update, making the method suitable for
34 | model-free reinforcement learning. This method thus scales
35 | to large parameter and input spaces and nonlinear function
36 | approximators. Preliminary results show that the encounter
37 | of two LOLA agents leads to the emergence of tit-for-tat
38 | and therefore cooperation in the infinitely iterated prisoners’
39 | dilemma, while independent learning does not. In this
40 | domain, LOLA also receives higher payouts compared to a
41 | naive learner, and is robust against exploitation by higher order
42 | gradient-based methods. Applied to infinitely repeated
43 | matching pennies, only LOLA agents converge to the Nash
44 | equilibrium. We also apply LOLA to a grid world task with
45 | an embedded social dilemma using deep recurrent policies.
46 | Again, by considering the learning of the other agent, LOLA
47 | agents learn to cooperate out of selfish interests.
48 | 


--------------------------------------------------------------------------------
/LQR+GAIfO.md:
--------------------------------------------------------------------------------
 1 | # Sample-efficient Adversarial Imitation Learning from Observation
 2 | Faraz Torabi, Sean Geiger, Garrett Warnell, Peter Stone
 3 | 
 4 | ## Abstract
 5 | Imitation from observation is the framework of learning tasks by observing demonstrated state-only trajectories. 
 6 | 
 7 | Recently, adversarial approaches have achieved significant performance improvements over other methods for imitating complex behaviors. 
 8 | 
 9 | However, these adversarial imitation algorithms often require many demonstration examples and learning iterations to produce a policy that is successful at imitating a demonstrator’s behavior. 
10 | 
11 | This high sample complexity often prohibits these algorithms from being deployed on physical robots. 
12 | 
13 | In this paper, we propose an algorithm that addresses the sample inefficiency problem by utilizing ideas from trajectory centric reinforcement learning algorithms.
14 | 
15 | We test our algorithm and conduct experiments using an imitation task on a physical robot arm and its simulated version in Gazebo and will show the improvement in learning rate and efficiency
16 | 


--------------------------------------------------------------------------------
/LipschitzQ.md:
--------------------------------------------------------------------------------
 1 | # Stochastic Lipschitz Q-Learning
 2 | > Xu Zhu and David Dunson
 3 | 
 4 | ## Abstract
 5 | In an episodic Markov Decision Process (MDP) problem, an online algorithm chooses from a set of actions in a sequence of H trials, where H is the episode length, in order to maximize the total payoff of the chosen actions. 
 6 | 
 7 | Q-learning, as the most popular model-free reinforcement learning (RL) algorithm, directly parameterizes and updates value functions without explicitly modeling the environment. 
 8 | 
 9 | Recently, [12] studies the sample complexity of Q-learning with finite states and actions. 
10 | 
11 | Their algorithm achieves nearly optimal regret, which shows that Q-learning can be made sample efficient. 
12 | 
13 | However, MDPs with large discrete states and actions [21] or continuous spaces [19] cannot learn efficiently in this way. 
14 | 
15 | Hence, it is critical to develop new algorithms to solve this dilemma with provable guarantee on the sample complexity. 
16 | 
17 | With this motivation, we propose a novel algorithm that works for MDPs with a more general setting, which has infinitely many states and actions and assumes that the payoff function and transition kernel are Lipschitz continuous. 
18 | 
19 | We also provide corresponding theory justification for our algorithm. 
20 | 
21 | It achieves the regret $\tilde{\mathcal{O}}(K^{\frac{d+1}{d+2}}\sqrt{H^3})$, where K denotes the number of episodes and d denotes the dimension of the joint space. 
22 | 
23 | To the best of our knowledge, this is the first analysis in the model-free setting whose established regret matches the lower bound up to a logarithmic factor.
24 | 


--------------------------------------------------------------------------------
/MADDPG.md:
--------------------------------------------------------------------------------
 1 | ## Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments
 2 | 
 3 | Ryan Lowe∗, Yi Wu∗, Aviv Tamar, Jean Harb, Pieter Abbeel, Igor Mordatch
 4 | 
 5 | > McGill University, UC Berkeley, OpenAI
 6 | 
 7 | deep RL in MAS setting:
 8 | 
 9 | ### Problems:
10 | 
11 | * Q-learning is challenged by an inherent non-stationarity of the environment, 
12 | * Policy gradient suffers from a variance that increases as the number of agents grows.
13 | 
14 | Actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multiagent
15 | coordination. 
16 | 
17 | Training regimen utilizing an **ensemble** of policies for each agent that leads to more robust multi-agent policies.
18 | 
19 | ### Experimentation
20 | We show the strength of our approach compared to existing methods in 
21 | * cooperative
22 | * competitive scenarios, 
23 | 
24 | agent populations are able to discover various physical and informational coordination strategies.
25 | 


--------------------------------------------------------------------------------
/MBDQN.md:
--------------------------------------------------------------------------------
 1 | # Model-Based Stabilisation of Deep Reinforcement Learning
 2 | > Felix Leibfried, Rasul Tutunov, Peter Vrancx, Haitham Bou-Ammar
 3 | PROWLER.io, Cambridge (UK)
 4 | 
 5 | ## Abstract
 6 | Though successful in high-dimensional domains, deep reinforcement learning exhibits high sample complexity and suffers from stability issues as reported by researchers and practitioners in the field. 
 7 | 
 8 | These problems hinder the application of such algorithms in real-world and safety-critical scenarios. 
 9 | 
10 | In this paper, we take steps towards stable and efficient reinforcement　learning by following a model-based approach that　is known to reduce agent-environment interactions. 
11 | 
12 | Namely, our method augments deep Q-networks (DQNs) with model predictions for transitions, rewards, and termination flags.
13 | 
14 | Having the model at hand, we then conduct a rigorous theoretical study of our algorithm and show, for the first time, convergence to a stationary point. En route, we provide a counterexample showing that ’vanilla’ DQNs can diverge confirming practitioners’ and researchers’ experiences. Our proof is novel in its own right and can be extended to other forms of deep reinforcement learning. In particular, we believe exploiting the relation between reinforcement (with deep function approximators) and online learning can serve as a recipe for future proofs in the domain. Finally, we validate our theoretical results in 20 games from the Atari benchmark. Our results show that following the proposed model-based learning approach not only ensures convergence but leads to a reduction in sample complexity and superior performance
15 | 


--------------------------------------------------------------------------------
/MBIE-EB.md:
--------------------------------------------------------------------------------
 1 | # Approximate Exploration through State Abstraction
 2 | > Adrien Ali Taïga, Aaron Courville, Marc G. Bellemare
 3 | 
 4 | [Download from arxiv](https://arxiv.org/pdf/1808.09819.pdf)
 5 | 
 6 | ## Abstract
 7 | Although exploration in reinforcement learning is well understood from a theoretical point of view, provably correct methods remain impractical. 
 8 | 
 9 | In this paper we study the interplay between exploration and approximation, what we call approximate exploration. Our main goal is to further our theoretical understanding of pseudo-count based exploration bonuses (Bellemare et al., 2016), a practical exploration scheme based on density modelling. 
10 | 
11 | As a warm-up, we quantify the performance of an exploration algorithm, MBIE-EB (Strehl and Littman, 2008), when explicitly combined with state aggregation. This allows us toconfirm that, as might be expected, approximation allows the agent to trade off between learning speed and quality of the learned policy. 
12 | 
13 | Next, we show how a given density model can be related to an abstraction and that the corresponding pseudocount bonus can act as a substitute in MBIE-EB combined with this abstraction, but may lead to either under- or over-exploration. 
14 | 
15 | Then, we show that a given density model also defines an implicit abstraction, and find a surprising mismatch between pseudo-counts derived either implicitly or explicitly. Finally we derive a new pseudo-count bonus alleviating this issue.
16 | 


--------------------------------------------------------------------------------
/MCAI.md:
--------------------------------------------------------------------------------
 1 | Learning human behaviors from motion capture by adversarial imitation
 2 | 
 3 | Josh Merel, Yuval Tassa, Dhruva TB, Sriram Srinivasan, Jay Lemmon, Ziyu Wang,
 4 | Greg Wayne, Nicolas Heess
 5 | 
 6 | > DeepMind
 7 | 
 8 | Rapid progress in deep reinforcement learning has made it increasingly feasible
 9 | to train controllers for high-dimensional humanoid bodies. However, methods
10 | that use pure reinforcement learning with simple reward functions tend to produce
11 | non-humanlike and overly stereotyped movement behaviors. In this work,
12 | we extend generative adversarial imitation learning to enable training of generic
13 | neural network policies to produce humanlike movement patterns from limited
14 | demonstrations consisting only of partially observed state features, without access
15 | to actions, even when the demonstrations come from a body with different and
16 | unknown physical parameters. We leverage this approach to build sub-skill policies
17 | from motion capture data and show that they can be reused to solve tasks when
18 | controlled by a higher level controller. [video abstract]
19 | 


--------------------------------------------------------------------------------
/MCGE.md:
--------------------------------------------------------------------------------
 1 | # Monte Carlo Gradient Estimation in Machine Learning
 2 | > Shakir Mohamed,Mihaela Rosca, Michael Figurnov, Andriy Mnih
 3 | 
 4 | ## Abstract
 5 | This paper is a broad and accessible survey of the methods we have at our disposal for Monte Carlo gradient estimation in machine learning and across the statistical sciences: the problem of computing the gradient of an expectation of a function with respect to parameters defining the distribution that is integrated; the problem of sensitivity analysis. 
 6 | 
 7 | In machine learning research, this gradient problem lies at the core of many learning problems, in supervised, unsupervised and reinforcement learning. 
 8 | 
 9 | We will generally seek to rewrite such gradients in a form that allows for Monte Carlo estimation, allowing them to be easily and efficiently used and analysed. 
10 | 
11 | We explore three strategies—the pathwise, score function, and measure-valued gradient estimators—exploring their historical developments, derivation, and underlying assumptions. 
12 | 
13 | We describe their use in other fields, show how they are related and can be combined, and expand on their possible generalisations. 
14 | 
15 | Wherever Monte Carlo gradient estimators have been derived and deployed in the past, important advances have followed. 
16 | 
17 | A deeper and more widely-held understanding of this problem will lead to further advances, and it is these advances that we wish to support.
18 | 
19 | 
20 | Keywords: gradient estimation, Monte Carlo, sensitivity analysis, score-function estimator, pathwise estimator, measure-valued estimator, variance reduction
21 | 


--------------------------------------------------------------------------------
/MERL.md:
--------------------------------------------------------------------------------
 1 | # Evolutionary Reinforcement Learning for Sample-Efficient Multiagent Coordination
 2 | > Shauharda Khadka, Somdeb Majumdar, Kagan Tumer
 3 | 
 4 | ## Abstract
 5 | A key challenge for Multiagent RL (Reinforcement Learning) is the design of agent-specific, local rewards that are aligned with sparse global objectives. 
 6 | 
 7 | In this paper, we introduce MERL (Multiagent Evolutionary RL), a hybrid algorithm that does not require an explicit alignment between local and global objectives. 
 8 | 
 9 | MERL uses fast, policy-gradient based learning for each agent by utilizing their dense local rewards. 
10 | 
11 | Concurrently, an evolutionary algorithm is used to recruit agents into a team by directly optimizing the sparser global objective. 
12 | 
13 | We explore problems that require coupling (a minimum number of agents required to coordinate for success), where the degree of coupling is not known to the agents. 
14 | 
15 | We demonstrate that MERL’s integrated approach is more sample-efficient and retains performance better with increasing coupling orders compared to MADDPG, the state-of-the-art policy-gradient algorithm for multiagent coordination.
16 | 


--------------------------------------------------------------------------------
/MGRL.md:
--------------------------------------------------------------------------------
 1 | # Meta-Gradient Reinforcement Learning
 2 | > Zhongwen Xu, Hado van Hasselt, David Silver
 3 | 
 4 | ## Abstract
 5 | The goal of reinforcement learning algorithms is to estimate and/or optimise the value function. 
 6 | 
 7 | However, unlike supervised learning, no teacher or oracle is available to provide the true value function. 
 8 | 
 9 | Instead, the majority of reinforcement learning algorithms estimate and/or optimise a proxy for the value function. 
10 | 
11 | This proxy is typically based on a sampled and bootstrapped approximation to the true value function, known as a return. 
12 | 
13 | The particular choice of return is one of the chief components determining the nature of the algorithm: the rate at which future rewards are discounted; when and how values should be bootstrapped; or even the nature of the rewards themselves. 
14 | 
15 | It is well-known that these decisions are crucial to the overall success of RL algorithms. 
16 | 
17 | We discuss a gradient-based meta-learning algorithm that is able to adapt the nature of the return, online, whilst interacting and learning from the environment. 
18 | 
19 | When applied to 57 games on the Atari 2600 environment over 200 million frames, our algorithm achieved a new state-of-the-art performance.
20 | 


--------------------------------------------------------------------------------
/MMRB.md:
--------------------------------------------------------------------------------
1 | Minimax Regret Bounds for Reinforcement Learning
2 | 
3 | Mohammad Gheshlaghi Azar, Ian Osband, Rémi Munos
4 | 
5 | 
6 | 
7 | 


--------------------------------------------------------------------------------
/MPO.md:
--------------------------------------------------------------------------------
 1 | # Robust Reinforcement Learning for Continuous Control with Model Misspecification
 2 | > Daniel J. Mankowitz, Nir Levine, Rae Jeong, Abbas Abdolmaleki, Jost Tobias Springenberg, Timothy Mann, Todd Hester, Martin Riedmiller
 3 | 
 4 | ## Abstract
 5 | We provide a framework for incorporating robustness – to perturbations in the transition dynamics which we refer to as model misspecification – into continuous control Reinforcement Learning (RL) algorithms. 
 6 | 
 7 | We specifically focus on incorporating robustness into a state-of-the-art continuous control RL algorithm called Maximum a-posteriori Policy Optimization (MPO). 
 8 | 
 9 | We achieve this by learning a policy that optimizes for a worst case, entropy-regularized, expected return objective and derive a corresponding robust entropy-regularized Bellman contraction operator. 
10 | 
11 | In addition, we introduce a less conservative, soft-robust, entropy-regularized objective with a corresponding Bellman operator. 
12 | 
13 | We show that both, robust and soft-robust policies, outperform their non-robust counterparts in nine Mujoco domains with environment perturbations. 
14 | 
15 | Finally, we present multiple investigative experiments that provide a deeper insight into the robustness framework; including an adaptation to another continuous control RL algorithm as well as comparing this approach to domain randomization. 
16 | 
17 | Performance videos can be found online at https://sites.google.com/view/robust-rl.
18 | 


--------------------------------------------------------------------------------
/MRL.md:
--------------------------------------------------------------------------------
 1 | # Malthusian Reinforcement Learning
 2 | > Joel Z. Leibo,
 3 | Julien Perolat,
 4 | Edward Hughes,
 5 | Steven Wheelwright,
 6 | Adam H. Marblestone,
 7 | Edgar Duéñez-Guzmán,
 8 | Peter Sunehag,
 9 | Iain Dunning,
10 | Thore Graepel
11 | 
12 | ## ABSTRACT
13 | Here we explore a new algorithmic framework for multi-agent reinforcement learning, called Malthusian reinforcement learning, which extends self-play to include fitness-linked population size dynamics that drive ongoing innovation. 
14 | 
15 | In Malthusian RL, increases in a subpopulation’s average return drive subsequent increases in its size, just as Thomas Malthus argued in 1798 was the relationship between preindustrial income levels and population growth [24].
16 | 
17 | Malthusian reinforcement learning harnesses the competitive pressures arising from growing and shrinking population size to drive agents to explore regions of state and policy spaces that they could not otherwise reach. 
18 | 
19 | Furthermore, in environments where there are potential gains from specialization and division of labor, we show that Malthusian reinforcement learning is better positioned to take advantage of such synergies than algorithms based on self-play.
20 | 


--------------------------------------------------------------------------------
/MSRL.md:
--------------------------------------------------------------------------------
 1 | # Multi-step Reinforcement Learning: A Unifying Algorithm
 2 | 
 3 | 
 4 | Kristopher De Asis, J. Fernando Hernandez-Garcia, G. Zacharias Holland, Richard S. Sutton 
 5 | 
 6 | Unifying seemingly disparate algorithmic ideas
 7 | to produce better performing algorithms has been
 8 | a longstanding goal in reinforcement learning.
 9 | As a primary example, TD(λ) elegantly unifies
10 | one-step TD prediction with Monte Carlo methods
11 | through the use of eligibility traces and the
12 | trace-decay parameter λ. Currently, there are a
13 | multitude of algorithms that can be used to perform
14 | TD control, including Sarsa, Q-learning,
15 | and Expected Sarsa. These methods are often
16 | studied in the one-step case, but they can be extended
17 | across multiple time steps to achieve better
18 | performance. Each of these algorithms is
19 | seemingly distinct, and no one dominates the others
20 | for all problems. In this paper, we study
21 | a new multi-step action-value algorithm called
22 | Q(σ) which unifies and generalizes these existing
23 | algorithms, while subsuming them as special
24 | cases. A new parameter, σ, is introduced to allow
25 | the degree of sampling performed by the algorithm
26 | at each step during its backup to be continuously
27 | varied, with Sarsa existing at one extreme
28 | (full sampling), and Expected Sarsa existing
29 | at the other (pure expectation). Q(σ) is generally
30 | applicable to both on- and off-policy learning,
31 | but in this work we focus on experiments in
32 | the on-policy case. Our results show that an intermediate
33 | value of σ, which results in a mixture
34 | of the existing algorithms, performs better than
35 | either extreme. The mixture can also be varied
36 | dynamically which can result in even greater performance.
37 | 
38 | 


--------------------------------------------------------------------------------
/MetaSS.md:
--------------------------------------------------------------------------------
 1 | # Meta-learning of Sequential Strategies
 2 | > Pedro A. Ortega, Jane X. Wang, Mark Rowland, Tim Genewein, Zeb Kurth-Nelson, Razvan Pascanu,
 3 | Nicolas Heess, Joel Veness, Alex Pritzel, Pablo Sprechmann, Siddhant M. Jayakumar, Tom McGrath, Kevin
 4 | Miller, Mohammad Azar, Ian Osband, Neil Rabinowitz, András György, Silvia Chiappa, Simon Osindero,
 5 | Yee Whye Teh, Hado van Hasselt, Nando de Freitas, Matthew Botvinick, and Shane Legg
 6 | 
 7 | 
 8 | ## Abstract
 9 | 
10 | In this report we review memory-based metalearning as a tool for building sample-efficient strategies that learn from past experience to adapt to any task within a target class. 
11 | 
12 | Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. 
13 | 
14 | To do so, we present basic algorithmic templates for building near-optimal predictors and reinforcement learners which behave as if they had a probabilistic model that allowed them to efficiently exploit task structure. 
15 | 
16 | Furthermore, we recast memory-based meta-learning within a Bayesian framework, showing that the meta-learned strategies are near-optimal because they amortize Bayes-filtered data, where the adaptation is implemented in the memory dynamics as a state-machine of sufficient statistics. 
17 | 
18 | Essentially, memory-based meta-learning translates the hard problem of probabilistic sequential inference into a regression problem.
19 | 
20 | ### keywords:
21 | meta-learning, generality, sample-efficiency, memory, sufficient statistics, Bayesian statistics, Thompson sampling, Bayes-optimality.
22 | 


--------------------------------------------------------------------------------
/NDM.md:
--------------------------------------------------------------------------------
 1 | Count-Based Exploration with Neural Density Models
 2 | Georg Ostrovski, Marc G. Bellemare, Aaron van den Oord, Remi Munos
 3 | 
 4 | Bellemare et al. (2016) introduced the notion of
 5 | a pseudo-count, derived from a density model,
 6 | to generalize count-based exploration to non
 7 | tabular reinforcement learning. This pseudocount
 8 | was used to generate an exploration bonus
 9 | for a DQN agent and combined with a mixed
10 | Monte Carlo update was sufficient to achieve
11 | state of the art on the Atari 2600 game Montezuma’s
12 | Revenge. We consider two questions
13 | left open by their work: First, how important is
14 | the quality of the density model for exploration?
15 | Second, what role does the Monte Carlo update
16 | play in exploration? We answer the first question
17 | by demonstrating the use of PixelCNN, an advanced
18 | neural density model for images, to supply
19 | a pseudo-count. In particular, we examine the
20 | intrinsic difficulties in adapting Bellemare et al.’s
21 | approach when assumptions about the model are
22 | violated. The result is a more practical and general
23 | algorithm requiring no special apparatus. We
24 | combine PixelCNN pseudo-counts with different
25 | agent architectures to dramatically improve the
26 | state of the art on several hard Atari games. One
27 | surprising finding is that the mixed Monte Carlo
28 | update is a powerful facilitator of exploration in
29 | the sparsest of settings, including Montezuma’s
30 | Revenge.
31 | 


--------------------------------------------------------------------------------
/NEC.md:
--------------------------------------------------------------------------------
 1 | # Neural Episodic Control
 2 | Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria, Oriol Vinyals, Demis Hassabis, Daan Wierstra, Charles Blundell
 3 | 
 4 | Deep reinforcement learning methods attain
 5 | super-human performance in a wide range of environments.
 6 | Such methods are grossly inefficient,
 7 | often taking orders of magnitudes more data than
 8 | humans to achieve reasonable performance. We
 9 | propose Neural Episodic Control: a deep reinforcement
10 | learning agent that is able to rapidly
11 | assimilate new experiences and act upon them.
12 | Our agent uses a semi-tabular representation of
13 | the value function: a buffer of past experience containing
14 | slowly changing state representations and
15 | rapidly updated estimates of the value function.
16 | We show across a wide range of environments
17 | that our agent learns significantly faster than other
18 | state-of-the-art, general purpose deep reinforcement
19 | learning agents.
20 | 


--------------------------------------------------------------------------------
/NashDQN.md:
--------------------------------------------------------------------------------
 1 | # Deep Q-Learning for Nash Equilibria: Nash-DQN
 2 | > Philippe Casgrain, Brian Ning, and Sebastian Jaimungal
 3 | 
 4 | ## Abstract
 5 | Model-free learning for multi-agent stochastic games is an active area of research.
 6 | 
 7 | Existing reinforcement learning algorithms, however, are often restricted to zero-sum games, and are applicable only in small state-action spaces or other simplified settings. 
 8 | 
 9 | Here, we develop a new data efficient Deep-Q-learning methodology for model-free learning of Nash equilibria for general-sum stochastic games. 
10 | 
11 | The algorithm uses a local linear-quadratic expansion of the stochastic game, which leads to analytically solvable optimal actions. 
12 | 
13 | The expansion is parametrized by deep neural networks to give it sufficient flexibility to learn the environment without the need to experience all state-action pairs.
14 | 
15 | We study symmetry properties of the algorithm stemming from label-invariant stochastic games and as a proof of concept, apply our algorithm to learning optimal trading strategies in competitive electronic markets
16 | 


--------------------------------------------------------------------------------
/NoisyNet.md:
--------------------------------------------------------------------------------
 1 | Noisy Networks for Exploration
 2 | 
 3 | Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot,
 4 | Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih,
 5 | Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell,
 6 | Shane Legg
 7 | 
 8 | https://arxiv.org/pdf/1706.10295.pdf
 9 | 
10 | We introduce NoisyNet, a deep reinforcement learning agent with parametric noise
11 | added to its weights, and show that the induced stochasticity of the agent’s policy
12 | can be used to aid efficient exploration. The parameters of the noise are learned
13 | with gradient descent along with the remaining network weights. NoisyNet is
14 | straightforward to implement and adds little computational overhead. We find that
15 | replacing the conventional exploration heuristics for A3C, DQN and dueling agents
16 | (entropy reward and $\epsilon$-greedy respectively) with NoisyNet yields substantially
17 | higher scores for a wide range of Atari games, in some cases advancing the agent
18 | from sub to super-human performance.
19 | 


--------------------------------------------------------------------------------
/OLRL.md:
--------------------------------------------------------------------------------
 1 | # Observational Learning by Reinforcement Learning
 2 | Diana Borsa, Bilal Piot, Rémi Munos, Olivier Pietquin
 3 | 
 4 | ## Abstract
 5 | Observational learning is a type of learning that occurs as a function of observing, retaining and possibly replicating or imitating the behaviour of another agent. 
 6 | 
 7 | It is a core mechanism appearing in various instances of social learning and has been found to be employed in several intelligent species, including humans. 
 8 | 
 9 | In this paper, we investigate to what extent the explicit modelling of other agents is necessary to achieve observational learning through machine learning. 
10 | 
11 | Especially, we argue that observational learning can emerge from pure Reinforcement Learning (RL), potentially coupled with memory. 
12 | 
13 | Through simple scenarios, we demonstrate that an RL agent can leverage the information provided by the observations of an other agent performing a task in a shared environment. 
14 | 
15 | The other agent is only observed through the effect of its actions on the environment and never explicitly modeled. 
16 | 
17 | Two key aspects are borrowed from observational learning: 
18 | 1. the observer behaviour needs to change as a result of viewing a ’teacher’ (another agent) and
19 | 2. the observer needs to be motivated somehow to engage in making use of the other agent’s behaviour. 
20 | 
21 | The later is naturally modeled by RL, by correlating the learning agent’s reward with the teacher agent’s behaviour.
22 | 


--------------------------------------------------------------------------------
/OP-GAIL.md:
--------------------------------------------------------------------------------
 1 | # ADDRESSING SAMPLE INEFFICIENCY AND REWARD BIAS IN INVERSE REINFORCEMENT LEARNING
 2 | > Ilya Kostrikov, Kumar Krishna Agrawal, Sergey Levine, Jonathan Tompson
 3 | 
 4 | ## ABSTRACT
 5 | The Generative Adversarial Imitation Learning (GAIL) framework from Ho & Ermon (2016) is known for being surprisingly sample efficient in terms of demonstrations provided by an expert policy. 
 6 | 
 7 | However, the algorithm requires a significantly larger number of policy interactions with the environment in order to imitate the expert. 
 8 | 
 9 | In this work we address this problem by proposing a sample efficient algorithm for inverse reinforcement learning that incorporates both offpolicy reinforcement learning and adversarial imitation learning. 
10 | 
11 | We also show that GAIL has a number of biases associated with the choice of reward function, which can unintentionally encode prior knowledge of some tasks, and prevent learning in others. 
12 | 
13 | We address these shortcomings by analyzing the issue and correcting invalid assumptions used when defining the learned reward function. 
14 | 
15 | We demonstrate that our algorithm achieves state-of-the-art performance for an inverse reinforcement learning framework on a variety of standard benchmark tasks, and from demonstrations provided from both learned agents and human experts. 
16 | 


--------------------------------------------------------------------------------
/OPRE.md:
--------------------------------------------------------------------------------
 1 | # Options as responses:Grounding behavioural hierarchies in multi-agent RL
 2 | > Alexander Sasha Vezhnevets, Yuhuai Wu, Rémi Leblond, Joel Z. Leibo
 3 | 
 4 | ## Abstract
 5 | We propose a novel hierarchical agent architecture for multi-agent reinforcement learning with concealed information. 
 6 | 
 7 | The hierarchy is grounded in the concealed information about other players, which resolves "the chicken or the egg" nature of option discovery. 
 8 | 
 9 | We factorise the value function over a latent representation of the concealed information and then re-use this latent space to factorise the policy into options. 
10 | 
11 | Low-level policies (options) are trained to respond to particular states of other agents grouped by the latent representation, while the top level (meta-policy) learns to infer the latent representation from its own observation thereby to select the right option. 
12 | 
13 | This grounding facilitates credit assignment across the levels of hierarchy. 
14 | 
15 | We show that this helps generalisation—performance against a held-out set of pre-trained competitors, while training in self- or population-play—and resolution of social dilemmas in self-play.
16 | 


--------------------------------------------------------------------------------
/OVPG.md:
--------------------------------------------------------------------------------
 1 | # An operator view of policy gradient methods
 2 | > Dibya Ghosh, Marlos C. Machado, and Nicolas Le Roux
 3 | 
 4 | ## Abstract
 5 | We cast policy gradient methods as the repeated application of two operators: a policy improvement operator I, which maps any policy π to a better one Iπ, and a projection operator P, which finds the best approximation of Iπ in the set of realizable policies. 
 6 | 
 7 | We use this framework to introduce operator-based versions of traditional policy gradient methods such as Reinforce and PPO, which leads to a better understanding of their original counterparts. 
 8 | 
 9 | We also use the understanding we develop of the role of I and P to propose a new global lower bound of the expected return. 
10 | 
11 | This new perspective allows us to further bridge the gap between policy-based and value-based methods, showing how Reinforce and the Bellman optimality operator, for example, can be seen as two sides of the same coin.
12 | 
13 | download link: https://arxiv.org/pdf/2006.11266.pdf
14 | 


--------------------------------------------------------------------------------
/PCL.md:
--------------------------------------------------------------------------------
 1 | # Bridging the Gap Between Value and Policy Based Reinforcement Learning
 2 | 
 3 | Ofir Nachum, Mohammad Norouzi, Kelvin Xu, Dale Schuurmans
 4 | 
 5 | A new notion of softmax temporal consistency that generalizes the standard hardmax Bellman consistency usually considered in value based reinforcement learning (RL). 
 6 | 
 7 | softmax consistent action values correspond to optimal policies that maximize entropy regularized expected reward. 
 8 | 
 9 | softmax consistent action values and the optimal policy must satisfy a mutual compatibility property that holds across any state-action subsequence. 
10 | 
11 | ## New algorithm
12 | Path Consistency Learning (PCL), that minimizes the total inconsistency measured along multi-step subsequences extracted from both on and off policy traces.
13 | 
14 | ## Contributions
15 | * A complete characterization of softmax temporal consistency, which generalizes the commonly used hardmax Bellman consistency.
16 | * A proof that Q-values satisfying softmax temporal consistency directly determine the optimal policy that maximizes entropy regularized expected discounted reward.
17 | * Identification of a new multi-step path-wise softmax consistency property that relates the optimal Q-values at the end points of any path to the log-probabilities of the optimal policy along actions of that path.
18 | * An effective RL algorithm, Path Consistency Learning, that exploits multi-step path-wise consistency and combines elements of value and policy based RL.
19 | * Strong experimental results versus current actor-critic and Q-learning baselines.
20 | 


--------------------------------------------------------------------------------
/PEARL.md:
--------------------------------------------------------------------------------
 1 | # Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables
 2 | Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, Sergey Levine
 3 | (Submitted on 19 Mar 2019)
 4 | Deep reinforcement learning algorithms require large amounts of experience to learn an individual task. 
 5 | 
 6 | While in principle meta-reinforcement learning (meta-RL) algorithms enable agents to learn new skills from small amounts of experience, several major challenges preclude their practicality. 
 7 | 
 8 | Current methods rely heavily on on-policy experience, limiting their sample efficiency. 
 9 | 
10 | They also lack mechanisms to reason about task uncertainty when adapting to new tasks, limiting their effectiveness in sparse reward problems. 
11 | 
12 | In this paper, we address these challenges by developing an off-policy meta-RL algorithm that disentangles task inference and control. 
13 | 
14 | In our approach, we perform online probabilistic filtering of latent task variables to infer how to solve a new task from small amounts of experience. 
15 | 
16 | This probabilistic interpretation enables posterior sampling for structured and efficient exploration. 
17 | 
18 | We demonstrate how to integrate these task variables with off-policy RL algorithms to achieve both meta-training and adaptation efficiency. 
19 | 
20 | Our method outperforms prior algorithms in sample efficiency by 20-100X as well as in asymptotic performance on several meta-RL benchmarks.
21 | 


--------------------------------------------------------------------------------
/PEB.md:
--------------------------------------------------------------------------------
 1 | # Time Limits in Reinforcement Learning
 2 | 
 3 | > Fabio Pardo 1 Arash Tavakoli 1 Vitaly Levdik 1 Petar Kormushev 1
 4 | 
 5 | ## Abstract
 6 | In reinforcement learning, it is common to let an agent interact for a fixed amount of time with its environment before resetting it and repeating the process in a series of episodes. 
 7 | 
 8 | The task that the agent has to learn can either be to maximize its performance over (i) that fixed period, or (ii) an indefinite period where time limits are only used during training to diversify experience. 
 9 | 
10 | In this paper, we provide a formal account for how time limits could effectively be handled in each of the two cases and explain why not doing so can cause state-aliasing and invalidation of experience replay, leading to suboptimal policies and training instability. 
11 | 
12 | In case (i), we argue that the terminations due to time limits are in fact part of the environment, and thus a notion of the remaining time should be included as part of the agent’s input to avoid violation of the Markov property. 
13 | 
14 | In case (ii), the time limits are not part of the environment and are only used to facilitate learning. 
15 | 
16 | We argue that this insight should be incorporated by bootstrapping from the value of the state at the end of each partial episode. 
17 | 
18 | For both cases, we illustrate empirically the significance of our considerations in improving the performance and stability of existing reinforcement learning algorithms, showing state-of-the-art results on several control tasks.
19 | 


--------------------------------------------------------------------------------
/PER.md:
--------------------------------------------------------------------------------
 1 | Prioritized Experience Replay
 2 | Tom Schaul, John Quan, Ioannis Antonoglou, David Silver
 3 | 
 4 | Experience replay lets online reinforcement learning agents remember and reuse
 5 | experiences from the past. In prior work, experience transitions were uniformly
 6 | sampled from a replay memory. However, this approach simply replays transitions
 7 | at the same frequency that they were originally experienced, regardless of their
 8 | significance. In this paper we develop a framework for prioritizing experience,
 9 | so as to replay important transitions more frequently, and therefore learn more
10 | efficiently. We use prioritized experience replay in Deep Q-Networks (DQN), a
11 | reinforcement learning algorithm that achieved human-level performance across
12 | many Atari games. DQN with prioritized experience replay achieves a new stateof-the-art,
13 | outperforming DQN with uniform replay on 41 out of 49 games.
14 | 


--------------------------------------------------------------------------------
/PGQ.md:
--------------------------------------------------------------------------------
 1 | # PGQ: COMBINING POLICY GRADIENT AND Q-LEARNING
 2 | Brendan O’Donoghue, Rémi Munos, Koray Kavukcuoglu & Volodymyr Mnih {Deepmind}
 3 | 
 4 | Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting. 
 5 | 
 6 | However, vanilla online variants are on-policy only and not able to take advantage of off-policy data.
 7 | 
 8 | A new technique that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer. 
 9 | 
10 | **Making a connection between the fixed points of the regularized policy gradient algorithm and the Q-values.**
11 | 
12 | we can estimate the Q-values from the action preferences of the policy, to which we apply Q-learning updates. 
13 | 
14 | ## Algorithm
15 | ‘PGQ’, for policy gradient and Q-learning. 
16 | 
17 | An equivalency between action-value fitting techniques and actor-critic algorithms, 
18 | showing that regularized policy gradient techniques can be interpreted as advantage function learning algorithms. 
19 | 
20 | 
21 | We conclude with some numerical examples that demonstrate improved data efficiency and stability of PGQ. 
22 | 
23 | We tested PGQ on the full suite of Atari games and achieved performance exceeding that of both asynchronous advantage 
24 | actor-critic (A3C) and Q-learning.
25 | 


--------------------------------------------------------------------------------
/PGS.md:
--------------------------------------------------------------------------------
 1 | # Policy Gradient Search: Online Planning and Expert Iteration without Search Trees
 2 | > Thomas Anthony, Robert Nishihara, Philipp Moritz, Tim Salimans, and John Schulman
 3 | 
 4 | ## Abstract
 5 | Monte Carlo Tree Search (MCTS) algorithms perform simulation-based search to improve policies online. During search, the simulation policy is adapted to explore the most promising lines of play. 
 6 | 
 7 | MCTS has been used by state-of-the-art programs for many problems, however a disadvantage to MCTS is that it estimates the values of states with Monte Carlo averages, stored in a search tree; this does not scale to games with very high branching factors. 
 8 | 
 9 | We propose an alternative simulation-based search method, Policy Gradient Search (PGS), which adapts a neural network simulation policy online via policy gradient updates, avoiding the need for a search tree. In Hex, PGS achieves comparable performance to MCTS, and an agent trained using Expert Iteration with PGS was able defeat MoHex 2.0, the strongest open-source Hex agent, in 9x9 Hex.
10 | 


--------------------------------------------------------------------------------
/PGSQL.md:
--------------------------------------------------------------------------------
 1 | from OpenAI & UC Berkeley
 2 | link: https://arxiv.org/pdf/1704.06440.pdf
 3 | 
 4 | Two of the leading approaches for model-free reinforcement learning are policy gradient methods
 5 | and Q-learning methods. Q-learning methods can be effective and sample-efficient when they work,
 6 | however, it is not well-understood why they work, since empirically, the Q-values they estimate are very
 7 | inaccurate. A partial explanation may be that Q-learning methods are secretly implementing policy
 8 | gradient updates: we show that there is a precise equivalence between Q-learning and policy gradient
 9 | methods in the setting of entropy-regularized reinforcement learning, that “soft” (entropy-regularized)
10 | Q-learning is exactly equivalent to a policy gradient method. We also point out a connection between
11 | Q-learning methods and natural policy gradient methods.
12 | Experimentally, we explore the entropy-regularized versions of Q-learning and policy gradients, and
13 | we find them to perform as well as (or slightly better than) the standard variants on the Atari benchmark.
14 | We also show that the equivalence holds in practical settings by constructing a Q-learning method that
15 | closely matches the learning dynamics of A3C without using a target network or -greedy exploration
16 | schedule.
17 | 


--------------------------------------------------------------------------------
/PPO-CMA.md:
--------------------------------------------------------------------------------
 1 | # PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation
 2 | > Perttu Hämäläinen, Amin Babadi, Xiaoxiao Ma, Jaakko Lehtinen
 3 | 
 4 | ## Abstract
 5 | 
 6 | Proximal Policy Optimization (PPO) is a highly popular model-free reinforcement learning (RL) approach. 
 7 | 
 8 | However, we observe that in a continuous action space, PPO can prematurely shrink the exploration variance, which leads to slow progress and may make the algorithm prone to getting stuck in local optima. 
 9 | 
10 | Drawing inspiration from CMA-ES, a black-box evolutionary optimization method designed for robustness in similar situations, we propose PPO-CMA, a proximal policy optimization approach that adaptively expands the exploration variance to speed up progress. 
11 | 
12 | This can be considered as a form of action-space momentum. 
13 | 
14 | With only minor changes to PPO, our algorithm considerably improves performance in Roboschool continuous control benchmarks.
15 | 


--------------------------------------------------------------------------------
/PPO.md:
--------------------------------------------------------------------------------
 1 | Proximal Policy Optimization Algorithms
 2 | 
 3 | John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov
 4 | 
 5 | We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.
 6 | 
 7 | 
 8 | 
 9 | 
10 | 


--------------------------------------------------------------------------------
/PhiEB.md:
--------------------------------------------------------------------------------
 1 | Count-Based Exploration in Feature Space for Reinforcement Learning
 2 | Jarryd Martin, Suraj Narayanan S., Tom Everitt, Marcus Hutter
 3 | 
 4 | We introduce a new count-based optimistic exploration
 5 | algorithm for reinforcement learning
 6 | (RL) that is feasible in environments with highdimensional
 7 | state-action spaces. The success of
 8 | RL algorithms in these domains depends crucially
 9 | on generalisation from limited training experience.
10 | Function approximation techniques enable
11 | RL agents to generalise in order to estimate the
12 | value of unvisited states, but at present few methods
13 | enable generalisation regarding uncertainty. This
14 | has prevented the combination of scalable RL algorithms
15 | with efficient exploration strategies that
16 | drive the agent to reduce its uncertainty. We
17 | present a new method for computing a generalised
18 | state visit-count, which allows the agent to estimate
19 | the uncertainty associated with any state. Our
20 | φ-pseudocount achieves generalisation by exploiting
21 | the same feature representation of the state
22 | space that is used for value function approximation.
23 | States that have less frequently observed features
24 | are deemed more uncertain. The φ-ExplorationBonus
25 | algorithm rewards the agent for exploring
26 | in feature space rather than in the untransformed
27 | state space. The method is simpler and less computationally
28 | expensive than some previous proposals,
29 | and achieves near state-of-the-art results on highdimensional
30 | RL benchmarks.
31 | 
32 | https://arxiv.org/pdf/1706.08090.pdf
33 | 


--------------------------------------------------------------------------------
/ProMP.md:
--------------------------------------------------------------------------------
 1 | # ProMP: Proximal Meta-Policy Search
 2 | Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, Pieter Abbeel
 3 | 
 4 | Credit assignment in Meta-reinforcement learning (Meta-RL) is still poorly understood. 
 5 | 
 6 | Existing methods either neglect credit assignment to pre-adaptation behavior or implement it naively. 
 7 | 
 8 | This leads to poor sample-efficiency during meta-training as well as ineffective task identification strategies. 
 9 | 
10 | This paper provides a theoretical analysis of credit assignment in gradient-based Meta-RL.
11 | 
12 | Building on the gained insights we develop a novel meta-learning algorithm that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients. 
13 | 
14 | By controlling the statistical distance of both pre-adaptation and adapted policies during meta-policy search, the proposed algorithm endows efficient and stable meta-learning. 
15 | 
16 | Our approach leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance.
17 | 


--------------------------------------------------------------------------------
/Programmable.md:
--------------------------------------------------------------------------------
1 | Misha Denil, Sergio Gómez Colmenarejo, Serkan Cabi, David Saxton, and Nando de Freitas
2 | > from DeepMind
3 | 
4 | We build deep RL agents that execute declarative programs expressed in formal language. 
5 | The agents learn to ground the terms in this language in their environment, and can generalize their behavior at test time to execute new programs that refer to objects that were not referenced during training. 
6 | The agents develop disentangled interpretable representations that allow them to generalize to a wide variety of zero-shot semantic tasks.
7 | 


--------------------------------------------------------------------------------
/Proposal.md:
--------------------------------------------------------------------------------
1 | To do: compare the 10-hotest topics of each year to show the evolution of reinforcement learning research progressing all the time.
2 | 


--------------------------------------------------------------------------------
/QEnsemble.md:
--------------------------------------------------------------------------------
 1 | UCB and InfoGain Exploration via Q-Ensembles
 2 | 
 3 | Richard Y. Chen
 4 | OpenAI
 5 | richardchen@openai.com
 6 | Szymon Sidor
 7 | OpenAI
 8 | szymon@openai.com
 9 | Pieter Abbeel
10 | OpenAI
11 | University of California, Berkeley
12 | pieter@openai.com
13 | John Schulman
14 | OpenAI
15 | joschu@openai.com
16 | 
17 | We show how an ensemble of Q∗
18 | -functions can be leveraged for more effective
19 | exploration in deep reinforcement learning. We build on well established algorithms
20 | from the bandit setting, and adapt them to the Q-learning setting. First we
21 | propose an exploration strategy based on upper-confidence bounds (UCB). Next,
22 | we define an “InfoGain” exploration bonus, which depends on the disagreement of
23 | the Q-ensemble. Our experiments show significant gains on the Atari benchmark.
24 | 


--------------------------------------------------------------------------------
/QPROP.md:
--------------------------------------------------------------------------------
 1 | # Q-PROP: SAMPLE-EFFICIENT POLICY GRADIENT WITH AN OFF-POLICY CRITIC
 2 | 
 3 | 
 4 | Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, Sergey Levine
 5 | 
 6 | Model-free deep reinforcement learning (RL) methods have been successful in a
 7 | wide variety of simulated domains. However, a major obstacle facing deep RL
 8 | in the real world is their high sample complexity. Batch policy gradient methods
 9 | offer stable learning, but at the cost of high variance, which often requires large
10 | batches. TD-style methods, such as off-policy actor-critic and Q-learning, are
11 | more sample-efficient but biased, and often require costly hyperparameter sweeps
12 | to stabilize. In this work, we aim to develop methods that combine the stability of
13 | policy gradients with the efficiency of off-policy RL. We present Q-Prop, a policy
14 | gradient method that uses a Taylor expansion of the off-policy critic as a control
15 | variate. Q-Prop is both sample efficient and stable, and effectively combines the
16 | benefits of on-policy and off-policy methods. We analyze the connection between
17 | Q-Prop and existing model-free algorithms, and use control variate theory to derive
18 | two variants of Q-Prop with conservative and aggressive adaptation. We show
19 | that conservative Q-Prop provides substantial gains in sample efficiency over trust
20 | region policy optimization (TRPO) with generalized advantage estimation (GAE),
21 | and improves stability over deep deterministic policy gradient (DDPG), the stateof-the-art
22 | on-policy and off-policy methods, on OpenAI Gym’s MuJoCo continuous
23 | control environments.
24 | 


--------------------------------------------------------------------------------
/QR-DQN.md:
--------------------------------------------------------------------------------
 1 | # Distributional Reinforcement Learning with Quantile Regression
 2 | > Will Dabney, Mark Rowland, Marc G. Bellemare, Remi Munos
 3 | 
 4 | ## Abstract
 5 | In reinforcement learning an agent interacts with the environment by taking actions and observing the next state and reward.
 6 | 
 7 | When sampled probabilistically, these state transitions, rewards, and actions can all induce randomness in the observed long-term return. 
 8 | 
 9 | Traditionally, reinforcement learning algorithms average over this randomness to estimate the value function. 
10 | 
11 | In this paper, we build on recent work advocating a distributional approach to reinforcement learning in which the distribution over returns is modeled explicitly instead of only estimating the mean. 
12 | 
13 | That is, we examine methods of learning the value distribution instead of the value function. 
14 | 
15 | We give results that close a number of gaps between the theoretical and algorithmic results given by Bellemare, Dabney, and Munos (2017). 
16 | 
17 | 1. we extend existing results to the approximate distribution setting. 
18 | 2. we present a novel distributional reinforcement learning algorithm consistent with our theoretical formulation. 
19 | 3. we evaluate this new algorithm on the Atari 2600 games, observing that it significantly outperforms many of the recent improvements on DQN, including the related distributional algorithm C51.
20 | 


--------------------------------------------------------------------------------
/RCFR.md:
--------------------------------------------------------------------------------
1 | # Revisiting CFR+ and Alternating Updates
2 | Neil Burch, Matej Moravcik, Martin Schmid
3 | ## Abstract
4 | The CFR+ algorithm for solving imperfect information games is a variant of the popular CFR algorithm, with faster empirical performance on a range of problems. 
5 | 
6 | It was introduced with a theoretical upper bound on solution error, but subsequent work showed an error in one step of the proof. 
7 | 
8 | We provide updated proofs to recover the original bound.
9 | 


--------------------------------------------------------------------------------
/REACTOR.md:
--------------------------------------------------------------------------------
 1 | https://arxiv.org/pdf/1704.04651.pdf
 2 | 
 3 | # introduction
 4 | This is the continuing work from authors of Retrace algorithm.
 5 | 
 6 | Reactor (for Retrace-actor), based on an off-policy multi-step return actor-critic architecture, makes some improvements on 
 7 | data efficiency, numerical efficiency and friendly for parallelization. 
 8 | 
 9 | ## use of rnn
10 | The agent uses a deep **recurrent neural network** for function approximation. 
11 | 
12 | ## multiple policy usage 
13 | The network outputs a target policy π (the actor), an action-value Q-function (the critic) evaluating the current policy π, 
14 | and an estimated behavioural policy μˆ which we use for off-policy correction. 
15 | 
16 | ## use of past experiences
17 | The agent maintains a memory buffer filled with past experiences. sample-efficient
18 | 
19 | ## training
20 | The critic is trained by the multi-step off-policy Retrace algorithm and the actor is trained by a novel β-leave-one-out 
21 | policy gradient estimate (which uses both the off-policy corrected return and the estimated Q-function). 
22 | 
23 | ## multiple-step returns
24 | numerical efficient since it uses multi-step returns. 
25 | 
26 | ## parallelization
27 | Also both acting and learning can be parallelized. 
28 | 
29 | 
30 | # experiments
31 | We evaluated our algorithm on 57 Atari 2600 games Reactor achieves state-of-the-art performance.
32 | 
33 | # related methods
34 | ACER (from DeepMind as well)
35 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Awesome Deep Reinforcement Learning
  2 | 
  3 | > **Mar 1 2024 update: HILP added**
  4 | > 
  5 | > **July 2022 update: EDDICT added**
  6 | > 
  7 | > **Mar 2022 update: a few papers released in early 2022**
  8 | > 
  9 | > **Dec 2021 update: Unsupervised RL**
 10 | 
 11 | ## Introduction to awesome drl
 12 | Reinforcement learning is the fundamental framework for building AGI. Therefore we share important contributions within this awesome drl project. 
 13 | 
 14 | ## Landscape of Deep RL
 15 | 
 16 | ![updated Landscape of **DRL**](images/awesome-drl.png)
 17 | 
 18 | ## Content
 19 | - [Awesome Deep Reinforcement Learning](#awesome-deep-reinforcement-learning)
 20 |   - [Introduction to awesome drl](#introduction-to-awesome-drl)
 21 |   - [Landscape of Deep RL](#landscape-of-deep-rl)
 22 |   - [Content](#content)
 23 |   - [General guidances](#general-guidances)
 24 |   - [2022](#2022)
 25 |   - [Foundations and theory](#foundations-and-theory)
 26 |   - [General benchmark frameworks](#general-benchmark-frameworks)
 27 |   - [Unsupervised](#unsupervised)
 28 |   - [Offline](#offline)
 29 |   - [Value based](#value-based)
 30 |   - [Policy gradient](#policy-gradient)
 31 |   - [Explorations](#explorations)
 32 |   - [Actor-Critic](#actor-critic)
 33 |   - [Model-based](#model-based)
 34 |   - [Model-free + Model-based](#model-free--model-based)
 35 |   - [Hierarchical](#hierarchical)
 36 |   - [Option](#option)
 37 |   - [Connection with other methods](#connection-with-other-methods)
 38 |   - [Connecting value and policy methods](#connecting-value-and-policy-methods)
 39 |   - [Reward design](#reward-design)
 40 |   - [Unifying](#unifying)
 41 |   - [Faster DRL](#faster-drl)
 42 |   - [Multi-agent](#multi-agent)
 43 |   - [New design](#new-design)
 44 |   - [Multitask](#multitask)
 45 |   - [Observational Learning](#observational-learning)
 46 |   - [Meta Learning](#meta-learning)
 47 |   - [Distributional](#distributional)
 48 |   - [Planning](#planning)
 49 |   - [Safety](#safety)
 50 |   - [Inverse RL](#inverse-rl)
 51 |   - [No reward RL](#no-reward-rl)
 52 |   - [Time](#time)
 53 |   - [Adversarial learning](#adversarial-learning)
 54 |   - [Use Natural Language](#use-natural-language)
 55 |   - [Generative and contrastive representation learning](#generative-and-contrastive-representation-learning)
 56 |   - [Belief](#belief)
 57 |   - [PAC](#pac)
 58 |   - [Applications](#applications)
 59 | 
 60 | Illustrations:
 61 | 
 62 | ![](images/ACER.png)
 63 | 
 64 | **Recommendations and suggestions are welcome**. 
 65 | ## General guidances
 66 | * [Awesome Offline RL](https://github.com/hanjuku-kaso/awesome-offline-rl)
 67 | * [Reinforcement Learning Today](http://reinforcementlearning.today/)
 68 | * [Multiagent Reinforcement Learning by Marc Lanctot RLSS @ Lille](http://mlanctot.info/files/papers/Lanctot_MARL_RLSS2019_Lille.pdf) 11 July 2019
 69 | * [RLDM 2019 Notes by David Abel](https://david-abel.github.io/notes/rldm_2019.pdf) 11 July 2019
 70 | * [A Survey of Reinforcement Learning Informed by Natural Language](RLNL.md) 10 Jun 2019 [arxiv](https://arxiv.org/pdf/1906.03926.pdf)
 71 | * [Challenges of Real-World Reinforcement Learning](ChallengesRealWorldRL.md) 29 Apr 2019 [arxiv](https://arxiv.org/pdf/1904.12901.pdf)
 72 | * [Ray Interference: a Source of Plateaus in Deep Reinforcement Learning](RayInterference.md) 25 Apr 2019 [arxiv](https://arxiv.org/pdf/1904.11455.pdf)
 73 | * [Principles of Deep RL by David Silver](p10.md)
 74 | * [University AI's General introduction to deep rl (in Chinese)](https://www.jianshu.com/p/dfd987aa765a)
 75 | * [OpenAI's spinningup](https://spinningup.openai.com/en/latest/)
 76 | * [The Promise of Hierarchical Reinforcement Learning](https://thegradient.pub/the-promise-of-hierarchical-reinforcement-learning/) 9 Mar 2019
 77 | * [Deep Reinforcement Learning that Matters](reproducing.md) 30 Jan 2019 [arxiv](https://arxiv.org/pdf/1709.06560.pdf)
 78 | 
 79 | ## 2024
 80 | * [Foundation Policies with Hilbert Representations](HILP.md) [arxiv](https://arxiv.org/abs/2402.15567) [repo](https://github.com/seohongpark/HILP) 23 Feb 2024
 81 | 
 82 | ## 2022
 83 | * Reinforcement Learning with Action-Free Pre-Training from Videos [arxiv](https://arxiv.org/abs/2203.13880) [repo](https://github.com/younggyoseo/apv)
 84 | 
 85 | ## Generalist policies
 86 | * [Foundation Policies with Hilbert Representations](HILP.md) [arxiv](https://arxiv.org/abs/2402.15567) [repo](https://github.com/seohongpark/HILP) 23 Feb 2024
 87 | 
 88 | ## Foundations and theory
 89 | 
 90 | * [General non-linear Bellman equations](GNLBE.md) 9 July 2019 [arxiv](https://arxiv.org/pdf/1907.07331.pdf)
 91 | * [Monte Carlo Gradient Estimation in Machine Learning](MCGE.md) 25 Jun 2019 [arxiv](https://arxiv.org/pdf/1906.10652.pdf)
 92 | 
 93 | ## General benchmark frameworks
 94 | 
 95 | * [Brax](https://github.com/google/brax/) <img src="https://github.com/google/brax/raw/main/docs/img/brax_logo.gif" width="336" height="80" alt="BRAX"/>
 96 | 
 97 | ![](https://github.com/google/brax/raw/main/docs/img/fetch.gif)
 98 | * [Android-Env](https://github.com/deepmind/android_env) 
 99 |   * ![](https://github.com/deepmind/android_env/raw/main/docs/images/device_control.gif)
100 | * [MuJoCo](http://mujoco.org/) | [MuJoCo Chinese version](https://github.com/tigerneil/mujoco-zh)
101 | * [Unsupervised RL Benchmark](https://github.com/rll-research/url_benchmark)
102 | * [Dataset for Offline RL](https://github.com/rail-berkeley/d4rl)
103 | * [Spriteworld: a flexible, configurable python-based reinforcement learning environment](https://github.com/deepmind/spriteworld)
104 | * [Chainerrl Visualizer](https://github.com/chainer/chainerrl-visualizer)
105 | * [Behaviour Suite for Reinforcement Learning](BSRL.md) 13 Aug 2019 [arxiv](https://arxiv.org/pdf/1908.03568.pdf) | [code](https://github.com/deepmind/bsuite)
106 | * [Quantifying Generalization in Reinforcement Learning](Coinrun.md) 20 Dec 2018 [arxiv](https://arxiv.org/pdf/1812.02341.pdf)
107 | * [S-RL Toolbox: Environments, Datasets and Evaluation Metrics for State Representation Learning](SRL.md) 25 Sept 2018
108 | * [dopamine](https://github.com/google/dopamine)
109 | * [StarCraft II](https://github.com/deepmind/pysc2)
110 | * [tfrl](https://github.com/deepmind/trfl)
111 | * [chainerrl](https://github.com/chainer/chainerrl)
112 | * [PARL](https://github.com/PaddlePaddle/PARL) 
113 | * [DI-engine: a generalized decision intelligence engine. It supports various Deep RL algorithms](https://github.com/opendilab/DI-engine)
114 | * [PPO x Family: Course in Chinese for Deep RL](https://github.com/opendilab/PPOxFamily)
115 | 
116 | ## Unsupervised
117 | 
118 | * [URLB: Unsupervised Reinforcement Learning Benchmark](https://arxiv.org/abs/2110.15191) 28 Oct 2021
119 | * [APS: Active Pretraining with Successor Feature](https://arxiv.org/abs/2108.13956) 31 Aug 2021
120 | * [Behavior From the Void: Unsupervised Active Pre-Training](https://arxiv.org/abs/2103.04551) 8 Mar 2021
121 | * [Reinforcement Learning with Prototypical Representations](https://arxiv.org/abs/2102.11271) 22 Feb 2021
122 | * [Efficient Exploration via State Marginal Matching](https://arxiv.org/abs/1906.05274) 12 Jun 2019
123 | * [Self-Supervised Exploration via Disagreement](https://arxiv.org/abs/1906.04161) 10 Jun 2019
124 | * [Exploration by Random Network Distillation](https://arxiv.org/abs/1810.12894) 30 Oct 2018
125 | * [Diversity is All You Need: Learning Skills without a Reward Function](https://arxiv.org/abs/1802.06070) 16 Feb 2018
126 | * [Curiosity-driven Exploration by Self-supervised Prediction](https://arxiv.org/pdf/1705.05363) 15 May 2017 
127 | 
128 | ## Offline
129 | * [PerSim: Data-efficient Offline Reinforcement Learning with Heterogeneous Agents via Personalized Simulators](https://arxiv.org/abs/2102.06961) 10 Nov 2021
130 | * [A General Offline Reinforcement Learning Framework for Interactive Recommendation]() AAAI 2021
131 | 
132 | 
133 | ## Value based
134 | 
135 | * [Harnessing Structures for Value-Based Planning and Reinforcement Learning](SVRL.md) 5 Feb 2020 [arxiv](https://arxiv.org/abs/1909.12255) | [code](https://github.com/YyzHarry/SV-RL)
136 | * [Recurrent Value Functions](RVF.md) 23 May 2019 [arxiv](https://arxiv.org/pdf/1905.09562.pdf)
137 | * [Stochastic Lipschitz Q-Learning](LipschitzQ.md) 24 Apr 2019 [arxiv](https://arxiv.org/pdf/1904.10653.pdf)
138 | * [TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning](https://arxiv.org/pdf/1710.11417) 8 Mar 2018
139 | * [DISTRIBUTED PRIORITIZED EXPERIENCE REPLAY](https://arxiv.org/pdf/1803.00933.pdf) 2 Mar 2018
140 | * [Rainbow: Combining Improvements in Deep Reinforcement Learning](Rainbow.md) 6 Oct 2017
141 | * [Learning from Demonstrations for Real World Reinforcement Learning](DQfD.md) 12 Apr 2017
142 | * [Dueling Network Architecture](Dueling.md)
143 | * [Double DQN](DDQN.md)
144 | * [Prioritized Experience](PER.md)
145 | * [Deep Q-Networks](DQN.md)
146 | 
147 | ## Policy gradient
148 | 
149 | * [Phasic Policy Gradient](PPG.md) 9 Sep 2020 [arxiv](https://arxiv.org/pdf/2009.04416.pdf) [code](https://github.com/openai/phasic-policy-gradient)
150 | * [An operator view of policy gradient methods](OVPG.md) 22 Jun 2020 [arxiv](https://arxiv.org/pdf/2006.11266.pdf)
151 | * [Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces](DirPG.md) 14 Jun 2019 [arxiv](https://arxiv.org/pdf/1906.06062.pdf)
152 | * [Policy Gradient Search: Online Planning and Expert Iteration without Search Trees](PGS.md) 7 Apr 2019 [arxiv](https://arxiv.org/pdf/1904.03646.pdf)
153 | * [SUPERVISED POLICY UPDATE FOR DEEP REINFORCEMENT LEARNING](SPU.md) 24 Dec 2018 [arxiv](https://arxiv.org/pdf/1805.11706v4.pdf)
154 | * [PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation](PPO-CMA.md) 5 Oct 2018 [arxiv](https://arxiv.org/pdf/1810.02541v6.pdf)
155 | * [Clipped Action Policy Gradient](CAPG.md) 22 June 2018
156 | * [Expected Policy Gradients for Reinforcement Learning](EPG.md) 10 Jan 2018
157 | * [Proximal Policy Optimization Algorithms](PPO.md) 20 July 2017
158 | * [Emergence of Locomotion Behaviours in Rich Environments](DPPO.md) 7 July 2017
159 | * [Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning](IPG.md) 1 Jun 2017
160 | * [Equivalence Between Policy Gradients and Soft Q-Learning](PGSQL.md)
161 | * [Trust Region Policy Optimization](TRPO.md)
162 | * [Reinforcement Learning with Deep Energy-Based Policies](DEBP.md)
163 | * [Q-PROP: SAMPLE-EFFICIENT POLICY GRADIENT WITH AN OFF-POLICY CRITIC](QPROP.md)
164 | 
165 | ## Explorations
166 | 
167 | * [Entropic Desired Dynamics for Intrinsic Control](EDDICT.md) 2021 [openreview](https://openreview.net/pdf?id=lBSSxTgXmiK)
168 | * [Self-Supervised Exploration via Disagreement](Disagreement.md) 10 Jun 2019 [arxiv](https://arxiv.org/pdf/1906.04161.pdf)
169 | * [Approximate Exploration through State Abstraction](MBIE-EB.md) 24 Jan 2019
170 | * [The Uncertainty Bellman Equation and Exploration](UBE.md) 15 Sep 2017
171 | * [Noisy Networks for Exploration](NoisyNet.md) 30 Jun 2017 [implementation](https://github.com/Kaixhin/NoisyNet-A3C)
172 | * [Count-Based Exploration in Feature Space for Reinforcement Learning](PhiEB.md) 25 Jun 2017
173 | * [Count-Based Exploration with Neural Density Models](NDM.md) 14 Jun 2017
174 | * [UCB and InfoGain Exploration via Q-Ensembles](QEnsemble.md) 11 Jun 2017
175 | * [Minimax Regret Bounds for Reinforcement Learning](MMRB.md) 16 Mar 2017
176 | * [Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models](incentivizing.md)
177 | * [EX2: Exploration with Exemplar Models for Deep Reinforcement Learning](EX2.md)
178 | 
179 | ## Actor-Critic
180 | 
181 | * [Generalized Off-Policy Actor-Critic](Geoff-PAC.md) 27 Mar 2019
182 | * [Soft Actor-Critic Algorithms and Applications](https://arxiv.org/pdf/1812.05905.pdf) 29 Jan 2019
183 | * [The Reactor: A Sample-Efficient Actor-Critic Architecture](REACTOR.md) 15 Apr 2017
184 | * [SAMPLE EFFICIENT ACTOR-CRITIC WITH EXPERIENCE REPLAY](ACER.md)
185 | * [REINFORCEMENT LEARNING WITH UNSUPERVISED AUXILIARY TASKS](UNREAL.md)
186 | * [Continuous control with deep reinforcement learning](DDPG.md)
187 | 
188 | ## Model-based
189 |  
190 | * [Self-Consistent Models and Values](sc.md) 25 Oct 2021 [arxiv](https://arxiv.org/pdf/2110.12840.pdf)
191 | * [When to use parametric models in reinforcement learning?](parametric.md) 12 Jun 2019 [arxiv](https://arxiv.org/pdf/1906.05243.pdf)
192 | * [Model Based Reinforcement Learning for Atari](https://arxiv.org/pdf/1903.00374.pdf) 5 Mar 2019
193 | * [Model-Based Stabilisation of Deep Reinforcement Learning](MBDQN.md) 6 Sep 2018
194 | * [Learning model-based planning from scratch](IBP.md) 19 July 2017
195 | 
196 | ## Model-free + Model-based
197 | 
198 | * [Imagination-Augmented Agents for Deep Reinforcement Learning](I2As.md) 19 July 2017
199 | 
200 | ## Hierarchical
201 | 
202 | * [WHY DOES HIERARCHY (SOMETIMES) WORK SO WELL IN REINFORCEMENT LEARNING?](HIRO.md) 23 Sep 2019 [arxiv](https://arxiv.org/pdf/1909.10618.pdf) 
203 | * [Language as an Abstraction for Hierarchical Deep Reinforcement Learning](HAL.md) 18 Jun 2019 [arxiv](https://arxiv.org/pdf/1906.07343.pdf)
204 | 
205 | ## Option
206 | 
207 | * [Variational Option Discovery Algorithms](VALOR.md) 26 July 2018
208 | * [A Laplacian Framework for Option Discovery in Reinforcement Learning](LFOD.md) 16 Jun 2017
209 | 
210 | ## Connection with other methods
211 | 
212 | * [Robust Imitation of Diverse Behaviors](GVG.md)
213 | * [Learning human behaviors from motion capture by adversarial imitation](GAIL.md)
214 | * [Connecting Generative Adversarial Networks and Actor-Critic Methods](GANAC.md)
215 | 
216 | ## Connecting value and policy methods
217 | 
218 | * [Bridging the Gap Between Value and Policy Based Reinforcement Learning](PCL.md)
219 | * [Policy gradient and Q-learning](PGQ.md)
220 | 
221 | ## Reward design
222 | 
223 | * [End-to-End Robotic Reinforcement Learning without Reward Engineering](VICE.md) 16 Apr 2019 [arxiv](https://arxiv.org/pdf/1904.07854.pdf)
224 | * [Reinforcement Learning with Corrupted Reward Channel](RLCRC.md) 23 May 2017
225 | 
226 | ## Unifying
227 | 
228 | * [Multi-step Reinforcement Learning: A Unifying Algorithm](MSRL.md)
229 | 
230 | ## Faster DRL
231 | 
232 | * [Neural Episodic Control](NEC.md)
233 | 
234 | ## Multi-agent
235 | 
236 | * [No Press Diplomacy: Modeling Multi-Agent Gameplay](Dip.md) 4 Sep 2019 [arxiv](https://arxiv.org/pdf/1909.02128.pdf)
237 | * [Options as responses: Grounding behavioural hierarchies in multi-agent RL](OPRE) 6 Jun 2019 [arxiv](https://arxiv.org/pdf/1906.01470.pdf)
238 | * [Evolutionary Reinforcement Learning for Sample-Efficient Multiagent Coordination](MERL.md) 18 Jun 2019 [arxiv](https://arxiv.org/pdf/1906.07315.pdf)
239 | * [A Regularized Opponent Model with Maximum Entropy Objective](ROMMEO.md) 17 May 2019 [arxiv](https://arxiv.org/pdf/1905.08087.pdf)
240 | * [Deep Q-Learning for Nash Equilibria: Nash-DQN](NashDQN.md) 23 Apr 2019 [arxiv](https://arxiv.org/pdf/1904.10554.pdf)
241 | * [Malthusian Reinforcement Learning](MRL.md) 3 Mar 2019 [arxiv](https://arxiv.org/pdf/1812.07019.pdf)
242 | * [Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning](bad.md) 4 Nov 2018
243 | * [INTRINSIC SOCIAL MOTIVATION VIA CAUSAL INFLUENCE IN MULTI-AGENT RL](ISMCI.md) 19 Oct 2018
244 | * [QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning](http://www.cs.ox.ac.uk/people/shimon.whiteson/pubs/rashidicml18.pdf) 30 Mar 2018
245 | * [Modeling Others using Oneself in Multi-Agent Reinforcement Learning](SOM.md) 26 Feb 2018
246 | * [The Mechanics of n-Player Differentiable Games](SGA.md) 15 Feb 2018 
247 | * [Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments](RoboSumo.md) 10 Oct 2017
248 | * [Learning with Opponent-Learning Awareness](LOLA.md) 13 Sep 2017
249 | * [Counterfactual Multi-Agent Policy Gradients](COMA.md) 
250 | * [Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments](MADDPG.md) 7 Jun 2017
251 | * [Multiagent Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat Games](BiCNet.md) 29 Mar 2017
252 | 
253 | ## New design
254 | 
255 | * [IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures](https://arxiv.org/pdf/1802.01561.pdf) 9 Feb 2018
256 | * [Reverse Curriculum Generation for Reinforcement Learning](RECUR.md)
257 | * [Trial without Error: Towards Safe Reinforcement Learning via Human Intervention](HIRL.md)
258 | * [Learning to Design Games: Strategic Environments in Deep Reinforcement Learning](DualMDP.md) 5 July 2017
259 | 
260 | ## Multitask
261 | 
262 | * [Kickstarting Deep Reinforcement Learning](https://arxiv.org/pdf/1803.03835.pdf) 10 Mar 2018
263 | * [Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning](ZSTG.md) 7 Nov 2017
264 | * [Distral: Robust Multitask Reinforcement Learning](Distral.md) 13 July 2017
265 | 
266 | ## Observational Learning
267 | 
268 | * [Observational Learning by Reinforcement Learning](OLRL.md) 20 Jun 2017
269 | 
270 | ## Meta Learning
271 | 
272 | * [Discovery of Useful Questions as Auxiliary Tasks](GVF.md) 10 Sep 2019 [arxiv](https://arxiv.org/pdf/1909.04607.pdf)
273 | * [Meta-learning of Sequential Strategies](MetaSS.md) 8 May 2019 [arxiv](https://arxiv.org/pdf/1905.03030.pdf)
274 | * [Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables](PEARL.md) 19 Mar 2019 [arxiv](https://arxiv.org/pdf/1903.08254.pdf)
275 | * [Some Considerations on Learning to Explore via Meta-Reinforcement Learning](E2.md) 11 Jan 2019 [arxiv](https://arxiv.org/pdf/1803.01118.pdf)
276 | * [Meta-Gradient Reinforcement Learning](MGRL.md) 24 May 2018 [arxiv](https://arxiv.org/pdf/1805.09801.pdf)
277 | * [ProMP: Proximal Meta-Policy Search](ProMP.md) 16 Oct 2018 [arxiv](https://arxiv.org/pdf/1810.06784)
278 | * [Unsupervised Meta-Learning for Reinforcement Learning](UML.md) 12 Jun 2018
279 | 
280 | ## Distributional
281 | 
282 | * [GAN Q-learning](GANQL.md) 20 July 2018
283 | * [Implicit Quantile Networks for Distributional Reinforcement Learning](IQN.md) 14 Jun 2018
284 | * [Nonlinear Distributional Gradient Temporal-Difference Learning](GTD.md) 20 May 2018
285 | * [DISTRIBUTED DISTRIBUTIONAL DETERMINISTIC POLICY GRADIENTS](D4PG.md) 23 Apr 2018
286 | * [An Analysis of Categorical Distributional Reinforcement Learning](C51-analysis.md) 22 Feb 2018
287 | * [Distributional Reinforcement Learning with Quantile Regression](QR-DQN.md) 27 Oct 2017
288 | * [A Distributional Perspective on Reinforcement Learning](C51.md) 21 July 2017
289 | 
290 | ## Planning
291 | 
292 | * [Search on the Replay Buffer: Bridging Planning and Reinforcement Learning](SoRB.md) 12 June 2019 [arxiv](https://arxiv.org/pdf/1906.05253.pdf)
293 | 
294 | ## Safety
295 | 
296 | * [Robust Reinforcement Learning for Continuous Control with Model Misspecification](MPO.md) 18 Jun 2019 [arxiv](https://arxiv.org/pdf/1906.07516.pdf)
297 | * [Verifiable Reinforcement Learning via Policy Extraction](Viper.md) 22 May 2018 [arxiv](https://arxiv.org/pdf/1805.08328.pdf)
298 | 
299 | ## Inverse RL
300 | 
301 | * [ADDRESSING SAMPLE INEFFICIENCY AND REWARD BIAS IN INVERSE REINFORCEMENT LEARNING](OP-GAIL.md) 9 Sep 2018
302 | 
303 | ## No reward RL
304 | 
305 | * [Fast Task Inference with Variational Intrinsic Successor Features](VISR.md) 2 Jun 2019 [arxiv](https://arxiv.org/pdf/1906.05030.pdf)
306 | * [Curiosity-driven Exploration by Self-supervised Prediction](https://arxiv.org/pdf/1705.05363) 15 May 2017 
307 | 
308 | ## Time
309 | 
310 | * [Interval timing in deep reinforcement learning agents](Intervaltime.md) 31 May 2019 [arxiv](https://arxiv.org/pdf/1905.13469.pdf)
311 | * [Time Limits in Reinforcement Learning](PEB.md)
312 | 
313 | ## Adversarial learning
314 | 
315 | * [Sample-efficient Adversarial Imitation Learning from Observation](LQR+GAIfO.md) 18 Jun 2019 [arxiv](https://arxiv.org/pdf/1906.07374.pdf) 
316 | 
317 | ## Use Natural Language
318 | 
319 | * [Using Natural Language for Reward Shaping in Reinforcement Learning](LEARN.md) 31 May 2019 [arxiv](https://www.cs.utexas.edu/~ai-lab/downloadPublication.php?filename=http://www.cs.utexas.edu/users/ml/papers/goyal.ijcai19.pdf&pubid=127757)
320 | 
321 | ## Generative and contrastive representation learning
322 | 
323 | * [Unsupervised State Representation Learning in Atari](ST-DIM.md) 19 Jun 2019 [arxiv](https://arxiv.org/pdf/1906.08226.pdf)
324 | 
325 | ## Belief
326 | 
327 | * [Shaping Belief States with Generative Environment Models for RL](GenerativeBelief.md) 24 Jun 2019 [arxiv](https://arxiv.org/pdf/1906.09237v2.pdf)
328 | 
329 | ## PAC
330 | * [Provably Convergent Off-Policy Actor-Critic with Function Approximation](COF-PAC.md) 11 Nov 2019 [arxiv](https://arxiv.org/pdf/1911.04384.pdf)
331 | 
332 | 
333 | ## Applications
334 | * [Benchmarks for Deep Off-Policy Evaluation](bdope.md) 30 Mar 2021 [arxiv](https://arxiv.org/pdf/2103.16596.pdf)
335 | * [Learning Reciprocity in Complex Sequential Social Dilemmas](Reciprocity.md) 19 Mar 2019 [arxiv](https://arxiv.org/pdf/1903.08082.pdf)
336 | * [DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills](dmimic.md) 9 Apr 2018
337 | * [TUNING RECURRENT NEURAL NETWORKS WITH REINFORCEMENT LEARNING](RLTUNER.md)
338 | 


--------------------------------------------------------------------------------
/RECUR.md:
--------------------------------------------------------------------------------
1 | # Reverse Curriculum Generation for Reinforcement Learning
2 | 
3 | Carlos Florensa, David Held, Markus Wulfmeier, Pieter Abbeel
4 | 
5 | Many relevant tasks require an agent to reach a certain state, or to manipulate objects into a desired configuration. For example, we might want a robot to align and assemble a gear onto an axle or insert and turn a key in a lock. These tasks present considerable difficulties for reinforcement learning approaches, since the natural reward function for such goal-oriented tasks is sparse and prohibitive amounts of exploration are required to reach the goal and receive a learning signal. Past approaches tackle these problems by manually designing a task-specific reward shaping function to help guide the learning. Instead, we propose a method to learn these tasks without requiring any prior task knowledge other than obtaining a single state in which the task is achieved. The robot is trained in "reverse", gradually learning to reach the goal from a set of starting positions increasingly far from the goal. Our method automatically generates a curriculum of starting positions that adapts to the agent's performance, leading to efficient training on such tasks. We demonstrate our approach on difficult simulated fine-grained manipulation problems, not solvable by state-of-the-art reinforcement learning methods.
6 | 


--------------------------------------------------------------------------------
/REETDQN.md:
--------------------------------------------------------------------------------
1 | Investigating Recurrence and Eligibility Traces in Deep Q-Networks
2 | 


--------------------------------------------------------------------------------
/RLCRC.md:
--------------------------------------------------------------------------------
 1 | Reinforcement Learning with Corrupted Reward Channel
 2 | 
 3 | Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, and Shane Legg
 4 | 
 5 | https://arxiv.org/pdf/1705.08417.pdf
 6 | 
 7 | No real-world reward function is perfect. Sensory errors and software bugs may result in RL agents
 8 | observing higher (or lower) rewards than they should. For example, a reinforcement learning agent may
 9 | prefer states where a sensory error gives it the maximum reward, but where the true reward is actually
10 | small. We formalise this problem as a generalised Markov Decision Problem called Corrupt Reward MDP.
11 | Traditional RL methods fare poorly in CRMDPs, even under strong simplifying assumptions and when
12 | trying to compensate for the possibly corrupt rewards. Two ways around the problem are investigated.
13 | First, by giving the agent richer data, such as in inverse reinforcement learning and semi-supervised
14 | reinforcement learning, reward corruption stemming from systematic sensory errors may sometimes be
15 | completely managed. Second, by using randomisation to blunt the agent’s optimisation, reward corruption
16 | can be partially managed under some assumptions.
17 | 


--------------------------------------------------------------------------------
/RLNL.md:
--------------------------------------------------------------------------------
 1 | # A Survey of Reinforcement Learning Informed by Natural Language
 2 | > Jelena Luketina, Nantas Nardelli, Gregory Farquhar, Jakob Foerster, Jacob Andreas, Edward Grefenstette, Shimon Whiteson, Tim Rocktäschel
 3 | 
 4 | ## Abstract
 5 | To be successful in real-world tasks, Reinforcement Learning (RL) needs to exploit the compositional, relational, and hierarchical structure of the world, and learn to transfer it to the task at hand. 
 6 | 
 7 | Recent advances in representation learning for language make it possible to build models that acquire world knowledge from text corpora and integrate this knowledge into downstream decision making problems. 
 8 | 
 9 | We thus argue that the time is right to investigate a tight integration of natural language understanding into RL in particular. 
10 | 
11 | We survey the state of the field, including work on instruction following, text games, and learning from textual domain knowledge. Finally, we call for the development of new environments as well as further investigation into the potential uses of recent Natural Language Processing (NLP) techniques for such tasks.
12 | 


--------------------------------------------------------------------------------
/RLP.md:
--------------------------------------------------------------------------------
 1 | Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Run Time
 2 | 
 3 | Mengdi Wang, Princeton
 4 | https://arxiv.org/pdf/1704.01869.pdf
 5 | 
 6 | We propose a novel randomized linear programming algorithm for approximating the optimal policy of the discounted Markov decision problem. By leveraging the value-policy duality and binary-tree
 7 | data structures, the algorithm adaptively samples state-action-state transitions and makes exponentiated primal-dual updates. 
 8 | 
 9 | We show that it finds an ǫ-optimal policy using nearly-linear run time in the worst
10 | case. 
11 | 
12 | When the Markov decision process is ergodic and specified in some special data formats, the algorithm finds an ǫ-optimal policy using run time linear in the total number of state-action pairs, which
13 | is sublinear in the input size. These results provide a new venue and complexity benchmarks for solving stochastic dynamic programs.
14 | 


--------------------------------------------------------------------------------
/RLTUNER.md:
--------------------------------------------------------------------------------
 1 | The approach of training sequence models using supervised learning and next-step
 2 | prediction suffers from known failure modes. For example, it is notoriously diffi-
 3 | cult to ensure multi-step generated sequences have coherent global structure. We
 4 | propose a novel sequence-learning approach in which we use a pre-trained Recurrent
 5 | Neural Network (RNN) to supply part of the reward value in a Reinforcement
 6 | Learning (RL) model. Thus, we can refine a sequence predictor by optimizing
 7 | for some imposed reward functions, while maintaining good predictive properties
 8 | learned from data. We propose efficient ways to solve this by augmenting deep
 9 | Q-learning with a cross-entropy reward and deriving novel off-policy methods for
10 | RNNs from KL control. We explore the usefulness of our approach in the context
11 | of music generation. An LSTM is trained on a large corpus of songs to predict
12 | the next note in a musical sequence. This Note-RNN is then refined using our
13 | method and rules of music theory. We show that by combining maximum likelihood
14 | (ML) and RL in this way, we can not only produce more pleasing melodies,
15 | but significantly reduce unwanted behaviors and failure modes of the RNN, while
16 | maintaining information learned from data.
17 | 


--------------------------------------------------------------------------------
/ROMMEO.md:
--------------------------------------------------------------------------------
 1 | # A Regularized Opponent Model with Maximum Entropy Objective
 2 | > Zheng Tian, Ying Wen, Zhichen Gong, Faiz Punakkath, Shihao Zou, and Jun Wang
 3 | 
 4 | 
 5 | ## Abstract
 6 | In a single-agent setting, reinforcement learning (RL) tasks can be cast into an inference problem by introducing a binary random variable o, which stands for the “optimality”. 
 7 | 
 8 | In this paper, we redefine the binary random variable o in multi-agent setting and formalize multi-agent reinforcement learning (MARL) as probabilistic inference. 
 9 | 
10 | We derive a variational lower bound of the likelihood of achieving the optimality and name it as Regularized Opponent Model with Maximum Entropy Objective (ROMMEO). 
11 | 
12 | From ROMMEO, we present a novel perspective on opponent modeling and show how it can improve the performance of training agents theoretically and empirically in cooperative games.
13 | 
14 | To optimize ROMMEO, we first introduce a tabular Q-iteration method ROMMEO-Q with proof of convergence. We extend the exact algorithm to complex environments by proposing an approximate version, ROMMEO-AC. 
15 | 
16 | We evaluate these two algorithms on the challenging iterated matrix game and differential game respectively and show that they can outperform strong MARL baselines.
17 | 


--------------------------------------------------------------------------------
/RVF.md:
--------------------------------------------------------------------------------
 1 | # Recurrent Value Functions
 2 | > Pierre Thodoroff, Nishanth Anand, Lucas Caccia, Doina Precup, Joelle Pineau
 3 | 
 4 | ## Abstract
 5 | Despite recent successes in Reinforcement Learning, value-based methods often suffer from high variance hindering performance. 
 6 | 
 7 | In this paper, we illustrate this in a continuous control setting where state of the art methods perform poorly whenever sensor noise is introduced. 
 8 | 
 9 | To overcome this issue, we introduce Recurrent Value Functions (RVFs) as an alternative to estimate the value function of a state. 
10 | 
11 | We propose to estimate the value function of the current state using the value function of past states visited along the trajectory. 
12 | 
13 | Due to the nature of their formulation, RVFs have a natural way of learning an emphasis function that selectively emphasizes important states. 
14 | 
15 | First, we establish RVF’s asymptotic convergence properties in tabular settings. 
16 | 
17 | We then demonstrate their robustness on a partially observable domain and continuous control tasks. 
18 | 
19 | Finally, we provide a qualitative interpretation of the learned emphasis function.
20 | 


--------------------------------------------------------------------------------
/Rainbow.md:
--------------------------------------------------------------------------------
1 | # Rainbow: Combining Improvements in Deep Reinforcement Learning
2 | Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, David Silver, 
3 | ## Abstract
4 | The deep reinforcement learning community has made several independent improvements to the DQN algorithm. However, it is unclear which of these extensions are complementary and can be fruitfully combined. This paper examines six extensions to the DQN algorithm and empirically studies their combination. Our experiments show that the combination provides state-of-the-art performance on the Atari 2600 benchmark, both in terms of data efficiency and final performance. We also provide results from a detailed ablation study that shows the contribution of each component to overall performance. 


--------------------------------------------------------------------------------
/RayInterference.md:
--------------------------------------------------------------------------------
 1 | # Ray Interference: a Source of Plateaus in Deep Reinforcement Learning
 2 | > Tom Schaul,*, Diana Borsa, Joseph Modayil and Razvan Pascanu
 3 | 
 4 | ## Abstract
 5 | Rather than proposing a new method, this paper investigates an issue present in existing learning algorithms. 
 6 | 
 7 | We study the learning dynamics of reinforcement learning (RL), specifically a characteristic coupling between learning and data generation that arises because RL agents control their future data distribution. 
 8 | 
 9 | In the presence of function approximation, this coupling can lead to a problematic type of ‘ray interference’, characterized by learning dynamics that sequentially traverse a number of performance plateaus, effectively constraining the agent to learn one thing at a time even when learning in parallel is better. 
10 | 
11 | We establish the conditions under which ray interference occurs, show its relation to saddle points and obtain the exact learning dynamics in a restricted setting. We characterize a number of its properties and discuss possible remedies.
12 | 


--------------------------------------------------------------------------------
/Reciprocity.md:
--------------------------------------------------------------------------------
 1 | # Learning Reciprocity in Complex Sequential Social Dilemmas
 2 | > Tom Eccles,
 3 | Edward Hughes,
 4 | János Kramár,
 5 | Steven Wheelwright,
 6 | Joel Z. Leibo
 7 | 
 8 | 
 9 | ## ABSTRACT
10 | Reciprocity is an important feature of human social interaction and underpins our cooperative nature. 
11 | 
12 | What is more, simple forms of reciprocity have proved remarkably resilient in matrix game social dilemmas. 
13 | 
14 | Most famously, the tit-for-tat strategy performs very well in tournaments of Prisoner’s Dilemma. 
15 | 
16 | Unfortunately this strategy is not readily applicable to the real world, in which options to cooperate or defect are temporally and spatially extended.
17 | 
18 | Here, we present a general online reinforcement learning algorithm that displays reciprocal behavior towards its co-players. 
19 | 
20 | We show that it can induce pro-social outcomes for the wider group when learning alongside selfish agents, both in a 2-player Markov game, and in 5-player intertemporal social dilemmas. 
21 | 
22 | We analyse the resulting policies to show that the reciprocating agents are strongly influenced by their co-players’ behavior.
23 | 


--------------------------------------------------------------------------------
/RoboSumo.md:
--------------------------------------------------------------------------------
 1 | Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments
 2 | 
 3 | Maruan Al-Shedivat∗
 4 | CMU
 5 | Trapit Bansal
 6 | UMass Amherst
 7 | Yuri Burda
 8 | OpenAI
 9 | Ilya Sutskever
10 | OpenAI
11 | Igor Mordatch
12 | OpenAI
13 | Pieter Abbeel
14 | UC Berkeley, OpenAI
15 | 
16 | Ability to **continuously learn and adapt** from limited experience in **nonstationary environments** is an important milestone on the path towards general intelligence. In this paper, we cast the problem of continuous adaptation into the learning-to-learn framework. We develop a simple gradient-based meta-learning algorithm suitable for adaptation in dynamically changing and adversarial scenarios. Additionally, we design a new multi-agent competitive environment, RoboSumo, and define iterated adaptation games for testing various aspects of continuous adaptation strategies.
17 | 
18 | We demonstrate that meta-learning enables significantly more efficient adaptation than reactive baselines in the few-shot regime. Our experiments with a population of agents that learn and compete suggest that meta-learners are the fittest.
19 | 


--------------------------------------------------------------------------------
/SGA.md:
--------------------------------------------------------------------------------
 1 | # The Mechanics of n-Player Differentiable Games
 2 | David Balduzzi 1 Sebastien Racani´ere` 1 James Martens 1 Jakob Foerster 2 Karl Tuyls 1 Thore Graepel 1
 3 | 
 4 | ## abstract
 5 | The cornerstone underpinning deep learning is the guarantee that gradient descent on an objective converges to local minima. 
 6 | 
 7 | Unfortunately, this guarantee fails in settings, such as generative adversarial nets, where there are multiple interacting losses. 
 8 | 
 9 | The behavior of gradient-based methods in games is not well understood – and is becoming increasingly important as adversarial and multiobjective architectures proliferate. 
10 | 
11 | In this paper, we develop new techniques to understand and control the dynamics in general games. 
12 | 
13 | The key result is to decompose the second-order dynamics into two components. 
14 | 
15 | The first is related to potential games, which reduce to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems. 
16 | 
17 | The decomposition motivates Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in general games. 
18 | 
19 | Basic experiments show SGA is competitive with recently proposed algorithms for finding stable fixed points in GANs – whilst at the same time being applicable to – and having guarantees in – much more general games.
20 | 


--------------------------------------------------------------------------------
/SOM.md:
--------------------------------------------------------------------------------
 1 | # Modeling Others using Oneself in Multi-Agent Reinforcement Learning
 2 | 
 3 | > Roberta Raileanu, Emily Denton, Arthur Szlam, Rob Fergus
 4 | 
 5 | ## Abstract
 6 | We consider the multi-agent reinforcement learning setting with imperfect information. 
 7 | 
 8 | The reward function depends on the hidden goals of both agents, so the agents must infer the other players’ goals from their observed behavior in order to maximize their returns. 
 9 | 
10 | We propose a new approach for learning in these domains: Self Other-Modeling (SOM), in which an agent uses its own policy to predict the other agent’s actions and update its belief of their hidden goal in an online manner. 
11 | 
12 | We evaluate this approach on three different tasks and show that the agents are able to learn better policies using their estimate of the other players’ goals, in both cooperative and competitive settings.
13 | 


--------------------------------------------------------------------------------
/SPU.md:
--------------------------------------------------------------------------------
 1 | # SUPERVISED POLICY UPDATE FOR DEEP REINFORCEMENT LEARNING
 2 | > Quan Vuong, Yiming Zhang, and Keith Ross
 3 | 
 4 | ## ABSTRACT
 5 | We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. 
 6 | 
 7 | Starting with data generated by the current policy, SPU formulates and solves a constrained optimization problem in the non-parameterized proximal policy space. 
 8 | 
 9 | Using supervised regression, it then converts the optimal non-parameterized policy to a parameterized policy, from which it draws new samples. 
10 | 
11 | The methodology is general in that it applies to both discrete and continuous action spaces, and can handle a wide variety of proximity constraints for the non-parameterized optimization problem.  
12 | 
13 | We show how the Natural Policy Gradient and Trust Region Policy Optimization (NPG/TRPO) problems, and the Proximal Policy Optimization (PPO) problem can be addressed by this methodology. 
14 | 
15 | The SPU implementation is much simpler than TRPO. 
16 | 
17 | In terms of sample efficiency, our extensive experiments show SPU outperforms TRPO in Mujoco simulated robotic tasks and outperforms PPO in Atari video game tasks.
18 | 


--------------------------------------------------------------------------------
/SRL.md:
--------------------------------------------------------------------------------
 1 | # S-RL Toolbox: Environments, Datasets and Evaluation Metrics for State Representation Learning
 2 | > Antonin Raffin antonin.raffin@ensta-paristech.fr
 3 | Ashley Hill ashley.hill@ensta-paristech.fr
 4 | René Traoré rene.traore@ensta-paristech.fr
 5 | Timothée Lesort timothee.lesort@ensta-paristech.fr
 6 | Natalia Díaz-Rodríguez natalia.diaz@ensta-paristech.fr
 7 | David Filliat david.filliat@ensta-paristech.fr
 8 | U2IS, ENSTA ParisTech / INRIA FLOWERS Team http: // flowers. inria. fr
 9 | 
10 | ## Abstract
11 | State representation learning aims at learning compact representations from raw observations in robotics and control applications. 
12 | 
13 | Approaches used for this objective are autoencoders, learning forward models, inverse dynamics or learning using generic priors on the state characteristics. 
14 | 
15 | However, the diversity in applications and methods makes the field lack standard evaluation datasets, metrics and tasks. 
16 | 
17 | This paper provides a set of environments, data generators, robotic control tasks, metrics and tools to facilitate iterative state representation learning and evaluation in reinforcement learning settings. 
18 | 
19 | **Keywords**: Deep learning, reinforcement learning, state representation learning, robotic
20 | priors
21 | 


--------------------------------------------------------------------------------
/ST-DIM.md:
--------------------------------------------------------------------------------
 1 | # Unsupervised State Representation Learning in Atari
 2 | > Ankesh Anand, Evan Racah, Sherjil Ozair, Yoshua Bengio, Marc-Alexandre Côté, R Devon Hjelm
 3 | 
 4 | ## Abstract
 5 | State representation learning, or the ability to capture latent generative factors of an environment, is crucial for building intelligent agents that can perform a wide variety of tasks. 
 6 | 
 7 | Learning such representations without supervision from rewards is a challenging open problem. 
 8 | 
 9 | We introduce a method that learns state representations by maximizing mutual information across spatially and temporally distinct features of a neural encoder of the observations. 
10 | 
11 | We also introduce a new benchmark based on Atari 2600 games where we evaluate representations based on how well they capture the ground truth state variables. 
12 | 
13 | We believe this new framework for evaluating representation learning models will be crucial for future representation learning research. 
14 | 
15 | Finally, we compare our technique with other state-of-the-art generative and contrastive representation learning methods.
16 | 


--------------------------------------------------------------------------------
/SVRL.md:
--------------------------------------------------------------------------------
 1 | # Harnessing Structures for Value-Based Planning and Reinforcement Learning
 2 | > Yuzhe Yang, Guo Zhang, Zhi Xu, Dina Katabi
 3 | 
 4 | ## Abstract
 5 | Value-based methods constitute a fundamental methodology in planning and deep reinforcement learning (RL). 
 6 | 
 7 | In this paper, we propose to exploit the underlying structures of the state-action value function, i.e., Q function, for both planning and deep RL. 
 8 | 
 9 | In particular, if the underlying system dynamics lead to some global structures of the Q function, one should be capable of inferring the function better by leveraging such structures. 
10 | 
11 | Specifically, we investigate the low-rank structure, which widely exists for big data matrices. We verify empirically the existence of low-rank Q functions in the context of control and deep RL tasks. 
12 | 
13 | As our key contribution, by leveraging Matrix Estimation (ME) techniques, we propose a general framework to exploit the underlying low-rank structure in Q functions. 
14 | 
15 | This leads to a more efficient planning procedure for classical control, and additionally, a simple scheme that can be applied to value-based RL techniques to consistently achieve better performance on "low-rank" tasks. 
16 | 
17 | Extensive experiments on control tasks and Atari games confirm the efficacy of our approach.
18 | 


--------------------------------------------------------------------------------
/SoRB.md:
--------------------------------------------------------------------------------
 1 | # Search on the Replay Buffer: Bridging Planning and Reinforcement Learning
 2 | > Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine
 3 | 
 4 | ## Abstract
 5 | The history of learning for control has been an exciting back and forth between two broad classes of algorithms: planning and reinforcement learning. 
 6 | 
 7 | Planning algorithms effectively reason over long horizons, but assume access to a local policy and distance metric over collision-free paths. 
 8 | 
 9 | Reinforcement learning excels at learning policies and the relative values of states, but fails to plan over long horizons. 
10 | 
11 | Despite the successes of each method in various domains, tasks that require reasoning over long horizons with limited feedback and high-dimensional observations remain exceedingly challenging for both planning and reinforcement learning algorithms. 
12 | 
13 | Frustratingly, these sorts of tasks are potentially the most useful, as they are simple to design (a human only need to provide an example goal state) and avoid reward shaping, which can bias the agent towards finding a sub-optimal solution. We introduce a general-purpose control algorithm that combines the strengths of planning and reinforcement learning to effectively solve these tasks. 
14 | 
15 | Our aim is to decompose the task of reaching a distant goal state into a sequence of easier tasks, each of which corresponds to reaching a particular subgoal. 
16 | 
17 | Planning algorithms can automatically find these waypoints, but only if provided with suitable abstractions of the environment – namely, a graph consisting of nodes and edges. 
18 | 
19 | Our main insight is that this graph can be constructed via reinforcement learning, where a goal-conditioned value function provides edge weights, and nodes are taken to be previously seen observations in a replay buffer. 
20 | 
21 | Using graph search over our replay buffer, we can automatically generate this sequence of subgoals, even in image-based environments. 
22 | 
23 | Our algorithm, search on the replay buffer (SoRB), enables agents to solve sparse reward tasks over one hundred steps, and generalizes substantially better than standard RL algorithms.
24 | 


--------------------------------------------------------------------------------
/TRPO.md:
--------------------------------------------------------------------------------
 1 | # Trust Region Policy Optimization
 2 | John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, Pieter Abbeel
 3 | 
 4 | > **Iterative procedure** for *optimizing policies*, with **guaranteed monotonic improvement**. 
 5 | 
 6 | ## Big picture:
 7 | 
 8 | We have Theory for proving, but need approximations for real play.
 9 | Practical algorithm: Trust Region Policy Optimization (TRPO)
10 | 
11 | ## Characteristics of TRPO:
12 | 
13 | 1. natural policy gradient methods 
14 | 2. effective for optimizing large nonlinear policies such as neural networks. 
15 | 
16 | ## Experiments:
17 | 
18 | performs robustly on a wide variety of tasks: 
19 | 1. learning simulated robotic swimming, hopping, and walking gaits
20 | 2. playing Atari games using images of the screen as input. 
21 | 
22 | ## To remind:
23 | 
24 | 1. approximations deviate from the theory
25 | 2. TRPO **tends** to give monotonic improvement, with little tuning of hyperparameters.
26 | 
27 | ![](images/TRPO.png)
28 | 


--------------------------------------------------------------------------------
/UBE.md:
--------------------------------------------------------------------------------
 1 | The Uncertainty Bellman Equation and Exploration
 2 | 
 3 | Brendan O’Donoghue, Ian Osband, Remi Munos, Volodymyr Mnih
 4 | Deepmind
 5 | {bodonoghue, iosband, munos, vmnih}@google.com
 6 | September 19, 2017
 7 | 
 8 | Abstract
 9 | 
10 | We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is
11 | well known that the Bellman equation connects the value at any time-step to the expected value at
12 | subsequent time-steps. In this paper we consider a similar uncertainty Bellman equation (UBE), which
13 | connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby
14 | extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the
15 | unique fixed point of the UBE yields an upper bound on the variance of the estimated value of any fixed
16 | policy. This bound can be much tighter than traditional count-based bonuses that compound standard
17 | deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this
18 | method scales naturally to large systems with complex generalization. Substituting our UBE-exploration
19 | strategy for -greedy improves DQN performance on 51 out of 57 games in the Atari suite.
20 | 


--------------------------------------------------------------------------------
/UML.md:
--------------------------------------------------------------------------------
 1 | # Unsupervised Meta-Learning for Reinforcement Learning
 2 | Abhishek Gupta
 3 | University of California, Berkeley
 4 | abhigupta@eecs.berkeley.edu
 5 | Benjamin Eysenbach
 6 | Google
 7 | eysenbach@google.com
 8 | Chelsea Finn
 9 | University of California, Berkeley
10 | cbfinn@eecs.berkeley.edu
11 | Sergey Levine
12 | University of California, Berkeley
13 | svlevine@eecs.berkeley.edu
14 | 
15 | ## Abstract
16 | Meta-learning is a powerful tool that builds on multi-task learning to learn how to quickly adapt a model to new tasks. 
17 | 
18 | In the context of reinforcement learning, meta-learning algorithms can acquire reinforcement learning procedures to solve new problems more efficiently by meta-learning prior tasks. 
19 | 
20 | The performance of meta-learning algorithms critically depends on the tasks available for meta-training:
21 | 
22 | in the same way that supervised learning algorithms generalize best to test points drawn from the same distribution as the training points, meta-learning methods generalize best to tasks from the same distribution as the meta-training tasks. 
23 | 
24 | In effect, meta-reinforcement learning offloads the design burden from algorithm design to task design. 
25 | 
26 | If we can automate the process of task design as well, we can devise a meta-learning algorithm that is truly automated. 
27 | 
28 | In this work, we take a step in this direction, proposing a family of unsupervised meta-learning algorithms for reinforcement learning. We describe a general recipe for unsupervised meta-reinforcement learning, and describe an effective instantiation of this approach based on a recently proposed unsupervised exploration technique and model-agnostic meta-learning. We also discuss practical and conceptual considerations for developing unsupervised meta-learning methods. 
29 | 
30 | Our experimental results demonstrate that unsupervised meta-reinforcement learning effectively acquires accelerated reinforcement learning procedures without the need for manual task design, significantly exceeds the performance of learning from scratch, and even matches performance of meta-learning methods that use hand-specified task distributions. 
31 | 


--------------------------------------------------------------------------------
/UNREAL.md:
--------------------------------------------------------------------------------
 1 | # REINFORCEMENT LEARNING WITH UNSUPERVISED AUXILIARY TASKS
 2 | 
 3 | Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki
 4 | Tom Schaul, Joel Z Leibo, David Silver & Koray Kavukcuoglu
 5 | 
 6 | This paper brings together the state-of-the-art Asynchronous Advantage Actor-Critic (A3C) framework (Mnih et al., 2016), outlined in Section 2, with auxiliary control tasks and auxiliary reward tasks, defined in sections Section 3.1 and Section 3.2 respectively. 
 7 | 
 8 | * A3C
 9 | * Auxiliary control tasks
10 | * Auxiliary reward tasks
11 | 
12 | ![(a) The base agent is a CNN-LSTM agent trained on-policy with the A3C loss (Mnih et al., 2016). Observations, rewards, and actions are stored in a small replay buffer which encapsulates a short history of agent experience. This experience is used by auxiliary learning tasks. (b) Pixel Control – auxiliary policies Qaux are trained to maximise change in pixel intensity of different regions of the input. The agent CNN and LSTM are used for this task along with an auxiliary deconvolution network. This auxiliary control task requires the agent to learn how to control the environment. (c) Reward Prediction – given three recent frames, the network must predict the reward that will be obtained in the next unobserved timestep. This task network uses instances of the agent CNN, and is trained on reward biased sequences to remove the perceptual sparsity of rewards. (d) Value Function Replay – further training of the value function using the agent network is performed to promote faster value iteration. Further visualisation of the agent can be found in https://youtu.be/Uz- zGYrYEjA](images/unreal.png)
13 | 


--------------------------------------------------------------------------------
/VALOR.md:
--------------------------------------------------------------------------------
 1 | > Joshua Achiam
 2 | UC Berkeley & OpenAI
 3 | Harrison Edwards
 4 | OpenAI
 5 | Dario Amodei
 6 | OpenAI
 7 | Pieter Abbeel
 8 | UC Berkeley
 9 | 
10 | https://arxiv.org/pdf/1807.10299.pdf
11 | 
12 | ## Abstract
13 | We explore methods for option discovery based on variational inference and make two algorithmic contributions. First: we highlight a tight connection between variational option discovery methods and variational autoencoders, and introduce Variational Autoencoding Learning of Options by Reinforcement (VALOR), a new method derived from the connection. In VALOR, the policy encodes contexts from a noise distribution into trajectories, and the decoder recovers the contexts from the complete trajectories. Second: we propose a curriculum learning approach where
14 | the number of contexts seen by the agent increases whenever the agent’s performance is strong enough (as measured by the decoder) on the current set of contexts. 
15 | 
16 | We show that this simple trick stabilizes training for VALOR and prior variational option discovery methods, allowing a single agent to learn many more modes of behavior than it could with a fixed context distribution. Finally, we investigate other topics related to variational option discovery, including fundamental limitations of the general approach and the applicability of learned options to downstream tasks.
17 | 


--------------------------------------------------------------------------------
/VICE.md:
--------------------------------------------------------------------------------
 1 | # End-to-End Robotic Reinforcement Learning without Reward Engineering
 2 | > Avi Singh, Larry Yang, Kristian Hartikainen, Chelsea Finn, Sergey Levine
 3 | 
 4 | ## Abstract
 5 | The combination of deep neural network models and reinforcement learning algorithms can make it possible to learn policies for robotic behaviors that directly read in raw sensory inputs, such as camera images, effectively subsuming both estimation and control into one model. 
 6 | 
 7 | However, realworld applications of reinforcement learning must specify the goal of the task by means of a manually programmed reward function, which in practice requires either designing the very same perception pipeline that end-to-end reinforcement learning promises to avoid, or else instrumenting the environment with additional sensors to determine if the task has been performed successfully. 
 8 | 
 9 | In this paper, we propose an approach for removing the need for manual engineering of reward specifications by enabling a robot to learn from a modest number of examples of successful outcomes, followed by actively solicited queries, where the robot shows the user a state and asks for a label to determine whether that state represents successful completion of the task.
10 | 
11 | While requesting labels for every single state would amount to asking the user to manually provide the reward signal, our method requires labels for only a tiny fraction of the states seen during training, making it an efficient and practical approach for learning skills without manually engineered rewards. 
12 | 
13 | We evaluate our method on real-world robotic manipulation tasks where the observations consist of images viewed by the robot’s camera. 
14 | 
15 | In our experiments, our method effectively learns to arrange objects, place books, and drape cloth, directly from images and without any manually specified reward functions, and with only 1-4 hours of interaction with the real world.
16 | 


--------------------------------------------------------------------------------
/VISR.md:
--------------------------------------------------------------------------------
 1 | # Fast Task Inference with Variational Intrinsic Successor Features
 2 | > Steven Hansen, Will Dabney, André Barreto, Tom Van de Wiele, David Warde-Farley, Volodymyr Mnih
 3 | 
 4 | ## Abstract
 5 | It has been established that diverse behaviors spanning the controllable subspace of an Markov decision process can be trained by rewarding a policy for being distinguishable from other policies [Gregor et al., 2016, Eysenbach et al., 2018, Warde-Farley et al., 2018]. 
 6 | 
 7 | However, one limitation of this formulation is generalizing behaviors beyond the finite set being explicitly learned, as is needed for use on subsequent tasks. 
 8 | 
 9 | Successor features [Dayan, 1993, Barreto et al., 2017] provide an appealing solution to this generalization problem, but require defining the reward function as linear in some grounded feature space. 
10 | 
11 | In this paper, we show that these two techniques can be combined, and that each method solves the other’s primary limitation. 
12 | 
13 | To do so we introduce Variational Intrinsic Successor FeatuRes (VISR), a novel algorithm which learns controllable features that can be leveraged to provide enhanced generalization and fast task inference through the successor feature framework. 
14 | 
15 | We empirically validate VISR on the full Atari suite, in a novel setup wherein the rewards are only exposed briefly after a long unsupervised phase. 
16 | 
17 | Achieving human-level performance on 14 games and beating all baselines, we believe VISR represents a step towards agents that rapidly learn from limited feedback.
18 | 


--------------------------------------------------------------------------------
/Viper.md:
--------------------------------------------------------------------------------
 1 | # Verifiable Reinforcement Learning via Policy Extraction
 2 | > Osbert Bastani, Yewen Pu, Armando Solar-Lezama
 3 | 
 4 | While deep reinforcement learning has successfully solved many challenging control tasks, its real-world applicability has been limited by the inability to ensure the safety of learned policies. 
 5 | 
 6 | We propose an approach to verifiable reinforcement learning by training decision tree policies, which can represent complex policies (since they are nonparametric), yet can be efficiently verified using existing techniques (since they are highly structured). 
 7 | 
 8 | The challenge is that decision tree policies are difficult to train. 
 9 | 
10 | We propose VIPER, an algorithm that combines ideas from model compression and imitation learning to learn decision tree policies guided by a DNN policy (called the oracle) and its Q-function, and show that it substantially outperforms two baselines. 
11 | 
12 | We use VIPER to 
13 | - (i) learn a provably robust decision tree policy for a variant of Atari Pong with a symbolic state space, 
14 | - (ii) learn a decision tree policy for a toy game based on Pong that provably never loses, and
15 | - (iii) learn a provably stable decision tree policy for cart-pole. 
16 | 
17 | In each case, the decision tree policy achieves performance equal to that of the original DNN policy.
18 | 


--------------------------------------------------------------------------------
/ZSTG.md:
--------------------------------------------------------------------------------
 1 | # Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning
 2 | 
 3 | > Junhyuk Oh 1 Satinder Singh 1 Honglak Lee 1 2 Pushmeet Kohli 3
 4 | 
 5 | As a step towards developing zero-shot task generalization capabilities in reinforcement learning(RL), we introduce a new RL problem where the
 6 | agent should learn to execute sequences of instructions after learning useful skills that solve subtasks. 
 7 | 
 8 | In this problem, we consider two types of generalizations: 
 9 | 
10 | * to previously unseen instructions 
11 |   For generalization over unseen instructions, we propose a new objective which encourages learning correspondences between similar subtasks by making analogies. For generalization over sequential instructions, we present a hierarchical architecture where a meta controller learns to use the acquired skills for executing the instructions.
12 |   
13 | * to longer sequences of instructions. 
14 |   To deal with delayed reward, we propose a new neural architecture in the meta controller that learns when to update the subtask, which makes learning more efficient. Experimental results on a stochastic 3D domain show that the proposed ideas are crucial for generalization to longer instructions as well as unseen instructions.
15 | 


--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-leap-day


--------------------------------------------------------------------------------
/bad.md:
--------------------------------------------------------------------------------
 1 | Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning
 2 | Jakob N. Foerster, Francis Song, Edward Hughes, Neil Burch, Iain Dunning, Shimon Whiteson, Matthew Botvinick, Michael Bowling
 3 | 
 4 | 
 5 | When observing the actions of others, humans carry out inferences about why the others acted as they did, and what this implies about their view of the world. 
 6 | 
 7 | Humans also use the fact that their actions will be interpreted in this manner when observed by others, allowing them to act informatively and thereby communicate efficiently with others. 
 8 | 
 9 | Although learning algorithms have recently achieved superhuman performance in a number of two-player, zero-sum games, scalable multi-agent reinforcement learning algorithms that can discover effective strategies and conventions in complex, partially observable settings have proven elusive. 
10 | 
11 | We present the Bayesian action decoder (BAD), a new multi-agent learning method that uses an approximate Bayesian update to obtain a public belief that conditions on the actions taken by all agents in the environment. 
12 | 
13 | Together with the public belief, this Bayesian update effectively defines a new Markov decision process, the public belief MDP, in which the action space consists of deterministic partial policies, parameterised by deep neural networks, that can be sampled for a given public state. 
14 | 
15 | It exploits the fact that an agent acting only on this public belief state can still learn to use its private information if the action space is augmented to be over partial policies mapping private information into environment actions. 
16 | 
17 | The Bayesian update is also closely related to the theory of mind reasoning that humans carry out when observing others' actions. 
18 | 
19 | We first validate BAD on a proof-of-principle two-step matrix game, where it outperforms traditional policy gradient methods. 
20 | 
21 | We then evaluate BAD on the challenging, cooperative partial-information card game Hanabi, where in the two-player setting the method surpasses all previously published learning and hand-coded approaches.
22 | 


--------------------------------------------------------------------------------
/bdope.md:
--------------------------------------------------------------------------------
 1 | # Benchmarks for Deep Off-Policy Evaluation
 2 | > Justin Fu, Mohammad Norouzi, +10 authors T. Paine
 3 | > Published 30 March 2021
 4 | ## Abstract
 5 | 
 6 | Off-policy evaluation (OPE) holds the promise of being able to leverage large, offline datasets for both obtaining and selecting complex policies for decision making. The ability to perform evaluation offline is particularly important in many real-world domains, such as healthcare, recommender systems, or robotics, where online data collection is an expensive and potentially dangerous process. 
 7 | 
 8 | Being able to accurately evaluate and select high-performing policies without requiring online interaction could yield significant benefits in safety, time, and cost for these applications. While many OPE methods have been proposed in recent years, comparing results between works is difficult because there is currently a lack of a comprehensive and unified benchmark. 
 9 | 
10 | Moreover, it is difficult to measure how far algorithms have progressed, due to the lack of challenging evaluation tasks. 
11 | 
12 | In order to address this gap, we propose a new benchmark for off-policy evaluation which includes tasks on a range of challenging, high-dimensional control problems, with wide selections of datasets and policies for performing policy selection. 
13 | 
14 | The goal of of our benchmark is to provide a standardized measure of progress that is motivated from a set of principles designed to challenge and test the limits of existing OPE methods. 
15 | 
16 | We perform a comprehensive evaluation of state-of-the-art algorithms, and we will provide open-source access to all data and code to foster future research in this area. 
17 | 


--------------------------------------------------------------------------------
/content.md:
--------------------------------------------------------------------------------
 1 | # Deep Reinforcement Learning
 2 | ## Policy gradient methods
 3 | 
 4 | * [Trust Region Policy Optimization](TRPO.md)
 5 | * [Reinforcement Learning with Deep Energy-Based Policies](DEBP.md)
 6 | * [Q-PROP: SAMPLE-EFFICIENT POLICY GRADIENT WITH AN OFF-POLICY CRITIC](QPROP.md)
 7 | 
 8 | ## Explorations in DRL
 9 | 
10 | * [Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models](incentivizing.md)
11 | 
12 | ## Actor-Critic methods
13 | 
14 | * [The Reactor: A Sample-Efficient Actor-Critic Architecture](REACTOR.md)
15 | * [SAMPLE EFFICIENT ACTOR-CRITIC WITH EXPERIENCE REPLAY](ACER.md)
16 | * [REINFORCEMENT LEARNING WITH UNSUPERVISED AUXILIARY TASKS](UNREAL.md)
17 | * [Continuous control with deep reinforcement learning](DDPG.md)
18 | 
19 | 
20 | ## Connection with other methods
21 | 
22 | * [Connecting Generative Adversarial Networks and Actor-Critic Methods](GANAC.md)
23 | 
24 | ## Connecting value and policy methods
25 | * [Bridging the Gap Between Value and Policy Based Reinforcement Learning](PCL.md)
26 | * [Policy gradient and Q-learning](PGQ.md)
27 | 
28 | ## Unifying
29 | * [Multi-step Reinforcement Learning: A Unifying Algorithm](MSRL.md)
30 | 
31 | ## Faster DRL
32 | * [Neural Episodic Control](NEC.md)
33 | 


--------------------------------------------------------------------------------
/database.csv:
--------------------------------------------------------------------------------
1 | title, author, time, algorithm, code
2 | 


--------------------------------------------------------------------------------
/dmimic.md:
--------------------------------------------------------------------------------
 1 | Xue Bin Peng and Pieter Abbeel and Sergey Levine and Michiel van de Panne
 2 | 
 3 | # Abstract
 4 | Copying an element from a photo and pasting it into a painting is a challenging task. Applying photo compositing techniques
 5 | in this context yields subpar results that look like a collage — and existing painterly stylization algorithms, which are global,
 6 | perform poorly when applied locally. We address these issues with a dedicated algorithm that carefully determines the local
 7 | statistics to be transferred. We ensure both spatial and inter-scale statistical consistency and demonstrate that both aspects
 8 | are key to generating quality results. To cope with the diversity of abstraction levels and types of paintings, we introduce a
 9 | technique to adjust the parameters of the transfer depending on the painting. We show that our algorithm produces significantly
10 | better results than photo compositing or global stylization techniques and that it enables creative painterly edits that would be
11 | otherwise difficult to achieve.
12 | 


--------------------------------------------------------------------------------
/images/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tigerneil/awesome-deep-rl/ccfc8116065a57ef107717cd7e77ec71c171058f/images/.gitkeep


--------------------------------------------------------------------------------
/images/ACER.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tigerneil/awesome-deep-rl/ccfc8116065a57ef107717cd7e77ec71c171058f/images/ACER.png


--------------------------------------------------------------------------------
/images/TRPO.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tigerneil/awesome-deep-rl/ccfc8116065a57ef107717cd7e77ec71c171058f/images/TRPO.png


--------------------------------------------------------------------------------
/images/Trust Region Policy Optimization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tigerneil/awesome-deep-rl/ccfc8116065a57ef107717cd7e77ec71c171058f/images/Trust Region Policy Optimization.png


--------------------------------------------------------------------------------
/images/awesome-drl.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tigerneil/awesome-deep-rl/ccfc8116065a57ef107717cd7e77ec71c171058f/images/awesome-drl.png


--------------------------------------------------------------------------------
/images/incentizing.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tigerneil/awesome-deep-rl/ccfc8116065a57ef107717cd7e77ec71c171058f/images/incentizing.png


--------------------------------------------------------------------------------
/images/landscape.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tigerneil/awesome-deep-rl/ccfc8116065a57ef107717cd7e77ec71c171058f/images/landscape.jpeg


--------------------------------------------------------------------------------
/images/unreal.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tigerneil/awesome-deep-rl/ccfc8116065a57ef107717cd7e77ec71c171058f/images/unreal.png


--------------------------------------------------------------------------------
/incentivizing.md:
--------------------------------------------------------------------------------
 1 | # Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models
 2 | 
 3 | **Bradly C. Stadie, Sergey Levine, Pieter Abbeel**
 4 | 
 5 | Achieving efficient and scalable exploration in complex domains poses a major challenge in reinforcement learning. 
 6 | 
 7 | While Bayesian and PAC-MDP approaches to the exploration problem offer strong formal guarantees, they are often **impractical** in higher dimensions due to their reliance on enumerating the state-action space. 
 8 | 
 9 | Hence, exploration in complex domains is often performed with simple epsilon-greedy methods. 
10 | In this paper, we consider the challenging Atari games domain, which requires processing raw pixel inputs and delayed rewards. 
11 | 
12 | We evaluate several more sophisticated exploration strategies, including **Thompson sampling and Boltzman exploration**, and propose a new *exploration* method based on assigning **exploration bonuses** from a **concurrently learned model** of the system dynamics. By parameterizing our **learned model** with a neural network, we are able to develop a scalable and efficient approach to exploration bonuses that can be applied to tasks with complex, high-dimensional state spaces. 
13 | 
14 | In the Atari domain, our method provides the most consistent improvement across a range of games that pose a major challenge for prior methods. In addition to raw game-scores, we also develop an **AUC-100 metric** for the Atari Learning domain to evaluate the impact of exploration on this benchmark.
15 | 
16 | Algorithm:
17 | 
18 | ![](images/incentizing.png)
19 | 
20 | ## Contribution
21 | Propose a new exploration method based on assigning exploration bonuses from a concurrently learned model of the system dynamics.
22 | 


--------------------------------------------------------------------------------
/index.html:
--------------------------------------------------------------------------------
  1 | 
  2 | <!DOCTYPE html>
  3 |     <html>
  4 |     <head>
  5 |         <meta charset="UTF-8">
  6 |         <title>Awesome Deep Reinforcement Learning</title>
  7 |         
  8 |         <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/Microsoft/vscode/extensions/markdown-language-features/media/markdown.css">
  9 |         <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/Microsoft/vscode/extensions/markdown-language-features/media/highlight.css">
 10 |         
 11 |         <style>
 12 | .task-list-item { list-style-type: none; } .task-list-item-checkbox { margin-left: -20px; vertical-align: middle; }
 13 | </style>
 14 |         <style>
 15 |             body {
 16 |                 font-family: -apple-system, BlinkMacSystemFont, 'Segoe WPC', 'Segoe UI', 'Ubuntu', 'Droid Sans', sans-serif;
 17 |                 font-size: 14px;
 18 |                 line-height: 1.6;
 19 |             }
 20 |         </style>
 21 |         
 22 |         
 23 |     </head>
 24 |     <body class="vscode-light">
 25 |         <p>Awesome Deep Reinforcement Learning</p>
 26 |         
 27 | <h1 id="awesome-deep-reinforcement-learning">Awesome Deep Reinforcement Learning</h1>
 28 | 
 29 | <p><a href="https://github.com/tigerneil/awesome-deep-rl">Follow <b>awesome-deep-rl</b> on Github</a></p>
 30 | <p><a href="https://github.com/tigerneil/awesome-deep-rl/issues">Ask questions on Github</a></p>  
 31 | 
 32 | <p>updated Landscape of <strong>DRL</strong></p>
 33 |         
 34 | <p><img src="images/awesome-drl.png" alt="updated Landscape of DRL"></p>
 35 |         
 36 | <p>Landscape of <strong>DRL</strong></p>
 37 |         
 38 | <p><img src="images/landscape.jpeg" alt="Landscape of DRL/GAN"></p>
 39 |     
 40 | <p>This project is built for people who are learning and researching on latest deep reinforcement learning methods.</p>
 41 |         
 42 | <p>Illustrations:</p>
 43 |         
 44 | <p><img src="images/ACER.png" alt=""></p>
 45 |         
 46 | <p><strong>Recommendations and suggestions are welcome</strong>.</p>
 47 |         
 48 | <h2 id="general-guidances">General guidances</h2>
 49 |         
 50 | <ul>
 51 | <li><a href="http://mlanctot.info/files/papers/Lanctot_MARL_RLSS2019_Lille.pdf">Multiagent Reinforcement Learning by Marc Lanctot RLSS @ Lille</a> 11 July 2019</li>
 52 | <li><a href="https://david-abel.github.io/notes/rldm_2019.pdf">RLDM 2019 Notes by David Abel</a> 11 July 2019</li>
 53 | <li><a href="RLNL.md">A Survey of Reinforcement Learning Informed by Natural Language</a> 10 Jun 2019 <a href="https://arxiv.org/pdf/1906.03926.pdf">arxiv</a></li>
 54 | <li><a href="ChallengesRealWorldRL.md">Challenges of Real-World Reinforcement Learning</a> 29 Apr 2019 <a href="https://arxiv.org/pdf/1904.12901.pdf">arxiv</a></li>
 55 | <li><a href="RayInterference.md">Ray Interference: a Source of Plateaus in Deep Reinforcement Learning</a> 25 Apr 2019 <a href="https://arxiv.org/pdf/1904.11455.pdf">arxiv</a></li>
 56 | <li><a href="p10.md">Principles of Deep RL by David Silver</a></li>
 57 | <li><a href="https://www.jianshu.com/p/dfd987aa765a">University AI's General introduction to deep rl (in Chinese)</a></li>
 58 | <li><a href="https://spinningup.openai.com/en/latest/">OpenAI's spinningup</a></li>
 59 | <li><a href="https://thegradient.pub/the-promise-of-hierarchical-reinforcement-learning/">The Promise of Hierarchical Reinforcement Learning</a> 9 Mar 2019</li>
 60 | <li><a href="reproducing.md">Deep Reinforcement Learning that Matters</a> 30 Jan 2019 <a href="https://arxiv.org/pdf/1709.06560.pdf">arxiv</a></li>
 61 | </ul>
 62 | <h2 id="foundations-and-theory">Foundations and theory</h2>
 63 | <ul>
 64 | <li><a href="GNLBE.md">General non-linear Bellman equations</a> 9 July 2019 <a href="https://arxiv.org/pdf/1907.07331.pdf">arxiv</a></li>
 65 | <li><a href="MCGE.md">Monte Carlo Gradient Estimation in Machine Learning</a> 25 Jun 2019 <a href="https://arxiv.org/pdf/1906.10652.pdf">arxiv</a></li>
 66 | </ul>
 67 | <h2 id="general-benchmark-testing-frameworks">General Benchmark Testing Frameworks</h2>
 68 | <ul>
 69 | <li><a href="https://github.com/chainer/chainerrl-visualizer">Chainerrl Visualizer</a></li>
 70 | <li><a href="BSRL.md">Behaviour Suite for Reinforcement Learning</a> 13 Aug 2019 <a href="https://arxiv.org/pdf/1908.03568.pdf">arxiv</a> | <a href="https://github.com/deepmind/bsuite">code</a></li>
 71 | <li><a href="Coinrun.md">Quantifying Generalization in Reinforcement Learning</a> 20 Dec 2018 <a href="https://arxiv.org/pdf/1812.02341.pdf">arxiv</a></li>
 72 | <li><a href="SRL.md">S-RL Toolbox: Environments, Datasets and Evaluation Metrics for State Representation Learning</a> 25 Sept 2018</li>
 73 | <li><a href="https://github.com/google/dopamine">dopamine</a></li>
 74 | <li><a href="https://github.com/deepmind/pysc2">StarCraft II</a></li>
 75 | <li><a href="https://github.com/deepmind/trfl">tfrl</a></li>
 76 | <li><a href="https://github.com/chainer/chainerrl">chainerrl</a></li>
 77 | <li><a href="https://github.com/PaddlePaddle/PARL">PARL</a></li>
 78 | </ul>
 79 | <h2 id="value-based-methods">Value based methods</h2>
 80 | <ul>
 81 | <li><a href="RVF.md">Recurrent Value Functions</a> 23 May 2019 <a href="https://arxiv.org/pdf/1905.09562.pdf">arxiv</a></li>
 82 | <li><a href="LipschitzQ.md">Stochastic Lipschitz Q-Learning</a> 24 Apr 2019 <a href="https://arxiv.org/pdf/1904.10653.pdf">arxiv</a></li>
 83 | <li><a href="https://arxiv.org/pdf/1710.11417">TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning</a> 8 Mar 2018</li>
 84 | <li><a href="https://arxiv.org/pdf/1803.00933.pdf">DISTRIBUTED PRIORITIZED EXPERIENCE REPLAY</a> 2 Mar 2018</li>
 85 | <li><a href="Rainbow.md">Rainbow: Combining Improvements in Deep Reinforcement Learning</a> 6 Oct 2017</li>
 86 | <li><a href="DQfD.md">Learning from Demonstrations for Real World Reinforcement Learning</a> 12 Apr 2017</li>
 87 | <li><a href="Dueling.md">Dueling Network Architecture</a></li>
 88 | <li><a href="DDQN.md">Double DQN</a></li>
 89 | <li><a href="PER.md">Prioritized Experience</a></li>
 90 | <li><a href="DQN.md">Deep Q-Networks</a></li>
 91 | </ul>
 92 | <h2 id="policy-gradient-methods">Policy gradient methods</h2>
 93 | <ul>
 94 | <li><a href="DirPG.md">Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces</a> <a href="https://arxiv.org/pdf/1906.06062.pdf">arxiv</a></li>
 95 | <li><a href="PGS.md">Policy Gradient Search: Online Planning and Expert Iteration without Search Trees</a> 7 Apr 2019 <a href="https://arxiv.org/pdf/1904.03646.pdf">arxiv</a></li>
 96 | <li><a href="SPU.md">SUPERVISED POLICY UPDATE FOR DEEP REINFORCEMENT LEARNING</a> 24 Dec 2018 <a href="https://arxiv.org/pdf/1805.11706v4.pdf">arxiv</a></li>
 97 | <li><a href="PPO-CMA.md">PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation</a> 5 Oct 2018 <a href="https://arxiv.org/pdf/1810.02541v6.pdf">arxiv</a></li>
 98 | <li><a href="CAPG.md">Clipped Action Policy Gradient</a> 22 June 2018</li>
 99 | <li><a href="EPG.md">Expected Policy Gradients for Reinforcement Learning</a> 10 Jan 2018</li>
100 | <li><a href="PPO.md">Proximal Policy Optimization Algorithms</a> 20 July 2017</li>
101 | <li><a href="DPPO.md">Emergence of Locomotion Behaviours in Rich Environments</a> 7 July 2017</li>
102 | <li><a href="IPG.md">Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning</a> 1 Jun 2017</li>
103 | <li><a href="PGSQL.md">Equivalence Between Policy Gradients and Soft Q-Learning</a></li>
104 | <li><a href="TRPO.md">Trust Region Policy Optimization</a></li>
105 | <li><a href="DEBP.md">Reinforcement Learning with Deep Energy-Based Policies</a></li>
106 | <li><a href="QPROP.md">Q-PROP: SAMPLE-EFFICIENT POLICY GRADIENT WITH AN OFF-POLICY CRITIC</a></li>
107 | </ul>
108 | <h2 id="explorations-in-drl">Explorations in DRL</h2>
109 | <ul>
110 | <li><a href="Disagreement.md">Self-Supervised Exploration via Disagreement</a> 10 Jun 2019 <a href="https://arxiv.org/pdf/1906.04161.pdf">arxiv</a></li>
111 | <li><a href="MBIE-EB.md">Approximate Exploration through State Abstraction</a> 24 Jan 2019</li>
112 | <li><a href="UBE.md">The Uncertainty Bellman Equation and Exploration</a> 15 Sep 2017</li>
113 | <li><a href="NoisyNet.md">Noisy Networks for Exploration</a> 30 Jun 2017 <a href="https://github.com/Kaixhin/NoisyNet-A3C">implementation</a></li>
114 | <li><a href="PhiEB.md">Count-Based Exploration in Feature Space for Reinforcement Learning</a> 25 Jun 2017</li>
115 | <li><a href="NDM.md">Count-Based Exploration with Neural Density Models</a> 14 Jun 2017</li>
116 | <li><a href="QEnsemble.md">UCB and InfoGain Exploration via Q-Ensembles</a> 11 Jun 2017</li>
117 | <li><a href="MMRB.md">Minimax Regret Bounds for Reinforcement Learning</a> 16 Mar 2017</li>
118 | <li><a href="incentivizing.md">Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models</a></li>
119 | <li><a href="EX2.md">EX2: Exploration with Exemplar Models for Deep Reinforcement Learning</a></li>
120 | </ul>
121 | <h2 id="actor-critic-methods">Actor-Critic methods</h2>
122 | <ul>
123 | <li><a href="Geoff-PAC.md">Generalized Off-Policy Actor-Critic</a> 27 Mar 2019</li>
124 | <li><a href="https://arxiv.org/pdf/1812.05905.pdf">Soft Actor-Critic Algorithms and Applications</a> 29 Jan 2019</li>
125 | <li><a href="REACTOR.md">The Reactor: A Sample-Efficient Actor-Critic Architecture</a> 15 Apr 2017</li>
126 | <li><a href="ACER.md">SAMPLE EFFICIENT ACTOR-CRITIC WITH EXPERIENCE REPLAY</a></li>
127 | <li><a href="UNREAL.md">REINFORCEMENT LEARNING WITH UNSUPERVISED AUXILIARY TASKS</a></li>
128 | <li><a href="DDPG.md">Continuous control with deep reinforcement learning</a></li>
129 | </ul>
130 | <h2 id="model-based-methods">Model-based methods</h2>
131 | <ul>
132 | <li><a href="parametric.md">When to use parametric models in reinforcement learning?</a> 12 Jun 2019 <a href="https://arxiv.org/pdf/1906.05243.pdf">arxiv</a></li>
133 | <li><a href="https://arxiv.org/pdf/1903.00374.pdf">Model Based Reinforcement Learning for Atari</a> 5 Mar 2019</li>
134 | <li><a href="MBDQN.md">Model-Based Stabilisation of Deep Reinforcement Learning</a> 6 Sep 2018</li>
135 | <li><a href="IBP.md">Learning model-based planning from scratch</a> 19 July 2017</li>
136 | </ul>
137 | <h2 id="model-free--model-based">Model-free + Model-based</h2>
138 | <ul>
139 | <li><a href="I2As.md">Imagination-Augmented Agents for Deep Reinforcement Learning</a> 19 July 2017</li>
140 | </ul>
141 | <h2 id="hierarchical">Hierarchical</h2>
142 | <ul>
143 | <li><a href="HIRO.md">WHY DOES HIERARCHY (SOMETIMES) WORK SO WELL IN REINFORCEMENT LEARNING?</a> 23 Sep 2019 <a href="https://arxiv.org/pdf/1909.10618.pdf">arxiv</a></li>
144 | <li><a href="HAL.md">Language as an Abstraction for Hierarchical Deep Reinforcement Learning</a> 18 Jun 2019 <a href="https://arxiv.org/pdf/1906.07343.pdf">arxiv</a></li>
145 | </ul>
146 | <h2 id="option">Option</h2>
147 | <ul>
148 | <li><a href="VALOR.md">Variational Option Discovery Algorithms</a> 26 July 2018</li>
149 | <li><a href="LFOD.md">A Laplacian Framework for Option Discovery in Reinforcement Learning</a> 16 Jun 2017</li>
150 | </ul>
151 | <h2 id="connection-with-other-methods">Connection with other methods</h2>
152 | <ul>
153 | <li><a href="GVG.md">Robust Imitation of Diverse Behaviors</a></li>
154 | <li><a href="GAIL.md">Learning human behaviors from motion capture by adversarial imitation</a></li>
155 | <li><a href="GANAC.md">Connecting Generative Adversarial Networks and Actor-Critic Methods</a></li>
156 | </ul>
157 | <h2 id="connecting-value-and-policy-methods">Connecting value and policy methods</h2>
158 | <ul>
159 | <li><a href="PCL.md">Bridging the Gap Between Value and Policy Based Reinforcement Learning</a></li>
160 | <li><a href="PGQ.md">Policy gradient and Q-learning</a></li>
161 | </ul>
162 | <h2 id="reward-design">Reward design</h2>
163 | <ul>
164 | <li><a href="VICE.md">End-to-End Robotic Reinforcement Learning without Reward Engineering</a> 16 Apr 2019 <a href="https://arxiv.org/pdf/1904.07854.pdf">arxiv</a></li>
165 | <li><a href="RLCRC.md">Reinforcement Learning with Corrupted Reward Channel</a> 23 May 2017</li>
166 | </ul>
167 | <h2 id="unifying">Unifying</h2>
168 | <ul>
169 | <li><a href="MSRL.md">Multi-step Reinforcement Learning: A Unifying Algorithm</a></li>
170 | </ul>
171 | <h2 id="faster-drl">Faster DRL</h2>
172 | <ul>
173 | <li><a href="NEC.md">Neural Episodic Control</a></li>
174 | </ul>
175 | <h2 id="apply-rl-to-other-domains">Apply RL to other domains</h2>
176 | <ul>
177 | <li><a href="RLTUNER.md">TUNING RECURRENT NEURAL NETWORKS WITH REINFORCEMENT LEARNING</a></li>
178 | </ul>
179 | <h2 id="multiagent-settings">Multiagent Settings</h2>
180 | <ul>
181 | <li><a href="Dip.md">No Press Diplomacy: Modeling Multi-Agent Gameplay</a> 4 Sep 2019 <a href="https://arxiv.org/pdf/1909.02128.pdf">arxiv</a></li>
182 | <li><a href="OPRE">Options as responses: Grounding behavioural hierarchies in multi-agent RL</a> 6 Jun 2019 <a href="https://arxiv.org/pdf/1906.01470.pdf">arxiv</a></li>
183 | <li><a href="MERL.md">Evolutionary Reinforcement Learning for Sample-Efficient Multiagent Coordination</a> 18 Jun 2019 <a href="https://arxiv.org/pdf/1906.07315.pdf">arxiv</a></li>
184 | <li><a href="ROMMEO.md">A Regularized Opponent Model with Maximum Entropy Objective</a> 17 May 2019 <a href="https://arxiv.org/pdf/1905.08087.pdf">arxiv</a></li>
185 | <li><a href="NashDQN.md">Deep Q-Learning for Nash Equilibria: Nash-DQN</a> 23 Apr 2019 <a href="https://arxiv.org/pdf/1904.10554.pdf">arxiv</a></li>
186 | <li><a href="MRL.md">Malthusian Reinforcement Learning</a> 3 Mar 2019 <a href="https://arxiv.org/pdf/1812.07019.pdf">arxiv</a></li>
187 | <li><a href="bad.md">Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning</a> 4 Nov 2018</li>
188 | <li><a href="ISMCI.md">INTRINSIC SOCIAL MOTIVATION VIA CAUSAL INFLUENCE IN MULTI-AGENT RL</a> 19 Oct 2018</li>
189 | <li><a href="http://www.cs.ox.ac.uk/people/shimon.whiteson/pubs/rashidicml18.pdf">QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning</a> 30 Mar 2018</li>
190 | <li><a href="SOM.md">Modeling Others using Oneself in Multi-Agent Reinforcement Learning</a> 26 Feb 2018</li>
191 | <li><a href="SGA.md">The Mechanics of n-Player Differentiable Games</a> 15 Feb 2018</li>
192 | <li><a href="RoboSumo.md">Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments</a> 10 Oct 2017</li>
193 | <li><a href="LOLA.md">Learning with Opponent-Learning Awareness</a> 13 Sep 2017</li>
194 | <li><a href="COMA.md">Counterfactual Multi-Agent Policy Gradients</a></li>
195 | <li><a href="MADDPG.md">Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments</a> 7 Jun 2017</li>
196 | <li><a href="BiCNet.md">Multiagent Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat Games</a> 29 Mar 2017</li>
197 | </ul>
198 | <h2 id="new-design">New design</h2>
199 | <ul>
200 | <li><a href="https://arxiv.org/pdf/1802.01561.pdf">IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures</a> 9 Feb 2018</li>
201 | <li><a href="RECUR.md">Reverse Curriculum Generation for Reinforcement Learning</a></li>
202 | <li><a href="HIRL.md">Trial without Error: Towards Safe Reinforcement Learning via Human Intervention</a></li>
203 | <li><a href="DualMDP.md">Learning to Design Games: Strategic Environments in Deep Reinforcement Learning</a> 5 July 2017</li>
204 | </ul>
205 | <h2 id="multitask">Multitask</h2>
206 | <ul>
207 | <li><a href="https://arxiv.org/pdf/1803.03835.pdf">Kickstarting Deep Reinforcement Learning</a> 10 Mar 2018</li>
208 | <li><a href="ZSTG.md">Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning</a> 7 Nov 2017</li>
209 | <li><a href="Distral.md">Distral: Robust Multitask Reinforcement Learning</a> 13 July 2017</li>
210 | </ul>
211 | <h2 id="observational-learning">Observational Learning</h2>
212 | <ul>
213 | <li><a href="OLRL.md">Observational Learning by Reinforcement Learning</a> 20 Jun 2017</li>
214 | </ul>
215 | <h2 id="meta-learning">Meta Learning</h2>
216 | <ul>
217 | <li><a href="GVF.md">Discovery of Useful Questions as Auxiliary Tasks</a> 10 Sep 2019 <a href="https://arxiv.org/pdf/1909.04607.pdf">arxiv</a></li>
218 | <li><a href="MetaSS.md">Meta-learning of Sequential Strategies</a> 8 May 2019 <a href="https://arxiv.org/pdf/1905.03030.pdf">arxiv</a></li>
219 | <li><a href="PEARL.md">Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables</a> 19 Mar 2019 <a href="https://arxiv.org/pdf/1903.08254.pdf">arxiv</a></li>
220 | <li><a href="E2.md">Some Considerations on Learning to Explore via Meta-Reinforcement Learning</a> 11 Jan 2019 <a href="https://arxiv.org/pdf/1803.01118.pdf">arxiv</a></li>
221 | <li><a href="MGRL.md">Meta-Gradient Reinforcement Learning</a> 24 May 2018 <a href="https://arxiv.org/pdf/1805.09801.pdf">arxiv</a></li>
222 | <li><a href="ProMP.md">ProMP: Proximal Meta-Policy Search</a> 16 Oct 2018 <a href="https://arxiv.org/pdf/1810.06784">arxiv</a></li>
223 | <li><a href="UML.md">Unsupervised Meta-Learning for Reinforcement Learning</a> 12 Jun 2018</li>
224 | </ul>
225 | <h2 id="distributional">Distributional</h2>
226 | <ul>
227 | <li><a href="GANQL.md">GAN Q-learning</a> 20 July 2018</li>
228 | <li><a href="IQN.md">Implicit Quantile Networks for Distributional Reinforcement Learning</a> 14 Jun 2018</li>
229 | <li><a href="GTD.md">Nonlinear Distributional Gradient Temporal-Difference Learning</a> 20 May 2018</li>
230 | <li><a href="D4PG.md">DISTRIBUTED DISTRIBUTIONAL DETERMINISTIC POLICY GRADIENTS</a> 23 Apr 2018</li>
231 | <li><a href="C51-analysis.md">An Analysis of Categorical Distributional Reinforcement Learning</a> 22 Feb 2018</li>
232 | <li><a href="QR-DQN.md">Distributional Reinforcement Learning with Quantile Regression</a> 27 Oct 2017</li>
233 | <li><a href="C51.md">A Distributional Perspective on Reinforcement Learning</a> 21 July 2017</li>
234 | </ul>
235 | <h2 id="planning">Planning</h2>
236 | <ul>
237 | <li><a href="SoRB.md">Search on the Replay Buffer: Bridging Planning and Reinforcement Learning</a> 12 June 2019 <a href="https://arxiv.org/pdf/1906.05253.pdf">arxiv</a></li>
238 | </ul>
239 | <h2 id="safety">Safety</h2>
240 | <ul>
241 | <li><a href="MPO.md">Robust Reinforcement Learning for Continuous Control with Model Misspecification</a> 18 Jun 2019 <a href="https://arxiv.org/pdf/1906.07516.pdf">arxiv</a></li>
242 | <li><a href="Viper.md">Verifiable Reinforcement Learning via Policy Extraction</a> 22 May 2018 <a href="https://arxiv.org/pdf/1805.08328.pdf">arxiv</a></li>
243 | </ul>
244 | <h2 id="inverse-rl">Inverse RL</h2>
245 | <ul>
246 | <li><a href="OP-GAIL.md">ADDRESSING SAMPLE INEFFICIENCY AND REWARD BIAS IN INVERSE REINFORCEMENT LEARNING</a> 9 Sep 2018</li>
247 | </ul>
248 | <h2 id="no-reward-rl">No reward RL</h2>
249 | <ul>
250 | <li><a href="VISR.md">Fast Task Inference with Variational Intrinsic Successor Features</a> 2 Jun 2019 <a href="https://arxiv.org/pdf/1906.05030.pdf">arxiv</a></li>
251 | <li><a href="https://arxiv.org/pdf/1705.05363">Curiosity-driven Exploration by Self-supervised Prediction</a> 15 May 2017</li>
252 | </ul>
253 | <h2 id="time">Time</h2>
254 | <ul>
255 | <li><a href="Intervaltime.md">Interval timing in deep reinforcement learning agents</a> 31 May 2019 <a href="https://arxiv.org/pdf/1905.13469.pdf">arxiv</a></li>
256 | <li><a href="PEB.md">Time Limits in Reinforcement Learning</a></li>
257 | </ul>
258 | <h2 id="applications">Applications</h2>
259 | <ul>
260 | <li><a href="dmimic.md">DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills</a> 9 Apr 2018</li>
261 | </ul>
262 | <h2 id="adversarial-learning">Adversarial learning</h2>
263 | <ul>
264 | <li><a href="LQR+GAIfO.md">Sample-efficient Adversarial Imitation Learning from Observation</a> 18 Jun 2019 <a href="https://arxiv.org/pdf/1906.07374.pdf">arxiv</a></li>
265 | </ul>
266 | <h2 id="use-natural-language">Use Natural Language</h2>
267 | <ul>
268 | <li><a href="LEARN.md">Using Natural Language for Reward Shaping in Reinforcement Learning</a> 31 May 2019 <a href="https://www.cs.utexas.edu/~ai-lab/downloadPublication.php?filename=http://www.cs.utexas.edu/users/ml/papers/goyal.ijcai19.pdf&amp;pubid=127757">arxiv</a></li>
269 | </ul>
270 | <h2 id="generative-and-contrastive-representation-learning">Generative and contrastive representation learning</h2>
271 | <ul>
272 | <li><a href="ST-DIM.md">Unsupervised State Representation Learning in Atari</a> 19 Jun 2019 <a href="https://arxiv.org/pdf/1906.08226.pdf">arxiv</a></li>
273 | </ul>
274 | <h2 id="belief">Belief</h2>
275 | <ul>
276 | <li><a href="GenerativeBelief.md">Shaping Belief States with Generative Environment Models for RL</a> 24 Jun 2019 <a href="https://arxiv.org/pdf/1906.09237v2.pdf">arxiv</a></li>
277 | </ul>
278 | <h2 id="pac">PAC</h2>
279 | <ul>
280 | <li><a href="COF-PAC.md">Provably Convergent Off-Policy Actor-Critic with Function Approximation</a> 11 Nov 2019 <a href="https://arxiv.org/pdf/1911.04384.pdf">arxiv</a></li>
281 | </ul>
282 | <h2 id=""></h2>
283 | <ul>
284 | <li><a href="Reciprocity.md">Learning Reciprocity in Complex Sequential Social Dilemmas</a> 19 Mar 2019 <a href="https://arxiv.org/pdf/1903.08082.pdf">arxiv</a></li>
285 | </ul>
286 | <script type="text/javascript" id="clustrmaps" src="//clustrmaps.com/map_v2.js?d=8WVXR3vTf3a1tXQExUQWRGVpVaABvZvooyRkmM4_0XQ&cl=ffffff&w=a"></script>    </body>
287 |     </html>
288 | 


--------------------------------------------------------------------------------
/p10.md:
--------------------------------------------------------------------------------
1 | http://www.deeplearningindaba.com/uploads/1/0/2/6/102657286/principles_of_deep_rl.pdf
2 | 


--------------------------------------------------------------------------------
/parametric.md:
--------------------------------------------------------------------------------
 1 | # When to use parametric models in reinforcement learning?
 2 | 
 3 | > Hado van Hasselt, Matteo Hessel, John Aslanides
 4 | 
 5 | ## Abstract
 6 | We examine the question of when and how parametric models are most useful in reinforcement learning.
 7 | 
 8 | In particular, we look at commonalities and differences between parametric models and experience replay.
 9 | 
10 | Replay-based learning algorithms share important traits with model-based approaches, including the ability to plan: to use more computation without additional data to improve predictions and behaviour. 
11 | 
12 | We discuss when to expect benefits from either approach, and interpret prior work in this context. 
13 | 
14 | We hypothesise that, under suitable conditions, replay-based algorithms should be competitive to or better than model-based algorithms if the model is used only to generate fictional transitions from observed states for an update rule that is otherwise model-free. 
15 | 
16 | We validated this hypothesis on Atari 2600 video games. 
17 | 
18 | The replay-based algorithm attained state-of-the-art data efficiency, improving over prior results with parametric models
19 | 


--------------------------------------------------------------------------------
/reproducing.md:
--------------------------------------------------------------------------------
 1 | # Deep Reinforcement Learning that Matters
 2 | > Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, David Meger
 3 | 
 4 | ## Abstract
 5 | In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL). 
 6 | 
 7 | Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. 
 8 | 
 9 | Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. 
10 | 
11 | In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret.
12 | 
13 | Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. 
14 | 
15 | In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. 
16 | 
17 | We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. 
18 | 
19 | We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.
20 | 


--------------------------------------------------------------------------------
/sc.md:
--------------------------------------------------------------------------------
 1 | # Self-Consistent Models and Values
 2 | > Gregory Farquhar, Matteo Hessel, Kate Baumli Zita Marinho, Hado van Hasselt, Angelos Filos, David Silver
 3 | 
 4 | ## Abstract
 5 | Learned models of the environment provide reinforcement learning (RL) agents with flexible ways of making predictions about the environment. 
 6 |  
 7 | In particular, models enable planning, i.e. using more computation to improve value functions or policies, without requiring additional environment interactions. 
 8 |  
 9 | In this work, we investigate a way of augmenting model-based RL, by additionally encouraging a learned model and value function to be jointly self-consistent. 
10 |  
11 | Our approach differs from classic planning methods such as Dyna, which only update values to be consistent with the model. 
12 |  
13 | We propose multiple self-consistency updates, evaluate these in both tabular and function approximation settings, and find that, with appropriate choices, self-consistency helps both policy evaluation and control.
14 | 


--------------------------------------------------------------------------------