├── IEEE Transactions on Systems, Man, and Cybernetics- Systems Volume 45 issue 3 2015 [doi 10.1109%2FTSMC.2014.2358639] Liu, Chunming; Xu, Xin; Hu, Dewen -- Multiobjective Reinforcement Learning- A Compr.pdf
├── README.md
└── DeepRL.md


/IEEE Transactions on Systems, Man, and Cybernetics- Systems Volume 45 issue 3 2015 [doi 10.1109%2FTSMC.2014.2358639] Liu, Chunming; Xu, Xin; Hu, Dewen -- Multiobjective Reinforcement Learning- A Compr.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RL-Group/tutorials-and-papers/HEAD/IEEE Transactions on Systems, Man, and Cybernetics- Systems Volume 45 issue 3 2015 [doi 10.1109%2FTSMC.2014.2358639] Liu, Chunming; Xu, Xin; Hu, Dewen -- Multiobjective Reinforcement Learning- A Compr.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # tutorials-and-papers
 2 | Collection of tutorials, exercises and papers on RL
 3 | 
 4 | ## Study tracks
 5 | 
 6 | - [Deep Reinforcement Learning](https://github.com/RL-Group/tutorials-and-papers/blob/master/DeepRL.md)
 7 | 
 8 | ## Other collections of resources
 9 | 
10 | - [awesome-rl](https://github.com/aikorea/awesome-rl)
11 | - [awesome-deep-reinforcement-learning](https://github.com/williamd4112/awesome-deep-reinforcement-learning)
12 | 
13 | ## Tutorials (theory)
14 | 
15 | - Nervana's [Demystifying Deep Reinforcement Learning](https://www.intelnervana.com/demystifying-deep-reinforcement-learning/)
16 | 
17 | ## Implementations
18 | 
19 | - Pytorch
20 |   - [Pytorch, DQN and gym](http://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html#sphx-glr-intermediate-reinforcement-q-learning-py)
21 |   
22 | - Tensorflow
23 |   - [Minimal and clean examples of reinforcement learning algorithms](https://github.com/rlcode/reinforcement-learning)
24 |   
25 | ## Reviews
26 | 
27 | - Yuxi Li, [Deep Reinforcement Learning: An Overview](https://arxiv.org/abs/1701.07274)
28 | 66 pages
29 | - Arulkumaran et al, [A Brief Survey of Deep Reinforcement Learning](https://arxiv.org/abs/1708.05866) 14 pages
30 | 
31 | ## Theses
32 | 
33 | - David Silver's (2009) [Reinforcement Learning and Simulation-Based Search in Computer Go](http://papersdb.cs.ualberta.ca/~papersdb/uploaded_files/1029/paper_thesis.pdf)
34 | - John Schulman's (2016) [Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs](http://joschu.net/docs/thesis.pdf)
35 | 
36 | ## Sutton & Barto
37 | Deserves its own section
38 | 
39 | - 2nd edition draft [Reinforcement Learning: An Introduction](http://incompleteideas.net/sutton/book/bookdraft2017june.pdf) [PDF]
40 | - Solutions:
41 |   - official manual somewhere
42 |   - [JKCooper2/rlai-exercises](https://github.com/JKCooper2/rlai-exercises)
43 | - Implementations
44 |   - [ShangtongZhang Sutton's student](https://github.com/ShangtongZhang/reinforcement-learning-an-introduction)
45 |   - [Denny Britz Google Brain Resident](https://github.com/dennybritz/reinforcement-learning)
46 | 
47 | 
48 | ## Books
49 | 
50 | - RL
51 |   - Marco Wiering and Martijn van Otterlo [Reinforcement Learning: State-of-the-Art](https://smile.amazon.com/Reinforcement-Learning-State-Art-Optimization/dp/364227644X)
52 | - DL
53 |   - Goodfellow, Bengio and Courville [Deep Learning](http://www.deeplearningbook.org/)
54 | - ML
55 |   - Bishop [Pattern Recognition and Machine Learning](https://smile.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738)
56 | 
57 | ## Libraries
58 | 
59 | - [Tensorforce](https://github.com/reinforceio/tensorforce): A TensorFlow library for applied reinforcement learning
60 | - [keras-rl](https://github.com/matthiasplappert/keras-rl): Library with keras and openai gym, with several DRL algorithms implemented/
61 | 


--------------------------------------------------------------------------------
/DeepRL.md:
--------------------------------------------------------------------------------
  1 | # Deep Reinforcement Learning
  2 | 
  3 | ## Lecture 1: Markov Decision Processes and Solving Finite Problems
  4 | 
  5 | [Video](https://www.youtube.com/watch?v=IL3gVyJMmhg&index=7&list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX)
  6 | | [Slides](http://rll.berkeley.edu/deeprlcoursesp17/docs/lec1.pdf)
  7 | 
  8 | ## Lecture 2: Policy Gradient Methods
  9 | 
 10 | [Video](https://www.youtube.com/watch?v=BB-BhTn6DCM&index=8&list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX)
 11 | | [Slides](http://rll.berkeley.edu/deeprlcoursesp17/docs/lec2.pdf)
 12 | 
 13 | * Karpathy's Deep RL Tutorial: [Deep Reinforcement Learning: Pong from Pixels](http://karpathy.github.io/2016/05/31/rl/)
 14 | * John Schulman's thesis: [Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs](http://joschu.net/docs/thesis.pdf)
 15 | * R. J. Williams. [Simple statistical gradient-following algorithms for connectionist reinforcement learning](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf). Machine learning (1992)
 16 | * R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. [Policy gradient methods for reinforcement learning with function approximation](https://web.eecs.umich.edu/~baveja/Papers/PolicyGradientNIPS99.pdf). NIPS. MIT Press, 2000
 17 | * Jan Peters, Stefan Schaal. [Policy Gradient Methods for Robotics](http://www-clmc.usc.edu/publications/P/peters-IROS2006.pdf) (2006)
 18 | * D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. [Deterministic Policy Gradient Algorithms](http://proceedings.mlr.press/v32/silver14.pdf) ICML 2014.
 19 | * N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. [Learning Continuous Control Policies by Stochastic Value Gradients](https://arxiv.org/abs/1510.09142) arXiv preprint arXiv:1510.09142 (2015).
 20 | * T. Jie and P. Abbeel. [On a connection between importance sampling and the likelihood ratio policy gradient](http://rll.berkeley.edu/~jietang/pubs/nips10_Tang.pdf) Advances in Neural Information Processing Systems.  2010, pp. 1000–1008.
 21 | * A3C paper: V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. [Asynchronous methods for deep reinforcement learning](https://arxiv.org/abs/1602.01783) (2016)
 22 | 
 23 | ## Lecture 3: Q-Function Learning Methods
 24 | 
 25 | [Video](https://www.youtube.com/watch?v=Wnl-Qh2UHGg&list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX&index=9)
 26 | | [Slides](http://rll.berkeley.edu/deeprlcoursesp17/docs/lec3.pdf)
 27 | 
 28 | * T. Jaakkola, M. I. Jordan, and S. P. Singh.  [On the convergence of stochastic iterative dynamic programming algorithms](https://www.researchgate.net/publication/220499733_On_the_Convergence_of_Stochastic_Iterative_Dynamic_Programming_Algorithms). Neural computation (1994);
 29 | * C. J. Watkins and P. Dayan.  [Q-learning](https://link.springer.com/content/pdf/10.1007/BF00992698.pdf). Machine learning (1992).
 30 | * M. Riedmiller.  [Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method](https://pdfs.semanticscholar.org/2820/01869bd502c7917db8b32b75593addfbbc68.pdf). Machine Learning: ECML 2005. Springer, 2005.
 31 | 
 32 | ## Lecture 4: Advanced Q-Function Learning Methods
 33 | 
 34 | [Video](https://www.youtube.com/watch?v=h1-pj4Y9-kM&list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX&index=10)
 35 | | [Slides](http://rll.berkeley.edu/deeprlcoursesp17/docs/lec4.pdf)
 36 | 
 37 | * V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, et al. [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602) (2013)
 38 | * M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling.  [The Arcade Learning Environment:  An Evaluation Platform for General Agents](https://arxiv.org/abs/1207.4708) Journal of Artificial Intelligence Research (2013)
 39 | *  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, et al. [Human-level control through deep reinforcement learning](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf). Nature (2015)
 40 | * H. V. Hasselt.  [Double Q-learning](https://papers.nips.cc/paper/3964-double-q-learning.pdf). NIPS. 2010
 41 | * H. Van Hasselt, A. Guez, and D. Silver.  [Deep reinforcement learning with double Q-learning](https://arxiv.org/abs/1509.06461). CoRR, abs/1509.06461 (2015)
 42 | * Z. Wang, N. de Freitas, and M. Lanctot.  [Dueling network architectures for deep reinforcement learning](https://arxiv.org/abs/1511.06581). arXiv preprint arXiv:1511.06581 (2015)
 43 | * T. Schaul, J. Quan, I. Antonoglou, and D. Silver.  [Prioritized experience replay](https://arxiv.org/abs/1511.05952). arXiv preprint arXiv:1511.05952 (2015)
 44 | * D. Silver, A. Huang, C. J. Maddison, et al. [Mastering the
 45 | game of Go with deep neural networks and tree search](https://vk.com/doc-44016343_437229031?dl=56ce06e325d42fbc72) (2016)
 46 | 
 47 | ## Lecture 5: Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More
 48 | 
 49 | Schulman: [Video](https://www.youtube.com/watch?v=_t5fpZuuf-4&index=15&list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX)
 50 | | [Slides](http://rll.berkeley.edu/deeprlcoursesp17/docs/lec5.pdf)
 51 | 
 52 | Achiam: [Video](https://www.youtube.com/watch?v=ycCtmp4hcUs&list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3&index=14)
 53 | | [Slides](http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_13_advanced_pg.pdf)
 54 | 
 55 | * S. Kakade.  [A Natural Policy Gradient](https://papers.nips.cc/paper/2073-a-natural-policy-gradient.pdf) NIPS. 2001
 56 | * S. Kakade and J. Langford. [Approximately optimal approximate reinforcement learning](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/KakadeLangford-icml2002.pdf). ICML. 2002.
 57 | * J. Peters and S. Schaal.  [Natural actor-critic](https://homes.cs.washington.edu/~todorov/courses/amath579/reading/NaturalActorCritic.pdf). Neurocomputing
 58 | (2008)
 59 | * J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel.  [Trust Region Policy Optimization](https://arxiv.org/abs/1502.05477). ICML (2015)
 60 | * Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel [Constrained Policy Optimization](https://arxiv.org/abs/1705.10528) (2017)
 61 | * John Schulman, Filip Wolski, Prafulla Dhariwal, et al. [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347) (2017)
 62 | * Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. [Benchmarking Deep Reinforcement Learning for Continuous Control](https://arxiv.org/abs/1604.06778). ICML (2016)
 63 | * J. Martens and I. Sutskever.  [Training deep and recurrent networks with Hessian-free optimization](http://www.cs.utoronto.ca/~jmartens/docs/HF_book_chapter.pdf).
 64 | * Neural Networks:  Tricks of the Trade. Springer, 2012 (book)
 65 | * S. Amari and S. C. Douglas [Why Natural Gradient?](http://www.yaroslavvb.com/papers/amari-why.pdf) (1998)
 66 | * Jan Peters, Stefan Schaal [Reinforcement learning of motor skills with policy gradients](http://www.kyb.mpg.de/fileadmin/user_upload/files/publications/attachments/Neural-Netw-2008-21-682_4867%5b0%5d.pdf) (2008)
 67 | * Nicolas Heess, Dhruva TB, Srinivasan Sriram, et al. [Emergence of Locomotion Behaviours in Rich Environments](https://arxiv.org/abs/1707.02286) (2017)
 68 | * Yuhuai Wu, Elman Mansimov, Shun Liao, et al. [Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation](https://arxiv.org/abs/1708.05144) (2017)
 69 | 
 70 | ## Lecture 6: Variance Reduction for Policy Gradient Methods
 71 | 
 72 | [Video](https://www.youtube.com/watch?v=Fauwwkiy-bo&index=16&list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX)
 73 | | [Slides](http://rll.berkeley.edu/deeprlcoursesp17/docs/lec6.pdf)
 74 | 
 75 | * A. Y. Ng, D. Harada, and S. Russell. [Policy invariance under reward transformations: Theory and application to reward shaping](http://www.robotics.stanford.edu/~ang/papers/shaping-icml99.pdf). ICML. 1999.
 76 | * (Example) I. Mordatch, E. Todorov, and Z. Popović. [Discovery of complex behaviors through contact-invariant optimization](https://homes.cs.washington.edu/~todorov/papers/MordatchSIGGRAPH12.pdf). ACM Transactions on Graphics (TOG) 31.4 (2012), p. 43
 77 | * (Example) Y. Tassa, T. Erez, and E. Todorov.  [Synthesis and stabilization of complex behaviors through online trajectory optimization](https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf) Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on.  IEEE. 2012, pp. 4906–4913
 78 | * J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. [High-dimensional continuous control using generalized advantage estimation](https://arxiv.org/abs/1506.02438) (2015)
 79 | * H. Kimura and S. Kobayashi. [An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function](http://www.umiacs.umd.edu/~hal/courses/2016F_RL/Kimura98.pdf) ICML. 1998
 80 | * A. Y. Ng and M. Jordan. [PEGASUS: A policy search method for large MDPs and POMDPs](http://www.robotics.stanford.edu/~ang/papers/uai00-pegasus.pdf) (2000)
 81 | 
 82 | ## Lecture 7: Policy Gradient Methods: Pathwise Derivative Methods and Wrap-up
 83 | 
 84 | [Video](https://www.youtube.com/watch?v=IDSA2wAACr0&list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX&index=17)
 85 | | [Slides](http://rll.berkeley.edu/deeprlcoursesp17/docs/lec7.pdf)
 86 | 
 87 | * T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, et al. [Continuous control with deep reinforcement learning](https://arxiv.org/abs/1509.02971). arXiv preprint arXiv:1509.02971 (2015)
 88 | * (Example) S. Gu, T. Lillicrap, I. Sutskever, and S. Levine.  [Continuous deep Q-learning with model-based acceleration](https://arxiv.org/abs/1603.00748) (2016)
 89 | * B. O’Donoghue, R. Munos, K. Kavukcuoglu, and V. Mnih.  [PGQ: Combining policy gradient and Q-learning](https://arxiv.org/abs/1611.01626) (2016)
 90 | * Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. [Sample Efficient Actor-Critic with Experience Replay](https://arxiv.org/abs/1611.01224) (2016)
 91 | * A. Harutyunyan, M. G. Bellemare, T. Stepleton, and R. Munos.  [Q(λ) with Off-Policy Corrections](https://arxiv.org/abs/1602.04951) 2016
 92 | * N. Jiang and L. Li. [Doubly robust off-policy value evaluation for reinforcement learning](https://arxiv.org/abs/1511.03722) 2016
 93 | * O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans.  [Bridging the Gap Between Value and Policy Based Reinforcement Learning](https://arxiv.org/abs/1702.08892) (2017)
 94 | * T. Haarnoja, H. Tang, P. Abbeel, and S. Levine.  [Reinforcement Learning with Deep Energy-Based Policies](https://arxiv.org/abs/1702.08165) (2017)
 95 | 
 96 | ## Lecture 8: Exploration
 97 | 
 98 | [Video](https://www.youtube.com/watch?v=SfCa1HQMkuw&index=18&list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX)
 99 | | [Slides](http://rll.berkeley.edu/deeprlcoursesp17/docs/2017.03.20.Exploration.pdf)
100 | 
101 | * Bubeck, Sébastien, and Nicolo Cesa-Bianchi. [Regret analysis of stochastic and nonstochastic multi-armed bandit problems](https://arxiv.org/abs/1204.5721) arXiv preprint arXiv:1204.5721 (2012).
102 | * Peter Auer, Nicolò Cesa-Bianchi and Paul Fischer, [Finite-Time Analysis of the Multi-Armed Bandit Problem](https://d2925a48-a-62cb3a1a-s-sites.googlegroups.com/site/anrexplora/bibliography/fta-2002.pdf), Mach. Learn., 47 (2002), 235–256
103 | * Daniel Russo, Benjamin Van Roy (2014) [Learning to Optimize via Posterior Sampling](https://arxiv.org/abs/1301.2609). Mathematics of Operations Research
104 | * Chapelle O. and Li, L. [An Empirical Evaluation of Thompson Sampling](https://papers.nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling.pdf). NIPS, 2011
105 | * Daniel Russo, Benjamin Van Roy (2014) [Learning to Optimize via Information-Directed Sampling](https://arxiv.org/abs/1403.5556). NIPS
106 | * Kearns & Singh, [Near-Optimal Reinforcement Learning in Polynomial Time](https://www.cis.upenn.edu/~mkearns/papers/KearnsSinghE3.pdf) (1999) 
107 | * Kakade, [On the sample complexity of reinforcement learning](https://homes.cs.washington.edu/~sham/papers/thesis/sham_thesis.pdf) (thesis) (2003)
108 | * Szita, István, and András Lőrincz. [The many faces of optimism: a unifying approach](http://icml2008.cs.helsinki.fi/papers/490.pdf) ICML 2008.
109 | * Moldovan, Teodor Mihai, and Pieter Abbeel. [Safe exploration in 
110 | markov decision processes](https://arxiv.org/abs/1205.4810) arXiv preprint arXiv:1205.4810 (2012).
111 | * Strehl, [PROBABLY APPROXIMATELY CORRECT (PAC) EXPLORATION IN REINFORCEMENT LEARNING](http://cs.brown.edu/~mlittman/theses/strehl.pdf), 2007
112 | * Osband, Ian, and Benjamin Van Roy. [Bootstrapped Thompson Sampling and Deep Exploration](https://arxiv.org/abs/1507.00300) (2015)
113 | * I. Osband, C. Blundell, A. Pritzel, and B. Van Roy [Deep Exploration via Bootstrapped DQN](https://arxiv.org/abs/1602.04621) (2016)
114 | * Yarin Gal, and Zoubin Ghahramani. [Dropout as a Bayesian approximation: Representing model uncertainty in deep learning](https://arxiv.org/abs/1506.02142) (2015).
115 | * Zachary Lipton et al., [Efficient Exploration for Dialogue Policy Learning with BBQ Networks & Replay Buffer Spiking](https://arxiv.org/abs/1608.05081) (2016) 
116 | * Stadie et al., [Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models](https://arxiv.org/abs/1507.00814) (2015) 
117 | * Bellemare et al., [Unifying Count-Based Exploration and Intrinsic Motivation](https://arxiv.org/abs/1606.01868) (2016) 
118 | * Ostrovski et al., [Count-Based Exploration with Neural Density Models](https://arxiv.org/abs/1703.01310) (2017)
119 | * Meire Fortunato, Mohammad Gheshlaghi Azar, et al. [Noisy Networks for Exploration](https://arxiv.org/abs/1706.10295) (2017)
120 | * Houthooft et al., [Variational information maximizing exploration](https://arxiv.org/abs/1605.09674) (2016).
121 | * Duan et al., [RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning](https://arxiv.org/abs/1611.02779) (2017) 
122 | * Wang et al., [Learning to Reinforcement Learn](https://arxiv.org/abs/1611.05763) (2017) 
123 | * Singh, S. P., Barto, A. G., and Chentanez, N. [Intrinsically motivated reinforcement learning](https://papers.nips.cc/paper/2552-intrinsically-motivated-reinforcement-learning.pdf) In NIPS, 2005.
124 | * Oudeyer, Pierre-Yves, and Frederic Kaplan. [How can we define intrinsic motivation?](http://www.pyoudeyer.com/epirob08OudeyerKaplan.pdf) 2008.
125 | * Shakir Mohamed and Danilo J. Rezende, [Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning](https://arxiv.org/abs/1509.08731), ArXiv 2015. (Includes discussion of *common random numbers*)
126 | 
127 | ## Scalability
128 | 
129 | * Tim Salimans, Jonathan Ho, Xi Chen, Ilya Sutskever [Evolution Strategies as a Scalable Alternative to Reinforcement Learning](https://arxiv.org/abs/1703.03864) (2017)
130 | * Alfredo V. Clemente, Humberto N. Castejón, Arjun Chandra [Efficient Parallel Methods for Deep Reinforcement Learning](https://arxiv.org/abs/1705.04862) (2017)
131 | 
132 | ## Frontiers
133 | 
134 | * Piotr Mirowski, Razvan Pascanu, Fabio Viola, et al. [Learning to Navigate in Complex Environments](https://arxiv.org/abs/1611.03673) (2017)
135 | * Alexander Sasha Vezhnevets, Simon Osindero, et al. [FeUdal Networks for Hierarchical Reinforcement Learning](https://arxiv.org/abs/1703.01161) (2017)
136 | * Nicolas Heess, Greg Wayne, Yuval Tassa, et al. [Learning and Transfer of Modulated Locomotor Controllers](https://arxiv.org/abs/1610.05182) (2016)
137 | * Théophane Weber, Sébastien Racanière, et al. [Imagination-Augmented Agents for Deep Reinforcement Learning](https://arxiv.org/abs/1707.06203) (2017)
138 | * S. Levine, C. Finn, T. Darrell, and P. Abbeel. [End-to-end training of deep visuomotor policies](https://arxiv.org/abs/1504.00702) (2016)
139 | * Haoran Tang and Tuomas Haarnoja [Learning Diverse Skills via Maximum Entropy Deep Reinforcement Learning](http://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/) (2017)
140 | * Matteo Hessel, Joseph Modayil, et al. [Rainbow: Combining Improvements in Deep Reinforcement Learning](https://arxiv.org/abs/1710.02298) (2017)
141 | * Max Jaderberg, Volodymyr Mnih, et al. [Reinforcement Learning with Unsupervised Auxiliary Tasks](https://arxiv.org/abs/1611.05397) (2016)
142 | * Aravind Rajeswaran, Vikash Kumar, et al. [Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations](https://arxiv.org/abs/1709.10087) (2017)
143 | 
144 | ## Optimization
145 | 
146 | * Kerby Shedden [Optimization in statistics](http://dept.stat.lsa.umich.edu/~kshedden/Courses/Stat606/Notes/optim.pdf)
147 |     - It contains a nice section about the *Fisher Information* and its relation to the *Hessian* of the log-likelihood.
148 |     - It also contains a section about *Conjugate-Gradient Methods*.
149 | * Andrew Gibiansky [Conjugate Gradient (article)](http://andrew.gibiansky.com/blog/machine-learning/conjugate-gradient/) (2014)
150 | * Andrew Gibiansky [Hessian Free Optimization (article)](http://andrew.gibiansky.com/blog/machine-learning/hessian-free-optimization/) (2014)
151 | * Andrew Gibiansky [Fully Connected Neural Network Algorithms (article)](http://andrew.gibiansky.com/blog/machine-learning/fully-connected-neural-networks/) (2014)
152 | * Andrew Gibiansky [Gauss Newton Matrix (article)](http://andrew.gibiansky.com/blog/machine-learning/gauss-newton-matrix/) (2014)
153 | 
154 | ## Useful knowledge
155 | 
156 | * Art B. Owen [Monte Carlo theory, methods and examples](http://statweb.stanford.edu/~owen/mc/) (2013)
157 |     - Useful for learning about **variance reduction**)
158 | 


--------------------------------------------------------------------------------