├── .gitignore
├── README.md
├── hackers_guide
    ├── 1_chapt
    │   ├── base_case.js
    │   ├── multiple_gates.js
    │   └── single_neuron.js
    ├── 2_chapt
    │   ├── neural_net.js
    │   └── svm.js
    └── README.md
├── summaries
    ├── auto-encoding_var_bayes.md
    ├── autoencoders.md
    ├── build_machines_learn_think.md
    ├── end-to-end_tf.md
    ├── fully_char_level_nmt.md
    ├── implicit_drd_grn.md
    ├── intrinsic_dimension.md
    ├── learning_phrase_rep_RNN_encoder_decoder_mt.md
    ├── matching_networks.md
    ├── neural_machine_translation.md
    ├── overview_optimization.md
    ├── scheduled_sampling.md
    ├── seq2seq_nn.md
    ├── softmax_bottleneck.md
    ├── unreal.md
    ├── vanishing_gradients.md
    └── var_auto_sequence_class.md
└── web_resources.md


/.gitignore:
--------------------------------------------------------------------------------
1 | *.swp
2 | tags
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ### 2019-06
  2 | - Discrete Flows: Invertible Generative Models of Discrete Data, Tran et al, 2019. [arXiv](https://arxiv.org/abs/1905.10347)
  3 | - _Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement_, Kool et al, ICML 2019. [arXiv](https://arxiv.org/abs/1903.06059)
  4 | - _Sorting out Lipschitz function approximation_, Anil et al, ICML 2019. [arXiv](https://arxiv.org/abs/1811.05381), [`ICML oral`](https://www.facebook.com/icml.imls/videos/552835701913736/?t=2667). Goal is to train a neural net which is provably a K-Lip function without loss of expressivity. There are many applications, such as Adversarial training and WGAN. Also note Lip constraints are required in some flows such as the i-ResNet (below). If you try to learn the absolute function with a 3-layer net with 1-Lip constrained linear layer and tanh activation, you cannot learn the target function. If you clip the grad norm to 1 beginning with output, as you backprop the following grad norms can only decrease. Once you get to the input layer grad update, you've lost most of the information. Solution is a new activation function called GroupoSort. GroupSort will sort activations into groups, which is a non-linear operation, is 1-Lipschitz, gradient norm preserving, continuous and differentiable almost everywhere.
  5 | - _Residual Flows for Invertible Generative Modeling, Chen_ et al 2019. [arXiv](https://arxiv.org/abs/1906.02735). Residual Flows resolve the bias problem introduced in i-ResNets (paper below).
  6 | - _Invertible Residual Networks_, Behrmann et al, ICML 2019. [arXiv](https://arxiv.org/abs/1811.00995) [`ICML oral`](https://www.facebook.com/icml.imls/videos/552835701913736/?t=550). Proposes conditions to add to ResNets to make invertible, such as 1-Lip constraint. The goal is to make general-purpose architectures, such as ResNets, invertible only by adding some Lipschitz conditions instead of strict architectural constraints. For example, Planar Flows must use specific layer functions to ensure invertibility. However, the method does have some bias which increases along with network expressiveness (see paper above for solution).
  7 | 
  8 |  Unified Bellman Equation for Causal Information and Value in Markov Decision Processes, Tiomkin and Tishby, [arXiv]()
  9 | 
 10 | ### 2019-01
 11 | - Evaluating Theory of Mind in Question Answering, Nematzadeh et al, EMNLP 2018. [arXiv](https://arxiv.org/abs/1808.09352)
 12 | - An Off-policy Policy Gradient Theorem Using Emphatic Weightings, Imani et al, NeurIPS 2018. [arXiv](https://arxiv.org/abs/1811.09013)
 13 | - Doubly Robust Off-policy Value Evaluation for Reinforcement Learning, Jiang and Li, 2015. [arXiv](https://arxiv.org/abs/1511.03722)
 14 | - Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning, Thomas and Brunskill, 2016. [arXiv](https://arxiv.org/abs/1604.00923)
 15 | - Implicit Reparameterization Gradients, Figurnov et al, NeurIPS 2018. [arXiv](https://arxiv.org/abs/1805.08498)
 16 | - Approximately Optimal Approximate Reinforcement Learning, Kakade and Langford, 2002. [pdf](https://dl.acm.org/citation.cfm?id=656005). Note: the paper which inspired the likes of TRPO
 17 | 
 18 | ### 2018-12
 19 | - Meta-Learning: A Survey, Vanschoren et al, 2018. [arXiv](https://arxiv.org/abs/1810.03548)
 20 | - Off-policy Learning with Recognizers, Precup et al, 2005. [pdf](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.420.3772&rep=rep1&type=pdf)
 21 | - Meta-Gradient Reinforcement Learning, Xu et al, 2018. [arXiv](https://arxiv.org/abs/1805.09801).
 22 | - Expected Policy Gradients, Criosek et al, 2018. [arXiv](https://arxiv.org/abs/1706.05374).
 23 | - Mean Actor Critic, Allen et al, 2018. [arXiv](https://arxiv.org/abs/1709.00503), [`web version`](https://www.groundai.com/project/mean-actor-critic/). The usual policy gradient is an expectation over states and actions, but they suggest to add the the explicit sum over actions back in the expectation over states (Eq. 4). Doing so result in a policy update considering actions not taken in the environment. In domains where Q is good, MAC results in lower variance, otherwise MAC performs worse.
 24 | - Near-Optimal Representation Learning for Hierarchical Reinforcement Learning, Nachum et al, NeurIPS 2018. [arXiv](https://arxiv.org/abs/1810.01257). Note: builds on HIRO, but focuses on optimal representations.
 25 | - Data-Efficient Hierarchical Reinforcement Learning, Nachum et al, NeurIPS 2018. [arXiv](https://arxiv.org/abs/1805.08296). Note: type of HRL called HIRO. High level policy gives low-level policy a goal state to reach.
 26 | - Neural Ordinary Differential Equations, Chen et al, NeurIPS 2018. [arXiv](https://arxiv.org/abs/1806.07366), [`code`](https://github.com/rtqichen/torchdiffeq). Best paper award
 27 | - Non-delusional Q-learning and value-iteration, Lu et al, NeurIPS 2018. [proceedings](https://papers.nips.cc/paper/8200-non-delusional-q-learning-and-value-iteration). Best paper award.
 28 | - Exploration by Random Network Distillation, Burda et al, 2018. [arXiv](https://arxiv.org/abs/1810.12894)
 29 | - Revisiting the Arcade Learning Environment, Machado et al, 2017. [arXiv](https://arxiv.org/abs/1709.06009). Note: known for suggesting sticky actions to make environment non-deterministic. Sticky action: with some prob eps, environment repeats previous action.
 30 | - An Information-Theoretic Optimality Principle for Deep Reinforcement Learning, Leibfried et al, 2017. [arXiv](https://arxiv.org/abs/1708.01867). Note: addresses problem of Q-value overestimation
 31 | - Dueling Network Architectures for Deep Reinforcement Learning, Wang et al, 2015. [arXiv](https://arxiv.org/abs/1511.06581)
 32 | - Deep Reinforcement Learning in Large Discrete Action Spaces, Dulac-Arnold, 2015. [arXiv](https://arxiv.org/abs/1512.07679)
 33 | 
 34 | ### 2018-11
 35 | - BISIMULATION METRICS FOR CONTINUOUS MARKOV DECISION PROCESSES, Ferns et al, 2011. [pdf](https://www.cs.mcgill.ca/~prakash/Pubs/siamFP11.pdf)
 36 | - Addressing Function Approximation Error in Actor-Critic Methods, Fujimoto et al, ICML 2018. [arXiv](https://arxiv.org/abs/1802.09477). TD3 agent
 37 | - The Mirage of Action-Dependent Baselines in Reinforcement Learning, Tucker et al, 2018. [arXiv](https://arxiv.org/abs/1802.10031). Note: decomposes variance into 3 sources: from trajectory, action-dependent baseline, and state visitation. Conclusion: variance-reduction from action-dependent baseline can be minimal.
 38 | - Backpropagation through the Void: Optimizing control variates for black-box gradient estimation, Grathwohl et al, ICLR 2018, [arXiv](https://arxiv.org/abs/1711.00123). Note: action dependent baseline, builds on REBAR
 39 | - REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models, Tucker et al, ICLR 2017. [arXiv](https://arxiv.org/abs/1703.07370)
 40 | 
 41 | ### 2018-10
 42 | - Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols, Havrylov and Titov, NIPS 2017. [arXiv](https://arxiv.org/abs/1705.11192). Note: EC with referential games, trained with REINFORCE and Gumbel-Softmax, shows hierarchy of language
 43 | - Prior Convictions: Black-box Adversarial Attacks with Bandits and Priors, Ilyas et al 2018, ICLR 2019 submission. [openreview](https://openreview.net/forum?id=BkMiWhR5K7)
 44 | - Certified Defenses against Adversarial Examples, Raghunathan et al, 2018, [arXiv](https://arxiv.org/abs/1801.09344)
 45 | - Speaker-Follower Models for Vision-and-Language Navigation, Fried et al, NIPS 2018, [arXiv](https://arxiv.org/abs/1806.02724)
 46 | - Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies, Grusky et al, NAACL 2018, [arXiv](https://arxiv.org/abs/1804.11283)
 47 | - Architectural Complexity Measures of Recurrent Neural Networks, Zhang et al, NIPS 2016, [arXiv](https://arxiv.org/abs/1602.08210)
 48 | - Gradient Estimation Using Stochastic Computation Graphs, Schulman et al, 2016. [arXiv](https://arxiv.org/abs/1506.05254)
 49 | - Variational Inference: A Review for Statisticians, Blei et al, 2018. [arXiv](https://arxiv.org/abs/1601.00670)
 50 | - Variational Inference with Normalizing Flows, Rezende et al, 2016. [arXiv](https://arxiv.org/abs/1505.05770)
 51 | - Large Scale GAN Training for High Fidelity Natural Image Synthesis, Brock et al, submission to ICLR 2019. [arXiv](https://arxiv.org/abs/1809.11096)
 52 | - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al, 2018. [arXiv](https://arxiv.org/pdf/1810.04805.pdf)
 53 | - The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables, Maddison et al, 2017. [arXiv](https://arxiv.org/abs/1611.00712). Note, the Concrete is equivalent to the Gumbel-Softmax.
 54 | -Categorical Reparameterization with Gumbel-Softmax, Jang et al, 2017. [arXiv](https://arxiv.org/abs/1611.01144). Note: Gumbel-Softmax is equivalent to the Concrete distribution.
 55 | 
 56 | ### 2018-09
 57 | - Universal Transformers, Dehghani et al, 2018. [arXiv](https://arxiv.org/abs/1807.03819), [`google blog post`](https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html)
 58 | - Phrase-Based & Neural Unsupervised Machine Translation, Lample et al, EMNLP 2018. [arXiv](https://arxiv.org/abs/1804.07755)
 59 | - Hybrid Reward Architecture for Reinforcement Learning, Seijen et al, 2017. [arXiv](https://arxiv.org/abs/1706.04208).
 60 | 
 61 | ### 2018-08
 62 | - Vehicle Communication Strategies for Simulated Highway Driving, Resnick et al, 2017, NIPS 2017 Workshop on Emergent Communication.
 63 | - Emergent Communication through Negotiation, Cao et al, NIPS 2017 Workshop on Emergent Communication.
 64 | - Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples, Athalye et al, ICML 2018. [arXiv](https://arxiv.org/abs/1802.00420). Defeats 7 of 9 recently introduced adversarial defense methods. Won best paper at ICML.
 65 | - Meta-Gradient Reinforcement Learning, Xu et al 2018, [arXiv](https://arxiv.org/abs/1805.09801)
 66 | 
 67 | ### 2018-07
 68 | - Proximal Policy Optimization Algorithms, Schulman et al, 2018. [arXiv](https://arxiv.org/abs/1707.06347), [`openai blog`](https://blog.openai.com/openai-baselines-ppo/), OpenAIFive [`blogpost`] which applies scaled up PPO on Dota2
 69 | - What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties, Conneau et al, ACL 2018. [arXiv](https://arxiv.org/abs/1805.01070). The authors go through 10 probing tasks to find out some of the things the embeddings capture, trained with various architectures.
 70 | - Style Transfer Through Back-Translation, Prabhumoye et al, ACL 2018. [arXiv](https://arxiv.org/abs/1804.09000)
 71 | - Hierarchical Neural Story Generation, Fan et al, ACL 2018. [arXiv](https://arxiv.org/abs/1805.04833). Generate a short story based on a "prompt", impressive results. Also has some cool tricks, like model fusion, a different type of attention, k=10 sampling, etc.
 72 | - Representation Learning for Grounded Spatial Reasoning, Janner et al, ACL 2018. [arXiv](https://arxiv.org/abs/1707.03938)
 73 | - Generating Sentences by Editing Prototypes, Guu et al, ACL 2018. [arXiv](https://arxiv.org/abs/1709.08878)
 74 | - A Stochastic Decoder for Neural Machine Translation, Schulz et al, ACL 2018. [arXiv](https://arxiv.org/abs/1805.10844)
 75 | - The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing, Dror et al, ACL 2018. [aclweb](http://aclweb.org/anthology/P18-1128)
 76 | - Stock Movement Prediction from Tweets and Historical Prices, Xu and Cohen, ACL 2018. [pdf](http://aclweb.org/anthology/P18-1183)
 77 | - Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context, Khandelwal et al, ACL 2018. [arXiv](https://arxiv.org/abs/1805.04623)
 78 | - Backpropagating through Structured Argmax using a SPIGOT, Peng et al, ACL 2018. [arXiv](https://arxiv.org/abs/1805.04658)
 79 | - Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum, Levy et al, ACL 2018. [arXiv](https://arxiv.org/abs/1805.03716)
 80 | 
 81 | ### 2018-06
 82 | - Self-Imitation Learning, Oh et al, 2018. [arXiv](https://arxiv.org/abs/1806.05635). Performs on-policy A2C update, and off-polic SIL, which samples positive experiences from a replay buffer and uses a form of AC. 
 83 | - Improving Language Understanding with Unsupervised Learning, Radford et al, 2018. [openai](https://blog.openai.com/language-unsupervised/)
 84 | - Prioritized Experience Replay, Schaul et al, ICLR 2016. [arXiv](https://arxiv.org/abs/1511.05952)
 85 | - Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation, Wu et al, 2017. [arXiv]()
 86 | - Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach, Karakida et al, 2018. [arXiv](https://arxiv.org/abs/1806.01316)
 87 | - On Learning Intrinsic Rewards for Policy Gradient Methods, Zheng et al, 2018. [arXiv](https://arxiv.org/abs/1804.06459)
 88 | - Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Yang et al, ICLR 2018. [openreview](https://openreview.net/forum?id=HkwZSG-CZ), [arXiv](https://arxiv.org/abs/1711.03953). [`summary`](summaries/softmax_bottleneck.md). Given a language model output matrix A over time, where each row is is the the vocabulary distribution given context, the authors hypothesize A  must be high rank to be express complex language, and the single softmax is not expressive enough. They propose a mixture of many softmax. 
 89 | - Measuring the Intrinsic Dimension of Objective Landscapes, Li et al, ICLR 2018. [openreview](), [arXiv](https://arxiv.org/abs/1804.08838), [`summary`](summaries/intrinsic_dimension.md). Intrinsic Dimension is the minimal parameter subspace (projected to the total parameters) to achieve a certain performance. It is a measure of model-problem complexity.
 90 | - Control of Memory, Active Perception, and Action in Minecraft, Oh et al, ICML 2016. [arXiv](https://arxiv.org/abs/1605.09128)
 91 | - Multitask Learning, Rich Caruana, PhD thesis 1997. [pdf](http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf). Work in the 90s on transfer learning! Chapter 5 discusses auxliary tasks for neural nets! 20 years before the UNREAL paper!
 92 | - Neural Map: Structured Memory for Deep Reinforcement Learning, Parisotto and Salakhutdinov, ICLR 2018. [arXiv](https://arxiv.org/abs/1702.08360). Instead of free external memory, have memory locations correlate with agent location, i.e. structured memory. Hugely outperforms memory nets and others on maze problems.
 93 | - On the State of the Art of Evaluation in Neural Language Models, ICLR 2018. [openreview](https://openreview.net/forum?id=ByJHuTgA-&). Some simple language models, like LSTM, actually achieve SOTA or near SOTA with proper hyperparams and simple additions, like shared embeddings and variational dropout (see Table 4 ablation).
 94 | - Reinforcement Learning with Unsupervised Auxiliary Tasks, Jaderberg et al, ICLR 2017. [openreview](https://openreview.net/forum?id=SJ6yPD5xg). Introduces the UNREAL model. See Caruana PhD thesis above from 1997, discusses auxiliary tasks for better representations!
 95 | 
 96 | ### 2018-03
 97 | - Parameter Space Noise for Exploration, Plappert et al, ICLR 2018. [arXiv](https://arxiv.org/abs/1706.01905). Instead of adding noise to action space, add noise to the FA's parameters for better exploration.
 98 | - Continuous control with deep reinforcement learning, Lillicrap et al, ICLR 2016. [arXiv](https://arxiv.org/abs/1509.02971). Introduced Deep Deterministic Policy Gradient (DDPG), an actor critic algorithm applicable to continuous action spaces, off-policy.
 99 | - Deterministic Policy Gradient Algorithms, Silver et al, ICML 2014. [pdf](http://proceedings.mlr.press/v32/silver14.pdf). DPG is the expected gradient of the action-value function, easier to estimate than the traditional stochastic policy gradient.
100 | - Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs, Murdoch et al, 2018, ICLR 2018. [pdf](https://openreview.net/pdf?id=rkRwGg-0Z), [arXiv](https://arxiv.org/abs/1801.05453)
101 | - Emergence Of Linguistic Communication From Referential Games With Symbolic And Pixel Input, Lazaridou et al, ICLR 2018. [pdf](https://openreview.net/pdf?id=HJGv1Z-AW)
102 | - Emergent Communication in a Multi-Modal, Multi-Step Referential Game, Evtimova et al, ICLR 2018. [arXiv](https://arxiv.org/abs/1705.10369), [`code`](https://github.com/nyu-dl/MultimodalGame/blob/master/model.py)
103 | - Neural Speed Reading via Skim-RNN, Seo et al, ICLR 2018. [arXiv](https://arxiv.org/abs/1711.02085)
104 | - Dynamic Word Embeddings for Evolving Semantic Discovery, Yao et al, 2017. [arXiv](https://arxiv.org/abs/1703.00607)
105 | 
106 | ### 2018-02
107 | - One Model To Learn Them All, Kaiser et al, 2017. [arXiv](https://arxiv.org/abs/1706.05137)
108 | - An Analysis of Temporal-Difference Learning with Function Approximation, Tsitsiklis and Van Roy, 1997. [pdf](http://web.mit.edu/jnt/www/Papers/J063-97-bvr-td.pdf)
109 | - Steps Toward Artificial Intelligence, Minsky, 1961. [pdf](https://courses.csail.mit.edu/6.803/pdf/steps.pdf)
110 | - Eye on the Prize, Nilsson, 1995. [pdf](http://ai.stanford.edu/~nilsson/OnlinePubs-Nils/General%20Essays/AIMag16-02-002.pdf)
111 | - The Option-Critic Architecture, Bacon et al. [arXiv](https://arxiv.org/abs/1609.05140)
112 | - Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings. He et al, 2017. [arXiv](https://arxiv.org/abs/1704.07130)
113 | - Learning to Win by Reading Manuals in a Monte-Carlo Framework, Branavan et al, 2012. [arXiv](https://arxiv.org/abs/1401.5390)
114 | 
115 | ### 2017-12
116 | - Generating Sentences by Editing Prototypes, Guu et al, 2017. [arXiv](https://arxiv.org/abs/1709.08878)
117 | - SenGen: Sentence Generating Neural Variational Topic Model, Nallapati et al, 2017. [arXiv](https://arxiv.org/abs/1708.00308)
118 | - Learning Sparse Neural Networks through L0 Regularization, Louizos et al 2017. [arXiv](https://arxiv.org/abs/1712.01312)
119 | - Sparsity and the Lasso, Tibshirani and Wasserman, 2015. [pdf](http://www.stat.cmu.edu/~larry/=sml/sparsity.pdf). Note: related L0 paper above
120 | - Proving convexity, Loh 2013. [pdf](http://www.math.cmu.edu/~ploh/docs/math/mop2013/convexity-soln.pdf). Note: related to L0 paper above
121 | - Mathematics of Deep Learning, Vidal et al, 2017. [arXiv](https://arxiv.org/abs/1712.04741)
122 | - Bayesian Hypernetworks, Krueger et al, 2017. [arXiv](https://arxiv.org/abs/1710.04759)
123 | - SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents, Nallapati et al, 2016. [arXiv](https://arxiv.org/abs/1611.04230)
124 | - Learning Online Alignments with Continuous Rewards Policy Gradient, Luo et al 2016. [arXiv](https://arxiv.org/abs/1608.01281)
125 | - Asynchronous Methods for Deep Reinforcement Learning. Mnih et al, 2016. [arXiv](https://arxiv.org/abs/1602.01783). Introduces A3C, Asyncrhonous Advantage Actor Critic
126 | - On The State of The Art In Neural Language Models, Anonymous, 2017. [iclr pdf](https://openreview.net/pdf?id=ByJHuTgA-)
127 | - Natural Language Inference with External Knowledge, Chen et al 2017. [arXiv](https://arxiv.org/abs/1711.04289)
128 | 
129 | ### 2017-11
130 | - Memory Augmented Neural Networks with Wormhole Connections, Gulcehre et al, 2017. [arXiv](https://arxiv.org/abs/1701.08718)
131 | - Emergence of Invariance and Disentangling in Deep Representations, Achille et al, 2017. [arXiv](https://arxiv.org/abs/1706.01350)
132 | - Distilling the Knowledge in a Neural Network, Hinton et al, 2015. [arXiv](https://arxiv.org/abs/1503.02531)
133 | - Seq2SQL: Generating Stuctured Queries From Natural Language Using Reinforcement Learning, Zhong et al, 2017. [arXiv](https://arxiv.org/abs/1709.00103)
134 | - Better Text Understanding Through Image-To-Text Transfer, Kurach, 2017. [arXiv](
135 | - Data Augmentation Generative Adversarial Networks, Antoniou et al, 2017. [arXiv](https://arxiv.org/abs/1711.04340)
136 | - Adversarial Training Methods for Semi-Supervised Text Classification, Miyato et al, 2017. [arXiv](https://arxiv.org/abs/1605.07725)
137 | - Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training, Anonymous, 2017. [openreview](https://openreview.net/pdf?id=SkhQHMW0W)
138 | - Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling, Inan et al 2017. [arXiv](https://arxiv.org/abs/1611.01462)
139 | - Building machines that learn and think for themselves, Botvinick et al, 2017. [cambridge](https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/building-machines-that-learn-and-think-for-themselves/E28DBFEC380D4189FB7754B50066A96F)
140 | - Neural Discrete Representation Learning, va den Oord et al, 2017. [arXiv](https://arxiv.org/abs/1711.00937)
141 | - InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets, Chen et al, 2016. [arXiv](https://arxiv.org/abs/1606.03657), [`blog`](https://towardsdatascience.com/infogan-generative-adversarial-networks-part-iii-380c0c6712cd), [`code`](https://github.com/zjost/InfoGAN)
142 | - Evolution Strategies, Otoro 2017, blog part [1](http://blog.otoro.net/2017/10/29/visual-evolution-strategies/), [2](http://blog.otoro.net/2017/11/12/evolving-stable-strategies/)
143 | - Matrix Capsules with EM Routing. Anonymous (likely Hinton lab), 2017. [openreview](https://openreview.net/pdf?id=HJWLfGWRb).
144 | - Dynamic Routing Between Capsules, Sabour et al, 2017. [arXiv](https://arxiv.org/abs/1710.09829). [`code-keras`](https://github.com/XifengGuo/CapsNet-Keras), [`video review`](https://youtu.be/pPN8d0E3900)
145 | - Weighted Transformer Network for Machine Translation, Ahmed et al, 2017. [arXiv](https://arxiv.org/abs/1711.02132)
146 | - Unsupervised Machine Translation Using Monolingual Corpora Only, Lample et al, 2017. [arXiv](https://arxiv.org/abs/1711.00043)
147 | - Non-Autoregressive Neural Machine Translation, Gu et al, 2017. [arXiv](https://arxiv.org/abs/1711.02281)
148 | - Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments, Lowe et al, 2017. [arXiv](https://arxiv.org/abs/1706.02275)
149 | 
150 | ### 2017-10
151 | - Adversarial Learning for Neural Dialogue Generation, Li et al, 2017. [arXiv](https://arxiv.org/abs/1701.06547)
152 | - Frustratingly Short Attention Spans in Neural Language Modeling, Daniluk et al, 2017. [arXiv](https://arxiv.org/abs/1702.04521)
153 | - Adversarial Training Methods for Semi-Supervised Text Classification, Miyato et al, 2017. [arXiv](https://arxiv.org/abs/1605.07725)
154 | - Progressive Growing of GANs for Improved Quality, Stability, and Variation, Karras et al, 2017. [pdf](http://research.nvidia.com/sites/default/files/pubs/2017-10_Progressive-Growing-of//karras2017gan-paper.pdf)
155 | - A Closer Look at Memorization in Deep Networks, Arpit et al, 2017. [arXiv](https://arxiv.org/abs/1706.05394)
156 | - Understanding deep learning requires rethinking generalization, Zhang et al, 2016. [arXiv](https://arxiv.org/abs/1611.03530)
157 | - The Loss Surfaces of Multilayer Networks, Choromanska et al, 2015. [arXiv](https://arxiv.org/abs/1412.0233)
158 | - Meta Learning Shared Hierarchies, Frans et al, 2017. [arXiv](https://arxiv.org/abs/1710.09767), [`author blog`](https://blog.openai.com/learning-a-hierarchy/)
159 | - Mastering the game of Go without human knowledge, Silver et al, 2017. [arXiv](https://www.nature.com/nature/journal/v550/n7676/full/nature24270.html), [`blog`](http://tim.hibal.org/blog/alpha-zero-how-and-why-it-works/)
160 | - Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation, Sharma et al, 2017. [arXiv](https://arxiv.org/abs/1706.09799)
161 | - GuessWhat?! Visual object discovery through multi-modal dialogue, de Vries et al, 2017. [arXiv](https://arxiv.org/abs/1611.08481)
162 | - A Frame Tracking Model for Memory-Enhanced Dialogue Systems, Schulz et al, 2017. [arXiv](https://arxiv.org/abs/1706.01690)
163 | - A Deep Reinforced Model for Abstractive Summarization, Paulus et al, 2017. [arXiv](https://arxiv.org/abs/1705.04304), [`author blog`](https://einstein.ai/research/your-tldr-by-an-ai-a-deep-reinforced-model-for-abstractive-summarization)
164 | - (about ROUGE score for summarization) ROUGE: A Package for Automatic Evaluation of Summaries, Chin-Yew Lin, 2004. [acl](http://anthology.aclweb.org/W/W04/W04-1013.pdf)
165 | - Rainbow: Combining Improvements in Deep Reinforcement Learning, Hessel et al, 2017. [arXiv](https://arxiv.org/abs/1710.02298)
166 | - Language Modeling with Gated Convolutional Networks, Dauphin et al, 2017, [arXiv](https://arxiv.org/abs/1612.08083)
167 | - Convolutional Sequence to Sequence Learning, Gehring et al, 2017. [arXiv](https://arxiv.org/abs/1705.03122)
168 | - Emergence of Grounded Compositional Language in Multi-Agent Populations, Mordatch and Abbeel, 2017. [arXiv](https://arxiv.org/abs/1703.04908), [`author blog`](https://blog.openai.com/learning-to-communicate/). Note: related to Kottur et al 2017.
169 | 
170 | ### 2017-09
171 | - Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog, Kottur et al, 2017. [arXiv](https://arxiv.org/abs/1706.08502), [`code`](https://github.com/batra-mlp-lab/lang-emerge)
172 | - Opening the black box of Deep Neural Networks via Information, Schwartz-Ziv and Tishbly, 2017. [arXiv](https://arxiv.org/abs/1703.00810), [m-p review](https://blog.acolyer.org/2017/11/15/opening-the-black-box-of-deep-neural-networks-via-information-part-i/)
173 | - End-to-end Neural Coreference Resolution, Lee et al, 2017. [arXiv](https://arxiv.org/abs/1707.07045)
174 | - Deep Reinforcement Learning for Mention-Ranking Coreference Models, Clark et al, 2016. [arXiv](https://arxiv.org/abs/1609.08667)
175 | - Oriented Response Networks, Zhou et al 2017. [arXiv](https://arxiv.org/abs/1701.01833)
176 | - Training RNNs as Fast as CNNs, Lei et al, 2017. [arXiv](https://arxiv.org/abs/1709.02755)
177 | - Quasi-Recurrent Neural Networks, Bradbury et al 2017. [arXiv](https://openreview.net/pdf?id=H1zJ-v5xl), [`author blog/code`](https://einstein.ai/research/new-neural-network-building-block-allows-faster-and-more-accurate-text-understanding)
178 | - A Deep Reinforcement Learning Chatbot, Serban et al, 2017. [arXiv](https://arxiv.org/abs/1709.02349)
179 | - Independently Controllable Factors, Thomas et al, 2017. [arXiv](https://arxiv.org/abs/1708.01289)
180 | - Attention Is All You Need, Vaswani et al, 2017. [arXiv](https://arxiv.org/abs/1706.03762), [`code`](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py), [`google blog`](https://research.googleblog.com/2017/08/transformer-novel-neural-network.html), [`reddit`](https://www.reddit.com/r/MachineLearning/comments/6gwqiw/r_170603762_attention_is_all_you_need_sota_nmt/)
181 | - Attention-over-Attention Neural Networks for Reading Comprehension, Cui et al 2017. [arXiv](https://arxiv.org/abs/1607.04423), [`code`](https://github.com/OlavHN/attention-over-attention)
182 | - Get To The Point: Summarization with Pointer-Generator Networks, See et al 2017. [arXiv](https://arxiv.org/abs/1704.04368), [`author blog`](http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html), [`code`](https://github.com/abisee/cnn-dailymail)
183 | - β-VAE: LEARNING BASIC VISUAL CONCEPTS WITH A CONSTRAINED VARIATIONAL FRAMEWORK, Higgins et al 2017. [pdf](https://openreview.net/pdf?id=Sy2fzU9gl)
184 | - Massive Exploration of Neural Machine Translation Architectures, Britz et al 2017. [arXiv](https://arxiv.org/abs/1703.03906v2)
185 | - Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, 2017. [arXiv](https://arxiv.org/abs/1703.10593), ['examples'](https://junyanz.github.io/CycleGAN/), [`code-torch`](https://github.com/junyanz/CycleGAN), [`code-PyT`](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix)
186 | 
187 | ### 2017-08
188 | - A Brief Survey of Deep Reinforcement Learning, Arulkumaran et al 2017. [arXiv](https://arxiv.org/abs/1708.05866)
189 | - Regularizing and Optimizing LSTM Language Models, Merity et al 2017. [arXiv](http://lanl.arxiv.org/abs/1708.02182v1)
190 | - Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets, Yang et al 2017. [arXiv](https://arxiv.org/abs/1703.04887)
191 | - Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders, Zhao et al 2017. [arXiv](https://arxiv.org/abs/1703.10960)
192 | - How to Train Your DRAGAN, Kodali et al 2017. [arXiv](https://arxiv.org/abs/1705.07215)
193 | - Improved Training of Wasserstein GANs, Gulrajani et al 2017. [arXiv](https://arxiv.org/abs/1704.00028), [`blog`](https://lernapparat.de/improved-wasserstein-gan/), [`blog`](http://lernapparat.de/more-improved-wgan/), [`code`](https://github.com/igul222/improved_wgan_training)
194 | - Wasserstein Gan, Arjovsky et al 2017. [arXiv](https://arxiv.org/abs/1701.07875), [`read-through`](http://www.alexirpan.com/2017/02/22/wasserstein-gan.html), [`Kantorovich-Rubinstein duality`](https://vincentherrmann.github.io/blog/wasserstein/), [`WGAN-tensorflow`](https://github.com/shekkizh/WassersteinGAN.tensorflow), [`blog/code`](https://wiseodd.github.io/techblog/2017/02/04/wasserstein-gan/)
195 | - Reading Scene Text in Deep Convolutional Sequences, He et al, 2016. [arXiv](https://arxiv.org/abs/1506.04395)
196 | 
197 | ### 2017-03
198 | - Recurrent Batch Normalization, Cooijmans et al, 2017. [arXiv](https://arxiv.org/abs/1603.09025), [`code-tf`](https://github.com/OlavHN/bnlstm)
199 | - An Actor-Critic Algorithm for Sequence Prediction, Bahdanau et al 2017. [arXiv](https://arxiv.org/abs/1607.07086), [`code`](https://github.com/rizar/actor-critic-public)
200 | - Scheduled Sampling for Sequence Prediction with RNN, Bengio et al, 2015 [arXiv](https://arxiv.org/abs/1506.03099), [`summary`](summaries/scheduled_sampling.md),
201 | - Hybrid computing using a neural network with dynamic external memory, published in [Nature](https://www.dropbox.com/s/0a40xi702grx3dq/2016-graves.pdf)
202 | - Neural Turing Machine, [arXiv](https://arxiv.org/abs/1410.5401)
203 | - LEARNING END-TO-END GOAL-ORIENTED DIALOG, Bordes et al, 2017. [arXiv](https://arxiv.org/abs/1605.07683), [`code`](https://github.com/carpedm20/MemN2N-tensorflow)
204 | - End-To-End Memory Networks, Sukhbaatar et al, 2015, [arXiv](https://arxiv.org/abs/1503.08895)
205 | - Memory Networks, [arXiv](https://arxiv.org/abs/1410.3916)
206 | - Deep Photo Style Transfer, [arXiv](https://arxiv.org/abs/1703.07511)
207 | - Matching Networks for One Shot Learning, Vinyals et al, NIPS 2016. [arXiv](https://arxiv.org/abs/1606.04080). [`summary`](summaries/matching_networks.md), [`code`](https://github.com/zergylord/oneshot). [`karpathy notes`](http://www.shortscience.org/paper?bibtexKey=journals/corr/VinyalsBLKW16#karpathy), [`Colyer blog`](https://blog.acolyer.org/2017/01/03/matching-networks-for-one-shot-learning/)
208 | 
209 | ### 2017-01
210 | - Optimization As A Model For Few-Shot Learning, Sachin Ravi and Hugo Larochelle, ICLR 2017. [openreview](https://openreview.net/pdf?id=rJY0-Kcll), [video](https://www.youtube.com/watch?v=igJmB6d8y8E)
211 | - NIPS 2016 Tutorial:Generative Adversarial Networks, [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOa2RqZmJVR2hrUTA/view?usp=sharing),[arXiv](https://arxiv.org/abs/1701.00160), [blog/code](https://wiseodd.github.io/techblog/2016/09/17/gan-tensorflow/)
212 | 
213 | ### 2016-11
214 | - [Fully Character-Level Neural Machine Translation without Explicit Segmentation](summaries/fully_char_level_nmt.md), [annotated](https://drive.google.com/open?id=0ByV7wn2NzevOQ0JtTTRuR0pjUlE), [arXiv](https://arxiv.org/abs/1610.03017)
215 | - [Neural Machine Translation by Jointly Learning to Align and Translate](summaries/neural_machine_translation.md), [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOS3FmWHVNazhnczA/view?usp=sharing), [arXiv](https://arxiv.org/abs/1409.0473)
216 | - [Sequence to Sequence Learning with Neural Networks](summaries/seq2seq_nn.md), [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOQ1l5aUF4RWYtenc/view?usp=sharing), [arXiv](https://arxiv.org/abs/1409.3215)
217 | - [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation](summaries/learning_phrase_rep_RNN_encoder_decoder_mt.md), [arXiv](https://arxiv.org/abs/1406.1078)
218 | - [Implicit Discourse Relation Detection via a Deep Architecture with Gated Relevance Network](summaries/implicit_drd_grn.md), [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOLUxtemFqejJmNVU/view?usp=sharing), [acl](https://www.aclweb.org/anthology/P/P16/P16-1163.pdf)
219 | 
220 | ### 2016-10
221 | - Learning Structured Output Representation using Deep Conditional Generative Models, Sohn et al 2015. (Conditional VAE) [nips](https://papers.nips.cc/paper/5775-learning-structured-output-representation-using-deep-conditional-generative-models), [blog/code](https://wiseodd.github.io/techblog/2016/12/17/conditional-vae/), [code](https://github.com/hwalsuklee/tensorflow-mnist-CVAE)
222 | - [Auto-Encoding Variational Bayes](summaries/auto-encoding_var_bayes.md), [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOcjBIeVBZcTFUQ2s/view?usp=sharing), [arXiv](https://arxiv.org/abs/1312.6114), [blog/code](https://wiseodd.github.io/techblog/2016/12/10/variational-autoencoder/), - [Semi-supervised Variational Autoencoders for Sequence Classification](summaries/var_auto_sequence_class.md), [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOTXEzLWlNQy1od0k/view?usp=sharing), [arXiv](https://arxiv.org/abs/1603.02514)
223 | - [Autoencoder review](summaries/autoencoders.md) by Keras author Francois Chollet
224 | 
225 | ## Datasets
226 | - UCI [machine learning repository](https://archive.ics.uci.edu/ml/datasets.html?format=&task=&att=&area=&numAtt=&numIns=&type=&sort=instDown&view=table). 360 datasets, some very large. Nice sorting feature, such as ">1000 instance/classification/text" results in [14 data sets](https://archive.ics.uci.edu/ml/datasets.html?format=&task=cla&att=&area=&numAtt=&numIns=greater1000&type=&sort=instDown&view=table)
227 | 
228 | ## Paper collections
229 | - ["Awesome deep learning papers"]https://github.com/terryum/awesome-deep-learning-papers/), a collection of 100 best papers from past few years
230 | - Paper collection by [songrotek](https://github.com/songrotek/Deep-Learning-Papers-Reading-Roadmap/blob/master/README.md)
231 | 
232 | ## Overview
233 | - [Nature Review article. Lecun, Bengio, Hinton. 2015](http://www.nature.com/articles/nature14539.epdf?referrer_access_token=K4awZz78b5Yn2_AoPV_4Y9RgN0jAjWel9jnR3ZoTv0PU8PImtLRceRBJ32CtadUBVOwHuxbf2QgphMCsA6eTOw64kccq9ihWSKdxZpGPn2fn3B_8bxaYh0svGFqgRLgaiyW6CBFAb3Fpm6GbL8a_TtQQDWKuhD1XKh_wxLReRpGbR_NdccoaiKP5xvzbV-x7b_7Y64ZSpqG6kmfwS6Q1rw%3D%3D&tracking_referrer=www.nature.com)
234 |   * Good short overview
235 | - [Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117.](http://arxiv.org/abs/1404.7828)
236 |     * Extensive overview
237 | 
238 | ## Neural Networks Basics
239 | 
240 | - [Michael Nielsen book on NN](http://neuralnetworksanddeeplearning.com/chap1.html)
241 | - [Hacker's guide to Neural Networks. Andrej Karpathy blog](http://karpathy.github.io/neuralnets/)
242 | - [Visualize NN training](http://experiments.mostafa.io/public/ffbpann/)
243 | 
244 | ## Backpropagation
245 | 
246 | - [A Gentle Introduction to Backpropagation. Sathyanarayana (2014)](http://numericinsight.com/uploads/A_Gentle_Introduction_to_Backpropagation.pdf)
247 | - [Learning representations by back-propagating errors. Hinton et al, 1986](http://www.nature.com/nature/journal/v323/n6088/abs/323533a0.html)
248 |   * Seminal paper by Hinton et al on back-propagation.
249 | - [The Backpropagation Algorithm](http://page.mi.fu-berlin.de/rojas/neural/chapter/K7.pdf)
250 |   * Longer tutorial on the topic, 34 pages
251 | - [Overview of various optimization algorithms](http://sebastianruder.com/optimizing-gradient-descent/)
252 |   * [Summary](summaries/overview_optimization.md)
253 | 
254 | ## Misc
255 | - Multi-Task Learning Objectives for Natural Language Processing, [blog](http://ruder.io/multi-task-learning-nlp/index.html)
256 | 
257 | ## Recurrent Neural Network (RNN)
258 | 
259 | - [Blog intro, tutorial](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)
260 | - [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Cho et al. 2014)](http://arxiv.org/abs/1406.1078)
261 | - [Character-Aware Neural Language Models. Kim et al. 2015.](http://arxiv.org/pdf/1508.06615.pdf)
262 | - [The Unreasonable Effectiveness of Recurrent Neural Networks. Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
263 |   * Indepth, examples in vision and NLP. Provides code
264 | - [Sequence-to-Sequence Learning with Neural Networks. Sutskever et al (2014)](http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf)
265 |   * Ground-breaking work on machine translation with RNN and LSTM
266 | - [Training RNN. Sutskever thesis. 2013](http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf)
267 |   * Indepth, self-contained, 85 pages
268 | - [Understanding Natural Language with Deep Neural Networks Using Torch (2015)](http://devblogs.nvidia.com/parallelforall/understanding-natural-language-deep-neural-networks-using-torch/)
269 |   * See part on predicting next word with RNN.
270 | - [LSTM BASED RNN ARCHITECTURES FOR LARGE VOCABULARY SPEECH RECOGNITION](http://arxiv.org/pdf/1402.1128v1.pdf)
271 | - [Awesome Recurrent Neural Networks](https://github.com/kjw0612/awesome-rnn#lectures)
272 |   * Curated list of RNN resources
273 | 
274 | ## CNNs
275 | - [Karpathy cs231 review](http://cs231n.github.io/convolutional-networks/)
276 | - [Character-level Convolutional Networks for Text Classification](http://arxiv.org/abs/1509.01626)
277 |   * [Annotated](https://drive.google.com/open?id=0ByV7wn2NzevOZEw4QV9tbFNyVTQ)
278 | - [Collobert. Natural Language Processing (Almost) from Scratch (2011)](http://dl.acm.org/citation.cfm?id=2078186)
279 |   * Spurred interest in applying CNN to NLP.
280 | - [Multichannel Variable-Size Convolution for Sentence Classification. Yin, 2015](https://aclweb.org/anthology/K/K15/K15-1021.pdf)
281 |   * Interesting, borrows multichannel from image CNN, where each channel is a different word embedding.
282 | - [A CNN for Modelling Sentences. Kalchbrenner et al, 2014](http://phd.nal.co/papers/Kalchbrenner_DCNN_ACL14)
283 |   * Dynamic k-max pooling for variable length sentences.
284 | - [Semantic Relation Classification via Convolutional Neural Networks with Simple Negative Sampling. Xu et al, 2015](http://arxiv.org/pdf/1506.07650v1.pdf)
285 | - [Text Understanding from Scratch. Zhang, LeCunn. (2015)](http://arxiv.org/abs/1502.01710)
286 | - [Kim. Convolutional Neural Networks for Sentence Classification (2014)](http://arxiv.org/pdf/1408.5882v2.pdf)
287 | - [Sensitivity Analysis of (And Practitioner's Guide to) CNN for Sentence Classification. Zhang, Wallace (2015)](http://arxiv.org/pdf/1510.03820v2.pdf)
288 |   * [Annotated](https://drive.google.com/open?id=0ByV7wn2NzevOY25JNlJQREVLZEU)
289 | - [Relation Extraction: Perspective from Convolutional Neural Networks. Nguyen, Grishman (2015)](http://www.cs.nyu.edu/~thien/pubs/vector15.pdf)
290 |   * [Annotated](https://drive.google.com/file/d/0ByV7wn2NzevObzAtV1QyUDl5X2M/view?usp=sharing)
291 | - [Convolutional Neural Network for Sentence Classification. Yahui Chen, 2015](https://uwspace.uwaterloo.ca/bitstream/handle/10012/9592/Chen_Yahui.pdf?sequence=3&isAllowed=y)
292 |   * Master's thesis, University of Waterloo
293 | 
294 | ## Deep Reinforcement Learning
295 | - [Playing Atari with Deep Reinforcement Learning. Mnih et al. (2014)](http://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)
296 | - [Youtube Demo](https://www.youtube.com/watch?v=wfL4L_l4U9A)
297 | - Simple Reinforcement Learning with TensorFlow series, part [0](https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0)
298 | - Basic DQN in Keras, [`blog`](https://keon.io/deep-q-learning/), [`code`](https://github.com/keon/deep-q-learning)
299 | - Minimal and clean examples, [`code`](https://github.com/rlcode/reinforcement-learning)
300 | - Demystifying Deep RL, [`blog`](http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/)
301 | - Berkeley course on DRL, [`course`](http://rll.berkeley.edu/deeprlcourse/)
302 | 
303 | ## Online Courses
304 | - [Deep Learning. Udacity, 2015](https://www.udacity.com/course/deep-learning--ud730)
305 |   * Very brief. It is more about getting a feel for DL and specifically about using TensorFlow for DL.
306 | - [Convolutional Neural Networks for Visual Recognition. Stanford, 2016](http://cs231n.stanford.edu/)
307 | - [Neural Network Course. Université de Sherbrooke, 2013](http://info.usherbrooke.ca/hlarochelle/neural_networks/description.html)
308 | - [Machine Learning Course, University of Oxford(2014-2015)](https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/)
309 | - [Deep Learning for NLP, Stanford (2015)](http://cs224d.stanford.edu/)
310 |   * Click "syllabus" for full material
311 | - [Stanford Deep Learning tutorials](http://ufldl.stanford.edu/tutorial/)
312 |   * From basics of Machine Learning, to DNN, CNN, and others.
313 |   * Includes code.
314 | 
315 | ## Books
316 | - [Ian Goodfellow, Yoshua Bengio, Aaron Courville (2016). Deep Learning.](http://www.deeplearningbook.org)
317 | 


--------------------------------------------------------------------------------
/hackers_guide/1_chapt/base_case.js:
--------------------------------------------------------------------------------
 1 | // FROM: http://karpathy.github.io/neuralnets/
 2 | 
 3 | var print = function(str) {
 4 |     str_new = document.getElementById('text').innerHTML + "<br>" + str;
 5 |   document.getElementById('text').innerHTML = str_new;
 6 | }
 7 | 
 8 | // circuit with single gate for now
 9 | var forwardMultiplyGate = function(x, y) { return x * y; };
10 | 
11 | // ----------------------------------------------------------
12 | // STRATEGY #1: RANDOM SEARCH
13 | // ----------------------------------------------------------
14 | var x = -2, y = 3; // some input values
15 | // try changing x,y randomly small amounts and keep track of what works best
16 | var tweak_amount = 0.01;
17 | var best_out = -Infinity;
18 | var best_x = x, best_y = y;
19 | for(var k = 0; k < 100; k++) {
20 |   var x_try = x + tweak_amount * (Math.random() * 2 - 1); // tweak x a bit
21 |   var y_try = y + tweak_amount * (Math.random() * 2 - 1); // tweak y a bit
22 |   var out = forwardMultiplyGate(x_try, y_try);
23 |   if(out > best_out) {
24 |     // best improvement yet! Keep track of the x and y
25 |     best_out = out; 
26 |     best_x = x_try, best_y = y_try;
27 |   }
28 | }
29 | print("Best x: " + best_x);
30 | print("Best y: " + best_x);
31 | print("Result: " + best_out);
32 | 
33 | // ----------------------------------------------------------
34 | // STRATEGY #2: PARTIAL DERIVATIVES --> GRADIENT
35 | // ----------------------------------------------------------
36 | // Based on following function: 
37 | // \frac{\partial f(x,y)}{\partial x} = \frac{f(x+h,y) - f(x,y)}{h}
38 | var x = -2, y = 3;
39 | var out = forwardMultiplyGate(x, y); // -6
40 | var h = 0.0001; //small change in variable to measure change in function
41 | // In theory we would want the gradient, ie the limit of the expression
42 | // as h --> 0
43 | // The gradient with respect to all inputs is a vector of all the partial derivatives
44 | // The gradient is the direction of the steepest increase of the function
45 | 
46 | // compute derivative with respect to x
47 | var xph = x + h; // -1.9999
48 | var out2 = forwardMultiplyGate(xph, y); // -5.9997
49 | var x_derivative = (out2 - out) / h; // 3.0
50 | 
51 | // compute derivative with respect to y
52 | var yph = y + h; // 3.0001
53 | var out3 = forwardMultiplyGate(x, yph); // -6.0002
54 | var y_derivative = (out3 - out) / h; // -2.0
55 | 
56 | // ----------------------------------------------------------
57 | // UPDATE BASED ON DERIVATIVES
58 | // ----------------------------------------------------------
59 | var step_size = 0.01;
60 | var out = forwardMultiplyGate(x, y); // before: -6
61 | x += step_size * x_derivative; // x becomes -1.97
62 | y += step_size * y_derivative; // y becomes 2.98
63 | var out_new = forwardMultiplyGate(x, y); // -5.87! exciting.
64 | print(" Output based on full partial deriv: " + out_new);
65 | 
66 | // ----------------------------------------------------------
67 | // STRATEGY #3: ANALYTIC GRADIENT
68 | // ----------------------------------------------------------
69 | /* 
70 | Previously we analyzed the change in function output once for every input we have. Complexity of evaluating the gradient is linear in number of inputs. In practice, not feasible. Instead, we derive a direct expression, an analytic gradient. 
71 | Plugging our expression into the definition of the derivative of y, we get:
72 | \frac{\partial f(x,y)}{\partial x} = \frac{f(x+h,y) - f(x,y)}{h}
73 | = \frac{(x+h)y - xy}{h}
74 | = \frac{xy + hy - xy}{h}
75 | = \frac{hy}{h}
76 | = y
77 | The analytic gradient of x is y, and for y it is x! We can simplify our derivative calculation.
78 | */
79 | var x = -2, y = 3;
80 | var out = forwardMultiplyGate(x, y); // before: -6
81 | var x_gradient = y; // by our complex mathematical derivation above
82 | var y_gradient = x;
83 | 
84 | var step_size = 0.01;
85 | x += step_size * x_gradient; // -2.03
86 | y += step_size * y_gradient; // 2.98
87 | var out_new = forwardMultiplyGate(x, y); // -5.87. Higher output! Nice.
88 | print("Analytic gradient output: " + out_new);
89 | 
90 | 
91 | 
92 | 
93 | 
94 | 
95 | 


--------------------------------------------------------------------------------
/hackers_guide/1_chapt/multiple_gates.js:
--------------------------------------------------------------------------------
 1 | // FROM: http://karpathy.github.io/neuralnets/
 2 | 
 3 | var print = function(str) {
 4 |     str_new = document.getElementById('text').innerHTML + "<br>" + str;
 5 |   document.getElementById('text').innerHTML = str_new;
 6 | }
 7 | 
 8 | /* 
 9 | Like the base case, we expand the gradient calculations to multiple gates where each calculates local derivatives, unaware of the complexity of the whole.
10 | */
11 | 
12 | // ----------------------------------------------------------
13 | // FUNCTION DEFINITION
14 | // ----------------------------------------------------------
15 | // We want to model the following expression:
16 | // f(x,y,z) = (x + y) z
17 | var forwardMultiplyGate = function(a, b) { 
18 |   return a * b;
19 | };
20 | var forwardAddGate = function(a, b) { 
21 |   return a + b;
22 | };
23 | var forwardCircuit = function(x,y,z) { 
24 |   var q = forwardAddGate(x, y);
25 |   var f = forwardMultiplyGate(q, z);
26 |   return f;
27 | };
28 | 
29 | var x = -2, y = 5, z = -4;
30 | var f = forwardCircuit(x, y, z); // output is -12
31 | print(f);
32 | 
33 | // ----------------------------------------------------------
34 | // GRADIENTS VIA CHAIN RULE (SIMPLE BACKPROPAGATION)
35 | // ----------------------------------------------------------
36 | /*
37 | To calculate the derivatives, we would start with the partial derivatives of function f (multiply gate) with respect to q and z, we would then calculate the partial derivatives of function q (add gate) with respect to x and y. 
38 | We then combine the two via the chain rule to get the gradient with respect to x, y and z. 
39 | */
40 | // initial conditions
41 | var x = -2, y = 5, z = -4;
42 | var q = forwardAddGate(x, y); // q is 3
43 | var f = forwardMultiplyGate(q, z); // output is -12
44 | 
45 | // gradient of the MULTIPLY gate with respect to its inputs
46 | // wrt is short for "with respect to"
47 | var derivative_f_wrt_z = q; // 3
48 | var derivative_f_wrt_q = z; // -4
49 | 
50 | // derivative of the ADD gate with respect to its inputs
51 | var derivative_q_wrt_x = 1.0;
52 | var derivative_q_wrt_y = 1.0;
53 | 
54 | // chain rule
55 | var derivative_f_wrt_x = derivative_q_wrt_x * derivative_f_wrt_q; // -4
56 | var derivative_f_wrt_y = derivative_q_wrt_y * derivative_f_wrt_q; // -4
57 | /* Note although derivatives of q with respect to x,y are 1, gradient of f with respect to q is -4! Since q is made up of x,y and we want q to decrease, therefore x,y must decrease to respect q's gradient! This is why derivative of f wrt to x,y (via chain rule) is -4. */ 
58 | 
59 | // final gradient, from above: [-4, -4, 3]
60 | var gradient_f_wrt_xyz = [derivative_f_wrt_x, derivative_f_wrt_y, derivative_f_wrt_z]
61 | 
62 | // let the inputs respond to the force/tug:
63 | var step_size = 0.01;
64 | x = x + step_size * derivative_f_wrt_x; // -2.04
65 | y = y + step_size * derivative_f_wrt_y; // 4.96
66 | z = z + step_size * derivative_f_wrt_z; // -3.97
67 | 
68 | // Our circuit now better give higher output:
69 | var q = forwardAddGate(x, y); // q becomes 2.92
70 | var f = forwardMultiplyGate(q, z); // output is -11.59, up from -12! Nice!
71 | 
72 | // ----------------------------------------------------------
73 | // GRADIENT CHECKING
74 | // ----------------------------------------------------------
75 | // initial conditions
76 | var x = -2, y = 5, z = -4;
77 | 
78 | // numerical gradient check
79 | var h = 0.0001;
80 | var x_derivative = (forwardCircuit(x+h,y,z) - forwardCircuit(x,y,z)) / h; // -4
81 | var y_derivative = (forwardCircuit(x,y+h,z) - forwardCircuit(x,y,z)) / h; // -4
82 | var z_derivative = (forwardCircuit(x,y,z+h) - forwardCircuit(x,y,z)) / h; // 3
83 | // We get [-4, -4, 3], same as with backpropagation
84 | 
85 | 
86 | 
87 | 
88 | 


--------------------------------------------------------------------------------
/hackers_guide/1_chapt/single_neuron.js:
--------------------------------------------------------------------------------
  1 | // FROM: http://karpathy.github.io/neuralnets/
  2 | 
  3 | var print = function(str) {
  4 |   str_new = document.getElementById('output').innerHTML + "<br>" + str;
  5 |   document.getElementById('output').innerHTML = str_new;
  6 | }
  7 | 
  8 | /* 
  9 | More realistic example, we have a single neuron which activates with sigmoid:
 10 | f(x,y,a,b,c) = \sigma(ax + by + c)
 11 | 
 12 | Sigmoid squashes values to between 0 and 1. 
 13 | The partial derivative with respect to a single input:
 14 | \frac{\partial \sigma(x)}{\partial x} = \sigma(x) (1 - \sigma(x))
 15 | */
 16 | 
 17 | // --------------------------------------------------------------------------
 18 | // UNIT CLASS
 19 | // --------------------------------------------------------------------------
 20 | // every Unit corresponds to a wire in the diagrams
 21 | var Unit = function(value, grad) {
 22 |   // value computed in the forward pass
 23 |   this.value = value;
 24 |   // the derivative of circuit output w.r.t this unit, computed in backward pass
 25 |   this.grad = grad;
 26 | }
 27 | 
 28 | // --------------------------------------------------------------------------
 29 | // GATE CLASSES: FORWARD/BACKWARD DEFINITION
 30 | // --------------------------------------------------------------------------
 31 | // the backward functions compute only the local derivatives
 32 | 
 33 | // MULTIPLY GATE
 34 | var multiplyGate = function() {};
 35 | multiplyGate.prototype = {
 36 |   forward: function(u0, u1) {
 37 |     // From two input units, multiply their values and forward result in new parent Unit
 38 |     // store pointers to input Units u0 and u1 and output unit utop
 39 |     this.u0 = u0;
 40 |     this.u1 = u1;
 41 |     this.utop = new Unit(u0.value * u1.value, 0.0);
 42 |     return this.utop;
 43 |   },
 44 |   backward: function() {
 45 |     // take the gradient in output unit and chain it with the
 46 |     // local gradients, which we derived for multiply gate before
 47 |     // then write those gradients to those Units.
 48 |     this.u0.grad += this.u1.value * this.utop.grad;
 49 |     this.u1.grad += this.u0.value * this.utop.grad;
 50 |     // remember in multiplication the gradient wrt u0 is u1.value
 51 |   }
 52 | }
 53 | 
 54 | // ADD GATE
 55 | var addGate = function() {};
 56 | addGate.prototype = {
 57 |   forward: function(u0, u1) {
 58 |     this.u0 = u0;
 59 |     this.u1 = u1; // store pointers to input units
 60 |     this.utop = new Unit(u0.value + u1.value, 0.0);
 61 |     return this.utop;
 62 |   },
 63 |   backward: function() {
 64 |     // add gate. derivative wrt both inputs is 1
 65 |     this.u0.grad += 1 * this.utop.grad;
 66 |     this.u1.grad += 1 * this.utop.grad;
 67 |   }
 68 | }
 69 | 
 70 | // SIGMOID GATE
 71 | var sigmoidGate = function() {
 72 |   // helper function
 73 |   this.sig = function(x) {
 74 |     return 1 / (1 + Math.exp(-x));
 75 |   };
 76 | };
 77 | sigmoidGate.prototype = {
 78 |   forward: function(u0) {
 79 |     this.u0 = u0;
 80 |     this.utop = new Unit(this.sig(this.u0.value), 0.0);
 81 |     return this.utop;
 82 |   },
 83 |   backward: function() {
 84 |     var s = this.sig(this.u0.value);
 85 |     this.u0.grad += (s * (1 - s)) * this.utop.grad;
 86 |   }
 87 | }
 88 | 
 89 | // --------------------------------------------------------------------------
 90 | // MAIN
 91 | // --------------------------------------------------------------------------
 92 | // create input units
 93 | var a = new Unit(1.0, 0.0);
 94 | var b = new Unit(2.0, 0.0);
 95 | var c = new Unit(-3.0, 0.0);
 96 | var x = new Unit(-1.0, 0.0);
 97 | var y = new Unit(3.0, 0.0);
 98 | 
 99 | // create the gates
100 | var mulg0 = new multiplyGate(); // to multiply ax
101 | var mulg1 = new multiplyGate(); // to multiply by
102 | var addg0 = new addGate(); // to add ax + by
103 | var addg1 = new addGate(); // to add c to ax+by
104 | var sg0 = new sigmoidGate(); // sigmoid result of previous
105 | 
106 | // Do the forward pass
107 | var forwardNeuron = function() {
108 |   ax = mulg0.forward(a, x); // a*x = -1
109 |   by = mulg1.forward(b, y); // b*y = 6
110 |   axpby = addg0.forward(ax, by); // a*x + b*y = 5
111 |   axpbypc = addg1.forward(axpby, c); // a*x + b*y + c = 2
112 |   s = sg0.forward(axpbypc); // sig(a*x + b*y + c) = 0.8808
113 | };
114 | 
115 | // INITIAL FORWARD
116 | print('initial values:');
117 | print('a: ' + a.value);
118 | print('b: ' + b.value);
119 | print('c: ' + c.value);
120 | print('x: ' + x.value);
121 | print('y: ' + y.value);
122 | forwardNeuron();
123 | print('initial circuit output: ' + s.value); // prints 0.8808
124 | 
125 | // LOOP!
126 | print_every = 10
127 | for (i = 0; i < 200; i++) {
128 |   // BACKWARD
129 |   s.grad = 1.0;
130 |   sg0.backward(); // writes gradient into axpbypc
131 |   addg1.backward(); // writes gradients into axpby and c
132 |   addg0.backward(); // writes gradients into ax and by
133 |   mulg1.backward(); // writes gradients into b and y
134 |   mulg0.backward(); // writes gradients into a and x
135 | 
136 |   // UPDATE
137 |   var step_size = 0.01;
138 |   a.value += step_size * a.grad; // a.grad is -0.105
139 |   b.value += step_size * b.grad; // b.grad is 0.315
140 |   c.value += step_size * c.grad; // c.grad is 0.105
141 |   x.value += step_size * x.grad; // x.grad is 0.105
142 |   y.value += step_size * y.grad; // y.grad is 0.210
143 | 
144 |   // FORWARD RESULT
145 |   forwardNeuron();
146 |   if (i % print_every == 0) {
147 |     print('output at step ' + i + ' : ' + s.value);
148 |   }
149 | }
150 | 
151 | // FINAL
152 | print('final values:');
153 | print('a: ' + a.value);
154 | print('b: ' + b.value);
155 | print('c: ' + c.value);
156 | print('x: ' + x.value);
157 | print('y: ' + y.value);
158 | print('final output : ' + s.value);
159 | 
160 | 


--------------------------------------------------------------------------------
/hackers_guide/2_chapt/neural_net.js:
--------------------------------------------------------------------------------
  1 | // FROM: http://karpathy.github.io/neuralnets/
  2 | 
  3 | var print = function(str) {
  4 |   str_new = document.getElementById('output').innerHTML + "<br>" + str;
  5 |   document.getElementById('output').innerHTML = str_new;
  6 | }
  7 | 
  8 | // --------------------------------------------------------------------
  9 | // TRAINING DATA
 10 | // --------------------------------------------------------------------
 11 | var data = []; var labels = [];
 12 | data.push([1.2, 0.7]); labels.push(1);
 13 | data.push([-0.3, -0.5]); labels.push(-1);
 14 | data.push([3.0, 0.1]); labels.push(1);
 15 | data.push([-0.1, -1.0]); labels.push(-1);
 16 | data.push([-1.0, 1.1]); labels.push(-1);
 17 | data.push([2.1, -3]); labels.push(1);
 18 | 
 19 | /*
 20 | Code below is simple neural network, 3 hidden neurons and one output layer. The code is not modular like the other js code.
 21 | 
 22 | Activation function is ReLU, which is simply max(0, x).
 23 | */
 24 | // --------------------------------------------------------------------
 25 | // RANDOMLY INITIALIZE NETWORK PARAMETERS
 26 | // --------------------------------------------------------------------
 27 | // Neuron 1
 28 | var a1 = Math.random() - 0.5; // a random number between -0.5 and 0.5
 29 | var b1 = Math.random() - 0.5; // a random number between -0.5 and 0.5
 30 | var c1 = Math.random() - 0.5; // a random number between -0.5 and 0.5
 31 | 
 32 | // Neuron 2
 33 | var a2 = Math.random() - 0.5; // a random number between -0.5 and 0.5
 34 | var b2 = Math.random() - 0.5; // a random number between -0.5 and 0.5
 35 | var c2 = Math.random() - 0.5; // a random number between -0.5 and 0.5
 36 | 
 37 | // Neuron 3
 38 | var a3 = Math.random() - 0.5; // a random number between -0.5 and 0.5
 39 | var b3 = Math.random() - 0.5; // a random number between -0.5 and 0.5
 40 | var c3 = Math.random() - 0.5; // a random number between -0.5 and 0.5
 41 | 
 42 | // Output layer
 43 | var a4 = Math.random() - 0.5; // a random number between -0.5 and 0.5
 44 | var b4 = Math.random() - 0.5; // a random number between -0.5 and 0.5
 45 | var c4 = Math.random() - 0.5; // a random number between -0.5 and 0.5
 46 | var d4 = Math.random() - 0.5; // a random number between -0.5 and 0.5
 47 | 
 48 | // --------------------------------------------------------------------
 49 | // FORWARD/BACKWARD/UPDATE
 50 | // --------------------------------------------------------------------
 51 | for(var iter = 0; iter < 400; iter++) {
 52 |   // pick a random data point
 53 |   var i = Math.floor(Math.random() * data.length);
 54 |   var x = data[i][0];
 55 |   var y = data[i][1];
 56 |   var label = labels[i];
 57 | 
 58 |   // compute forward pass
 59 |   var n1 = Math.max(0, a1*x + b1*y + c1); // activation of 1st hidden neuron
 60 |   var n2 = Math.max(0, a2*x + b2*y + c2); // 2nd neuron
 61 |   var n3 = Math.max(0, a3*x + b3*y + c3); // 3rd neuron
 62 |   var score = a4*n1 + b4*n2 + c4*n3 + d4; // the score
 63 |   if (iter % 25 === 0) print(score)
 64 | 
 65 |   // compute the pull on top
 66 |   var pull = 0.0;
 67 |   if(label === 1 && score < 1) pull = 1; // we want higher output! Pull up.
 68 |   if(label === -1 && score > -1) pull = -1; // we want lower output! Pull down.
 69 | 
 70 |   // now compute backward pass to all parameters of the model
 71 | 
 72 |   // backprop through the last "score" neuron
 73 |   // First backprop is simple deriv of a*b type, ie da = x * top_gradient
 74 |   var dscore = pull;
 75 |   var da4 = n1 * dscore;
 76 |   var dn1 = a4 * dscore;
 77 |   var db4 = n2 * dscore;
 78 |   var dn2 = b4 * dscore;
 79 |   var dc4 = n3 * dscore;
 80 |   var dn3 = c4 * dscore;
 81 |   var dd4 = 1.0 * dscore; // phew
 82 | 
 83 |   // backprop the ReLU non-linearities, in place
 84 |   // i.e. just set gradients to zero if the neurons did not "fire"
 85 |   var dn3 = n3 === 0 ? 0 : dn3;
 86 |   var dn2 = n2 === 0 ? 0 : dn2;
 87 |   var dn1 = n1 === 0 ? 0 : dn1;
 88 | 
 89 |   // backprop to parameters of neuron 1
 90 |   var da1 = x * dn1;
 91 |   var db1 = y * dn1;
 92 |   var dc1 = 1.0 * dn1;
 93 | 
 94 |   // backprop to parameters of neuron 2
 95 |   var da2 = x * dn2;
 96 |   var db2 = y * dn2;
 97 |   var dc2 = 1.0 * dn2;
 98 | 
 99 |   // backprop to parameters of neuron 3
100 |   var da3 = x * dn3;
101 |   var db3 = y * dn3;
102 |   var dc3 = 1.0 * dn3;
103 | 
104 |   // phew! End of backprop!
105 |   // note we could have also backpropped into x,y
106 |   // but we do not need these gradients. We only use the gradients
107 |   // on our parameters in the parameter update, and we discard x,y
108 | 
109 |   // add the pulls from the regularization, tugging all multiplicative
110 |   // parameters (i.e. not the biases) downward, proportional to their value
111 |   da1 += -a1; da2 += -a2; da3 += -a3;
112 |   db1 += -b1; db2 += -b2; db3 += -b3;
113 |   da4 += -a4; db4 += -b4; dc4 += -c4;
114 | 
115 |   // finally, do the parameter update
116 |   var step_size = 0.01;
117 |   a1 += step_size * da1;
118 |   b1 += step_size * db1;
119 |   c1 += step_size * dc1;
120 |   a2 += step_size * da2;
121 |   b2 += step_size * db2;
122 |   c2 += step_size * dc2;
123 |   a3 += step_size * da3;
124 |   b3 += step_size * db3;
125 |   c3 += step_size * dc3;
126 |   a4 += step_size * da4;
127 |   b4 += step_size * db4;
128 |   c4 += step_size * dc4;
129 |   d4 += step_size * dd4;
130 |   // wow this is tedious, please use for loops in prod.
131 |   // we're done!
132 | }
133 | 


--------------------------------------------------------------------------------
/hackers_guide/2_chapt/svm.js:
--------------------------------------------------------------------------------
  1 | // FROM: http://karpathy.github.io/neuralnets/
  2 | 
  3 | var print = function(str) {
  4 |   str_new = document.getElementById('output').innerHTML + "<br>" + str;
  5 |   document.getElementById('output').innerHTML = str_new;
  6 | }
  7 | 
  8 | /*
  9 | More realistic example, we have a single neuron which activates with sigmoid:
 10 | f(x,y,a,b,c) = \sigma(ax + by + c)
 11 | 
 12 | Sigmoid squashes values to between 0 and 1.
 13 | The partial derivative with respect to a single input:
 14 | \frac{\partial \sigma(x)}{\partial x} = \sigma(x) (1 - \sigma(x))
 15 | */
 16 | 
 17 | // --------------------------------------------------------------------------
 18 | // UNIT CLASS
 19 | // --------------------------------------------------------------------------
 20 | // every Unit corresponds to a wire in the diagrams
 21 | var Unit = function(value, grad) {
 22 |   // value computed in the forward pass
 23 |   this.value = value;
 24 |   // the derivative of circuit output w.r.t this unit, computed in backward pass
 25 |   this.grad = grad;
 26 | }
 27 | 
 28 | // --------------------------------------------------------------------------
 29 | // GATE CLASSES: FORWARD/BACKWARD DEFINITION
 30 | // --------------------------------------------------------------------------
 31 | // the backward functions compute only the local derivatives
 32 | 
 33 | // MULTIPLY GATE
 34 | var multiplyGate = function() {};
 35 | multiplyGate.prototype = {
 36 |   forward: function(u0, u1) {
 37 |     // From two input units, multiply their values and forward result in new parent Unit
 38 |     // store pointers to input Units u0 and u1 and output unit utop
 39 |     this.u0 = u0;
 40 |     this.u1 = u1;
 41 |     this.utop = new Unit(u0.value * u1.value, 0.0);
 42 |     return this.utop;
 43 |   },
 44 |   backward: function() {
 45 |     // take the gradient in output unit and chain it with the
 46 |     // local gradients, which we derived for multiply gate before
 47 |     // then write those gradients to those Units.
 48 |     this.u0.grad += this.u1.value * this.utop.grad;
 49 |     this.u1.grad += this.u0.value * this.utop.grad;
 50 |     // remember in multiplication the gradient wrt u0 is u1.value
 51 |   }
 52 | }
 53 | 
 54 | // ADD GATE
 55 | var addGate = function() {};
 56 | addGate.prototype = {
 57 |   forward: function(u0, u1) {
 58 |     this.u0 = u0;
 59 |     this.u1 = u1; // store pointers to input units
 60 |     this.utop = new Unit(u0.value + u1.value, 0.0);
 61 |     return this.utop;
 62 |   },
 63 |   backward: function() {
 64 |     // add gate. derivative wrt both inputs is 1
 65 |     this.u0.grad += 1 * this.utop.grad;
 66 |     this.u1.grad += 1 * this.utop.grad;
 67 |   }
 68 | }
 69 | 
 70 | // SIGMOID GATE
 71 | var sigmoidGate = function() {
 72 |   // helper function
 73 |   this.sig = function(x) {
 74 |     return 1 / (1 + Math.exp(-x));
 75 |   };
 76 | };
 77 | sigmoidGate.prototype = {
 78 |   forward: function(u0) {
 79 |     this.u0 = u0;
 80 |     this.utop = new Unit(this.sig(this.u0.value), 0.0);
 81 |     return this.utop;
 82 |   },
 83 |   backward: function() {
 84 |     var s = this.sig(this.u0.value);
 85 |     this.u0.grad += (s * (1 - s)) * this.utop.grad;
 86 |   }
 87 | }
 88 | 
 89 | // CIRCUIT CLASS
 90 | // A circuit: it takes 5 Units (x,y,a,b,c) and outputs a single Unit
 91 | // It can also compute the gradient w.r.t. its inputs
 92 | var Circuit = function() {
 93 |   // create some gates
 94 |   this.mulg0 = new multiplyGate();
 95 |   this.mulg1 = new multiplyGate();
 96 |   this.addg0 = new addGate();
 97 |   this.addg1 = new addGate();
 98 | };
 99 | Circuit.prototype = {
100 |   forward: function(x,y,a,b,c) {
101 |     this.ax = this.mulg0.forward(a, x); // a*x
102 |     this.by = this.mulg1.forward(b, y); // b*y
103 |     this.axpby = this.addg0.forward(this.ax, this.by); // a*x + b*y
104 |     this.axpbypc = this.addg1.forward(this.axpby, c); // a*x + b*y + c
105 |     return this.axpbypc;
106 |   },
107 |   backward: function(gradient_top) { // takes pull from above
108 |     this.axpbypc.grad = gradient_top;
109 |     this.addg1.backward(); // sets gradient in axpby and c
110 |     this.addg0.backward(); // sets gradient in ax and by
111 |     this.mulg1.backward(); // sets gradient in b and y
112 |     this.mulg0.backward(); // sets gradient in a and x
113 |   }
114 | }
115 | 
116 | // SVM CLASS
117 | var SVM = function() {
118 |   // Class variables a,b,c. Keep track of these while iterating
119 |   // through all samples
120 |   // Random initial parameter values
121 |   this.a = new Unit(1.0, 0.0);
122 |   this.b = new Unit(-2.0, 0.0);
123 |   this.c = new Unit(-1.0, 0.0);
124 | 
125 |   this.circuit = new Circuit();
126 | };
127 | SVM.prototype = {
128 |   forward: function(x, y) { // assume x and y are Units
129 |     this.unit_out = this.circuit.forward(x, y, this.a, this.b, this.c);
130 |     return this.unit_out;
131 |   },
132 |   backward: function(label) { // label is +1 or -1
133 | 
134 |     // reset pulls on a,b,c
135 |     this.a.grad = 0.0;
136 |     this.b.grad = 0.0;
137 |     this.c.grad = 0.0;
138 | 
139 |     // compute the pull based on what the circuit output was
140 |     var pull = 0.0;
141 |     if(label === 1 && this.unit_out.value < 1) {
142 |       pull = 1; // the score was too low: pull up
143 |     }
144 |     if(label === -1 && this.unit_out.value > -1) {
145 |       pull = -1; // the score was too high for a positive example, pull down
146 |     }
147 |     this.circuit.backward(pull); // writes gradient into x,y,a,b,c
148 | 
149 |     // add regularization pull for parameters: towards zero and proportional to value
150 |     this.a.grad += -this.a.value;
151 |     this.b.grad += -this.b.value;
152 |   },
153 |   learnFrom: function(x, y, label) {
154 |     this.forward(x, y); // forward pass (set .value in all Units)
155 |     this.backward(label); // backward pass (set .grad in all Units)
156 |     this.parameterUpdate(); // parameters respond to tug
157 |   },
158 |   parameterUpdate: function() {
159 |     var step_size = 0.01;
160 |     this.a.value += step_size * this.a.grad;
161 |     this.b.value += step_size * this.b.grad;
162 |     this.c.value += step_size * this.c.grad;
163 |   }
164 | };
165 | 
166 | // --------------------------------------------------------------------------
167 | // MAIN
168 | // --------------------------------------------------------------------------
169 | 
170 | var data = []; var labels = [];
171 | data.push([1.2, 0.7]); labels.push(1);
172 | data.push([-0.3, -0.5]); labels.push(-1);
173 | data.push([3.0, 0.1]); labels.push(1);
174 | data.push([-0.1, -1.0]); labels.push(-1);
175 | data.push([-1.0, 1.1]); labels.push(-1);
176 | data.push([2.1, -3]); labels.push(1);
177 | var svm = new SVM();
178 | 
179 | // a function that computes the classification accuracy
180 | var evalTrainingAccuracy = function() {
181 |   var num_correct = 0;
182 |   for(var i = 0; i < data.length; i++) {
183 |     var x = new Unit(data[i][0], 0.0);
184 |     var y = new Unit(data[i][1], 0.0);
185 |     var true_label = labels[i];
186 | 
187 |     // see if the prediction matches the provided label
188 |     var predicted_label = svm.forward(x, y).value > 0 ? 1 : -1;
189 |     if(predicted_label === true_label) {
190 |       num_correct++;
191 |     }
192 |   }
193 |   return num_correct / data.length;
194 | };
195 | 
196 | // the learning loop
197 | for(var iter = 0; iter < 1000; iter++) {
198 |   // pick a random data point
199 |   var i = Math.floor(Math.random() * data.length);
200 |   var x = new Unit(data[i][0], 0.0);
201 |   var y = new Unit(data[i][1], 0.0);
202 |   var label = labels[i];
203 |   svm.learnFrom(x, y, label);
204 | 
205 |   if(iter % 25 == 0) { // every 25 iterations...
206 |     print('training accuracy at iter ' + iter + ': ' + evalTrainingAccuracy());
207 |   }
208 | }
209 | 
210 | 
211 | 


--------------------------------------------------------------------------------
/hackers_guide/README.md:
--------------------------------------------------------------------------------
1 | javascript code based on:
2 | http://karpathy.github.io/neuralnets/
3 | 


--------------------------------------------------------------------------------
/summaries/auto-encoding_var_bayes.md:
--------------------------------------------------------------------------------
 1 | [original paper](https://arxiv.org/abs/1312.6114) about var-bayes autoencoders by Kingma and Welling, 2013. Check out [implementation in Keras](https://github.com/fchollet/keras/blob/master/examples/variational_autoencoder.py) for example code
 2 | 
 3 | Also:
 4 | - [Presentation](https://home.zhaw.ch/~dueo/bbs/files/vae.pdf) about the paper to help understand.
 5 | - [Accompanying Python notebook](https://github.com/oduerr/dl_tutorial/tree/master/tensorflow/vae)
 6 | 
 7 | ## Background
 8 | - How can we perform efficient approximate inference and learning with directed probabilistic models whose continuous latent variables and/or parameters have intractable posterior distributions?
 9 | - Variational Bayes optimizes an approximation of the intractable posterior
10 | - See [lower bounds for estimation](https://www.stat.washington.edu/jaw/COURSES/580s/581/LECTNOTES/ch3-rev1.pdf)
11 | 
12 | ## Problem
13 | - Assume $X = {x^i }_{i=1}^N$. The process consists of getting $z^i$, generated from a prior $p_theta * (z)$ and value $x^i$ generated from conditional $p_theta * (x|z)$. 
14 | - Algorithm must work in case of intractability and large dataset
15 | 
16 | 


--------------------------------------------------------------------------------
/summaries/autoencoders.md:
--------------------------------------------------------------------------------
 1 | see [this](https://blog.keras.io/building-autoencoders-in-keras.html) blogpost. The post is also a good tutorial for Autoencoders in Keras.
 2 | 
 3 | ### What are autoencoders good for?
 4 | - Not very good at data compression, much better algorithms out there
 5 | - Rarely used in practice in original form.
 6 | - Mostly used for data denoising and dimensionality reduction for visualization
 7 | 
 8 | ### Examples
 9 | - Autoencode MNIST from 784, down to 32, and then back. When applying to the test set, reconstruction from low 32 dimension looks like original but blurry.
10 | - Can add a sparsity contraint on the activity of the hidden representations, type of regularizer
11 | - Deep autoencoder. Instead of layers input/hidden/output [784 32 784], we can try [784 128 64 32 64 128 784]. Note the progressive compression/decompression. However, only minute improvement in performance.
12 | - Convolution autoencoder: Much better than vanilla autoencoder due to the "higher entropic capacity of the encoded representation, 128 dimensions vs. 32 previously".
13 | - Denoising: Train an autoencoder to map noisy images to clean images. We can easily do this by adding Gaussian noise.
14 | - Variational autoencoder: Generative model, you learn parameters of a probability distribution modeling your data. You can then sample this distribution.
15 | 


--------------------------------------------------------------------------------
/summaries/build_machines_learn_think.md:
--------------------------------------------------------------------------------
1 | # [Building Machines That Learn and Think Like People](http://arxiv.org/abs/1604.00289)
2 | 
3 | 
4 | 


--------------------------------------------------------------------------------
/summaries/end-to-end_tf.md:
--------------------------------------------------------------------------------
 1 | # Generic ML datasets
 2 | - **Linear data** can be solved with linear classifiers sur as logistic regression, svm, and so on.
 3 | - **Moon data** has two clusters. Linear classifier cannot fully seperate the two classes
 4 | - **Saturn data** has a core cluster and ring cluster, requires non-linear classifier
 5 | 
 6 | # Hidden layer, number of nodes
 7 | - Simple neural net with one hidden layer and two nodes will dramatically vary in accuracy (0.88~0.96) due to random initialization. Hidden layer with 3 nodes give consistent results around 0.97 accuracy.
 8 | - Sensitive to weight initialization: use one of two following
 9 |   * Truncated normals for weights, 0.1 for biases and ReLU
10 |   * Xavier initialization, 0 biases, tanh
11 | 


--------------------------------------------------------------------------------
/summaries/fully_char_level_nmt.md:
--------------------------------------------------------------------------------
 1 | on paper by Lee et al (2016)
 2 | 
 3 | Fully Character-Level Neural Machine Translation without Explicit Segmentation, [annotated](https://drive.google.com/open?id=0ByV7wn2NzevOQ0JtTTRuR0pjUlE), [arXiv](https://arxiv.org/abs/1610.03017)
 4 | 
 5 | [Theano implementation by author Lee](https://github.com/nyu-dl/dl4mt-c2c?utm_campaign=Revue%20newsletter&utm_medium=Newsletter&utm_source=revue)
 6 | 
 7 | ## Background
 8 | - Most MT research exclusively at word level
 9 | - NMT suffer from out-of-vocab words with languages with rich morphology
10 | - Character-level better suited for multilingual translation than word-level because:
11 |   - Does not suffer from out-of-vocab issues
12 |   - Can model rare morphological variants of a word
13 |   - No segmentation required
14 | - Recent trend in MT is NMT with encoder, decoder and attention mechanism:
15 |   - Encoder: Bidirectional RNN, concat of forward and backward hidden states
16 |   - Attention: lets decoder attend more to differnet source symbols for each target symbol. There is a context vector $c_{t'}$ for each time step $t'$ as weighted sum of hidden states, ie the weights reflect relevance of inputs to the t'-th target token
17 |   - Decoder: at time $t'$, computes hidden state $s_{t'}$ as a function of previous prediction, previous hidden and source context vector $c_{t'}$. Note how the context vector is specific for that output time step. Next, the prediction is produced by a parametric function (like beam search)
18 | - Loss: model is trained to minimize the negative conditional log-likelihood of the probability of output target given previous target and input.
19 | - Some other work based on character, but mostly subword-to-subword, or subword-to-character. Here they propose fully character-to-character
20 | 
21 | ## Chararacter level challenges
22 | - Sentences are much longer
23 | - The decoder softmax operation is much faster over characters
24 | - Attention mechanism with characters grows quadratically
25 | - Encoder must encode long sequence of chars to good representation
26 | 
27 | ## Model
28 | - Aggressively uses convolutions + pooling to shorten input and capture local regularities
29 | - **Encoder**: 
30 |   - Char embedding size 128
31 |   - 1D narrow convolution over padded sentence -> output length = input length
32 |   - Various filter sizes from width 1 to 8 (up to char n-gram of 8)
33 |   - Output of conv op is $Y \in \mathbb R^{N \times T_x}$, where $N$ is number of filter sizes, and $T_x$ is input length
34 |   - Max pooling over time over $Y$, without mixing between widths in $N$, with stride $s$. So new $Y \to Y' \in \mathbb R^{N \times (T_x/s)}, where $s$ was chosen to be 5.
35 |   - Highway network over pooling output to regulate information flow
36 |   - Finally goes to BiGRU
37 | - **Decoder**:
38 |   - Attention and decoder like [NMT](summaries/neural_machine_translation.md) model, but predict characters as opposed to words.
39 |   - Two-layer unidirection with 1024 GRU, beam search with width 20
40 | - See Table 2 for full model parameters
41 | 
42 | ## Experiment Setup
43 | - Char2char model, includes only sentences with max 450 chars.
44 | - Adam optimizer, $\alpha$ of 0.0001 and minibatch size 64
45 | - Gradient clipping with threshold of 1
46 | - Weights initialized from uniform [-0.01, 0.01]
47 | - Multilingual char2char and bilingual char2char
48 | - A few sub-word models as baseline
49 | - Data scheduling to avoid overfitting to one language
50 | 
51 | ## Observations
52 | - Char2char always outperforms
53 | - For some language bilingual char2char outperforms, in others the multilingual char2char outperforms
54 | - BLEU metrics encourage reference-like translations, so additional evaluation by humans on adequacy and fluency
55 | - Translation improvement by char2char mainly from fluency
56 | - Two weeks to train char2char
57 | - Char2char model not told any concept of word boundary, automatically learns
58 | 


--------------------------------------------------------------------------------
/summaries/implicit_drd_grn.md:
--------------------------------------------------------------------------------
 1 | acl paper [Implicit Discourse Relation Detection via a Deep Architecture with Gated Relevance Network](https://www.aclweb.org/anthology/P/P16/P16-1163.pdf)
 2 | 
 3 | ## Problem
 4 | - Discourse relation recognition, easy for explicit, difficult for implicit.
 5 | - Traditionally use word pairs, such as "warm, cold", but data is sparse.
 6 | 
 7 | ## Model
 8 | - Word2Vec embeddings, then Bidirecitonal LSTM to represent input over two separate text units X and Y
 9 | - Gated Relevance Network (GRN):
10 |   - Get positional representation from BiLSTM output
11 |   - Compute relevance score between every pair of x and y, with Bilinear Model and Single Layer Network.
12 |   - Bilinear Model: let $(h_{xi}, h_{yi})$ be the vectorized representation of from the BiLSTM of from X and Y. Bilinear Model is function: $s(h_{xi}, h_{yi} = h_{xi}^TMh_{yj$, where $M \in \mathbb R^{d_h \times d_h}$ is the matrix coefficient to learn. Note the relationship between the two is linear. 
13 |   - The Single Layer Network captures nonlinear interaction: standard single hidden neural net where output is nonlinear function over input plus bias, where input is concat of the pair. 
14 |   - The two models are incorporated through the gate mechanism. The output of the GRN is :gate * linear + (1-gate)*non-linear. So the gate controls flow from linear and non-linear. 
15 |   - GRN is similar to Neural Tensor Network (Socher et al 2013) but with added gate.
16 |   - Output of GRN is an interaction score matrix
17 | - Max pooling over GRN output which feeds into dense hidden layer (MLP), and finally connects to output.
18 | - Train four binary classifiers to identify top level relations
19 | 
20 | ## Observations
21 | - LSTM alone has poor performance, loses too much local information
22 | - Cosine distance, Bilinear or Single Layer alone do not perform very well
23 | - Boost in LSTM when encoding to positional representation, boost with Bilinear and Single Layer relevance scores, and boost with extra gating on the relevance score. The mixture of scores performs best
24 | - Model performs best in all categories
25 | - Using the BiLSTM to encode rather than just words directly gets much better emphasis on relevance scores(Figure 3). Important word pairs for discourse are emphasised and irrelevant ones ignored.
26 | 


--------------------------------------------------------------------------------
/summaries/intrinsic_dimension.md:
--------------------------------------------------------------------------------
 1 | **Measuring the Intrinsic Dimension of Objective Landscapes**, Li et al, ICLR 2018. [openreview](), [arXiv](https://arxiv.org/abs/1804.08838)
 2 | 
 3 | tl;dr: Intrinsic Dimension is the minimal parameter subspace (projected to the total parameters) to achieve a certain performance. It is a measure of model-problem complexity.
 4 | 
 5 | First, let's describe the normal training of a neural network until convergence as the "direct method".
 6 | Consider an alternative to the direct method.
 7 | Given a set of model parameters, train a lower dimensional set of parameters which is then projected and added to the fixed larger set. The smaller set size is the "subspace". The size of the subspace controls the degree of freedom of the model. Train several subspace size, each time training until convergence. As you increase the subspace size, at one point accuracy jumps and you achieve 90% performance of the direct method (and NOT 90% accuracy on the task). The subspace size that achieves 90% is called the intrinsic dimension.
 8 | 
 9 | What's interesting is that when you increase the original model capacity, the subspace parameters are projected to a larger space, and yet you barely need to increase the subspace size to solve the same problem. For example, MNIST is always a subspace of around 750 on an MLP model. But if you change to a CNN, the subspace for MNIST is much smaller, around 250, showing the CNN is a superior model on this dataset.
10 | 
11 | Also, harder problems need larger subspace, like Pong from pixels vs MNIST. If we view solving MDPs and supervised learning as function approximation, then we can compare different problems with the intrinsic dimension metric. Apparently, Pong from pixels is equivalent to CIFAR-10!
12 | 


--------------------------------------------------------------------------------
/summaries/learning_phrase_rep_RNN_encoder_decoder_mt.md:
--------------------------------------------------------------------------------
 1 | on seminal Cho et al [Encoder-Decoder paper](https://arxiv.org/abs/1406.1078)
 2 | 
 3 | - [Sequence-to-sequence](https://www.tensorflow.org/versions/r0.11/tutorials/seq2seq/index.html) in Tensorflow
 4 | - Elaborate [seq2seq](https://github.com/farizrahman4u/seq2seq) library for Keras
 5 | 
 6 | ## Model
 7 | - Two RNN that act as encoder and decoder pair
 8 | - The two networks are trained jointly to maximize the conditional probability of the output given the input
 9 | - Encoder: scan linearily input x, at each symbol update the hidden state of the RNN. At the end of the input, hidden state $c$ is a summary of the whole input sequence
10 | - Decoder: Generative model. Predict next $y_t$ given the hidden state $h_t$. However, unlike Encoder, $y_t$ and $h_t$ are both conditioned on $y_{t-1}$ and $c$. 
11 | - The Encoder-Decoder, once trained, can be used to either:
12 |   - Generate output translation given input. or,
13 |   - Score a given pair of input and output sequences produced by other algorithm
14 | 
15 | ## Details
16 | - Uses new type of hidden unit, motivated by LSTM
17 | - Reset gate: when reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This allows the hidden sate to **drop** any information that is found to be irrelevant later in the future, allowing more compact representation
18 | - Update gate: controls how much information from the previous hidden state will carry over to the current hidden state. Helps remember long-term information like memory cell in LSTM
19 | - Encoder-Decoder ignores normalized frequencies of phrase pairs in original corpora
20 | 
21 | ## Experiment
22 | - Uses WMT dataset. Data from Europarl, UN, crawled data. For language model only, uses crawled data.
23 | - For RNN Encoder-Decoder, limits source and target vocabulary to most frequent 15k words, about 93% coverage
24 | - RNN E-D has 1000 hidden units in both encoder and decoder.
25 | - Uses rank-100 matrices, equivalent to learning an embedding of dimension 100 for each word.
26 | - Uses tanh as activation for new hiddent state
27 | - Early stop after 10 ecpochs
28 | - Best performing model: Proposed RNN for scoring and add traditional neural net language model (CLSM) to improve the RNN
29 | 
30 | ## Observations
31 | - Gating is crucial. Without gating, just using tanh does not give meaningful results.
32 | - The model is focused towards learning linguistic regularities: distinguishing between plausible and implausible translations or manifold (regions of probability concentration) or plausible translations. 
33 | 


--------------------------------------------------------------------------------
/summaries/matching_networks.md:
--------------------------------------------------------------------------------
 1 | [Matching Networks for One Shot Learning](https://arxiv.org/abs/1606.04080), Vinyals et al, all from DeepMind, for NIPS 2016
 2 | 
 3 | ## Problem
 4 | The problem is to learn something with few examples is a huge challenge for DNNs, when it can be trivial for humans. A child can generalize the concept of "giraffe" from just a single example. For machines, it's hard. This is a particular challenge for many NLP datasets [my opinion].
 5 | 
 6 | This motivates the "one-shot" and "few-shot" learning paradigm.
 7 | 
 8 | ## Model
 9 | The authors present a model that tries to match an input \hat{x} with an example from a support set of image-label pairs S={(xi, yi)} of k examples.
10 | 
11 | The output \hat{y} is defined a the sum of a(\hat{x}, xi)yi, where a() is an attention function.
12 | 
13 | ## Attention
14 | - The proposed attention mechanism a() is to use softmax over cosine distance c(f(\hat{x}), g(xi)). The softmax normalized the cosine of \hat{x} and x by dividing by the sum of cosine over all x_j for j=1 to k. So a() is a weight for that class i which is multiplied by y_i. Therefore \hat{y} is a blend of the classes.
15 | - For example, imagine we have 3 classes and a() has calculated a probability for each of 3 classes in yi: 0.3, 0.5 and 0.2. \hat{y} as a weighted sum would be equal to 0.2[1,0,0] + 0.5[0,1,0] + 0.3[0,0,1] = [0.2,0.5,0.3]. This should be more explicit in the paper.
16 | - The functions f() and g() that embed the two inputs are neural networks such as deep ConvNets.
17 | - They propose to modify the embedding function g to include the whole support S instead of only xi, this way the net can change an embedding if it deems it too close to another element in the set. g(xi) becomes g(xi,S).
18 | 
19 | ## Results
20 | The approach shows SOTA results on Omniglot, ImageNet and PTB. Note these are smaller datasets selected for the task. The PTB "mini" dataset is proposed by this paper. 
21 | 
22 | ## Summary
23 | In summary, the model predicts a class indirectly by mapping input samples to samples from an example set by taking their labels. The task for the model is thus to learn how to best represent samples (via a neural net) to compute distance metrics as effectively as possible. Once it becomes good at this, it can match samples with classes it has never even seen in training!
24 | 


--------------------------------------------------------------------------------
/summaries/neural_machine_translation.md:
--------------------------------------------------------------------------------
 1 | on paper by Bahdanau, Cho, Bengio, ICLR 2015
 2 | 
 3 | Neural Machine Translation by Jointly Learning to Align and Translate, [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOS3FmWHVNazhnczA/view?usp=sharing), [arXiv](https://arxiv.org/abs/1409.0473)
 4 | 
 5 | See TensorFlow library [seq2seq](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/seq2seq.py) for usage
 6 | 
 7 | ## Background
 8 | - Unlike traditional phrase-based translation system which consist of many small sub-components that are tuned separately, NMT builds and trains single large neural net to encode/decode translation
 9 | - Issue with NMT encoder/decoder is compressing all source into single fixed-length vector
10 | 
11 | ## Model
12 | - Extends Encoder-Decoder by learning to align and translate
13 | - Does not encode whole input to single fixed-vector, but encodes into a sequence of vectors. Decoder chooses a subset of these vectors adaptively while decoding the translation
14 | - Traditional encoder-decoder, for each $y_t$, the decoder predicts the conditional probability of $y_t$ given all previous words and context. With RNN, this is estimated as $g(y_{t-1}, s_t, c)$ where: $g$ is a non-linear, $s_t$ is hidden state and $c$ is context vector from encoder
15 | - New decoder is conditioned on distinct context vector $c_i$ for each target word $y_i$: $p(y_i|y_1,...,y_{i-1},x)=g(y_{i-1},s_i,c_i)
16 | - Context $c_i$ is weighted sum of annotations $h_i$ from encoder (Eq. 5). The weight is a softmax over an alignment model. This effectively acts like attention mechanism, tells decoder what to pay attention to and relieves encoder burden of encoding all into fixed-length vector
17 | - Decoder has 1000 hidden units and single maxout hidden layer to compute conditional probability of each target word
18 | - Once trained, use beam search to find translation that maximizes the conditional probability
19 | - Encoder is BiRNN, 1000 hidden units
20 | 
21 | ## Experiments
22 | - Trained on regular Encoder-Decoder (Cho et al. 2014), called RNNencdec and new proposed model named RNNsearch
23 | - Trained on sentences of max 30 words (RNNencdec-30, RNNsearch-30), and max 50 words (RNNencdec-50, RNNsearch-50)
24 | - In both cases, RNNsearch outperforms.
25 | 
26 | ## Remarks
27 | - Sequence of vectors and adaptive decoding frees model from having to squash all information of a source sentence into a fixed-length vector. Copes better with long sentences
28 | - Alignment learning significantly improves basic encoder-decoder
29 | - Astoundingly, RNNsearch-50 massively outperforms other models as sentences become very long, with no deterioration in BLEU score with sentences approaching length 60 where others go towards BLEU score 0.
30 | - The alignment matrix generally shows monotonic correlations. However, number of non-trivial non-monotonic alignments among adjectives and nouns since order is different between English and French. Model correctly translated [European Economic Area] to [zone economique europeenne]
31 | 


--------------------------------------------------------------------------------
/summaries/overview_optimization.md:
--------------------------------------------------------------------------------
 1 | ## Overview optimization algorithms
 2 | [blog post](http://sebastianruder.com/optimizing-gradient-descent/index.html#gradientdescentvariants)
 3 | 
 4 | ### Gradient descent
 5 | #### Batch gradient descent
 6 | - Computes the gradient of the cost on entire training set, in one go
 7 | - Minus
 8 |   - Slow as single update requires measuring loss on all data
 9 |   - Converts to global minimum only for convex error surfaces
10 | 
11 | #### Stochastic gradient descent (SGD)
12 | - Plus
13 |   * Parameter update after each training sample, faster than batch
14 |   * Frequent updates cause objective function to fluctuate
15 |   * Can easily overshoot minimum, so better to decay learning rate
16 | - Should shuffle data at each epoch
17 | 
18 | #### Mini-batch gradient descent
19 | - Best of both, size should be 50 ~ 256
20 | - Plus
21 |   * Reduces variance of parameter updates
22 |   * Efficient computation of matrices
23 | - Minus
24 |   * Same issues with learning rate, can address problem with schedules
25 |   * Same learning rate applies to all parameters, problem
26 |   * Easy to get trapped in local minima and saddle points
27 | 
28 | ### Optimization (adaptive methods)
29 | #### Momemtum
30 | - Helps SGD accelerate in relevant direction and dampens oscillations
31 | - Adds fraction of previous step vector to current step vector
32 | 
33 | #### Nesterov accelerated gradient (NAG)
34 | - Lookahead descent, slows down before hill slopes up
35 | - Calculates momentum as well as approximation of next parameter value
36 | 
37 | #### Adagrad
38 | - Adapts learning rate to the parameters
39 |   * Large updates for infrequent parameters
40 |   * Small updates for frequent parameters
41 |   * Different rate for every theta at every time step
42 | - Plus
43 |   * Good for sparse data
44 |   * No longer need to tune learning rate
45 | - Minus
46 |   * Because of accumulated sum in denom, learning rate shrinks and vanishes
47 | 
48 | #### Adadelta
49 | - Extends Adagrad, fixes decreasing learning rate
50 | - Restricts window of accumulated past gradients
51 | 
52 | #### Adaptive Moment Estimation (Adam)
53 | - Also computes adaptive learning for each parameter
54 | - Like Adadelta, stores decaying average of past squared gradients
55 | - Also stores decayin average of past gradients
56 | - Basically stores first two moments
57 | 
58 | ### Tips
59 | - Shuffle before each epoch
60 | - For progressively harder problems, use curriculum learning
61 | - Batch normalization so can use higher learning rates
62 | - Early stopping: stop if error no longer improves
63 | - Add gaussian noise to gradients
64 | 


--------------------------------------------------------------------------------
/summaries/scheduled_sampling.md:
--------------------------------------------------------------------------------
 1 | Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, [arXiv](https://arxiv.org/abs/1506.03099) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer, 2015
 2 | 
 3 | TLDR; Scheduled sampling improves the quality of language generation by being more robust to mistakes. Use inverse sigmoid decay.
 4 | 
 5 | ## Problem
 6 | One of the issues with training RNNs for prediction is that each $y$ being predicted is in part conditioned on previous *true* $y$, whereas at inference time $y$ is conditioned on a *generated* $y$. This can yield errors that accumulate throughout the prediction process. The authors propose a training method, "scheduled sampling", to sometimes condition on the true and sometimes on the generated.
 7 | 
 8 | Traditional inference is conditioned on the *most likely* previous prediction. Prediction error can compound through the entire prediction. One way to deal with this is to use beam search, which maintains several probable sequences in memory. Beam search produces $k$ "best" sequences, as opposed to searching through all possibilities of $Y$. 
 9 | 
10 | ## Scheduled Sampling
11 | The authors propose a "curriculum learning" approach that forces the model to deal with mistakes. This is interesting since error correction is baked into the model.
12 | 
13 | While training the sampling mechanism randomly decides to use $y_{t-1}$ or $\hat{y}_{t-1}$. The true previous $y_{t-1}$ token is used with probability $\epsilon_i$. So if $\epsilon_i = 1$, we are using the same training as usual, and when $\epsilon_i = 0$, we are always training on predicted values. *Curriculum learning* strategy for training selects the true previous most of the time and slowly shifts to selecting a predicted previous. At the end of training, $\epsilon_i$ should favor sampling from the model. (See Figure 1).
14 | 
15 | The sampling variable $\epsilon$ decays according to a few schemes such as linear decay, exponential decay and inverse sigmoid decay (figure 2).
16 | 
17 | ## Experiments
18 | ### Image Captioning
19 | - Trained on MSCOCO, 75k for training and 5k for dev set.
20 | - Each image has 5 possible captions, one is chosen at random.
21 | - Image preprocessed by pretrained CNN
22 | - Word generation done with LSTM(512), vocabular size is 8857
23 | - Used inverse sigmoid decay
24 | 
25 | This approach led the team to first place for MSCOCO captioning challenge 2015.
26 | 
27 | ### Constituency Parsing
28 | Map a sentence onto a parse tree. Unlike image captioning, the task is much more deterministic, "uni-modal". Generally only one correct parse tree.
29 | 
30 | - One layer LSTM(512)
31 | - Words as embeddings of size 512
32 | - Attention mechanism
33 | - Inverse sigmoid decay
34 | 
35 | ### Speech Recognition
36 | 
37 | - Two layers of LSTM(250)
38 | - Baseline trained 14 epochs, scheduled sampling only needed 9 epochs.
39 | 


--------------------------------------------------------------------------------
/summaries/seq2seq_nn.md:
--------------------------------------------------------------------------------
 1 | Sutskever et al paper [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215), [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOQ1l5aUF4RWYtenc/view?usp=sharing)
 2 | 
 3 | - [Sequence-to-sequence](https://www.tensorflow.org/versions/r0.11/tutorials/seq2seq/index.html) in Tensorflow
 4 | - Elaborate [seq2seq](https://github.com/farizrahman4u/seq2seq) library for Keras
 5 | 
 6 | ## Background
 7 | - Some problems can be seen as sequence to sequence problems, mapping an input sequence to an output sequence. Such as translation, question and answering, etc. 
 8 | - One challenge for DNN is dimensionlity of input/output which must be fixed. This can be overcome with LSTMs
 9 | 
10 | ## Model
11 | - Model maps input ABC to output WXYZ. Does not use the RNN for scoring as Cho et al, but to produce the translation
12 | - After training, produce translation with left-to-right beam-search decoder
13 | - Ensemble of 5 deep LSTMs with beam of size 2
14 | - Reverse the order of the input, leave order of output
15 | - Use two LSTMs, much like Encoder-Decoder
16 | - 4 layer LSTMs. Deep LSTMs significantly outperformed shallow LSTMs
17 | - 1000 cells at each layer
18 | - 1000 dimensional embeddings
19 | 
20 | ## Experiment
21 | - Used the LSTM to rescore publicly available 1000-best lists of the SMT baseline and obtained close to SOTA results.
22 | - Much better results when inversing input. Test perplexity drops from 5.8 to 4.7 and test BLEU increases from 25.9 to 30.6!
23 | - Used 160k most frequent words for source language and 80k most frequent for the target language
24 | - Predict a small number $B$ most likely partial hypothesis with beam search. If a hypothesis is appended with "EOS", the hypothesis is added to set of complete hypothesis. Beam search continues until all partial hypothesis are complete.
25 | - Most sentences are short, some are very lond, which can waste computation in minibatch. Therefore, made sure sentences in minibatch are roughly same length, yielding 2x speedup.
26 | - Parallelized on 8 GPUs: one LSTM layer per GPU (for 4 layers), and other 4 GPUs for softmax calculation. Results in 3.7x speedup.
27 | - Ensemble of 5 LSTMs with beam of size 2 is cheaper than a single LSTM with a beam of size 12
28 | - Surprisingly, performs well on long sentences
29 | 
30 | 


--------------------------------------------------------------------------------
/summaries/softmax_bottleneck.md:
--------------------------------------------------------------------------------
 1 | 
 2 | **Breaking the Softmax Bottleneck: A High-Rank RNN Language Model**, Yang et al, ICLR 2018. [openreview](https://openreview.net/forum?id=HkwZSG-CZ), [arXiv](https://arxiv.org/abs/1711.03953)
 3 | 
 4 | Language modeling consists of learning the joint language distribution factorized as a product of word probabilities conditioned over context.
 5 | 
 6 | Given a language model output matrix A over time, where each row is the vocabulary distribution given context. A word logit is produced by the inner product of h_c (rnn hidden state) and w_x (embedding vector), both of dimension d.
 7 | 
 8 | The authors hypothesize A  must be high rank to express complex language. The rank of A can be as high as M, the vocabulary size. The single softmax is not expressive enough if d is too small, and thus it is learning a low-rank approximation of A.
 9 | 
10 | A more expressive model would be Ngram or simply increasing d, however these approaches lead to significant increases in parameters and hurts generality.
11 | 
12 | They propose a mixture of K softmax distributions (MoS). Each softmax distribution is weighted by pi_c, which is itself learned by the context. They empirically measure the MoS matrix A compared to single sofmax A and show that with a mixture of 15 softmax trained on PTB, A's rank is as high as M! They get rank 9981, while the single softmax is 400 (Table 6). They also achieve SOTA on PTB and WikiText-2
13 | 
14 | 


--------------------------------------------------------------------------------
/summaries/unreal.md:
--------------------------------------------------------------------------------
 1 | Reinforcement Learning with Unsupervised Auxiliary Tasks, Jaderberg et al, ICLR 2017. [openreview](https://openreview.net/forum?id=SJ6yPD5xg). See Caruana PhD thesis from 1997, discusses auxiliary tasks for better representations!
 2 | 
 3 | The paper is about learning additional tasks, which don't require additional data, with the same parameters from the model which learns to maximize rewards with RL. This leads to better representations and performance, in score and speed!
 4 | 
 5 | A base agent learns to maximize rewards with A3C. Experiences are pushed to a replay buffer. Auxiliary tasks sample from this buffer for off-policy training. Interestingly, buffer sampling is actually intentionally skewed so samples are evenly split between rewarding and negative evnts.
 6 | 
 7 | The auxiliary tasks include:
 8 | - Pixel control: train agents that learn a separate policy for maximally changing the pixels in an input image. 
 9 | - Reward prediction: process a sequence of consecutive observations and require agent to predict the reward in following unseen frame
10 | - Value function replay: resample recent historical sequences from behaviour policy distribution (via replay buffer) and perform extra value function regression in addition to the on-policy A3C. By resampling previous experience, and randomly varying the temporal position of the truncating window over which the n-step return is computed, value function replay performs value iteration and exploits newly discovered features shaped by reward prediction (sampling distribution not skewed in this case)
11 | 


--------------------------------------------------------------------------------
/summaries/vanishing_gradients.md:
--------------------------------------------------------------------------------
 1 | Summary of [this section](http://neuralnetworksanddeeplearning.com/chap5.html#the_vanishing_gradient_problem)
 2 | 
 3 | ## Exposing the problem
 4 | - Train on MNIST dataset
 5 | - Base network: [784, 30, 10], input of 784, 1 hidden layer of 30 nodes, 10 class output
 6 |   * Accuracy: 0.9648
 7 | - Expand network: [784, 30, 30, 10], 2 hidden layers
 8 |   * Improvement in Accuracy: 0.9690
 9 | - Expand network: [784, 30, 30, 30, 10], 3 hidden layers
10 |   * Accuracy drops: 0.9657
11 | 
12 | In theory deep networks work better than shallow ones due to abstraction of features as the network gets deeper. Deep networks could solve a problem with dramatically less parameters than shallow networks. But sometimes deep networks don't perform well. Sometimes, when later layers in a deep network work well, early layers get stuck, possibly learning nothing. The opposite may be true, with earlier layers learning well and later ones being stuck.
13 | 
14 | ## Vanishing (and exploding) gradients
15 | - For a two hidden-layer net, second layer neurons learn much faster than neurons in the first layer
16 | - For each layer, let vector gl represent a vector of gradients for l-th layer where each entry determine how quickly the  hidden layer learns
17 | - ||gl||, the length of the vector, determines the speed of learning of the l-th layer
18 | - For 2-hidden-layer network, g1 = 0.07 and g2 = 0.31. The second layer learns faster
19 | - For 3-hidden-layer network, lengths are 0.012, 0.06, 0.283. Again, earlier layers are slower
20 | - As we go deeper, gradients in earlier layers get much smaller, i.e. they vanish
21 | - In some instances, instead of vanishing, early layer gradients may explode
22 | 
23 | ## Vanishing (and exploding) gradient cause
24 | Simple network structure with backprop:
25 | 
26 | ![network derivative](http://neuralnetworksanddeeplearning.com/images/tikz38.png)
27 | 
28 | Where $\sigma (z_j)$ is the sigmoid, and $zj = w_j a_{j-1} + b_j$ is the weighted input to the activation function in the next neuron. (See proof starting at formula 114). The expression is the partial derivative of the cost with respect to the first bias, b1. Besides the last term, it follows the pattern of weight x sigmoid derivative.
29 | 
30 | Looking at the sigmoid derivative plot:
31 | ![sigmoid deriv](http://www.billharlan.com/papers/logistic/img39.png)
32 | - Notice the function peaks at 1/4.
33 | - Weights are initialized with standard gaussian: mean 0 and standard deviation of 1
34 | - Weights are less than 1
35 | - Terms $|w_j \sigma^' (z_j)|$ less than 1/4
36 | - The product of all these terms results in tiny gradient
37 | - If the weights are initialized above 1, they grow exponentially as we move back through the layers, causing them to explode
38 | 
39 | In summary, choice of activation function, weight initilization, optimization algorithm and network architecture can cause unstable gradients.
40 | 
41 | ## Solution
42 | - Use activation function which don't squash the input, such as ReLU. See [here](https://cs224d.stanford.edu/notebooks/vanishing_grad_example.html) for effect of sigmoid v ReLU. 
43 | 
44 | 


--------------------------------------------------------------------------------
/summaries/var_auto_sequence_class.md:
--------------------------------------------------------------------------------
 1 | about paper [Semi-supervised Variational Autoencoders for Sequence Classification](https://arxiv.org/abs/1603.02514), [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOTXEzLWlNQy1od0k/view?usp=sharing).
 2 | 
 3 | ## Problem
 4 | - SemiVAE work well in image classification tasks, but fail for text classification if using vanilla LSTM as conditional generative model.
 5 | - We have more and more data, but very little labels accompanying the data
 6 | - We want unsupervised to extract useful features which we can then use in supervised tasks
 7 | - RNNs are good for sequence-to-sequence, but not good for high level features like topic, style, and sentiment. Variational Recurrent Autoencoders have been used for this.
 8 | 
 9 | ## Background
10 | - Conditional variational autoencoders can generate samples according to certain attributions of given labels.
11 | 
12 | ## The Model
13 | - Novel semi-supervised deep generative model for text classification, the model can generate sentences conditioned on labels
14 | - A RNN encodes input text x with the conditional input y (the label). The generative network then decodes the latent variable z, where $z \sim p(z|x,y)$.
15 | - Conditional LSTM network proposed as conditional generative model. Same traditional LSTM equations except one has extra term about $y$.
16 | 
17 | 


--------------------------------------------------------------------------------
/web_resources.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | ## Optimization
 4 | - Notes on gradient descent, Toussaint 2012. [pdf](http://ipvs.informatik.uni-stuttgart.de/mlr/marc/notes/gradientDescent.pdf). Check Algorithm 2 and 3
 5 | 
 6 | ## Probability
 7 | - Review of expectation, 2009. [pdf](http://math.arizona.edu/~jwatkins/g-expectation.pdf)
 8 | 
 9 | ### Topics
10 | - [Deep Q-Learning](https://keon.io/deep-q-learning/). Keon
11 | - [Policy method for Cartpole](http://kvfrans.com/simple-algoritms-for-solving-cartpole/), Kvfrans. [`repo`](https://github.com/kvfrans/openai-cartpole/blob/master/cartpole-policygradient.py)
12 | - Fundamentals of Policy Gradients, Seita, 2017-03. [`blog`](https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/)
13 | - Deep Deterministic Policy Gradients in TensorFlow, Emami, 2016-08. [`blog`](http://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html#References)
14 | 
15 | ### Overview
16 | - Deep Reinforcement Learning: Pong from Pixels. Karpathy. [`blog`](http://karpathy.github.io/2016/05/31/rl/)
17 | - [Beginner's guide](https://www.analyticsvidhya.com/blog/2017/01/introduction-to-reinforcement-learning-implementation/).
18 | - [Riemannian manifolds lecture](https://www.youtube.com/watch?v=MtZV82LCNHc), [slides](https://www.robots.ox.ac.uk/~vgg/rg/slides/Oxford-Mar-2014.pdf)
19 | - [Information Geometry lecture](https://www.youtube.com/watch?v=zmUMBLEHhZg), [slides](http://videolectures.net/mlss05us_dasgupta_ig/)
20 | - [Adversarial Attacks, Robustness](https://adversarial-ml-tutorial.org), theory and practice
21 | 
22 | ### Tutorials
23 | - [Deep RL for checkers](https://chrislarson1.github.io/blog/2016/05/30/cnn-checkers/)
24 | - [Variational Inference tutorial](https://github.com/philschulz/VITutorial.git)
25 | - Blog on theory and code. Covers **q-learning** with frozen lake, deep q-learning on doom/space invaders, policy gradients on Doom, A2C/A3C with Sonic, PPO with Sonic. [link](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)
26 | 
27 | ## Repos
28 | - Minimal clean examples. Iteration methods, policy gradient, Grid world, CartPole, Atari, etc. [`repo`](https://github.com/rlcode/reinforcement-learning)
29 | - Many PyTorch tutorials, all levels, for image and text. [`repo`](https://github.com/yunjey/pytorch-tutorial)
30 | - OpenAi Universe starter code, A3C algo. [`repo`](https://github.com/openai/universe-starter-agent)
31 | - Minimalist REINFORCE for discrete and continuous actions. [`repo`](https://github.com/JamesChuanggg/pytorch-REINFORCE)
32 | - [RLCode](https://github.com/rlcode/reinforcement-learning). Minimal example of DQN, DDQN, PG, A2C, A3C
33 | - [ikostrikov](https://github.com/ikostrikov/pytorch-a2c-ppo-acktr): a2c, ppo, acktr
34 | 
35 | ## Tools
36 | - Beam search implementation [in PyTorch](https://github.com/eladhoffer/seq2seq.pytorch/blob/master/seq2seq/tools/beam_search.py)
37 | 
38 | ## Course
39 | - Deep RL Bootcamp, [`site`](https://sites.google.com/view/deep-rl-bootcamp/lectures)
40 | 
41 | # Datasets
42 | ## NLP
43 | - The Stanford Natural Language Inference (SNLI) Corpus. 570k human-written English sentences. Text entailment [`site`](https://nlp.stanford.edu/projects/snli/)
44 | ## Environments
45 | ### Simulators
46 | - [Gibson Environments: Real-World Perception for Embodied Agents](https://github.com/StanfordVL/GibsonEnv). A virtual environment for agents which is quite realistic.
47 | 
48 | ### Environments
49 | - [VizDoom](https://github.com/mwydmuch/ViZDoom). Doom environment using only visual information. Visuals include: FPV game pixels, object labelling visual, depth map, 2D map. Should probably use with a gym wrapper, like [this one](https://github.com/nsavinov/gym-vizdoom). To understand how to setup the engine, checkout [this minimalist example](https://github.com/mwydmuch/ViZDoom/blob/master/examples/python/basic.py). Also, checkout [this pytorch example](https://github.com/mwydmuch/ViZDoom/blob/master/examples/python/learning_pytorch.py).
50 | - [MAME tookit](https://github.com/M-J-Murray/MAMEToolkit), wrapper around the popular MAME arcade emulator
51 | - [MiniWorl](https://github.com/maximecb/gym-miniworld), 2D and 3D environments, minimial dependencies, gym friendly
52 | 
53 | # DL
54 | - Unreasonable effectiveness of one neuron, [`blog`](https://rakeshchada.github.io/Sentiment-Neuron.html)
55 | 
56 | # Math
57 | - Matrix albegra review, 24 pages, [`pdf`](http://faculty.uml.edu/adoerr/92.321/pdf/week6.pdf)
58 | - Variational Inference, [`slides`](http://shakirm.com/papers/VITutorial.pdf)
59 | - Maximum likelihood, [`blog`](http://suriyadeepan.github.io/2017-01-22-mle-linear-regression/)
60 | 
61 | # Coding
62 | - PyTorch [tutorial](https://medium.com/towards-data-science/pytorch-tutorial-distilled-95ce8781a89c)
63 | 
64 | # State Course
65 | - [Stat Trek](http://stattrek.com/), stats and prob course
66 | 
67 | # NLP
68 | - [Textual entailment with tf](https://www.oreilly.com/learning/textual-entailment-with-tensorflow)
69 | 


--------------------------------------------------------------------------------