├── .gitignore ├── README.md ├── hackers_guide ├── 1_chapt │ ├── base_case.js │ ├── multiple_gates.js │ └── single_neuron.js ├── 2_chapt │ ├── neural_net.js │ └── svm.js └── README.md ├── summaries ├── auto-encoding_var_bayes.md ├── autoencoders.md ├── build_machines_learn_think.md ├── end-to-end_tf.md ├── fully_char_level_nmt.md ├── implicit_drd_grn.md ├── intrinsic_dimension.md ├── learning_phrase_rep_RNN_encoder_decoder_mt.md ├── matching_networks.md ├── neural_machine_translation.md ├── overview_optimization.md ├── scheduled_sampling.md ├── seq2seq_nn.md ├── softmax_bottleneck.md ├── unreal.md ├── vanishing_gradients.md └── var_auto_sequence_class.md └── web_resources.md /.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | tags 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### 2019-06 2 | - Discrete Flows: Invertible Generative Models of Discrete Data, Tran et al, 2019. [arXiv](https://arxiv.org/abs/1905.10347) 3 | - _Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement_, Kool et al, ICML 2019. [arXiv](https://arxiv.org/abs/1903.06059) 4 | - _Sorting out Lipschitz function approximation_, Anil et al, ICML 2019. [arXiv](https://arxiv.org/abs/1811.05381), [`ICML oral`](https://www.facebook.com/icml.imls/videos/552835701913736/?t=2667). Goal is to train a neural net which is provably a K-Lip function without loss of expressivity. There are many applications, such as Adversarial training and WGAN. Also note Lip constraints are required in some flows such as the i-ResNet (below). If you try to learn the absolute function with a 3-layer net with 1-Lip constrained linear layer and tanh activation, you cannot learn the target function. If you clip the grad norm to 1 beginning with output, as you backprop the following grad norms can only decrease. Once you get to the input layer grad update, you've lost most of the information. Solution is a new activation function called GroupoSort. GroupSort will sort activations into groups, which is a non-linear operation, is 1-Lipschitz, gradient norm preserving, continuous and differentiable almost everywhere. 5 | - _Residual Flows for Invertible Generative Modeling, Chen_ et al 2019. [arXiv](https://arxiv.org/abs/1906.02735). Residual Flows resolve the bias problem introduced in i-ResNets (paper below). 6 | - _Invertible Residual Networks_, Behrmann et al, ICML 2019. [arXiv](https://arxiv.org/abs/1811.00995) [`ICML oral`](https://www.facebook.com/icml.imls/videos/552835701913736/?t=550). Proposes conditions to add to ResNets to make invertible, such as 1-Lip constraint. The goal is to make general-purpose architectures, such as ResNets, invertible only by adding some Lipschitz conditions instead of strict architectural constraints. For example, Planar Flows must use specific layer functions to ensure invertibility. However, the method does have some bias which increases along with network expressiveness (see paper above for solution). 7 | 8 | Unified Bellman Equation for Causal Information and Value in Markov Decision Processes, Tiomkin and Tishby, [arXiv]() 9 | 10 | ### 2019-01 11 | - Evaluating Theory of Mind in Question Answering, Nematzadeh et al, EMNLP 2018. [arXiv](https://arxiv.org/abs/1808.09352) 12 | - An Off-policy Policy Gradient Theorem Using Emphatic Weightings, Imani et al, NeurIPS 2018. [arXiv](https://arxiv.org/abs/1811.09013) 13 | - Doubly Robust Off-policy Value Evaluation for Reinforcement Learning, Jiang and Li, 2015. [arXiv](https://arxiv.org/abs/1511.03722) 14 | - Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning, Thomas and Brunskill, 2016. [arXiv](https://arxiv.org/abs/1604.00923) 15 | - Implicit Reparameterization Gradients, Figurnov et al, NeurIPS 2018. [arXiv](https://arxiv.org/abs/1805.08498) 16 | - Approximately Optimal Approximate Reinforcement Learning, Kakade and Langford, 2002. [pdf](https://dl.acm.org/citation.cfm?id=656005). Note: the paper which inspired the likes of TRPO 17 | 18 | ### 2018-12 19 | - Meta-Learning: A Survey, Vanschoren et al, 2018. [arXiv](https://arxiv.org/abs/1810.03548) 20 | - Off-policy Learning with Recognizers, Precup et al, 2005. [pdf](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.420.3772&rep=rep1&type=pdf) 21 | - Meta-Gradient Reinforcement Learning, Xu et al, 2018. [arXiv](https://arxiv.org/abs/1805.09801). 22 | - Expected Policy Gradients, Criosek et al, 2018. [arXiv](https://arxiv.org/abs/1706.05374). 23 | - Mean Actor Critic, Allen et al, 2018. [arXiv](https://arxiv.org/abs/1709.00503), [`web version`](https://www.groundai.com/project/mean-actor-critic/). The usual policy gradient is an expectation over states and actions, but they suggest to add the the explicit sum over actions back in the expectation over states (Eq. 4). Doing so result in a policy update considering actions not taken in the environment. In domains where Q is good, MAC results in lower variance, otherwise MAC performs worse. 24 | - Near-Optimal Representation Learning for Hierarchical Reinforcement Learning, Nachum et al, NeurIPS 2018. [arXiv](https://arxiv.org/abs/1810.01257). Note: builds on HIRO, but focuses on optimal representations. 25 | - Data-Efficient Hierarchical Reinforcement Learning, Nachum et al, NeurIPS 2018. [arXiv](https://arxiv.org/abs/1805.08296). Note: type of HRL called HIRO. High level policy gives low-level policy a goal state to reach. 26 | - Neural Ordinary Differential Equations, Chen et al, NeurIPS 2018. [arXiv](https://arxiv.org/abs/1806.07366), [`code`](https://github.com/rtqichen/torchdiffeq). Best paper award 27 | - Non-delusional Q-learning and value-iteration, Lu et al, NeurIPS 2018. [proceedings](https://papers.nips.cc/paper/8200-non-delusional-q-learning-and-value-iteration). Best paper award. 28 | - Exploration by Random Network Distillation, Burda et al, 2018. [arXiv](https://arxiv.org/abs/1810.12894) 29 | - Revisiting the Arcade Learning Environment, Machado et al, 2017. [arXiv](https://arxiv.org/abs/1709.06009). Note: known for suggesting sticky actions to make environment non-deterministic. Sticky action: with some prob eps, environment repeats previous action. 30 | - An Information-Theoretic Optimality Principle for Deep Reinforcement Learning, Leibfried et al, 2017. [arXiv](https://arxiv.org/abs/1708.01867). Note: addresses problem of Q-value overestimation 31 | - Dueling Network Architectures for Deep Reinforcement Learning, Wang et al, 2015. [arXiv](https://arxiv.org/abs/1511.06581) 32 | - Deep Reinforcement Learning in Large Discrete Action Spaces, Dulac-Arnold, 2015. [arXiv](https://arxiv.org/abs/1512.07679) 33 | 34 | ### 2018-11 35 | - BISIMULATION METRICS FOR CONTINUOUS MARKOV DECISION PROCESSES, Ferns et al, 2011. [pdf](https://www.cs.mcgill.ca/~prakash/Pubs/siamFP11.pdf) 36 | - Addressing Function Approximation Error in Actor-Critic Methods, Fujimoto et al, ICML 2018. [arXiv](https://arxiv.org/abs/1802.09477). TD3 agent 37 | - The Mirage of Action-Dependent Baselines in Reinforcement Learning, Tucker et al, 2018. [arXiv](https://arxiv.org/abs/1802.10031). Note: decomposes variance into 3 sources: from trajectory, action-dependent baseline, and state visitation. Conclusion: variance-reduction from action-dependent baseline can be minimal. 38 | - Backpropagation through the Void: Optimizing control variates for black-box gradient estimation, Grathwohl et al, ICLR 2018, [arXiv](https://arxiv.org/abs/1711.00123). Note: action dependent baseline, builds on REBAR 39 | - REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models, Tucker et al, ICLR 2017. [arXiv](https://arxiv.org/abs/1703.07370) 40 | 41 | ### 2018-10 42 | - Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols, Havrylov and Titov, NIPS 2017. [arXiv](https://arxiv.org/abs/1705.11192). Note: EC with referential games, trained with REINFORCE and Gumbel-Softmax, shows hierarchy of language 43 | - Prior Convictions: Black-box Adversarial Attacks with Bandits and Priors, Ilyas et al 2018, ICLR 2019 submission. [openreview](https://openreview.net/forum?id=BkMiWhR5K7) 44 | - Certified Defenses against Adversarial Examples, Raghunathan et al, 2018, [arXiv](https://arxiv.org/abs/1801.09344) 45 | - Speaker-Follower Models for Vision-and-Language Navigation, Fried et al, NIPS 2018, [arXiv](https://arxiv.org/abs/1806.02724) 46 | - Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies, Grusky et al, NAACL 2018, [arXiv](https://arxiv.org/abs/1804.11283) 47 | - Architectural Complexity Measures of Recurrent Neural Networks, Zhang et al, NIPS 2016, [arXiv](https://arxiv.org/abs/1602.08210) 48 | - Gradient Estimation Using Stochastic Computation Graphs, Schulman et al, 2016. [arXiv](https://arxiv.org/abs/1506.05254) 49 | - Variational Inference: A Review for Statisticians, Blei et al, 2018. [arXiv](https://arxiv.org/abs/1601.00670) 50 | - Variational Inference with Normalizing Flows, Rezende et al, 2016. [arXiv](https://arxiv.org/abs/1505.05770) 51 | - Large Scale GAN Training for High Fidelity Natural Image Synthesis, Brock et al, submission to ICLR 2019. [arXiv](https://arxiv.org/abs/1809.11096) 52 | - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al, 2018. [arXiv](https://arxiv.org/pdf/1810.04805.pdf) 53 | - The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables, Maddison et al, 2017. [arXiv](https://arxiv.org/abs/1611.00712). Note, the Concrete is equivalent to the Gumbel-Softmax. 54 | -Categorical Reparameterization with Gumbel-Softmax, Jang et al, 2017. [arXiv](https://arxiv.org/abs/1611.01144). Note: Gumbel-Softmax is equivalent to the Concrete distribution. 55 | 56 | ### 2018-09 57 | - Universal Transformers, Dehghani et al, 2018. [arXiv](https://arxiv.org/abs/1807.03819), [`google blog post`](https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html) 58 | - Phrase-Based & Neural Unsupervised Machine Translation, Lample et al, EMNLP 2018. [arXiv](https://arxiv.org/abs/1804.07755) 59 | - Hybrid Reward Architecture for Reinforcement Learning, Seijen et al, 2017. [arXiv](https://arxiv.org/abs/1706.04208). 60 | 61 | ### 2018-08 62 | - Vehicle Communication Strategies for Simulated Highway Driving, Resnick et al, 2017, NIPS 2017 Workshop on Emergent Communication. 63 | - Emergent Communication through Negotiation, Cao et al, NIPS 2017 Workshop on Emergent Communication. 64 | - Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples, Athalye et al, ICML 2018. [arXiv](https://arxiv.org/abs/1802.00420). Defeats 7 of 9 recently introduced adversarial defense methods. Won best paper at ICML. 65 | - Meta-Gradient Reinforcement Learning, Xu et al 2018, [arXiv](https://arxiv.org/abs/1805.09801) 66 | 67 | ### 2018-07 68 | - Proximal Policy Optimization Algorithms, Schulman et al, 2018. [arXiv](https://arxiv.org/abs/1707.06347), [`openai blog`](https://blog.openai.com/openai-baselines-ppo/), OpenAIFive [`blogpost`] which applies scaled up PPO on Dota2 69 | - What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties, Conneau et al, ACL 2018. [arXiv](https://arxiv.org/abs/1805.01070). The authors go through 10 probing tasks to find out some of the things the embeddings capture, trained with various architectures. 70 | - Style Transfer Through Back-Translation, Prabhumoye et al, ACL 2018. [arXiv](https://arxiv.org/abs/1804.09000) 71 | - Hierarchical Neural Story Generation, Fan et al, ACL 2018. [arXiv](https://arxiv.org/abs/1805.04833). Generate a short story based on a "prompt", impressive results. Also has some cool tricks, like model fusion, a different type of attention, k=10 sampling, etc. 72 | - Representation Learning for Grounded Spatial Reasoning, Janner et al, ACL 2018. [arXiv](https://arxiv.org/abs/1707.03938) 73 | - Generating Sentences by Editing Prototypes, Guu et al, ACL 2018. [arXiv](https://arxiv.org/abs/1709.08878) 74 | - A Stochastic Decoder for Neural Machine Translation, Schulz et al, ACL 2018. [arXiv](https://arxiv.org/abs/1805.10844) 75 | - The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing, Dror et al, ACL 2018. [aclweb](http://aclweb.org/anthology/P18-1128) 76 | - Stock Movement Prediction from Tweets and Historical Prices, Xu and Cohen, ACL 2018. [pdf](http://aclweb.org/anthology/P18-1183) 77 | - Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context, Khandelwal et al, ACL 2018. [arXiv](https://arxiv.org/abs/1805.04623) 78 | - Backpropagating through Structured Argmax using a SPIGOT, Peng et al, ACL 2018. [arXiv](https://arxiv.org/abs/1805.04658) 79 | - Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum, Levy et al, ACL 2018. [arXiv](https://arxiv.org/abs/1805.03716) 80 | 81 | ### 2018-06 82 | - Self-Imitation Learning, Oh et al, 2018. [arXiv](https://arxiv.org/abs/1806.05635). Performs on-policy A2C update, and off-polic SIL, which samples positive experiences from a replay buffer and uses a form of AC. 83 | - Improving Language Understanding with Unsupervised Learning, Radford et al, 2018. [openai](https://blog.openai.com/language-unsupervised/) 84 | - Prioritized Experience Replay, Schaul et al, ICLR 2016. [arXiv](https://arxiv.org/abs/1511.05952) 85 | - Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation, Wu et al, 2017. [arXiv]() 86 | - Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach, Karakida et al, 2018. [arXiv](https://arxiv.org/abs/1806.01316) 87 | - On Learning Intrinsic Rewards for Policy Gradient Methods, Zheng et al, 2018. [arXiv](https://arxiv.org/abs/1804.06459) 88 | - Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Yang et al, ICLR 2018. [openreview](https://openreview.net/forum?id=HkwZSG-CZ), [arXiv](https://arxiv.org/abs/1711.03953). [`summary`](summaries/softmax_bottleneck.md). Given a language model output matrix A over time, where each row is is the the vocabulary distribution given context, the authors hypothesize A must be high rank to be express complex language, and the single softmax is not expressive enough. They propose a mixture of many softmax. 89 | - Measuring the Intrinsic Dimension of Objective Landscapes, Li et al, ICLR 2018. [openreview](), [arXiv](https://arxiv.org/abs/1804.08838), [`summary`](summaries/intrinsic_dimension.md). Intrinsic Dimension is the minimal parameter subspace (projected to the total parameters) to achieve a certain performance. It is a measure of model-problem complexity. 90 | - Control of Memory, Active Perception, and Action in Minecraft, Oh et al, ICML 2016. [arXiv](https://arxiv.org/abs/1605.09128) 91 | - Multitask Learning, Rich Caruana, PhD thesis 1997. [pdf](http://reports-archive.adm.cs.cmu.edu/anon/1997/CMU-CS-97-203.pdf). Work in the 90s on transfer learning! Chapter 5 discusses auxliary tasks for neural nets! 20 years before the UNREAL paper! 92 | - Neural Map: Structured Memory for Deep Reinforcement Learning, Parisotto and Salakhutdinov, ICLR 2018. [arXiv](https://arxiv.org/abs/1702.08360). Instead of free external memory, have memory locations correlate with agent location, i.e. structured memory. Hugely outperforms memory nets and others on maze problems. 93 | - On the State of the Art of Evaluation in Neural Language Models, ICLR 2018. [openreview](https://openreview.net/forum?id=ByJHuTgA-&). Some simple language models, like LSTM, actually achieve SOTA or near SOTA with proper hyperparams and simple additions, like shared embeddings and variational dropout (see Table 4 ablation). 94 | - Reinforcement Learning with Unsupervised Auxiliary Tasks, Jaderberg et al, ICLR 2017. [openreview](https://openreview.net/forum?id=SJ6yPD5xg). Introduces the UNREAL model. See Caruana PhD thesis above from 1997, discusses auxiliary tasks for better representations! 95 | 96 | ### 2018-03 97 | - Parameter Space Noise for Exploration, Plappert et al, ICLR 2018. [arXiv](https://arxiv.org/abs/1706.01905). Instead of adding noise to action space, add noise to the FA's parameters for better exploration. 98 | - Continuous control with deep reinforcement learning, Lillicrap et al, ICLR 2016. [arXiv](https://arxiv.org/abs/1509.02971). Introduced Deep Deterministic Policy Gradient (DDPG), an actor critic algorithm applicable to continuous action spaces, off-policy. 99 | - Deterministic Policy Gradient Algorithms, Silver et al, ICML 2014. [pdf](http://proceedings.mlr.press/v32/silver14.pdf). DPG is the expected gradient of the action-value function, easier to estimate than the traditional stochastic policy gradient. 100 | - Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs, Murdoch et al, 2018, ICLR 2018. [pdf](https://openreview.net/pdf?id=rkRwGg-0Z), [arXiv](https://arxiv.org/abs/1801.05453) 101 | - Emergence Of Linguistic Communication From Referential Games With Symbolic And Pixel Input, Lazaridou et al, ICLR 2018. [pdf](https://openreview.net/pdf?id=HJGv1Z-AW) 102 | - Emergent Communication in a Multi-Modal, Multi-Step Referential Game, Evtimova et al, ICLR 2018. [arXiv](https://arxiv.org/abs/1705.10369), [`code`](https://github.com/nyu-dl/MultimodalGame/blob/master/model.py) 103 | - Neural Speed Reading via Skim-RNN, Seo et al, ICLR 2018. [arXiv](https://arxiv.org/abs/1711.02085) 104 | - Dynamic Word Embeddings for Evolving Semantic Discovery, Yao et al, 2017. [arXiv](https://arxiv.org/abs/1703.00607) 105 | 106 | ### 2018-02 107 | - One Model To Learn Them All, Kaiser et al, 2017. [arXiv](https://arxiv.org/abs/1706.05137) 108 | - An Analysis of Temporal-Difference Learning with Function Approximation, Tsitsiklis and Van Roy, 1997. [pdf](http://web.mit.edu/jnt/www/Papers/J063-97-bvr-td.pdf) 109 | - Steps Toward Artificial Intelligence, Minsky, 1961. [pdf](https://courses.csail.mit.edu/6.803/pdf/steps.pdf) 110 | - Eye on the Prize, Nilsson, 1995. [pdf](http://ai.stanford.edu/~nilsson/OnlinePubs-Nils/General%20Essays/AIMag16-02-002.pdf) 111 | - The Option-Critic Architecture, Bacon et al. [arXiv](https://arxiv.org/abs/1609.05140) 112 | - Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings. He et al, 2017. [arXiv](https://arxiv.org/abs/1704.07130) 113 | - Learning to Win by Reading Manuals in a Monte-Carlo Framework, Branavan et al, 2012. [arXiv](https://arxiv.org/abs/1401.5390) 114 | 115 | ### 2017-12 116 | - Generating Sentences by Editing Prototypes, Guu et al, 2017. [arXiv](https://arxiv.org/abs/1709.08878) 117 | - SenGen: Sentence Generating Neural Variational Topic Model, Nallapati et al, 2017. [arXiv](https://arxiv.org/abs/1708.00308) 118 | - Learning Sparse Neural Networks through L0 Regularization, Louizos et al 2017. [arXiv](https://arxiv.org/abs/1712.01312) 119 | - Sparsity and the Lasso, Tibshirani and Wasserman, 2015. [pdf](http://www.stat.cmu.edu/~larry/=sml/sparsity.pdf). Note: related L0 paper above 120 | - Proving convexity, Loh 2013. [pdf](http://www.math.cmu.edu/~ploh/docs/math/mop2013/convexity-soln.pdf). Note: related to L0 paper above 121 | - Mathematics of Deep Learning, Vidal et al, 2017. [arXiv](https://arxiv.org/abs/1712.04741) 122 | - Bayesian Hypernetworks, Krueger et al, 2017. [arXiv](https://arxiv.org/abs/1710.04759) 123 | - SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents, Nallapati et al, 2016. [arXiv](https://arxiv.org/abs/1611.04230) 124 | - Learning Online Alignments with Continuous Rewards Policy Gradient, Luo et al 2016. [arXiv](https://arxiv.org/abs/1608.01281) 125 | - Asynchronous Methods for Deep Reinforcement Learning. Mnih et al, 2016. [arXiv](https://arxiv.org/abs/1602.01783). Introduces A3C, Asyncrhonous Advantage Actor Critic 126 | - On The State of The Art In Neural Language Models, Anonymous, 2017. [iclr pdf](https://openreview.net/pdf?id=ByJHuTgA-) 127 | - Natural Language Inference with External Knowledge, Chen et al 2017. [arXiv](https://arxiv.org/abs/1711.04289) 128 | 129 | ### 2017-11 130 | - Memory Augmented Neural Networks with Wormhole Connections, Gulcehre et al, 2017. [arXiv](https://arxiv.org/abs/1701.08718) 131 | - Emergence of Invariance and Disentangling in Deep Representations, Achille et al, 2017. [arXiv](https://arxiv.org/abs/1706.01350) 132 | - Distilling the Knowledge in a Neural Network, Hinton et al, 2015. [arXiv](https://arxiv.org/abs/1503.02531) 133 | - Seq2SQL: Generating Stuctured Queries From Natural Language Using Reinforcement Learning, Zhong et al, 2017. [arXiv](https://arxiv.org/abs/1709.00103) 134 | - Better Text Understanding Through Image-To-Text Transfer, Kurach, 2017. [arXiv]( 135 | - Data Augmentation Generative Adversarial Networks, Antoniou et al, 2017. [arXiv](https://arxiv.org/abs/1711.04340) 136 | - Adversarial Training Methods for Semi-Supervised Text Classification, Miyato et al, 2017. [arXiv](https://arxiv.org/abs/1605.07725) 137 | - Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training, Anonymous, 2017. [openreview](https://openreview.net/pdf?id=SkhQHMW0W) 138 | - Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling, Inan et al 2017. [arXiv](https://arxiv.org/abs/1611.01462) 139 | - Building machines that learn and think for themselves, Botvinick et al, 2017. [cambridge](https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/building-machines-that-learn-and-think-for-themselves/E28DBFEC380D4189FB7754B50066A96F) 140 | - Neural Discrete Representation Learning, va den Oord et al, 2017. [arXiv](https://arxiv.org/abs/1711.00937) 141 | - InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets, Chen et al, 2016. [arXiv](https://arxiv.org/abs/1606.03657), [`blog`](https://towardsdatascience.com/infogan-generative-adversarial-networks-part-iii-380c0c6712cd), [`code`](https://github.com/zjost/InfoGAN) 142 | - Evolution Strategies, Otoro 2017, blog part [1](http://blog.otoro.net/2017/10/29/visual-evolution-strategies/), [2](http://blog.otoro.net/2017/11/12/evolving-stable-strategies/) 143 | - Matrix Capsules with EM Routing. Anonymous (likely Hinton lab), 2017. [openreview](https://openreview.net/pdf?id=HJWLfGWRb). 144 | - Dynamic Routing Between Capsules, Sabour et al, 2017. [arXiv](https://arxiv.org/abs/1710.09829). [`code-keras`](https://github.com/XifengGuo/CapsNet-Keras), [`video review`](https://youtu.be/pPN8d0E3900) 145 | - Weighted Transformer Network for Machine Translation, Ahmed et al, 2017. [arXiv](https://arxiv.org/abs/1711.02132) 146 | - Unsupervised Machine Translation Using Monolingual Corpora Only, Lample et al, 2017. [arXiv](https://arxiv.org/abs/1711.00043) 147 | - Non-Autoregressive Neural Machine Translation, Gu et al, 2017. [arXiv](https://arxiv.org/abs/1711.02281) 148 | - Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments, Lowe et al, 2017. [arXiv](https://arxiv.org/abs/1706.02275) 149 | 150 | ### 2017-10 151 | - Adversarial Learning for Neural Dialogue Generation, Li et al, 2017. [arXiv](https://arxiv.org/abs/1701.06547) 152 | - Frustratingly Short Attention Spans in Neural Language Modeling, Daniluk et al, 2017. [arXiv](https://arxiv.org/abs/1702.04521) 153 | - Adversarial Training Methods for Semi-Supervised Text Classification, Miyato et al, 2017. [arXiv](https://arxiv.org/abs/1605.07725) 154 | - Progressive Growing of GANs for Improved Quality, Stability, and Variation, Karras et al, 2017. [pdf](http://research.nvidia.com/sites/default/files/pubs/2017-10_Progressive-Growing-of//karras2017gan-paper.pdf) 155 | - A Closer Look at Memorization in Deep Networks, Arpit et al, 2017. [arXiv](https://arxiv.org/abs/1706.05394) 156 | - Understanding deep learning requires rethinking generalization, Zhang et al, 2016. [arXiv](https://arxiv.org/abs/1611.03530) 157 | - The Loss Surfaces of Multilayer Networks, Choromanska et al, 2015. [arXiv](https://arxiv.org/abs/1412.0233) 158 | - Meta Learning Shared Hierarchies, Frans et al, 2017. [arXiv](https://arxiv.org/abs/1710.09767), [`author blog`](https://blog.openai.com/learning-a-hierarchy/) 159 | - Mastering the game of Go without human knowledge, Silver et al, 2017. [arXiv](https://www.nature.com/nature/journal/v550/n7676/full/nature24270.html), [`blog`](http://tim.hibal.org/blog/alpha-zero-how-and-why-it-works/) 160 | - Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation, Sharma et al, 2017. [arXiv](https://arxiv.org/abs/1706.09799) 161 | - GuessWhat?! Visual object discovery through multi-modal dialogue, de Vries et al, 2017. [arXiv](https://arxiv.org/abs/1611.08481) 162 | - A Frame Tracking Model for Memory-Enhanced Dialogue Systems, Schulz et al, 2017. [arXiv](https://arxiv.org/abs/1706.01690) 163 | - A Deep Reinforced Model for Abstractive Summarization, Paulus et al, 2017. [arXiv](https://arxiv.org/abs/1705.04304), [`author blog`](https://einstein.ai/research/your-tldr-by-an-ai-a-deep-reinforced-model-for-abstractive-summarization) 164 | - (about ROUGE score for summarization) ROUGE: A Package for Automatic Evaluation of Summaries, Chin-Yew Lin, 2004. [acl](http://anthology.aclweb.org/W/W04/W04-1013.pdf) 165 | - Rainbow: Combining Improvements in Deep Reinforcement Learning, Hessel et al, 2017. [arXiv](https://arxiv.org/abs/1710.02298) 166 | - Language Modeling with Gated Convolutional Networks, Dauphin et al, 2017, [arXiv](https://arxiv.org/abs/1612.08083) 167 | - Convolutional Sequence to Sequence Learning, Gehring et al, 2017. [arXiv](https://arxiv.org/abs/1705.03122) 168 | - Emergence of Grounded Compositional Language in Multi-Agent Populations, Mordatch and Abbeel, 2017. [arXiv](https://arxiv.org/abs/1703.04908), [`author blog`](https://blog.openai.com/learning-to-communicate/). Note: related to Kottur et al 2017. 169 | 170 | ### 2017-09 171 | - Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog, Kottur et al, 2017. [arXiv](https://arxiv.org/abs/1706.08502), [`code`](https://github.com/batra-mlp-lab/lang-emerge) 172 | - Opening the black box of Deep Neural Networks via Information, Schwartz-Ziv and Tishbly, 2017. [arXiv](https://arxiv.org/abs/1703.00810), [m-p review](https://blog.acolyer.org/2017/11/15/opening-the-black-box-of-deep-neural-networks-via-information-part-i/) 173 | - End-to-end Neural Coreference Resolution, Lee et al, 2017. [arXiv](https://arxiv.org/abs/1707.07045) 174 | - Deep Reinforcement Learning for Mention-Ranking Coreference Models, Clark et al, 2016. [arXiv](https://arxiv.org/abs/1609.08667) 175 | - Oriented Response Networks, Zhou et al 2017. [arXiv](https://arxiv.org/abs/1701.01833) 176 | - Training RNNs as Fast as CNNs, Lei et al, 2017. [arXiv](https://arxiv.org/abs/1709.02755) 177 | - Quasi-Recurrent Neural Networks, Bradbury et al 2017. [arXiv](https://openreview.net/pdf?id=H1zJ-v5xl), [`author blog/code`](https://einstein.ai/research/new-neural-network-building-block-allows-faster-and-more-accurate-text-understanding) 178 | - A Deep Reinforcement Learning Chatbot, Serban et al, 2017. [arXiv](https://arxiv.org/abs/1709.02349) 179 | - Independently Controllable Factors, Thomas et al, 2017. [arXiv](https://arxiv.org/abs/1708.01289) 180 | - Attention Is All You Need, Vaswani et al, 2017. [arXiv](https://arxiv.org/abs/1706.03762), [`code`](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py), [`google blog`](https://research.googleblog.com/2017/08/transformer-novel-neural-network.html), [`reddit`](https://www.reddit.com/r/MachineLearning/comments/6gwqiw/r_170603762_attention_is_all_you_need_sota_nmt/) 181 | - Attention-over-Attention Neural Networks for Reading Comprehension, Cui et al 2017. [arXiv](https://arxiv.org/abs/1607.04423), [`code`](https://github.com/OlavHN/attention-over-attention) 182 | - Get To The Point: Summarization with Pointer-Generator Networks, See et al 2017. [arXiv](https://arxiv.org/abs/1704.04368), [`author blog`](http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html), [`code`](https://github.com/abisee/cnn-dailymail) 183 | - β-VAE: LEARNING BASIC VISUAL CONCEPTS WITH A CONSTRAINED VARIATIONAL FRAMEWORK, Higgins et al 2017. [pdf](https://openreview.net/pdf?id=Sy2fzU9gl) 184 | - Massive Exploration of Neural Machine Translation Architectures, Britz et al 2017. [arXiv](https://arxiv.org/abs/1703.03906v2) 185 | - Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, 2017. [arXiv](https://arxiv.org/abs/1703.10593), ['examples'](https://junyanz.github.io/CycleGAN/), [`code-torch`](https://github.com/junyanz/CycleGAN), [`code-PyT`](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix) 186 | 187 | ### 2017-08 188 | - A Brief Survey of Deep Reinforcement Learning, Arulkumaran et al 2017. [arXiv](https://arxiv.org/abs/1708.05866) 189 | - Regularizing and Optimizing LSTM Language Models, Merity et al 2017. [arXiv](http://lanl.arxiv.org/abs/1708.02182v1) 190 | - Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets, Yang et al 2017. [arXiv](https://arxiv.org/abs/1703.04887) 191 | - Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders, Zhao et al 2017. [arXiv](https://arxiv.org/abs/1703.10960) 192 | - How to Train Your DRAGAN, Kodali et al 2017. [arXiv](https://arxiv.org/abs/1705.07215) 193 | - Improved Training of Wasserstein GANs, Gulrajani et al 2017. [arXiv](https://arxiv.org/abs/1704.00028), [`blog`](https://lernapparat.de/improved-wasserstein-gan/), [`blog`](http://lernapparat.de/more-improved-wgan/), [`code`](https://github.com/igul222/improved_wgan_training) 194 | - Wasserstein Gan, Arjovsky et al 2017. [arXiv](https://arxiv.org/abs/1701.07875), [`read-through`](http://www.alexirpan.com/2017/02/22/wasserstein-gan.html), [`Kantorovich-Rubinstein duality`](https://vincentherrmann.github.io/blog/wasserstein/), [`WGAN-tensorflow`](https://github.com/shekkizh/WassersteinGAN.tensorflow), [`blog/code`](https://wiseodd.github.io/techblog/2017/02/04/wasserstein-gan/) 195 | - Reading Scene Text in Deep Convolutional Sequences, He et al, 2016. [arXiv](https://arxiv.org/abs/1506.04395) 196 | 197 | ### 2017-03 198 | - Recurrent Batch Normalization, Cooijmans et al, 2017. [arXiv](https://arxiv.org/abs/1603.09025), [`code-tf`](https://github.com/OlavHN/bnlstm) 199 | - An Actor-Critic Algorithm for Sequence Prediction, Bahdanau et al 2017. [arXiv](https://arxiv.org/abs/1607.07086), [`code`](https://github.com/rizar/actor-critic-public) 200 | - Scheduled Sampling for Sequence Prediction with RNN, Bengio et al, 2015 [arXiv](https://arxiv.org/abs/1506.03099), [`summary`](summaries/scheduled_sampling.md), 201 | - Hybrid computing using a neural network with dynamic external memory, published in [Nature](https://www.dropbox.com/s/0a40xi702grx3dq/2016-graves.pdf) 202 | - Neural Turing Machine, [arXiv](https://arxiv.org/abs/1410.5401) 203 | - LEARNING END-TO-END GOAL-ORIENTED DIALOG, Bordes et al, 2017. [arXiv](https://arxiv.org/abs/1605.07683), [`code`](https://github.com/carpedm20/MemN2N-tensorflow) 204 | - End-To-End Memory Networks, Sukhbaatar et al, 2015, [arXiv](https://arxiv.org/abs/1503.08895) 205 | - Memory Networks, [arXiv](https://arxiv.org/abs/1410.3916) 206 | - Deep Photo Style Transfer, [arXiv](https://arxiv.org/abs/1703.07511) 207 | - Matching Networks for One Shot Learning, Vinyals et al, NIPS 2016. [arXiv](https://arxiv.org/abs/1606.04080). [`summary`](summaries/matching_networks.md), [`code`](https://github.com/zergylord/oneshot). [`karpathy notes`](http://www.shortscience.org/paper?bibtexKey=journals/corr/VinyalsBLKW16#karpathy), [`Colyer blog`](https://blog.acolyer.org/2017/01/03/matching-networks-for-one-shot-learning/) 208 | 209 | ### 2017-01 210 | - Optimization As A Model For Few-Shot Learning, Sachin Ravi and Hugo Larochelle, ICLR 2017. [openreview](https://openreview.net/pdf?id=rJY0-Kcll), [video](https://www.youtube.com/watch?v=igJmB6d8y8E) 211 | - NIPS 2016 Tutorial:Generative Adversarial Networks, [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOa2RqZmJVR2hrUTA/view?usp=sharing),[arXiv](https://arxiv.org/abs/1701.00160), [blog/code](https://wiseodd.github.io/techblog/2016/09/17/gan-tensorflow/) 212 | 213 | ### 2016-11 214 | - [Fully Character-Level Neural Machine Translation without Explicit Segmentation](summaries/fully_char_level_nmt.md), [annotated](https://drive.google.com/open?id=0ByV7wn2NzevOQ0JtTTRuR0pjUlE), [arXiv](https://arxiv.org/abs/1610.03017) 215 | - [Neural Machine Translation by Jointly Learning to Align and Translate](summaries/neural_machine_translation.md), [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOS3FmWHVNazhnczA/view?usp=sharing), [arXiv](https://arxiv.org/abs/1409.0473) 216 | - [Sequence to Sequence Learning with Neural Networks](summaries/seq2seq_nn.md), [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOQ1l5aUF4RWYtenc/view?usp=sharing), [arXiv](https://arxiv.org/abs/1409.3215) 217 | - [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation](summaries/learning_phrase_rep_RNN_encoder_decoder_mt.md), [arXiv](https://arxiv.org/abs/1406.1078) 218 | - [Implicit Discourse Relation Detection via a Deep Architecture with Gated Relevance Network](summaries/implicit_drd_grn.md), [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOLUxtemFqejJmNVU/view?usp=sharing), [acl](https://www.aclweb.org/anthology/P/P16/P16-1163.pdf) 219 | 220 | ### 2016-10 221 | - Learning Structured Output Representation using Deep Conditional Generative Models, Sohn et al 2015. (Conditional VAE) [nips](https://papers.nips.cc/paper/5775-learning-structured-output-representation-using-deep-conditional-generative-models), [blog/code](https://wiseodd.github.io/techblog/2016/12/17/conditional-vae/), [code](https://github.com/hwalsuklee/tensorflow-mnist-CVAE) 222 | - [Auto-Encoding Variational Bayes](summaries/auto-encoding_var_bayes.md), [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOcjBIeVBZcTFUQ2s/view?usp=sharing), [arXiv](https://arxiv.org/abs/1312.6114), [blog/code](https://wiseodd.github.io/techblog/2016/12/10/variational-autoencoder/), - [Semi-supervised Variational Autoencoders for Sequence Classification](summaries/var_auto_sequence_class.md), [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOTXEzLWlNQy1od0k/view?usp=sharing), [arXiv](https://arxiv.org/abs/1603.02514) 223 | - [Autoencoder review](summaries/autoencoders.md) by Keras author Francois Chollet 224 | 225 | ## Datasets 226 | - UCI [machine learning repository](https://archive.ics.uci.edu/ml/datasets.html?format=&task=&att=&area=&numAtt=&numIns=&type=&sort=instDown&view=table). 360 datasets, some very large. Nice sorting feature, such as ">1000 instance/classification/text" results in [14 data sets](https://archive.ics.uci.edu/ml/datasets.html?format=&task=cla&att=&area=&numAtt=&numIns=greater1000&type=&sort=instDown&view=table) 227 | 228 | ## Paper collections 229 | - ["Awesome deep learning papers"]https://github.com/terryum/awesome-deep-learning-papers/), a collection of 100 best papers from past few years 230 | - Paper collection by [songrotek](https://github.com/songrotek/Deep-Learning-Papers-Reading-Roadmap/blob/master/README.md) 231 | 232 | ## Overview 233 | - [Nature Review article. Lecun, Bengio, Hinton. 2015](http://www.nature.com/articles/nature14539.epdf?referrer_access_token=K4awZz78b5Yn2_AoPV_4Y9RgN0jAjWel9jnR3ZoTv0PU8PImtLRceRBJ32CtadUBVOwHuxbf2QgphMCsA6eTOw64kccq9ihWSKdxZpGPn2fn3B_8bxaYh0svGFqgRLgaiyW6CBFAb3Fpm6GbL8a_TtQQDWKuhD1XKh_wxLReRpGbR_NdccoaiKP5xvzbV-x7b_7Y64ZSpqG6kmfwS6Q1rw%3D%3D&tracking_referrer=www.nature.com) 234 | * Good short overview 235 | - [Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117.](http://arxiv.org/abs/1404.7828) 236 | * Extensive overview 237 | 238 | ## Neural Networks Basics 239 | 240 | - [Michael Nielsen book on NN](http://neuralnetworksanddeeplearning.com/chap1.html) 241 | - [Hacker's guide to Neural Networks. Andrej Karpathy blog](http://karpathy.github.io/neuralnets/) 242 | - [Visualize NN training](http://experiments.mostafa.io/public/ffbpann/) 243 | 244 | ## Backpropagation 245 | 246 | - [A Gentle Introduction to Backpropagation. Sathyanarayana (2014)](http://numericinsight.com/uploads/A_Gentle_Introduction_to_Backpropagation.pdf) 247 | - [Learning representations by back-propagating errors. Hinton et al, 1986](http://www.nature.com/nature/journal/v323/n6088/abs/323533a0.html) 248 | * Seminal paper by Hinton et al on back-propagation. 249 | - [The Backpropagation Algorithm](http://page.mi.fu-berlin.de/rojas/neural/chapter/K7.pdf) 250 | * Longer tutorial on the topic, 34 pages 251 | - [Overview of various optimization algorithms](http://sebastianruder.com/optimizing-gradient-descent/) 252 | * [Summary](summaries/overview_optimization.md) 253 | 254 | ## Misc 255 | - Multi-Task Learning Objectives for Natural Language Processing, [blog](http://ruder.io/multi-task-learning-nlp/index.html) 256 | 257 | ## Recurrent Neural Network (RNN) 258 | 259 | - [Blog intro, tutorial](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) 260 | - [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Cho et al. 2014)](http://arxiv.org/abs/1406.1078) 261 | - [Character-Aware Neural Language Models. Kim et al. 2015.](http://arxiv.org/pdf/1508.06615.pdf) 262 | - [The Unreasonable Effectiveness of Recurrent Neural Networks. Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) 263 | * Indepth, examples in vision and NLP. Provides code 264 | - [Sequence-to-Sequence Learning with Neural Networks. Sutskever et al (2014)](http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) 265 | * Ground-breaking work on machine translation with RNN and LSTM 266 | - [Training RNN. Sutskever thesis. 2013](http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf) 267 | * Indepth, self-contained, 85 pages 268 | - [Understanding Natural Language with Deep Neural Networks Using Torch (2015)](http://devblogs.nvidia.com/parallelforall/understanding-natural-language-deep-neural-networks-using-torch/) 269 | * See part on predicting next word with RNN. 270 | - [LSTM BASED RNN ARCHITECTURES FOR LARGE VOCABULARY SPEECH RECOGNITION](http://arxiv.org/pdf/1402.1128v1.pdf) 271 | - [Awesome Recurrent Neural Networks](https://github.com/kjw0612/awesome-rnn#lectures) 272 | * Curated list of RNN resources 273 | 274 | ## CNNs 275 | - [Karpathy cs231 review](http://cs231n.github.io/convolutional-networks/) 276 | - [Character-level Convolutional Networks for Text Classification](http://arxiv.org/abs/1509.01626) 277 | * [Annotated](https://drive.google.com/open?id=0ByV7wn2NzevOZEw4QV9tbFNyVTQ) 278 | - [Collobert. Natural Language Processing (Almost) from Scratch (2011)](http://dl.acm.org/citation.cfm?id=2078186) 279 | * Spurred interest in applying CNN to NLP. 280 | - [Multichannel Variable-Size Convolution for Sentence Classification. Yin, 2015](https://aclweb.org/anthology/K/K15/K15-1021.pdf) 281 | * Interesting, borrows multichannel from image CNN, where each channel is a different word embedding. 282 | - [A CNN for Modelling Sentences. Kalchbrenner et al, 2014](http://phd.nal.co/papers/Kalchbrenner_DCNN_ACL14) 283 | * Dynamic k-max pooling for variable length sentences. 284 | - [Semantic Relation Classification via Convolutional Neural Networks with Simple Negative Sampling. Xu et al, 2015](http://arxiv.org/pdf/1506.07650v1.pdf) 285 | - [Text Understanding from Scratch. Zhang, LeCunn. (2015)](http://arxiv.org/abs/1502.01710) 286 | - [Kim. Convolutional Neural Networks for Sentence Classification (2014)](http://arxiv.org/pdf/1408.5882v2.pdf) 287 | - [Sensitivity Analysis of (And Practitioner's Guide to) CNN for Sentence Classification. Zhang, Wallace (2015)](http://arxiv.org/pdf/1510.03820v2.pdf) 288 | * [Annotated](https://drive.google.com/open?id=0ByV7wn2NzevOY25JNlJQREVLZEU) 289 | - [Relation Extraction: Perspective from Convolutional Neural Networks. Nguyen, Grishman (2015)](http://www.cs.nyu.edu/~thien/pubs/vector15.pdf) 290 | * [Annotated](https://drive.google.com/file/d/0ByV7wn2NzevObzAtV1QyUDl5X2M/view?usp=sharing) 291 | - [Convolutional Neural Network for Sentence Classification. Yahui Chen, 2015](https://uwspace.uwaterloo.ca/bitstream/handle/10012/9592/Chen_Yahui.pdf?sequence=3&isAllowed=y) 292 | * Master's thesis, University of Waterloo 293 | 294 | ## Deep Reinforcement Learning 295 | - [Playing Atari with Deep Reinforcement Learning. Mnih et al. (2014)](http://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) 296 | - [Youtube Demo](https://www.youtube.com/watch?v=wfL4L_l4U9A) 297 | - Simple Reinforcement Learning with TensorFlow series, part [0](https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0) 298 | - Basic DQN in Keras, [`blog`](https://keon.io/deep-q-learning/), [`code`](https://github.com/keon/deep-q-learning) 299 | - Minimal and clean examples, [`code`](https://github.com/rlcode/reinforcement-learning) 300 | - Demystifying Deep RL, [`blog`](http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/) 301 | - Berkeley course on DRL, [`course`](http://rll.berkeley.edu/deeprlcourse/) 302 | 303 | ## Online Courses 304 | - [Deep Learning. Udacity, 2015](https://www.udacity.com/course/deep-learning--ud730) 305 | * Very brief. It is more about getting a feel for DL and specifically about using TensorFlow for DL. 306 | - [Convolutional Neural Networks for Visual Recognition. Stanford, 2016](http://cs231n.stanford.edu/) 307 | - [Neural Network Course. Université de Sherbrooke, 2013](http://info.usherbrooke.ca/hlarochelle/neural_networks/description.html) 308 | - [Machine Learning Course, University of Oxford(2014-2015)](https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/) 309 | - [Deep Learning for NLP, Stanford (2015)](http://cs224d.stanford.edu/) 310 | * Click "syllabus" for full material 311 | - [Stanford Deep Learning tutorials](http://ufldl.stanford.edu/tutorial/) 312 | * From basics of Machine Learning, to DNN, CNN, and others. 313 | * Includes code. 314 | 315 | ## Books 316 | - [Ian Goodfellow, Yoshua Bengio, Aaron Courville (2016). Deep Learning.](http://www.deeplearningbook.org) 317 | -------------------------------------------------------------------------------- /hackers_guide/1_chapt/base_case.js: -------------------------------------------------------------------------------- 1 | // FROM: http://karpathy.github.io/neuralnets/ 2 | 3 | var print = function(str) { 4 | str_new = document.getElementById('text').innerHTML + "
" + str; 5 | document.getElementById('text').innerHTML = str_new; 6 | } 7 | 8 | // circuit with single gate for now 9 | var forwardMultiplyGate = function(x, y) { return x * y; }; 10 | 11 | // ---------------------------------------------------------- 12 | // STRATEGY #1: RANDOM SEARCH 13 | // ---------------------------------------------------------- 14 | var x = -2, y = 3; // some input values 15 | // try changing x,y randomly small amounts and keep track of what works best 16 | var tweak_amount = 0.01; 17 | var best_out = -Infinity; 18 | var best_x = x, best_y = y; 19 | for(var k = 0; k < 100; k++) { 20 | var x_try = x + tweak_amount * (Math.random() * 2 - 1); // tweak x a bit 21 | var y_try = y + tweak_amount * (Math.random() * 2 - 1); // tweak y a bit 22 | var out = forwardMultiplyGate(x_try, y_try); 23 | if(out > best_out) { 24 | // best improvement yet! Keep track of the x and y 25 | best_out = out; 26 | best_x = x_try, best_y = y_try; 27 | } 28 | } 29 | print("Best x: " + best_x); 30 | print("Best y: " + best_x); 31 | print("Result: " + best_out); 32 | 33 | // ---------------------------------------------------------- 34 | // STRATEGY #2: PARTIAL DERIVATIVES --> GRADIENT 35 | // ---------------------------------------------------------- 36 | // Based on following function: 37 | // \frac{\partial f(x,y)}{\partial x} = \frac{f(x+h,y) - f(x,y)}{h} 38 | var x = -2, y = 3; 39 | var out = forwardMultiplyGate(x, y); // -6 40 | var h = 0.0001; //small change in variable to measure change in function 41 | // In theory we would want the gradient, ie the limit of the expression 42 | // as h --> 0 43 | // The gradient with respect to all inputs is a vector of all the partial derivatives 44 | // The gradient is the direction of the steepest increase of the function 45 | 46 | // compute derivative with respect to x 47 | var xph = x + h; // -1.9999 48 | var out2 = forwardMultiplyGate(xph, y); // -5.9997 49 | var x_derivative = (out2 - out) / h; // 3.0 50 | 51 | // compute derivative with respect to y 52 | var yph = y + h; // 3.0001 53 | var out3 = forwardMultiplyGate(x, yph); // -6.0002 54 | var y_derivative = (out3 - out) / h; // -2.0 55 | 56 | // ---------------------------------------------------------- 57 | // UPDATE BASED ON DERIVATIVES 58 | // ---------------------------------------------------------- 59 | var step_size = 0.01; 60 | var out = forwardMultiplyGate(x, y); // before: -6 61 | x += step_size * x_derivative; // x becomes -1.97 62 | y += step_size * y_derivative; // y becomes 2.98 63 | var out_new = forwardMultiplyGate(x, y); // -5.87! exciting. 64 | print(" Output based on full partial deriv: " + out_new); 65 | 66 | // ---------------------------------------------------------- 67 | // STRATEGY #3: ANALYTIC GRADIENT 68 | // ---------------------------------------------------------- 69 | /* 70 | Previously we analyzed the change in function output once for every input we have. Complexity of evaluating the gradient is linear in number of inputs. In practice, not feasible. Instead, we derive a direct expression, an analytic gradient. 71 | Plugging our expression into the definition of the derivative of y, we get: 72 | \frac{\partial f(x,y)}{\partial x} = \frac{f(x+h,y) - f(x,y)}{h} 73 | = \frac{(x+h)y - xy}{h} 74 | = \frac{xy + hy - xy}{h} 75 | = \frac{hy}{h} 76 | = y 77 | The analytic gradient of x is y, and for y it is x! We can simplify our derivative calculation. 78 | */ 79 | var x = -2, y = 3; 80 | var out = forwardMultiplyGate(x, y); // before: -6 81 | var x_gradient = y; // by our complex mathematical derivation above 82 | var y_gradient = x; 83 | 84 | var step_size = 0.01; 85 | x += step_size * x_gradient; // -2.03 86 | y += step_size * y_gradient; // 2.98 87 | var out_new = forwardMultiplyGate(x, y); // -5.87. Higher output! Nice. 88 | print("Analytic gradient output: " + out_new); 89 | 90 | 91 | 92 | 93 | 94 | 95 | -------------------------------------------------------------------------------- /hackers_guide/1_chapt/multiple_gates.js: -------------------------------------------------------------------------------- 1 | // FROM: http://karpathy.github.io/neuralnets/ 2 | 3 | var print = function(str) { 4 | str_new = document.getElementById('text').innerHTML + "
" + str; 5 | document.getElementById('text').innerHTML = str_new; 6 | } 7 | 8 | /* 9 | Like the base case, we expand the gradient calculations to multiple gates where each calculates local derivatives, unaware of the complexity of the whole. 10 | */ 11 | 12 | // ---------------------------------------------------------- 13 | // FUNCTION DEFINITION 14 | // ---------------------------------------------------------- 15 | // We want to model the following expression: 16 | // f(x,y,z) = (x + y) z 17 | var forwardMultiplyGate = function(a, b) { 18 | return a * b; 19 | }; 20 | var forwardAddGate = function(a, b) { 21 | return a + b; 22 | }; 23 | var forwardCircuit = function(x,y,z) { 24 | var q = forwardAddGate(x, y); 25 | var f = forwardMultiplyGate(q, z); 26 | return f; 27 | }; 28 | 29 | var x = -2, y = 5, z = -4; 30 | var f = forwardCircuit(x, y, z); // output is -12 31 | print(f); 32 | 33 | // ---------------------------------------------------------- 34 | // GRADIENTS VIA CHAIN RULE (SIMPLE BACKPROPAGATION) 35 | // ---------------------------------------------------------- 36 | /* 37 | To calculate the derivatives, we would start with the partial derivatives of function f (multiply gate) with respect to q and z, we would then calculate the partial derivatives of function q (add gate) with respect to x and y. 38 | We then combine the two via the chain rule to get the gradient with respect to x, y and z. 39 | */ 40 | // initial conditions 41 | var x = -2, y = 5, z = -4; 42 | var q = forwardAddGate(x, y); // q is 3 43 | var f = forwardMultiplyGate(q, z); // output is -12 44 | 45 | // gradient of the MULTIPLY gate with respect to its inputs 46 | // wrt is short for "with respect to" 47 | var derivative_f_wrt_z = q; // 3 48 | var derivative_f_wrt_q = z; // -4 49 | 50 | // derivative of the ADD gate with respect to its inputs 51 | var derivative_q_wrt_x = 1.0; 52 | var derivative_q_wrt_y = 1.0; 53 | 54 | // chain rule 55 | var derivative_f_wrt_x = derivative_q_wrt_x * derivative_f_wrt_q; // -4 56 | var derivative_f_wrt_y = derivative_q_wrt_y * derivative_f_wrt_q; // -4 57 | /* Note although derivatives of q with respect to x,y are 1, gradient of f with respect to q is -4! Since q is made up of x,y and we want q to decrease, therefore x,y must decrease to respect q's gradient! This is why derivative of f wrt to x,y (via chain rule) is -4. */ 58 | 59 | // final gradient, from above: [-4, -4, 3] 60 | var gradient_f_wrt_xyz = [derivative_f_wrt_x, derivative_f_wrt_y, derivative_f_wrt_z] 61 | 62 | // let the inputs respond to the force/tug: 63 | var step_size = 0.01; 64 | x = x + step_size * derivative_f_wrt_x; // -2.04 65 | y = y + step_size * derivative_f_wrt_y; // 4.96 66 | z = z + step_size * derivative_f_wrt_z; // -3.97 67 | 68 | // Our circuit now better give higher output: 69 | var q = forwardAddGate(x, y); // q becomes 2.92 70 | var f = forwardMultiplyGate(q, z); // output is -11.59, up from -12! Nice! 71 | 72 | // ---------------------------------------------------------- 73 | // GRADIENT CHECKING 74 | // ---------------------------------------------------------- 75 | // initial conditions 76 | var x = -2, y = 5, z = -4; 77 | 78 | // numerical gradient check 79 | var h = 0.0001; 80 | var x_derivative = (forwardCircuit(x+h,y,z) - forwardCircuit(x,y,z)) / h; // -4 81 | var y_derivative = (forwardCircuit(x,y+h,z) - forwardCircuit(x,y,z)) / h; // -4 82 | var z_derivative = (forwardCircuit(x,y,z+h) - forwardCircuit(x,y,z)) / h; // 3 83 | // We get [-4, -4, 3], same as with backpropagation 84 | 85 | 86 | 87 | 88 | -------------------------------------------------------------------------------- /hackers_guide/1_chapt/single_neuron.js: -------------------------------------------------------------------------------- 1 | // FROM: http://karpathy.github.io/neuralnets/ 2 | 3 | var print = function(str) { 4 | str_new = document.getElementById('output').innerHTML + "
" + str; 5 | document.getElementById('output').innerHTML = str_new; 6 | } 7 | 8 | /* 9 | More realistic example, we have a single neuron which activates with sigmoid: 10 | f(x,y,a,b,c) = \sigma(ax + by + c) 11 | 12 | Sigmoid squashes values to between 0 and 1. 13 | The partial derivative with respect to a single input: 14 | \frac{\partial \sigma(x)}{\partial x} = \sigma(x) (1 - \sigma(x)) 15 | */ 16 | 17 | // -------------------------------------------------------------------------- 18 | // UNIT CLASS 19 | // -------------------------------------------------------------------------- 20 | // every Unit corresponds to a wire in the diagrams 21 | var Unit = function(value, grad) { 22 | // value computed in the forward pass 23 | this.value = value; 24 | // the derivative of circuit output w.r.t this unit, computed in backward pass 25 | this.grad = grad; 26 | } 27 | 28 | // -------------------------------------------------------------------------- 29 | // GATE CLASSES: FORWARD/BACKWARD DEFINITION 30 | // -------------------------------------------------------------------------- 31 | // the backward functions compute only the local derivatives 32 | 33 | // MULTIPLY GATE 34 | var multiplyGate = function() {}; 35 | multiplyGate.prototype = { 36 | forward: function(u0, u1) { 37 | // From two input units, multiply their values and forward result in new parent Unit 38 | // store pointers to input Units u0 and u1 and output unit utop 39 | this.u0 = u0; 40 | this.u1 = u1; 41 | this.utop = new Unit(u0.value * u1.value, 0.0); 42 | return this.utop; 43 | }, 44 | backward: function() { 45 | // take the gradient in output unit and chain it with the 46 | // local gradients, which we derived for multiply gate before 47 | // then write those gradients to those Units. 48 | this.u0.grad += this.u1.value * this.utop.grad; 49 | this.u1.grad += this.u0.value * this.utop.grad; 50 | // remember in multiplication the gradient wrt u0 is u1.value 51 | } 52 | } 53 | 54 | // ADD GATE 55 | var addGate = function() {}; 56 | addGate.prototype = { 57 | forward: function(u0, u1) { 58 | this.u0 = u0; 59 | this.u1 = u1; // store pointers to input units 60 | this.utop = new Unit(u0.value + u1.value, 0.0); 61 | return this.utop; 62 | }, 63 | backward: function() { 64 | // add gate. derivative wrt both inputs is 1 65 | this.u0.grad += 1 * this.utop.grad; 66 | this.u1.grad += 1 * this.utop.grad; 67 | } 68 | } 69 | 70 | // SIGMOID GATE 71 | var sigmoidGate = function() { 72 | // helper function 73 | this.sig = function(x) { 74 | return 1 / (1 + Math.exp(-x)); 75 | }; 76 | }; 77 | sigmoidGate.prototype = { 78 | forward: function(u0) { 79 | this.u0 = u0; 80 | this.utop = new Unit(this.sig(this.u0.value), 0.0); 81 | return this.utop; 82 | }, 83 | backward: function() { 84 | var s = this.sig(this.u0.value); 85 | this.u0.grad += (s * (1 - s)) * this.utop.grad; 86 | } 87 | } 88 | 89 | // -------------------------------------------------------------------------- 90 | // MAIN 91 | // -------------------------------------------------------------------------- 92 | // create input units 93 | var a = new Unit(1.0, 0.0); 94 | var b = new Unit(2.0, 0.0); 95 | var c = new Unit(-3.0, 0.0); 96 | var x = new Unit(-1.0, 0.0); 97 | var y = new Unit(3.0, 0.0); 98 | 99 | // create the gates 100 | var mulg0 = new multiplyGate(); // to multiply ax 101 | var mulg1 = new multiplyGate(); // to multiply by 102 | var addg0 = new addGate(); // to add ax + by 103 | var addg1 = new addGate(); // to add c to ax+by 104 | var sg0 = new sigmoidGate(); // sigmoid result of previous 105 | 106 | // Do the forward pass 107 | var forwardNeuron = function() { 108 | ax = mulg0.forward(a, x); // a*x = -1 109 | by = mulg1.forward(b, y); // b*y = 6 110 | axpby = addg0.forward(ax, by); // a*x + b*y = 5 111 | axpbypc = addg1.forward(axpby, c); // a*x + b*y + c = 2 112 | s = sg0.forward(axpbypc); // sig(a*x + b*y + c) = 0.8808 113 | }; 114 | 115 | // INITIAL FORWARD 116 | print('initial values:'); 117 | print('a: ' + a.value); 118 | print('b: ' + b.value); 119 | print('c: ' + c.value); 120 | print('x: ' + x.value); 121 | print('y: ' + y.value); 122 | forwardNeuron(); 123 | print('initial circuit output: ' + s.value); // prints 0.8808 124 | 125 | // LOOP! 126 | print_every = 10 127 | for (i = 0; i < 200; i++) { 128 | // BACKWARD 129 | s.grad = 1.0; 130 | sg0.backward(); // writes gradient into axpbypc 131 | addg1.backward(); // writes gradients into axpby and c 132 | addg0.backward(); // writes gradients into ax and by 133 | mulg1.backward(); // writes gradients into b and y 134 | mulg0.backward(); // writes gradients into a and x 135 | 136 | // UPDATE 137 | var step_size = 0.01; 138 | a.value += step_size * a.grad; // a.grad is -0.105 139 | b.value += step_size * b.grad; // b.grad is 0.315 140 | c.value += step_size * c.grad; // c.grad is 0.105 141 | x.value += step_size * x.grad; // x.grad is 0.105 142 | y.value += step_size * y.grad; // y.grad is 0.210 143 | 144 | // FORWARD RESULT 145 | forwardNeuron(); 146 | if (i % print_every == 0) { 147 | print('output at step ' + i + ' : ' + s.value); 148 | } 149 | } 150 | 151 | // FINAL 152 | print('final values:'); 153 | print('a: ' + a.value); 154 | print('b: ' + b.value); 155 | print('c: ' + c.value); 156 | print('x: ' + x.value); 157 | print('y: ' + y.value); 158 | print('final output : ' + s.value); 159 | 160 | -------------------------------------------------------------------------------- /hackers_guide/2_chapt/neural_net.js: -------------------------------------------------------------------------------- 1 | // FROM: http://karpathy.github.io/neuralnets/ 2 | 3 | var print = function(str) { 4 | str_new = document.getElementById('output').innerHTML + "
" + str; 5 | document.getElementById('output').innerHTML = str_new; 6 | } 7 | 8 | // -------------------------------------------------------------------- 9 | // TRAINING DATA 10 | // -------------------------------------------------------------------- 11 | var data = []; var labels = []; 12 | data.push([1.2, 0.7]); labels.push(1); 13 | data.push([-0.3, -0.5]); labels.push(-1); 14 | data.push([3.0, 0.1]); labels.push(1); 15 | data.push([-0.1, -1.0]); labels.push(-1); 16 | data.push([-1.0, 1.1]); labels.push(-1); 17 | data.push([2.1, -3]); labels.push(1); 18 | 19 | /* 20 | Code below is simple neural network, 3 hidden neurons and one output layer. The code is not modular like the other js code. 21 | 22 | Activation function is ReLU, which is simply max(0, x). 23 | */ 24 | // -------------------------------------------------------------------- 25 | // RANDOMLY INITIALIZE NETWORK PARAMETERS 26 | // -------------------------------------------------------------------- 27 | // Neuron 1 28 | var a1 = Math.random() - 0.5; // a random number between -0.5 and 0.5 29 | var b1 = Math.random() - 0.5; // a random number between -0.5 and 0.5 30 | var c1 = Math.random() - 0.5; // a random number between -0.5 and 0.5 31 | 32 | // Neuron 2 33 | var a2 = Math.random() - 0.5; // a random number between -0.5 and 0.5 34 | var b2 = Math.random() - 0.5; // a random number between -0.5 and 0.5 35 | var c2 = Math.random() - 0.5; // a random number between -0.5 and 0.5 36 | 37 | // Neuron 3 38 | var a3 = Math.random() - 0.5; // a random number between -0.5 and 0.5 39 | var b3 = Math.random() - 0.5; // a random number between -0.5 and 0.5 40 | var c3 = Math.random() - 0.5; // a random number between -0.5 and 0.5 41 | 42 | // Output layer 43 | var a4 = Math.random() - 0.5; // a random number between -0.5 and 0.5 44 | var b4 = Math.random() - 0.5; // a random number between -0.5 and 0.5 45 | var c4 = Math.random() - 0.5; // a random number between -0.5 and 0.5 46 | var d4 = Math.random() - 0.5; // a random number between -0.5 and 0.5 47 | 48 | // -------------------------------------------------------------------- 49 | // FORWARD/BACKWARD/UPDATE 50 | // -------------------------------------------------------------------- 51 | for(var iter = 0; iter < 400; iter++) { 52 | // pick a random data point 53 | var i = Math.floor(Math.random() * data.length); 54 | var x = data[i][0]; 55 | var y = data[i][1]; 56 | var label = labels[i]; 57 | 58 | // compute forward pass 59 | var n1 = Math.max(0, a1*x + b1*y + c1); // activation of 1st hidden neuron 60 | var n2 = Math.max(0, a2*x + b2*y + c2); // 2nd neuron 61 | var n3 = Math.max(0, a3*x + b3*y + c3); // 3rd neuron 62 | var score = a4*n1 + b4*n2 + c4*n3 + d4; // the score 63 | if (iter % 25 === 0) print(score) 64 | 65 | // compute the pull on top 66 | var pull = 0.0; 67 | if(label === 1 && score < 1) pull = 1; // we want higher output! Pull up. 68 | if(label === -1 && score > -1) pull = -1; // we want lower output! Pull down. 69 | 70 | // now compute backward pass to all parameters of the model 71 | 72 | // backprop through the last "score" neuron 73 | // First backprop is simple deriv of a*b type, ie da = x * top_gradient 74 | var dscore = pull; 75 | var da4 = n1 * dscore; 76 | var dn1 = a4 * dscore; 77 | var db4 = n2 * dscore; 78 | var dn2 = b4 * dscore; 79 | var dc4 = n3 * dscore; 80 | var dn3 = c4 * dscore; 81 | var dd4 = 1.0 * dscore; // phew 82 | 83 | // backprop the ReLU non-linearities, in place 84 | // i.e. just set gradients to zero if the neurons did not "fire" 85 | var dn3 = n3 === 0 ? 0 : dn3; 86 | var dn2 = n2 === 0 ? 0 : dn2; 87 | var dn1 = n1 === 0 ? 0 : dn1; 88 | 89 | // backprop to parameters of neuron 1 90 | var da1 = x * dn1; 91 | var db1 = y * dn1; 92 | var dc1 = 1.0 * dn1; 93 | 94 | // backprop to parameters of neuron 2 95 | var da2 = x * dn2; 96 | var db2 = y * dn2; 97 | var dc2 = 1.0 * dn2; 98 | 99 | // backprop to parameters of neuron 3 100 | var da3 = x * dn3; 101 | var db3 = y * dn3; 102 | var dc3 = 1.0 * dn3; 103 | 104 | // phew! End of backprop! 105 | // note we could have also backpropped into x,y 106 | // but we do not need these gradients. We only use the gradients 107 | // on our parameters in the parameter update, and we discard x,y 108 | 109 | // add the pulls from the regularization, tugging all multiplicative 110 | // parameters (i.e. not the biases) downward, proportional to their value 111 | da1 += -a1; da2 += -a2; da3 += -a3; 112 | db1 += -b1; db2 += -b2; db3 += -b3; 113 | da4 += -a4; db4 += -b4; dc4 += -c4; 114 | 115 | // finally, do the parameter update 116 | var step_size = 0.01; 117 | a1 += step_size * da1; 118 | b1 += step_size * db1; 119 | c1 += step_size * dc1; 120 | a2 += step_size * da2; 121 | b2 += step_size * db2; 122 | c2 += step_size * dc2; 123 | a3 += step_size * da3; 124 | b3 += step_size * db3; 125 | c3 += step_size * dc3; 126 | a4 += step_size * da4; 127 | b4 += step_size * db4; 128 | c4 += step_size * dc4; 129 | d4 += step_size * dd4; 130 | // wow this is tedious, please use for loops in prod. 131 | // we're done! 132 | } 133 | -------------------------------------------------------------------------------- /hackers_guide/2_chapt/svm.js: -------------------------------------------------------------------------------- 1 | // FROM: http://karpathy.github.io/neuralnets/ 2 | 3 | var print = function(str) { 4 | str_new = document.getElementById('output').innerHTML + "
" + str; 5 | document.getElementById('output').innerHTML = str_new; 6 | } 7 | 8 | /* 9 | More realistic example, we have a single neuron which activates with sigmoid: 10 | f(x,y,a,b,c) = \sigma(ax + by + c) 11 | 12 | Sigmoid squashes values to between 0 and 1. 13 | The partial derivative with respect to a single input: 14 | \frac{\partial \sigma(x)}{\partial x} = \sigma(x) (1 - \sigma(x)) 15 | */ 16 | 17 | // -------------------------------------------------------------------------- 18 | // UNIT CLASS 19 | // -------------------------------------------------------------------------- 20 | // every Unit corresponds to a wire in the diagrams 21 | var Unit = function(value, grad) { 22 | // value computed in the forward pass 23 | this.value = value; 24 | // the derivative of circuit output w.r.t this unit, computed in backward pass 25 | this.grad = grad; 26 | } 27 | 28 | // -------------------------------------------------------------------------- 29 | // GATE CLASSES: FORWARD/BACKWARD DEFINITION 30 | // -------------------------------------------------------------------------- 31 | // the backward functions compute only the local derivatives 32 | 33 | // MULTIPLY GATE 34 | var multiplyGate = function() {}; 35 | multiplyGate.prototype = { 36 | forward: function(u0, u1) { 37 | // From two input units, multiply their values and forward result in new parent Unit 38 | // store pointers to input Units u0 and u1 and output unit utop 39 | this.u0 = u0; 40 | this.u1 = u1; 41 | this.utop = new Unit(u0.value * u1.value, 0.0); 42 | return this.utop; 43 | }, 44 | backward: function() { 45 | // take the gradient in output unit and chain it with the 46 | // local gradients, which we derived for multiply gate before 47 | // then write those gradients to those Units. 48 | this.u0.grad += this.u1.value * this.utop.grad; 49 | this.u1.grad += this.u0.value * this.utop.grad; 50 | // remember in multiplication the gradient wrt u0 is u1.value 51 | } 52 | } 53 | 54 | // ADD GATE 55 | var addGate = function() {}; 56 | addGate.prototype = { 57 | forward: function(u0, u1) { 58 | this.u0 = u0; 59 | this.u1 = u1; // store pointers to input units 60 | this.utop = new Unit(u0.value + u1.value, 0.0); 61 | return this.utop; 62 | }, 63 | backward: function() { 64 | // add gate. derivative wrt both inputs is 1 65 | this.u0.grad += 1 * this.utop.grad; 66 | this.u1.grad += 1 * this.utop.grad; 67 | } 68 | } 69 | 70 | // SIGMOID GATE 71 | var sigmoidGate = function() { 72 | // helper function 73 | this.sig = function(x) { 74 | return 1 / (1 + Math.exp(-x)); 75 | }; 76 | }; 77 | sigmoidGate.prototype = { 78 | forward: function(u0) { 79 | this.u0 = u0; 80 | this.utop = new Unit(this.sig(this.u0.value), 0.0); 81 | return this.utop; 82 | }, 83 | backward: function() { 84 | var s = this.sig(this.u0.value); 85 | this.u0.grad += (s * (1 - s)) * this.utop.grad; 86 | } 87 | } 88 | 89 | // CIRCUIT CLASS 90 | // A circuit: it takes 5 Units (x,y,a,b,c) and outputs a single Unit 91 | // It can also compute the gradient w.r.t. its inputs 92 | var Circuit = function() { 93 | // create some gates 94 | this.mulg0 = new multiplyGate(); 95 | this.mulg1 = new multiplyGate(); 96 | this.addg0 = new addGate(); 97 | this.addg1 = new addGate(); 98 | }; 99 | Circuit.prototype = { 100 | forward: function(x,y,a,b,c) { 101 | this.ax = this.mulg0.forward(a, x); // a*x 102 | this.by = this.mulg1.forward(b, y); // b*y 103 | this.axpby = this.addg0.forward(this.ax, this.by); // a*x + b*y 104 | this.axpbypc = this.addg1.forward(this.axpby, c); // a*x + b*y + c 105 | return this.axpbypc; 106 | }, 107 | backward: function(gradient_top) { // takes pull from above 108 | this.axpbypc.grad = gradient_top; 109 | this.addg1.backward(); // sets gradient in axpby and c 110 | this.addg0.backward(); // sets gradient in ax and by 111 | this.mulg1.backward(); // sets gradient in b and y 112 | this.mulg0.backward(); // sets gradient in a and x 113 | } 114 | } 115 | 116 | // SVM CLASS 117 | var SVM = function() { 118 | // Class variables a,b,c. Keep track of these while iterating 119 | // through all samples 120 | // Random initial parameter values 121 | this.a = new Unit(1.0, 0.0); 122 | this.b = new Unit(-2.0, 0.0); 123 | this.c = new Unit(-1.0, 0.0); 124 | 125 | this.circuit = new Circuit(); 126 | }; 127 | SVM.prototype = { 128 | forward: function(x, y) { // assume x and y are Units 129 | this.unit_out = this.circuit.forward(x, y, this.a, this.b, this.c); 130 | return this.unit_out; 131 | }, 132 | backward: function(label) { // label is +1 or -1 133 | 134 | // reset pulls on a,b,c 135 | this.a.grad = 0.0; 136 | this.b.grad = 0.0; 137 | this.c.grad = 0.0; 138 | 139 | // compute the pull based on what the circuit output was 140 | var pull = 0.0; 141 | if(label === 1 && this.unit_out.value < 1) { 142 | pull = 1; // the score was too low: pull up 143 | } 144 | if(label === -1 && this.unit_out.value > -1) { 145 | pull = -1; // the score was too high for a positive example, pull down 146 | } 147 | this.circuit.backward(pull); // writes gradient into x,y,a,b,c 148 | 149 | // add regularization pull for parameters: towards zero and proportional to value 150 | this.a.grad += -this.a.value; 151 | this.b.grad += -this.b.value; 152 | }, 153 | learnFrom: function(x, y, label) { 154 | this.forward(x, y); // forward pass (set .value in all Units) 155 | this.backward(label); // backward pass (set .grad in all Units) 156 | this.parameterUpdate(); // parameters respond to tug 157 | }, 158 | parameterUpdate: function() { 159 | var step_size = 0.01; 160 | this.a.value += step_size * this.a.grad; 161 | this.b.value += step_size * this.b.grad; 162 | this.c.value += step_size * this.c.grad; 163 | } 164 | }; 165 | 166 | // -------------------------------------------------------------------------- 167 | // MAIN 168 | // -------------------------------------------------------------------------- 169 | 170 | var data = []; var labels = []; 171 | data.push([1.2, 0.7]); labels.push(1); 172 | data.push([-0.3, -0.5]); labels.push(-1); 173 | data.push([3.0, 0.1]); labels.push(1); 174 | data.push([-0.1, -1.0]); labels.push(-1); 175 | data.push([-1.0, 1.1]); labels.push(-1); 176 | data.push([2.1, -3]); labels.push(1); 177 | var svm = new SVM(); 178 | 179 | // a function that computes the classification accuracy 180 | var evalTrainingAccuracy = function() { 181 | var num_correct = 0; 182 | for(var i = 0; i < data.length; i++) { 183 | var x = new Unit(data[i][0], 0.0); 184 | var y = new Unit(data[i][1], 0.0); 185 | var true_label = labels[i]; 186 | 187 | // see if the prediction matches the provided label 188 | var predicted_label = svm.forward(x, y).value > 0 ? 1 : -1; 189 | if(predicted_label === true_label) { 190 | num_correct++; 191 | } 192 | } 193 | return num_correct / data.length; 194 | }; 195 | 196 | // the learning loop 197 | for(var iter = 0; iter < 1000; iter++) { 198 | // pick a random data point 199 | var i = Math.floor(Math.random() * data.length); 200 | var x = new Unit(data[i][0], 0.0); 201 | var y = new Unit(data[i][1], 0.0); 202 | var label = labels[i]; 203 | svm.learnFrom(x, y, label); 204 | 205 | if(iter % 25 == 0) { // every 25 iterations... 206 | print('training accuracy at iter ' + iter + ': ' + evalTrainingAccuracy()); 207 | } 208 | } 209 | 210 | 211 | -------------------------------------------------------------------------------- /hackers_guide/README.md: -------------------------------------------------------------------------------- 1 | javascript code based on: 2 | http://karpathy.github.io/neuralnets/ 3 | -------------------------------------------------------------------------------- /summaries/auto-encoding_var_bayes.md: -------------------------------------------------------------------------------- 1 | [original paper](https://arxiv.org/abs/1312.6114) about var-bayes autoencoders by Kingma and Welling, 2013. Check out [implementation in Keras](https://github.com/fchollet/keras/blob/master/examples/variational_autoencoder.py) for example code 2 | 3 | Also: 4 | - [Presentation](https://home.zhaw.ch/~dueo/bbs/files/vae.pdf) about the paper to help understand. 5 | - [Accompanying Python notebook](https://github.com/oduerr/dl_tutorial/tree/master/tensorflow/vae) 6 | 7 | ## Background 8 | - How can we perform efficient approximate inference and learning with directed probabilistic models whose continuous latent variables and/or parameters have intractable posterior distributions? 9 | - Variational Bayes optimizes an approximation of the intractable posterior 10 | - See [lower bounds for estimation](https://www.stat.washington.edu/jaw/COURSES/580s/581/LECTNOTES/ch3-rev1.pdf) 11 | 12 | ## Problem 13 | - Assume $X = {x^i }_{i=1}^N$. The process consists of getting $z^i$, generated from a prior $p_theta * (z)$ and value $x^i$ generated from conditional $p_theta * (x|z)$. 14 | - Algorithm must work in case of intractability and large dataset 15 | 16 | -------------------------------------------------------------------------------- /summaries/autoencoders.md: -------------------------------------------------------------------------------- 1 | see [this](https://blog.keras.io/building-autoencoders-in-keras.html) blogpost. The post is also a good tutorial for Autoencoders in Keras. 2 | 3 | ### What are autoencoders good for? 4 | - Not very good at data compression, much better algorithms out there 5 | - Rarely used in practice in original form. 6 | - Mostly used for data denoising and dimensionality reduction for visualization 7 | 8 | ### Examples 9 | - Autoencode MNIST from 784, down to 32, and then back. When applying to the test set, reconstruction from low 32 dimension looks like original but blurry. 10 | - Can add a sparsity contraint on the activity of the hidden representations, type of regularizer 11 | - Deep autoencoder. Instead of layers input/hidden/output [784 32 784], we can try [784 128 64 32 64 128 784]. Note the progressive compression/decompression. However, only minute improvement in performance. 12 | - Convolution autoencoder: Much better than vanilla autoencoder due to the "higher entropic capacity of the encoded representation, 128 dimensions vs. 32 previously". 13 | - Denoising: Train an autoencoder to map noisy images to clean images. We can easily do this by adding Gaussian noise. 14 | - Variational autoencoder: Generative model, you learn parameters of a probability distribution modeling your data. You can then sample this distribution. 15 | -------------------------------------------------------------------------------- /summaries/build_machines_learn_think.md: -------------------------------------------------------------------------------- 1 | # [Building Machines That Learn and Think Like People](http://arxiv.org/abs/1604.00289) 2 | 3 | 4 | -------------------------------------------------------------------------------- /summaries/end-to-end_tf.md: -------------------------------------------------------------------------------- 1 | # Generic ML datasets 2 | - **Linear data** can be solved with linear classifiers sur as logistic regression, svm, and so on. 3 | - **Moon data** has two clusters. Linear classifier cannot fully seperate the two classes 4 | - **Saturn data** has a core cluster and ring cluster, requires non-linear classifier 5 | 6 | # Hidden layer, number of nodes 7 | - Simple neural net with one hidden layer and two nodes will dramatically vary in accuracy (0.88~0.96) due to random initialization. Hidden layer with 3 nodes give consistent results around 0.97 accuracy. 8 | - Sensitive to weight initialization: use one of two following 9 | * Truncated normals for weights, 0.1 for biases and ReLU 10 | * Xavier initialization, 0 biases, tanh 11 | -------------------------------------------------------------------------------- /summaries/fully_char_level_nmt.md: -------------------------------------------------------------------------------- 1 | on paper by Lee et al (2016) 2 | 3 | Fully Character-Level Neural Machine Translation without Explicit Segmentation, [annotated](https://drive.google.com/open?id=0ByV7wn2NzevOQ0JtTTRuR0pjUlE), [arXiv](https://arxiv.org/abs/1610.03017) 4 | 5 | [Theano implementation by author Lee](https://github.com/nyu-dl/dl4mt-c2c?utm_campaign=Revue%20newsletter&utm_medium=Newsletter&utm_source=revue) 6 | 7 | ## Background 8 | - Most MT research exclusively at word level 9 | - NMT suffer from out-of-vocab words with languages with rich morphology 10 | - Character-level better suited for multilingual translation than word-level because: 11 | - Does not suffer from out-of-vocab issues 12 | - Can model rare morphological variants of a word 13 | - No segmentation required 14 | - Recent trend in MT is NMT with encoder, decoder and attention mechanism: 15 | - Encoder: Bidirectional RNN, concat of forward and backward hidden states 16 | - Attention: lets decoder attend more to differnet source symbols for each target symbol. There is a context vector $c_{t'}$ for each time step $t'$ as weighted sum of hidden states, ie the weights reflect relevance of inputs to the t'-th target token 17 | - Decoder: at time $t'$, computes hidden state $s_{t'}$ as a function of previous prediction, previous hidden and source context vector $c_{t'}$. Note how the context vector is specific for that output time step. Next, the prediction is produced by a parametric function (like beam search) 18 | - Loss: model is trained to minimize the negative conditional log-likelihood of the probability of output target given previous target and input. 19 | - Some other work based on character, but mostly subword-to-subword, or subword-to-character. Here they propose fully character-to-character 20 | 21 | ## Chararacter level challenges 22 | - Sentences are much longer 23 | - The decoder softmax operation is much faster over characters 24 | - Attention mechanism with characters grows quadratically 25 | - Encoder must encode long sequence of chars to good representation 26 | 27 | ## Model 28 | - Aggressively uses convolutions + pooling to shorten input and capture local regularities 29 | - **Encoder**: 30 | - Char embedding size 128 31 | - 1D narrow convolution over padded sentence -> output length = input length 32 | - Various filter sizes from width 1 to 8 (up to char n-gram of 8) 33 | - Output of conv op is $Y \in \mathbb R^{N \times T_x}$, where $N$ is number of filter sizes, and $T_x$ is input length 34 | - Max pooling over time over $Y$, without mixing between widths in $N$, with stride $s$. So new $Y \to Y' \in \mathbb R^{N \times (T_x/s)}, where $s$ was chosen to be 5. 35 | - Highway network over pooling output to regulate information flow 36 | - Finally goes to BiGRU 37 | - **Decoder**: 38 | - Attention and decoder like [NMT](summaries/neural_machine_translation.md) model, but predict characters as opposed to words. 39 | - Two-layer unidirection with 1024 GRU, beam search with width 20 40 | - See Table 2 for full model parameters 41 | 42 | ## Experiment Setup 43 | - Char2char model, includes only sentences with max 450 chars. 44 | - Adam optimizer, $\alpha$ of 0.0001 and minibatch size 64 45 | - Gradient clipping with threshold of 1 46 | - Weights initialized from uniform [-0.01, 0.01] 47 | - Multilingual char2char and bilingual char2char 48 | - A few sub-word models as baseline 49 | - Data scheduling to avoid overfitting to one language 50 | 51 | ## Observations 52 | - Char2char always outperforms 53 | - For some language bilingual char2char outperforms, in others the multilingual char2char outperforms 54 | - BLEU metrics encourage reference-like translations, so additional evaluation by humans on adequacy and fluency 55 | - Translation improvement by char2char mainly from fluency 56 | - Two weeks to train char2char 57 | - Char2char model not told any concept of word boundary, automatically learns 58 | -------------------------------------------------------------------------------- /summaries/implicit_drd_grn.md: -------------------------------------------------------------------------------- 1 | acl paper [Implicit Discourse Relation Detection via a Deep Architecture with Gated Relevance Network](https://www.aclweb.org/anthology/P/P16/P16-1163.pdf) 2 | 3 | ## Problem 4 | - Discourse relation recognition, easy for explicit, difficult for implicit. 5 | - Traditionally use word pairs, such as "warm, cold", but data is sparse. 6 | 7 | ## Model 8 | - Word2Vec embeddings, then Bidirecitonal LSTM to represent input over two separate text units X and Y 9 | - Gated Relevance Network (GRN): 10 | - Get positional representation from BiLSTM output 11 | - Compute relevance score between every pair of x and y, with Bilinear Model and Single Layer Network. 12 | - Bilinear Model: let $(h_{xi}, h_{yi})$ be the vectorized representation of from the BiLSTM of from X and Y. Bilinear Model is function: $s(h_{xi}, h_{yi} = h_{xi}^TMh_{yj$, where $M \in \mathbb R^{d_h \times d_h}$ is the matrix coefficient to learn. Note the relationship between the two is linear. 13 | - The Single Layer Network captures nonlinear interaction: standard single hidden neural net where output is nonlinear function over input plus bias, where input is concat of the pair. 14 | - The two models are incorporated through the gate mechanism. The output of the GRN is :gate * linear + (1-gate)*non-linear. So the gate controls flow from linear and non-linear. 15 | - GRN is similar to Neural Tensor Network (Socher et al 2013) but with added gate. 16 | - Output of GRN is an interaction score matrix 17 | - Max pooling over GRN output which feeds into dense hidden layer (MLP), and finally connects to output. 18 | - Train four binary classifiers to identify top level relations 19 | 20 | ## Observations 21 | - LSTM alone has poor performance, loses too much local information 22 | - Cosine distance, Bilinear or Single Layer alone do not perform very well 23 | - Boost in LSTM when encoding to positional representation, boost with Bilinear and Single Layer relevance scores, and boost with extra gating on the relevance score. The mixture of scores performs best 24 | - Model performs best in all categories 25 | - Using the BiLSTM to encode rather than just words directly gets much better emphasis on relevance scores(Figure 3). Important word pairs for discourse are emphasised and irrelevant ones ignored. 26 | -------------------------------------------------------------------------------- /summaries/intrinsic_dimension.md: -------------------------------------------------------------------------------- 1 | **Measuring the Intrinsic Dimension of Objective Landscapes**, Li et al, ICLR 2018. [openreview](), [arXiv](https://arxiv.org/abs/1804.08838) 2 | 3 | tl;dr: Intrinsic Dimension is the minimal parameter subspace (projected to the total parameters) to achieve a certain performance. It is a measure of model-problem complexity. 4 | 5 | First, let's describe the normal training of a neural network until convergence as the "direct method". 6 | Consider an alternative to the direct method. 7 | Given a set of model parameters, train a lower dimensional set of parameters which is then projected and added to the fixed larger set. The smaller set size is the "subspace". The size of the subspace controls the degree of freedom of the model. Train several subspace size, each time training until convergence. As you increase the subspace size, at one point accuracy jumps and you achieve 90% performance of the direct method (and NOT 90% accuracy on the task). The subspace size that achieves 90% is called the intrinsic dimension. 8 | 9 | What's interesting is that when you increase the original model capacity, the subspace parameters are projected to a larger space, and yet you barely need to increase the subspace size to solve the same problem. For example, MNIST is always a subspace of around 750 on an MLP model. But if you change to a CNN, the subspace for MNIST is much smaller, around 250, showing the CNN is a superior model on this dataset. 10 | 11 | Also, harder problems need larger subspace, like Pong from pixels vs MNIST. If we view solving MDPs and supervised learning as function approximation, then we can compare different problems with the intrinsic dimension metric. Apparently, Pong from pixels is equivalent to CIFAR-10! 12 | -------------------------------------------------------------------------------- /summaries/learning_phrase_rep_RNN_encoder_decoder_mt.md: -------------------------------------------------------------------------------- 1 | on seminal Cho et al [Encoder-Decoder paper](https://arxiv.org/abs/1406.1078) 2 | 3 | - [Sequence-to-sequence](https://www.tensorflow.org/versions/r0.11/tutorials/seq2seq/index.html) in Tensorflow 4 | - Elaborate [seq2seq](https://github.com/farizrahman4u/seq2seq) library for Keras 5 | 6 | ## Model 7 | - Two RNN that act as encoder and decoder pair 8 | - The two networks are trained jointly to maximize the conditional probability of the output given the input 9 | - Encoder: scan linearily input x, at each symbol update the hidden state of the RNN. At the end of the input, hidden state $c$ is a summary of the whole input sequence 10 | - Decoder: Generative model. Predict next $y_t$ given the hidden state $h_t$. However, unlike Encoder, $y_t$ and $h_t$ are both conditioned on $y_{t-1}$ and $c$. 11 | - The Encoder-Decoder, once trained, can be used to either: 12 | - Generate output translation given input. or, 13 | - Score a given pair of input and output sequences produced by other algorithm 14 | 15 | ## Details 16 | - Uses new type of hidden unit, motivated by LSTM 17 | - Reset gate: when reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This allows the hidden sate to **drop** any information that is found to be irrelevant later in the future, allowing more compact representation 18 | - Update gate: controls how much information from the previous hidden state will carry over to the current hidden state. Helps remember long-term information like memory cell in LSTM 19 | - Encoder-Decoder ignores normalized frequencies of phrase pairs in original corpora 20 | 21 | ## Experiment 22 | - Uses WMT dataset. Data from Europarl, UN, crawled data. For language model only, uses crawled data. 23 | - For RNN Encoder-Decoder, limits source and target vocabulary to most frequent 15k words, about 93% coverage 24 | - RNN E-D has 1000 hidden units in both encoder and decoder. 25 | - Uses rank-100 matrices, equivalent to learning an embedding of dimension 100 for each word. 26 | - Uses tanh as activation for new hiddent state 27 | - Early stop after 10 ecpochs 28 | - Best performing model: Proposed RNN for scoring and add traditional neural net language model (CLSM) to improve the RNN 29 | 30 | ## Observations 31 | - Gating is crucial. Without gating, just using tanh does not give meaningful results. 32 | - The model is focused towards learning linguistic regularities: distinguishing between plausible and implausible translations or manifold (regions of probability concentration) or plausible translations. 33 | -------------------------------------------------------------------------------- /summaries/matching_networks.md: -------------------------------------------------------------------------------- 1 | [Matching Networks for One Shot Learning](https://arxiv.org/abs/1606.04080), Vinyals et al, all from DeepMind, for NIPS 2016 2 | 3 | ## Problem 4 | The problem is to learn something with few examples is a huge challenge for DNNs, when it can be trivial for humans. A child can generalize the concept of "giraffe" from just a single example. For machines, it's hard. This is a particular challenge for many NLP datasets [my opinion]. 5 | 6 | This motivates the "one-shot" and "few-shot" learning paradigm. 7 | 8 | ## Model 9 | The authors present a model that tries to match an input \hat{x} with an example from a support set of image-label pairs S={(xi, yi)} of k examples. 10 | 11 | The output \hat{y} is defined a the sum of a(\hat{x}, xi)yi, where a() is an attention function. 12 | 13 | ## Attention 14 | - The proposed attention mechanism a() is to use softmax over cosine distance c(f(\hat{x}), g(xi)). The softmax normalized the cosine of \hat{x} and x by dividing by the sum of cosine over all x_j for j=1 to k. So a() is a weight for that class i which is multiplied by y_i. Therefore \hat{y} is a blend of the classes. 15 | - For example, imagine we have 3 classes and a() has calculated a probability for each of 3 classes in yi: 0.3, 0.5 and 0.2. \hat{y} as a weighted sum would be equal to 0.2[1,0,0] + 0.5[0,1,0] + 0.3[0,0,1] = [0.2,0.5,0.3]. This should be more explicit in the paper. 16 | - The functions f() and g() that embed the two inputs are neural networks such as deep ConvNets. 17 | - They propose to modify the embedding function g to include the whole support S instead of only xi, this way the net can change an embedding if it deems it too close to another element in the set. g(xi) becomes g(xi,S). 18 | 19 | ## Results 20 | The approach shows SOTA results on Omniglot, ImageNet and PTB. Note these are smaller datasets selected for the task. The PTB "mini" dataset is proposed by this paper. 21 | 22 | ## Summary 23 | In summary, the model predicts a class indirectly by mapping input samples to samples from an example set by taking their labels. The task for the model is thus to learn how to best represent samples (via a neural net) to compute distance metrics as effectively as possible. Once it becomes good at this, it can match samples with classes it has never even seen in training! 24 | -------------------------------------------------------------------------------- /summaries/neural_machine_translation.md: -------------------------------------------------------------------------------- 1 | on paper by Bahdanau, Cho, Bengio, ICLR 2015 2 | 3 | Neural Machine Translation by Jointly Learning to Align and Translate, [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOS3FmWHVNazhnczA/view?usp=sharing), [arXiv](https://arxiv.org/abs/1409.0473) 4 | 5 | See TensorFlow library [seq2seq](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/seq2seq.py) for usage 6 | 7 | ## Background 8 | - Unlike traditional phrase-based translation system which consist of many small sub-components that are tuned separately, NMT builds and trains single large neural net to encode/decode translation 9 | - Issue with NMT encoder/decoder is compressing all source into single fixed-length vector 10 | 11 | ## Model 12 | - Extends Encoder-Decoder by learning to align and translate 13 | - Does not encode whole input to single fixed-vector, but encodes into a sequence of vectors. Decoder chooses a subset of these vectors adaptively while decoding the translation 14 | - Traditional encoder-decoder, for each $y_t$, the decoder predicts the conditional probability of $y_t$ given all previous words and context. With RNN, this is estimated as $g(y_{t-1}, s_t, c)$ where: $g$ is a non-linear, $s_t$ is hidden state and $c$ is context vector from encoder 15 | - New decoder is conditioned on distinct context vector $c_i$ for each target word $y_i$: $p(y_i|y_1,...,y_{i-1},x)=g(y_{i-1},s_i,c_i) 16 | - Context $c_i$ is weighted sum of annotations $h_i$ from encoder (Eq. 5). The weight is a softmax over an alignment model. This effectively acts like attention mechanism, tells decoder what to pay attention to and relieves encoder burden of encoding all into fixed-length vector 17 | - Decoder has 1000 hidden units and single maxout hidden layer to compute conditional probability of each target word 18 | - Once trained, use beam search to find translation that maximizes the conditional probability 19 | - Encoder is BiRNN, 1000 hidden units 20 | 21 | ## Experiments 22 | - Trained on regular Encoder-Decoder (Cho et al. 2014), called RNNencdec and new proposed model named RNNsearch 23 | - Trained on sentences of max 30 words (RNNencdec-30, RNNsearch-30), and max 50 words (RNNencdec-50, RNNsearch-50) 24 | - In both cases, RNNsearch outperforms. 25 | 26 | ## Remarks 27 | - Sequence of vectors and adaptive decoding frees model from having to squash all information of a source sentence into a fixed-length vector. Copes better with long sentences 28 | - Alignment learning significantly improves basic encoder-decoder 29 | - Astoundingly, RNNsearch-50 massively outperforms other models as sentences become very long, with no deterioration in BLEU score with sentences approaching length 60 where others go towards BLEU score 0. 30 | - The alignment matrix generally shows monotonic correlations. However, number of non-trivial non-monotonic alignments among adjectives and nouns since order is different between English and French. Model correctly translated [European Economic Area] to [zone economique europeenne] 31 | -------------------------------------------------------------------------------- /summaries/overview_optimization.md: -------------------------------------------------------------------------------- 1 | ## Overview optimization algorithms 2 | [blog post](http://sebastianruder.com/optimizing-gradient-descent/index.html#gradientdescentvariants) 3 | 4 | ### Gradient descent 5 | #### Batch gradient descent 6 | - Computes the gradient of the cost on entire training set, in one go 7 | - Minus 8 | - Slow as single update requires measuring loss on all data 9 | - Converts to global minimum only for convex error surfaces 10 | 11 | #### Stochastic gradient descent (SGD) 12 | - Plus 13 | * Parameter update after each training sample, faster than batch 14 | * Frequent updates cause objective function to fluctuate 15 | * Can easily overshoot minimum, so better to decay learning rate 16 | - Should shuffle data at each epoch 17 | 18 | #### Mini-batch gradient descent 19 | - Best of both, size should be 50 ~ 256 20 | - Plus 21 | * Reduces variance of parameter updates 22 | * Efficient computation of matrices 23 | - Minus 24 | * Same issues with learning rate, can address problem with schedules 25 | * Same learning rate applies to all parameters, problem 26 | * Easy to get trapped in local minima and saddle points 27 | 28 | ### Optimization (adaptive methods) 29 | #### Momemtum 30 | - Helps SGD accelerate in relevant direction and dampens oscillations 31 | - Adds fraction of previous step vector to current step vector 32 | 33 | #### Nesterov accelerated gradient (NAG) 34 | - Lookahead descent, slows down before hill slopes up 35 | - Calculates momentum as well as approximation of next parameter value 36 | 37 | #### Adagrad 38 | - Adapts learning rate to the parameters 39 | * Large updates for infrequent parameters 40 | * Small updates for frequent parameters 41 | * Different rate for every theta at every time step 42 | - Plus 43 | * Good for sparse data 44 | * No longer need to tune learning rate 45 | - Minus 46 | * Because of accumulated sum in denom, learning rate shrinks and vanishes 47 | 48 | #### Adadelta 49 | - Extends Adagrad, fixes decreasing learning rate 50 | - Restricts window of accumulated past gradients 51 | 52 | #### Adaptive Moment Estimation (Adam) 53 | - Also computes adaptive learning for each parameter 54 | - Like Adadelta, stores decaying average of past squared gradients 55 | - Also stores decayin average of past gradients 56 | - Basically stores first two moments 57 | 58 | ### Tips 59 | - Shuffle before each epoch 60 | - For progressively harder problems, use curriculum learning 61 | - Batch normalization so can use higher learning rates 62 | - Early stopping: stop if error no longer improves 63 | - Add gaussian noise to gradients 64 | -------------------------------------------------------------------------------- /summaries/scheduled_sampling.md: -------------------------------------------------------------------------------- 1 | Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, [arXiv](https://arxiv.org/abs/1506.03099) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer, 2015 2 | 3 | TLDR; Scheduled sampling improves the quality of language generation by being more robust to mistakes. Use inverse sigmoid decay. 4 | 5 | ## Problem 6 | One of the issues with training RNNs for prediction is that each $y$ being predicted is in part conditioned on previous *true* $y$, whereas at inference time $y$ is conditioned on a *generated* $y$. This can yield errors that accumulate throughout the prediction process. The authors propose a training method, "scheduled sampling", to sometimes condition on the true and sometimes on the generated. 7 | 8 | Traditional inference is conditioned on the *most likely* previous prediction. Prediction error can compound through the entire prediction. One way to deal with this is to use beam search, which maintains several probable sequences in memory. Beam search produces $k$ "best" sequences, as opposed to searching through all possibilities of $Y$. 9 | 10 | ## Scheduled Sampling 11 | The authors propose a "curriculum learning" approach that forces the model to deal with mistakes. This is interesting since error correction is baked into the model. 12 | 13 | While training the sampling mechanism randomly decides to use $y_{t-1}$ or $\hat{y}_{t-1}$. The true previous $y_{t-1}$ token is used with probability $\epsilon_i$. So if $\epsilon_i = 1$, we are using the same training as usual, and when $\epsilon_i = 0$, we are always training on predicted values. *Curriculum learning* strategy for training selects the true previous most of the time and slowly shifts to selecting a predicted previous. At the end of training, $\epsilon_i$ should favor sampling from the model. (See Figure 1). 14 | 15 | The sampling variable $\epsilon$ decays according to a few schemes such as linear decay, exponential decay and inverse sigmoid decay (figure 2). 16 | 17 | ## Experiments 18 | ### Image Captioning 19 | - Trained on MSCOCO, 75k for training and 5k for dev set. 20 | - Each image has 5 possible captions, one is chosen at random. 21 | - Image preprocessed by pretrained CNN 22 | - Word generation done with LSTM(512), vocabular size is 8857 23 | - Used inverse sigmoid decay 24 | 25 | This approach led the team to first place for MSCOCO captioning challenge 2015. 26 | 27 | ### Constituency Parsing 28 | Map a sentence onto a parse tree. Unlike image captioning, the task is much more deterministic, "uni-modal". Generally only one correct parse tree. 29 | 30 | - One layer LSTM(512) 31 | - Words as embeddings of size 512 32 | - Attention mechanism 33 | - Inverse sigmoid decay 34 | 35 | ### Speech Recognition 36 | 37 | - Two layers of LSTM(250) 38 | - Baseline trained 14 epochs, scheduled sampling only needed 9 epochs. 39 | -------------------------------------------------------------------------------- /summaries/seq2seq_nn.md: -------------------------------------------------------------------------------- 1 | Sutskever et al paper [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215), [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOQ1l5aUF4RWYtenc/view?usp=sharing) 2 | 3 | - [Sequence-to-sequence](https://www.tensorflow.org/versions/r0.11/tutorials/seq2seq/index.html) in Tensorflow 4 | - Elaborate [seq2seq](https://github.com/farizrahman4u/seq2seq) library for Keras 5 | 6 | ## Background 7 | - Some problems can be seen as sequence to sequence problems, mapping an input sequence to an output sequence. Such as translation, question and answering, etc. 8 | - One challenge for DNN is dimensionlity of input/output which must be fixed. This can be overcome with LSTMs 9 | 10 | ## Model 11 | - Model maps input ABC to output WXYZ. Does not use the RNN for scoring as Cho et al, but to produce the translation 12 | - After training, produce translation with left-to-right beam-search decoder 13 | - Ensemble of 5 deep LSTMs with beam of size 2 14 | - Reverse the order of the input, leave order of output 15 | - Use two LSTMs, much like Encoder-Decoder 16 | - 4 layer LSTMs. Deep LSTMs significantly outperformed shallow LSTMs 17 | - 1000 cells at each layer 18 | - 1000 dimensional embeddings 19 | 20 | ## Experiment 21 | - Used the LSTM to rescore publicly available 1000-best lists of the SMT baseline and obtained close to SOTA results. 22 | - Much better results when inversing input. Test perplexity drops from 5.8 to 4.7 and test BLEU increases from 25.9 to 30.6! 23 | - Used 160k most frequent words for source language and 80k most frequent for the target language 24 | - Predict a small number $B$ most likely partial hypothesis with beam search. If a hypothesis is appended with "EOS", the hypothesis is added to set of complete hypothesis. Beam search continues until all partial hypothesis are complete. 25 | - Most sentences are short, some are very lond, which can waste computation in minibatch. Therefore, made sure sentences in minibatch are roughly same length, yielding 2x speedup. 26 | - Parallelized on 8 GPUs: one LSTM layer per GPU (for 4 layers), and other 4 GPUs for softmax calculation. Results in 3.7x speedup. 27 | - Ensemble of 5 LSTMs with beam of size 2 is cheaper than a single LSTM with a beam of size 12 28 | - Surprisingly, performs well on long sentences 29 | 30 | -------------------------------------------------------------------------------- /summaries/softmax_bottleneck.md: -------------------------------------------------------------------------------- 1 | 2 | **Breaking the Softmax Bottleneck: A High-Rank RNN Language Model**, Yang et al, ICLR 2018. [openreview](https://openreview.net/forum?id=HkwZSG-CZ), [arXiv](https://arxiv.org/abs/1711.03953) 3 | 4 | Language modeling consists of learning the joint language distribution factorized as a product of word probabilities conditioned over context. 5 | 6 | Given a language model output matrix A over time, where each row is the vocabulary distribution given context. A word logit is produced by the inner product of h_c (rnn hidden state) and w_x (embedding vector), both of dimension d. 7 | 8 | The authors hypothesize A must be high rank to express complex language. The rank of A can be as high as M, the vocabulary size. The single softmax is not expressive enough if d is too small, and thus it is learning a low-rank approximation of A. 9 | 10 | A more expressive model would be Ngram or simply increasing d, however these approaches lead to significant increases in parameters and hurts generality. 11 | 12 | They propose a mixture of K softmax distributions (MoS). Each softmax distribution is weighted by pi_c, which is itself learned by the context. They empirically measure the MoS matrix A compared to single sofmax A and show that with a mixture of 15 softmax trained on PTB, A's rank is as high as M! They get rank 9981, while the single softmax is 400 (Table 6). They also achieve SOTA on PTB and WikiText-2 13 | 14 | -------------------------------------------------------------------------------- /summaries/unreal.md: -------------------------------------------------------------------------------- 1 | Reinforcement Learning with Unsupervised Auxiliary Tasks, Jaderberg et al, ICLR 2017. [openreview](https://openreview.net/forum?id=SJ6yPD5xg). See Caruana PhD thesis from 1997, discusses auxiliary tasks for better representations! 2 | 3 | The paper is about learning additional tasks, which don't require additional data, with the same parameters from the model which learns to maximize rewards with RL. This leads to better representations and performance, in score and speed! 4 | 5 | A base agent learns to maximize rewards with A3C. Experiences are pushed to a replay buffer. Auxiliary tasks sample from this buffer for off-policy training. Interestingly, buffer sampling is actually intentionally skewed so samples are evenly split between rewarding and negative evnts. 6 | 7 | The auxiliary tasks include: 8 | - Pixel control: train agents that learn a separate policy for maximally changing the pixels in an input image. 9 | - Reward prediction: process a sequence of consecutive observations and require agent to predict the reward in following unseen frame 10 | - Value function replay: resample recent historical sequences from behaviour policy distribution (via replay buffer) and perform extra value function regression in addition to the on-policy A3C. By resampling previous experience, and randomly varying the temporal position of the truncating window over which the n-step return is computed, value function replay performs value iteration and exploits newly discovered features shaped by reward prediction (sampling distribution not skewed in this case) 11 | -------------------------------------------------------------------------------- /summaries/vanishing_gradients.md: -------------------------------------------------------------------------------- 1 | Summary of [this section](http://neuralnetworksanddeeplearning.com/chap5.html#the_vanishing_gradient_problem) 2 | 3 | ## Exposing the problem 4 | - Train on MNIST dataset 5 | - Base network: [784, 30, 10], input of 784, 1 hidden layer of 30 nodes, 10 class output 6 | * Accuracy: 0.9648 7 | - Expand network: [784, 30, 30, 10], 2 hidden layers 8 | * Improvement in Accuracy: 0.9690 9 | - Expand network: [784, 30, 30, 30, 10], 3 hidden layers 10 | * Accuracy drops: 0.9657 11 | 12 | In theory deep networks work better than shallow ones due to abstraction of features as the network gets deeper. Deep networks could solve a problem with dramatically less parameters than shallow networks. But sometimes deep networks don't perform well. Sometimes, when later layers in a deep network work well, early layers get stuck, possibly learning nothing. The opposite may be true, with earlier layers learning well and later ones being stuck. 13 | 14 | ## Vanishing (and exploding) gradients 15 | - For a two hidden-layer net, second layer neurons learn much faster than neurons in the first layer 16 | - For each layer, let vector gl represent a vector of gradients for l-th layer where each entry determine how quickly the hidden layer learns 17 | - ||gl||, the length of the vector, determines the speed of learning of the l-th layer 18 | - For 2-hidden-layer network, g1 = 0.07 and g2 = 0.31. The second layer learns faster 19 | - For 3-hidden-layer network, lengths are 0.012, 0.06, 0.283. Again, earlier layers are slower 20 | - As we go deeper, gradients in earlier layers get much smaller, i.e. they vanish 21 | - In some instances, instead of vanishing, early layer gradients may explode 22 | 23 | ## Vanishing (and exploding) gradient cause 24 | Simple network structure with backprop: 25 | 26 | ![network derivative](http://neuralnetworksanddeeplearning.com/images/tikz38.png) 27 | 28 | Where $\sigma (z_j)$ is the sigmoid, and $zj = w_j a_{j-1} + b_j$ is the weighted input to the activation function in the next neuron. (See proof starting at formula 114). The expression is the partial derivative of the cost with respect to the first bias, b1. Besides the last term, it follows the pattern of weight x sigmoid derivative. 29 | 30 | Looking at the sigmoid derivative plot: 31 | ![sigmoid deriv](http://www.billharlan.com/papers/logistic/img39.png) 32 | - Notice the function peaks at 1/4. 33 | - Weights are initialized with standard gaussian: mean 0 and standard deviation of 1 34 | - Weights are less than 1 35 | - Terms $|w_j \sigma^' (z_j)|$ less than 1/4 36 | - The product of all these terms results in tiny gradient 37 | - If the weights are initialized above 1, they grow exponentially as we move back through the layers, causing them to explode 38 | 39 | In summary, choice of activation function, weight initilization, optimization algorithm and network architecture can cause unstable gradients. 40 | 41 | ## Solution 42 | - Use activation function which don't squash the input, such as ReLU. See [here](https://cs224d.stanford.edu/notebooks/vanishing_grad_example.html) for effect of sigmoid v ReLU. 43 | 44 | -------------------------------------------------------------------------------- /summaries/var_auto_sequence_class.md: -------------------------------------------------------------------------------- 1 | about paper [Semi-supervised Variational Autoencoders for Sequence Classification](https://arxiv.org/abs/1603.02514), [annotated](https://drive.google.com/file/d/0ByV7wn2NzevOTXEzLWlNQy1od0k/view?usp=sharing). 2 | 3 | ## Problem 4 | - SemiVAE work well in image classification tasks, but fail for text classification if using vanilla LSTM as conditional generative model. 5 | - We have more and more data, but very little labels accompanying the data 6 | - We want unsupervised to extract useful features which we can then use in supervised tasks 7 | - RNNs are good for sequence-to-sequence, but not good for high level features like topic, style, and sentiment. Variational Recurrent Autoencoders have been used for this. 8 | 9 | ## Background 10 | - Conditional variational autoencoders can generate samples according to certain attributions of given labels. 11 | 12 | ## The Model 13 | - Novel semi-supervised deep generative model for text classification, the model can generate sentences conditioned on labels 14 | - A RNN encodes input text x with the conditional input y (the label). The generative network then decodes the latent variable z, where $z \sim p(z|x,y)$. 15 | - Conditional LSTM network proposed as conditional generative model. Same traditional LSTM equations except one has extra term about $y$. 16 | 17 | -------------------------------------------------------------------------------- /web_resources.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ## Optimization 4 | - Notes on gradient descent, Toussaint 2012. [pdf](http://ipvs.informatik.uni-stuttgart.de/mlr/marc/notes/gradientDescent.pdf). Check Algorithm 2 and 3 5 | 6 | ## Probability 7 | - Review of expectation, 2009. [pdf](http://math.arizona.edu/~jwatkins/g-expectation.pdf) 8 | 9 | ### Topics 10 | - [Deep Q-Learning](https://keon.io/deep-q-learning/). Keon 11 | - [Policy method for Cartpole](http://kvfrans.com/simple-algoritms-for-solving-cartpole/), Kvfrans. [`repo`](https://github.com/kvfrans/openai-cartpole/blob/master/cartpole-policygradient.py) 12 | - Fundamentals of Policy Gradients, Seita, 2017-03. [`blog`](https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/) 13 | - Deep Deterministic Policy Gradients in TensorFlow, Emami, 2016-08. [`blog`](http://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html#References) 14 | 15 | ### Overview 16 | - Deep Reinforcement Learning: Pong from Pixels. Karpathy. [`blog`](http://karpathy.github.io/2016/05/31/rl/) 17 | - [Beginner's guide](https://www.analyticsvidhya.com/blog/2017/01/introduction-to-reinforcement-learning-implementation/). 18 | - [Riemannian manifolds lecture](https://www.youtube.com/watch?v=MtZV82LCNHc), [slides](https://www.robots.ox.ac.uk/~vgg/rg/slides/Oxford-Mar-2014.pdf) 19 | - [Information Geometry lecture](https://www.youtube.com/watch?v=zmUMBLEHhZg), [slides](http://videolectures.net/mlss05us_dasgupta_ig/) 20 | - [Adversarial Attacks, Robustness](https://adversarial-ml-tutorial.org), theory and practice 21 | 22 | ### Tutorials 23 | - [Deep RL for checkers](https://chrislarson1.github.io/blog/2016/05/30/cnn-checkers/) 24 | - [Variational Inference tutorial](https://github.com/philschulz/VITutorial.git) 25 | - Blog on theory and code. Covers **q-learning** with frozen lake, deep q-learning on doom/space invaders, policy gradients on Doom, A2C/A3C with Sonic, PPO with Sonic. [link](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/) 26 | 27 | ## Repos 28 | - Minimal clean examples. Iteration methods, policy gradient, Grid world, CartPole, Atari, etc. [`repo`](https://github.com/rlcode/reinforcement-learning) 29 | - Many PyTorch tutorials, all levels, for image and text. [`repo`](https://github.com/yunjey/pytorch-tutorial) 30 | - OpenAi Universe starter code, A3C algo. [`repo`](https://github.com/openai/universe-starter-agent) 31 | - Minimalist REINFORCE for discrete and continuous actions. [`repo`](https://github.com/JamesChuanggg/pytorch-REINFORCE) 32 | - [RLCode](https://github.com/rlcode/reinforcement-learning). Minimal example of DQN, DDQN, PG, A2C, A3C 33 | - [ikostrikov](https://github.com/ikostrikov/pytorch-a2c-ppo-acktr): a2c, ppo, acktr 34 | 35 | ## Tools 36 | - Beam search implementation [in PyTorch](https://github.com/eladhoffer/seq2seq.pytorch/blob/master/seq2seq/tools/beam_search.py) 37 | 38 | ## Course 39 | - Deep RL Bootcamp, [`site`](https://sites.google.com/view/deep-rl-bootcamp/lectures) 40 | 41 | # Datasets 42 | ## NLP 43 | - The Stanford Natural Language Inference (SNLI) Corpus. 570k human-written English sentences. Text entailment [`site`](https://nlp.stanford.edu/projects/snli/) 44 | ## Environments 45 | ### Simulators 46 | - [Gibson Environments: Real-World Perception for Embodied Agents](https://github.com/StanfordVL/GibsonEnv). A virtual environment for agents which is quite realistic. 47 | 48 | ### Environments 49 | - [VizDoom](https://github.com/mwydmuch/ViZDoom). Doom environment using only visual information. Visuals include: FPV game pixels, object labelling visual, depth map, 2D map. Should probably use with a gym wrapper, like [this one](https://github.com/nsavinov/gym-vizdoom). To understand how to setup the engine, checkout [this minimalist example](https://github.com/mwydmuch/ViZDoom/blob/master/examples/python/basic.py). Also, checkout [this pytorch example](https://github.com/mwydmuch/ViZDoom/blob/master/examples/python/learning_pytorch.py). 50 | - [MAME tookit](https://github.com/M-J-Murray/MAMEToolkit), wrapper around the popular MAME arcade emulator 51 | - [MiniWorl](https://github.com/maximecb/gym-miniworld), 2D and 3D environments, minimial dependencies, gym friendly 52 | 53 | # DL 54 | - Unreasonable effectiveness of one neuron, [`blog`](https://rakeshchada.github.io/Sentiment-Neuron.html) 55 | 56 | # Math 57 | - Matrix albegra review, 24 pages, [`pdf`](http://faculty.uml.edu/adoerr/92.321/pdf/week6.pdf) 58 | - Variational Inference, [`slides`](http://shakirm.com/papers/VITutorial.pdf) 59 | - Maximum likelihood, [`blog`](http://suriyadeepan.github.io/2017-01-22-mle-linear-regression/) 60 | 61 | # Coding 62 | - PyTorch [tutorial](https://medium.com/towards-data-science/pytorch-tutorial-distilled-95ce8781a89c) 63 | 64 | # State Course 65 | - [Stat Trek](http://stattrek.com/), stats and prob course 66 | 67 | # NLP 68 | - [Textual entailment with tf](https://www.oreilly.com/learning/textual-entailment-with-tensorflow) 69 | --------------------------------------------------------------------------------