├── CAS.md
├── DRRN.md
├── Glove.md
├── LSTM.md
├── MTCNN.md
├── MonteCarloBayesianRL.md
├── ProGroGAN.md
├── R-CNN.md
├── README.md
├── Seq2SeqLearning.md
├── TDLearning.md
├── TFX.md
├── WebNav.md
├── asyncMethodsDRL.md
├── autocaption.md
├── capsule.md
├── casCNN.md
├── contControl.md
├── contControlDRL.md
├── contDeepRL.md
├── copyMechanism.md
├── deepCompression.md
├── deterministicPolicyGradients.md
├── discussionThreads.md
├── doubleQLearning.md
├── duelingNetworkArch.md
├── end-to-end-LSTM.md
├── faceNet.md
├── fb_AML.md
├── fitLAM.md
├── genAdversarial.md
├── hand_tracking.md
├── humanLevelControl.md
├── infoTheoreticalEmb.md
├── mineCraft.md
├── ml-test-score.md
├── modelFreeEpisodicControl.md
├── multiAgentRL.md
├── playingAtari.md
├── rldm2015_silver_reinforcement_learning.pdf
├── seq2Seq.md
├── smartReply.md
├── spatial_search.md
├── trustBasedOptimization.md
└── webSpidering.md


/CAS.md:
--------------------------------------------------------------------------------
 1 | ## Composition Aware Search
 2 | 
 3 | Having just read, Spatial-Semantic Search ([notes](https://github.com/domarps/papers-i-read/blob/master/spatial_search.md)), it is easy to see that the paper draws a lot of inspiration from Region-Based Image Retrieval (RIBR) papers. It is a bit concern that the paper does not outline evaluation methodologies on datasets such Visual Genome and MS-COCO or mention about user studies to understand model effectiveness. Such benchmarks are important tools for measuring performance, but in a rapidly evolving field like RIBR it can be difficult to keep up with the state of the art.
 4 | 
 5 | Also, it is important to understand how storage mechansims other than IVFADC can enable the maintenance of large index as well as enable fast retrieval of images.
 6 | 
 7 | > Our index is likely where the bulk of future work will be valuable. Currently, it requires
 8 | about 720 GiB of storage space to serve our entire collection of 120 million images, which is still
 9 | large. 
10 | 
11 | > Specifically, we might see large gains using the inverted multi-index to find candidate images, and using product quantization to heavily compress the stored vectors.
12 | 
13 | 
14 | 
15 | 


--------------------------------------------------------------------------------
/DRRN.md:
--------------------------------------------------------------------------------
1 | [Deep Reinforcement Learning With an Action Space Defined by Natural Language](http://arxiv.org/abs/1511.04636)
2 | ============================================================================
3 | 
4 | The paper extends Deep Q-networks to “unbounded” action spaces effectively by generating action representations the environment provides and greedily choosing the action that provides the highest Q. A novel deep reinforcement relevance network(DRRN) is developed to handle actions defined through natural language in a text games setting. It is demonstrated that although the DRRN architecture uses fewer parameters, it converges faster than the baseline deep Q-networks.
5 | 
6 | Let |𝕊| denote the state space, and let |𝔸| denote the entire action space that includes all the unique actions over time. A vanilla Q-learning recursion needs to maintain a table of |𝕊|×|𝔸| which is not applicable to a large state space problem. An important structural difference from the DQN is that the DRRN not only extracts a fixed-dimension embedding from the state side but also create a unified distributed representation of state-action and the text-action that it can capture various dimensions of both semantic and syntactic information of a vector. The authors argue that the success of the DRRN in handling a natural language action-space is that both the action-texts and the state-texts are mapped into a finite dimensional embedding space. The “experience-replay” strategy is employed for the learning process which uses a fixed exploration policy to interact with the environment to obtain a data trajectory. This results in more aligned embeddings of the state-text with its relevant action-text so that the corresponding scalar product, which is the Q-function value of the action, is higher. The DRRN is evaluated on two popular text games and is compared with two baselines: a linear model and a NN-RL(with two hidden layers). The input is the text strings of state and action descriptions together as Bag-of-Words and the number of outputs equals the maximum number of actions. Softmax selection is used for the exploration vs exploitation tradeoff.
7 | 
8 | One shortcoming of the paper is that they do not exhibit the complex nature of natural languages to the full extent. For instance, the largest game tested uses only 2258 words. More importantly, the description of a state at each time step is almost always limited to the visual description of the current scene, lacking any use of higher-level concepts present in natural languages. The environment only provides a small(2-4) number of actions that are required to be evaluated and hence do not pick an action from a large set. This architecture has also been leveraged to learn to attend to actions which take place in multiple actions at each state(Slate MDPs) .
9 | 


--------------------------------------------------------------------------------
/Glove.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | - Glove - Vector Representation of Unigrams
 4 | --  Cast as a co-occurence matrix factorization problem
 5 | --  Use a sliding window
 6 | -- 
 7 | 
 8 | 
 9 | Given a sequence, predict another sequences
10 | Encoders/decoders are GRU/LSTM
11 | 


--------------------------------------------------------------------------------
/LSTM.md:
--------------------------------------------------------------------------------
 1 | Long Short Term Memory Networks
 2 | ===============================
 3 | 
 4 | Can learn time series with long lags between events.
 5 | A part of the state of the art deep learning models for recognition of sequences eg: speech or handwriting
 6 | A deep RNN network that can contain several LSTM layers stacked on each other
 7 | 
 8 | Four inputs -- 
 9 | 
10 | Three Gates 
11 |  - Output Gate
12 |  - Input Gate
13 |  - Forget Gate controls the previous state of the cell impacting on the next state
14 | 
15 | sigmoid or tanh(full signal or zero energy)
16 | 
17 | LSTM Layer with Projection
18 | 
19 | Step 1 : Pre-nonlinear signals calculation
20 | 
21 | h_{t-1} : initialize with some constant, say 0.1
22 | Weight matrix 
23 | Step 2 : Non-linearity
24 |  - Input and forgetting state e.g : sigmoid
25 |  - Cell State update e.g :  Hyperbolic tangent
26 |  - output signal update e.g : element wise product/projection matrix(mapping from one dimension to another dimension) 
27 |     
28 |     
29 |   
30 | 
31 | 95% of the time is spent on computing the matrix product
32 | 
33 | Need to optimize this Matrix Multiplication:
34 | Low-rank approximation with SVD
35 | SVD: Factorize the weight matrix as,
36 |     W = U * E * V
37 |     U and v are unitary matrices
38 |     E is a diagonal matrix with singular values in descending order
39 |    
40 | Complexity Analysis of LRA
41 |   Number of multiple with feature vector
42 |   Before Low-Rank Approximation
43 |   
44 | LRA Impact on Accuracy
45 |   LRA results in slight degradation in accuracy
46 |   
47 | Quantization of matrices
48 |   Many quantization schemes exist
49 |   Basic idea is to represent 4 byte floats as 2 or 1 byte integers
50 |   
51 | 
52 | 
53 | 
54 | 


--------------------------------------------------------------------------------
/MTCNN.md:
--------------------------------------------------------------------------------
1 | ## Face alignment using MTCNN
2 | 
3 | The Dlib face detector misses some of the hard examples (partial occlusion, silhouettes, etc). This makes the training set too "easy" which actually deteriorates model performance on certain benchmarks. To solve this, other face landmark detectors has been tested. One face landmark detector that has proven to work very well in this setting is the Multi-task CNN.
4 | 


--------------------------------------------------------------------------------
/MonteCarloBayesianRL.md:
--------------------------------------------------------------------------------
 1 | [Monte Carlo Bayesian Reinforcement Learning](http://arxiv.org/abs/1206.6449)
 2 | =============================================================================
 3 | #### TL;DR
 4 | Partially Observable Markov Decision Process is an elegant and general model for planning under uncertainty. Applications for POMDPs include control of autonomous vehicles, dialog systems, and systems for providing assistance to the elderly. Learning problems such as reinforcement learning, making recommendations and active learning can also be posed as POMDPs. Unfortunately, solving POMDPs is computationally intractable. When the state space is not too large, conditions are defined under which solving POMDPs becomes computationally easier, and describe algorithms for solving such problems. The algorithms are extended to very large or infinite state spaces using Monte Carlo methods. 
 5 | 
 6 | ##### POMDP Background
 7 | Partially Observable Markov Decision Process(POMDP) &lt;S, A, T, R , *Ω*, O&gt;
 8 |  - State S, Actions A and observations *Ω*
 9 |  - Transition Function T
10 |  - Observation Probability O
11 |  - Reward Function R 
12 | 
13 | 
14 | ##### Solving Discrete POMDPs
15 | * Intractability
16 |     - Finite horizon POMDP is **PSPACE-complete**(intractable in the worst case)
17 |     - No polynomial-sized policy
18 | * Easier to Approximate Subclasses
19 |     - Many problems of practical interest are actually not so hard
20 | * Reachable Beliefs
21 | * Covering Number
22 | * SARSOP
23 | 
24 | ##### Bayesian Reinforcement Learning(BRL)
25 | 
26 | * Markov Decision Process(POMDP) &lt;S, A, T, R&gt; with unknown parameters *θ*
27 | * Cast as POMDP with 
28 |     - States *S*′=(*S* × *θ*)
29 |     - Transition matrix dependent on *θ*
30 |     - States fully observed
31 |     - Prior belief b<sub>0</sub>(*θ*)
32 | * Problem with employing approaches for solving discrete POMDPs
33 |     - Parameters *θ* are often continuous. 
34 |         - Think driver imperfection, reaction time, acceleration, deceleration parameters
35 | 
36 | ##### Monte Carlo Bayesian Reinforcement Learning
37 | * Online Phase
38 |     - Sample K Hypotheses
39 |     - Solve discrete POMDP
40 | * Offline Phase
41 |     - Execute Policy 
42 | * Advantages
43 |    - Not restricted to special prior distributions, unlike most BRL methods
44 |    - Requires only that priors are easy to sample
45 |    - Exploits effective discrete POMDP solvers
46 |    - Handles both fully and partially observable domains
47 | 
48 | ##### Conclusions
49 | * POMDP elegant and general model for many interesting sequential decision problems
50 | * For interesting applications, need to *scale* up to very large or continuous state spaces
51 | 


--------------------------------------------------------------------------------
/ProGroGAN.md:
--------------------------------------------------------------------------------
 1 | ## Progressive Growing of GANs for Improved Quality, Stability, and Variation
 2 | 
 3 | 
 4 | The main idea is that they train a GAN at a low resolution first, then iteratively add higher and higher upscaling units. They show extremely convincing results at 1024x1024.
 5 | 
 6 | #### WGAN technique
 7 | 
 8 | * the improved WGAN technique, but that this is orthogonal to their technique (as evidence, they also demonstrate good results without WGAN).
 9 | * by starting at low resolution and slowly ramping up, they are able to train faster (around 5x). This is pretty intuitive: it doesn't make sense to start adding details before the core structure of the content is sound.
10 | * We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR10.
11 | 


--------------------------------------------------------------------------------
/R-CNN.md:
--------------------------------------------------------------------------------
 1 | [Rich feature hierarchies for accurate object detection and semantic segmentation](https://arxiv.org/abs/1311.2524)
 2 | ============================================================================
 3 | 
 4 | - The goal of R-CNN is to identify the main objects present in an image using the bounding box technique. 
 5 | - The R-CNN proposes a set of regions in the image and see if any of them actually correspond to an object. 
 6 | - This process is known as Selective Search which looks at the image through windows of different sizes, and for each size groups together adjacent pixels by texture, color or intensity to identify objects. 
 7 | - The output is the bounding box and labels for each object in the image.
 8 | 
 9 | 
10 | 
11 | 
12 | 
13 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ## 2019-07
  2 | - [And the Bit Goes Down: Revisiting the Quantization
  3 | of Neural Networks](https://arxiv.org/pdf/1907.05686.pdf)
  4 | 
  5 | ## 2019-06
  6 | - [Applying Deep Learning to Airbnb Search](https://arxiv.org/abs/1810.09591)
  7 | 
  8 | ## 2019-04
  9 | 
 10 | - [A survey of Product Quantization](https://www.jstage.jst.go.jp/article/mta/6/1/6_2/_article/-char/en)
 11 | - [Product Quantization for Nearest neighbor search](https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf)
 12 | 
 13 | 
 14 | ## 2019-03
 15 | - [Learning Deep Features for Discriminative Localization](http://cnnlocalization.csail.mit.edu/Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf)
 16 | 
 17 | 
 18 | ## 2019-02
 19 | - [Concept Mask: Large-Scale Segmentation from Semantic Concepts](https://arxiv.org/abs/1808.06032)
 20 | 
 21 | ## 2018-06 - 2019-01
 22 | 
 23 | What was Pramod doing?
 24 | 
 25 | ## 2018-04
 26 | - [Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks](https://arxiv.org/pdf/1703.10593.pdf)
 27 | - [CycleGAN, a Master of Steganography](https://arxiv.org/pdf/1712.02950.pdf)
 28 | - [Image to Image translation with Conditional Adversarial Networks](https://github.com/phillipi/pix2pix)
 29 | 
 30 | ## 2018-03
 31 | - [ScaledML 2018](https://medium.com/@domarps/the-scaled-ml-conference-2018-e1df716abc58)
 32 | - [TensorFlow Estimators: Managing Simplicity vs. Flexibility in High-Level Machine Learning Frameworks](https://dl.acm.org/citation.cfm?doid=3097983.3098171)
 33 | ## 2018-01
 34 | - [Fine-tuned Language Models for Text Classification](https://github.com/domarps/papers-i-read/blob/master/fitLAM.md) [[arXiv](https://arxiv.org/abs/1801.06146)]
 35 | ## 2017-12
 36 | - [What’s your ML test score? A rubric for ML production systems](https://github.com/domarps/papers-i-read/blob/master/ml-test-score.md) [[Google Research](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45742.pdf)]
 37 | - [Cyclical Learning Rates for Training Neural Networks](https://medium.com/@domarps/democratizing-state-of-the-art-sota-techniques-in-ai-6bb473fed44a) [[arXiv](https://arxiv.org/abs/1506.01186)]
 38 | - [Snapshot Ensembles : Train 1, Get M for free](https://medium.com/@domarps/democratizing-state-of-the-art-sota-techniques-in-ai-6bb473fed44a) [[arXiv](https://arxiv.org/pdf/1704.00109.pdf)]
 39 | 
 40 | 
 41 | ## 2017-11
 42 | - [**TFX**: A TensorFlow-Based Production-Scale Machine Learning Platform](https://github.com/domarps/papers-i-read/blob/master/TFX.md) [[ACM](https://dl.acm.org/citation.cfm?id=3098021)]
 43 | - [Progressive Growing of GANs for Improved Quality, Stability, and Variation](https://github.com/domarps/papers-i-read/blob/master/ProGroGAN.md) [[NVIDIA](http://research.nvidia.com/publication/2017-10_Progressive-Growing-of)]
 44 | - [Dynamic Routing between Capsules](https://github.com/domarps/papers-i-read/blob/master/capsule.md) [[NIPS'17](http://papers.nips.cc/paper/6975-dynamic-routing-between-capsules.pdf)]
 45 | - [Spatial-Semantic Image Search by Visual Feature Synthesis](https://github.com/domarps/papers-i-read/blob/master/spatial_search.md) [[CVPR'17](http://web.cecs.pdx.edu/~fliu/papers/cvpr2017-search.pdf)]
 46 | - [Composition Aware Search](https://github.com/domarps/papers-i-read/blob/master/CAS.md) [[arXiv/whitepaper](https://www.shutterstock.com/labs/compositionsearch/static/cas-final.pdf)]
 47 | - [Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks](https://github.com/domarps/papers-i-read/blob/master/MTCNN.md) [[IEEE](https://kpzhang93.github.io/MTCNN_face_detection_alignment/index.html)]
 48 | - [FaceNet: A Unified Embedding for Face Recognition and Clustering](https://github.com/domarps/papers-i-read/blob/master/faceNet.md) [[arXiv](https://arxiv.org/pdf/1503.03832.pdf)]
 49 | - [Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition](https://github.com/domarps/papers-i-read/blob/master/autocaption.md) [[CVPR'17](http://acsweb.ucsd.edu/~yuw176/report/cvpr_2017.pdf)]
 50 | 
 51 | 
 52 | ## 2017-10
 53 | - Oh, hi tensorflow-1.4! Still no arXiv!!
 54 | 
 55 | ## 2017-09
 56 | - [A Convolutional Neural Network Cascade for Face Detection](https://github.com/domarps/papers-i-read/blob/master/CAS.md) [[CVPR'15](http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Li_A_Convolutional_Neural_2015_CVPR_paper.pdf)]
 57 | - [fast.ai : A Pytorch-first approach to AI](http://www.fast.ai/2017/09/08/introducing-pytorch-for-fastai/)
 58 | 
 59 | ## 2017-07/2017-08
 60 | - `Tensorflow` and `boost`! Yes, i am deep learning in C++ -- so no arXiv!!
 61 | 
 62 | 
 63 | ## 2017-06
 64 | - [Rich feature hierarchies for accurate object detection and semantic segmentation](https://github.com/domarps/papers-i-read/blob/master/R-CNN.md) [[arXiv](https://arxiv.org/abs/1311.2524)]
 65 | 
 66 | 
 67 | ## 2016-08
 68 | - [Deterministic Policy Gradients](https://github.com/domarps/papers-i-read/blob/master/deterministicPolicyGradients) [[JMLR](http://jmlr.org/proceedings/papers/v32/silver14.pdf)]
 69 | - [Model-Free Episodic Control](https://github.com/domarps/papers-i-read/blob/master/modelFreeEpisodicControl.md) [[arXiv](https://arxiv.org/abs/1606.04460)]
 70 | - [Human-level control through deep reinforcement learning](https://github.com/domarps/papers-i-read/blob/master/humanLevelControl.md) [[Nature](http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html)]
 71 | 
 72 | ## 2016-07
 73 | - [Information-theoretical label embeddings for
 74 | large-scale image classification](https://github.com/domarps/papers-i-read/blob/master/infoTheoreticalEmb.md) [[arXiv](https://arxiv.org/abs/1607.05691)]
 75 | - [Incorporating Copying Mechanism in Sequence-to-Sequence Learning](https://github.com/domarps/papers-i-read/blob/master/copyMechanism.md) [[arXiv](https://arxiv.org/abs/1603.06393)]
 76 | - [Deep Reinforcement Learning with a Combinatorial Action Space for Predicting and Tracking Popular Discussion Threads](https://github.com/domarps/papers-i-read/blob/master/discussionThreads.md) [[arXiv](https://arxiv.org/abs/1606.03667)]
 77 | - [Generative Adversarial Text to Image Synthesis](https://github.com/domarps/papers-i-read/blob/master/genAdversarial.md) [[arXiv](https://arxiv.org/abs/1605.05396)]
 78 | - [Sequence to Sequence Learning with Neural Networks](https://github.com/domarps/papers-i-read/blob/master/seq2Seq.md) [[arXiv](https://arxiv.org/abs/1409.3215)]
 79 | 
 80 | ## 2016-06
 81 | - [Trust Region Policy Optimization](https://github.com/domarps/papers-i-read/blob/master/trustBasedOptimization.md) [[arXiv](https://arxiv.org/abs/1502.05477)]
 82 | - [End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning](https://github.com/domarps/papers-i-read/blob/master/end-to-end-LSTM.md) [[arXiv](https://arxiv.org/abs/1606.01269#)]
 83 | - [Continuous Deep Q-Learning with Model-based Acceleration](https://github.com/domarps/papers-i-read/blob/master/contDeepRL.md) [[arXiv](http://arxiv.org/abs/1603.00748)]
 84 | - [Asynchronous Methods for Deep Reinforcement Learning](https://github.com/domarps/papers-i-read/blob/master/asyncMethodsDRL.md) [[arXiv](https://arxiv.org/abs/1602.01783)]
 85 | - [Dueling Network Architectures for Deep Reinforcement Learning - ICML'16 Best Paper Award](https://github.com/domarps/papers-i-read/blob/master/duelingNetworkArch.md) [[arXiv](http://arxiv.org/abs/1511.06581)]
 86 | - [Deep Reinforcement Learning with Double Q-learning](https://github.com/domarps/papers-i-read/blob/master/doubleQLearning.md) [[arXiv](http://arxiv.org/abs/1509.06461)]
 87 | - [Monte Carlo Bayesian Reinforcement Learning](https://github.com/domarps/papers-i-read/blob/master/MonteCarloBayesianRL.md) [[arXiV](http://arxiv.org/abs/1206.6449)]
 88 | - [Control of Memory, Active Perception, and Action in Minecraft](https://github.com/domarps/papers-i-read/blob/master/mineCraft.md) [[arXiV](http://arxiv.org/abs/1605.09128)]
 89 | - [Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments](https://github.com/domarps/papers-i-read/blob/master/multiAgentRL.md) [[arXiv](http://arxiv.org/pdf/1511.04143.pdf)]
 90 | - [Deep Compression - ICLR'16 Best Paper Award](https://github.com/domarps/papers-i-read/blob/master/deepCompression.md) [[arXiv](http://arxiv.org/abs/1510.00149)]
 91 | - [Quoc Le's Tutorials on Deep Learning](http://www.trivedigaurav.com/blog/quoc-les-lectures-on-deep-learning/) [[Tutorial1](https://cs.stanford.edu/~quocle/tutorial1.pdf)] [[Tutorial2](https://cs.stanford.edu/~quocle/tutorial2.pdf)]
 92 | - [Sequence to Sequence Learning with Neural Networks](seq2Seq.md)[[arXiv](https://arxiv.org/abs/1409.3215)]
 93 | 
 94 | ## 2016-05
 95 | - [Continuous Control with Deep Reinforcement Learning](https://github.com/domarps/papers-i-read/blob/master/contControlDRL.md) [[arXiv](http://arxiv.org/abs/1509.02971)]
 96 | - [Playing Atari with Deep Reinforcement Learning](https://github.com/domarps/papers-i-read/blob/master/playingAtari.md) [[arXiv](https://arxiv.org/abs/1312.5602)]
 97 | - [Deep Reinforcement Learning With an Action Space Defined by Natural Language](https://github.com/domarps/papers-i-read/blob/master/DRRN.md) [[arXiv](http://arxiv.org/abs/1511.04636)]
 98 | - [WebNav : A New Large-Scale Task for Natural Language based Sequential Decision Making](https://github.com/domarps/papers-i-read/blob/master/WebNav.md) [[arXiv](http://arxiv.org/abs/1602.02261)]
 99 | - [Using Reinforcement Learning to Spider the Web Efficiently](https://github.com/domarps/papers-i-read/blob/master/webSpidering.md) [[ACM](http://dl.acm.org/citation.cfm?id=657633)]
100 | - [Focused Crawling using Temporal Difference Learning](https://github.com/domarps/papers-i-read/blob/master/TDLearning.md) [[Springer](http://link.springer.com/chapter/10.1007%2F978-3-540-24674-9_16)]
101 | 
102 | ## 2016-04
103 | - LINE: Large-scale Information Network Embedding
104 | - [PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks](https://chara.cs.illinois.edu/sites/cs591txt/files/0408-presentation.pdf)
105 | 
106 | ## 2016-03
107 | - [Link Prediction in Social Networks](https://uofi.box.com/s/nhgzf85bytamdglppbjgc0h4gkzq5zyo)
108 | 
109 | ## 2016-02
110 | - Co-Author Relationship Prediction in Heterogeneous Bibliographic Networks [[link](http://www.ccs.neu.edu/home/yzsun/papers/asonam11_pathpredict.pdf)]
111 | 
112 | ## 2015-11
113 | - Visual Question Answering(VQA) [[arXiv](http://arxiv.org/abs/1505.00468)]
114 | 
115 | 
116 | 


--------------------------------------------------------------------------------
/Seq2SeqLearning.md:
--------------------------------------------------------------------------------
1 | [Recurrent Neural Networks as Sequence Decoders](https://arxiv.org/abs/1409.3215)
2 | ========================================================================
3 | ####TL;DR: Recurrent Neural Networks have seen a lot of attention as powerful models that are able to decode sequences from signals. The paper reviews some recent successes on machine translation, image understanding, and beyond. The key component of such methods are the use of a recurrent neural network architecture that is trained end-to-end to optimize the probability of the output sequence given those signals. 
4 | 


--------------------------------------------------------------------------------
/TDLearning.md:
--------------------------------------------------------------------------------
1 | Focused Crawling using Temporal Difference-Learning
2 | ===================================================
3 | 
4 | The paper discusses how TD learning can be leveraged for link prediction in focused crawling and presents initial evaluations on a confined dataset. In their approach, every web page is represented as a finite-dimensional feature vector where each value corresponds to the existence or not of a specific keyword which is key to classify whether a page is relevant or not. The state of each page is determined by Temporal Difference Learning in order to minimize the state space. The relevance of a page depends on the set of keywords present in a page.
5 | 
6 | Neural Networks are used to estimate values of the different stages. During training session, the crawler randomly follows pages for a defined number of steps or until it reaches a relevant page. Each step represents the implementation of action *a* thus moving the agent from state *s*<sub>*t*</sub> to *s*<sub>*t*+1</sub>. The respective reward *r*<sub>*t*+1</sub> and the features of the state *s*<sub>*t*</sub> are used as input to the neural network which is tuned to evaluate the state’s potential of belonging to the right path.
7 | 
8 | During the crawling mode, the crawler maintains a priority list of links to be followed, the priorities are computed by the neural network. The state value of a child page is inherited by the value of its parent (the current page) or by the average value of its parents, in case the page is pointed by multiple pages.
9 | 


--------------------------------------------------------------------------------
/TFX.md:
--------------------------------------------------------------------------------
 1 | ## TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
 2 | 
 3 | 
 4 | The main idea is to describe a homegrown machine learning platform. 
 5 | 
 6 | ## AI ?
 7 | 
 8 | The AI model is after all just a tiny part of what goes into using machine learning in production systems. TFX has been claimed to have empowered Google Play engineers to standardize these moving parts, simplify the platform configuration, and reduce production time from **O(weeks)** to **O(days)**, while providing platform stability that minimizes disruptions. Here is the platform:
 9 | 
10 | ![image](https://user-images.githubusercontent.com/7057078/33226670-1eeb4b5c-d148-11e7-83f4-69fd0a8b3266.png)
11 | 
12 | ## Data?
13 | 
14 | - Early identification and elimination of data anomalies downstream propogation saves a lot of wasted time later on.
15 | 
16 | 
17 | ## Multi-model training, evaluation
18 | 
19 |  - *Warm Start* : When training a new version of NN, the parameters corresponding to warm-start features are initialised from a previously trained version of the model, and fine tuning begins from there.
20 |  - Supports only Tensorflow models. 
21 | 
22 | 
23 | ## Tensorflow Serving
24 | 
25 | - Production-scale Deployment 
26 | - How does this stack against Flask or AWS Lambda?
27 | 
28 | The paper is an easy read, with no math! Highly recommend checking it out
29 | 
30 | References:
31 | 
32 | -- [Blog](https://blog.acolyer.org/2017/10/03/tfx-a-tensorflow-based-production-scale-machine-learning-platform/)
33 | -- [KDD 2017](http://www.kdd.org/kdd2017/papers/view/tfx-a-tensorflow-based-production-scale-machine-learning-platform)
34 | 


--------------------------------------------------------------------------------
/WebNav.md:
--------------------------------------------------------------------------------
 1 | ## [WebNav: A New Large-Scale Task for Natural Language based Sequential Decision Making](http://arxiv.org/abs/1602.02261)
 2 | 
 3 | TLDR; The authors propose a web navigation task where an agent must find a target page containing a search query (typically a few sentences) by navigating a web graph with restrictions on memory, path length and number of exlorable nodes. They train Feedforward and Recurrent Neural Networks and evaluate their performance against that of human volunteers.
 4 | 
 5 | 
 6 | #### Key Points
 7 | 
 8 | 
 9 | - Datasets: Wiki-[NUM_ALLOWED_HOPS]: WikiNav-4 (6k train), WikiNav-8 (890k train), WikiNav-16 (12M train). Authors evaluate variosu query lengths for all data sets.
10 | - Vector representation of pages: BoW of pre-trained word2vec embeddings. 
11 | - State-dependent action space: All possible outgoing links on the current page. At each step, the agent can peek at the neighboring nodes and see their full content.
12 | - Training, a single correct path is fed to the agent. Beam search to make predictions.
13 | - NeuAgent-FF uses a single tanh layer. NeuAgent-Rec uses LSTM.
14 | - Human performance typically worse than that of Neural agents
15 | 
16 | 
17 | #### Notes/Questions
18 | 
19 | - Is it reasonable to allow the agents to "peek" at neighboring pages? Humans can make decisions based on the hyperlink context. In practice, peaking at each page may not be feasible if there are many links on the page.
20 | - I'm not sure if I buy the fact that this task requires Natural Language Understanding. Agents are just matching query word vectors against pages, which is no indication of NLU. An indication of NLU would be if the query was posed in a question format, which is typically short. But here, the authors use several sentences as queries and longer queries lead to better results, suggesting that the agents don't actually have any understanding of language. They just match text.
21 | - Authors say that NeuAgent-Rec performed consistently better for high hop length, but I don't see that in the data.
22 | - The training method seems a bit strange to me because the agent is fed only one correct path, but in reality there are a large number of correct paths and target pages. It may be more sensible to train the agent with all possible target pages and paths to answer a query.
23 | 


--------------------------------------------------------------------------------
/asyncMethodsDRL.md:
--------------------------------------------------------------------------------
1 | [Asynchronous Methods for Deep Reinforcement Learning](https://theberkeleyview.wordpress.com/2016/04/25/asynchronous-methods-for-deep-reinforcement-learning/)
2 | ============================================================
3 | 


--------------------------------------------------------------------------------
/autocaption.md:
--------------------------------------------------------------------------------
 1 | ## Visual Attribute Prediction with Instance Segmentation
 2 | 
 3 | 
 4 | - Perform the instance level Segmentation given an image and then perform the attribute prediction on the segmented instances
 5 | - Predict the low-resolution heatmap -- then along with the original image, perform segmentation
 6 | 
 7 | #### Method
 8 | 
 9 | - Step 1 : Instance Level Segmentation 
10 | - Step 2 : Attribute Prediction 
11 | 
12 | #### Applications
13 | 
14 | - help improve image search 
15 | - help image captioning (segment objects + predict attributes) 
16 | - improve image tagging with segmentation (ambiguous objects 
17 | - verify correct object label/tag)
18 | 


--------------------------------------------------------------------------------
/capsule.md:
--------------------------------------------------------------------------------
 1 | ## Dynamic Routing Between Capsules
 2 | 
 3 | Unfortunately, this reader is not intelligent enough to claim that I read the Capsules paper and can explain it well. Heck, even the Hinton's co-author has claimed this publicly:
 4 | 
 5 | ![image](https://user-images.githubusercontent.com/7057078/33227225-5da7cd06-d153-11e7-845a-c0dbfdd43b53.png)
 6 | 
 7 | I rest my case.
 8 | 
 9 | I am going to curate some of the best no-nonsense, non-clickbaity explanations here for you:
10 | 
11 | [Hinton-endorsed video explanation](https://www.youtube.com/watch?v=pPN8d0E3900)
12 | 


--------------------------------------------------------------------------------
/casCNN.md:
--------------------------------------------------------------------------------
1 | ## A Convolutional Neural Network Cascade for Face Detection
2 | 
3 | 


--------------------------------------------------------------------------------
/contControl.md:
--------------------------------------------------------------------------------
1 | [Continuous Control with Deep Reinforcement Learning](https://arxiv.org/abs/1509.02971)
2 | ===================================================
3 | 
4 | 
5 | 


--------------------------------------------------------------------------------
/contControlDRL.md:
--------------------------------------------------------------------------------
 1 | ## [Continuous Control with Deep Reinforcement Learning](http://arxiv.org/pdf/1509.02971v5.pdf)
 2 | 
 3 | #### TL;DR
 4 | The authors have developed a network that can control simulated physical events via reinforcement learning. Their previous work with deep Q-networks learned to play Atari games via [Q-learning](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf). However, this cannot work for real systems because the action space is (essentially) infinite since 3D space is continuous. The model hopes to revive attempts to approximate the value function of RL algorithms into developmental robotics problems.
 5 | 
 6 | 
 7 | #### Key Points
 8 | 
 9 | - `Model-Free Reinforcement Learning`: system dynamics are not known
10 | - `continuous action problems` require Q-learning to maximize a complex, non-linear function at each update
11 | - `actor-critic architecture` assigns a separate parameter `actor` policy
12 | - `Deep DPG(DDPG)`combined a previous approach, Deterministic Policy Gradient(DPG) with DQN
13 | - The off-policy algorithm is preferred over policy gradient methods  which can succeed only on high-dimensional problems requiring a large sample space.
14 | - DDPG learns policies using low-dimensional observations using an actor-critic architecture
15 | - virtual robots which learn to perform the task themselves are controlled by the same network via the same set of parameters
16 | - improved stability in the system 
17 | - virtual robots are used since it takes 2.5 million steps of experience for training in a virtual environment
18 | 
19 | #### Deep Deterministic Policy Gradients
20 | 
21 | It is a `Model-free, Deep Actor Critic` architecture. It has two neural networks -- Actor learns a policy \pi and the critic evaluates the policy to estimate Q-values. Actor processes the state of the world and outputs the action as well as the next 6 possible values of the parameters associated with the state of the world. These outputs are fed to the critic which evaluates how well the actor has learnt the current state.
22 | 
23 | ![ddpg](https://cloud.githubusercontent.com/assets/7057078/16067790/7c71c0ea-3274-11e6-94b4-27c806c302c2.PNG)
24 | 
25 | 
26 | The critic is trained using the same temporal difference update as the *standard DQN* -- the Loss is the immediate reward and the gamma discounted Q-value of the next state.
27 | 
28 | However, the actor network is trained with gradients that are generated from the critic. 
29 | 
30 | Ask the critic : `How should I change the current action in order to get a higher Q-value estimate?`
31 | 
32 | Answer : `The critic gradient tells you that the 4 actions and the 6 parameters are to be changed leading to a higher Q-value.`
33 | 
34 | 
35 | 
36 | ![training](https://cloud.githubusercontent.com/assets/7057078/16019257/e6207c2a-315c-11e6-9a96-3dce35e8a3ca.PNG)
37 | 
38 | 
39 | #### Notes/Questions
40 | > Interestingly, all of our experiments used substantially fewer steps of experience than was used by DQN learning to find solutions in the Atari domain. Nearly all of the problems we looked at were solved within 2.5 million steps of experience (and usually far fewer), a factor of 20 fewer steps than DQN requires for good Atari solutions. This suggests that, given more simulation time, DDPG may solve even more difficult problems than those considered here.
41 | 
42 | - Nearly 2.5 million steps needed for learning the basic rules of the system for practical applications
43 | - If the original Deep Q-network problem is converted to a continuous one by mapping each button to the 0-1 interval, how will DDPG compare?
44 | - Adapting this method to continuous tasks typically requires optimizing two function approximators on different objectives.
45 | - This continuous action space `differs` from the natural language action space in that the action space is known. In the [DRRN approach](https://github.com/domarps/papers-i-read/blob/master/DRRN.md), the action space is inherently discrete although a continuous representation of it has been learnt.
46 | - Another continuous variant of the Q-learning algorithm, Normalized Advantage Functions(NAF) is a considerably simpler alternative than DDPG.
47 | 


--------------------------------------------------------------------------------
/contDeepRL.md:
--------------------------------------------------------------------------------
 1 | [Continuous Deep Q-Learning with Model-based Acceleration](http://arxiv.org/abs/1603.00748)
 2 | ===========================================================================================
 3 | 
 4 | ####TL;DR:
 5 | The paper aims to bring generality of model-free Deep Reinforcement Learning into real-world domains by reducing sample complexity. The paper has two key contributions:
 6 | *  `Normalized Advantage Function (NAF)` approach for parameterization of the Q-value function enables efficient training of the Q-learning procedure with experience replay.
 7 | * `Imagination Rollouts` is a new type of model-guided exploration, where a dynamics model is used to generate both -ve and +ve trajectories added to the experience relay buffer.
 8 | 
 9 | The authors describe an approach for applying Q-Learning to problems with continuous states and actions, while using neural networks as function approximators. The Q-function is represented in terms of the value function and the advantage function, while the value function parameterized quadratically w.r.t the action.
10 | 
11 | The feasibility of the Q-learning task is achieved by estimating the mean/co-variance of the NAF using a deep network -- thus making it possible to evaluate the max of the Q-function using a simple forward pass through the value and advantage function networks.
12 | 
13 | 
14 | #####Notes/Questions to ask:
15 | 
16 | * The approach requiring the estimation of parameters, V(s), mu(s), and P(s) is an interesting alternative to estimation of parameters in actor-critic type methods, Q(s,a) and pi(s).
17 | * The authors' claim for improved performance hinges on two facts : Quadratic function acts as a regularizer and that the learning signal for all the estimators come directly from the environment.
18 | 
19 | 
20 | #####References:
21 | * [Paper Reviews](http://icml.cc/2016/reviews/1274.txt)
22 | * [Deep-RL-TensorFlow](https://github.com/carpedm20/deep-rl-tensorflow)
23 | 
24 | 
25 | 
26 | 


--------------------------------------------------------------------------------
/copyMechanism.md:
--------------------------------------------------------------------------------
1 | #[Incorporating Copying Mechanism in Sequence-to-Sequence Learning](https://arxiv.org/abs/1603.06393)
2 | =================================================
3 | 


--------------------------------------------------------------------------------
/deepCompression.md:
--------------------------------------------------------------------------------
 1 | [Deep Compression, DSD training and EIE: deep neural network model compression, regularization and hardware acceleration](https://www.youtube.com/watch?v=kQAhW9gh6aU)
 2 | ========================================================================================================================
 3 | 
 4 | #### TL;DR
 5 | Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on mobile phones and embedded systems with limited hardware resources. To address this limitation, “Deep Compression” is developed to compress the deep neural networks by 10x-49x without loss of prediction accuracy. The authors claim that the "Dense-Sparse-Dense" training method regularizes CNN/RNN/LSTMs to improve the prediction accuracy of a wide range of neural networks given the same model size. The "Efficient Inference Engine" works directly on the deep-compressed DNN model and accelerates the inference, taking advantage of weight sparsity, activation sparsity and weight sharing, which is 13x faster and 3000x more energy efficient than a TitanX GPU.
 6 | 
 7 | #### Slides I understood(or hope to understand): 
 8 | ![bitsperweight](https://cloud.githubusercontent.com/assets/7057078/15988105/4eae4fb6-2ff9-11e6-8692-bd856fd91e24.PNG)
 9 | ![deepcompression-i](https://cloud.githubusercontent.com/assets/7057078/15988108/4eaed4f4-2ff9-11e6-8e49-63d52fed8366.PNG)
10 | ![deepcompression-pipeline](https://cloud.githubusercontent.com/assets/7057078/15988107/4eaee0f2-2ff9-11e6-8d05-690dbc56415d.PNG)
11 | ![densesparsedensetraining](https://cloud.githubusercontent.com/assets/7057078/15988109/4eb2294c-2ff9-11e6-96ba-315f1f54deff.PNG)
12 | ![finetunecentroids](https://cloud.githubusercontent.com/assets/7057078/15988106/4eae700e-2ff9-11e6-99e7-7e3281107589.PNG)
13 | ![modelcompression](https://cloud.githubusercontent.com/assets/7057078/15988110/4eb4de12-2ff9-11e6-9c7b-e1a4be610efb.PNG)
14 | ![networkpruning](https://cloud.githubusercontent.com/assets/7057078/15988114/4ec0547c-2ff9-11e6-8f7b-0397e0b5ff4a.PNG)
15 | ![neuralmachinetranslation](https://cloud.githubusercontent.com/assets/7057078/15988111/4ebc5a66-2ff9-11e6-8716-f16d5bba6d34.PNG)
16 | ![pruning_neuraltalk_lstm](https://cloud.githubusercontent.com/assets/7057078/15988113/4ebdc630-2ff9-11e6-9e6d-b5fed8086bd6.PNG)
17 | ![pruning trainedquantization](https://cloud.githubusercontent.com/assets/7057078/15988112/4ebd8b5c-2ff9-11e6-848d-6fb2a45d1c50.PNG)
18 | ![pruning-i](https://cloud.githubusercontent.com/assets/7057078/15988115/4ec0f260-2ff9-11e6-8eb1-31e7bd152844.PNG)
19 | ![pruningneuralmachinetranslation](https://cloud.githubusercontent.com/assets/7057078/15988116/4ec4f5c2-2ff9-11e6-818b-f374a40763ad.PNG)
20 | ![pruning-result](https://cloud.githubusercontent.com/assets/7057078/15988117/4ecaa3a0-2ff9-11e6-966f-fe88bec6fb97.PNG)
21 | ![pruning-rnn-lstm](https://cloud.githubusercontent.com/assets/7057078/15988119/4ecc9390-2ff9-11e6-8b91-a2ec2ab47b42.PNG)
22 | ![retraintorecover](https://cloud.githubusercontent.com/assets/7057078/15988118/4ecb9f26-2ff9-11e6-9713-ecdb5ae710f8.PNG)
23 | ![smallerdnn](https://cloud.githubusercontent.com/assets/7057078/15988120/4ece74ee-2ff9-11e6-9550-62a2d2565fe4.PNG)
24 | ![speedefficiency](https://cloud.githubusercontent.com/assets/7057078/15988121/4ecffb0c-2ff9-11e6-84f6-a03ff904c551.PNG)
25 | ![sram-dnn](https://cloud.githubusercontent.com/assets/7057078/15988122/4ed404e0-2ff9-11e6-864a-f7709c6237ed.PNG)
26 | ![weightdistribution](https://cloud.githubusercontent.com/assets/7057078/15988123/4ed91c0a-2ff9-11e6-8a5f-cecb97ee16c0.PNG)
27 | ![weightsharing](https://cloud.githubusercontent.com/assets/7057078/15988124/4ed9d00a-2ff9-11e6-85b9-ff32fadcc689.PNG)
28 | ![weightsharing-ii](https://cloud.githubusercontent.com/assets/7057078/15988125/4edc1bf8-2ff9-11e6-8fb3-fe811cc20d3d.PNG)
29 | ![weightsharing-iii](https://cloud.githubusercontent.com/assets/7057078/15988126/4edfabba-2ff9-11e6-9b6a-93257966acae.PNG)
30 | 
31 | ####References:
32 | * [SqueezeNet Code](https://github.com/songhan/SqueezeNet-Deep-Compression)
33 | 


--------------------------------------------------------------------------------
/deterministicPolicyGradients.md:
--------------------------------------------------------------------------------
 1 | ### [Deterministic Policy Gradient Algorithms](http://jmlr.org/proceedings/papers/v32/silver14.pdf)
 2 | ======================================================
 3 | 
 4 | ###Policy Gradient Algorithms for End-to-end policy optimization
 5 | 
 6 | * They calcuate noisy estimates of the gradient of the `expected reward` of the policy and then update the policy in the gradient direction. 
 7 | * Traditionally, a stochastic policy gives a probability distribution over actions, after having observed a large pool of training examples
 8 | 
 9 | ### Issues with vanilla-PG
10 | * `Credit Assignment` : Difficulty in ascertaining the good actions from the others
11 | *  Continuous Action Space
12 | 
13 | ###Actor-Critic Algorithms
14 | 
15 | * The critic's output drives the learning of both the actor and the critic.
16 | * Represent the policy function independent of the value function
17 | * `Policy  = actor` and the `Value = critic`
18 | * Actor produces an action given the current state of the environment
19 | * Critic produces a Temporal Difference error signal given the state and resultant award. Note that the critic estimates the action value function `Q`, it would require the output of the actor as well.
20 | * Deep RL can enable the representation of both actor and critic as Neural networks
21 | 
22 | 
23 | 
24 | 
25 | 
26 | 


--------------------------------------------------------------------------------
/discussionThreads.md:
--------------------------------------------------------------------------------
 1 | [Deep RL for Predicting and Tracking Popular Discussion Threads](http://arxiv.org/abs/1606.03667)
 2 | ==================================================
 3 | 
 4 | ####TL;DR : Non-personalized Media Recommendation and Trend Spotting in the social network of a target community. 
 5 | 
 6 | Since, the actual degree of hotness (reward) of a potential hot topic candidate is not immediately known, a Reinforcement Learning mechanism can estimate future reactions. A Deep RL architecture can handle the `combinatorial action` defined by natural language. The work is claimed to serve as a benchmark for tracking hot topics in social media with deep RL.
 7 | 
 8 | ####Discussion:
 9 | 
10 | * The approach is different from the Wolpertinger architecture to reduce computational complexity of evaluating all actions -- the actions defined by a natural language `vary over different states`.
11 | * Since the `episode` is defined as a trail of text comments beginning from the original poster to the end, the agent picks the best action which affects both the next state and the future expected reward.
12 | * The action is chosen from a given set of candidates which makes modeling Q value difficult.
13 | 
14 | ####Architectures:
15 | 
16 | * Per-action DQN
17 | * DRRN
18 | * DRRN-Sum
19 | * DRRN-BiLSTM
20 | 
21 | 
22 | 
23 | 
24 | 
25 | 
26 | 


--------------------------------------------------------------------------------
/doubleQLearning.md:
--------------------------------------------------------------------------------
1 | [Deep Reinforcement Learning with Double Q-learning](https://hadovanhasselt.wordpress.com/2015/12/10/deep-reinforcement-learning-with-double-q-learning-2/)
2 | ==================================================
3 | 
4 | ###TL;DR
5 | The paper demonstrates that along with deep neural networks, Q-learning learns overoptimistic action values on deterministic environments such as Atari video games. The authors propose the course correction method is to employ using a variant of Double Q-learning.  The resulting Double DQN algorithm greatly improves over the performance of the DQN algorithm.
6 | 


--------------------------------------------------------------------------------
/duelingNetworkArch.md:
--------------------------------------------------------------------------------
 1 | [Dueling Network Architectures for Deep Reinforcement Learning](http://arxiv.org/abs/1511.06581)
 2 | =============================================================
 3 | #####TL; DR:
 4 | An alternative architecture + learning scheme for Deep Q-Networks(DQN) is proposed in the paper. The Q-function approximation is achieved by decomposing the Q function into a state value and the advantage value. Concretely, the action values Q(s,a) can be written as the sum of a state-dependent offset V(s) and the advantage A(s,a) for specifically taking action a in that state, such that:
 5 |                               Q(s,a) = V(s) + A(s,a)
 6 | 
 7 | The action value is represented with a network architecture that features *two* channels, representing the state value and the advantages. The channels are merged to obtain an estimated action value. The following image (stolen from [Link](https://hadovanhasselt.wordpress.com/2016/06/20/best-paper-at-icml-dueling-network-architectures-for-deep-reinforcement-learning/)) captures the distinction between a conventional DQN and the duel network 
 8 | 
 9 | ![image](https://cloud.githubusercontent.com/assets/7057078/16360009/66f75d0e-3aff-11e6-915c-9ace33a342ab.png)
10 | 
11 | As we follow the convolutional layers, we observe that the convolutional network is forked the network into two streams of fully connected feed-forward computation.
12 | 
13 | ######Notes/Questions:
14 | * This paper highlights that until now the standard neural networks (CNNs, MLPs and LSTMs) have been used to perform this Q function approximation have been standard neural networks. The authors suggest that the task of Q function approximation can benefit from an architecture *specially* designed for reinforcement learning.
15 | * This approach, albeit a minor modification to the architecture and training of DQNs improves the state-of-the-art as observed in the Atari Learning Environment benchmark.
16 | * The work is a huge motivator for the optimal representation of Q value approximators. It has been demonstrated many times that even **slightly better** approximations to the Q function yield significant improvements in reinforcement learners, so reinforcement-learning specific architectures to improve the Q function approximation seems a natural approach to take. 
17 | 
18 | ######References:
19 | * [Reviews](http://icml.cc/2016/reviews/927.txt)
20 | * [Author's Blogpost](https://hadovanhasselt.wordpress.com/2016/06/20/best-paper-at-icml-dueling-network-architectures-for-deep-reinforcement-learning/)
21 | 
22 | 
23 | 
24 | 
25 | 
26 | 


--------------------------------------------------------------------------------
/end-to-end-LSTM.md:
--------------------------------------------------------------------------------
 1 | [End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning](https://arxiv.org/abs/1606.01269#)
 2 | ============================================================================================================
 3 | 
 4 | #####TL;DR:
 5 | The goal of the paper is to model task-oriented dialog systems. A Recurrent Neural Network (LSTM) maps raw dialog history to a distribution over system actions. A custom API declaring various business rules enables the LSTM to take actions in real world. The LSTM optimization process is performed via Supervised Learning and Reinforcement Learning.
 6 | 
 7 | ####Key Points:
 8 | 
 9 | The problem of Dialog Systems is essentially solved by tackling two major sub-problems:
10 | * **State Tracking** : Automatic inference of dialog history representation, because of state space issues
11 | * **Action Selection** : Best of both worlds (SL + RL)
12 |   * Supervised learning can provide “good” dialogs as for the neural network to memorize
13 |   * Reinforcement learning enables LSTM to perform action selection
14 | 
15 | #### References
16 | * [Denny Britz's notes](https://github.com/dennybritz/deeplearning-papernotes/blob/master/notes/e2e-dialog-control-sl-rl.md)
17 | 
18 | 
19 | 


--------------------------------------------------------------------------------
/faceNet.md:
--------------------------------------------------------------------------------
1 | ## FaceNet: A Unified Embedding for Face Recognition and Clustering
2 | 


--------------------------------------------------------------------------------
/fb_AML.md:
--------------------------------------------------------------------------------
1 | ### Summary
2 | 
3 | - CV is only a fraction of the AI efforts.
4 | - Evaluating hardware on a `Performance-per-watt` perspective
5 | - CPUs for inference; GPUs and CPUs for inference.
6 | 


--------------------------------------------------------------------------------
/fitLAM.md:
--------------------------------------------------------------------------------
1 | Fine-tuned Language Models for Text Classification
2 | ================================================
3 | 
4 | Highlights:
5 | 
6 | - fitLAM beats multiple SOTA (sometimes by 20%)
7 | - Leverages seeveral techniques, some of which are largely inspired from Computer Vision
8 | 


--------------------------------------------------------------------------------
/genAdversarial.md:
--------------------------------------------------------------------------------
 1 | [Generative Adversarial Networks](https://arxiv.org/abs/1406.2661)
 2 | =============================================================================
 3 | #### TL;DR
 4 | The paper extends generative adversarial networks (GANs) to learn a generator network and a discriminator network to learn to generate images from a sentence as well map an image to the sentence. The approach requires solving two sub-problems:
 5 |   * Learn a text feature representation that captures the important visual details
 6 |   * Use these features to synthesize a compelling image that a human may mistake for real
 7 |   
 8 | #### The Scourge of Multi-modality
 9 | Deep learning has made great strides in solving the problem, however the one issue which is not solved is that hat the distribution of images conditioned on a text description is highly multimodal, in the sense that there are very many plausible configurations of pixels that correctly illustrate the description. 
10 | 
11 | #### Generative Adversarial Networks [[Karpathy's visualization](http://cs.stanford.edu/people/karpathy/gan/)] [[Soumith's eyescream](http://soumith.ch/eyescream/)]
12 | * Optimize the Generator Network (G) to *fool* the he adversarially-trained discriminator (D) into predicting
13 | that synthetic images are real.
14 | * Both are competing in a two-player minimax game -- the discriminator tried to distinguish real training data from synthetic images while the generator tries to fool the discriminator.
15 | 
16 | #### char-CNN-RNN
17 | 
18 | #### Contributions
19 | The contribution lies in the architecture, as well as two components:
20 | * A matching-aware discriminator (CLS) 
21 | * A manifold interpretation technique (INT)
22 | 
23 | 
24 | ##### Discussions/Questions
25 | * The main limitation of this model and other similar ones is that, so far, they are only able to generate images from the very narrow distribution they were trained on (birds or flowers here).
26 | 


--------------------------------------------------------------------------------
/hand_tracking.md:
--------------------------------------------------------------------------------
1 | Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences [PDF](http://www.samehkhamis.com/taylor-siggraph2016.pdf) [video]()
2 | 
3 | #### Objective
4 | 
5 | Real-time inference of precise position of hand using a smooth model of hand surface
6 | 
7 | 
8 | 


--------------------------------------------------------------------------------
/humanLevelControl.md:
--------------------------------------------------------------------------------
 1 | [Human-level control through deep reinforcement learning](http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html)
 2 | ===============================================================
 3 | 
 4 | 
 5 | 
 6 | 
 7 | #### Background reading
 8 | 
 9 | - [Annotated lectures of David Silver's tutorial at RLDM 2015](https://github.com/domarps/papers-i-read/blob/master/rldm2015_silver_reinforcement_learning.pdf)
10 | 


--------------------------------------------------------------------------------
/infoTheoreticalEmb.md:
--------------------------------------------------------------------------------
1 | [Information-theoretical label embeddings for large-scale image classification](https://arxiv.org/abs/1607.05691)
2 | ==================================================
3 | 


--------------------------------------------------------------------------------
/mineCraft.md:
--------------------------------------------------------------------------------
 1 | [Control of Memory, Active Perception, and Action in Minecraft](https://arxiv.org/abs/1605.09128)
 2 | ================================================================
 3 | 
 4 | #####TL; DR:
 5 | The authors introduce and evaluate 3 neural network architectures to extend DQN with a form of memory. Two notable contributions :
 6 | * A new Deep RL benchmark on maze-based Minecraft games is proposed with a difficulty intermediate between Atari Games and Continuous Tasks that thoroughly exercise memory needs.
 7 | * Network architectures for incorporating memory into deep RL and an empirical evaluation to convincingly demonstrate an improvement in generalization from adding memory.
 8 | The paper is another step towards addressing an open question in Deep RL : how can memory be used to deal with partial observability?
 9 | 
10 | #####Discussion:
11 | The task is a combination of combined delayed rewards with partial observability and a high-dimensional visual input. Care is taken to observe the generalization of architectures via task variations and division of task instances into test and train buckets.
12 | 
13 | #####Notes/Questions to ask:
14 | * The proposed memory-based RL architecture should also be evaluated on standard benchmarks such as Atari.
15 | * It is also important to understnd external memory size influence on performance.
16 | * The part describing the memory, read-function and the controller were a bit harder to follow without specific background.
17 | 
18 | #####References:
19 | 
20 | * [Video Demo](https://www.youtube.com/watch?v=jQg8p-V8jF4)
21 | * [Paper Review](http://icml.cc/2016/reviews/1242.txt)
22 | 


--------------------------------------------------------------------------------
/ml-test-score.md:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/modelFreeEpisodicControl.md:
--------------------------------------------------------------------------------
1 | [Model-Free Episodic Control] (https://arxiv.org/pdf/1606.04460v1.pdf)
2 | ================================
3 | 


--------------------------------------------------------------------------------
/multiAgentRL.md:
--------------------------------------------------------------------------------
  1 | Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments
  2 | ========================================================================================================================
  3 | 
  4 | #### TL;DR
  5 | As software and hardware agents begin to perform tasks of genuine interest, they will be faced with environments too complex for humans to predetermine the correct actions to take. 
  6 | Three characteristics shared by many complex domains are:
  7 | * high-dimensional state and action spaces
  8 | * partial observability
  9 | * multiple learning agents 
 10 | 
 11 | To tackle such problems algorithms combine deep neural network function approximation with reinforcement learning. The paper first described using Recurrent Neural Networks(RNN) to handle partial observability in Atari games. Next, a multiagent soccer domain Half-Field-Offense is described and approaches for learning effective policies in this parameterized-continuous action space are enumerated.
 12 | 
 13 | #### What's next in store?
 14 | Hierarchial RL : possibility to observe different goal states.
 15 | 
 16 | #### Key terms:
 17 | - n-step Q-learning
 18 | - On-Policy Monte Carlo
 19 | - Taking on-policy approach in the beginning before switching to off-policy(Matthew's future work)
 20 | 
 21 | #### Key Slides from the seminar:
 22 | 
 23 | ##### Presentation Outline
 24 | 
 25 | ![outline](https://cloud.githubusercontent.com/assets/7057078/16066490/92b1dd92-3268-11e6-9363-6ef823365a59.PNG)
 26 | 
 27 | ##### Markov Decision Process
 28 | 
 29 | Simply observe the current state - learn the action to execute.
 30 | 
 31 | ![mdp](https://cloud.githubusercontent.com/assets/7057078/16066896/084db212-326c-11e6-8920-5e07f693bc51.PNG)
 32 | 
 33 | 
 34 | 
 35 | ##### Partially Observable MDP
 36 | 
 37 | Instead of receiving the full state of the world, the agent only receives observations(which may be noisy and incomplete) -- the agent still performs actions a<sub>t</sub> and rewards r<sub>t</sub>.
 38 | 
 39 | ![pomdp](https://cloud.githubusercontent.com/assets/7057078/16065909/9ca39a02-3263-11e6-9eb2-9023c5d0b2c5.PNG)
 40 | 
 41 | ##### Introductory slide on RL
 42 | 
 43 | ![RL](https://cloud.githubusercontent.com/assets/7057078/16065938/df2fef2e-3263-11e6-8b5a-52a014b7ff19.PNG)
 44 | 
 45 | ##### Q-Value Function : Expected sum of $gamma$ discounted rewards from taking action a in state s. 
 46 | 
 47 | An optimal Q function yields an optimal policy -- it is important correctly estimate the every action from every state -- so that it is easy to learn to act optimally by simply choosing the action that maximizes the Q-function for each state.
 48 | 
 49 | ![Q-value](https://cloud.githubusercontent.com/assets/7057078/16066220/4b774fe0-3266-11e6-869c-6dba45ae35bd.png)
 50 | 
 51 | ##### Deep Neural Networks
 52 | 
 53 | ![dnn](https://cloud.githubusercontent.com/assets/7057078/16066251/926bdbe6-3266-11e6-843b-085a7efe8456.PNG)
 54 | 
 55 | ##### Recurrent Q-Learning for POMDPs -- The Atari Environment
 56 | 
 57 | Observation : Current game screen of the Atari Game(160x210 image with 3 channels)
 58 | 
 59 | ![atari-i](https://cloud.githubusercontent.com/assets/7057078/16067020/49cfe8ee-326d-11e6-8136-91a7e1f694bc.PNG)
 60 | ![atari-ii](https://cloud.githubusercontent.com/assets/7057078/16067021/4af42f1e-326d-11e6-9608-6697dd09aeca.PNG)
 61 | 
 62 | Are Atari Games MDPs or POMDPs? Depends on the number of game screens used in the state representation. 
 63 | 
 64 | Many games PO with a single frame - can tell the position of the ball but not the velocity!!
 65 | 
 66 | ##### Deep Q-Network
 67 | Most successful approach to play Atari Games -- it estimates the Q-values for each of the 18 possible actions in an Atari Game. The DQN accepts the last 4 game screens as input. 
 68 | Learning via TD Reinforcement Learning  -- maintain a Replay Memory *D* -- sample transitions from the memory and make the target of the neural net *y* the *reward* plus the *gamma discounted Q-value of the next state encountered*.
 69 | 
 70 | ![deepqn](https://cloud.githubusercontent.com/assets/7057078/16067022/4c2a59b2-326d-11e6-8fc8-8eda346e7d1b.PNG)
 71 | 
 72 | ##### Flickering Atari
 73 | 
 74 | DeepMind have established that the DQNs perform very well on MDPs but the motivation is to test the performance in POMDPs.
 75 | 
 76 | ![flickeringatari](https://cloud.githubusercontent.com/assets/7057078/16067023/4e1543cc-326d-11e6-9d48-6db972c1553d.PNG)
 77 | 
 78 | #### DQN Flickering Pong 
 79 | Here the Game state must be *inferred* with high probability from previous history of observations! DQN learns a flickering version of Pong not that well -- it seems to have establishing the position of the ball or inferring its velocity. Half of the 4 game screens are noisy -- the DQN however treats them as normal game screens. Reason : Not intended to handicap the algorithm.
 80 | 
 81 | ##### Deep Recurrent Q-Network
 82 | Two major changes
 83 |     - Fully connected layer of DQN **replaced** with LSTM (Same number of nodes present in both layers)
 84 |     - Recurrence in the LSTM layer hopefully extracts the relevant information from the screen at the current timestep.
 85 |     
 86 | ![deeprecurrentqn](https://cloud.githubusercontent.com/assets/7057078/16067025/4f490332-326d-11e6-9e0d-23d201e1232b.PNG)
 87 | 
 88 | BPTT : Back Propogation Through Time for the last ~~4~~ 10 timesteps.
 89 | 
 90 | The LSTM gives the DQN some redundancy in order to combat the noisy observations coming in. Hope is to infer what the current state of the world is, although the observations are noisy. It is important to note that the LSTM has *inferred* velocity of the Pong ball despite observing just a single frame at every timestep -- a sequence of inputs that **maximize the activation of a given LSTM unit**.
 91 | 
 92 | 
 93 | ##### Performance Analysis
 94 | 
 95 | Note : Pong maxes out at 21
 96 | 
 97 | ![performance](https://cloud.githubusercontent.com/assets/7057078/16067426/5f8a409a-3271-11e6-93b8-3c6699698ab5.PNG)
 98 | 
 99 | ##### State Action Spaces
100 | 
101 | ![sas](https://cloud.githubusercontent.com/assets/7057078/16066929/71596eb8-326c-11e6-84cb-3a6571a7759f.PNG)
102 | 
103 | Observe the view cones. The exact positions and the velocities of the agents are not observable.
104 | 
105 | ##### Lillicrap's DDPG
106 | 
107 | ![ddpg](https://cloud.githubusercontent.com/assets/7057078/16019295/220d85a2-315d-11e6-8093-8ad5e4b5bf55.PNG)
108 | 
109 | ##### Training 
110 | 
111 | ![training](https://cloud.githubusercontent.com/assets/7057078/16019257/e6207c2a-315c-11e6-9a96-3dce35e8a3ca.PNG)
112 | 
113 | ##### Approaches to bound the DDPG action space
114 | 
115 | - Squashing Gradients   
116 | 
117 |        ![squashinggradients](https://cloud.githubusercontent.com/assets/7057078/16124275/2986f05a-33a3-11e6-9027-fc185637ce8d.PNG)
118 | 
119 | - Zero Gradients
120 |         ![zeroinggradients](https://cloud.githubusercontent.com/assets/7057078/16068086/ad9a0616-3277-11e6-8c98-95c1d4ac57fe.PNG)
121 | 
122 | - Invert Gradients
123 |         ![invertinggradients](https://cloud.githubusercontent.com/assets/7057078/16018553/cd0a5a24-3159-11e6-9b65-29415968a7d3.PNG)
124 | 
125 | ##### Q-Learning Spectrum
126 | 
127 | ![q-learning spectrum](https://cloud.githubusercontent.com/assets/7057078/16018336/f6e27904-3158-11e6-96fb-0a3a9ba92d0b.PNG)
128 | 
129 | ##### Low Bias, High Variance Q-learning v/s High Bias, Low Variance Q-learning
130 | 
131 | ![overview](https://cloud.githubusercontent.com/assets/7057078/16018374/181f40de-3159-11e6-8e15-e399fc17f1a0.PNG)
132 | 
133 | ##### Monte-Carlo 
134 | 
135 | ![monte-carlo](https://cloud.githubusercontent.com/assets/7057078/16018400/35a3a0dc-3159-11e6-9106-640b34b67b95.PNG)
136 | 
137 | ##### Experiments
138 | 
139 | ![experiments](https://cloud.githubusercontent.com/assets/7057078/16018522/a523e3ae-3159-11e6-8417-43907d495887.PNG)
140 | 
141 | ##### Inverting Gradients -- can we use eligibility traces for update targets? 
142 | 
143 | Need to acquire the Monte-Carlo HV,LB approach as well stay computationally efficient.
144 | 
145 | ![invertinggradients](https://cloud.githubusercontent.com/assets/7057078/16018553/cd0a5a24-3159-11e6-9b65-29415968a7d3.PNG)
146 | 
147 | ##### Snapshot of the Game
148 | 
149 | ![thegame](https://cloud.githubusercontent.com/assets/7057078/16018640/345b61be-315a-11e6-86fa-4347ff95e1af.PNG)
150 | 
151 | ##### Off-Policy Monte Carlo
152 | 
153 | ![off-policy Monte Carlo](https://cloud.githubusercontent.com/assets/7057078/16018684/6160a1e2-315a-11e6-8de3-7ebdce2583ec.png)
154 | 
155 | ##### Deep Multiagent RL
156 | 
157 | ![multiagentrl](https://cloud.githubusercontent.com/assets/7057078/16018837/17e4e07c-315b-11e6-84f7-3ff9316ccbdb.PNG)
158 | 
159 | More challenging as the action space is twice as large ~ learning to write : we have two hands but you use only 1 to write.
160 | 
161 | ![centralized](https://cloud.githubusercontent.com/assets/7057078/16018853/29d611b6-315b-11e6-8256-330b6002f15b.PNG)
162 | 
163 | ![parametersharing](https://cloud.githubusercontent.com/assets/7057078/16018914/6c2b206a-315b-11e6-9f85-29bd03de90e0.PNG)
164 | 
165 | ##### Related Work
166 | 
167 | ![relatedwork](https://cloud.githubusercontent.com/assets/7057078/16019006/cd908980-315b-11e6-9e9a-d83caa251f11.PNG)
168 | 
169 | 
170 | ##### Future Directions
171 | Reducing sample complexity but learn better policies, Non-differenitable components in RNN
172 | 
173 | ![futuredirections-ii](https://cloud.githubusercontent.com/assets/7057078/16019070/077ae42e-315c-11e6-95b0-ffc952db5a0d.PNG)
174 | 
175 | 
176 | #### Discussion(Notes/Questions):
177 | - Instead of all joint space taking up exponential complexity
178 | - Idea is to have a sequence of tasks to learn, curriculum to learn. promising direction for shaping rewards
179 | - Implementation Question : mini-batch of size 32, number of steps 10 for backprop.
180 | - Training time : LSTM receives a single frame with 10 time steps.
181 | - RL is a little slower to learn policies, flexible to learn non-differentiabilties.
182 | - Determistic Policy Gradient vs Stochastic Gradient
183 |     * to address continuous action spaces
184 |     * trade-off between TD methods to estimate Q-values v/s policy methods
185 | - The critic must know [ToDo]
186 | - Loss may not be able to tell you that you have converged.
187 | 
188 | 


--------------------------------------------------------------------------------
/playingAtari.md:
--------------------------------------------------------------------------------
 1 | ## [Playing Atari with Deep Reinforcement Learning](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)
 2 | 
 3 | 
 4 | #####TL;DR:
 5 | Deep learning techniques can approximate the value function of an RL algorithm much more accurately than standard method
 6 | because they find the adequate features for doing so. This insight led to a first breakthrough when a CNN was combined with a slightly modified version of the standard Q-Learning algorithm to get impressive results in a large set of challenging video game problem.
 7 | 
 8 | 
 9 | ### Playing Atari games with raw-pixel images
10 | ##### Why Atari?
11 | In sharp contrast to the needs of dataset in the evaluation pipeline in Supervised learning problems, Sequential Decision Making (a.k.a Reinforcement Learning) requires closed loop systems. Researchers designing simulators to test their algorithms is a blatant conflict of interest. Game Consoles such as ATARI 2600 is a very modest game and serves as an efficient test-bed.
12 | 
13 | ![image](https://cloud.githubusercontent.com/assets/7057078/16360587/fa135474-3b1f-11e6-854b-a35878afbbd4.png)
14 | 
15 | #### Reinforcement Learning in n-grams
16 | 
17 | * Future **Discounted** return starting from a given state at time t conditioned on the actions taken in the trajectory.
18 | ![image](https://cloud.githubusercontent.com/assets/7057078/16360614/1bdadc84-3b21-11e6-836e-f9bba3232534.png)
19 | * Policy maps states to actions
20 | * Q-Value of executing action *a* in state *s* followed by *\pi*
21 | ![image](https://cloud.githubusercontent.com/assets/7057078/16360615/290284c0-3b21-11e6-8096-62c3f3652f33.png)
22 | 
23 | * The Optimal Q-value function Q* which maximizes over different policies -- on expanding it, we get the Bellman equations (involving both the zero-step prediction and the one-step prediction using the same Q-value) as shown below:
24 | 
25 | ![image](https://cloud.githubusercontent.com/assets/7057078/16360630/b01cdaf0-3b21-11e6-8f39-06a2162e8d56.png)
26 | 
27 | * Q* defines an optimal policy 
28 | 
29 | #### Approximating Q* in Practice
30 | 
31 | The issue is complicated in practice since the transition model is unknown and the feature space is a large high-dimensional one.
32 | 
33 | #### Deep Q Networks (Mnih et al. 2013)
34 | 
35 | **Train a convolutional network to predict/optimize Q-values (model-free)**
36 | * Learn *online* from interaction data **off-policy (Q-learning like)**
37 | * Minimally processed input -- no hand coded feature extraction
38 | * Sample architecture and hyperparameters for different games
39 | * State-of-the-art results
40 | 
41 | #### Deep Q Network Architecture
42 | 
43 | ![dqn](https://cloud.githubusercontent.com/assets/7057078/16360672/13b5ab36-3b23-11e6-8b15-ce8ea5883856.jpg)
44 | 
45 | The input is the last 4 frames of the image and the filters are computed independently : essentially these are 8 X 8 X 4.
46 | Note : the two layers of convolutions with Rectified Linear activation, using the same filter across the image to extract lower level features of the images.
47 | 
48 | Notice that the target y is independent of the theta parameter.
49 | ![image](https://cloud.githubusercontent.com/assets/7057078/16360685/b0f1434c-3b23-11e6-8e06-f0ccf10a761a.png)
50 | 
51 | 
52 | 
53 | 


--------------------------------------------------------------------------------
/rldm2015_silver_reinforcement_learning.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/domarps/papers-i-read/3d5ed5a71a0a44cc6fc701630d511408ccdbb8fa/rldm2015_silver_reinforcement_learning.pdf


--------------------------------------------------------------------------------
/seq2Seq.md:
--------------------------------------------------------------------------------
1 | [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215)
2 | =================================================
3 | ####Motivation:
4 | Seq2Seq addresses problems such as machine translation, language modeling and speech recognition. At a high level, the model takes in a sequence of inputs, looks at each element of the sequence and tries to predict the next element. Language models are generative -- once trained, they can be used to generate sequnences of information by feeding their previous outputs back into the model.
5 | 
6 | ####TL;DR:
7 | 


--------------------------------------------------------------------------------
/smartReply.md:
--------------------------------------------------------------------------------
1 | #[Smart Reply: Automated Response Suggestion for Email] (https://arxiv.org/abs/1606.04870)
2 | ============================
3 | 


--------------------------------------------------------------------------------
/spatial_search.md:
--------------------------------------------------------------------------------
 1 | ## Spatial-Semantic Image Search by Visual Feature Synthesis
 2 | 
 3 | - The paper attempts to rethink Text-based Search and Image-based Search by proposing a novel Region-Based Image retrieval (RBIR) methodology
 4 | - RBIR attempts to address the relationship between spatial query and relationship,it can handle the query composed of semantic concepts (i.e., object categories) and their spatial layout.
 5 | - This is because users want to constrain their retrieval both _semantically_ and _spatially_
 6 | - The paper leverages how those elements are spatially are arranged.
 7 | - Users can to create, manipulate and annotate all the **bounding boxes** to specify their search intent, a neural agent can automatically retrieve relevant images.
 8 | 
 9 | 
10 | #### Visual Feature Synthesis
11 | 
12 | - Represent database images using pre-trained deep visual features
13 | - Train a ConvNet model to synthesize visual representation from the query
14 | - Use the synthesized feature to retrieve the database images
15 | 
16 | 
17 | ![image](https://user-images.githubusercontent.com/7057078/33227716-9663ec6e-d15e-11e7-8e88-7c2670863fdf.png)
18 | 
19 | Specifically, the method transforms the user canvas queryto spatial-semantic representation, each spatial location is associated with the semantic word vector (_word2vec_). Then a Convolutional Neural Network (CNN) synthesizes the appropriate visual feature. The network is trained with three loss functions, `Similarity Loss`, `Discriminative Loss` and `Ranking Loss`.
20 | 
21 | Quantitative Evaluation on Visual Genome and MS-COCO datasets shows that the method outperforms all the other methods. 
22 | 


--------------------------------------------------------------------------------
/trustBasedOptimization.md:
--------------------------------------------------------------------------------
 1 | [Trust Region Policy Optimization](https://arxiv.org/abs/1502.05477)
 2 | ============================================
 3 | 
 4 | ####TL;DR:
 5 | The success of	neural	networks	in	supervised	learning	relies	on	the	fact	that	learning	reduces	to	a	nonlinear	optimization	problem. An improved understanding of monotonic behavior can fully wield the power of non-linear function approximators in RL (which has been rather slow to adopt such NLFA). The authors aim to *achieve monotonic policy improvement*, both in theory and practice. A good policy should be applicable to arbitrary policy parameterization as well as should be able to provide guidance about step-size selection.
 6 | 
 7 | ###Contributions:
 8 | 
 9 | A policy update scheme with monotonic improvement guarantee has inspired a practical trust region algorithm to demonstrate the robustness on challenging domains.
10 | 
11 | ####Problem Setup:
12 | Consider a Markov Decision Process (MDP)
13 | ![image](https://cloud.githubusercontent.com/assets/7057078/16541782/00df1dd4-4042-11e6-8236-3c0659ab8f41.png)
14 | 
15 | ####Neat Identity:
16 | 
17 | ![image](https://cloud.githubusercontent.com/assets/7057078/16592295/e43b4554-4294-11e6-8730-4e88e03b7029.png)
18 | 
19 | ####Surrogate Loss Function:
20 | 
21 | ![image](https://cloud.githubusercontent.com/assets/7057078/16592313/f6ec2d3a-4294-11e6-95b5-4a9b558b0405.png)
22 | 
23 | 
24 | ####MM Algorithm:
25 | ![image](https://cloud.githubusercontent.com/assets/7057078/16592346/1bdb79e8-4295-11e6-89f1-3bdc583698f1.png)
26 | 
27 | ####Improvement Theorem
28 | ![image](https://cloud.githubusercontent.com/assets/7057078/16592330/0be48af2-4295-11e6-93bc-855165429a4b.png)
29 | 
30 | ##References:
31 | [ICML 2015 presentation](http://videolectures.net/icml2015_schulman_policy_optimization/)
32 | 
33 | 
34 | 
35 | 
36 | 
37 | 


--------------------------------------------------------------------------------
/webSpidering.md:
--------------------------------------------------------------------------------
 1 | Using Reinforcement Learning to Spider the Web Efficiently
 2 | ==========================================================
 3 | 
 4 | One of the major issues faced by the crawlers is that it is increasingly arduous to learn that some sets of off-topic documents often lead reliably to highly relevant documents. An important observation of the topic-specific spidering is that the environment presents situations with delayed reward – this makes reinforcement learning the appropriate framework.
 5 | 
 6 | To help explain how reinforcement learning relates to spidering, the on-topic documents are immediate *rewards*. An *action* is following a particular hyperlink and the *state* is the set of on-topic documents remaining to be found and the set of hyperlinks to be discovered. One preliminary observation is that the state space is huge and difficult to generalize as it encompasses not only the on-topic documents remaining to be explored but also the set of the hyperlinks which serve as possible actions. Also, the number of the available actions is large and difficult to generalize if we consider the number of distinct hyperlinks within the Web as our action space, as described earlier.
 7 | 
 8 | To explicitly address this problem, Rennie and McCallum do not model the state space and instead, capture relevant distinctions between the actions using only the words in the *neighborhood* of the corresponding hyperlink. For every action, a single Q-value is determined by calculating the discounted sum of rewards received by following the optimal policy after traversing the given hyperlink. Thus the Q-value becomes a mapping from a “bag of words” to a scalar.
 9 | 
10 | Efficient spidering is achieved when this mapping is learnt from a training dataset containing the bag-of-words/Q-value pairs. The authors chose greedy search, which yields in following hyperlinks to maximize immediate reward.
11 | 
12 | The paper also discusses a method for using Naive Bayes as a function approximator. In addition, the text classification task is built on a Bayesian learning framework, where the text data is generated by a parametric model and the training data(BOW/Q-value tuples) is used to calculate the MAP estimates of the model parameters. Each class is modeled by a mutlinomial over words. Therefore, classification essentially is the simple matter of selecting the most probable class given the document’s words.
13 | 


--------------------------------------------------------------------------------