├── Artificial Intelligence.md
├── Bayesian Inference and Learning.md
├── Causal Inference.md
├── Deep Learning.md
├── Information Retrieval.md
├── Knowledge Representation and Reasoning.md
├── Machine Learning.md
├── Natural Language Processing.md
├── Probabilistic Programming.md
├── Recommender Systems.md
├── Reinforcement Learning.md
└── interesting recent papers.md


/Causal Inference.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 |   * [**overview**](#overview)
  4 |   * [**theory**](#theory)
  5 |   * [**interesting papers**](#interesting-papers)
  6 | 
  7 | 
  8 | 
  9 | ---
 10 | ### overview
 11 | 
 12 |   ["Why Correlation Usually != Causation"](https://gwern.net/Causality) by Gwern Branwen
 13 | 
 14 |   ["Do we still need models or just more data and compute?"](https://staff.fnwi.uva.nl/m.welling/wp-content/uploads/Model-versus-Data-AI-1.pdf) by Max Welling
 15 | 
 16 |   ["ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus"](http://inference.vc/untitled) by Ferenc Huszar  
 17 |   ["Causal Inference 2: Illustrating Interventions via a Toy Example"](https://inference.vc/causal-inference-2-illustrating-interventions-in-a-toy-example) by Ferenc Huszar  
 18 |   ["Causal Inference 3: Counterfactuals"](https://inference.vc/causal-inference-3-counterfactuals) by Ferenc Huszar  
 19 | 
 20 |   ["Causal Data Science"](https://medium.com/@akelleh/causal-data-science-721ed63a4027) by Adam Kelleher:
 21 |   - ["If Correlation Doesn’t Imply Causation, Then What Does?"](https://medium.com/@akelleh/if-correlation-doesnt-imply-causation-then-what-does-c74f20d26438)
 22 |   - ["Understanding Bias: A Prerequisite For Trustworthy Results"](https://medium.com/@akelleh/understanding-bias-a-pre-requisite-for-trustworthy-results-ee590b75b1be)
 23 |   - ["Speed vs. Accuracy: When Is Correlation Enough? When Do You Need Causation?"](https://medium.com/@akelleh/speed-vs-accuracy-when-is-correlation-enough-when-do-you-need-causation-708c8ca93753)
 24 |   - ["A Technical Primer on Causality"](https://medium.com/@akelleh/a-technical-primer-on-causality-181db2575e41)
 25 |   - ["The Data Processing Inequality"](https://medium.com/@akelleh/the-data-processing-inequality-da242b40800b)
 26 |   - ["Causal Graph Inference"](https://medium.com/@akelleh/causal-graph-inference-b3e3afd47110)
 27 | 
 28 |   ["If Correlation Doesn’t Imply Causation, then What Does?"](http://michaelnielsen.org/ddi/if-correlation-doesnt-imply-causation-then-what-does) by Michael Nielsen
 29 | 
 30 |   ["Latent Variables and Model Mis-specification"](https://jsteinhardt.wordpress.com/2017/01/10/latent-variables-and-model-mis-specification/) by Jacob Steinhardt
 31 | 
 32 |   ["Causality in Machine Learning"](http://unofficialgoogledatascience.com/2017/01/causality-in-machine-learning.html) by Muralidharan et al.
 33 | 
 34 | ----
 35 | 
 36 |   ["The Seven Tools of Causal Inference with Reflections on Machine Learning"](https://dl.acm.org/citation.cfm?id=3241036) by Judea Pearl `paper` ([talk](https://youtube.com/watch?v=nWaM6XmQEmU) `video`)  
 37 |   ["Theoretical Impediments to Machine Learning"](http://web.cs.ucla.edu/~kaoru/theoretical-impediments.pdf) by Judea Pearl `paper`  
 38 | 
 39 |   ["Causality for Machine Learning"](https://arxiv.org/abs/1911.10500) by Bernhard Scholkopf `paper`  
 40 |   ["Towards Causal Representation Learning"](https://arxiv.org/abs/2102.11107) by Scholkopf et al. `paper`  
 41 | 
 42 |   ["On Pearl’s Hierarchy and the Foundations of Causal Inference"](https://causalai.net/r60.pdf) by Bareinboim, Correa, Ibeling, Icard `paper` ([talk](https://youtube.com/watch?v=fNuMHDrh6AY) `video`)
 43 | 
 44 |   ["Causality"](http://www.homepages.ucl.ac.uk/~ucgtrbd/papers/causality.pdf) by Ricardo Silva `paper`
 45 | 
 46 |   ["Introduction to Causal Inference"](http://jmlr.org/papers/volume11/spirtes10a/spirtes10a.pdf) by Peter Spirtes `paper`
 47 | 
 48 |   ["Graphical Causal Models"](http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch22.pdf) by Cosma Shalizi `paper`
 49 | 
 50 | ----
 51 | 
 52 |   ["The Book of Why: The New Science of Cause and Effect"](https://amazon.com/Book-Why-Science-Cause-Effect/dp/046509760X) by Judea Pearl and Dana Mackenzie `book` ([overview](http://bayes.cs.ucla.edu/WHY/why-intro.pdf))  
 53 |   ["Causal Inference in Statistics: A Primer"](https://books.google.co.uk/books/about/Causal_Inference_in_Statistics.html?id=IqCECwAAQBAJ) by Judea Pearl, Madelyn Glymour, Nicholas Jewell `book`  
 54 |   ["Causality: Models, Reasoning, and Inference"](https://dropbox.com/s/m2m1935e6tohii9/Pearl%20-%20Causality%3A%20Models%2C%20Reasoning%2C%20and%20Inference.pdf) by Judea Pearl `book` ([epilogue](http://bayes.cs.ucla.edu/BOOK-2K/causality2-epilogue.pdf))  
 55 |   ["Elements of Causal Inference"](https://mitpress.mit.edu/books/elements-causal-inference) by Jonas Peters, Dominik Janzing, Bernhard Scholkopf `book`  
 56 |   ["Causal Inference Book"](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/) by Miguel Hernan and James Robins `book`  
 57 | 
 58 | ----
 59 | 
 60 |   [tutorial](https://youtube.com/watch?v=CTcQlRSnvvM) by Bernhard Scholkopf `video`  
 61 |   [tutorial](https://youtube.com/watch?v=zvrcyqcN9Wo) by Jonas Peters `video`  
 62 |   [tutorial](https://youtube.com/watch?v=_wFagI5Fn9I) by Jonas Peters `video`  
 63 | 
 64 |   [course](https://youtube.com/channel/UCbOJ2eEdvf2wOPrAmA72Gzg) by Brady Neal `video`
 65 | 
 66 |   ["Causal Inference in Everyday Machine Learning"](https://youtube.com/watch?v=HOgx_SBBzn0) tutorial by Ferenc Huszar `video`  
 67 |   ["Causal Inference in Online Systems: Methods, Pitfalls and Best Practices"](https://mediasite.kellogg.northwestern.edu/Mediasite/Play/8e78dc83c6fb4d20abeeb18028a8f7071d?catalog=1533bdef-0c88-4513-ad97-5fce50c92e62) tutorial by Amit Sharma `video` ([slides](https://github.com/amit-sharma/causal-inference-tutorial))  
 68 |   ["Counterfactual Evaluation and Learning for Search, Recommendation and Ad Placement"](http://www.cs.cornell.edu/~adith/CfactSIGIR2016/) tutorial by Thorsten Joachims and Adith Swaminathan `video`  
 69 |   ["Counterfactual Reasoning and Massive Data Sets"](https://youtube.com/watch?v=s37cIYDOM6s) by Leon Bottou `video`  
 70 |   ["Counterfactual Inference"](https://facebook.com/nipsfoundation/videos/1291139774361116) tutorial by Susan Athey `video`  
 71 |   ["Causal Inference for Observational Studies"](http://techtalks.tv/talks/causal-inference-for-observational-studies/62355/) tutorial by David Sontag and Uri Shalit `video` ([slides](https://cs.nyu.edu/~shalit/slides.pdf))  
 72 |   ["Connections between Causality and Machine Learning"](https://youtube.com/watch?v=9pm0eXuiTZs) by Jonas Peters `video`  
 73 | 
 74 | ----
 75 | 
 76 |   ["Science vs Data: Contesting the Soul of Data Science"](https://youtube.com/watch?v=X_1MG4ViVGM) by Judea Pearl `video`  
 77 |   ["The Foundations of Causal Inference with Reflections on Machine Learning and Artificial Intelligence"](https://youtube.com/watch?v=nWaM6XmQEmU) by Judea Pearl `video`  
 78 |   ["The New Science of Cause and Effect"](https://youtube.com/watch?v=ZaPV1OSEpHw) by Judea Pearl `video`  
 79 |   ["The Mathematics of Causal Inference with Reflections on Machine Learning"](https://youtube.com/watch?v=bcRl7sXR1hE) by Judea Pearl `video`  
 80 |   ["The Mathematics of Causal Inference, with Reflections on Machine Learning and the Logic of Science"](https://youtube.com/watch?v=zHjdd--W6o4) by Judea Pearl `video`  
 81 | 
 82 |   ["On the Causal Foundations of AI (Explainability & Decision-Making)"](https://youtube.com/watch?v=fNuMHDrh6AY) by Elias Bareinboim `video`  
 83 |   ["Causal Data Science: A General Framework for Data Fusion and Causal Inference"](https://youtube.com/watch?v=dUsokjG4DHc) by Elias Bareinboim `video`  
 84 |   "Towards Causal Reinforcement Learning" ([[1]](https://youtube.com/watch?v=QRTgLWfFBMM), [[2]](https://youtube.com/watch?v=2hGvd_9ho6s)) by Elias Bareinboim `video`  
 85 |   ["Causal Reinforcement Learning"](https://youtube.com/watch?v=bwz3NpVfz6k) by Elias Bareinboim `video`  
 86 | 
 87 |   ["Learning Causal Mechanisms"](https://facebook.com/iclr.cc/videos/2123421684353553?t=294) by Bernhard Scholkopf `video`  
 88 |   ["The Role of Causality for Interpretability"](https://vimeo.com/252188186) by Bernhard Scholkopf `video`  
 89 |   ["Causal Learning"](https://vimeo.com/238274659#t=13m22s) by Bernhard Scholkopf `video`  
 90 |   ["Toward Causal Machine Learning"](https://youtube.com/watch?v=ooeRlw3U2zU) by Bernhard Scholkopf `video`  
 91 |   ["Statistical and Causal Approaches to Machine Learning"](https://youtu.be/ek9jwRA2Jio?t=26m) by Bernhard Scholkopf `video`  
 92 | 
 93 |   ["The Missing Signal"](https://youtube.com/watch?v=DfJeaa--xO0) by Leon Bottou `video`  
 94 |   ["Learning Representations Using Causal Invariance"](https://facebook.com/722677142/posts/10155953319752143?t=714) by Leon Bottou `video`  
 95 | 
 96 | ----
 97 | 
 98 |   [workshop](https://sites.google.com/view/nips2018causallearning) at NeurIPS 2018 ([videos](https://youtube.com/playlist?list=PLJscN9YDD1bu1dCKuXSV1qYmicx3g9t7A))  
 99 |   [symposium](https://why19.causalai.net) at AAAI 2019  
100 | 
101 | 
102 | 
103 | ---
104 | ### theory
105 | 
106 |   Causal inference is a problem of uncovering cause-effect relations between variables of data generating system. Causal structures provide understanding about how the system will behave under changing and unseen environments. Knowledge about these causal dynamics allows to answer "what if" questions, describing potential responses of the system under hypothetical manipulations and interventions.
107 | 
108 |   What if some railways are closed, what will passengers do? What if we incentivize members of a social network to propagate an idea, how influential can they be? What if some genes in a cell are knocked-out, which phenotypes can we expect? Such questions need to be addressed via a combination of experimental and observational data, and require a careful approach to modelling heterogeneous datasets and structural assumptions concerning the causal relations among components of the system.
109 | 
110 |   Causal model is a set of assumptions about the data generating process, which cannot be expressed as properties of the joint distribution of observed variables.
111 | 
112 | ----
113 | 
114 |   "In retrospect, my greatest challenge was to break away from probabilistic thinking and accept, first, that people are not probability thinkers but cause-effect thinkers and, second, that causal thinking cannot be captured in the language of probability; it requires a formal language of its own."
115 | 
116 |   "What is more likely, that a daughter will have blue eyes given that her mother has blue eyes or the other way around — that the mother will have blue eyes given that the daughter has blue eyes? Most people will say the former — they'll prefer the causal direction. But it turns out the two probabilities are the same, because the number of blue-eyed people in every generation remains stable. I took it as evidence that people think causally, not probabilistically — they're biased by having easy access to causal explanations, even though probability theory tells you something different.  
117 |   There are many biases in our judgment that are created by our inclination to attribute causal relationships where they do not belong. We see the world as a collection of causal relationships and not as a collection of statistical or associative relationships. Most of the time, we can get by, because they are closely tied together. Once in a while we fail. The blue-eye story is an example of such failure.  
118 |   The slogan, "Correlation doesn't imply causation" leads to many paradoxes. For instance, the size of a child's thumb is highly correlated with their reading ability. So, naively, if you want to be taller, you should learn to read better. This kind of paradoxical example convinces us that correlation does not imply causation. Still, people fall into that trap quite often because they crave causal explanations. The mind is a causal processor, not an association processor. Once you acknowledge that, the question remains how we reconcile the discrepancies between the two. How do we organize causal relationships in our mind? How do we operate on and update such a mental presentation?"  
119 | 
120 |   "I now take causal relations as the fundamental building block that of physical reality and of human understanding of that reality, and I regard probabilistic relationships as but the surface phenomena of the causal machinery that underlies and propels our understanding of our world."
121 | 
122 |   *(Judea Pearl)*
123 | 
124 | ----
125 | 
126 |   "If we examine the information that drives machine learning today, we find that it is almost entirely statistical. In other words, learning machines improve their performance by optimizing parameters over a stream of sensory inputs received from the environment. It is a slow process, analogous in many respects to the evolutionary survival-of-the-fittest process that explains how species like eagles and snakes have developed superb vision systems over millions of years. It cannot explain however the super-evolutionary process that enabled humans to build eyeglasses and telescopes over barely one thousand years. What humans possessed that other species lacked was a mental representation, a blue-print of their environment which they could manipulate at will to imagine alternative hypothetical environments for planning and learning. Anthropologists like N. Harari, and S. Mithen are in general agreement that the decisive ingredient that gave our homo sapiens ancestors the ability to achieve global dominion, about 40,000 years ago, was their ability to sketch and store a representation of their environment, interrogate that representation, distort it by mental acts of imagination and finally answer “What if?” kind of questions. Examples are interventional questions: “What if I act?” and retrospective or explanatory questions: “What if I had acted differently?” No learning machine in operation today can answer such questions about actions not taken before. Moreover, most learning machine today do not utilize a representation from which such questions can be answered. We postulate that the major impediment to achieving accelerated learning speeds as well as human level performance can be overcome by removing these barriers and equipping learning machines with causal reasoning tools. This postulate would have been speculative twenty years ago, prior to the mathematization of counterfactuals. Not so today. Advances in graphical and structural models have made counterfactuals computationally manageable and thus rendered metastatistical learning worthy of serious exploration."
127 | 
128 |   "An extremely useful insight unveiled by the logic of causal reasoning is the existence of a sharp classification of causal information, in terms of the kind of questions that each class is capable of answering. The classification forms a 3-level hierarchy in the sense that questions at one level can only be answered if information from next levels is available."
129 | 
130 |   - association P(y|x) - seeing (what is?)
131 | 
132 | 	How would seeing X change my belief in Y?  
133 | 	What does a symptom tell me about a disease?  
134 | 
135 |   - intervention P(y|do(x),z) - doing (what if?)
136 | 
137 | 	What if I do X?  
138 | 	What if I take aspirin, will my headache be cured?  
139 | 	What if we ban cigarettes?  
140 | 
141 |   - counterfactuals P(yx|x0,y0) - imagining, retrospection (why?)
142 | 
143 | 	Was it X that caused Y?  
144 | 	What if I had acted differently?  
145 | 	Was it the aspirin that stopped my headache?  
146 | 	What if I had not been smoking the past 2 years?  
147 | 
148 |   "The first level, Association, invokes purely statistical relationships, defined by the naked data. For instance, observing a customer who buys toothpaste makes it more likely that he/she buys floss; such association can be inferred directly from the observed data using conditional expectation. Questions at this layer, because they require no causal information, are placed at the bottom level on the hierarchy.  
149 |   The second level, Intervention, ranks higher than Association because it involves not just seeing what is, but changing what we see. A typical question at this level would be: What happens if we double the price? Such questions cannot be answered from sales data alone, because they involve a change in customers behavior, in reaction to the new pricing. These choices may differ substantially from those taken in previous price-raising situations. Unless we replicate precisely the market conditions that existed when the price reached double its current value.  
150 |   The third level, Counterfactuals, is placed at the top of the hierarchy because they subsume interventional and associational questions. A typical question in the counterfactual category is “What if I had acted differently” thus necessitating retrospective reasoning.  
151 |   If we have a model that can answer counterfactual queries, we can also answer questions about interventions and observations. For example, the interventional question “What will happen if we double the price?” can be answered by asking the counterfactual question: “What would happen had the price been twice its current value?” Likewise, associational questions can be answered once we can answer interventional questions; we simply ignore the action part and let observations take over.  
152 |   The translation does not work in the opposite direction. Interventional questions cannot be answered from purely observational information (i.e., from statistical data alone). No counterfactual question involving retrospection can be answered from purely interventional information, such as that acquired from controlled experiments; we cannot re-run an experiment on subjects who were treated with a drug and see how they behave had they not given the drug."
153 | 
154 |   [*(Judea Pearl)*](http://web.cs.ucla.edu/~kaoru/theoretical-impediments.pdf)
155 | 
156 | ----
157 | 
158 |   tuple (d1, d2, d4, d4) - (population, observational/experimental, sampling, measure)  
159 | 
160 |   (Los Angeles, experimental with randomized Z1, selection on Age, (X1, Z1, W, M, Y1))  
161 |   (New York, observational, selection on SES, (X1, X2, Z1, N, Y2))  
162 |   (Texas, experimental with randomized Z2, (X2, Z1, W, L, M, Y1))  
163 | 
164 |   *statistics - descriptive*:  
165 | 	(d1, samples(observations), d3, d4) -> (d1, distribution(observations), d3, d4)  *(Bernulli, Poisson, Kolmogorov)*  
166 |   *statistics - experimental*:  
167 | 	(d1, samples(do(X)), d3, d4) -> (d1, distribution(do(X)), d3, d4)  *(Fisher, Cox, Goodman)*  
168 |   *causal inference from observational studies*:  
169 | 	(d1, distribution(observations), d3, d4) -> (d1, distribution(do(X)), d3, d4)  *(Rubin, Robins, Dawid, Pearl)*  
170 |   *experimental inference (generalized instrumental variables)*:  
171 | 	(d1, distribution(do(Z)), d3, d4) -> (d1, distribution(do(X)), d3, d4)  *(P. Wright, S. Wright)*  
172 |   *sampling selection bias*:  
173 | 	(d1, d2, select(Age), d4) -> (d1, d2, {}, d4)  *(Heckman)*  
174 |   *transportability (external validity)*:  
175 | 	(bonobos, d2, d3, d4) -> (humans, d2, d3, d4)  *(Shadish, Cook, Campbell)*  
176 | 
177 |   [*(Elias Bareinboim)*](https://youtu.be/dUsokjG4DHc?t=8m13s)
178 | 
179 | ----
180 | 
181 |   "Under probabilistic interpretation of causation from Pearl, the causal structure underlying a set of random variables X=(X1, ..., Xd), with joint distribution P, is often described in terms of a Directed Acyclic Graph, denoted by G = (V, E). In this graph, each vertex Vi ∈ V is associated to the random variable Xi ∈ X, and an edge Eji ∈ E from Vj to Vi denotes the causal relationship “Xi ← Xj”. More specifically, these causal relationships are defined by a structural equation model: each Xi ← fi(Pa(Xi), Ni), where fi is a function, Pa(Xi) is the parental set of Vi ∈ V, and Ni is some independent noise variable. Then, causal inference is the task of recovering G from S ∼ P^n."
182 | 
183 |   "Causal graph and the intervention types and targets may be (partially) unknown. This is a realistic setting in many practical applications. For example, in biology, many interventions that can be performed on organisms are known to result in measurable downstream effects, but the exact mechanism and direct intervention targets are unknown, and therefore it is not clear whether the knowledge gained may be transferred to other species. In pharmaceutical research, it is desirable to target the root causes of illness directly and minimize side-effects; however, as the causal mechanisms are often poorly understood, it is unclear what exactly a drug is doing and whether the results of a particular study on a subpopulation of patients (say, middle-aged males in the US) will generalize to other subpopulations (e.g., elderly women with dementia). In policy decisions, changing tax rules may have different repercussions for different socio-economic classes, but the exact workings of an economy can only be modeled to a certain extent. Machine learning may help to make such predictions more data-driven, but should then correctly take into account the transfer of distributions that result from interventions and context changes. For prediction in IID setting, imitating the exterior of a process is enough (i.e. can disregard causal structure). Anything else can benefit from causal learning."
184 | 
185 | 
186 | 
187 | ---
188 | ### interesting papers
189 | 
190 | [recent papers](http://deeplearningpatterns.com/doku.php?id=causal_analysis)
191 | 
192 | 
193 | 
194 | ----
195 | #### ["The Seven Tools of Causal Inference with Reflections on Machine Learning"](https://dl.acm.org/citation.cfm?id=3241036) Pearl
196 | >	"The dramatic success in machine learning has led to an explosion of artificial intelligence applications and increasing expectations for autonomous systems that exhibit human-level intelligence. These expectations have, however, met with fundamental obstacles that cut across many application areas. One such obstacle is adaptability, or robustness. Machine learning researchers have noted current systems lack the ability to recognize or react to new circumstances they have not been specifically programmed or trained for."
197 | 
198 |   - `video` <https://youtube.com/watch?v=nWaM6XmQEmU> (Pearl)
199 | 
200 | 
201 | #### ["Causality for Machine Learning"](https://arxiv.org/abs/1911.10500) Scholkopf
202 | >	"Graphical causal inference as pioneered by Judea Pearl arose from research on artificial intelligence, and for a long time had little connection to the field of machine learning. This article discusses where links have been and should be established, introducing key concepts along the way. It argues that the hard open problems of machine learning and AI are intrinsically related to causality, and explains how the field is beginning to understand them."
203 | 
204 | 
205 | #### ["Causal Inference and the Data-fusion Problem"](https://pnas.org/content/113/27/7345) Bareinboim, Pearl
206 | >	"We review concepts, principles, and tools that unify current approaches to causal analysis and attend to new challenges presented by big data. In particular, we address the problem of data fusion - piecing together multiple datasets collected under heterogeneous conditions (i.e., different populations, regimes, and sampling methods) to obtain valid answers to queries of interest. The availability of multiple heterogeneous datasets presents new opportunities to big data analysts, because the knowledge that can be acquired from combined data would not be possible from any individual source alone. However, the biases that emerge in heterogeneous environments require new analytical tools. Some of these biases, including confounding, sampling selection, and cross-population biases, have been addressed in isolation, largely in restricted parametric models. We here present a general, nonparametric framework for handling these biases and, ultimately, a theoretical solution to the problem of data fusion in causal inference tasks."
207 | 
208 |   - `video` <https://youtube.com/watch?v=_cNbWuErsoI> (Bareinboim)
209 |   - `video` <https://youtube.com/watch?v=dUsokjG4DHc> (Bareinboim)
210 | 
211 | 
212 | #### ["On Causal and Anticausal Learning"](https://arxiv.org/abs/1206.6471) Schoelkopf et al.
213 |   `ICML 2012`
214 | >	"We consider the problem of function estimation in the case where an underlying causal model can be inferred. This has implications for popular scenarios such as covariate shift, concept drift, transfer learning and semi-supervised learning. We argue that causal knowledge may facilitate some approaches for a given problem, and rule out others. In particular, we formulate a hypothesis for when semi-supervised learning can help, and corroborate it with empirical results."
215 | 
216 |   - `video` <https://youtu.be/zo4oRqfMrgo?t=15m58s> (Lipton)
217 | 
218 | 
219 | #### ["Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising"](https://arxiv.org/abs/1209.2355) Bottou et al.
220 | >	"This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select the changes that would have improved the system performance. This work is illustrated by experiments on the ad placement system associated with the Bing search engine."
221 | 
222 |   - `video` <https://youtube.com/watch?v=qmQceWeYg04> (Bottou)
223 |   - `video` <https://youtube.com/watch?v=W8k5KqYqVBw> (Bottou)
224 |   - `video` <https://youtube.com/watch?v=isGAY9ELqyo> (Bottou)
225 |   - `video` <https://youtu.be/_RtxTpOb8e4?t=52m6s> (Huszar)
226 | 
227 | 
228 | #### ["Causal Bootstrapping"](https://arxiv.org/abs/1910.09648) Little, Badawy
229 | >	"To draw scientifically meaningful conclusions and build reliable engineering models of quantitative phenomena, statistical models must take cause and effect into consideration (either implicitly or explicitly). This is particularly challenging when the relevant measurements are not obtained from controlled experimental (interventional) settings, so that cause and effect can be obscured by spurious, indirect influences. Modern predictive techniques from machine learning are capable of capturing high-dimensional, complex, nonlinear relationships between variables while relying on few parametric or probabilistic modelling assumptions. However, since these techniques are associational, applied to observational data they are prone to picking up spurious influences from non-experimental (observational) data, making their predictions unreliable. Techniques from causal inference, such as probabilistic causal diagrams and do-calculus, provide powerful (nonparametric) tools for drawing causal inferences from such observational data. However, these techniques are often incompatible with modern, nonparametric machine learning algorithms since they typically require explicit probabilistic models. Here, we develop causal bootstrapping, a set of techniques for augmenting classical nonparametric bootstrap resampling with information about the causal relationship between variables. This makes it possible to resample observational data such that, if it is possible to identify an interventional relationship from that data, new data representing that relationship can be simulated from the original observational data. In this way, we can use modern machine learning algorithms unaltered to make statistically powerful, yet causally-robust, predictions. We develop several causal bootstrapping algorithms for drawing interventional inferences from observational data, for classification and regression problems, and demonstrate, using synthetic and real-world examples, the value of this approach."
230 | 
231 | 
232 | #### ["Discovering Causal Signals in Images"](https://arxiv.org/abs/1605.08179) Lopez-Paz, Nishihara, Chintala, Scholkopf, Bottou
233 | >	"This paper establishes the existence of observable footprints that reveal the "causal dispositions" of the object categories appearing in collections of images. We achieve this goal in two steps. First, we take a learning approach to observational causal discovery, and build a classifier that achieves state-of-the-art performance on finding the causal direction between pairs of random variables, given samples from their joint distribution. Second, we use our causal direction classifier to effectively distinguish between features of objects and features of their contexts in collections of static images. Our experiments demonstrate the existence of a relation between the direction of causality and the difference between objects and their contexts, and by the same token, the existence of observable signals that reveal the causal dispositions of objects."
234 | 
235 | >	"First, we take a learning approach to observational causal inference, and build a classifier that achieves state-of-the-art performance on finding the causal direction between pairs of random variables, when given samples from their joint distribution. Second, we use our causal direction finder to effectively distinguish between features of objects and features of their contexts in collections of static images. Our experiments demonstrate the existence of (1) a relation between the direction of causality and the difference between objects and their contexts, and (2) observable causal signals in collections of static images."
236 | 
237 | >	"Causal features are those that cause the presence of the object of interest in the image (that is, those features that cause the object’s class label), while anticausal features are those caused by the presence of the object in the image (that is, those features caused by the class label)."
238 | 
239 | >	"Paper aims to verify experimentally that the higher-order statistics of image datasets can inform about causal relations. Authors conjecture that object features and anticausal features are closely related and vice-versa context features and causal features are not necessarily related. Context features give the background while object features are what it would be usually inside bounding boxes in an image dataset."
240 | 
241 | >	"Better algorithms for causal direction should, in principle, help learning features that generalize better when the data distribution changes. Causality should help with building more robust features by awareness of the generating process of the data."
242 |  
243 |   - `video` <https://youtube.com/watch?v=DfJeaa--xO0> (Bottou)
244 |   - `post` <http://giorgiopatrini.org/posts/2017/09/06/in-search-of-the-missing-signals/>
245 |   - `notes` <http://www.shortscience.org/paper?bibtexKey=journals/corr/Lopez-PazNCSB16>
246 | 
247 | 
248 | #### ["Learning Representations for Counterfactual Inference"](http://arxiv.org/abs/1605.03661) Johansson, Shalit, Sontag
249 | >	"Observational studies are rising in importance due to the widespread accumulation of data in fields such as healthcare, education, employment and ecology. We consider the task of answering counterfactual questions such as, "Would this patient have lower blood sugar had she received a different medication?". We propose a new algorithmic framework for counterfactual inference which brings together ideas from domain adaptation and representation learning. In addition to a theoretical justification, we perform an empirical comparison with previous approaches to causal inference from observational data. Our deep learning algorithm significantly outperforms the previous state-of-the-art."
250 | 
251 | >	"In this paper we focus on counterfactual inference, which is a widely applicable special case of causal inference. We cast counterfactual inference as a type of domain adaptation problem, and derive a novel way of learning representations suited for this problem. Our models rely on a novel type of regularization criteria: learning balanced representations, representations which have similar distributions among the treated and untreated populations. We show that trading off a balancing criterion with standard data fitting and regularization terms is both practically and theoretically prudent. Open questions which remain are how to generalize this method for cases where more than one treatment is in question, deriving better optimization algorithms and using richer discrepancy measures."
252 | 
253 |   - `video` <http://techtalks.tv/talks/learning-representations-for-counterfactual-inference/62489/> (Johansson)
254 |   - `video` <https://channel9.msdn.com/Events/Neural-Information-Processing-Systems-Conference/Neural-Information-Processing-Systems-Conference-NIPS-2016/Deep-Learning-Symposium-Session-3> (Shalit)
255 |   - `notes` <http://www.shortscience.org/paper?bibtexKey=journals/corr/JohanssonSS16>
256 |   - `code` <https://github.com/clinicalml/cfrnet>
257 | 
258 | 
259 | #### ["Causal Effect Inference with Deep Latent-Variable Models"](https://arxiv.org/abs/1705.08821) Louizos, Shalit, Mooij, Sontag, Zemel, Welling
260 | >	"Learning individual-level causal effects from observational data, such as inferring the most effective medication for a specific patient, is a problem of growing importance for policy makers. The most important aspect of inferring causal effects from observational data is the handling of confounders, factors that affect both an intervention and its outcome. A carefully designed observational study attempts to measure all important confounders. However, even if one does not have direct access to all confounders, there may exist noisy and uncertain measurement of proxies for confounders. We build on recent advances in latent variable modeling to simultaneously estimate the unknown latent space summarizing the confounders and the causal effect. Our method is based on Variational Autoencoders which follow the causal structure of inference with proxies. We show our method is significantly more robust than existing methods, and matches the state-of-the-art on previous benchmarks focused on individual treatment effects."
261 | 
262 |   - `code` <https://github.com/AMLab-Amsterdam/CEVAE>
263 | 
264 | 
265 | #### ["Implicit Causal Models for Genome-wide Association Studies"](https://arxiv.org/abs/1710.10742) Tran, Blei
266 | >	"Progress in probabilistic generative models has accelerated, developing richer models with neural architectures, implicit densities, and with scalable algorithms for their Bayesian inference. However, there has been limited progress in models that capture causal relationships, for example, how individual genetic factors cause major human diseases. In this work, we focus on two challenges in particular: How do we build richer causal models, which can capture highly nonlinear relationships and interactions between multiple causes? How do we adjust for latent confounders, which are variables influencing both cause and effect and which prevent learning of causal relationships? To address these challenges, we synthesize ideas from causality and modern probabilistic modeling. For the first, we describe implicit causal models, a class of causal models that leverages neural architectures with an implicit density. For the second, we describe an implicit causal model that adjusts for confounders by sharing strength across examples. In experiments, we scale Bayesian inference on up to a billion genetic measurements. We achieve state of the art accuracy for identifying causal factors: we significantly outperform existing genetics methods by an absolute difference of 15-45.3%."
267 | 
268 |   - `video` <https://vimeo.com/253922904> (Tran)
269 |   - `video` <https://youtube.com/watch?v=gi2jZ_bVJuA> (Tran)
270 |   - `slides` <http://dustintran.com/talks/Tran_Genomics.pdf>
271 |   - `post` <https://www.alexdamour.com/blog/public/2018/05/18/non-identification-in-latent-confounder-models>
272 | 
273 | 
274 | #### ["Learning Functional Causal Models with Generative Neural Networks"](https://arxiv.org/abs/1709.05321) Goudet, Kalainathan, Caillou, Lopez-Paz, Guyon, Sebag, Tritas, Tubaro
275 |   `CGNN`
276 | >	"We introduce a new approach to functional causal modeling from observational data. The approach, called Causal Generative Neural Networks, leverages the power of neural networks to learn a generative model of the joint distribution of the observed variables, by minimizing the Maximum Mean Discrepancy between generated and observed data. An approximate learning criterion is proposed to scale the computational cost of the approach to linear complexity in the number of observations. The performance of CGNN is studied throughout three experiments. First, we apply CGNN to the problem of cause-effect inference, where two CGNNs model P(Y|X,noise) and P(X|Y,noise) identify the best causal hypothesis out of X → Y and Y → X. Second, CGNN is applied to the problem of identifying v-structures and conditional independences. Third, we apply CGNN to problem of multivariate functional causal modeling: given a skeleton describing the dependences in a set of random variables {X1,…,Xd}, CGNN orients the edges in the skeleton to uncover the directed acyclic causal graph describing the causal structure of the random variables. On all three tasks, CGNN is extensively assessed on both artificial and real-world data, comparing favorably to the state-of-the-art. Finally, we extend CGNN to handle the case of confounders, where latent variables are involved in the overall causal model."
277 | 
278 |   - `video` <https://vimeo.com/252105914#t=37m10s> (Goudet)
279 |   - `code` <https://github.com/GoudetOlivier/CGNN>
280 |   - `paper` ["Causal Generative Neural Networks"](https://arxiv.org/abs/1711.08936) by Goudet et al.
281 | 
282 | 
283 | #### ["SAM: Structural Agnostic Model, Causal Discovery and Penalized Adversarial Learning"](https://arxiv.org/abs/1803.04929) Kalainathan, Goudet, Guyon, Lopez-Paz, Sebag
284 | >	"We present the Structural Agnostic Model, a framework to estimate end-to-end non-acyclic causal graphs from observational data. In a nutshell, SAM implements an adversarial game in which a separate model generates each variable, given real values from all others. In tandem, a discriminator attempts to distinguish between the joint distributions of real and generated samples. Finally, a sparsity penalty forces each generator to consider only a small subset of the variables, yielding a sparse causal graph. SAM scales easily to hundreds variables. Our experiments show the state-of-the-art performance of SAM on discovering causal structures and modeling interventions, in both acyclic and non-acyclic graphs."
285 | 
286 | 
287 | #### ["Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search"](https://arxiv.org/abs/1811.06272) Buesing, Weber, Zwols, Racaniere, Guez, Lespiau, Heess
288 |   `CF-GPS` `counterfactual inference` `ICLR 2019`
289 | >	Learning policies on data synthesized by models can in principle quench the thirst of reinforcement learning algorithms for large amounts of real experience, which is often costly to acquire. However, simulating plausible experience de novo is a hard problem for many complex environments, often resulting in biases for model-based policy evaluation and search. Instead of de novo synthesis of data, here we assume logged,  real experience and model alternative outcomes of this experience under counterfactual actions, i.e. actions that were not actually taken. Based on this, we propose the Counterfactually-Guided Policy Search algorithm for learning policies in POMDPs from off-policy experience. It leverages structural causal models for counterfactual evaluation of arbitrary policies on individual off-policy episodes. CF-GPS can improve on vanilla model-based RL algorithms by making use of available logged data to de-bias model predictions. In contrast to off-policy algorithms based on Importance Sampling which re-weight data, CF-GPS leverages a model to explicitly consider alternative outcomes, allowing the algorithm to make better use of experience data. We find empirically that these advantages translate into improved policy evaluation and search results on a non-trivial grid-world task. Finally, we show that CF-GPS generalizes the previously proposed Guided Policy Search and that reparameterization-based algorithms such Stochastic Value Gradient can be interpreted as counterfactual methods."
290 | 
291 | >	"Instead of relying on data synthesized from scratch by a model, we train policies on model predictions of alternate outcomes of past experience from the true environment under counterfactual actions, i.e. actions that had not actually been taken, while everything else remaining the same. At the heart of CF-GPS are structural causal models which model the environment with two ingredients: 1) Independent random variables, called scenarios here, summarize all aspects of the environment that cannot be influenced by the agent. 2) Deterministic transition functions (also called causal mechanisms) take these scenarios, together with the agent’s actions, as input and produce the predicted outcome. The central idea of CF-GPS is that, instead of running an agent on scenarios sampled de novo from a model, we infer scenarios in hindsight from given off-policy data, and then evaluate and improve the agent on these specific scenarios using given or learned causal mechanisms."
292 | 
293 | >	"We show that CF-GPS generalizes and empirically improves on a vanilla model-based RL algorithm, by mitigating model mismatch via “grounding” or “anchoring” model-based predictions in inferred scenarios. As a result, this approach explicitly allows to trade-off historical data for model bias. CF-GPS differs substantially from standard off-policy RL algorithms based on Importance Sampling, where historical data is re-weighted with respect to the importance weights to evaluate or learn new policies. In contrast, CF-GPS explicitly reasons counterfactually about given off-policy data."
294 | 
295 | >	"We formulate model-based RL in POMDPs in terms of structural causal models, thereby connecting concepts from reinforcement learning and causal inference."  
296 | >	"We provide the first results, to the best of our knowledge, showing that counterfactual reasoning in structural causal models on off-policy data can facilitate solving non-trivial RL tasks."  
297 | >	"We show that two previously proposed classes of RL algorithms, namely Guided Policy Search and Stochastic Value Gradient methods can be interpreted as counterfactual methods, opening up possible generalizations."  
298 | 
299 | >	"Simulating plausible synthetic experience de novo is a hard problem for many environments, often resulting in biases for model-based RL algorithms. The main takeaway from this work is that we can improve policy learning by evaluating counterfactual actions in concrete, past scenarios. Compared to only considering synthetic scenarios, this procedure mitigates model bias."
300 | 
301 | >	"We assumed that there are no additional hidden confounders in the environment and that the main challenge in modelling the environment is capturing the distribution of the noise sources p(U), whereas we assumed that the transition and reward kernels given the noise is easy to model. This seems a reasonable assumption in some environments, such as the partially observed grid-world considered here, but not all. Probably the most restrictive assumption is that we require the inference over the noise U given data hT to be sufficiently accurate. We showed in our example, that we could learn a parametric model of this distribution from privileged information, i.e. from joint samples u, hT from the true environment. However, imperfect inference over the scenario U could result e.g. in wrongly attributing a negative outcome to the agent’s actions, instead environment factors. This could in turn result in too optimistic predictions for counterfactual actions. Future research is needed to investigate if learning a sufficiently strong SCM is possible without privileged information for interesting RL domains. If, however, we can trust the transition and reward kernels of the model, we can substantially improve model-based RL methods by counterfactual reasoning on off-policy data, as demonstrated in our experiments and by the success of Guided Policy Search and Stochastic Value Gradient methods."
302 | 
303 | ----
304 | >	"The proposed approach here is general but only instantiated (in terms of inference algorithms and experiments) for when the initial starting state is unknown in a deterministic POMDP environment, where the dynamics and reward model is known. The authors show that they can use inference over the full trajectory (or some multi-time-step subpart) to get a (often delta function) posterior over the initial starting state, which then allows them to build a more accurate initial state distribution for use in their model simulations than approaches that do not use more than 1 step to do so. This is interesting, but it’s not quite clear where this sort of situation would arise in practice, and the proposed experimental results are limited to one simulated toy domain."
305 | 
306 | 
307 | #### ["Causal Reasoning from Meta-reinforcement Learning"](https://arxiv.org/abs/1901.08162) Dasgupta et al.
308 | >	"Discovering and exploiting the causal structure in the environment is a crucial challenge for intelligent agents. Here we explore whether causal reasoning can emerge via meta-reinforcement learning. We train a recurrent network with model-free reinforcement learning to solve a range of problems that each contain causal structure. We find that the trained agent can perform causal reasoning in novel situations in order to obtain rewards. The agent can select informative interventions, draw causal inferences from observational data, and make counterfactual predictions. Although established formal causal reasoning algorithms also exist, in this paper we show that such reasoning can arise from model-free reinforcement learning, and suggest that causal reasoning in complex settings may benefit from the more end-to-end learning-based approaches presented here. This work also offers new strategies for structured exploration in reinforcement learning, by providing agents with the ability to perform - and interpret - experiments."
309 | 
310 | >	"Agents trained in this manner performed causal reasoning in three data settings: observational, interventional, and counterfactual. Our approach did not require explicit encoding of formal principles of causal inference. Rather, by optimizing an agent to perform a task that depended on causal structure, the agent learned implicit strategies to generate and use different kinds of available data for causal reasoning, including drawing causal inferences from passive observation, actively intervening, and making counterfactual predictions, all on held out causal CBNs that the agents had never previously seen. A consistent result in all three data settings was that our agents learned to perform good experiment design or active learning. That is, they learned a non-random data collection policy where they actively chose which nodes to intervene (or condition) on in the information phase, and thus could control the kinds of data they saw, leading to higher performance in the quiz phase than that from an agent with a random data collection policy."
311 | 
312 | >	"We showed that agents learned to perform do-calculus. We saw that, the trained agent with access to only observational data received more reward than the highest possible reward achievable without causal knowledge. We further observed that this performance increase occurred selectively in cases where do-calculus made a prediction distinguishable from the predictions based on correlations – i.e. where the externally intervened node had a parent, meaning that the intervention resulted in a different graph."
313 | 
314 | >	"We showed that agents learned to resolve unobserved confounders using interventions (which is impossible with only observational data). We saw that agents with access to interventional data performed better than agents with access to only observational data only in cases where the intervened node shared an unobserved parent (a confounder) with other variables in the graph."
315 | 
316 | >	"We showed that agents learned to use counterfactuals. We saw that agents with additional access to the specific randomness in the test phase performed better than agents with access to only interventional data. We found that the increased performance was observed only in cases where the maximum mean value in the graph was degenerate, and optimal choice was affected by the latent randomness – i.e. where multiple nodes had the same value on average and the specific randomness could be used to distinguish their actual values in that specific case."
317 | 
318 | 
319 | #### ["General Identifiability with Arbitrary Surrogate Experiments"](http://auai.org/uai2019/proceedings/papers/144.pdf) Lee, Correa, Bareinboim
320 |   `UAI 2019`
321 | >	"We study the problem of causal identification from an arbitrary collection of observational and experimental distributions, and substantive knowledge about the phenomenon under investigation, which usually comes in the form of a causal graph. We call this problem g-identifiability, or gID for short. The gID setting encompasses two well-known problems in causal inference, namely, identifiability and z-identifiability — the former assumes that an observational distribution is necessarily available, and no experiments can be performed, conditions that are both relaxed in the gID setting; the latter assumes that all combinations of experiments are available, i.e., the power set of the experimental set Z, which gID does not require a priori. In this paper, we introduce a general strategy to prove non-gID based on hedgelets and thickets, which leads to a necessary and sufficient graphical condition for the corresponding decision problem. We further develop a procedure for systematically computing the target effect, and prove that it is sound and complete for gID instances. In other words, failure of the algorithm in returning an expression implies that the target effect is not computable from the available distributions. Finally, as a corollary of these results, we show that do-calculus is complete for the task of g-identifiability."
322 | 
323 | >	"In one line of investigation, this task is formalized through the question of whether the effect that an intervention on a set of variables X will have on another set of outcome variables Y (denoted Px(y)) can be uniquely computed from the probability distribution P over the observed variables V and a causal diagram G. This is known as the problem of identification, and has received great attention in the literature, starting with a number of sufficient conditions, and culminating in a complete graphical and algorithmic characterization. Despite the generality of such results, it’s the case that in some real-world applications the quantity Px(y) is not identifiable (i.e., not uniquely computable) from the observational data and the causal diagram."
324 | 
325 | >	"On an alternative thread in the literature, causal effects (Px(y)) are obtained directly through controlled experimentation. In the biomedical sciences, for instance, considerable resources are spent every year by the FDA, the NIH, and others, in supporting large-scale, systematic, and controlled experimentation, which comes under the rubric of Randomized Controlled Trials. The same method is also leveraged in the context of reinforcement learning, for example, when an autonomous agent is deployed in an environment and is given the capability of performing interventions and observing how they unfold in time. Through this process, experimental data is gathered, and used in the construction of a strategy, also known as policy, with the goal of optimizing the agent’s cumulative reward (e.g., survival, profitability, happiness). Despite all the inferential power entailed by this approach, there are real-world settings where controlling the variables in X is not feasible, possibly due to economical, technical, or ethical constraints."
326 | 
327 | >	"In this paper, we note that these two approaches can be seen as extremes in a spectrum of possible research designs, which can be combined to solve very natural, albeit non-trivial, causal inference problems. In fact, this generalized setting has been investigated in the literature under the rubric of z-identifiability (zID, for short). Formally, zID asks whether Px(y) can be uniquely computed from the combination of the observational distribution P(V) and the experimental distributions Pz'(V), for all Z'⊆ Z for some Z ⊆ V."
328 | 


--------------------------------------------------------------------------------
/Information Retrieval.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 |   * [**overview**](#overview)
  4 |   * [**interesting papers**](#interesting-papers)
  5 |     - [**ranking**](#interesting-papers---ranking)
  6 |     - [**document models**](#interesting-papers---document-models)
  7 |     - [**entity-centric search**](#interesting-papers---entity-centric-search)
  8 | 
  9 | 
 10 | 
 11 | ---
 12 | ### overview
 13 | 
 14 |   ["Foundations of Information Retrieval"](https://drive.google.com/file/d/0B-GJrccmbImkZ3pjNl9sczQxd3M) by Maarten de Rijke `slides` `SIGIR 2017`
 15 | 
 16 |   ["What Every Software Engineer Should Know about Search"](https://medium.com/startup-grind/what-every-software-engineer-should-know-about-search-27d1df99f80d) by Max Grigorev
 17 | 
 18 |   ["An Introduction to Information Retrieval"](https://nlp.stanford.edu/IR-book/) by Manning, Raghavan, Schutze `book`  
 19 |   ["Search Engines. Information Retrieval in Practice"](http://ciir.cs.umass.edu/irbook/) by Croft, Metzler, Strohman `book`  
 20 | 
 21 | ----
 22 | 
 23 |   [course](https://youtube.com/user/victorlavrenko/playlists) by Victor Lavrenko `video`
 24 | 
 25 |   [course](https://compscicenter.ru/courses/information-retrieval/2016-autumn/) from Yandex `video` `in russian`  
 26 |   course from Mail.ru
 27 | 	([first part](https://youtube.com/playlist?list=PLrCZzMib1e9o_BlrSB5bFkLq8h2i4pQjz),
 28 | 	[second part](https://youtube.com/playlist?list=PLrCZzMib1e9o7YIhOfJtD1EaneGOGkN-_)) `video` `in russian`  
 29 |   [course](https://youtube.com/playlist?list=PLrCZzMib1e9rIikWB2NlBUF1z7HvaO_IO) from Mail.ru `video` `in russian`  
 30 | 
 31 |   [overview](https://youtube.com/watch?v=3R6vBd_Y8O4) of ranking by Sergey Nikolenko `video` `in russian`  
 32 |   overview of ranking by Nikita Volkov
 33 | 	([first part](https://youtube.com/watch?v=GctrEpJinhI),
 34 | 	[second part](https://youtube.com/watch?v=GZmXKBzIfkA)) `video` `in russian`  
 35 | 
 36 |   [course](http://nzhiltsov.github.io/IR-course/) by Nikita Zhiltsov `in russian`
 37 | 
 38 | ----
 39 | 
 40 |   ["Neural Models for Information Retrieval"](https://youtube.com/watch?v=g1Pgo5yTIKg) by Bhaskar Mitra `video`
 41 | 
 42 |   ["An Introduction to Neural Information Retrieval"](https://www.microsoft.com/en-us/research/uploads/prod/2017/06/fntir2018-neuralir-mitra.pdf) by Bhaskar Mitra and Nick Craswell `paper`
 43 | 
 44 |   ["Neural Networks for Information Retrieval"](http://nn4ir.com) tutorials `slides` `ECIR 2018` `WSDM 2018` `SIGIR 2017`  
 45 |   ["Neural Text Embeddings for Information Retrieval"](https://microsoft.com/en-us/research/event/wsdm-2017-tutorial-neural-text-embeddings-information-retrieval/)
 46 | 	tutorial by Bhaskar Mitra and Nick Craswell
 47 | 	([slides](https://slideshare.net/BhaskarMitra3/neural-text-embeddings-for-information-retrieval-wsdm-2017), [paper](https://arxiv.org/abs/1705.01509)) `WSDM 2017`  
 48 | 
 49 | ---
 50 | 
 51 |   challenges:
 52 |   - full text document retrieval, passage retrieval, question answering
 53 |   - web search, searching social media, distributed information retrieval, entity ranking
 54 |   - learning to rank combined with neural network based representation learning
 55 |   - user and task modelling, personalized search, diversity
 56 |   - query formulation assistance, query recommendation, conversational search
 57 |   - multimedia retrieval
 58 |   - learning dense representations for long documents
 59 |   - dealing with rare queries and rare words
 60 |   - modelling text at different granularities (character, word, passage, document)
 61 |   - compositionality of vector representations
 62 |   - jointly modelling queries, documents, entities and other structured data
 63 | 
 64 | 
 65 | 
 66 | ---
 67 | ### interesting papers
 68 | 
 69 |   - [**ranking**](#interesting-papers---ranking)  
 70 |   - [**document models**](#interesting-papers---document-models)  
 71 |   - [**entity-centric search**](#interesting-papers---entity-centric-search)  
 72 | 
 73 | ----
 74 | 
 75 |   - [**question answering over texts**](https://github.com/brylevkirill/notes/blob/master/Knowledge%20Representation%20and%20Reasoning.md#interesting-papers---question-answering-over-texts)  
 76 |   - [**question answering over knowledge bases**](https://github.com/brylevkirill/notes/blob/master/Knowledge%20Representation%20and%20Reasoning.md#interesting-papers---question-answering-over-knowledge-bases)  
 77 |   - [**information extraction and integration**](https://github.com/brylevkirill/notes/blob/master/Knowledge%20Representation%20and%20Reasoning.md#interesting-papers---information-extraction-and-integration)  
 78 | 
 79 | ----
 80 | 
 81 | #### ["Neural Information Retrieval: at the End of the Early Years"](https://link.springer.com/content/pdf/10.1007%2Fs10791-017-9321-y.pdf) Onal et al.
 82 | >	"In this paper, we survey the current landscape of Neural IR research, paying special attention to the use of learned distributed representations of textual units. We highlight the successes of neural IR thus far, catalog obstacles to its wider adoption, and suggest potentially promising directions for future research."
 83 | 
 84 | 
 85 | #### ["Neural Models for Information Retrieval"](https://arxiv.org/abs/1705.01509) Mitra, Craswell
 86 | >	"Neural ranking models for information retrieval use shallow or deep neural networks to rank search results in response to a query. Traditional learning to rank models employ machine learning techniques over hand-crafted IR features. By contrast, neural models learn representations of language from raw text that can bridge the gap between query and document vocabulary. Unlike classical IR models, these new machine learning based approaches are data-hungry, requiring large scale training data before they can be deployed. This tutorial introduces basic concepts and intuitions behind neural IR models, and places them in the context of traditional retrieval models. We begin by introducing fundamental concepts of IR and different neural and non-neural approaches to learning vector representations of text. We then review shallow neural IR methods that employ pre-trained neural term embeddings without learning the IR task end-to-end. We introduce deep neural networks next, discussing popular deep architectures. Finally, we review the current DNN models for information retrieval. We conclude with a discussion on potential future directions for neural IR."
 87 | 
 88 |   - `video` <https://youtube.com/watch?v=g1Pgo5yTIKg> (Mitra)
 89 |   - `slides` <https://slideshare.net/BhaskarMitra3/neural-text-embeddings-for-information-retrieval-wsdm-2017>
 90 | 
 91 | 
 92 | #### ["Critically Examining the “Neural Hype”: Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models"](https://arxiv.org/abs/1904.09171) Yang et al.
 93 | >	"Is neural IR mostly hype? In a recent SIGIR Forum article, Linexpressed skepticism that neural ranking models were actuallyimprovingad hocretrieval effectiveness in limited data scenarios.He provided anecdotal evidence that authors of neural IR papersdemonstrate “wins” by comparing against weak baselines. Thispaper provides a rigorous evaluation of those claims in two ways:First, we conducted a meta-analysis of papers that have reportedexperimental results on the TREC Robust04 test collection. We donot find evidence of an upward trend in effectiveness over time. Infact, the best reported results are from a decade ago and no recentneural approach comes close. Second, we applied five recent neuralmodels to rerank the strong baselines that Lin used to make hisarguments. A significant improvement was observed for one of themodels, demonstrating additivity in gains. While there appears tobe merit to neural IR approaches, at least some of the gains reportedin the literature appear illusory."
 94 | 
 95 |   - `video` <https://youtu.be/dC_uXXiugsk?t=13m32s> (Pavlov) `in russian`
 96 | 
 97 | 
 98 | 
 99 | ---
100 | ### interesting papers - ranking
101 | 
102 | 
103 | #### ["Learning Rank Functionals: An Empirical Study"](https://arxiv.org/abs/1407.6089) Tran et al.
104 | >	"Ranking is a key aspect of many applications, such as information retrieval, question answering, ad placement and recommender systems. Learning to rank has the goal of estimating a ranking model automatically from training data. In practical settings, the task often reduces to estimating a rank functional of an object with respect to a query. In this paper, we investigate key issues in designing an effective learning to rank algorithm. These include data representation, the choice of rank functionals, the design of the loss function so that it is correlated with the rank metrics used in evaluation. For the loss function, we study three techniques: approximating the rank metric by a smooth function, decomposition of the loss into a weighted sum of element-wise losses and into a weighted sum of pairwise losses. We then present derivations of piecewise losses using the theory of high-order Markov chains and Markov random fields. In experiments, we evaluate these design aspects on two tasks: answer ranking in a Social Question Answering site, and Web Information Retrieval."
105 | 
106 | 
107 | #### ["Learning to Rank using Gradient Descent"](http://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf) Burges et al.
108 |   `learning to rank using relevance labels` `RankNet`
109 | >	"We investigate using gradient descent methods for learning ranking functions; we propose a simple probabilistic cost function, and we introduce RankNet, an implementation of these ideas using a neural network to model the underlying ranking function. We present test results on toy data and on data from a commercial internet search engine."
110 | 
111 | >	"We have proposed a probabilistic cost for training systems to learn ranking functions using pairs of training examples. The approach can be used for any differentiable function; we explored using a neural network formulation, RankNet. RankNet is simple to train and gives excellent performance on a real world ranking problem with large amounts of data. Comparing the linear RankNet with other linear systems clearly demonstrates the benefit of using our pair-based cost function together with gradient descent; the two layer net gives further improvement. For future work it will be interesting to investigate extending the approach to using other machine learning methods for the ranking function; however evaluation speed and simplicity is a critical constraint for such systems."
112 | 
113 |   - `video` <http://videolectures.net/icml2015_burges_learning_to_rank/> (Burges)
114 |   - `video` <https://youtu.be/3R6vBd_Y8O4> (Nikolenko) `in russian`
115 |   - `code` <https://github.com/shiba24/learning2rank>
116 | 
117 | 
118 | #### ["Learning to Rank with Nonsmooth Cost Functions"](https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions) Burges et al.
119 |   `learning to rank using relevance labels` `LambdaRank`
120 | >	"The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions."
121 | 
122 | ----
123 | >	"LambdaRank is a method for learning arbitrary information retrieval measures; it can be applied to any algorithm that learns through gradient descent. LambdaRank is a listwise method, in that the cost depends on the sorted order of the documents. The key LambdaRank insight is to define the gradient of the cost with respect to the score that the model assigns to a given xi after all of the xi have been sorted by their scores si; thus the gradients take into account the rank order of the documents, as defined by the current model. LambdaRank is an empirical algorithm, in that the form that the gradients take was chosen empirically: the λ’s are those gradients, and the contribution to a given feature vector xi’s λi from a pair (xi, xj), y(xi) != y(xj), is just the gradient of the logistic regression loss (viewed as a function of si - sj) multiplied by the change in Z caused by swapping the rank positions of the two documents while keeping all other documents fixed, where Z is the information retrieval measure being learned. λi is then the sum of contributions for all such pairs. Remarkably, it has been shown that a LambdaRank model trained on Z, for Z equal to Normalized Cumulative Discounted Gain (NDCG), Mean Reciprocal Rank, or Mean Average Precision (three commonly used IR measures), given sufficient training data, consistently finds a local optimum of that IR measure (in the space of the measure viewed as a function of the model parameters)."
124 | 
125 |   - `video` <https://youtu.be/3R6vBd_Y8O4?t=32m8s> (Nikolenko) `in russian`
126 |   - `paper` ["Learning to Rank Using an Ensemble of Lambda-Gradient Models"](http://proceedings.mlr.press/v14/burges11a/burges11a.pdf) by Burges et al. (optimizing Expected Reciprocal Rank)
127 | 
128 | 
129 | #### ["From RankNet to LambdaRank to LambdaMART: An Overview"](https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/) Burges
130 |   `learning to rank using relevance labels` `LambdaMART`
131 | >	"LambdaMART is the boosted tree version of LambdaRank, which is based on RankNet. RankNet, LambdaRank, and LambdaMART have proven to be very successful algorithms for solving real world ranking problems: for example an ensemble of LambdaMART rankers won Track 1 of the 2010 Yahoo! Learning To Rank Challenge. The details of these algorithms are spread across several papers and reports, and so here we give a self-contained, detailed and complete description of them."
132 | 
133 | ----
134 | >	"While LambdaRank was originally instantiated using neural nets, it was found that a boosted tree multiclass classifier (McRank) gave improved performance. Combining these ideas led to LambdaMART, which instantiates the LambdaRank idea using gradient boosted decision trees. This work showed that McRank’s improved performance over LambdaRank (instantiated in a neural net) is due to the difference in the expressiveness of the underlying models (boosted decision trees versus neural nets) rather than being due to an inherent limitation of the lambda-gradient idea."
135 | 
136 | >	"LambdaMART combines LambdaRank and MART (Multiple Additive Regression Trees). While MART uses gradient boosted decision trees for prediction tasks, LambdaMART uses gradient boosted decision trees using a cost function derived from LambdaRank for solving a ranking task. On experimental datasets, LambdaMART has shown better results than LambdaRank and the original RankNet."
137 | 
138 | >	"Cascade of trees, in which each new tree contributes to a gradient step in the direction that minimises the loss function. The ensemble of these trees is the final model. LambdaMART uses this ensemble but it replaces that gradient with the lambda (gradient computed given the candidate pairs) presented in LambdaRank."
139 | 
140 |   - `video` <https://youtu.be/3R6vBd_Y8O4?t=48m9s> (Nikolenko) `in russian`
141 |   - `post` <https://wellecks.wordpress.com/2015/02/21/peering-into-the-black-box-visualizing-lambdamart/>
142 |   - `paper` ["Learning to Rank Using an Ensemble of Lambda-Gradient Models"](http://proceedings.mlr.press/v14/burges11a/burges11a.pdf) by Burges et al.
143 | 
144 | 
145 | #### ["An Efficient Boosting Algorithm for Combining Preferences"](http://jmlr.org/papers/volume4/freund03a/freund03a.pdf) Freund, Iyer, Schapire, Singer
146 |   `learning to rank using relevance labels` `RankBoost`
147 | >	"We study the problem of learning to accurately rank a set of objects by combining a given collection of ranking or preference functions. This problem of combining preferences arises in several applications, such as that of combining the results of different search engines, or the "collaborative-filtering" problem of ranking movies for a user based on the movie rankings provided by other users. In this work, we begin by presenting a formal framework for this general problem. We then describe and analyze an efficient algorithm called RankBoost for combining preferences based on the boosting approach to machine learning. We give theoretical results describing the algorithm's behavior both on the training data, and on new test data not seen during training. We also describe an efficient implementation of the algorithm for a particular restricted but common case. We next discuss two experiments we carried out to assess the performance of RankBoost. In the first experiment, we used the algorithm to combine different web search strategies, each of which is a query expansion for a given domain. The second experiment is a collaborative-filtering task for making movie recommendations."
148 | 
149 |   - `video` <https://youtu.be/3R6vBd_Y8O4?t=42m13s> (Nikolenko) `in russian`
150 | 
151 | 
152 | #### ["Neural Ranking Models with Weak Supervision"](https://arxiv.org/abs/1704.08803) Dehghani, Zamani, Severyn, Kamps, Croft
153 |   `unsupervised learning to rank`
154 | >	"Despite the impressive improvements achieved by unsupervised deep neural networks in computer vision and NLP tasks, such improvements have not yet been observed in ranking for information retrieval. The reason may be the complexity of the ranking problem, as it is not obvious how to learn from queries and documents when no supervised signal is available. Hence, in this paper, we propose to train a neural ranking model using weak supervision, where labels are obtained automatically without human annotators or any external resources (e.g., click data). To this aim, we use the output of an unsupervised ranking model, such as BM25, as a weak supervision signal. We further train a set of simple yet effective ranking models based on feed-forward neural networks. We study their effectiveness under various learning scenarios (point-wise and pair-wise models) and using different input representations (i.e., from encoding query-document pairs into dense/sparse vectors to using word embedding representation). We train our networks using tens of millions of training instances and evaluate it on two standard collections: a homogeneous news collection (Robust) and a heterogeneous large-scale web collection (ClueWeb). Our experiments indicate that employing proper objective functions and letting the networks to learn the input representation based on weakly supervised data leads to impressive performance, with over 13% and 35% MAP improvements over the BM25 model on the Robust and the ClueWeb collections. Our findings also suggest that supervised neural ranking models can greatly benefit from pre-training on large amounts of weakly labeled data that can be easily obtained from unsupervised IR models."
155 | 
156 |   - `post` <https://mostafadehghani.com/2017/04/23/beating-the-teacher-neural-ranking-models-with-weak-supervision> (Dehghani)
157 |   - `slides` <http://mostafadehghani.com/wp-content/uploads/2016/07/SIGIR2017_Presentation.pdf>
158 | 
159 | 
160 | #### ["Gathering Additional Feedback on Search Results by Multi-Armed Bandits with Respect to Production Ranking"](http://www.www2015.it/documents/proceedings/proceedings/p1177.pdf) Vorobev, Lefortier, Gusev, Serdyukov
161 |   `online learning to rank using click data` `BBRA`
162 | >	"Given a repeatedly issued query and a document with a not-yet-confirmed potential to satisfy the users’ needs, a search system should place this document on a high position in order to gather user feedback and obtain a more confident estimate of the document utility. On the other hand, the main objective of the search system is to maximize expected user satisfaction over a rather long period, what requires showing more relevant documents on average. The state-of-the-art approaches to solving this exploration-exploitation dilemma rely on strongly simplified settings making these approaches infeasible in practice. We improve the most flexible and pragmatic of them to handle some actual practical issues. The first one is utilizing prior information about queries and documents, the second is combining bandit-based learning approaches with a default production ranking algorithm. We show experimentally that our framework enables to significantly improve the ranking of a leading commercial search engine."
163 | 
164 | 
165 | #### ["Online Learning to Rank in Stochastic Click Models"](https://arxiv.org/abs/1703.02527) Zoghi, Tunys, Ghavamzadeh, Kveton, Szepesvari, Wen
166 |   `online learning to rank using click data` `BatchRank`
167 | >	"Online learning to rank is a core problem in information retrieval and machine learning. Many provably efficient algorithms have been recently proposed for this problem in specific click models. The click model is a model of how the user interacts with a list of documents. Though these results are significant, their impact on practice is limited, because all proposed algorithms are designed for specific click models and lack convergence guarantees in other models. In this work, we propose BatchRank, the first online learning to rank algorithm for a broad class of click models. The class encompasses two most fundamental click models, the cascade and position-based models. We derive a gap-dependent upper bound on the T-step regret of BatchRank and evaluate it on a range of web search queries. We observe that BatchRank outperforms ranked bandits and is more robust than CascadeKL-UCB, an existing algorithm for the cascade model."
168 | 
169 |   - `video` <https://youtu.be/__En7H2awqM?t=24m21s> (Szepesvari)
170 | 
171 | 
172 | #### ["A Neural Click Model for Web Search"](http://www2016.net/proceedings/proceedings/p531.pdf) Borisov, Markov, Rijke, Serdyukov
173 |   `click prediction`
174 | >	"Understanding user browsing behavior in web search is key to improving web search effectiveness. Many click models have been proposed to explain or predict user clicks on search engine results. They are based on the probabilistic graphical model (PGM) framework, in which user behavior is represented as a sequence of observable and hidden events. The PGM framework provides a mathematically solid way to reason about a set of events given some information about other events. But the structure of the dependencies between the events has to be set manually. Different click models use different hand-crafted sets of dependencies. We propose an alternative based on the idea of distributed representations: to represent the user’s information need and the information available to the user with a vector state. The components of the vector state are learned to represent concepts that are useful for modeling user behavior. And user behavior is modeled as a sequence of vector states associated with a query session: the vector state is initialized with a query, and then iteratively updated based on information about interactions with the search engine results. This approach allows us to directly understand user browsing behavior from click-through data, i.e., without the need for a predefined set of rules as is customary for PGM-based click models. We illustrate our approach using a set of neural click models. Our experimental results show that the neural click model that uses the same training data as traditional PGM-based click models, has better performance on the click prediction task (i.e., predicting user click on search engine results) and the relevance prediction task (i.e., ranking documents by their relevance to a query). An analysis of the best performing neural click model shows that it learns similar concepts to those used in traditional click models, and that it also learns other concepts that cannot be designed manually."
175 | 
176 | 
177 | #### ["A Click Sequence Model for Web Search"](https://arxiv.org/abs/1805.03411) Borisov, Wardenaar, Markov, Rijke
178 |   `click prediction`
179 | >	"Getting a better understanding of user behavior is important for advancing information retrieval systems. Existing work focuses on modeling and predicting single interaction events, such as clicks. In this paper, we for the first time focus on modeling and predicting sequences of interaction events. And in particular, sequences of clicks. We formulate the problem of click sequence prediction and propose a click sequence model (CSM) that aims to predict the order in which a user will interact with search engine results. CSM is based on a neural network that follows the encoder-decoder architecture. The encoder computes contextual embeddings of the results. The decoder predicts the sequence of positions of the clicked results. It uses an attention mechanism to extract necessary information about the results at each timestep. We optimize the parameters of CSM by maximizing the likelihood of observed click sequences. We test the effectiveness of CSM on three new tasks: (i) predicting click sequences, (ii) predicting the number of clicks, and (iii) predicting whether or not a user will interact with the results in the order these results are presented on a search engine result page (SERP). Also, we show that CSM achieves state-of-the-art results on a standard click prediction task, where the goal is to predict an unordered set of results a user will click on."
180 | 
181 | 
182 | 
183 | ---
184 | ### interesting papers - document models
185 | 
186 | [**interesting papers - question answering over texts**](https://github.com/brylevkirill/notes/blob/master/Knowledge%20Representation%20and%20Reasoning.md#interesting-papers---question-answering-over-texts)
187 | 
188 | ----
189 | #### ["Learning Deep Structured Semantic Models for Web Search using Clickthrough Data"](http://research.microsoft.com/apps/pubs/default.aspx?id=198202) Huang, He, Gao, Deng, Acero, Heck
190 |   `DSSM`
191 |   - <https://github.com/brylevkirill/notes/blob/master/Natural%20Language%20Processing.md#learning-deep-structured-semantic-models-for-web-search-using-clickthrough-data-huang-he-gao-deng-acero-heck>
192 | 
193 | 
194 | #### ["A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval"](http://www.msr-waypoint.com/pubs/226585/cikm2014_cdssm_final.pdf) Shen, He, Gao, Deng, Mesnil
195 |   `CLSM`
196 | >	"In this paper, we propose a new latent semantic model that incorporates a convolutional-pooling structure over word sequences to learn low-dimensional, semantic vector representations for search queries and Web documents. In order to capture the rich contextual structures in a query or a document, we start with each word within a temporal context window in a word sequence to directly capture contextual features at the word n-gram level. Next, the salient word n-gram features in the word sequence are discovered by the model and are then aggregated to form a sentence-level feature vector. Finally, a non-linear transformation is applied to extract high-level semantic information to generate a continuous vector representation for the full text string. The proposed convolutional latent semantic model is trained on clickthrough data and is evaluated on a Web document ranking task using a large-scale, real-world data set. Results show that the proposed model effectively captures salient semantic information in queries and documents for the task while significantly outperforming previous state-of-the-art semantic models."
197 | 
198 | >	"In this paper, we have reported a novel deep learning architecture called the CLSM, motivated by the convolutional structure of the CNN, to extract both local contextual features at the word-n-gram level (via the convolutional layer) and global contextual features at the sentence-level (via the max-pooling layer) from text. The higher layer(s) in the overall deep architecture makes effective use of the extracted context-sensitive features to generate latent semantic vector representations which facilitates semantic matching between documents and queries for Web search applications. We have carried out extensive experimental studies of the proposed model whereby several state-of-the-art semantic models are compared and significant performance improvement on a large-scale real-world Web search data set is observed. Extended from our previous work on DSSM and C-DSSM models, the CLSM and its variations have also been demonstrated giving superior performance on a range of natural language processing tasks beyond information retrieval, including semantic parsing and question answering, entity search and online recommendation."
199 | 
200 |   - `video` <https://youtu.be/x7B6RudUQLI?t=1h33m39s> (Gulin) `in russian`
201 |   - `code` <https://github.com/airalcorn2/Deep-Semantic-Similarity-Model>
202 | 
203 | 
204 | #### ["Modeling Interestingness with Deep Neural Networks"](http://research.microsoft.com/apps/pubs/default.aspx?id=226584) Gao, Pantel, Gamon, He, Deng
205 | >	"This paper presents a deep semantic similarity model, a special type of deep neural networks designed for text analysis, for recommending target documents to be of interest to a user based on a source document that she is reading. We observe, identify, and detect naturally occurring signals of interestingness in click transitions on the Web between source and target documents, which we collect from commercial Web browser logs. The DSSM is trained on millions of Web transitions, and maps source-target document pairs to feature vectors in a latent space in such a way that the distance between source documents and their corresponding interesting targets in that space is minimized. The effectiveness of the DSSM is demonstrated using two interestingness tasks: automatic highlighting and contextual entity search. The results on large-scale, real-world datasets show that the semantics of documents are important for modeling interestingness and that the DSSM leads to significant quality improvement on both tasks, outperforming not only the classic document models that do not use semantics but also state-of-the-art topic models."
206 | 
207 |   - `video` <https://youtube.com/watch?v=YXi66Zgd0D0> (Yih)
208 | 
209 | 
210 | #### ["Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks"](http://disi.unitn.it/~severyn/papers/sigir-2015-long.pdf) Severyn, Moschitti
211 | >	"Learning a similarity function between pairs of objects is at the core of learning to rank approaches. In information retrieval tasks we typically deal with query-document pairs, in question answering - question-answer pairs. However, before learning can take place, such pairs needs to be mapped from the original space of symbolic words into some feature space encoding various aspects of their relatedness, e.g. lexical, syntactic and semantic. Feature engineering is often a laborious task and may require external knowledge sources that are not always available or difficult to obtain. Recently, deep learning approaches have gained a lot of attention from the research community and industry for their ability to automatically learn optimal feature representation for a given task, while claiming state-of-the-art performance in many tasks in computer vision, speech recognition and natural language processing. In this paper, we present a convolutional neural network architecture for reranking pairs of short texts, where we learn the optimal representation of text pairs and a similarity function to relate them in a supervised way from the available training data. Our network takes only words in the input, thus requiring minimal preprocessing. In particular, we consider the task of reranking short text pairs where elements of the pair are sentences. We test our deep learning system on two popular retrieval tasks from TREC: Question Answering and Microblog Retrieval. Our model demonstrates strong performance on the first task beating previous state-of-the-art systems by about 3% absolute points in both MAP and MRR and shows comparable results on tweet reranking, while enjoying the benefits of no manual feature engineering and no additional syntactic parsers."
212 | 
213 |   - `code` <https://github.com/aseveryn/deep-qa>
214 |   - `code` <https://github.com/shashankg7/Keras-CNN-QA>
215 | 
216 | 
217 | #### ["A Dual Embedding Space Model for Document Ranking"](https://arxiv.org/abs/1602.01137) Mitra, Nalisnick, Craswell, Caruana
218 |   `DESM`
219 | >	"A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words without being relevant. We investigate neural word embeddings as a source of evidence in document ranking. We train a word2vec embedding model on a large unlabelled query corpus, but in contrast to how the model is commonly used, we retain both the input and the output projections, allowing us to leverage both the embedding spaces to derive richer distributional relationships. During ranking we map the query words into the input space and the document words into the output space, and compute a query-document relevance score by aggregating the cosine similarities across all the query-document word pairs."
220 | 
221 | >	"We postulate that the proposed Dual Embedding Space Model (DESM) captures evidence on whether a document is about a query term in addition to what is modelled by traditional term-frequency based approaches. Our experiments show that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-matching based signal like TF-IDF. However, when ranking a larger set of candidate documents, we find the embeddings-based approach is prone to false positives, retrieving documents that are only loosely related to the query. We demonstrate that this problem can be solved effectively by ranking based on a linear mixture of the DESM and the word counting features."
222 | 
223 |   - `video` <https://youtu.be/g1Pgo5yTIKg?t=30m1s> (Mitra)
224 |   - `code` <https://github.com/bmitra-msft/Demos/blob/master/notebooks/DESM.ipynb>
225 | 
226 | 
227 | #### ["Query Expansion with Locally-Trained Word Embeddings"](https://arxiv.org/abs/1605.07891) Diaz, Mitra, Craswell
228 | >	"Continuous space word embeddings have received a great deal of attention in the natural language processing and machine learning communities for their ability to model term similarity and other relationships. We study the use of term relatedness in the context of query expansion for ad hoc information retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when trained globally, underperform corpus and query specific embeddings for retrieval tasks. These results suggest that other tasks benefiting from global embeddings may also benefit from local embeddings."
229 | 
230 | >	"The success of local embeddings on this task should alarm natural language processing researchers using global embeddings as a representational tool. For one, the approach of learning from vast amounts of data is only effective if the data is appropriate for the task at hand. And, when provided, much smaller high-quality data can provide much better performance. Beyond this, our results suggest that the approach of estimating global representations, while computationally convenient, may overlook insights possible at query time, or evaluation time in general. A similar local embedding approach can be adopted for any natural language processing task where topical locality is expected and can be estimated. Although we used a query to re-weight the corpus in our experiments, we could just as easily use alternative contextual information (e.g. a sentence, paragraph, or document) in other tasks."
231 | 
232 | >	"Although local embeddings provide effectiveness gains, they can be quite inefficient compared to global embeddings. We believe that there is opportunity to improve the efficiency by considering offline computation of local embeddings at a coarser level than queries but more specialized than the corpus. If the retrieval algorithm is able to select the appropriate embedding at query time, we can avoid training the local embedding."
233 | 
234 |   - `video` <https://youtu.be/g1Pgo5yTIKg?t=38m58s> (Mitra)
235 | 
236 | 
237 | #### ["Learning to Match Using Local and Distributed Representations of Text for Web Search"](https://arxiv.org/abs/1610.08136) Mitra, Diaz, Craswell
238 |   `Duet`
239 | >	"Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favorable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and also significantly outperforms traditional baselines and other recently proposed models based on neural networks."
240 | 
241 |   - `video` <https://youtu.be/g1Pgo5yTIKg?t=46m31s> (Mitra)
242 |   - `code` <https://github.com/faneshion/MatchZoo>
243 |   - `code` <https://github.com/bmitra-msft/NDRM/blob/master/notebooks/Duet.ipynb>
244 | 
245 | 
246 | 
247 | ---
248 | ### interesting papers - entity-centric search
249 | 
250 | [**interesting papers - question answering over knowledge bases**](https://github.com/brylevkirill/notes/blob/master/Knowledge%20Representation%20and%20Reasoning.md#interesting-papers---question-answering-over-knowledge-bases)  
251 | [**interesting papers - information extraction and integration**](https://github.com/brylevkirill/notes/blob/master/Knowledge%20Representation%20and%20Reasoning.md#interesting-papers---information-extraction-and-integration)  
252 | 
253 | ----
254 | #### ["Fast and Space-Efficient Entity Linking in Queries"](http://labs.yahoo.com/publication/fast-and-space-efficient-entity-linking-in-queries/) Blanco, Ottaviano, Meij
255 | >	"Entity linking deals with identifying entities from a knowledge base in a given piece of text and has become a fundamental building block for web search engines, enabling numerous downstream improvements from better document ranking to enhanced search results pages. A key problem in the context of web search queries is that this process needs to run under severe time constraints as it has to be performed before any actual retrieval takes place, typically within milliseconds. In this paper we propose a probabilistic model that leverages user-generated information on the web to link queries to entities in a knowledge base. There are three key ingredients that make the algorithm fast and space-efficient. First, the linking process ignores any dependencies between the different entity candidates, which allows for a O(k^2) implementation in the number of query terms. Second, we leverage hashing and compression techniques to reduce the memory footprint. Finally, to equip the algorithm with contextual knowledge without sacrificing speed, we factor the distance between distributional semantics of the query words and entities into the model. We show that our solution significantly outperforms several state-of-the-art baselines by more than 14% while being able to process queries in sub-millisecond times—at least two orders of magnitude faster than existing systems."
256 | 
257 | 
258 | #### ["Jigs and Lures: Associating Web Queries with Structured Entities"](http://www.aclweb.org/anthology/P11-1009) Pantel, Fuxman
259 | >	"We propose methods for estimating the probability that an entity from an entity database is associated with a web search query. Association is modeled using a query entity click graph, blending general query click logs with vertical query click logs. Smoothing techniques are proposed to address the inherent data sparsity in such graphs, including interpolation using a query synonymy model. A large-scale empirical analysis of the smoothing techniques, over a 2-year click graph collected from a commercial search engine, shows significant reductions in modeling error. The association models are then applied to the task of recommending products to web queries, by annotating queries with products from a large catalog and then mining query-product associations through web search session analysis. Experimental analysis shows that our smoothing techniques improve coverage while keeping precision stable, and overall, that our top-performing model affects 9% of general web queries with 94% precision."
260 | 
261 | 
262 | #### ["Active Objects: Actions for Entity-Centric Search"](http://research.microsoft.com/apps/pubs/default.aspx?id=161389) Lin, Pantel, Gamon, Kannan, Fuxman
263 | >	"We introduce an entity-centric search experience, called Active Objects, in which entity-bearing queries are paired with actions that can be performed on the entities. For example, given a query for a specific flashlight, we aim to present actions such as reading reviews, watching demo videos, and finding the best price online. In an annotation study conducted over a random sample of user query sessions, we found that a large proportion of queries in query logs involve actions on entities, calling for an automatic approach to identifying relevant actions for entity-bearing queries. In this paper, we pose the problem of finding actions that can be performed on entities as the problem of probabilistic inference in a graphical model that captures how an entity bearing query is generated. We design models of increasing complexity that capture latent factors such as entity type and intended actions that determine how a user writes a query in a search box, and the URL that they click on. Given a large collection of real-world queries and clicks from a commercial search engine, the models are learned efficiently through maximum likelihood estimation using an EM algorithm. Given a new query, probabilistic inference enables recommendation of a set of pertinent actions and hosts. We propose an evaluation methodology for measuring the relevance of our recommended actions, and show empirical evidence of the quality and the diversity of the discovered actions."
264 | 
265 | >	"Search as an action broker: A promising future search scenario involves modeling the user intents (or “verbs”) underlying the queries and brokering the webpages that accomplish the intended actions. In this vision, the broker is aware of all entities and actions of interest to its users, understands the intent of the user, ranks all providers of actions, and provides direct actionable results through APIs with the providers."
266 | 


--------------------------------------------------------------------------------
/Probabilistic Programming.md:
--------------------------------------------------------------------------------
  1 |   Probabilistic models can be represented using programs that make stochastic choices. Operations on models such as learning and inference can be represented as meta-programs that find probable executions of model programs given constraints on execution traces.
  2 | 
  3 | 
  4 |   * [**overview**](#overview)
  5 |   * [**applications**](#applications)
  6 |   * [**projects**](#projects)
  7 |   * [**interesting papers**](#interesting-papers)
  8 |     - [**applications**](#interesting-papers---applications)
  9 | 
 10 | 
 11 | 
 12 | ---
 13 | ### overview
 14 | 
 15 |   [introduction](http://intelligence.org/2014/09/04/daniel-roy/) by Daniel Roy  
 16 |   [introduction](http://habrahabr.ru/post/242993/) by Alexey Popov `in russian`  
 17 | 
 18 |   ["An Introduction to Probabilistic Programming"](https://arxiv.org/abs/1809.10756) by Meent, Paige, Yang, Wood `paper`
 19 | 
 20 |   [Forest](http://forestdb.org) - a repository for generative models
 21 | 
 22 | ----
 23 | 
 24 |   ["Engineering and Reverse Engineering Intelligence with Probabilistic Programs, Program Induction, and Deep Learning"](https://vimeo.com/248502450) by Josh Tenenbaum and Vikash Mansinghka `video`
 25 | 
 26 |   [overview](http://youtube.com/watch?v=-8QMqSWU76Q) by Vikash Mansinghka `video`
 27 | 
 28 |   [tutorial](http://research.microsoft.com/apps/video/dl.aspx?id=259568) by Frank Wood `video`  
 29 |   [turorial](http://youtube.com/watch?v=6Lqt07enBGs) by Frank Wood `video`  
 30 | 
 31 |   [PROBPROG 2018](https://probprog.cc) conference ([videos](https://youtube.com/playlist?list=PL_PW0E_Tf2qvXBEpl10Y39RULTN-ExzZQ))
 32 | 
 33 | ----
 34 | 
 35 |   "Probabilistic programming systems represent generative models as programs in a language that has specialized syntax for definition and conditioning of random variables. A backend provides one or more general-purpose inference methods for any program in this language, resulting in an abstraction boundary between model and inference algorithm design. Models specified as programs are often concise, modular, and easy to modify or extend. This allows definition of structured priors that are specifically tailored to an application domain in a manner that is efficient in terms of the dimensionality of its latent variables, albeit at the expense of performing inference with generic methods that may not take advantage of model-specific optimizations."
 36 | 
 37 |   "Probabilistic models provide a framework for describing abstract prior knowledge and using it to reason under uncertainty. Probabilistic programs are a powerful tool for probabilistic modeling. A probabilistic programming language is a deterministic programming language augmented with random sampling and Bayesian conditioning operators. Performing inference on these programs then involves reasoning about the space of executions which satisfy some constraints, such as observed values. A universal probabilistic programming language, one built on a Turing-complete language, can represent any computable probability distribution, including open-world models, Bayesian non-parameterics, and stochastic recursion. Distribution is a meaning of program."
 38 | 
 39 |   "One of the key characteristics of higher-order probabilistic programming languages equiped with eval is that program text both can be generated and evaluated. In higher-order languages (Lisp, Scheme, Church, Anglican and Venture) functions are first class objects - evaluating program text that defines a valid procedure returns a procedure that can be applied to arguments. The means that, among other things, program text can be programmatically generated by a program and then evaluated. In a probabilistic programming context this means that we can do inference about the program text that gave rise to an observed output or output relationship. In short, we can get computers to program themselves."
 40 | 
 41 | 
 42 | 
 43 | ---
 44 | ### applications
 45 | 
 46 |   [Microsoft Office 365 Clutter](https://microsoft.com/research/blog/probabilistic-programming-goes-large-scale-from-reducing-email-clutter-to-any-machine-learning-task) ([overview](https://youtu.be/g_LSbqLBdM0?t=4m44s) `video`)  *(uses Infer.NET)*  
 47 |   [Microsoft TrueSkill](http://trueskill.org) ([overview](https://youtu.be/g_LSbqLBdM0?t=5m56s) `video`, [paper](https://github.com/brylevkirill/notes/blob/master/Bayesian%20Inference%20and%20Learning.md#trueskilltm-a-bayesian-skill-rating-system-herbrich-minka-graepel) `summary`, [paper](https://github.com/brylevkirill/notes/blob/master/Bayesian%20Inference%20and%20Learning.md#trueskill-2-an-improved-bayesian-skill-rating-system-minka-cleven-zaykov) `summary`)  *(uses Infer.NET)*  
 48 |   [Microsoft Azure ML Matchbox](https://devblogs.microsoft.com/dotnet/dot-net-recommendation-system-for-net-applications-using-azure-machine-learning) ([overview](https://youtu.be/g_LSbqLBdM0?t=7m48s) `video`, [paper](https://github.com/brylevkirill/notes/blob/master/Bayesian%20Inference%20and%20Learning.md#matchbox-large-scale-bayesian-recommendations-stern-herbrich-graepel) `summary`) *(uses Infer.NET)*  
 49 |   [Microsoft Satori Alexandria](https://devblogs.microsoft.com/dotnet/announcing-ml-net-0-6-machine-learning-net) ([overview](https://youtu.be/g_LSbqLBdM0?t=8m53s) `video`, [paper](https://github.com/brylevkirill/notes/blob/master/Knowledge%20Representation%20and%20Reasoning.md#alexandria-unsupervised-high-precision-knowledge-base-construction-using-a-probabilistic-program-winn-et-al) `summary`)  *(uses Infer.NET)*  
 50 |   [Microsoft Excel](http://research.microsoft.com/en-us/projects/tabular/) ([overview](https://youtube.com/watch?v=jsJZkSpLmq4) `video`)  *(uses Infer.NET)*  
 51 | 
 52 |   Facebook HackPPL ([overview](https://youtube.com/watch?v=gn6M8MX8jpI) `video`, [paper](https://research.fb.com/publications/hackppl-a-universal-probabilistic-programming-language))  
 53 |   [Facebook Prophet](https://facebook.github.io/prophet/) ([overview](https://youtube.com/watch?v=pOYAXv15r3A) `video`, [post](https://research.fb.com/blog/2017/02/prophet-forecasting-at-scale/), [paper](http://lethalletham.com/ForecastingAtScale.pdf))  *(uses Stan)*  
 54 | 
 55 |   [machine teaching](http://blogs.microsoft.com/next/2015/07/10/the-next-evolution-of-machine-learning-machine-teaching/)
 56 | 
 57 |   [graphics in reverse](http://newsoffice.mit.edu/2015/better-probabilistic-programming-0413)
 58 | 
 59 |   [**commonsense reasoning**](https://github.com/brylevkirill/notes/blob/master/Knowledge%20Representation%20and%20Reasoning.md#reasoning---commonsense-reasoning)
 60 | 
 61 | 
 62 | 
 63 | ---
 64 | ### projects
 65 | 
 66 |   - languages for models & systems that simplify / automate aspects of inference (Edward, Stan, PyMC, PyRo, WebPPL, BLOG)  
 67 |   - models and queries defined in terms of complex stochastic computations (BayesDB)  
 68 |   - programs and languages as formal representations of probabilistic objects (Venture)  
 69 | 
 70 | ----
 71 | 
 72 |   - [*Infer.NET*](https://github.com/dotnet/infer)
 73 | 
 74 | 	["Model-Based Machine Learning"](http://mbmlbook.com) by John Winn et al. `book`
 75 | 
 76 | 	[overview](https://youtube.com/watch?v=g_LSbqLBdM0) by Yordan Zaykov `video`  
 77 | 	[overview](http://youtube.com/watch?v=ZHERrzVDTiU) by Boris Yangel `video` `in russian`  
 78 | 
 79 | 	[applications](#applications)
 80 | 
 81 |   - [*Stan*](https://github.com/stan-dev)
 82 | 
 83 | 	[overview](https://youtube.com/watch?v=6NXRCtWQNMg) by Bob Carpenter `video`  
 84 | 	[overview](https://vimeo.com/132156595) by Bob Carpenter `video`  
 85 | 
 86 | 	[Prophet](https://facebookincubator.github.io/prophet/) from Facebook
 87 | 
 88 |   - [*TensorFlow Probability*](https://tensorflow.org/probability)
 89 | 
 90 | 	[overview](https://youtube.com/watch?v=BrwKURU-wpk) by Joshua Dilon `video`
 91 | 
 92 | 	[TensorFlow Distributions](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distributions) ([paper](https://arxiv.org/abs/1711.10604) by Dillon et al.)
 93 | 
 94 | 	["Probabilistic Programming and Bayesian Methods for Hackers"](http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/#tensorflow) by Cam Davidson-Pilon `book` *(examples in TensorFlow Probability)*
 95 | 
 96 | 	[Edward2](https://github.com/tensorflow/probability/tree/master/tensorflow_probability/python/edward2) language ([overview](https://youtube.com/watch?v=NLhVfI8jYdQ) by Dustin Tran and Chris Suter `video`)  
 97 | 	[Edward](https://github.com/blei-lab/edward) language ([paper](#deep-probabilistic-programming-tran-hoffman-saurous-brevdo-murphy-blei) by Tran et al. `summary`, [overview](https://youtube.com/watch?v=4XZkHtHtQsk) by Dustin Tran `video`, [overview](https://youtu.be/1zNNLHyeWok?t=5m) by Dustin Tran `video`, [overview](https://youtube.com/watch?v=PvyVahNl8H8) by Dustin Tran `video` ([slides](http://dustintran.com/talks/Tran_Edward.pdf)))  
 98 | 
 99 |   - [*Pyro*](https://github.com/uber/pyro)
100 | 
101 | 	["Pyro: Deep Universal Probabilistic Programming"](https://arxiv.org/abs/1810.09538) by Bingham et al. `paper`
102 | 
103 |   - [*PyMC*](https://github.com/pymc-devs)
104 | 
105 | 	["Probabilistic Programming and Bayesian Methods for Hackers"](http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/) by Cam Davidson-Pilon `book` *(examples in PyMC)*
106 | 
107 | 	[overview](https://youtube.com/watch?v=LlzVlqVzeD8) by Thomas Wiecki `video`
108 | 
109 |   - [*WebPPL*](https://github.com/probmods/webppl)
110 | 
111 |   - [*BLOG*](https://github.com/BayesianLogic)
112 | 
113 | 	[overview](https://youtu.be/JzBrp5LnNCo?t=14m4s) by Stuart Russell `video`  
114 | 	[overview](https://youtube.com/watch?v=rdsPMCYMcZA) by Stuart Russell `video`  
115 | 
116 |   - [*Venture*](https://github.com/venture)
117 | 
118 | 	[overview](https://youtu.be/-8QMqSWU76Q?t=35m30s) by Vikash Mansinghka `video`  
119 | 	[overview](https://youtu.be/Rte-y6ThwAQ?t=35m59s) by Vikash Mansinghka `video`  
120 | 
121 |   - [*BayesDB*](http://probcomp.org/bayesdb)
122 | 
123 | 	[**BayesDB**](https://github.com/brylevkirill/notes/blob/master/Knowledge%20Representation%20and%20Reasoning.md#probabilistic-database---bayesdb) project `summary`
124 | 
125 | 
126 | 
127 | ---
128 | ### interesting papers
129 | 
130 | [**interesting papers - bayesian inference and learning**](https://github.com/brylevkirill/notes/blob/master/Bayesian%20Inference%20and%20Learning.md#interesting-papers)  
131 | [**interesting papers - bayesian deep learning**](https://github.com/brylevkirill/notes/blob/master/Deep%20Learning.md#interesting-papers---bayesian-deep-learning)  
132 | 
133 | 
134 | 
135 | ----
136 | #### ["Probabilistic Programming"](http://research.microsoft.com/pubs/208585/fose-icse2014.pdf) Gordon, Henzinger, Nori, Rajamani
137 | >	"Probabilistic programs are usual functional or imperative programs with two added constructs: (1) the ability to draw values at random from distributions, and (2) the ability to condition values of variables in a program via observations. Models from diverse application areas such as computer vision, coding theory, cryptographic protocols, biology and reliability analysis can be written as probabilistic programs. Probabilistic inference is the problem of computing an explicit representation of the probability distribution implicitly specified by a probabilistic program. Depending on the application, the desired output from inference may vary - we may want to estimate the expected value of some function f with respect to the distribution, or the mode of the distribution, or simply a set of samples drawn from the distribution. In this paper, we describe connections this research area called “Probabilistic Programming” has with programming languages and software engineering, and this includes language design, and the static and dynamic analysis of programs. We survey current state  of the art and speculate on promising directions for future research."
138 | 
139 | 
140 | #### ["Probabilistic (Logic) Programming Concepts"](https://lirias.kuleuven.be/bitstream/123456789/490338/1/deraedt_kimmig_mlj15.pdf) De Raedt, Kimmig
141 | >	"A multitude of different probabilistic programming languages exists today, all extending a traditional programming language with primitives to support modeling of complex, structured probability distributions. Each of these languages employs its own probabilistic primitives, and comes with a particular syntax, semantics and inference procedure. This makes it hard to understand the underlying programming concepts and appreciate the differences between the different languages. To obtain a better understanding of probabilistic programming, we identify a number of core programming concepts underlying the primitives used by various probabilistic languages, discuss the execution mechanisms that they require and use these to position and survey state-of-the-art probabilistic languages and their implementation. While doing so, we focus on probabilistic extensions of logic programming languages such as Prolog, which have been considered for over 20 years."
142 | 
143 | >	"First, there is an ongoing quest for efficient inference approaches for languages that support a broad range of programming concepts. Promising directions include lifted inference, which aims at exploiting symmetries and abstraction over individuals to speed up inference, knowledge compilation, which has contributed many data structures for compactly representing and efficiently querying various types of knowledge, and approximate methods such as MCMC, which is used in many probabilistic programming languages, but still requires proposal functions to be custom made for the program at hand. There also is a need for a clear understanding of the relative computational complexity of the various probabilistic languages and concepts that exist to date. Another question that has only seen partial answers so far is how to efficiently deal with evidence and constraints in different inference techniques. Adapting and extending program transformation and analysis techniques to the probabilistic setting promises opportunities to recognize and exploit program parts that are amenable to more efficient inference. Concepts such as time and dynamics require inference approaches that on the one hand exploit repeated structure, but on the other hand can also deal with changing structure over time. Last but not least, it still is a challenge to learn probabilistic programs, although a wide variety of learning techniques for probabilistic programming has already been developed. Many key challenges for both parameter and structure learning remain, many of which are related to efficient inference, as learning requires inference."
144 | 
145 |   - `video` <https://youtube.com/watch?v=5g0Z5b77rOs> (Kimmig)
146 |   - `video` <https://youtube.com/watch?v=3lnVBqxjC88> (De Raedt)
147 |   - `video` <https://youtube.com/watch?v=oNEtju9hE78> (De Raedt)
148 |   - `slides` <https://lirias.kuleuven.be/bitstream/123456789/504183/1/pp-tutorial-ijcai15.pdf>
149 | 
150 | 
151 | #### ["Deep Probabilistic Programming"](https://arxiv.org/abs/1701.03757) Tran, Hoffman, Saurous, Brevdo, Murphy, Blei
152 | >	"We propose Edward, a Turing-complete probabilistic programming language. Edward builds on two compositional representations - random variables and inference. By treating inference as a first class citizen, on a par with modeling, we show that probabilistic programming can be as flexible and computationally efficient as traditional deep learning. For flexibility, Edward makes it easy to fit the same model using a variety of composable inference methods, ranging from point estimation, to variational inference, to MCMC. In addition, Edward can reuse the modeling representation as part of inference, facilitating the design of rich variational models and generative adversarial networks. For efficiency, Edward is integrated into TensorFlow, providing significant speedups over existing probabilistic systems. For example, on a benchmark logistic regression task, Edward is at least 35x faster than Stan and PyMC3."
153 | 
154 | 
155 | #### ["Automatic Variational Inference in Stan"](http://arxiv.org/abs/1506.03431) Kucukelbir, Ranganath, Gelman, Blei
156 | >	"Variational inference is a scalable technique for approximate Bayesian inference. Deriving variational inference algorithms requires tedious model-specific calculations; this makes it difficult to automate. We propose an automatic variational inference algorithm, automatic differentiation variational inference. The user only provides a Bayesian model and a dataset; nothing else. We make no conjugacy assumptions and support a broad class of models. The algorithm automatically determines an appropriate variational family and optimizes the variational objective. We implement ADVI in Stan (code available now), a probabilistic programming framework. We compare ADVI to MCMC sampling across hierarchical generalized linear models, nonconjugate matrix factorization, and a mixture model. We train the mixture model on a quarter million images. With ADVI we can use variational inference on any model we write in Stan."
157 | 
158 | >	"We develop automatic differentiation variational inference in Stan. ASVI leverages automatic transformations, an implicit non-Gaussian variational approximation, and automatic differentiation. This is a valuable tool. We can explore many models, and analyze large datasets with ease."
159 | 
160 |   - `video` <http://research.microsoft.com/apps/video/default.aspx?id=259601> (18:30) (Kucukelbir)
161 | 
162 | 
163 | #### ["Automatic Differentiation Variational Inference"](http://arxiv.org/abs/1603.00788) Kucukelbir, Tran, Ranganath, Gelman, Blei
164 | >	"Probabilistic modeling is iterative. A scientist posits a simple model, fits it to her data, refines it according to her analysis, and repeats. However, fitting complex models to large data is a bottleneck in this process. Deriving algorithms for new models can be both mathematically and computationally challenging, which makes it difficult to efficiently cycle through the steps. To this end, we develop automatic differentiation variational inference. Using our method, the scientist only provides a probabilistic model and a dataset, nothing else. ADVI automatically derives an efficient variational inference algorithm, freeing the scientist to refine and explore many models. ADVI supports a broad class of models - no conjugacy assumptions are required. We study ADVI across ten different models and apply it to a dataset with millions of observations. ADVI is integrated into Stan, a probabilistic programming system; it is available for immediate use."
165 | 
166 | 
167 | #### ["Deep Armotized Inference for Probabilistic Programs"](https://arxiv.org/abs/1610.05735) Ritchie, Horsfall, Goodman
168 | >	"Probabilistic programming languages are a powerful modeling tool, able to represent any computable probability distribution. Unfortunately, probabilistic program inference is often intractable, and existing PPLs mostly rely on expensive, approximate sampling-based methods. To alleviate this problem, one could try to learn from past inferences, so that future inferences run faster. This strategy is known as amortized inference; it has recently been applied to Bayesian networks and deep generative models. This paper proposes a system for amortized inference in PPLs. In our system, amortization comes in the form of a parameterized guide program. Guide programs have similar structure to the original program, but can have richer data flow, including neural network components. These networks can be optimized so that the guide approximately samples from the posterior distribution defined by the original program. We present a flexible interface for defining guide programs and a stochastic gradient-based scheme for optimizing guide parameters, as well as some preliminary results on automatically deriving guide programs. We explore in detail the common machine learning pattern in which a ‘local’ model is specified by ‘global’ random values and used to generate independent observed data points; this gives rise to amortized local inference supporting global model learning."
169 | 
170 | >	"In this paper, we presented a system for amortized inference in probabilistic programs. Amortization is achieved through parameterized guide programs which mirror the structure of the original program but can be trained to approximately sample from the posterior. We introduced an interface for specifying guide programs which is flexible enough to reproduce state-of-the-art variational inference methods. We also demonstrated how this interface supports model learning in addition to amortized inference. We developed and proved the correctness of an optimization method for training guide programs, and we evaluated its ability to optimize guides for Bayesian networks, topic models, and deep generative models."
171 | 
172 |   - `video` <https://youtube.com/watch?v=jp3noyIYAbA> (Wood)
173 | 
174 | 
175 | #### ["Nonstandard Interpretations of Probabilistic Programs for Efficient Inference"](https://web.stanford.edu/~ngoodman/papers/WGSS-NIPS11.pdf) Wingate, Goodman, Stuhlmuller, Siskind
176 | >	"Probabilistic programming languages allow modelers to specify a stochastic process using syntax that resembles modern programming languages. Because the program is in machine-readable format, a variety of techniques from compiler design and program analysis can be used to examine the structure of the distribution represented by the probabilistic program. We show how nonstandard interpretations of probabilistic programs can be used to craft efficient inference algorithms: information about the structure of a distribution (such as gradients or dependencies) is generated as a monad-like side computation while executing the program. These interpretations can be easily coded using special-purpose objects and operator overloading. We implement two examples of nonstandard interpretations in two different languages, and use them as building blocks to construct inference algorithms: automatic differentiation, which enables gradient based methods, and provenance tracking, which enables efficient construction of global proposals."
177 | 
178 | >	"We have shown how nonstandard interpretations of probabilistic programs can be used to extract structural information about a distribution, and how this information can be used as part of a variety of inference algorithms. The information can take the form of gradients, Hessians, fine-grained dependencies, or bounds. Empirically, we have implemented two such interpretations and demonstrated how this information can be used to find regions of high likelihood quickly, and how it can be used to generate samples with improved statistical properties versus random-walk style MCMC. There are other types of interpretations which could provide additional information. For example, interval arithmetic could be used to provide bounds or as part of adaptive importance sampling. Each of these interpretations can be used alone or in concert with each other; one of the advantages of the probabilistic programming framework is the clean separation of models and inference algorithms, making it easy to explore combinations of inference algorithms for complex models. More generally, this work begins to illuminate the close connections between probabilistic inference and programming language theory. It is likely that other techniques from compiler design and program analysis could be fruitfully applied to inference problems in probabilistic programs."
179 | 
180 | >	"With an outline of probabilistic programming in hand, we now turn to nonstandard interpretations. The idea of nonstandard interpretations originated in model theory and mathematical logic, where it was proposed that a set of axioms could be interpreted by different models. For example, differential geometry can be considered a nonstandard interpretation of classical arithmetic. In programming, a nonstandard interpretation replaces the domain of the variables in the program with a new domain, and redefines the semantics of the operators in the program to be consistent with the new domain. This allows reuse of program syntax while implementing new functionality. For example, the expression “a ∗ b” can be interpreted equally well if a and b are either scalars or matrices, but the “∗” operator takes on different meanings. Practically, many useful nonstandard interpretations can be implemented with operator overloading: variables are redefined to be objects with operators that implement special functionality, such as tracing, reference counting, or profiling."
181 | 
182 | 
183 | 
184 | ---
185 | ### interesting papers - applications
186 | 
187 | [**interesting papers - bayesian inference and learning - applications**](https://github.com/brylevkirill/notes/blob/master/Bayesian%20Inference%20and%20Learning.md#interesting-papers---applications)
188 | 
189 | ----
190 | #### ["Human-level Concept Learning Through Probabilistic Program Induction"](http://web.mit.edu/cocosci/Papers/Science-2015-Lake-1332-8.pdf) Lake, Salakhutdinov, Tenenbaum
191 | >	"People learning new concepts can often generalize successfully from just a single example, yet machine learning algorithms typically require tens or hundreds of examples to perform with similar accuracy. People can also use learned concepts in richer ways than conventional algorithms - for action, imagination, and explanation. We present a computational model that captures these human learning abilities for a large class of simple visual concepts: handwritten characters from the world’s alphabets. The model represents concepts as simple programs that best explain observed examples under a Bayesian criterion. On a challenging one-shot classification task, the model achieves human-level performance while outperforming recent deep learning approaches. We also present several “visual Turing tests” probing the model’s creative generalization abilities, which in many cases are indistinguishable from human behavior."
192 | 
193 | >	"Vision program outperformed humans in identifying handwritten characters, given single training example"
194 | 
195 | >	"This work brings together three key ideas -- compositionality, causality, and learning-to-learn --- challenging (in a good way) the traditional deep learning approach"
196 | 
197 |   - `video` <http://youtube.com/watch?v=kzl8Bn4VtR8> (Lake)
198 |   - `video` <http://youtu.be/quPN7Hpk014?t=21m5s> (Tenenbaum)
199 |   - `video` <http://techtalks.tv/talks/one-shot-learning-of-simple-fractal-concepts/63049/> (Lake)
200 |   - `notes` <https://casmls.github.io/general/2017/02/08/oneshot.html>
201 |   - `code` <https://github.com/brendenlake/BPL>
202 | 
203 | 
204 | #### ["Picture: A Probabilistic Programming Language for Scene Perception"](http://mrkulk.github.io/www_cvpr15/) Kulkarni, Kohli, Tenenbaum, Mansinghka
205 | >	"Recent progress on probabilistic modeling and statistical learning, coupled with the availability of large training datasets, has led to remarkable progress in computer vision. Generative probabilistic models, or “analysis-by-synthesis” approaches, can capture rich scene structure but have been less widely applied than their discriminative counterparts, as they often require considerable problem-specific engineering in modeling and inference, and inference is typically seen as requiring slow, hypothesize-and-test Monte Carlo methods. Here we present Picture, a probabilistic programming language for scene understanding that allows researchers to ex- press complex generative vision models, while automatically solving them using fast general-purpose inference machinery. Picture provides a stochastic scene language that can express generative models for arbitrary 2D/3D scenes, as well as a hierarchy of representation layers for comparing scene hypotheses with observed images by matching not simply pixels, but also more abstract features (e.g., contours, deep neural network activations). Inference can flexibly integrate advanced Monte Carlo strategies with fast bottom-up datadriven methods. Thus both representations and inference strategies can build directly on progress in discriminatively trained systems to make generative vision more robust and efficient. We use Picture to write programs for generative 3D face analysis, 3D human pose estimation, and 3D object reconstruction – each in under 50 lines of code, and each competitive with specially engineered baselines."
206 | 
207 |   - `video` <https://youtube.com/watch?v=quPN7Hpk014> (Tenenbaum)
208 |   - `video` <https://youtu.be/-8QMqSWU76Q?t=44m8s> (Mansinghka)
209 |   - `video` <https://youtu.be/Rte-y6ThwAQ?t=5m18s> (Mansinghka)
210 |   - `video` <https://vimeo.com/248502450#t=1h35m9s> (Mansinghka)
211 |   - `video` <https://facebook.com/nipsfoundation/videos/1552060484885185?t=5988> (Reed)
212 | 
213 | 
214 | #### ["Practical Optimal Experiment Design with Probabilistic Programs"](https://arxiv.org/abs/1608.05046) Ouyang, Tessler, Ly, Goodman
215 | >	"Scientists often run experiments to distinguish competing theories. This requires patience, rigor, and ingenuity - there is often a large space of possible experiments one could run. But we need not comb this space by hand - if we represent our theories as formal models and explicitly declare the space of experiments, we can automate the search for good experiments, looking for those with high expected information gain. Here, we present a general and principled approach to experiment design based on probabilistic programming languages. PPLs offer a clean separation between declaring problems and solving them, which means that the scientist can automate experiment design by simply declaring her model and experiment spaces in the PPL without having to worry about the details of calculating information gain. We demonstrate our system in two case studies drawn from cognitive psychology, where we use it to design optimal experiments in the domains of sequence prediction and categorization. We find strong empirical validation that our automatically designed experiments were indeed optimal. We conclude by discussing a number of interesting questions for future research."
216 | 
217 | 
218 | #### ["Semantic Parsing to Probabilistic Programs for Situated Question Answering"](http://arxiv.org/abs/1606.07046) Krishnamurthy, Tafjord
219 | >	"Situated question answering is the problem of answering questions about an environment such as an image. This problem requires interpreting both a question and the environment, and is challenging because the set of interpretations is large, typically superexponential in the number of environmental objects. Existing models handle this challenge by making strong -- and untrue -- independence assumptions. We present Parsing to Probabilistic Programs (P3), a novel situated question answering model that utilizes approximate inference to eliminate these independence assumptions and enable the use of global features of the question/environment interpretation. Our key insight is to treat semantic parses as probabilistic programs that execute nondeterministically and whose possible executions represent environmental uncertainty. We evaluate our approach on a new, publicly-released data set of 5000 science diagram questions, finding that our approach outperforms several competitive baselines."
220 | 
221 | >	"We present Parsing to Probabilistic Programs (P3), a novel model for situated question answering that embraces approximate inference to enable the use of arbitrary features of the language and environment. P3 trains a semantic parser to predict logical forms that are probabilistic programs whose possible executions represent environmental uncertainty. We demonstrate this model on a challenging new data set of 5000 science diagram questions, finding that it outperforms several competitive baselines and that its global features improve accuracy. P3 has several advantageous properties. First, P3 can be easily applied to new problems: one simply has to write an initialization program and define the execution features. Second, the initialization program can be used to encode a wide class of assumptions about the environment. For example, the model can assume that every noun refers to a single object. The combination of semantic parsing and probabilistic programming makes P3 an expressive and flexible model with many potential applications."
222 | 
223 | 
224 | #### ["TerpreT: A Probabilistic Programming Language for Program Induction"](https://arxiv.org/abs/1608.04428) Gaunt, Brockschmidt, Singh, Kushman, Kohli, Taylor, Tarlow
225 | >	"We study machine learning formulations of inductive program synthesis; that is, given input-output examples, we would like to synthesize source code that maps inputs to corresponding outputs. Our aims in this work are to develop new machine learning approaches to the problem based on neural networks and graphical models, and to understand the capabilities of machine learning techniques relative to traditional alternatives, such as those based on constraint solving from the programming languages community. Our key contribution is the proposal of TerpreT, a domain-specific language for expressing program synthesis problems. TerpreT is similar to a probabilistic programming language: a model is composed of a specification of a program representation (declarations of random variables) and an interpreter that describes how programs map inputs to outputs (a model connecting unknowns to observations). The inference task is to observe a set of input-output examples and infer the underlying program. TerpreT has two main benefits. First, it enables rapid exploration of a range of domains, program representations, and interpreter models. Second, it separates the model specification from the inference algorithm, allowing proper like-to-like comparisons between different approaches to inference. From a single TerpreT specification we can automatically perform inference using four different back-ends that include machine learning and program synthesis approaches. These are based on gradient descent (thus each specification can be seen as a differentiable interpreter), linear program relaxations for graphical models, discrete satisfiability solving, and the Sketch program synthesis system. We illustrate the value of TerpreT by developing several interpreter models and performing an extensive empirical comparison between alternative inference algorithms on a variety of program models. Our key, and perhaps surprising, empirical finding is that constraint solvers dominate the gradient descent and LP-based formulations. We conclude with some suggestions on how the machine learning community can make progress on program synthesis."
226 | 
227 | >	"These works raise questions of (a) whether new models can be designed specifically to synthesize interpretable source code that may contain looping and branching structures, and (b) whether searching over program space using techniques developed for training deep neural networks is a useful alternative to the combinatorial search methods used in traditional IPS. In this work, we make several contributions in both of these directions."
228 | 
229 | >	"Shows that differentiable interpreter-based program induction is inferior to discrete search-based techniques used by the programming languages community. We are then left with the question of how to make progress on program induction using machine learning techniques."
230 | 
231 |   - `video` <https://youtu.be/vzDuVhFMB9Q?t=2m40s> (Gaunt)  
232 | 
233 | 
234 | #### ["Black-Box Policy Search with Probabilistic Programs"](https://arxiv.org/abs/1507.04635) Meent, Paige, Tolpin, Wood
235 | >	"In this work, we explore how probabilistic programs can be used to represent policies in sequential decision problems. In this formulation, a probabilistic program is a black-box stochastic simulator for both the problem domain and the agent. We relate classic policy gradient techniques to recently introduced black-box variational methods which generalize to probabilistic program inference. We present case studies in the Canadian traveler problem, Rock Sample, and a benchmark for optimal diagnosis inspired by Guess Who. Each study illustrates how programs can efficiently represent policies using moderate numbers of parameters."
236 | 
237 | >	"In this paper we put forward the idea that probabilistic programs can be a productive medium for describing both a problem domain and the agent in sequential decision problems. Programs can often incorporate assumptions about the structure of a problem domain to represent the space of policies in a more targeted manner, using a much smaller number of variables than would be needed in a more general formulation. By combining probabilistic programming with black-box variational inference we obtain a generalized variant of well-established policy gradient techniques that allow us to define and learn policies with arbitrary levels of algorithmic sophistication in moderately high-dimensional parameter spaces. Fundamentally, policy programs represent some form of assumptions about what contextual information is most relevant to a decision, whereas the policy parameters represent domain knowledge that generalizes across episodes."
238 | 


--------------------------------------------------------------------------------
/Recommender Systems.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 |   * [**overview**](#overview)
  4 |   * [**interesting papers**](#interesting-papers)
  5 |     - [**modeling items and users**](#interesting-papers---modeling-items-and-users)
  6 |     - [**deep learning**](#interesting-papers---deep-learning)
  7 |     - [**modeling dynamics**](#interesting-papers---modeling-dynamics)
  8 |     - [**interactive learning**](#interesting-papers---interactive-learning)
  9 | 
 10 | 
 11 | 
 12 | ---
 13 | ### overview
 14 | 
 15 |   ["Model-Based Machine Learning: Making Recommendations"](http://mbmlbook.com/Recommender.html) by John Winn, Christopher Bishop and Thomas Diethe
 16 | 
 17 |   ["Recommender Systems: The Textbook"](http://charuaggarwal.net/Recommender-Systems.htm) by Charu Aggarwal `book`  
 18 |   ["Recommender Systems Handbook"](http://www.cs.ubbcluj.ro/~gabis/DocDiplome/SistemeDeRecomandare/Recommender_systems_handbook.pdf) by Ricci, Rokach, Shapira, Kantor `book`  
 19 | 
 20 | ----
 21 | 
 22 |   [overview](https://youtube.com/watch?v=gCaOa3W9kM0) by Alex Smola `video` `2015`  
 23 |   [overview](https://youtube.com/watch?v=xMr7I-OypVY) by Alex Smola `video` `2012`  
 24 | 
 25 |   [overview](https://youtube.com/watch?v=VJOtr47V0eo) by Xavier Amatriain and Deepak Agarwal `video` `RecSys 2016`  
 26 |   [overview](http://videolectures.net/kdd2014_amatriain_mobasher_recommender_problem) by Xavier Amatriain `video` `KDD 2014`  
 27 |   [overview](http://technocalifornia.blogspot.ru/2014/08/introduction-to-recommender-systems-4.html) by Xavier Amatriain `video` `MLSS 2014`  
 28 | 
 29 |   ["Recent Trends in Personalization: A Netflix Perspective"](https://slideslive.com/38917692/recent-trends-in-personalization-a-netflix-perspective) by Justin Basilico `video` `ICML 2019`  
 30 |   ["Beyond Being Accurate: Solving Real-World Recommendation Problems with Neural Modeling"](https://youtube.com/watch?v=FNmh-eGYAv0) by Ed Chi from Google `video` `2020`  
 31 | 
 32 |   [ACM RecSys](https://youtube.com/channel/UC2nEn-yNA1BtdDNWziphPGA) conference `video`
 33 | 
 34 | ----
 35 | 
 36 |   course by Sergey Nikolenko ([part 1](https://youtube.com/watch?v=mr8u54jsveA), [part 2](https://youtube.com/watch?v=cD47Ssp_Flk), [part 3](https://youtube.com/watch?v=OFyb8ukrRDo)) `video` `in russian`  
 37 |   [course](https://youtube.com/playlist?list=PL-_cKNuVAYAWkYunGd6zKk7UxmExS-GHl) by Evgeny Sokolov `video` `in russian`  
 38 | 
 39 |   [overview](https://youtube.com/watch?v=umyNVwePCtw) of recent research by Dmitry Bugaychenko `video` `in russian`  
 40 |   [overview](https://youtube.com/watch?v=N0NUwz3xWX4) of recent research by Dmitry Ushanov `video` `in russian`  
 41 | 
 42 |   [overview](https://youtube.com/watch?v=Us4KJkJiYrM) by Michael Rozner `video` `in russian`  
 43 |   [overview](https://youtube.com/watch?v=kfhqzkcfMqI) by Konstantin Vorontsov `video` `in russian`  
 44 |   [overview](https://youtube.com/watch?v=Te_6TqEhyTI) by Victor Kantor `video` `in russian`  
 45 |   [overview](https://youtube.com/watch?v=5ir_fCgzfLM) by Vladimir Gulin `video` `in russian`  
 46 |   [overview](https://youtube.com/watch?v=MLljnzsz9Dk) by Alexey Dral `video` `in russian`  
 47 | 
 48 |   [overview](https://youtube.com/watch?v=CQnCioCq4C0) of applications at Yandex by Boris Sharchilev `video` `in russian`  
 49 |   [overview](https://youtube.com/watch?v=iGAMPnv-0VY) of applications at Yandex by Igor Lifar and Dmitry Ushanov `video` `in russian`  
 50 |   [overview](https://youtube.com/watch?v=OJ0nJb3LfNo) of applications at Yandex by Andrey Zimovnov `video` `in russian`  
 51 |   [overview](https://youtube.com/watch?v=JKTneRi2vn8) of applications at Yandex by Michael Rozner `video` `in russian`  
 52 | 
 53 |   [overview](https://youtube.com/watch?v=TlBDO8UgMOE) of applications at VK by Danila Savenkov `video` `in russian`  
 54 |   [overview](https://vk.com/video-187376020_456239022?list=5e2c11518f60c8d169) of applications at VK by Oleg Lashinin `video` `in russian`  
 55 |   [overview](https://vk.com/video-187376020_456239020?list=a28424fc797e02beaf) of applications at VK by Semyon Polyakov `video` `in russian`  
 56 | 
 57 | ----
 58 | 
 59 |   challenges:
 60 |   - diversity vs accuracy
 61 |   - personalization vs popularity
 62 |   - novelty vs relevance
 63 |   - contextual dimensions (time)
 64 |   - presentation bias
 65 |   - explaining vs selecting items
 66 |   - influencing user vs predicting future
 67 | 
 68 | 
 69 | 
 70 | ---
 71 | ### interesting papers
 72 | 
 73 |   - [**modeling items and users**](#interesting-papers---modeling-items-and-users)
 74 |   - [**deep learning**](#interesting-papers---deep-learning)
 75 |   - [**modeling dynamics**](#interesting-papers---modeling-dynamics)
 76 |   - [**interactive learning**](#interesting-papers---interactive-learning)
 77 | 
 78 | 
 79 | [**selected papers**](https://yadi.sk/d/RtAsSjLG3PhrT2)
 80 | 
 81 | 
 82 | 
 83 | ----
 84 | #### ["On the Difficulty of Evaluating Baselines: A Study on Recommender Systems"](https://arxiv.org/abs/1905.01395) Rendle, Zhang, Koren
 85 | >	"Numerical evaluations with comparisons to baselines play a central role when judging research in recommender systems. In this paper, we show that running baselines properly is difficult. We demonstrate this issue on two extensively studied datasets. First, we show that results for baselines that have been used in numerous publications over the past five years for the Movielens 10M benchmark are suboptimal. With a careful setup of a vanilla matrix factorization baseline, we are not only able to improve upon the reported results for this baseline but even outperform the reported results of any newly proposed method. Secondly, we recap the tremendous effort that was required by the community to obtain high quality results for simple methods on the Netflix Prize. Our results indicate that empirical findings in research papers are questionable unless they were obtained on standardized benchmarks where baselines have been tuned extensively by the research community."
 86 | 
 87 |   - `video` <https://youtube.com/watch?v=fEVk2987Ik4> (Korepanov) `in russian`
 88 | 
 89 | 
 90 | 
 91 | ---
 92 | ### interesting papers - modeling items and users
 93 | 
 94 | 
 95 | #### ["Two Decades of Recommender Systems at Amazon.com"](https://www.computer.org/csdl/mags/ic/2017/03/mic2017030012.html) Smith, Linden
 96 |   `Amazon`
 97 | >	learning item-to-item similarity on offline data (e.g. item2 often bought with item1)
 98 | 
 99 | >	"problem with algorithms based on computing correlations between users and items: did you watch a movie because you liked it or because we showed it to you or both? requires computing causal interventions instead of correlations: p(Y|X) -> p(Y|X,do(R))"  
100 | 
101 |   - `video` <https://youtube.com/watch?v=GSQj27ps854> (Wilke)
102 |   - `post` <https://amazon.science/the-history-of-amazons-recommendation-algorithm>
103 | 
104 | 
105 | #### ["Real-time Personalization using Embeddings for Search Ranking at Airbnb"](https://kdd.org/kdd2018/accepted-papers/view/real-time-personalization-using-embeddings-for-search-ranking-at-airbnb) Grbovic, Cheng
106 |   `Airbnb` `KDD 2018`
107 | >	"Search Ranking and Recommendations are fundamental problems of crucial interest to major Internet companies, including web search engines, content publishing websites and marketplaces. However, despite sharing some common characteristics a one-size-fits-all solution does not exist in this space. Given a large difference in content that needs to be ranked, personalized and recommended, each marketplace has a somewhat unique challenge. Correspondingly, at Airbnb, a short-term rental marketplace, search and recommendation problems are quite unique, being a two-sided marketplace in which one needs to optimize for host and guest preferences, in a world where a user rarely consumes the same item twice and one listing can accept only one guest for a certain set of dates. In this paper we describe Listing and User Embedding techniques we developed and deployed for purposes of Real-time Personalization in Search Ranking and Similar Listing Recommendations, two channels that drive 99% of conversions. The embedding models were specifically tailored for Airbnb marketplace, and are able to capture guest’s short-term and long-term interests, delivering effective home listing recommendations. We conducted rigorous offline testing of the embedding models, followed by successful online tests before fully deploying them into production."
108 | 
109 |   - `post` <https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in-search-601172f7603e>
110 |   - `video` <https://youtube.com/watch?v=aWjsUEX7B1I>
111 |   - `video` <http://videolectures.net/kdd2018_grbovic_search_ranking_at_airbnb> (Grbovic)
112 |   - `video` <https://infoq.com/presentations/nlp-word-embedding> (31:33) (Alammar)
113 |   - `slides` <https://astro.temple.edu/~tua95067/Mihajlo_KDD2018_slides.pptx> (Grbovic)
114 |   - `post` <https://mccormickml.com/2018/06/15/applying-word2vec-to-recommenders-and-advertising>
115 | 
116 | 
117 | #### ["Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba"](https://arxiv.org/abs/1803.02349) Wang et al.
118 |   `Alibaba` `KDD 2018`
119 | >	"Recommender systems have been the most important technology for increasing the business in Taobao, the largest online consumer-to-consumer platform in China. There are three major challenges facing RS in Taobao: scalability, sparsity and cold start. In this paper, we present our technical solutions to address these three challenges. The methods are based on a well-known graph embedding framework. We first construct an item graph from users’ behavior history, and learn the embeddings of all items in the graph. The item embeddings are employed to compute pairwise similarities between all items, which are then used in the recommendation process. To alleviate the sparsity and cold start problems, side information is incorporated into the graph embedding framework. We propose two aggregation methods to integrate the embeddings of items and the corresponding side information. Experimental results from offline experiments show that methods incorporating side information are superior to those that do not. Further, we describe the platform upon which the embedding methods are deployed and the workflow to process the billion-scale data in Taobao. Using A/B test, we show that the online Click-Through-Rates are improved comparing to the previous collaborative filtering based methods widely used in Taobao, further demonstrating the effectiveness and feasibility of our proposed methods in Taobao’s live production environment."
120 | 
121 |   - `video` <https://youtube.com/watch?v=TLD_bSiHZdE>
122 |   - `video` <https://infoq.com/presentations/nlp-word-embedding> (36:56) (Alammar)
123 | 
124 | 
125 | #### ["Variational Autoencoders for Collaborative Filtering"](https://arxiv.org/abs/1802.05814) Liang, Krishnan, Hoffman, Jebara
126 |   `Mult-VAE` ` VAE-CF` `Netflix`
127 | >	"We extend variational autoencoders to collaborative filtering for implicit feedback. This non-linear probabilistic model enables us to go beyond the limited modeling capacity of linear factor models which still largely dominate collaborative filtering research. We introduce a generative model with multinomial likelihood and use Bayesian inference for parameter estimation. Despite widespread use in language modeling and economics, the multinomial likelihood receives less attention in the recommender systems literature. We introduce a different regularization parameter for the learning objective, which proves to be crucial for achieving competitive performance. Remarkably, there is an efficient way to tune the parameter using annealing. The resulting model and learning algorithm has information-theoretic connections to maximum entropy discrimination and the information bottleneck principle. Empirically, we show that the proposed approach significantly outperforms several state-of-the-art baselines, including two recently-proposed neural network approaches, on several real-world datasets. We also provide extended experiments comparing the multinomial likelihood with other commonly used likelihood functions in the latent factor collaborative filtering literature and show favorable results. Finally, we identify the pros and cons of employing a principled Bayesian inference approach and characterize settings where it provides the most significant improvements."
128 | 
129 | >	"Recommender systems is more of a "small data" than a "big data" problem."  
130 | >	"VAE generalizes linear latent factor model and recovers Gaussian matrix factorization as a special linear case. No iterative procedure required to rank all the items given a user's watch history - only need to evaluate inference and generative functions."  
131 | >	"We introduce a regularization parameter for the learning objective to trade-off the generative power for better predictive recommendation performance. For recommender systems, we don't necessarily need all the statistical property of a generative model. We trade off the ability of performing ancestral sampling for better fitting the data."  
132 | 
133 |   - `video` <https://youtube.com/watch?v=gRvxr47Gj3k> (Liang)
134 |   - `code` <https://github.com/dawenl/vae_cf>
135 |   - `code` <https://github.com/belepi93/vae-cf-pytorch>
136 | 
137 | 
138 | #### ["DropoutNet: Addressing Cold Start in Recommender Systems"](https://papers.nips.cc/paper/7081-dropoutnet-addressing-cold-start-in-recommender-systems) Volkovs, Yu, Poutanen
139 |   `DropoutNet`
140 | >	"Latent models have become the default choice for recommender systems due to their performance and scalability. However, research in this area has primarily focused on modeling user-item interactions, and few latent models have been developed for cold start. Deep learning has recently achieved remarkable success showing excellent results for diverse input types. Inspired by these results we propose a neural network based latent model called DropoutNet to address the cold start problem in recommender systems. Unlike existing approaches that incorporate additional content-based objective terms, we instead focus on the optimization and show that neural network models can be explicitly trained for cold start through dropout."
141 | 
142 | >	"Our approach is based on the observation that cold start is equivalent to the missing data problem where preference information is missing. Hence, instead of adding additional objective terms to model content, we modify the learning procedure to explicitly condition the model for the missing input. The key idea behind our approach is that by applying dropout to input mini-batches, we can train DNNs to generalize to missing input. By selecting an appropriate amount of dropout we show that it is possible to learn a DNN-based latent model that performs comparably to state-of-the-art on warm start while significantly outperforming it on cold start. The resulting model is simpler than most hybrid approaches and uses a single objective function, jointly optimizing all components to maximize recommendation accuracy."
143 | 
144 | >	"Training with dropout has a two-fold effect: pairs with dropout encourage the model to only use content information, while pairs without dropout encourage it to ignore content and simply reproduce preference input. The net effect is balanced between these two extremes. The model learns to reproduce the accuracy of the input latent model when preference data is available while also generalizing to cold start."
145 | 
146 | >	"An additional advantage of our approach is that it can be applied on top of any existing latent model to provide/enhance its cold start capability. This requires virtually no modification to the original model thus minimizing the implementation barrier for any production environment that’s already running latent models."
147 | 
148 |   - `video` <https://youtu.be/N0NUwz3xWX4?t=10m44s> (Ushanov) `in russian`
149 | 
150 | 
151 | #### ["Content-based Recommendations with Poisson Factorization"](http://www.cs.toronto.edu/~lcharlin/papers/GopalanCharlinBlei_nips14.pdf) Gopalan, Charlin, Blei
152 |   `CTPF`
153 | >	"We develop collaborative topic Poisson factorization (CTPF), a generative model of articles and reader preferences. CTPF can be used to build recommender systems by learning from reader histories and content to recommend personalized articles of interest. In detail, CTPF models both reader behavior and article texts with Poisson distributions, connecting the latent topics that represent the texts with the latent preferences that represent the readers. This provides better recommendations than competing methods and gives an interpretable latent space for understanding patterns of readership. Further, we exploit stochastic variational inference to model massive real-world datasets. For example, we can fit CTPF to the full arXiv usage dataset, which contains over 43 million ratings and 42 million word counts, within a day. We demonstrate empirically that our model outperforms several baselines, including the previous state-of-the art approach."
154 | 
155 | >	collaborative topic models:  
156 | >	- blending factorization-based and content-based recommendation  
157 | >	- describing user preferences with interpretable topics  
158 | 
159 |   - `video` <http://www.fields.utoronto.ca/video-archive/2017/03/2267-16706> (26:36) (Blei)
160 |   - `code` <https://github.com/premgopalan/collabtm>
161 | 
162 | 
163 | #### ["Scalable Recommendation with Hierarchical Poisson Factorization"](http://auai.org/uai2015/proceedings/papers/208.pdf) Gopalan, Hofman, Blei
164 |   `HPF`
165 | >	"We develop hierarchical Poisson matrix factorization, a novel method for providing users with high quality recommendations based on implicit feedback, such as views, clicks, or purchases. In contrast to existing recommendation models, HPF has a number of desirable properties. First, we show that HPF more accurately captures the long-tailed user activity found in most consumption data by explicitly considering the fact that users have finite attention budgets. This leads to better estimates of users’ latent preferences, and therefore superior recommendations, compared to competing methods. Second, HPF learns these latent factors by only explicitly considering positive examples, eliminating the often costly step of generating artificial negative examples when fitting to implicit data. Third, HPF is more than just one method- it is the simplest in a class of probabilistic models with these properties, and can easily be extended to include more complex structure and assumptions. We develop a variational algorithm for approximate posterior inference for HPF that scales up to large data sets, and we demonstrate its performance on a wide variety of real-world recommendation problems - users rating movies, listening to songs, reading scientific papers, and reading news articles."
166 | 
167 | >	discovering correlated preferences (devising new utility models and other factors such as time of day, date, in stock, customer demographic information)
168 | 
169 |   - `video` <https://youtu.be/zwcjJQoK8_Q?t=41m49s> (Blei)
170 |   - `code` <https://github.com/premgopalan/hgaprec>
171 | 
172 | 
173 | #### ["Exponential Family Embeddings"](https://arxiv.org/abs/1608.00778) Rudolph, Ruiz, Mandt, Blei
174 | >	"Word embeddings are a powerful approach for capturing semantic similarity among terms in a vocabulary. In this paper, we develop exponential family embeddings, a class of methods that extends the idea of word embeddings to other types of high-dimensional data. As examples, we studied neural data with real-valued observations, count data from a market basket analysis, and ratings data from a movie recommendation system. The main idea is to model each observation conditioned on a set of other observations. This set is called the context, and the way the context is defined is a modeling choice that depends on the problem. In language the context is the surrounding words; in neuroscience the context is close-by neurons; in market basket data the context is other items in the shopping cart. Each type of embedding model defines the context, the exponential family of conditional distributions, and how the latent embedding vectors are shared across data. We infer the embeddings with a scalable algorithm based on stochastic gradient descent. On all three applications - neural activity of zebrafish, users' shopping behavior, and movie ratings - we found exponential family embedding models to be more effective than other types of dimension reduction. They better reconstruct held-out data and find interesting qualitative structure."
175 | 
176 | >	identifying substitutes and co-purchases in high-scale consumer data (basket analysis)
177 | 
178 |   - `video` <https://youtu.be/zwcjJQoK8_Q?t=15m14s> (Blei)
179 |   - `code` <https://github.com/mariru/exponential_family_embeddings>
180 |   - `code` <https://github.com/franrruiz/p-emb>
181 | 
182 | 
183 | #### ["E-commerce in Your Inbox: Product Recommendations at Scale"](https://arxiv.org/abs/1606.07154) Grbovic et al.
184 |   `user2vec` `prod2vec` `Yahoo`
185 | >	"In recent years online advertising has become increasingly ubiquitous and effective. Advertisements shown to visitors fund sites and apps that publish digital content, manage social networks, and operate e-mail services. Given such large variety of internet resources, determining an appropriate type of advertising for a given platform has become critical to financial success. Native advertisements, namely ads that are similar in look and feel to content, have had great success in news and social feeds. However, to date there has not been a winning formula for ads in e-mail clients. In this paper we describe a system that leverages user purchase history determined from e-mail receipts to deliver highly personalized product ads to Yahoo Mail users. We propose to use a novel neural language-based algorithm specifically tailored for delivering effective product recommendations, which was evaluated against baselines that included showing popular products and products predicted based on co-occurrence. We conducted rigorous offline testing using a large-scale product purchase data set, covering purchases of more than 29 million users from 172 e-commerce websites. Ads in the form of product recommendations were successfully tested on online traffic, where we observed a steady 9% lift in click-through rates over other ad formats in mail, as well as comparable lift in conversion rates. Following successful tests, the system was launched into production during the holiday season of 2014."
186 | 
187 |   - `video` <https://youtube.com/watch?v=W56fZewflRw> (Djuric)
188 | 
189 | 
190 | #### ["Metadata Embeddings for User and Item Cold-start Recommendations"](https://arxiv.org/abs/1507.08439) Kula
191 |   `LightFM`
192 | >	"I present a hybrid matrix factorisation model representing users and items as linear combinations of their content features' latent factors. The model outperforms both collaborative and content-based models in cold-start or sparse interaction data scenarios (using both user and item metadata), and performs at least as well as a pure collaborative matrix factorisation model where interaction data is abundant. Additionally, feature embeddings produced by the model encode semantic information in a way reminiscent of word embedding approaches, making them useful for a range of related tasks such as tag recommendations."
193 | 
194 |   - `video` <https://youtube.com/watch?v=EgE0DUrYmo8> (Kula)
195 |   - `code` <https://github.com/lyst/lightfm>
196 | 
197 | 
198 | #### ["Causal Inference for Recommendation"](http://people.hss.caltech.edu/~fde/UAI2016WS/papers/Liang.pdf) Liang, Charlin, Blei
199 | >	"We develop a causal inference approach to recommender systems. Observational recommendation data contains two sources of information: which items each user decided to look at and which of those items each user liked. We assume these two types of information come from different models - the exposure data comes from a model by which users discover items to consider; the click data comes from a model by which users decide which items they like. Traditionally, recommender systems use the click data alone (or ratings data) to infer the user preferences. But this inference is biased by the exposure data, i.e., that users do not consider each item independently at random. We use causal inference to correct for this bias. On real-world data, we demonstrate that causal inference for recommender systems leads to improved generalization to new data."
200 | 
201 |   - `slides` <http://people.hss.caltech.edu/~fde/UAI2016WS/talks/Dawen.pdf> (Liang)
202 |   - `slides` <http://www.homepages.ucl.ac.uk/~ucgtrbd/whatif/David.pdf> (Blei)
203 | 
204 | 
205 | 
206 | ---
207 | ### interesting papers - deep learning
208 | 
209 | 
210 | #### ["Deep Learning based Recommender System: A Survey and New Perspectives"](https://arxiv.org/abs/1707.07435) Zhang et al.
211 | >	"With the ever-growing volume of online information, recommender systems have been an effective strategy to overcome such information overload. The utility of recommender systems cannot be overstated, given its widespread adoption in many web applications, along with its potential impact to ameliorate many problems related to over-choice. In recent years, deep learning has garnered considerable interest in many research fields such as computer vision and natural language processing, owing not only to stellar performance but also the attractive property of learning feature representations from scratch. The influence of deep learning is also pervasive, recently demonstrating its effectiveness when applied to information retrieval and recommender systems research. Evidently, the field of deep learning in recommender system is flourishing. This article aims to provide a comprehensive review of recent research efforts on deep learning based recommender systems. More concretely, we provide and devise a taxonomy of deep learning based recommendation models, along with providing a comprehensive summary of the state-of-the-art. Finally, we expand on current trends and provide new perspectives pertaining to this new exciting development of the field."
212 | 
213 | 
214 | #### ["Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches"](https://arxiv.org/abs/1907.06902) Dacrema, Cremonesi, Jannach
215 | >	"Deep learning techniques have become the method of choice for researchers working on algorithmic aspects of recommender systems. With the strongly increased interest in machine learning in general, it has, as a result, become difficult to keep track of what represents the state-of-the-art at the moment, e.g., for top-n recommendation tasks. At the same time, several recent publications point out problems in today's research practice in applied machine learning, e.g., in terms of the reproducibility of the results or the choice of the baselines when proposing new models. In this work, we report the results of a systematic analysis of algorithmic proposals for top-n recommendation tasks. Specifically, we considered 18 algorithms that were presented at top-level research conferences in the last years. Only 7 of them could be reproduced with reasonable effort. For these methods, it however turned out that 6 of them can often be outperformed with comparably simple heuristic methods, e.g., based on nearest-neighbor or graph-based techniques. The remaining one clearly outperformed the baselines but did not consistently outperform a well-tuned non-neural linear ranking method. Overall, our work sheds light on a number of potential problems in today's machine learning scholarship and calls for improved scientific practices in this area."
216 | 
217 |   - `video` <https://vk.com/video-187376020_456239022?t=9m36s> (Lashinin) `in russian`
218 |   - `video` <https://youtu.be/umyNVwePCtw?t=17m14s> (Bugaychenko) `in russian`
219 |   - `post` <https://habr.com/en/company/ods/blog/472672/#7-are-we-really-making-much-progress-a-worrying-analysis-of-recent-neural-recommendation-approaches> `in russian`
220 | 
221 | 
222 | #### ["Deep Learning Recommendation Model for Personalization and Recommendation Systems"](https://arxiv.org/abs/1906.00091) Naumov et al.
223 |   `DLRM` `Facebook`
224 | >	"With the advent of deep learning, neural network-based recommendation models have emerged as an important tool for tackling personalization and recommendation tasks. These networks differ significantly from other deep learning networks due to their need to handle categorical features and are not well studied or understood. In this paper, we develop a state-of-the-art deep learning recommendation model and provide its implementation in both PyTorch and Caffe2 frameworks. In addition, we design a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers. We compare DLRM against existing recommendation models and characterize its performance on the Big Basin AI platform, demonstrating its usefulness as a benchmark for future algorithmic experimentation and system co-design."
225 | 
226 |   - `post` <https://ai.facebook.com/blog/dlrm-an-advanced-open-source-deep-learning-recommendation-model>
227 |   - `paper` ["The Architectural Implications of Facebook's DNN-based Personalized Recommendation"](https://arxiv.org/abs/1906.03109) by Gupta et al.
228 | 
229 | 
230 | #### ["Recommending What Video to Watch Next: A Multitask Ranking System"](https://daiwk.github.io/assets/youtube-multitask.pdf) Zhao et al.
231 |   `YouTube`
232 | >	"In this paper, we introduce a large scale multi-objective ranking system for recommending what video to watch next on an industrial video sharing platform. The system faces many real-world challenges, including the presence of multiple competing ranking objectives, as well as implicit selection biases in user feedback. To tackle these challenges, we explored a variety of soft-parameter sharing techniques such as Multi-gate Mixture-of-Experts so as to efficiently optimize for multiple ranking objectives. Additionally, we mitigated the selection biases by adopting a Wide & Deep framework. We demonstrated that our proposed techniques can lead to substantial improvements on recommendation quality on one of the world's largest video sharing platforms."
233 | 
234 |   - `video` <https://youtu.be/umyNVwePCtw?t=19m35s> (Bugaychenko) `in russian`
235 |   - `notes` <https://medium.com/vantageai/how-youtube-is-recommending-your-next-video-7e5f1a6bd6d9>
236 |   - `paper` [**"Wide & Deep Learning"**](#wide--deep-learning-cheng-et-al) by Cheng et al. `summary`
237 | 
238 | 
239 | #### ["Wide & Deep Learning"](https://arxiv.org/abs/1606.07792) Cheng et al.
240 |   `Google`
241 | >	"Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item interactions are sparse and high-rank. In this paper, we present Wide & Deep learning---jointly trained wide linear models and deep neural networks---to combine the benefits of memorization and generalization for recommender systems. We productionized and evaluated the system on Google Play, a commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide & Deep significantly increased app acquisitions compared with wide-only and deep-only models. We have also open-sourced our implementation in TensorFlow."
242 | 
243 |   - `post` <https://research.googleblog.com/2016/06/wide-deep-learning-better-together-with.html>
244 |   - `video` <https://youtube.com/watch?v=NV1tkZ9Lq48> (Cheng)
245 |   - `post` <https://www.tensorflow.org/tutorials/wide_and_deep>
246 |   - `code` <https://github.com/tensorflow/models/blob/master/official/wide_deep/wide_deep.py>
247 | 
248 | 
249 | #### ["Deep Neural Networks for YouTube Recommendations"](http://research.google.com/pubs/pub45530.html) Covington, Adams, Sargin
250 |   `YouTube`
251 | >	"YouTube represents one of the largest scale and most sophisticated industrial recommendation systems in existence. In this paper, we describe the system at a high level and focus on the dramatic performance improvements brought by deep learning. The paper is split according to the classic two-stage information retrieval dichotomy: first, we detail a deep candidate generation model and then describe a separate deep ranking model. We also provide practical lessons and insights derived from designing, iterating and maintaining a massive recommendation system with enormous userfacing impact."
252 | 
253 | >	"We have described our deep neural network architecture for recommending YouTube videos, split into two distinct problems: candidate generation and ranking. Our deep collaborative filtering model is able to effectively assimilate many signals and model their interaction with layers of depth, outperforming previous matrix factorization approaches used at YouTube. There is more art than science in selecting the surrogate problem for recommendations and we found classifying a future watch to perform well on live metrics by capturing asymmetric co-watch behavior and preventing leakage of future information. Withholding discrimative signals from the classifier was also essential to achieving good results - otherwise the model would overfit the surrogate problem and not transfer well to the homepage. We demonstrated that using the age of the training example as an input feature removes an inherent bias towards the past and allows the model to represent the time-dependent behavior of popular of videos. This improved offline holdout precision results and increased the watch time dramatically on recently uploaded videos in A/B testing. Ranking is a more classical machine learning problem yet our deep learning approach outperformed previous linear and tree-based methods for watch time prediction. Recommendation systems in particular benefit from specialized features describing past user behavior with items. Deep neural networks require special representations of categorical and continuous features which we transform with embeddings and quantile normalization, respectively. Layers of depth were shown to effectively model non-linear interactions between hundreds of features. Logistic regression was modified by weighting training examples with watch time for positive examples and unity for negative examples, allowing us to learn odds that closely model expected watch time. This approach performed much better on watch-time weighted ranking evaluation metrics compared to predicting click-through rate directly."
254 | 
255 |   - `video` <https://youtube.com/watch?v=WK_Nr4tUtl8> (Covington)
256 |   - `video` <https://youtube.com/watch?v=ZrgNalPlvSs> (Nada)
257 |   - `notes` <https://blog.acolyer.org/2016/09/19/deep-neural-networks-for-youtube-recommendations>
258 |   - `notes` <https://ekababisong.org/deep-neural-networks-youtube>
259 |   - `code` <https://github.com/ogerhsou/Youtube-Recommendation-Tensorflow/blob/master/youtube_recommendation.py>
260 | 
261 | 
262 | #### ["Latent Cross: Making Use of Context in Recurrent Recommender Systems"](https://dl.acm.org/citation.cfm?id=3159727) Beutel et al.
263 |   `YouTube`
264 | >	"The success of recommender systems often depends on their ability to understand and make use of the context of the recommendation request. Significant research has focused on how time, location, interfaces, and a plethora of other contextual features affect recommendations. However, in using deep neural networks for recommender systems, researchers often ignore these contexts or incorporate them as ordinary features in the model. In this paper, we study how to effectively treat contextual data in neural recommender systems. We begin with an empirical analysis of the conventional approach to context as features in feed-forward recommenders and demonstrate that this approach is inefficient in capturing common feature crosses. We apply this insight to design a state-of-the-art RNN recommender system. We first describe our RNN-based recommender system in use at YouTube. Next, we offer “Latent Cross,” an easy-to-use technique to incorporate contextual data in the RNN by embedding the context feature first and then performing an element-wise product of the context embedding with model’s hidden states. We demonstrate the improvement in performance by using this Latent Cross technique in multiple experimental settings."
265 | 
266 |   - `video` <https://youtu.be/FNmh-eGYAv0?t=23m12s> (Chi)
267 | 
268 | 
269 | 
270 | ---
271 | ### interesting papers - modeling dynamics
272 | 
273 | 
274 | #### ["Recurrent Recommender Networks"](https://dl.acm.org/citation.cfm?id=3018689) Wu, Ahmed, Beutel, Smola, Jing
275 | >	"Recommender systems traditionally assume that user profiles and movie attributes are static. Temporal dynamics are purely reactive, that is, they are inferred after they are observed, e.g. after a user’s taste has changed or based on hand-engineered temporal bias corrections for movies. We propose Recurrent Recommender Networks that are able to predict future behavioral trajectories. This is achieved by endowing both users and movies with a Long Short-Term Memory autoregressive model that captures dynamics, in addition to a more traditional low-rank factorization. On multiple real-world datasets, our model offers excellent prediction accuracy and it is very compact, since we need not learn latent state but rather just the state transition function."
276 | 
277 | >	"A common approach to practical recommender systems is to study problems of the form introduced in the Netflix contest. That is, given a set of tuples consisting of users, movies, timestamps and ratings, the goal is to find ratings for alternative combinations of the first three attributes (user, movie, time). Performance is then measured by the deviation of the prediction from the actual rating. This formulation is easy to understand and it has led to numerous highly successful approaches, such as Probabilistic Matrix Factorization, nearest neighbor based approaches, and clustering. Moreover, it is easy to define appropriate performance measures (deviation between rating estimates and true ratings over the matrix), simply by selecting a random subset of the tuples for training and the rest for testing purposes. Unfortunately, these approaches are lacking when it comes to temporal and causal aspects inherent in the data. The following examples illustrate this in some more detail:  
278 | >	Change in Movie Perception. Plan 9 from Outer Space has achieved cult movie status by being arguably one of the world’s worst movies. As a result of the social notion of being a movie that is so bad that it is great to watch, the perception changed over time from a truly awful movie to a popular one. To capture this appropriately, the movie attribute parameters would have to change over time to track such a trend. While maybe not quite as pronounced, similar effects hold for movie awards such as the Oscars. After all, it is much more challenging to hold a contrarian view about a critically acclaimed movie than about, say, Star Wars 1, The Phantom Menace.  
279 | >	Seasonal Changes. While not quite so extreme, the relative appreciation of romantic comedies, Christmas movies and summer blockbusters is seasonal. Beyond the appreciation, users are unlikely to watch movies about overweight bearded old men wearing red robes in summer.  
280 | >	User Interest. User’s preferences change over time. This is well established in online communities and it arguably also applies to online consumption. A user might take a liking to a particular actor, might discover the intricacies of a specific genre, or her interest in a particular show might wane, due to maturity or a change in lifestyle. Any such aspects render existing profiles moot, yet it is difficult to model all such changes explicitly."
281 | 
282 | >	"Beyond the mere need of modeling temporal evolution, evaluating ratings with the benefit of hindsight also violates basic requirements of causality. For instance, knowing that a user will have developed a liking for Pedro Almod ́ovar in one month in the future makes it much easier to estimate what his opinion about La Mala Educacion might be. In other words, we violate causality in our statistical analysis when we use future ratings for the benefit of estimating current reviews. It also makes it impossible to translate reported accuracies on benchmarks into meaningful assessments as to whether such a system would work well in practice. While the Netflix prize generated a flurry of research, evaluating different models’ success on future predictions is hindered by the mixed distribution of training and testing data. Rather, by having an explicit model of profile dynamics, we can predict future behavior based on current trends. A model capable of capturing the actual data distribution inherent in recommender systems needs to be able to model both the temporal dynamics within each user and movie, in addition to capturing the rating interaction between both sets. This suggests the use of latent variable models to infer the unobserved state governing their behavior."
283 | 
284 | >	"Nonlinear nonparametric recommender systems have proven to be somewhat elusive. In particular, nonlinear substitutes of the inner product formulation showed only limited promise in our experiments. To the best of our knowledge this is the first paper addressing movie recommendation in a fully causal and integrated fashion. That is, we believe that this is the first model which attempts to capture the dynamics of both users and movies. Moreover, our model is nonparametric. This allows us to model the data rather than having to assume a specific form of a state space."
285 | 
286 | >	"Recurrent Recommender Networks are very concise since we only learn the dynamics rather than the state. This is one of the key differences to typical latent variable models where considerable effort is spent on estimating the latent state."
287 | 
288 | >	"Experiments show that our model outperforms all others in terms of forward prediction, i.e. in the realistic scenario where we attempt to estimate future ratings given data that occurred strictly prior to the to-be-predicted ratings. We show that our model is able to capture exogenous dynamics (e.g. an Oscar award) and endogenous dynamics (e.g. Christmas movies) quite accurately. Moreover, we demonstrate that the model is able to predict changes in future user preferences accurately."
289 | 
290 | 
291 | #### ["Latent LSTM Allocation: Joint Clustering and Non-Linear Dynamic Modeling of Sequential Data"](http://proceedings.mlr.press/v70/zaheer17a/zaheer17a.pdf) Zaheer, Ahmed, Smola
292 |   `Google`
293 | >	"Recurrent neural networks, such as LSTM networks, are powerful tools for modeling sequential data like user browsing history or natural language text. However, to generalize across different user types, LSTMs require a large number of parameters, notwithstanding the simplicity of the underlying dynamics, rendering it uninterpretable, which is highly undesirable in user modeling. The increase in complexity and parameters arises due to a large action space in which many of the actions have similar intent or topic. In this paper, we introduce Latent LSTM Allocation for user modeling combining hierarchical Bayesian models with LSTMs. In LLA, each user is modeled as a sequence of actions, and the model jointly groups actions into topics and learns the temporal dynamics over the topic sequence, instead of action space directly. This leads to a model that is highly interpretable, concise, and can capture intricate dynamics. We present an efficient Stochastic EM inference algorithm for our model that scales to millions of users/documents. Our experimental evaluations show that the proposed model compares favorably with several state-of-the-art baselines."
294 | 
295 |   - `video` <https://vimeo.com/240608072> (Zaheer)
296 |   - `video` <https://youtube.com/watch?v=ofaPq5aRKZ0> (Smola)
297 |   - `paper` ["State Space LSTM Models with Particle MCMC Inference"](https://arxiv.org/abs/1711.11179) by Zheng et al.
298 | 
299 | 
300 | #### ["State Space LSTM Models with Particle MCMC Inference"](https://arxiv.org/abs/1711.11179) Zheng, Zaheer, Ahmed, Wang, Xing, Smola
301 | >	"LSTM is one of the most powerful sequence models. Despite the strong performance, however, it lacks the nice interpretability as in state space models. In this paper, we present a way to combine the best of both worlds by introducing State Space LSTM models that generalizes the earlier work of combining topic models with LSTM. However we do not make any factorization assumptions in our inference algorithm. We present an efficient sampler based on sequential Monte Carlo method that draws from the joint posterior directly. Experimental results confirms the superiority and stability of this SMC inference algorithm on a variety of domains."
302 | 
303 | 
304 | 
305 | ---
306 | ### interesting papers - interactive learning
307 | 
308 | 
309 | #### ["Making Contextual Decisions with Low Technical Debt"](http://arxiv.org/abs/1606.03966) Agarwal et al.
310 |   `Microsoft Project Custom Decision`
311 |   - <https://github.com/brylevkirill/notes/blob/master/Reinforcement%20Learning.md#making-contextual-decisions-with-low-technical-debt-agarwal-et-al>
312 | 
313 | 
314 | #### ["Top-K Off-Policy Correction for a REINFORCE Recommender System"](https://arxiv.org/abs/1812.02353) Chen, Beutel, Covington, Jain, Belletti, Chi
315 |   `YouTube`
316 |   - <https://github.com/brylevkirill/notes/blob/master/Reinforcement%20Learning.md#top-k-off-policy-correction-for-a-reinforce-recommender-system-chen-beutel-covington-jain-belletti-chi>
317 | 
318 | 
319 | #### ["Q&R: A Two-Stage Approach toward Interactive Recommendation"](http://alexbeutel.com/papers/q-and-r-kdd2018.pdf) Christakopoulou, Beutel, Li, Jain, Chi
320 |   `YouTube`
321 | >	"Recommendation systems, prevalent in many applications, aim to surface to users the right content at the right time. Recently, researchers have aspired to develop conversational systems that offer seamless interactions with users, more effectively eliciting user preferences and offering better recommendations. Taking a step towards this goal, this paper explores the two stages of a single round of conversation with a user: which question to ask the user, and how to use their feedback to respond with a more accurate recommendation. Following these two stages, first, we detail an RNN-based model for generating topics a user might be interested in, and then extend a state-of-the-art RNN-based video recommender to incorporate the user’s selected topic. We describe our proposed system Q&R, i.e., Question & Recommendation, and the surrogate tasks we utilize to bootstrap data for training our models. We evaluate different components of Q&R on live traffic in various applications within YouTube: User Onboarding, Home-page Recommendation, and Notifications. Our results demonstrate that our approach improves upon state-of-the-art recommendation models, including RNNs, and makes these applications more useful, such as a >1% increase in video notifications opened. Further, our design choices can be useful to practitioners wanting to transition to more conversational recommendation systems."
322 | 
323 | >	"To the best of our knowledge, this is the first work on learned interactive recommendation (i.e., asking questions and giving recommendations) demonstrated in a large-scale industrial setting. In building Q&R, we set out to improve the user experience of casual users in YouTube. Users become 18% more likely to complete the User Onboarding experience, and when they do, the numbers of topics they select goes up by 77.7%."
324 | 
325 | >	"We provide a novel neural-based recommendation approach, which factorizes video recommendation to a two-fold problem: user history-to-topic, and topic& user history-to-video."
326 | 
327 | >	"Having shed light on a single round of conversation, the area of research in industrial conversational recommendation systems seems to be wide-open for exploration, with incorporating multi-turn conversations and multiple types of data sources, as well as developing models for deciding when to trigger a conversational experience, being exciting topics to be explored in the future."
328 | 
329 |   - `video` <https://youtube.com/watch?v=-02mJfFoLQo>
330 |   - `video` <http://videolectures.net/kdd2018_beutel_interactive_recommendation> (Beutel)
331 | 
332 | 
333 | #### ["Towards Conversational Recommender Systems"](https://chara.cs.illinois.edu/sites/fa16-cs591txt/pdf/Christakopoulou-2016-KDD.pdf) Christakopoulou, Radlinski, Hofmann
334 | >	"People often ask others for restaurant recommendations as a way to discover new dining experiences. This makes restaurant recommendation an exciting scenario for recommender systems and has led to substantial research in this area. However, most such systems behave very di↵erently from a human when asked for a recommendation. The goal of this paper is to begin to reduce this gap. In particular, humans can quickly establish preferences when asked to make a recommendation for someone they do not know. We address this cold-start recommendation problem in an online learning setting. We develop a preference elicitation framework to identify which questions to ask a new user to quickly learn their preferences. Taking advantage of latent structure in the recommendation space using a probabilistic latent factor model, our experiments with both synthetic and real world data compare di↵erent types of feedback and question selection strategies. We find that our framework can make very e↵ective use of online user feedback, improving personalized recommendations over a static model by 25% after asking only 2 questions. Our results demonstrate dramatic benefits of starting from offline embeddings, and highlight the benefit of bandit-based explore-exploit strategies in this setting."
335 | 
336 |   - `video` <https://youtube.com/watch?v=Wz4vogHwHzY>
337 |   - `video` <https://youtube.com/watch?v=udrkPBIb8D4> (Christakopoulou)
338 |   - `video` <https://youtube.com/watch?v=nLUfAJqXFUI> (Christakopoulou)
339 | 


--------------------------------------------------------------------------------