├── .gitignore
├── README.md
├── related_paper_queue.md
└── src
    ├── DGM4NLP.jpg
    ├── MI1.png
    ├── MI2.png
    ├── MI3.png
    ├── MI4.png
    ├── MINotes.md
    ├── MINotes.pdf
    ├── VI4NLP_Recipe.pdf
    ├── annotated_arae.pdf
    ├── roadmap.01.png
    └── titlepage.jpeg


/.gitignore:
--------------------------------------------------------------------------------
1 | src/.DS_Store
2 | .DS_Store
3 | local


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | ![title](src/titlepage.jpeg)
  4 | 
  5 | DGMs 4 NLP. Deep Generative Models for Natural Language Processing. A Roadmap. 
  6 | 
  7 | Yao Fu, University of Edinburgh, yao.fu@ed.ac.uk
  8 | 
  9 | \*\*Update\*\*: [How does GPT Obtain its Ability? Tracing  Emergent Abilities of Language Models to their Sources](https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1)
 10 | 
 11 | \*\*Update\*\*: [A Closer Look at Language Model Emergent Abilities](https://yaofu.notion.site/A-Closer-Look-at-Large-Language-Models-Emergent-Abilities-493876b55df5479d80686f68a1abd72f)
 12 | 
 13 | \*\*Update\*\*: [Large Languge Models](#large-language-models)
 14 | 
 15 | \*\*Update\*\*: [Long-range Dependency](#long-range-dependency); [Why S4 is Good at Long Sequence: Remembering a Sequence with Online Function Approximation](https://yaofu.notion.site/Why-S4-is-Good-at-Long-Sequence-Remembering-a-Sequence-with-Online-Function-Approximation-836fc54a49aa413b84997a265132f13f)
 16 | 
 17 | \*\*TODO 1\*\*: Calibration; Prompting; Long-range transformers; State-space Models
 18 | 
 19 | \*\*TODO 2\*\*: Matrix Factorization and Word embedding; Kernels; Gaussian Process
 20 | 
 21 | \*\*TODO 3\*\*: Relationship between inference and RL; 
 22 | 
 23 | 
 24 | 
 25 | ----
 26 | ## Introduction 
 27 | 
 28 | ### Prelude
 29 | 
 30 | (written in early 2019, originated from the [DGM seminar at Columbia](http://stat.columbia.edu/~cunningham/teaching/GR8201/))
 31 | 
 32 | Why do we want deep generative models? Because we want to learn basic factors that generate language. Human language contains rich latent factors, the continuous ones might be emotion, intention, and others, the discrete/ structural factors might be POS/ NER tags or syntax trees. Many of them are latent as in most cases, we just observe the sentence. They are also generative: human should produce language based on the overall idea, the current emotion, the syntax, and all other things we can or cannot name. 
 33 | 
 34 | How to model the generative process of language in a statistically principled way? Can we have a flexible framework that allows us to incorporate explicit supervision signals when we have labels, or add distant supervision or logical/ statistical constraints when we do not have labels but have other prior knowledge, or simply infer whatever makes the most sense when we have no labels or a priori? Is it possible that we exploit the modeling power of advanced neural architectures while still being mathematical and probabilistic? DGMs allow us to achieve these goals. 
 35 | 
 36 | Let us begin the journey. 
 37 | 
 38 | ### chronology
 39 | * 2013: VAE
 40 | * 2014: GAN; Sequence to sequence; Attention Mechanism
 41 | * 2015: Normalizing Flow; Difussion Models 
 42 | * 2016: Gumbel-softmax; Google's Neural Machine Translation System (GNMT)
 43 | * 2017: Transformers; ELMo
 44 | * 2018: BERT
 45 | * 2019: Probing and Bertology; GPT2
 46 | * 2020: GPT3; Contrastive Learning; Compositional Generalization; Diffusion Models
 47 | * 2021: Prompting; Score-based Generative Models; 
 48 | * 2022: State-spece Models
 49 | 
 50 | ## Table of Content 
 51 | 
 52 | ![roadmap](src/roadmap.01.png)
 53 | 
 54 | - [Introduction](#introduction)
 55 |   - [Prelude](#prelude)
 56 |   - [chronology](#chronology)
 57 | - [Table of Content](#table-of-content)
 58 | - [Resources](#resources)
 59 |   - [DGM Seminars](#dgm-seminars)
 60 |   - [Courses](#courses)
 61 |   - [Books](#books)
 62 | - [NLP Side](#nlp-side)
 63 |   - [Generation](#generation)
 64 |   - [Decoding and Search, General](#decoding-and-search-general)
 65 |   - [Constrained Decoding](#constrained-decoding)
 66 |   - [Non-autoregressive Decoding](#non-autoregressive-decoding)
 67 |   - [Decoding from Pretrained Language Model](#decoding-from-pretrained-language-model)
 68 |   - [Structured Prediction](#structured-prediction)
 69 |   - [Syntax](#syntax)
 70 |   - [Semantics](#semantics)
 71 |   - [Grammar Induction](#grammar-induction)
 72 |   - [Compositionality](#compositionality)
 73 | - [ML Side](#ml-side)
 74 |   - [Samplig Methods](#samplig-methods)
 75 |   - [Variational Inference, VI](#variational-inference-vi)
 76 |   - [VAEs](#vaes)
 77 |   - [Reparameterization](#reparameterization)
 78 |   - [GANs](#gans)
 79 |   - [Flows](#flows)
 80 |   - [Score-based Generative Models](#score-based-generative-models)
 81 |   - [Diffusion Models](#diffusion-models)
 82 | - [Advanced Topics](#advanced-topics)
 83 |   - [Neural Architectures](#neural-architectures)
 84 |     - [RNNs](#rnns)
 85 |     - [Transformers](#transformers)
 86 |     - [Language Model Pretraining](#language-model-pretraining)
 87 |     - [Neural Network Learnability](#neural-network-learnability)
 88 |     - [Long-range Transformers](#long-range-transformers)
 89 |     - [State-Spece Models](#state-spece-models)
 90 |   - [Large Language Models](#large-language-models)
 91 |     - [Solutions and Frameworks for Running Large Language Models](#solutions-and-frameworks-for-running-large-language-models)
 92 |     - [List of Large Language Models](#list-of-large-language-models)
 93 |     - [Emergent Abilities](#emergent-abilities)
 94 |   - [Optimization](#optimization)
 95 |     - [Gradient Estimation](#gradient-estimation)
 96 |     - [Discrete Structures](#discrete-structures)
 97 |   - [Inference](#inference)
 98 |     - [Efficient Inference](#efficient-inference)
 99 |     - [Posterior Regularization](#posterior-regularization)
100 |   - [Geometry](#geometry)
101 |   - [Randomization](#randomization)
102 |   - [Generalization Thoery](#generalization-thoery)
103 |   - [Representation](#representation)
104 |     - [Information Theory](#information-theory)
105 |     - [Disentanglement and Interpretability](#disentanglement-and-interpretability)
106 |     - [Invariance](#invariance)
107 |   - [Analysis and Critics](#analysis-and-critics)
108 | 
109 | Citation:
110 | ```
111 | @article{yao2019DGM4NLP,
112 |   title   = "Deep Generative Models for Natual Language Processing",
113 |   author  = "Yao Fu",
114 |   year    = "2019",
115 |   url     = "https://github.com/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing"
116 | }
117 | ```
118 | 
119 | ## Resources 
120 | 
121 | * [How to write Variational Inference and Generative Models for NLP: a recipe](https://github.com/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing/blob/master/src/VI4NLP_Recipe.pdf). This is strongly suggested for beginners writing papers about VAEs for NLP.
122 | 
123 | * A Tutorial on Deep Latent Variable Models of Natural Language ([link](https://arxiv.org/abs/1812.06834)), EMNLP 18 
124 |   * Yoon Kim, Sam Wiseman and Alexander M. Rush, Havard
125 | 
126 | * Latent Structure Models for NLP. ACL 2019 tutorial [link](https://deep-spin.github.io/tutorial/)
127 |   * André Martinns, Tsvetomila Mihaylova, Nikita Nangia, Vlad Niculae.
128 | 
129 | ### DGM Seminars
130 | 
131 | * Columbia STAT 8201 - [Deep Generative Models](http://stat.columbia.edu/~cunningham/teaching/GR8201/), by [John Cunningham](https://stat.columbia.edu/~cunningham/)
132 | 
133 | * Stanford CS 236 - [Deep Generative Models](https://deepgenerativemodels.github.io/), by Stefano Ermon
134 | 
135 | * U Toronto CS 2541 - [Differentiable Inference and Generative Models](https://www.cs.toronto.edu/~duvenaud/courses/csc2541/index.html), CS 2547 [Learning Discrete Latent Structures](https://duvenaud.github.io/learn-discrete/), CSC 2547 Fall 2019: [Learning to Search](https://duvenaud.github.io/learning-to-search/). By David Duvenaud
136 | 
137 | * U Toronto STA 4273 Winter 2021 - [Minimizing Expectations](https://www.cs.toronto.edu/~cmaddis/courses/sta4273_w21/). By Chris Maddison
138 | 
139 | * Berkeley CS294-158 - [Deep Unsupervised Learning](https://sites.google.com/view/berkeley-cs294-158-sp20/home). By Pieter Abbeel
140 | 
141 | * Columbia STCS 8101 - [Representation Learning: A Probabilistic Perspective](http://www.cs.columbia.edu/~blei/seminar/2020-representation/index.html). By David Blei
142 | 
143 | * Stanford CS324 - [Large Language Models](https://stanford-cs324.github.io/winter2022/). By Percy Liang, Tatsunori Hashimoto and Christopher Re
144 | 
145 | * U Toronto CSC2541 - [Neural Net Training Dynamics](https://www.cs.toronto.edu/~rgrosse/courses/csc2541_2021/). By Roger Grosse. 
146 | 
147 | ### Courses
148 | 
149 | The fundation of the DGMs is built upon probabilistic graphical models. So we take a look at the following resources
150 | 
151 | * Blei's Foundation of Graphical Models course, STAT 6701 at Columbia ([link](http://www.cs.columbia.edu/~blei/fogm/2019F/index.html))
152 |   * Foundation of probabilistic modeling, graphical models, and approximate inference. 
153 | 
154 | * Xing's Probabilistic Graphical Models, 10-708 at CMU ([link](https://sailinglab.github.io/pgm-spring-2019/))
155 |   * A really heavy course with extensive materials.
156 |   * 5 modules in total: exact inference, approximate inference, DGMs, reinforcement learning, and non-parameterics. 
157 |   * All the lecture notes, vedio recordings, and homeworks are open-sourced. 
158 | 
159 | *  Collins' Natural Language Processing, COMS 4995 at Columbia ([link](http://www.cs.columbia.edu/~mcollins/cs4705-spring2019/))
160 |    * Many inference methods for structured models are introduced. Also take a look at related notes from [Collins' homepage](http://www.cs.columbia.edu/~mcollins/)
161 |    * Also checkout [bilibili](https://www.bilibili.com/video/av29608234?from=search&seid=10252913399572988135)
162 | 
163 | ### Books
164 | 
165 | * Pattern Recognition and Machine Learning. Christopher M. Bishop. 2006
166 |   * Probabily the most classical textbook 
167 |   * The _core part_, according to my own understanding, of this book, should be section 8 - 13, especially section 10 since this is the section that introduces variational inference. 
168 | 
169 | * Machine Learning: A Probabilistic Perspective. Kevin P. Murphy. 2012
170 |   * Compared with the PRML Bishop book, this book may be used as a super-detailed handbook for various graphical models and inference methods. 
171 | 
172 | * Graphical Models, Exponential Families, and Variational Inference. 2008
173 |   * Martin J. Wainwright and Michael I. Jordan
174 | 
175 | * Linguistic Structure Prediction. 2011
176 |   * Noah Smith 
177 | 
178 | * The Syntactic Process. 2000 
179 |   * Mark Steedman
180 | 
181 | ----
182 | 
183 | 
184 | ## NLP Side 
185 | 
186 | 
187 | ### Generation
188 | 
189 | *  Generating Sentences from a Continuous Space, CoNLL 15
190 |    * Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, Samy Bengio
191 | 
192 | * Neural variational inference for text processing, ICML 16 
193 |   * Yishu Miao, Lei Yu, Phil Blunsom, Deepmind
194 | 
195 | * Learning Neural Templates for Text Generation. EMNLP 2018 
196 |   * Sam Wiseman, Stuart M. Shieber, Alexander Rush. Havard 
197 | 
198 | * Residual Energy Based Models for Text Generation. ICLR 20
199 |   * Yuntian Deng, Anton Bakhtin, Myle Ott, Arthur Szlam, Marc' Aurelio Ranzato. Havard and FAIR
200 | 
201 | * Paraphrase Generation with Latent Bag of Words. NeurIPS 2019.
202 |   * Yao Fu, Yansong Feng, and John P. Cunningham. Columbia 
203 | 
204 | 
205 | 
206 | ### Decoding and Search, General
207 | 
208 | *  Fairseq Decoding Library. [[github](https://github.com/pytorch/fairseq/blob/master/fairseq/search.py)]
209 | 
210 | * Controllabel Neural Text Generation [[Lil'Log](https://lilianweng.github.io/lil-log/2021/01/02/controllable-neural-text-generation.html)]
211 | 
212 | * Best-First Beam Search. TACL 2020
213 |   * Clara Meister, Tim Vieira, Ryan Cotterell
214 | 
215 | * The Curious Case of Neural Text Degeneration. ICLR 2020 
216 |   * Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi
217 | 
218 | * Comparison of Diverse Decoding Methods from Conditional Language Models. ACL 2019
219 |   * Daphne Ippolito, Reno Kriz, Maria Kustikova, Joa ̃o Sedoc, Chris Callison-Burch
220 | 
221 | * Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement. ICML 19
222 |   * Wouter Kool, Herke van Hoof, Max Welling
223 | 
224 | * Conditional Poisson Stochastic Beam Search. EMNLP 2021
225 |   * Clara Meister, Afra Amini, Tim Vieira, Ryan Cotterell
226 | 
227 | * Massive-scale Decoding for Text Generation using Lattices. 2021
228 |   * Jiacheng Xu and Greg Durrett
229 | 
230 | 
231 | 
232 | ### Constrained Decoding
233 | * Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search. ACL 2017
234 |   * Chris Hokamp, Qun Liu
235 | 
236 | * Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation. NAACL 2018 
237 |   * Matt Post, David Vilar
238 | 
239 | * Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting. NAACL 2019
240 |   * J. Edward Hu, Huda Khayrallah, Ryan Culkin, Patrick Xia, Tongfei Chen, Matt Post, Benjamin Van Durme
241 |   
242 | * Towards Decoding as Continuous Optimisation in Neural Machine Translation. EMNLP 2017
243 |   * Cong Duy Vu Hoang, Gholamreza Haffari and Trevor Cohn. 
244 | 
245 | * Gradient-guided Unsupervised Lexically Constrained Text Generation. EMNLP 2020
246 |   * Lei Sha
247 | 
248 | * Controlled Text Generation as Continuous Optimization with Multiple Constraints. 2021 
249 |   * Sachin Kumar, Eric Malmi, Aliaksei Severyn, Yulia Tsvetkov
250 | 
251 | * NeuroLogic Decoding: (Un)supervised Neural Text Generation with Predicate Logic Constraints. NAACL 2021
252 |   * Ximing Lu, Peter West, Rowan Zellers, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi
253 | 
254 | * NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics. 2021
255 |   * Ximing Lu, Sean Welleck, Peter West, Liwei Jiang, Jungo Kasai, Daniel Khashabi, Ronan Le Bras, Lianhui Qin, Youngjae Yu, Rowan Zellers, Noah A. Smith, Yejin Choi
256 | 
257 | * COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics. 2022
258 |   * Lianhui Qin, Sean Welleck, Daniel Khashabi, Yejin Choi
259 | 
260 | 
261 | ### Non-autoregressive Decoding 
262 | 
263 | Note:  I have not fully gone through this chapter, please give me suggestions! 
264 | 
265 | * Non-Autoregressive Neural Machine Translation. ICLR 2018 
266 |   * Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, Richard Socher
267 | 
268 | * Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade. 
269 |   * Jiatao Gu, Xiang Kong. 
270 | 
271 | * Fast Decoding in Sequence Models Using Discrete Latent Variables. ICML 2021
272 |   * Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Uszkoreit, Noam Shazeer
273 | 
274 | * Cascaded Text Generation with Markov Transformers. Arxiv 20
275 |   * Yuntian Deng and Alexander Rush
276 | 
277 | * Glancing Transformer for Non-Autoregressive Neural Machine Translation. ACL 2021 
278 |   * Lihua Qian, Hao Zhou, Yu Bao, Mingxuan Wang, Lin Qiu, Weinan Zhang, Yong Yu, Lei Li
279 |   * This one is now deployed inside Bytedance
280 | 
281 | 
282 | ### Decoding from Pretrained Language Model 
283 | 
284 | TODO: more about it
285 | 
286 | * Prompt Papers, ThuNLP ([link](https://github.com/thunlp/PromptPapers))
287 | 
288 | * CTRL: A Conditional Transformer Language Model for Controllable Generation. Arxiv 2019
289 |   * Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher
290 | 
291 | * Plug and Play Language Models: a Simple Approach to Controlled Text Generation
292 |   * Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, Rosanne Liu
293 | 
294 | ### Structured Prediction
295 | 
296 | *  Torch-Struct: Deep Structured Prediction Library. [github](https://github.com/harvardnlp/pytorch-struct), [paper](https://arxiv.org/abs/2002.00876), [documentation](http://nlp.seas.harvard.edu/pytorch-struct/)
297 |    * Alexander M. Rush. Cornell University 
298 | 
299 | *  An introduction to Conditional Random Fields. 2012 
300 |    *  Charles Sutton and Andrew McCallum. 
301 | 
302 | 
303 | *  Inside-Outside and Forward-Backward Algorithms Are Just Backprop. 2016. 
304 |    *  Jason Eisner
305 | * Learning with Fenchel-Young Losses. JMLR 2019 
306 |   * Mathieu Blondel, André F. T. Martins, Vlad Niculae
307 | 
308 | *  Structured Attention Networks. ICLR 2017 
309 |    * Yoon Kim, Carl Denton, Luong Hoang, Alexander M. Rush
310 | 
311 | * Differentiable Dynamic Programming for Structured Prediction and Attention. ICML 2018 
312 |   * Arthur Mensch and Mathieu Blondel.
313 | 
314 | 
315 | 
316 | ### Syntax
317 | 
318 | * Recurrent Neural Network Grammars. NAACL 16
319 |   * Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah Smith.
320 | 
321 | * Unsupervised Recurrent Neural Network Grammars, NAACL 19 
322 |   * Yoon Kin, Alexander Rush, Lei Yu, Adhiguna Kuncoro, Chris Dyer, and Gabor Melis
323 | 
324 | * Differentiable Perturb-and-Parse: Semi-Supervised Parsing with a Structured Variational Autoencoder, ICLR 19
325 |   * Caio Corro, Ivan Titov, Edinburgh
326 | 
327 | 
328 | ### Semantics
329 | 
330 | * The Syntactic Process. 2020 
331 |   * Mark Steedman
332 | 
333 | * Linguistically-Informed Self-Attention for Semantic Role Labeling. EMNLP 2018 Best paper award
334 |   * Emma Strubell, Patrick Verga, Daniel Andor, David Weiss and Andrew McCallum. UMass Amherst and Google AI Language
335 | 
336 | * Semantic Parsing with Semi-Supervised Sequential Autoencoders. 2016
337 |   * Tomas Kocisky, Gabor Melis, Edward Grefenstette, Chris Dyer, Wang Ling, Phil Blunsom, Karl Moritz Hermann
338 | 
339 | ### Grammar Induction 
340 | * Grammar Induction and Unsupervised Learning, paper list. ([link](https://github.com/FranxYao/nlp-fundamental-frontier/blob/main/nlp/grammar_induction.md))
341 |   * Yao Fu 
342 | 
343 | ### Compositionality
344 | 
345 | * [Compositional Generalization in NLP](https://github.com/FranxYao/CompositionalGeneralizationNLP). Paper list
346 |   * Yao Fu
347 | 
348 | * Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks. ICML 2019
349 |   * Brenden Lake and Marco Baroni 
350 | 
351 | * Improving Text-to-SQL Evaluation Methodology. ACL 2018 
352 |   * Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Li Zhang, Karthik Ramanathan, Sesh Sadasivam, Rui Zhang, Dragomir Radev
353 | 
354 | ----
355 | 
356 | ## ML Side 
357 | 
358 | 
359 | ### Samplig Methods
360 | 
361 | *  Probabilistic inference using Markov chain Monte Carlo methods. 1993 
362 |    * Radford M Neal 
363 | 
364 | * Elements of Sequential Monte Carlo ([link](https://arxiv.org/abs/1903.04797))
365 |   * Christian A. Naesseth, Fredrik Lindsten, Thomas B. Schön
366 | 
367 | * A Conceptual Introduction to Hamiltonian Monte Carlo ([link](https://arxiv.org/abs/1701.02434))
368 |   * Michael Betancourt
369 | 
370 | * Candidate Sampling ([link](https://www.tensorflow.org/extras/candidate_sampling.pdf))
371 |   * Google Tensorflow Blog
372 | 
373 | * Noise-constrastive estimation: A new estimation principle for unnormalized statistical models. AISTATA 2010 
374 |   * Michael Gutmann, Hyvarinen. University of Helsinki
375 | 
376 | *  A* Sampling. NIPS 2014 Best paper award
377 |    * Chris J. Maddison, Daniel Tarlow, Tom Minka. University of Toronto and MSR
378 | 
379 | 
380 | 
381 | ### Variational Inference, VI 
382 | 
383 | *  Cambridge Variational Inference Reading Group ([link](http://www.statslab.cam.ac.uk/~sp825/vi.html))
384 |    * Sam Power. University of Cambridge 
385 | 
386 | *  Variational Inference: A Review for Statisticians. 
387 |    * David M. Blei, Alp Kucukelbir, Jon D. McAuliffe. 
388 | 
389 | * Stochastic Variational Inference
390 |   * Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley
391 | 
392 | * Variational Bayesian Inference with Stochastic Search. ICML 12
393 |   * John Paisley, David Blei, Michael Jordan. Berkeley and Princeton 
394 | 
395 | 
396 | 
397 | ### VAEs 
398 | 
399 | *  Auto-Encoding Variational Bayes, ICLR 14
400 |    * Diederik P. Kingma, Max Welling
401 | 
402 | * beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR 2017
403 |   * Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, Alexander Lerchner
404 | 
405 | * Importance Weighted Autoencoders. ICLR 2015 
406 |   * Yuri Burda, Roger Grosse, Ruslan Salakhutdinov
407 | 
408 | * Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ICML 14
409 |   * Danilo Jimenez Rezende, Shakir Mohamed, Daan Wierstra
410 |   * Reparameterization w. deep gaussian models. 
411 | 
412 | * Semi-amortized variational autoencoders, ICML 18 
413 |   * Yoon Kim, Sam Wiseman, Andrew C. Miller, David Sontag, Alexander M. Rush, Havard
414 | 
415 | * Adversarially Regularized Autoencoders, ICML 18 
416 |   * Jake (Junbo) Zhao, Yoon Kim, Kelly Zhang, Alexander M. Rush, Yann LeCun. 
417 | 
418 | 
419 | 
420 | 
421 | ### Reparameterization 
422 | More on reparameterization: to reparameterize gaussian mixture, permutation matrix, and rejection samplers(Gamma and Dirichlet).   
423 | 
424 | * Stochastic Backpropagation through Mixture Density Distributions, Arxiv 16
425 |   * Alex Graves
426 | 
427 | * Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms. AISTATS 2017 
428 |   * Christian A. Naesseth, Francisco J. R. Ruiz, Scott W. Linderman, David M. Blei
429 | 
430 | * Implicit Reparameterization Gradients. NeurIPS 2018. 
431 |   * Michael Figurnov, Shakir Mohamed, and Andriy Mnih
432 | 
433 | * Categorical Reparameterization with Gumbel-Softmax. ICLR 2017 
434 |   * Eric Jang, Shixiang Gu, Ben Poole
435 | 
436 | *  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. ICLR 2017 
437 |    * Chris J. Maddison, Andriy Mnih, and Yee Whye Teh
438 | 
439 | * Invertible Gaussian Reparameterization:  Revisiting the Gumbel-Softmax. 2020
440 |   *  Andres Potapczynski, Gabriel Loaiza-Ganem, John P. Cunningham 
441 | 
442 | * Reparameterizable Subset Sampling via Continuous Relaxations. IJCAI 2019 
443 |   * Sang Michael Xie and Stefano Ermon
444 | 
445 | 
446 | 
447 | 
448 | 
449 | ### GANs
450 | 
451 | * Generative Adversarial Networks, NIPS 14
452 |   * Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio
453 | 
454 | * Towards principled methods for training generative adversarial networks, ICLR 2017 
455 |   * Martin Arjovsky and Leon Bottou
456 | 
457 | *  Wasserstein GAN 
458 |    * Martin Arjovsky, Soumith Chintala, Léon Bottou
459 | 
460 | * InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. NIPS 2016
461 |   * Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel. UC Berkeley. OpenAI
462 | 
463 | * Adversarially Learned Inference. ICLR 2017
464 |   * Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, Aaron Courville
465 | 
466 | ### Flows
467 | 
468 | *  Flow Based Deep Generative Models, from [Lil's log](https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html) 
469 | 
470 | * Variational Inference with Normalizing Flows, ICML 15 
471 |   * Danilo Jimenez Rezende, Shakir Mohamed
472 | 
473 | * Learning About Language with Normalizing Flows 
474 |   * Graham Neubig, CMU, [slides](http://www.phontron.com/slides/neubig19generative.pdf)
475 | 
476 | * Improved Variational Inference with Inverse Autoregressive Flow
477 |   * Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, Max Welling
478 | 
479 | * Density estimation using Real NVP. ICLR 17 
480 |   * Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio
481 | 
482 | * Unsupervised Learning of Syntactic Structure with Invertible Neural Projections. EMNLP 2018
483 |   * Junxian He, Graham Neubig, Taylor Berg-Kirkpatrick
484 | 
485 | * Latent Normalizing Flows for Discrete Sequences. ICML 2019. 
486 |   * Zachary M. Ziegler and Alexander M. Rush
487 | 
488 | * Discrete Flows: Invertible Generative Models of Discrete Data. 2019 
489 |   * Dustin Tran, Keyon Vafa, Kumar Krishna Agrawal, Laurent Dinh, Ben Poole
490 | 
491 | * FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow. EMNLP 2019
492 |   * Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, Eduard Hovy
493 | 
494 | * Variational Neural Machine Translation with Normalizing Flows. ACL 2020 
495 |   * Hendra Setiawan, Matthias Sperber, Udhay Nallasamy, Matthias Paulik. Apple 
496 | 
497 | * On the Sentence Embeddings from Pre-trained Language Models. EMNLP 2020 
498 |   * Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, Lei Li
499 | 
500 | ### Score-based Generative Models
501 | > FY: Need to see how score-based generative models and diffusion models can be used for discrete sequences
502 | 
503 | * [Generative Modeling by Estimating Gradients of the Data Distribution](https://yang-song.github.io/blog/2021/score/). Blog 2021
504 |   * Yang Song
505 | 
506 | * [Score Based Generative Modeling Papers](https://scorebasedgenerativemodeling.github.io/)
507 |   * researchers at the University of Oxford
508 | 
509 | * [Generative Modeling by Estimating Gradients of the Data Distribution](https://arxiv.org/abs/1907.05600). NeurIPS 2019
510 |   * Yang Song, Stefano Ermon
511 | 
512 | ### Diffusion Models
513 | 
514 | * [What are Diffusion Models?](https://lilianweng.github.io/lil-log/2021/07/11/diffusion-models.html) 2021
515 |   * Lilian Weng
516 | 
517 | * [Awesome-Diffusion-Models](https://github.com/heejkoo/Awesome-Diffusion-Models)
518 |   * Heejoon Koo
519 | 
520 | * [Deep Unsupervised Learning using Nonequilibrium Thermodynamics](https://arxiv.org/abs/1503.03585). 2015
521 |   * Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli
522 | 
523 | * [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239). NeurIPS 2020
524 |   * Jonathan Ho, Ajay Jain, Pieter Abbeel
525 | 
526 | * [Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions](https://arxiv.org/abs/2102.05379). NeurIPS 2021
527 |   * Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, Max Welling
528 | 
529 | *  [Structured Denoising Diffusion Models in Discrete State-Spaces](https://arxiv.org/abs/2107.03006). NeurIPS 2021
530 |    *  Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, Rianne van den Berg
531 | 
532 | * [Autoregressive Diffusion Models](https://arxiv.org/abs/2110.02037). ICLR 2022
533 |   * Emiel Hoogeboom, Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, Tim Salimans
534 | 
535 | * [Diffusion-LM Improves Controllable Text Generation](https://arxiv.org/abs/2205.14217). 2022
536 |   * Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, Tatsunori B. Hashimoto
537 | 
538 | * [Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding](). 2022
539 |   * Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, Mohammad Norouzi
540 | 
541 | ----
542 | ## Advanced Topics
543 | 
544 | ### Neural Architectures
545 | 
546 | 
547 | #### RNNs
548 | 
549 | * Ordered Neurons: Integrating Tree Structured into Recurrent Neural Networks
550 |   * Yikang Shen, Shawn Tan, Alessandro Sordoni, Aaron Courville. Mila, MSR
551 | 
552 | * RNNs can generate bounded hierarchical languages with optimal memory
553 |   * John Hewitt, Michael Hahn, Surya Ganguli, Percy Liang, Christopher D. Manning
554 | 
555 | #### Transformers
556 | 
557 | * Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. ACL 2019
558 |   * Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, Ivan Titov
559 | 
560 | * Theoretical Limitations of Self-Attention in Neural Sequence Models. TACL 2019
561 |   * Michael Hahn
562 | 
563 | * Rethinking Attention with Performers. 2020
564 |   * Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller
565 | 
566 | #### Language Model Pretraining
567 | 
568 | * THUNLP: Pre-trained Languge Model paper list ([link](https://github.com/thunlp/PLMpapers))
569 |   * Xiaozhi Wang and Zhengyan Zhang, Tsinghua University 
570 | 
571 | * Tomohide Shibata's [BERT-related Papers](https://github.com/tomohideshibata/BERT-related-papers)
572 | 
573 | #### Neural Network Learnability
574 | * [Neural Network Learnability](https://github.com/FranxYao/Semantics-and-Compositional-Generalization-in-Natural-Language-Processing#neural-network-learnability). Yao Fu
575 | 
576 | 
577 | #### Long-range Transformers 
578 | 
579 | * Long Range Arena: A Benchmark for Efficient Transformers
580 |   * Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, Donald Metzler
581 | 
582 | #### State-Spece Models
583 | 
584 | * HiPPO: Recurrent Memory with Optimal Polynomial Projections. NeurIPS 2020
585 |   * Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, Christopher Ré 
586 |   
587 | * Combining Recurrent, Convolutional, and Continuous-time Models with the Linear State Space Layer. NeurIPS 2021
588 |   * Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, Christopher Ré
589 | 
590 | * Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022
591 |   * Albert Gu, Karan Goel, and Christopher Ré 
592 | 
593 | * [Why S4 is Good at Long Sequence: Remembering a Sequence with Online Function Approximation.](https://yaofu.notion.site/Why-S4-is-Good-at-Long-Sequence-Remembering-a-Sequence-with-Online-Function-Approximation-836fc54a49aa413b84997a265132f13f) 2022
594 |   * Yao Fu
595 | 
596 | 
597 | ### Large Language Models 
598 | 
599 | #### Solutions and Frameworks for Running Large Language Models
600 | 
601 | * Serving OPT-175B using Alpa (350 GB GPU memory in total) [link](https://alpa.ai/tutorials/opt_serving.html)
602 | 
603 | #### List of Large Language Models 
604 | 
605 | * GPT3 (175B). Language Models are Few-Shot Learners. May 2020
606 | 
607 | * Megatron-Turing NLG (530B). Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. Jan 2022
608 | 
609 | * LaMDA (137B). LaMDA: Language Models for Dialog Applications. Jan 2022
610 | 
611 | * Gopher (280B). Scaling Language Models: Methods, Analysis & Insights from Training Gopher. Dec 2021
612 | 
613 | * Chinchilla (70B). Training Compute-Optimal Large Language Models. Mar 2022
614 | 
615 | * PaLM (540B). PaLM: Scaling Language Modeling with Pathways. Apr 2022
616 | 
617 | * OPT (175B). OPT: Open Pre-trained Transformer Language Models. May 2022 
618 | 
619 | * BLOOM (176B): BigScience Large Open-science Open-access Multilingual Language Model. May 2022
620 | 
621 | * BlenderBot 3 (175B): a deployed conversational agent that continually learns to responsibly engage. Aug 2022
622 |   
623 | 
624 | 
625 | #### Emergent Abilities
626 | 
627 | * Scaling Laws for Neural Language Models. 2020
628 |   * Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei
629 | 
630 | * Emergent Abilities of Large Language Models. 2022
631 |   * Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, William Fedus.
632 | 
633 | 
634 | ### Optimization
635 | 
636 | #### Gradient Estimation
637 | 
638 | * [Minimizing Expectations](https://www.cs.toronto.edu/~cmaddis/courses/sta4273_w21/). Chris Maddison
639 | 
640 | *  Monte Carlo Gradient Estimation in Machine Learning 
641 |    * Schakir Mohamed, Mihaela Rosca, Michael Figurnov, Andriy Mnih. DeepMind
642 | 
643 | * Variational Inference for Monte Carlo Objectives. ICML 16
644 |   * Andriy Mnih,  Danilo J. Rezende. DeepMind
645 | 
646 | * REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. NIPS 17
647 |   * George Tucker, Andriy Mnih, Chris J. Maddison, Dieterich Lawson, Jascha Sohl-Dickstein. Google Brain, DeepMind, Oxford
648 | 
649 | *  Backpropagation Through the Void: Optimizing Control Variates for Black-box Gradient Estimation. ICLR 18
650 |    * Will Grathwohl, Dami Choi, Yuhuai Wu, Geoffrey Roeder, David Duvenaud. U Toronto and Vector Institute
651 |   
652 | * Backpropagating through Structured Argmax using a SPIGOT. ACL 2018 Best Paper Honorable Mention. 
653 |   * Hao Peng, Sam Thomson, and Noah A. Smith
654 |   
655 | * Understanding the Mechanics of SPIGOT: Surrogate Gradients for Latent Structure Learning. EMNLP 2020 
656 |   * Tsvetomila Mihaylova, Vlad Niculae, and Andre ́ F. T. Martins
657 | 
658 | 
659 | 
660 | #### Discrete Structures
661 | 
662 | * Learning with Differentiable Perturbed Optimizers. NeurIPS 2020
663 |   * Quentin Berthet, Mathieu Blondel, Olivier Teboul, Marco Cuturi, Jean-Philippe Vert, Francis Bach
664 | 
665 | * Gradient Estimation with Stochastic Softmax Tricks. NeurIPS 2020 
666 |   * Max B. Paulus, Dami Choi, Daniel Tarlow, Andreas Krause, Chris J. Maddison. 
667 | 
668 | * Differentiable Dynamic Programming for Structured Prediction and Attention. ICML 18 
669 |   * Arthur Mensch, Mathieu Blondel. Inria Parietal and NTT Communication Science Laboratories 
670 | 
671 | * Stochastic Optimization of Sorting Networks via Continuous Relaxations
672 |   * Aditya Grover, Eric Wang, Aaron Zweig, Stefano Ermon
673 | 
674 | * Differentiable Ranks and Sorting using Optimal Transport
675 |   * Guy Lorberbom, Andreea Gane, Tommi Jaakkola, and Tamir Hazan
676 | 
677 | * Reparameterizing the Birkhoff Polytope for Variational Permutation Inference. AISTATS 2018 
678 |   * Scott W. Linderman, Gonzalo E. Mena, Hal Cooper, Liam Paninski, John P. Cunningham. 
679 | 
680 | * A Regularized Framework for Sparse and Structured Neural Attention. NeurIPS 2017
681 | 
682 | * SparseMAP: Differentiable Sparse Structured Inference. ICML 2018
683 | 
684 | 
685 | ### Inference
686 | 
687 | * Topics in Advanced Inference. Yingzhen Li. ([Link](http://yingzhenli.net/home/pdf/topics_approx_infer.pdf))
688 | 
689 | #### Efficient Inference
690 | 
691 | * Nested Named Entity Recognition with Partially-Observed TreeCRFs. AAAI 2021
692 |   * Yao Fu, Chuanqi Tan, Mosha Chen, Songfang Huang, Fei Huang
693 | 
694 | * Rao-Blackwellized Stochastic Gradients for Discrete Distributions. ICML 2019.
695 |   * Runjing Liu, Jeffrey Regier, Nilesh Tripuraneni, Michael I. Jordan, Jon McAuliffe
696 | 
697 | * Efficient Marginalization of Discrete and Structured Latent Variables via Sparsity. NeurIPS 2020 
698 |   * Gonçalo M. Correia, Vlad Niculae, Wilker Aziz, André F. T. Martins
699 | 
700 | 
701 | #### Posterior Regularization 
702 | 
703 | * Posterior Regularization for Structured Latent Variable Models. JMLR 2010
704 |   * Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar. 
705 | 
706 | * Posterior Control of Blackbox Generation. 2019
707 |   * Xiang Lisa Li and Alexander M. Rush. 
708 | 
709 | * Dependency Grammar Induction with a Neural Variational Transition-based Parser. AAAI 2019
710 |   * Bowen Li, Jianpeng Cheng, Yang Liu, Frank Keller
711 | 
712 | 
713 | ### Geometry
714 | 
715 | * (In Chinese) 微分几何与拓扑学简明教程
716 |   * 米先珂，福明珂
717 | 
718 | * Only Bayes Should Learn a Manifold (On the Estimation of Differential Geometric Structure from Data). Arxiv 2018
719 |   * Soren Hauberg
720 | 
721 | * The Riemannian Geometry of Deep Generative Models. CVPRW 2018
722 |   * Hang Shao, Abhishek Kumar, P. Thomas Fletcher
723 | 
724 | * The Geometry of Deep Generative Image Models and Its Applications. ICLR 2021
725 |   * Binxu Wang and Carlos R. Ponce
726 | 
727 | * Metrics for Deep Generative Models. AISTATS 2017
728 |   * Nutan Chen, Alexej Klushyn, Richard Kurle, Xueyan Jiang, Justin Bayer, Patrick van der Smagt
729 | 
730 | * First-Order Algorithms for Min-Max Optimization in Geodesic Metric Spaces. 2022
731 |   * Michael I. Jordan, Tianyi Lin, Emmanouil V. Vlatakis-Gkaragkounis
732 | 
733 | ### Randomization 
734 | 
735 | * Random Features for Large-Scale Kernel Machines. NeurIPS 2007
736 |   * Ali Rahimi, Benjamin Recht
737 | 
738 | * Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM 2011
739 |   * Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp
740 | 
741 | * Efficient optimization of loops and limits with randomized telescoping sums. ICML 2019
742 |   * Alex Beatson, Ryan P Adams 
743 | 
744 | * Telescoping Density-Ratio Estimation. NeurIPS 2020
745 |   * Benjamin Rhodes, Kai Xu, Michael U. Gutmann
746 | 
747 | * Bias-Free Scalable Gaussian Processes via Randomized Truncations. ICML 2021
748 |   * Andres Potapczynski, Luhuan Wu, Dan Biderman, Geoff Pleiss, John P Cunningham 
749 | 
750 | * Randomized Automatic Differentiation. ICLR 2021
751 |   * Deniz Oktay, Nick McGreivy, Joshua Aduol, Alex Beatson, Ryan P. Adams
752 | 
753 | * Scaling Structured Inference with Randomization. 2021
754 |   * Yao Fu, John Cunningham, Mirella Lapata
755 | 
756 | 
757 | 
758 | ### Generalization Thoery
759 | 
760 | * CS229T. Statistical Learning Theory. 2016
761 |   * Percy Liang 
762 | 
763 | 
764 | ### Representation
765 | 
766 | #### Information Theory 
767 | 
768 | * Elements of Information Theory. Cover and Thomas. 1991 
769 | 
770 | * On Variational Bounds of Mutual Information. ICML 2019 
771 |   * Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A. Alemi, George Tucker
772 |   * A comprehensive discussion of all these MI variational bounds 
773 | 
774 | * Learning Deep Representations By Mutual Information Estimation And Maximization. ICLR 2019 
775 |   * R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio
776 |   * A detailed comparison between different MI estimators, section 3.2. 
777 | 
778 | * MINE: Mutual Information Neural Estimation
779 |   * R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, Yoshua Bengio
780 | 
781 | * Deep Variational Information Bottleneck. ICLR 2017 
782 |   * Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, Kevin Murphy. Google Research 
783 | 
784 | 
785 | 
786 | #### Disentanglement and Interpretability
787 | 
788 | * Identifying Bayesian Mixture Models 
789 |   * Michael Betancourt
790 | 
791 | * Disentangling Disentanglement in Variational Autoencoders. ICML 2019 
792 |   * Emile Mathieu, Tom Rainforth, N. Siddharth, Yee Whye Teh
793 | 
794 | * Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. ICML 2019 
795 |   * Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, Olivier Bachem
796 | 
797 | 
798 | 
799 | 
800 | 
801 | #### Invariance
802 | 
803 | * Emergence of Invariance and Disentanglement in Deep Representations
804 |   * Alessandro Achillo and Stefano Soatto. UCLA. JMLR 2018 
805 | 
806 | * Invariant Risk Minimization
807 |   * Martin Arjovsky, Leon Bottou, Ishaan Gulrajani, David Lopez-Paz. 2019. 
808 | 
809 | 
810 | 
811 | 
812 | 
813 | 
814 | 
815 | 
816 | 
817 | 
818 | ### Analysis and Critics
819 | 
820 | * Fixing a Broken ELBO. ICML 2018. 
821 |   * Alexander A. Alemi, Ben Poole, Ian Fischer, Joshua V. Dillon, Rif A. Saurous, Kevin Murphy
822 | 
823 | * Tighter Variational Bounds are Not Necessarily Better. ICML 2018 
824 |   * Tom Rainforth, Adam R. Kosiorek, Tuan Anh Le, Chris J. Maddison, Maximilian Igl, Frank Wood, Yee Whye Teh
825 | 
826 | * The continuous Bernoulli: fixing a pervasive error in variational autoencoders. NeurIPS 2019 
827 |   * Gabriel Loaiza-Ganem and John P. Cunningham. Columbia. 
828 | 
829 | * Do Deep Generative Models Know What They Don't Know? ICLR 2019 
830 |   * Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, Balaji Lakshminarayanan
831 | 
832 | * Effective Estimation of Deep Generative Language Models. ACL 2020 
833 |   * Tom Pelsmaeker and Wilker Aziz. University of Edinburgh and University of Amsterdam 
834 | 
835 | * How Good is the Bayes Posterior in Deep Neural Networks Really? ICML 2020 
836 |   * Florian Wenzel, Kevin Roth, Bastiaan S. Veeling, Jakub Świątkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, Sebastian Nowozin
837 | 
838 | * A statistical theory of cold posteriors in deep neural networks. ICLR 2021 
839 |   * Laurence Aitchison
840 | 
841 | * Limitations of Autoregressive Models and Their Alternatives. NAACL 2021
842 |   * Chu-Cheng Lin, Aaron Jaech, Xin Li, Matthew R. Gormley, Jason Eisner
843 | 
844 | 


--------------------------------------------------------------------------------
/related_paper_queue.md:
--------------------------------------------------------------------------------
 1 | * Latent Variable Model for Multi-modal Translation. ACL 19 
 2 |   * Iacer Calixto, Miguel Rios and Wilker Aziz
 3 | 
 4 | * Interpretable Neural Predictions with Differentiable Binary Variables. ACL 2019
 5 |   * Joost Bastings, Wilker Aziz and Ivan Titov. 
 6 | 
 7 | * Lagging Inference Networks and Posterior Collapse in Variational Autoencoders, ICLR 19 
 8 |   * Junxian He, Daniel Spokoyny, Graham Neubig, Taylor Berg-Kirkpatrick
 9 | 
10 | * Spherical Latent Spaces for Stable Variational Autoencoders, EMNLP 18 
11 |   * Jiacheng Xu and Greg Durrett, UT Austin
12 | 
13 | * Avoiding Latent Variable Collapse with Generative Skip Models, AISTATS 19 
14 |   * Adji B. Dieng, Yoon Kim, Alexander M. Rush, David M. Blei
15 | 
16 | * The Annotated Gumbel-softmax. Yao Fu. 2020 ([link](https://github.com/FranxYao/Annotated-Gumbel-Softmax-and-Score-Function))
17 | 
18 | * Continuous Hierarchical Representations with Poincaré Variational Auto-Encoders
19 |   * Emile Mathieu, Charline Le Lan, Chris J. Maddison, Ryota Tomioka, Yee Whye Teh
20 | 
21 | 
22 | * Direct Optimization through arg max for Discrete Variational Auto-Encoder
23 |   * Guy Lorberbom, Andreea Gane, Tommi Jaakkola, Tamir Hazan
24 | 
25 | 
26 | 
27 | * My [notes on mutual information](src/MINotes.md). Yao Fu, 2019. [pdf](src/MINotes.pdf)
28 |   * Basics of information theory 
29 | 
30 | * Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization, NIPS 18
31 |   * Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, Bill Dolan
32 | 
33 | * Discovering Discrete Latent Topics with Neural Variational Inference, ICML 17 
34 |   * Yishu Miao, Edward Grefenstette, Phil Blunsom. Oxford
35 | 
36 | * TopicRNN: A Recurrent Neural Network with Long-Range Semantic Dependency, ICLR 17 
37 |   * Adji B. Dieng, Chong Wang, Jianfeng Gao, John William Paisley
38 | 
39 | * Topic Aware Neural Response Generation, AAAI 17 
40 |   * Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, Wei-Ying Ma


--------------------------------------------------------------------------------
/src/DGM4NLP.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing/2f9f98fcf1da5a81dea9f2f796e8e640457f591f/src/DGM4NLP.jpg


--------------------------------------------------------------------------------
/src/MI1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing/2f9f98fcf1da5a81dea9f2f796e8e640457f591f/src/MI1.png


--------------------------------------------------------------------------------
/src/MI2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing/2f9f98fcf1da5a81dea9f2f796e8e640457f591f/src/MI2.png


--------------------------------------------------------------------------------
/src/MI3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing/2f9f98fcf1da5a81dea9f2f796e8e640457f591f/src/MI3.png


--------------------------------------------------------------------------------
/src/MI4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing/2f9f98fcf1da5a81dea9f2f796e8e640457f591f/src/MI4.png


--------------------------------------------------------------------------------
/src/MINotes.md:
--------------------------------------------------------------------------------
 1 | # Mutual Information Estimation and Representation Learning
 2 | 
 3 | <img src="MI1.png" alt="mi1"
 4 | 	title="mi1" width="900"  />
 5 |   
 6 | <img src="MI2.png" alt="mi2"
 7 | title="mi2" width="900"  />
 8 | 
 9 | <img src="MI3.png" alt="mi3"
10 | title="mi3" width="900"  />
11 | 
12 | <img src="MI4.png" alt="mi4"
13 | title="mi4" width="900"  />
14 | 


--------------------------------------------------------------------------------
/src/MINotes.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing/2f9f98fcf1da5a81dea9f2f796e8e640457f591f/src/MINotes.pdf


--------------------------------------------------------------------------------
/src/VI4NLP_Recipe.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing/2f9f98fcf1da5a81dea9f2f796e8e640457f591f/src/VI4NLP_Recipe.pdf


--------------------------------------------------------------------------------
/src/annotated_arae.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing/2f9f98fcf1da5a81dea9f2f796e8e640457f591f/src/annotated_arae.pdf


--------------------------------------------------------------------------------
/src/roadmap.01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing/2f9f98fcf1da5a81dea9f2f796e8e640457f591f/src/roadmap.01.png


--------------------------------------------------------------------------------
/src/titlepage.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing/2f9f98fcf1da5a81dea9f2f796e8e640457f591f/src/titlepage.jpeg


--------------------------------------------------------------------------------