├── .gitignore
├── MISC.md
├── PAPERS.md
├── PAPERS2017.md
├── PAPERS2018.md
├── PAPERS2019.md
├── README.md
├── formatter.py
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Created by .ignore support plugin (hsz.mobi)
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | *.so
 6 | .Python
 7 | env/
 8 | build/
 9 | .idea/
10 | develop-eggs/
11 | dist/
12 | downloads/
13 | eggs/
14 | .eggs/
15 | lib/
16 | lib64/
17 | parts/
18 | sdist/
19 | var/
20 | *.egg-info/
21 | .installed.cfg
22 | *.egg
23 | *.manifest
24 | *.spec
25 | pip-log.txt
26 | pip-delete-this-directory.txt
27 | htmlcov/
28 | .tox/
29 | .coverage
30 | .coverage.*
31 | .cache
32 | nosetests.xml
33 | coverage.xml
34 | *,cover
35 | .hypothesis/
36 | *.mo
37 | *.pot
38 | *.log
39 | local_settings.py
40 | instance/
41 | .webassets-cache
42 | .scrapy
43 | docs/_build/
44 | target/
45 | .ipynb_checkpoints
46 | .python-version
47 | celerybeat-schedule
48 | .env
49 | venv/
50 | ENV/
51 | .spyderproject
52 | .ropeproject
53 | gh-md-toc
54 | 


--------------------------------------------------------------------------------
/MISC.md:
--------------------------------------------------------------------------------
  1 | 
  2 | Table of Contents
  3 | =================
  4 | 
  5 |   * [Miscellaneous](#miscellaneous)
  6 |     * [Lecture notes](#lecture-notes)
  7 |       * [Monte Carlo Methods and Importance Sampling](#monte-carlo-methods-and-importance-sampling)
  8 |       * [Kernel Canonical Correlation Analysis](#kernel-canonical-correlation-analysis)
  9 |       * [The Matrix Calculus You Need For Deep Learning](#the-matrix-calculus-you-need-for-deep-learning)
 10 |     * [Blueprints](#blueprints)
 11 |       * [In\-Datacenter Performance Analysis of a Tensor Processing Unit​](#in-datacenter-performance-analysis-of-a-tensor-processing-unit)
 12 |       * [TensorFlow: Large\-Scale Machine Learning on Heterogeneous Distributed Systems](#tensorflow-large-scale-machine-learning-on-heterogeneous-distributed-systems)
 13 |       * [DyNet: The Dynamic Neural Network Toolkit](#dynet-the-dynamic-neural-network-toolkit)
 14 |       * [AllenNLP: A Deep Semantic Natural Language Processing Platform](#allennlp-a-deep-semantic-natural-language-processing-platform)
 15 |     * [Reports/Surveys](#reportssurveys)
 16 |       * [Best Practices for Applying Deep Learning to Novel Applications](#best-practices-for-applying-deep-learning-to-novel-applications)
 17 |       * [Automatic Keyword Extraction for Text Summarization: A Survey](#automatic-keyword-extraction-for-text-summarization-a-survey)
 18 |       * [Factorization tricks for LSTM networks](#factorization-tricks-for-lstm-networks)
 19 |       * [Symbolic, Distributed and Distributional Representations for Natural Language Processing in the Era of Deep Learning: a Survey](#symbolic-distributed-and-distributional-representations-for-natural-language-processing-in-the-era-of-deep-learning-a-survey)
 20 |       * [Deep Reinforcement Learning: An Overview](#deep-reinforcement-learning-an-overview)
 21 |       * [Algorithms for multi\-armed bandit problems](#algorithms-for-multi-armed-bandit-problems)
 22 |       * [A comparison of Extrinsic Clustering Evaluation Metrics based on Formal Constraints](#a-comparison-of-extrinsic-clustering-evaluation-metrics-based-on-formal-constraints)
 23 |       * [A Survey on Dialogue Systems: Recent Advances and New Frontiers](#a-survey-on-dialogue-systems-recent-advances-and-new-frontiers)
 24 |     * [Other](#other)
 25 |       * [Living Together: Mind and Machine Intelligence](#living-together-mind-and-machine-intelligence)
 26 |       * [Machine Teaching: A New Paradigm for Building Machine Learning Systems](#machine-teaching-a-new-paradigm-for-building-machine-learning-systems)
 27 |     * [Tutorial](#tutorial)
 28 |       * [NIPS 2016 Tutorial: Generative Adversarial Networks](#nips-2016-tutorial-generative-adversarial-networks)
 29 | 
 30 | Miscellaneous
 31 | =============
 32 | ## Lecture notes
 33 | ### Monte Carlo Methods and Importance Sampling
 34 | 
 35 | **Authors:** Eric C. Anderson, E. A. Thompson
 36 | 
 37 | **Abstract:** Lecture Notes for Stat 578c Statistical Genetics
 38 | 
 39 | **URL:** http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf
 40 | 
 41 | **Notes:** simple explanation for importance sampling in stats, IS in softmax is coming from here
 42 | 
 43 | ### Kernel Canonical Correlation Analysis
 44 | 
 45 | **Authors:** Max Welling
 46 | 
 47 | **Abstract:** Kernel Canonical Correlation Analysis
 48 | 
 49 | **URL:** http://www.ics.uci.edu/~welling/classnotes/papers_class/kCCA.pdf
 50 | 
 51 | **Notes:** explanation of kernel Canonical Correlation Analysis from Max Welling
 52 | 
 53 | ### The Matrix Calculus You Need For Deep Learning
 54 | 
 55 | **Authors:** Terence Parr, Jeremy Howard
 56 | 
 57 | **Abstract:** This paper is an attempt to explain all the matrix calculus you need in order to understand the training of deep neural networks. We assume no math knowledge beyond what you learned in calculus 1, and provide links to help you refresh the necessary math where needed. Note that you do not need to understand this material before you start learning to train and use deep learning in practice; rather, this material is for those who are already familiar with the basics of neural networks, and wish to deepen their understanding of the underlying math. Don't worry if you get stuck at some point along the way---just go back and reread the previous section, and try writing down and working through some examples. And if you're still stuck, we're happy to answer your questions in the Theory category at forums.fast.ai. Note: There is a reference section at the end of the paper summarizing all the key matrix calculus rules and terminology discussed here.
 58 | 
 59 | **URL:** https://arxiv.org/abs/1802.01528
 60 | 
 61 | **Notes:** self-explaining title: The Matrix Calculus You Need For Deep Learning
 62 | 
 63 | ## Blueprints
 64 | ### In-Datacenter Performance Analysis of a Tensor Processing Unit​
 65 | 
 66 | **Authors:** Norman P. Jouppi et al.
 67 | 
 68 | **Abstract:** Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC—called a ​Tensor Pro​cessing Unit (TPU)— deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU’s deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters’ NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU’s GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
 69 | 
 70 | **URL:** https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk/view
 71 | 
 72 | **Notes:** a blueprint about new Google TPUs; fascinating future of Deep Learning
 73 | 
 74 | ### TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
 75 | 
 76 | **Authors:** Martín Abadi et al.
 77 | 
 78 | **Abstract:** TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
 79 | 
 80 | **URL:** http://download.tensorflow.org/paper/whitepaper2015.pdf
 81 | 
 82 | **Notes:** long time missed here blueprint on Tensorflow
 83 | 
 84 | ### DyNet: The Dynamic Neural Network Toolkit
 85 | 
 86 | **Authors:** Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, Pengcheng Yin
 87 | 
 88 | **Abstract:** We describe DyNet, a toolkit for implementing neural network models based on dynamic declaration of network structure. In the static declaration strategy that is used in toolkits like Theano, CNTK, and TensorFlow, the user first defines a computation graph (a symbolic representation of the computation), and then examples are fed into an engine that executes this computation and computes its derivatives. In DyNet's dynamic declaration strategy, computation graph construction is mostly transparent, being implicitly constructed by executing procedural code that computes the network outputs, and the user is free to use different network structures for each input. Dynamic declaration thus facilitates the implementation of more complicated network architectures, and DyNet is specifically designed to allow users to implement their models in a way that is idiomatic in their preferred programming language (C++ or Python). One challenge with dynamic declaration is that because the symbolic computation graph is defined anew for every training example, its construction must have low overhead. To achieve this, DyNet has an optimized C++ backend and lightweight graph representation. Experiments show that DyNet's speeds are faster than or comparable with static declaration toolkits, and significantly faster than Chainer, another dynamic declaration toolkit. DyNet is released open-source under the Apache 2.0 license and available at this http URL
 89 | 
 90 | **URL:** https://arxiv.org/abs/1701.03980
 91 | 
 92 | **Notes:** The paper has remarkable list of authors - DeepMind, Google, IBM Watson, CMU, AI2 & MSR... New DNN framework.
 93 | 
 94 | ### AllenNLP: A Deep Semantic Natural Language Processing Platform
 95 | 
 96 | **Authors:** Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Peters, Michael Schmitz, Luke Zettlemoyer
 97 | 
 98 | **Abstract:** This paper describes AllenNLP, a platform for research on deep learning methods in natural language understanding. AllenNLP is designed to support researchers who want to build novel language understanding models quickly and easily. It is built on top of PyTorch, allowing for dynamic computation graphs, and provides (1) a flexible data API that handles intelligent batching and padding, (2) highlevel abstractions for common operations in working with text, and (3) a modular and extensible experiment framework that makes doing good science easy. It also includes reference implementations of high quality approaches for both core semantic problems (e.g. semantic role labeling (Palmer et al., 2005)) and language understanding applications (e.g. machine comprehension (Rajpurkar et al., 2016)). AllenNLP is an ongoing open-source effort maintained by engineers and researchers at the Allen Institute for Artificial Intelligence.
 99 | 
100 | **URL:** http://allennlp.org/papers/AllenNLP_white_paper.pdf
101 | 
102 | **Notes:** white paper for freshly presented AllenNLP - DL platform for NLP tasks; made on PyTorch
103 | 
104 | ## Reports/Surveys
105 | ### Best Practices for Applying Deep Learning to Novel Applications
106 | 
107 | **Authors:** Leslie N. Smith
108 | 
109 | **Abstract:** This report is targeted to groups who are subject matter experts in their application but deep learning novices. It contains practical advice for those interested in testing the use of deep neural networks on applications that are novel for deep learning. We suggest making your project more manageable by dividing it into phases. For each phase this report contains numerous recommendations and insights to assist novice practitioners.
110 | 
111 | **URL:** https://arxiv.org/abs/1704.01568
112 | 
113 | **Notes:** some notes on applying DL to new areas
114 | 
115 | ### Automatic Keyword Extraction for Text Summarization: A Survey
116 | 
117 | **Authors:** Santosh Kumar Bharti, Korra Sathya Babu
118 | 
119 | **Abstract:** In recent times, data is growing rapidly in every domain such as news, social media, banking, education, etc. Due to the excessiveness of data, there is a need of automatic summarizer which will be capable to summarize the data especially textual data in original document without losing any critical purposes. Text summarization is emerged as an important research area in recent past. In this regard, review of existing work on text summarization process is useful for carrying out further research. In this paper, recent literature on automatic keyword extraction and text summarization are presented since text summarization process is highly depend on keyword extraction. This literature includes the discussion about different methodology used for keyword extraction and text summarization. It also discusses about different databases used for text summarization in several domains along with evaluation matrices. Finally, it discusses briefly about issues and research challenges faced by researchers along with future direction.
120 | 
121 | **URL:** https://arxiv.org/abs/1704.03242
122 | 
123 | **Notes:** useful list of works in keyword extraction
124 | 
125 | ### Factorization tricks for LSTM networks
126 | 
127 | **Authors:** Oleksii Kuchaiev, Boris Ginsburg
128 | 
129 | **Abstract:** We present two simple ways of reducing the number of parameters and accelerating the training of large Long Short-Term Memory (LSTM) networks: the first one is "matrix factorization by design" of LSTM matrix into the product of two smaller matrices, and the second one is partitioning of LSTM matrix, its inputs and states into the independent groups. Both approaches allow us to train large LSTM networks significantly faster to the state-of the art perplexity. On the One Billion Word Benchmark we improve single model perplexity down to 23.36.
130 | 
131 | **URL:** https://arxiv.org/abs/1703.10722
132 | 
133 | **Notes:** could be useful bunch of tricks for LSTM from NVIDIA engineers
134 | 
135 | ### Symbolic, Distributed and Distributional Representations for Natural Language Processing in the Era of Deep Learning: a Survey
136 | 
137 | **Authors:** Lorenzo Ferrone, Fabio Massimo Zanzotto
138 | 
139 | **Abstract:** Natural language and symbols are intimately correlated. Recent advances in machine learning (ML) and in natural language processing (NLP) seem to contradict the above intuition: symbols are fading away, erased by vectors or tensors called distributed and distributional representations. However, there is a strict link between distributed/distributional representations and symbols, being the first an approximation of the second. A clearer understanding of the strict link between distributed/distributional representations and symbols will certainly lead to radically new deep learning networks. In this paper we make a survey that aims to draw the link between symbolic representations and distributed/distributional representations. This is the right time to revitalize the area of interpreting how symbols are represented inside neural networks.
140 | 
141 | **URL:** https://arxiv.org/abs/1702.00764
142 | 
143 | **Notes:** review of nlp representations
144 | 
145 | ### Deep Reinforcement Learning: An Overview
146 | 
147 | **Authors:** Yuxi Li
148 | 
149 | **Abstract:** We give an overview of recent exciting achievements of deep reinforcement learning (RL). We start with background of deep learning and reinforcement learning, as well as introduction of testbeds. Next we discuss Deep Q-Network (DQN) and its extensions, asynchronous methods, policy optimization, reward, and planning. After that, we talk about attention and memory, unsupervised learning, and learning to learn. Then we discuss various applications of RL, including games, in particular, AlphaGo, robotics, spoken dialogue systems (a.k.a. chatbot), machine translation, text sequence prediction, neural architecture design, personalized web services, healthcare, finance, and music generation. We mention topics/papers not reviewed yet. After listing a collection of RL resources, we close with discussions.
150 | 
151 | **URL:** https://arxiv.org/abs/1701.07274
152 | 
153 | **Notes:** RL overview, including dialog systems
154 | 
155 | ### Algorithms for multi-armed bandit problems
156 | 
157 | **Authors:** Volodymyr Kuleshov, Doina Precup
158 | 
159 | **Abstract:** Although many algorithms for the multi-armed bandit problem are well-understood theoretically, empirical confirmation of their effectiveness is generally scarce. This paper presents a thorough empirical study of the most popular multi-armed bandit algorithms. Three important observations can be made from our results. Firstly, simple heuristics such as epsilon-greedy and Boltzmann exploration outperform theoretically sound algorithms on most settings by a significant margin. Secondly, the performance of most algorithms varies dramatically with the parameters of the bandit problem. Our study identifies for each algorithm the settings where it performs well, and the settings where it performs poorly. Thirdly, the algorithms' performance relative each to other is affected only by the number of bandit arms and the variance of the rewards. This finding may guide the design of subsequent empirical evaluations. In the second part of the paper, we turn our attention to an important area of application of bandit algorithms: clinical trials. Although the design of clinical trials has been one of the principal practical problems motivating research on multi-armed bandits, bandit algorithms have never been evaluated as potential treatment allocation strategies. Using data from a real study, we simulate the outcome that a 2001-2002 clinical trial would have had if bandit algorithms had been used to allocate patients to treatments. We find that an adaptive trial would have successfully treated at least 50% more patients, while significantly reducing the number of adverse effects and increasing patient retention. At the end of the trial, the best treatment could have still been identified with a high level of statistical confidence. Our findings demonstrate that bandit algorithms are attractive alternatives to current adaptive treatment allocation strategies.
160 | 
161 | **URL:** https://arxiv.org/abs/1402.6028
162 | 
163 | **Notes:** an pretty old (2014) but seems to useful review of k-armed bandits algos
164 | 
165 | ### A comparison of Extrinsic Clustering Evaluation Metrics based on Formal Constraints
166 | 
167 | **Authors:** Enrique Amigo, Julio Gonzalo, Javier Artiles, Felisa Verdejo
168 | 
169 | **Abstract:** There is a wide set of evaluation metrics available to compare the quality of text clustering algorithms. In this article, we define a few intuitive formal constraints on such metrics which shed light on which aspects of the quality of a clustering are captured by different metric families. These formal constraints are validated in an experiment involving human assessments, and compared with other constraints proposed in the literature. Our analysis of a wide range of metrics shows that only BCubed satisfies all formal constraints. We also extend the analysis to the problem of overlapping clustering, where items can simultaneously belong to more than one cluster. As BCubed cannot be directly applied to this task, we propose a modifiedversion of Bcubed that avoids the problems found with other metrics.
170 | 
171 | **URL:** http://nlp.uned.es/docs/amigo2007a.pdf
172 | 
173 | **Notes:** comparison of clustering metrics for texts
174 | 
175 | ### A Survey on Dialogue Systems: Recent Advances and New Frontiers
176 | 
177 | **Authors:** Hongshen Chen, Xiaorui Liu, Dawei Yin, Jiliang Tang
178 | 
179 | **Abstract:** Dialogue systems have attracted more and more attention. Recent advances on dialogue systems are overwhelmingly contributed by deep learning techniques, which have been employed to enhance a wide range of big data applications such as computer vision, natural language processing, and recommender systems. For dialogue systems, deep learning can leverage a massive amount of data to learn meaningful feature representations and response generation strategies, while requiring a minimum amount of hand-crafting. In this article, we give an overview to these recent advances on dialogue systems from various perspectives and discuss some possible research directions. In particular, we generally di- vide existing dialogue systems into task-oriented and non- task-oriented models, then detail how deep learning techniques help them with representative algorithms and finally discuss some appealing research directions that can bring the dialogue system research into a new frontier.
180 | 
181 | **URL:** https://arxiv.org/abs/1711.01731
182 | 
183 | **Notes:** fresh survey on dialog systems
184 | 
185 | ## Other
186 | ### Living Together: Mind and Machine Intelligence
187 | 
188 | **Authors:** Neil D. Lawrence
189 | 
190 | **Abstract:** In this paper we consider the nature of the machine intelligences we have created in the context of our human intelligence. We suggest that the fundamental difference between human and machine intelligence comes down to \emph{embodiment factors}. We define embodiment factors as the ratio between an entity's ability to communicate information vs compute information. We speculate on the role of embodiment factors in driving our own intelligence and consciousness. We briefly review dual process models of cognition and cast machine intelligence within that framework, characterising it as a dominant System Zero, which can drive behaviour through interfacing with us subconsciously. Driven by concerns about the consequence of such a system we suggest prophylactic courses of action that could be considered. Our main conclusion is that it is \emph{not} sentient intelligence we should fear but \emph{non-sentient} intelligence.
191 | 
192 | **URL:** https://arxiv.org/abs/1705.07996
193 | 
194 | **Notes:** Jack Clark recommends this philosophical paper; human brain still better than computer in compression of input data by a several orders of magnitude; it is nice since we (the computer scientists to whom I have the courage to attribute myself) have a lot of stuff to do before the singularity
195 | 
196 | ### Machine Teaching: A New Paradigm for Building Machine Learning Systems
197 | 
198 | **Authors:** Patrice Y. Simard, Saleema Amershi, David M. Chickering, Alicia Edelman Pelton, Soroush Ghorashi, Christopher Meek, Gonzalo Ramos, Jina Suh, Johan Verwey, Mo Wang, John Wernsing
199 | 
200 | **Abstract:** The current processes for building machine learning systems require practitioners with deep knowledge of machine learning. This significantly limits the number of machine learning systems that can be created and has led to a mismatch between the demand for machine learning systems and the ability for organizations to build them. We believe that in order to meet this growing demand for machine learning systems we must significantly increase the number of individuals that can teach machines. We postulate that we can achieve this goal by making the process of teaching machines easy, fast and above all, universally accessible. While machine learning focuses on creating new algorithms and improving the accuracy of "learners", the machine teaching discipline focuses on the efficacy of the "teachers". Machine teaching as a discipline is a paradigm shift that follows and extends principles of software engineering and programming languages. We put a strong emphasis on the teacher and the teacher's interaction with data, as well as crucial components such as techniques and design principles of interaction and visualization. In this paper, we present our position regarding the discipline of machine teaching and articulate fundamental machine teaching principles. We also describe how, by decoupling knowledge about machine learning algorithms from the process of teaching, we can accelerate innovation and empower millions of new uses for machine learning models.
201 | 
202 | **URL:** https://arxiv.org/abs/1707.06742
203 | 
204 | **Notes:** Microsoft proposes a way to handle Machine Learning as Software Development, for me it is not that obvious why do we need to state a new field of study, apart of Software Development, but the stated problem does exist in the my day-to-day live too
205 | 
206 | ## Tutorial
207 | ### NIPS 2016 Tutorial: Generative Adversarial Networks
208 | 
209 | **Authors:** Ian Goodfellow
210 | 
211 | **Abstract:** This report summarizes the tutorial presented by the author at NIPS 2016 on generative adversarial networks (GANs). The tutorial describes: (1) Why generative modeling is a topic worth studying, (2) how generative models work, and how GANs compare to other generative models, (3) the details of how GANs work, (4) research frontiers in GANs, and (5) state-of-the-art image models that combine GANs with other methods. Finally, the tutorial contains three exercises for readers to complete, and the solutions to these exercises.
212 | 
213 | **URL:** https://arxiv.org/abs/1701.00160
214 | 
215 | **Notes:** Goodfellow's tutorial couldn't hurt
216 | 
217 | 


--------------------------------------------------------------------------------
/PAPERS2018.md:
--------------------------------------------------------------------------------
  1 | 
  2 | Table of Contents
  3 | =================
  4 | 
  5 | * [Articles](#articles)
  6 |   * [2018\-01](#2018-01)
  7 |     * [Unsupervised Low\-Dimensional Vector Representations for Words, Phrases  and Text that are Transparent, Scalable, and produce Similarity Metrics that  are Complementary to Neural Embeddings](#unsupervised-low-dimensional-vector-representations-for-words-phrases--and-text-that-are-transparent-scalable-and-produce-similarity-metrics-that--are-complementary-to-neural-embeddings)
  8 |     * [Knowledge\-based Word Sense Disambiguation using Topic Models](#knowledge-based-word-sense-disambiguation-using-topic-models)
  9 |     * [Unsupervised Part\-of\-Speech Induction](#unsupervised-part-of-speech-induction)
 10 |     * [MaskGAN: Better Text Generation via Filling in the \_\_\_\_\_\_](#maskgan-better-text-generation-via-filling-in-the-______)
 11 |   * [2018\-02](#2018-02)
 12 |     * [Improving Variational Encoder\-Decoders in Dialogue Generation](#improving-variational-encoder-decoders-in-dialogue-generation)
 13 |     * [TextZoo, a New Benchmark for Reconsidering Text Classification](#textzoo-a-new-benchmark-for-reconsidering-text-classification)
 14 |     * [Tensor Comprehensions: Framework\-Agnostic High\-Performance Machine  Learning Abstractions](#tensor-comprehensions-framework-agnostic-high-performance-machine--learning-abstractions)
 15 |     * [Ranking Sentences for Extractive Summarization with Reinforcement  Learning](#ranking-sentences-for-extractive-summarization-with-reinforcement--learning)
 16 |     * [Deep contextualized word representations](#deep-contextualized-word-representations)
 17 |     * [Latent Topic Conversational Models](#latent-topic-conversational-models)
 18 |     * [Disentangling Aspect and Opinion Words in Target\-based Sentiment  Analysis using Lifelong Learning](#disentangling-aspect-and-opinion-words-in-target-based-sentiment--analysis-using-lifelong-learning)
 19 |   * [2018\-03](#2018-03)
 20 |     * [Simple random search provides a competitive approach to reinforcement  learning](#simple-random-search-provides-a-competitive-approach-to-reinforcement--learning)
 21 |     * [Achieving Human Parity on AutomaticChinese to English News Translation](#achieving-human-parity-on-automaticchinese-to-english-news-translation)
 22 |     * [Fast Decoding in Sequence Models using Discrete Latent Variables](#fast-decoding-in-sequence-models-using-discrete-latent-variables)
 23 |     * [An Analysis of Neural Language Modeling at Multiple Scales](#an-analysis-of-neural-language-modeling-at-multiple-scales)
 24 |   * [2018\-04](#2018-04)
 25 |     * [Large scale distributed neural network training through online  distillation](#large-scale-distributed-neural-network-training-through-online--distillation)
 26 |     * [Frustratingly Easy Meta\-Embedding \-\- Computing Meta\-Embeddings by  Averaging Source Word Embeddings](#frustratingly-easy-meta-embedding----computing-meta-embeddings-by--averaging-source-word-embeddings)
 27 |     * [The Best of Both Worlds: Combining Recent Advances in Neural Machine  Translation](#the-best-of-both-worlds-combining-recent-advances-in-neural-machine--translation)
 28 |   * [2018\-05](#2018-05)
 29 |     * [Paper Abstract Writing through Editing Mechanism](#paper-abstract-writing-through-editing-mechanism)
 30 |     * [Zero\-Shot Dual Machine Translation](#zero-shot-dual-machine-translation)
 31 |   * [2018\-06](#2018-06)
 32 |     * [TextWorld: A Learning Environment for Text\-based Games](#textworld-a-learning-environment-for-text-based-games)
 33 |   * [2018\-07](#2018-07)
 34 |     * [Talk the Walk: Navigating New York City through Grounded Dialogue](#talk-the-walk-navigating-new-york-city-through-grounded-dialogue)
 35 |   * [2018\-08](#2018-08)
 36 |     * [Fake Sentence Detection as a Training Task for Sentence Encoding](#fake-sentence-detection-as-a-training-task-for-sentence-encoding)
 37 |     * [Dynamic Self\-Attention : Computing Attention over Words Dynamically for  Sentence Embedding](#dynamic-self-attention--computing-attention-over-words-dynamically-for--sentence-embedding)
 38 |     * [TaxoGen: Unsupervised Topic Taxonomy Construction by Adaptive Term Embedding and Clustering](#taxogen-unsupervised-topic-taxonomy-construction-by-adaptive-term-embedding-and-clustering)
 39 |     * [Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised](#summarizing-opinions-aspect-extraction-meets-sentiment-prediction-and-they-are-both-weakly-supervised)
 40 |   * [2018\-09](#2018-09)
 41 |     * [Adaptive Input Representations for Neural Language Modeling](#adaptive-input-representations-for-neural-language-modeling)
 42 |     * [Weakly\-Supervised Neural Text Classification](#weakly-supervised-neural-text-classification)
 43 |     * [Accelerated Reinforcement Learning for Sentence Generation by Vocabulary Prediction](#accelerated-reinforcement-learning-for-sentence-generation-by-vocabulary-prediction)
 44 |   * [2018\-10](#2018-10)
 45 |     * [Phrase\-Based Attentions](#phrase-based-attentions)
 46 |     * [BERT: Pre\-training of Deep Bidirectional Transformers for Language  Understanding](#bert-pre-training-of-deep-bidirectional-transformers-for-language--understanding)
 47 |   * [2018\-11](#2018-11)
 48 |     * [CALCS: Continuously Approximating Longest Common Subsequence for Sequence Level Optimization](#calcs-continuously-approximating-longest-common-subsequence-for-sequence-level-optimization)
 49 |   * [2018\-12](#2018-12)
 50 |     * [Von Mises\-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs](#von-mises-fisher-loss-for-training-sequence-to-sequence-models-with-continuous-outputs)
 51 | 
 52 | Articles
 53 | ========
 54 | ## 2018-01
 55 | ### Unsupervised Low-Dimensional Vector Representations for Words, Phrases  and Text that are Transparent, Scalable, and produce Similarity Metrics that  are Complementary to Neural Embeddings
 56 | 
 57 | **Authors:** Neil R. Smalheiser, Gary Bonifield
 58 | 
 59 | **Abstract:** Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = ~0.5-0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title+abstracts are all publicly available from [URL](http://arrowsmith.psych.uic.edu) for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.
 60 | 
 61 | **URL:** https://arxiv.org/abs/1801.01884
 62 | 
 63 | **Notes:** medical-related paper on word embeddings; guys used few tricks over word2vec, like weighted score for 1- & 2-grams or list of important terms; results show their embedding actually improve relatedness of terms for humans; with code!
 64 | 
 65 | ### Knowledge-based Word Sense Disambiguation using Topic Models
 66 | 
 67 | **Authors:** Devendra Singh Chaplot, Ruslan Salakhutdinov
 68 | 
 69 | **Abstract:** Word Sense Disambiguation is an open problem in Natural Language Processing which is particularly challenging and useful in the unsupervised setting where all the words in any given text need to be disambiguated without using any labeled data. Typically WSD systems use the sentence or a small window of words around the target word as the context for disambiguation because their computational complexity scales exponentially with the size of the context. In this paper, we leverage the formalism of topic model to design a WSD system that scales linearly with the number of words in the context. As a result, our system is able to utilize the whole document as the context for a word to be disambiguated. The proposed method is a variant of Latent Dirichlet Allocation in which the topic proportions for a document are replaced by synset proportions. We further utilize the information in the WordNet by assigning a non-uniform prior to synset distribution over words and a logistic-normal prior for document distribution over synsets. We evaluate the proposed method on Senseval-2, Senseval-3, SemEval-2007, SemEval-2013 and SemEval-2015 English All-Word WSD datasets and show that it outperforms the state-of-the-art unsupervised knowledge-based WSD system by a significant margin.
 70 | 
 71 | **URL:** https://arxiv.org/abs/1801.01900
 72 | 
 73 | **Notes:** word sense disambiguation with wordnet, assigning prior as normal distribution; the parameters of normal distribution are determined from corpus at hand; the topics are being modelled by synset disrtibution instead of word themselves
 74 | 
 75 | ### Unsupervised Part-of-Speech Induction
 76 | 
 77 | **Authors:** Omid Kashefi
 78 | 
 79 | **Abstract:** Part-of-Speech (POS) tagging is an old and fundamental task in natural language processing. While supervised POS taggers have shown promising accuracy, it is not always feasible to use supervised methods due to lack of labeled data. In this project, we attempt to unsurprisingly induce POS tags by iteratively looking for a recurring pattern of words through a hierarchical agglomerative clustering process. Our approach shows promising results when compared to the tagging results of the state-of-the-art unsupervised POS taggers.
 80 | 
 81 | **URL:** https://arxiv.org/abs/1801.03564
 82 | 
 83 | **Notes:** unsupervised PoS-tagging; the author use classic backward-forward algorithm to cluster tags produced by HMM; it shows promicing results - only 10% worse that SotA
 84 | 
 85 | ### MaskGAN: Better Text Generation via Filling in the ______
 86 | 
 87 | **Authors:** William Fedus, Ian Goodfellow, Andrew M. Dai
 88 | 
 89 | **Abstract:** Neural text generation models are often autoregressive language models or seq2seq models. These models generate text by sampling words sequentially, with each word conditioned on the previous word, and are state-of-the-art for several machine translation and summarization benchmarks. These benchmarks are often defined by validation perplexity even though this is not a direct measure of the quality of the generated text. Additionally, these models are typically trained via maxi- mum likelihood and teacher forcing. These methods are well-suited to optimizing perplexity but can result in poor sample quality since generating text requires conditioning on sequences of words that may have never been observed at training time. We propose to improve sample quality using Generative Adversarial Networks (GANs), which explicitly train the generator to produce high quality samples and have shown a lot of success in image generation. GANs were originally designed to output differentiable values, so discrete language generation is challenging for them. We claim that validation perplexity alone is not indicative of the quality of text generated by a model. We introduce an actor-critic conditional GAN that fills in missing text conditioned on the surrounding context. We show qualitatively and quantitatively, evidence that this produces more realistic conditional and unconditional text samples compared to a maximum likelihood trained model.
 90 | 
 91 | **URL:** https://arxiv.org/abs/1801.07736
 92 | 
 93 | **Notes:** GAN on texts which actually makes sense; gererator is standard seq2seq; discriminator has the same architecture as generator, but it has two outputs: probability for a word to be real and value function, which is used as baseline in REINFORCE for generator
 94 | 
 95 | ## 2018-02
 96 | ### Improving Variational Encoder-Decoders in Dialogue Generation
 97 | 
 98 | **Authors:** Xiaoyu Shen, Hui Su, Shuzi Niu, Vera Demberg
 99 | 
100 | **Abstract:** Variational encoder-decoders (VEDs) have shown promising results in dialogue generation. However, the latent variable distributions are usually approximated by a much simpler model than the powerful RNN structure used for encoding and decoding, yielding the KL-vanishing problem and inconsistent training objective. In this paper, we separate the training step into two phases: The first phase learns to autoencode discrete texts into continuous embeddings, from which the second phase learns to generalize latent representations by reconstructing the encoded embedding. In this case, latent variables are sampled by transforming Gaussian noise through multi-layer perceptrons and are trained with a separate VED model, which has the potential of realizing a much more flexible distribution. We compare our model with current popular models and the experiment demonstrates substantial improvement in both metric-based and human evaluations.
101 | 
102 | **URL:** https://arxiv.org/abs/1802.02032
103 | 
104 | **Notes:** combined arch for better dialog gen: auto-encoder entangled with conditional VAE; variational HRED for CVAE; CVAE is trained with scheduled sampling; training of the whole model is resembling of GANs: AE or CVAE is freezed while the other has being trained
105 | 
106 | ### TextZoo, a New Benchmark for Reconsidering Text Classification
107 | 
108 | **Authors:** Benyou Wang, Li Wang, Qikang Wei
109 | 
110 | **Abstract:** Text representation is a fundamental concern in Natural Language Processing, especially in text classification. Recently, many neural network approaches with delicate representation model (e.g. FASTTEXT, CNN, RNN and many hybrid models with attention mechanisms) claimed that they achieved state-of-art in specific text classification datasets. However, it lacks an unified benchmark to compare these models and reveals the advantage of each sub-components for various settings. We re-implement more than 20 popular text representation models for classification in more than 10 datasets. In this paper, we reconsider the text classification task in the perspective of neural network and get serval effects with analysis of the above results.
111 | 
112 | **URL:** https://arxiv.org/abs/1802.03656
113 | 
114 | **Notes:** conceptually simple paper, but the code for it is really useful: guys reimplemented SotA text classification architectures in one manner
115 | 
116 | ### Tensor Comprehensions: Framework-Agnostic High-Performance Machine  Learning Abstractions
117 | 
118 | **Authors:** Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, Albert Cohen
119 | 
120 | **Abstract:** Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ranking user preferences, ad placement, etc. Competing frameworks for building these networks such as TensorFlow, Chainer, CNTK, Torch/PyTorch, Caffe1/2, MXNet and Theano, explore different tradeoffs between usability and expressiveness, research or production orientation and supported hardware. They operate on a DAG of computational operators, wrapping high-performance libraries such as CUDNN (for NVIDIA GPUs) or NNPACK (for various CPUs), and automate memory allocation, synchronization, distribution. Custom operators are needed where the computation does not fit existing high-performance library calls, usually at a high engineering cost. This is frequently required when new operators are invented by researchers: such operators suffer a severe performance penalty, which limits the pace of innovation. Furthermore, even if there is an existing runtime call these frameworks can use, it often doesn't offer optimal performance for a user's particular network architecture and dataset, missing optimizations between operators as well as optimizations that can be done knowing the size and shape of data. Our contributions include (1) a language close to the mathematics of deep learning called Tensor Comprehensions offering both imperative and declarative styles, (2) a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning DAG into a CUDA kernel with delegated memory management and synchronization, also providing optimizations such as operator fusion and specialization for specific sizes, (3) a compilation cache populated by an autotuner. [Abstract cutoff]
121 | 
122 | **URL:** https://arxiv.org/abs/1802.04730
123 | 
124 | **Notes:** really hot engineering work from Facebook: DSL which is really close to mathematic notation, so a researcher could write in it directly, from this DSL an algorithm generates code in CUDA a few times, the best generated code is used for production
125 | 
126 | ### Ranking Sentences for Extractive Summarization with Reinforcement  Learning
127 | 
128 | **Authors:** Shashi Narayan, Shay B. Cohen, Mirella Lapata
129 | 
130 | **Abstract:** Single document summarization is the task of producing a shorter version of a document while preserving its principal information content. In this paper we conceptualize extractive summarization as a sentence ranking task and propose a novel training algorithm which globally optimizes the ROUGE evaluation metric through a reinforcement learning objective. We use our algorithm to train a neural summarization model on the CNN and DailyMail datasets and demonstrate experimentally that it outperforms state-of-the-art extractive and abstractive systems when evaluated automatically and by humans.
131 | 
132 | **URL:** https://arxiv.org/abs/1802.08636
133 | 
134 | **Notes:** new SotA in summarization; CNN for feature extraction from sentences; LSTM for document embedding; embedding based ranking of sentences which are used for summarization; ranking trained by REINFORCE; beam search analog for RL training
135 | 
136 | ### Deep contextualized word representations
137 | 
138 | **Authors:** Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer
139 | 
140 | **Abstract:** We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.
141 | 
142 | **URL:** https://arxiv.org/abs/1802.05365
143 | 
144 | **Notes:** new SotA in NER & other tasks for embeddings; ELMo - n layers of bidirectional language model (4096 LSTM units each direction) and attn over layers, i.e embedding from each layer for each token is taken with softmax weight and summarized
145 | 
146 | ### Latent Topic Conversational Models
147 | 
148 | **Authors:** Tsung-Hsien Wen, Minh-Thang Luong
149 | 
150 | **Abstract:** Despite much success in many large-scale language tasks, sequence-to-sequence (seq2seq) models have not been an ideal choice for conversational modeling as they tend to generate generic and repetitive responses. In this paper, we propose a Latent Topic Conversational Model (LTCM) that augments the seq2seq model with a neural topic component to better model human-human conversations. The neural topic component encodes information from the source sentence to build a global “topic” distribution over words, which is then consulted by the seq2seq model to improve generation at each time step. The experimental results show that the proposed LTCM can generate more diverse and interesting responses by sampling from its learnt latent representations. In a subjective human evaluation, the judges also confirm that LTCM is the preferred option comparing to competitive baseline models.
151 | 
152 | **URL:** https://openreview.net/forum?id=S1GUgxgCW
153 | 
154 | **Notes:** generating utterances in dialogues with VAE and topic modelling: before generating a sentense we draw a topic proportion and generate phrase according to it
155 | 
156 | ### Disentangling Aspect and Opinion Words in Target-based Sentiment  Analysis using Lifelong Learning
157 | 
158 | **Authors:** Shuai Wang, Mianwei Zhou, Sahisnu Mazumder, Bing Liu, Yi Chang
159 | 
160 | **Abstract:** Given a target name, which can be a product aspect or entity, identifying its aspect words and opinion words in a given corpus is a fine-grained task in target-based sentiment analysis (TSA). This task is challenging, especially when we have no labeled data and we want to perform it for any given domain. To address it, we propose a general two-stage approach. Stage one extracts/groups the target-related words (call t-words) for a given target. This is relatively easy as we can apply an existing semantics-based learning technique. Stage two separates the aspect and opinion words from the grouped t-words, which is challenging because we often do not have enough word-level aspect and opinion labels. In this work, we formulate this problem in a PU learning setting and incorporate the idea of lifelong learning to solve it. Experimental results show the effectiveness of our approach.
161 | 
162 | **URL:** https://arxiv.org/abs/1802.05818
163 | 
164 | **Notes:** Authors are proposing nouvelle technique to extract opinion words from general vocabulary words. They use Amazon review dataset. The key idea here is to found opinion words as being close in multiple domains to base lexicon.
165 | 
166 | ## 2018-03
167 | ### Simple random search provides a competitive approach to reinforcement  learning
168 | 
169 | **Authors:** Horia Mania, Aurelia Guy, Benjamin Recht
170 | 
171 | **Abstract:** A common belief in model-free reinforcement learning is that methods based on random search in the parameter space of policies exhibit significantly worse sample complexity than those that explore the space of actions. We dispel such beliefs by introducing a random search method for training static, linear policies for continuous control problems, matching state-of-the-art sample efficiency on the benchmark MuJoCo locomotion tasks. Our method also finds a nearly optimal controller for a challenging instance of the Linear Quadratic Regulator, a classical problem in control theory, when the dynamics are not known. Computationally, our random search algorithm is at least 15 times more efficient than the fastest competing model-free methods on these benchmarks. We take advantage of this computational efficiency to evaluate the performance of our method over hundreds of random seeds and many different hyperparameter configurations for each benchmark task. Our simulations highlight a high variability in performance in these benchmark tasks, suggesting that commonly used estimations of sample efficiency do not adequately evaluate the performance of RL algorithms.
172 | 
173 | **URL:** https://arxiv.org/abs/1803.07055
174 | 
175 | **Notes:** random search with a few optimizations achieve SotA on continous control tasks! essentially is is random search with scaling of update by reward std. deviation; much more sample efficient then Evolution Strategy; with code!
176 | 
177 | ### Achieving Human Parity on AutomaticChinese to English News Translation
178 | 
179 | **Authors:** Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou
180 | 
181 | **Abstract:** Machine translation has made rapid advances in recent years. Millions of people are using it today in online translation systems and mobile applications in order to communicate across language barriers. The question naturally arises whether such systems can approach or achieve parity with human translations. In this paper, we first address the problem of how to define and accurately measure human parity in translation. We then describe Microsoft’s machine translation system and measure the quality of its translations on the widely used WMT 2017 news translation task from Chinese to English. We find that our latest neural machine translation system has reached a new state-of-the-art, and that the translation quality is at human parity when compared to professional human translations. We also find that it significantly exceeds the quality of crowd-sourced non-professional translations.
182 | 
183 | **URL:** https://www.microsoft.com/en-us/research/uploads/prod/2018/03/final-achieving-human.pdf
184 | 
185 | **Notes:** human level performance in MT from MSFT! really strong claim proved in reliefingly narrow domain of English/Chinese news; not sure if the reference-matching measures are good for MT; arch is Transformer with joint unservised training on monolingual corpora
186 | 
187 | ### Fast Decoding in Sequence Models using Discrete Latent Variables
188 | 
189 | **Authors:** Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Uszkoreit, Noam Shazeer
190 | 
191 | **Abstract:** Autoregressive sequence models based on deep neural networks, such as RNNs, Wavenet and the Transformer attain state-of-the-art results on many tasks. However, they are difficult to parallelize and are thus slow at processing long sequences. RNNs lack parallelism both during training and decoding, while architectures like WaveNet and Transformer are much more parallelizable during training, yet still operate sequentially during decoding. Inspired by [arxiv:[URL](/abs/1711.00937)], we present a method to extend sequence models using discrete latent variables that makes decoding much more parallelizable. We first auto-encode the target sequence into a shorter sequence of discrete latent variables, which at inference time is generated autoregressively, and finally decode the output sequence from this shorter latent sequence in parallel. To this end, we introduce a novel method for constructing a sequence of discrete latent variables and compare it with previously introduced methods. Finally, we evaluate our model end-to-end on the task of neural machine translation, where it is an order of magnitude faster at decoding than comparable autoregressive models. While lower in BLEU than purely autoregressive models, our model achieves higher scores than previously proposed non-autogregressive translation models.
192 | 
193 | **URL:** https://arxiv.org/abs/1803.03382
194 | 
195 | **Notes:** great work from Google; like Non-Autoregressive Trasformer; vector quantization as projection on subspaces; the reconstruction loss for latent space and input with AE; convolutions instead of fully-connecteds; latent variables instead of fertility rate in NAT
196 | 
197 | ### An Analysis of Neural Language Modeling at Multiple Scales
198 | 
199 | **Authors:** Stephen Merity, Nitish Shirish Keskar, Richard Socher
200 | 
201 | **Abstract:** Many of the leading approaches in language modeling introduce novel, complex and specialized architectures. We take existing state-of-the-art word level language models based on LSTMs and QRNNs and extend them to both larger vocabularies as well as character-level granularity. When properly tuned, LSTMs and QRNNs achieve state-of-the-art results on character-level (Penn Treebank, enwik8) and word-level (WikiText-103) datasets, respectively. Results are obtained in only 12 hours (WikiText-103) to 2 days (enwik8) using a single modern GPU.
202 | 
203 | **URL:** https://arxiv.org/abs/1803.08240
204 | 
205 | **Notes:** new paper from Salesforce: larger well-tuned LSTM (and QRNN) are giving SotA results in language modelling; in addition they use longer BPTT (150) and tied adaptive softmax; interestingly they found that QRNN need to be deeper than LSTM for these tasks
206 | 
207 | ## 2018-04
208 | ### Large scale distributed neural network training through online  distillation
209 | 
210 | **Authors:** Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E. Dahl, Geoffrey E. Hinton
211 | 
212 | **Abstract:** Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters. Our first claim is that online distillation enables us to use extra parallelism to fit very large datasets about twice as fast. Crucially, we can still speed up training even after we have already reached the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent. Two neural networks trained on disjoint subsets of the data can share knowledge by encouraging each model to agree with the predictions the other model would have made. These predictions can come from a stale version of the other model so they can be safely computed using weights that only rarely get transmitted. Our second claim is that online distillation is a cost-effective way to make the exact predictions of a model dramatically more reproducible. We support our claims using experiments on the Criteo Display Ad Challenge dataset, ImageNet, and the largest to-date dataset used for neural language modeling, containing $6\times 10^{11}$ tokens and based on the Common Crawl repository of web data.
213 | 
214 | **URL:** https://arxiv.org/abs/1804.03235
215 | 
216 | **Notes:** Google large scale language modelling. It is 20 _Terabytes_ of texts with 673 billion tokens. The algorithm they propose called codistillation: distributed learning of similar models with additional loss for each model to predist close to others results.
217 | 
218 | ### Frustratingly Easy Meta-Embedding -- Computing Meta-Embeddings by  Averaging Source Word Embeddings
219 | 
220 | **Authors:** Joshua Coates, Danushka Bollegala
221 | 
222 | **Abstract:** Creating accurate meta-embeddings from pre-trained source embeddings has received attention lately. Methods based on global and locally-linear transformation and concatenation have shown to produce accurate meta-embeddings. In this paper, we show that the arithmetic mean of two distinct word embedding sets yields a performant meta-embedding that is comparable or better than more complex meta-embedding learning methods. The result seems counter-intuitive given that vector spaces in different source embeddings are not comparable and cannot be simply averaged. We give insight into why averaging can still produce accurate meta-embedding despite the incomparability of the source vector spaces.
223 | 
224 | **URL:** https://arxiv.org/abs/1804.05262
225 | 
226 | **Notes:** really simple claim - averaging of different embedding makes them better; it is proved on datasets for semantic properties of words; interestingly averaging is slighly worse then concatenation but don't requires extra storage space
227 | 
228 | ### The Best of Both Worlds: Combining Recent Advances in Neural Machine  Translation
229 | 
230 | **Authors:** Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, Yonghui Wu, Macduff Hughes
231 | 
232 | **Abstract:** The past year has witnessed rapid advances in sequence-to-sequence (seq2seq) modeling for Machine Translation (MT). The classic RNN-based approaches to MT were first out-performed by the convolutional seq2seq model, which was then out-performed by the more recent Transformer model. Each of these new approaches consists of a fundamental architecture accompanied by a set of modeling and training techniques that are in principle applicable to other seq2seq architectures. In this paper, we tease apart the new architectures and their accompanying techniques in two ways. First, we identify several key modeling and training techniques, and apply them to the RNN architecture, yielding a new RNMT+ model that outperforms all of the three fundamental architectures on the benchmark WMT'14 English to French and English to German tasks. Second, we analyze the properties of each fundamental seq2seq architecture and devise new hybrid architectures intended to combine their strengths. Our hybrid models obtain further improvements, outperforming the RNMT+ model on both benchmark datasets.
233 | 
234 | **URL:** https://arxiv.org/abs/1804.09849
235 | 
236 | **Notes:** Google's paper on combination of multi-head and ol' good RNNs; it's somewhat surprisingly better in NMT task; as reguralizers authors use Label Smoothing, Dropout and Weight Decay; an ablation study shows that LS is even more important for NMT than MH attn
237 | 
238 | ## 2018-05
239 | ### Paper Abstract Writing through Editing Mechanism
240 | 
241 | **Authors:** Qingyun Wang, Zhihao Zhou, Lifu Huang, Spencer Whitehead, Boliang Zhang, Heng Ji, Kevin Knight
242 | 
243 | **Abstract:** We present a paper abstract writing system based on an attentive neural sequence-to-sequence model that can take a title as input and automatically generate an abstract. We design a novel Writing-editing Network that can attend to both the title and the previously generated abstract drafts and then iteratively revise and polish the abstract. With two series of Turing tests, where the human judges are asked to distinguish the system-generated abstracts from human-written ones, our system passes Turing tests by junior domain experts at a rate up to 30% and by non-expert at a rate up to 80%.
244 | 
245 | **URL:** https://arxiv.org/abs/1805.06064
246 | 
247 | **Notes:** generate abstract from title; dataset id published; two networks: writing and editing ones, which relates to back-translation; Attentinve Revision Gate; ROUGE, METEOR and Turing tests (METEOR correlates with the latter, surprisingly)
248 | 
249 | ### Zero-Shot Dual Machine Translation
250 | 
251 | **Authors:** Lierni Sestorain, Massimiliano Ciaramita, Christian Buck, Thomas Hofmann
252 | 
253 | **Abstract:** Neural Machine Translation (NMT) systems rely on large amounts of parallel data. This is a major challenge for low-resource languages. Building on recent work on unsupervised and semi-supervised methods, we present an approach that combines zero-shot and dual learning. The latter relies on reinforcement learning, to exploit the duality of the machine translation task, and requires only monolingual data for the target language pair. Experiments show that a zero-shot dual system, trained on English-French and English-Spanish, outperforms by large margins a standard NMT system in zero-shot translation performance on Spanish-French (both directions). The zero-shot dual method approaches the performance, within 2.2 BLEU points, of a comparable supervised setting. Our method can obtain improvements also on the setting where a small amount of parallel data for the zero-shot language pair is available. Adding Russian, to extend our experiments to jointly modeling 6 zero-shot translation directions, all directions improve between 4 and 15 BLEU points, again, reaching performance near that of the supervised setting.
254 | 
255 | **URL:** https://arxiv.org/abs/1805.10338
256 | 
257 | **Notes:** Google's fresh paper on semi-supervised NMT; zero-shot & back-translation inside; it is still pretty far from supervised models, but that approach adds 2-5 point to every language they tried (UN offial languages); and they opensourcing the model!
258 | 
259 | ## 2018-06
260 | ### TextWorld: A Learning Environment for Text-based Games
261 | 
262 | **Authors:** Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, Adam Trischler
263 | 
264 | **Abstract:** We introduce TextWorld, a sandbox learning environment for the training and evaluation of RL agents on text-based games. TextWorld is a Python library that handles interactive play-through of text games, as well as backend functions like state tracking and reward assignment. It comes with a curated list of games whose features and challenges we have analyzed. More significantly, it enables users to handcraft or automatically generate new games. Its generative mechanisms give precise control over the difficulty, scope, and language of constructed games, and can be used to relax challenges inherent to commercial text games like partial observability and sparse rewards. By generating sets of varied but similar games, TextWorld can also be used to study generalization and transfer learning. We cast text-based games in the Reinforcement Learning formalism, use our framework to develop a set of benchmark games, and evaluate several baseline agents on this set and the curated list.
265 | 
266 | **URL:** https://arxiv.org/abs/1806.11532
267 | 
268 | **Notes:** oh, yes! I've been waiting for that: text quests environment for RL agents! Thanks to MSR and colleagues from McGill! They use classic text adventures from the 80s as a Pierian spring, they introduce fwd and bkwd generation using generative grammars
269 | 
270 | ## 2018-07
271 | ### Talk the Walk: Navigating New York City through Grounded Dialogue
272 | 
273 | **Authors:** Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh, Jason Weston, Douwe Kiela
274 | 
275 | **Abstract:** We introduce "Talk The Walk", the first large-scale dialogue dataset grounded in action and perception. The task involves two agents (a "guide" and a "tourist") that communicate via natural language in order to achieve a common goal: having the tourist navigate to a given target location. The task and dataset, which are described in detail, are challenging and their full solution is an open problem that we pose to the community. We (i) focus on the task of tourist localization and develop the novel Masked Attention for Spatial Convolutions (MASC) mechanism that allows for grounding tourist utterances into the guide's map, (ii) show it yields significant improvements for both emergent and natural language communication, and (iii) using this method, we establish non-trivial baselines on the full task.
276 | 
277 | **URL:** https://arxiv.org/abs/1807.03367
278 | 
279 | **Notes:** the next step in grounding natural language: now it's dialog-based orientation in virtual environment (Google StreetView); NLP, CV and RL mix; innovative MASC - transformation of landmark embeddings to directions for a "tourist"-agent
280 | 
281 | ## 2018-08
282 | ### Fake Sentence Detection as a Training Task for Sentence Encoding
283 | 
284 | **Authors:** Viresh Ranjan, Heeyoung Kwon, Niranjan Balasubramanian, Minh Hoai
285 | 
286 | **Abstract:** Sentence encoders are typically trained on language modeling tasks which enable them to use large unlabeled datasets. While these models achieve state-of-the-art results on many sentence-level tasks, they are difficult to train with long training cycles. We introduce fake sentence detection as a new training task for learning sentence encodings. We automatically generate fake sentences by corrupting some original sentence and train the encoders to produce representations that are effective at detecting fake sentences. This binary classification task allows for efficient training and forces the encoder to learn the distinctions introduced by a small edit to sentences. We train a basic BiLSTM encoder to produce sentence representations and find that it outperforms a strong sentence encoding model trained on language modeling tasks, while also training much faster on smaller amount of data (20 hours instead of weeks). Further analysis shows the learned representations capture many syntactic and semantic properties expected from good sentence representations.
287 | 
288 | **URL:** https://arxiv.org/abs/1808.03840
289 | 
290 | **Notes:** really nice work on transfer learning in NLP; authors improve SotA results on 5 classification and 1 retrieve tasks with usage of additional goal for a learning - the detection of fake sentences produced by word drop and word shuffle
291 | 
292 | ### Dynamic Self-Attention : Computing Attention over Words Dynamically for  Sentence Embedding
293 | 
294 | **Authors:** Deunsol Yoon, Dongbok Lee, SangKeun Lee
295 | 
296 | **Abstract:** In this paper, we propose Dynamic Self-Attention (DSA), a new self-attention mechanism for sentence embedding. We design DSA by modifying dynamic routing in capsule network (Sabouretal.,2017) for natural language processing. DSA attends to informative words with a dynamic weight vector. We achieve new state-of-the-art results among sentence encoding methods in Stanford Natural Language Inference (SNLI) dataset with the least number of parameters, while showing comparative results in Stanford Sentiment Treebank (SST) dataset.
297 | 
298 | **URL:** https://arxiv.org/abs/1808.07383
299 | 
300 | **Notes:** dynamic sentence embeddings with usage of dense CNN and (interesting!) simplified capsules; dynamic attention is an algorithm of iterative recomputation of simple attention with addition nonlinearity applied; new SotA on SNLI
301 | 
302 | ### TaxoGen: Unsupervised Topic Taxonomy Construction by Adaptive Term Embedding and Clustering
303 | 
304 | **Authors:** Chao Zhang, Fangbo Tao, Xiusi Chen, Jiaming Shen, Meng Jiang, Brian Sadler, Michelle Vanni, and Jiawei Han
305 | 
306 | **Abstract:** Taxonomy construction is not only a fundamental task for semantic analysis of text corpora, but also an important step for applications such as information filtering, recommendation, and Web search. Existing pattern-based methods extract hypernym-hyponym term pairs and then organize these pairs into a taxonomy. However, by considering each term as an independent concept node, they overlook the topical proximity and the semantic correlations among terms. In this paper, we propose a method for constructing topic taxonomies, wherein every node represents a conceptual topic and is defined as a cluster of semantically coherent concept terms. Our method, TaxoGen, uses term embeddings and hierarchical clustering to construct a topic taxonomy in a recursive fashion. To ensure the quality of the recursive process, it consists of: (1) an adaptive spherical clustering module for allocating terms to proper levels when splitting a coarse topic into fine-grained ones; (2) a local embedding module for learning term embeddings that maintain strong discriminative power at different levels of the taxonomy. Our experiments on two real datasets demonstrate the effectiveness of TaxoGen compared with baseline methods.
307 | 
308 | **URL:** https://research.fb.com/wp-content/uploads/2018/08/TaxoGen-Unsupervised-Topic-Taxonomy-Construction-by-Adaptive-Term-Embedding-and-Clustering.pdf
309 | 
310 | **Notes:** taxonomy generation w/o supervision; authors use spherical K-means, a relevance and the local embeddings for sub-topic construction; the relevance is more sophisticated TF-IDF; the local embs are constructed from subcorpora from (again) clustering
311 | 
312 | ### Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised
313 | 
314 | **Authors:** Stefanos Angelidis, Mirella Lapata
315 | 
316 | **Abstract:** We present a neural framework for opinion summarization from online product reviews which is knowledge-lean and only requires light supervision (e.g., in the form of product domain labels and user-provided ratings). Our method combines two weakly supervised components to identify salient opinions and form extractive summaries from multiple reviews: an aspect extractor trained under a multi-task objective, and a sentiment predictor based on multiple instance learning. We introduce an opinion summarization dataset that includes a training set of product reviews from six diverse domains and human-annotated development and test sets with gold standard aspect annotations, salience labels, and opinion summaries. Automatic evaluation shows significant improvements over baselines, and a large-scale study indicates that our opinion summaries are preferred by human judges according to multiple criteria.
317 | 
318 | **URL:** https://arxiv.org/abs/1808.08858
319 | 
320 | **Notes:** interesting approach to summarization of user reviews: authors use aspect extraction, sentiment analysis and semantic duplication removal to produce a summary of user opinions on one item; with code & data!
321 | 
322 | ## 2018-09
323 | ### Adaptive Input Representations for Neural Language Modeling
324 | 
325 | **Authors:** Alexei Baevski, Michael Auli
326 | 
327 | **Abstract:** We introduce adaptive input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity. There are several choices on how to factorize the input and output layers, and whether to model words, characters or sub-word units. We perform a systematic comparison of popular choices for a self-attentional architecture. Our experiments show that models equipped with adaptive embeddings are more than twice as fast to train than the popular character input CNN while having a lower number of parameters. On the WikiText-103 benchmark we achieve 18.7 perplexity, an improvement of 10.5 perplexity compared to the previously best published result and on the Billion Word benchmark, we achieve 23.02 perplexity.
328 | 
329 | **URL:** https://arxiv.org/abs/1809.10853
330 | 
331 | **Notes:** Transformer is definitely ground breaking, next step in word embeddings - different number of dimensions for different frequency bins; a Transformer decoder with a few tweaks projects its hidden state to different capacity embeddings
332 | 
333 | ### Weakly-Supervised Neural Text Classification
334 | 
335 | **Authors:** Yu Meng, Jiaming Shen, Chao Zhang, Jiawei Han
336 | 
337 | **Abstract:** Deep neural networks are gaining increasing popularity for the classic text classification task, due to their strong expressive power and less requirement for feature engineering. Despite such attractiveness, neural text classification models suffer from the lack of training data in many real-world applications. Although many semi-supervised and weakly-supervised text classification models exist, they cannot be easily applied to deep neural models and meanwhile support limited supervision types. In this paper, we propose a weakly-supervised method that addresses the lack of training data in neural text classification. Our method consists of two modules: (1) a pseudo-document generator that leverages seed information to generate pseudo-labeled documents for model pre-training, and (2) a self-training module that bootstraps on real unlabeled data for model refinement. Our method has the flexibility to handle different types of weak supervision and can be easily integrated into existing deep neural models for text classification. We have performed extensive experiments on three real-world datasets from different domains. The results demonstrate that our proposed method achieves inspiring performance without requiring excessive training data and outperforms baseline methods significantly.
338 | 
339 | **URL:** https://arxiv.org/abs/1809.01478
340 | 
341 | **Notes:** authors construct vMF distribution to sample word vectors for keywords and generate pseudo-docs using these keywords for clf; the procedure itself is closely related to LDA motivation but its samples whole embeddings instead; with code!
342 | 
343 | ### Accelerated Reinforcement Learning for Sentence Generation by Vocabulary Prediction
344 | 
345 | **Authors:** Kazuma Hashimoto, Yoshimasa Tsuruoka
346 | 
347 | **Abstract:** A major obstacle in reinforcement learning-based sentence generation is the large action space whose size is equal to the vocabulary size of the target-side language. To improve the efficiency of reinforcement learning, we present a novel approach for reducing the action space based on dynamic vocabulary prediction. Our method first predicts a fixed-size small vocabulary for each input to generate its target sentence. The input-specific vocabularies are then used at supervised and reinforcement learning steps, and also at test time. In our experiments on six machine translation and two image captioning datasets, our method achieves faster reinforcement learning ($\sim$2.7x faster) with less GPU memory ($\sim$2.3x less) than the full-vocabulary counterpart. The reinforcement learning with our method consistently leads to significant improvement of BLEU scores, and the scores are equal to or better than those of baselines using the full vocabularies, with faster decoding time ($\sim$3x faster) on CPUs.
348 | 
349 | **URL:** https://arxiv.org/abs/1809.01694
350 | 
351 | **Notes:** a technique to select important words from a dictionary for sentence generation task; authors use this cherry-picked dictionary to form an action space for reinforcement algorithms and get faster training, inference and less memory consumption; with code!
352 | 
353 | ## 2018-10
354 | ### Phrase-Based Attentions
355 | 
356 | **Authors:** Phi Xuan Nguyen, Shafiq Joty
357 | 
358 | **Abstract:** Most state-of-the-art neural machine translation systems, despite being different in architectural skeletons (e.g. recurrence, convolutional), share an indispensable feature: the Attention. However, most existing attention methods are token-based and ignore the importance of phrasal alignments, the key ingredient for the success of phrase-based statistical machine translation. In this paper, we propose novel phrase-based attention methods to model n-grams of tokens as attention entities. We incorporate our phrase-based attentions into the recently proposed Transformer network, and demonstrate that our approach yields improvements of 1.3 BLEU for English-to-German and 0.5 BLEU for German-to-English translation tasks on WMT newstest2014 using WMT'16 training data.
359 | 
360 | **URL:** https://arxiv.org/abs/1810.03444
361 | 
362 | **Notes:** convolutional attention prings back multi-word expressions to machine translation; the core idea is to produce one vector from each n-gram and combine it with traditional [multi-head] attention; new SotA on En-De & De-En
363 | 
364 | ### BERT: Pre-training of Deep Bidirectional Transformers for Language  Understanding
365 | 
366 | **Authors:** Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
367 | 
368 | **Abstract:** We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%.
369 | 
370 | **URL:** https://arxiv.org/abs/1810.04805
371 | 
372 | **Notes:** new SotA on many tasks from Google; bidirectional transformer language model, which is trained for 1M steps with 128k words/batch; for downstream tasks they train additional output layer with small lr; closely resembles OpenAI GPT, but bigger & better
373 | 
374 | ## 2018-11
375 | ### CALCS: Continuously Approximating Longest Common Subsequence for Sequence Level Optimization
376 | 
377 | **Authors:** Semih Yavuz, Chung-Cheng Chiu, Patrick Nguyen, Yonghui Wu
378 | 
379 | **Abstract:** Maximum-likelihood estimation (MLE) is one of the most widely used approaches for training structured prediction models for textgeneration based natural language processing applications. However, besides exposure bias, models trained with MLE suffer from wrong objective problem where they are trained to maximize the word-level correct next step prediction, but are evaluated with respect to sequence-level discrete metrics such as ROUGE and BLEU. Several variants of policy-gradient methods address some of these problems by optimizing for final discrete evaluation metrics and showing improvements over MLE training for downstream tasks like text summarization and machine translation. However, policy-gradient methods suffers from high sample variance, making the training process very difficult and unstable. In this paper, we present an alternative direction towards mitigating this problem by introducing a new objective (CALCS) based on a differentiable surrogate of longest common subsequence (LCS) measure that captures sequence-level structure similarity. Experimental results on abstractive summarization and machine translation validate the effectiveness of the proposed approach.
380 | 
381 | **URL:** http://aclweb.org/anthology/D18-1406
382 | 
383 | **Notes:** continuous approximation of longest common subsequence (LCS) from Google; nice math proof as for me; shows improvement in machine translation and abstractive summarization on transformer and pointer-nets
384 | 
385 | ## 2018-12
386 | ### Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs
387 | 
388 | **Authors:** Sachin Kumar, Yulia Tsvetkov
389 | 
390 | **Abstract:** The Softmax function is used in the final layer of nearly all existing sequence-to-sequence models for language generation. However, it is usually the slowest layer to compute which limits the vocabulary size to a subset of most frequent types; and it has a large memory footprint. We propose a general technique for replacing the softmax layer with a continuous embedding layer. Our primary innovations are a novel probabilistic loss, and a training and inference procedure in which we generate a probability distribution over pre-trained word embeddings, instead of a multinomial distribution over the vocabulary obtained via softmax. We evaluate this new class of sequence-to-sequence models with continuous outputs on the task of neural machine translation. We show that our models obtain upto 2.5x speed-up in training time while performing on par with the state-of-the-art models in terms of translation quality. These models are capable of handling very large vocabularies without compromising on translation quality. They also produce more meaningful errors than in the softmax-based models, as these errors typically lie in a subspace of the vector space of the reference translations.
391 | 
392 | **URL:** https://arxiv.org/abs/1812.04616
393 | 
394 | **Notes:** long awaited SoftMax replacement: predicting a word emb instead of a vocab index; computation of NLLvMF loss is 2x faster than SM; interestingly, this loss became possible only two years back when the tight lower bound for Bessel's functions was proven
395 | 
396 | 


--------------------------------------------------------------------------------
/PAPERS2019.md:
--------------------------------------------------------------------------------
  1 | 
  2 | Table of Contents
  3 | =================
  4 | 
  5 |   * [Articles](#articles)
  6 |     * [2019\-01](#2019-01)
  7 |       * [Pull out all the stops: Textual analysis via punctuation sequences](#pull-out-all-the-stops-textual-analysis-via-punctuation-sequences)
  8 |       * [Assessing BERT’s Syntactic Abilities](#assessing-berts-syntactic-abilities)
  9 |       * [Human few\-shot learning of compositional instructions](#human-few-shot-learning-of-compositional-instructions)
 10 |       * [No Training Required: Exploring Random Encoders for Sentence Classification](#no-training-required-exploring-random-encoders-for-sentence-classification)
 11 |       * [Pay Less Attention with Lightweight and Dynamic Convolutions](#pay-less-attention-with-lightweight-and-dynamic-convolutions)
 12 |     * [2019\-02](#2019-02)
 13 |       * [Rethinking Action Spaces for Reinforcement Learning in End\-to\-end Dialog Agents with Latent Variable Models](#rethinking-action-spaces-for-reinforcement-learning-in-end-to-end-dialog-agents-with-latent-variable-models)
 14 |     * [2019\-04](#2019-04)
 15 |       * [Unsupervised Data Augmentation](#unsupervised-data-augmentation)
 16 |     * [2019\-05](#2019-05)
 17 |       * [Controlled CNN\-based Sequence Labeling for Aspect Extraction](#controlled-cnn-based-sequence-labeling-for-aspect-extraction)
 18 |       * [Behavior Sequence Transformer for E\-commerce Recommendation in Alibaba](#behavior-sequence-transformer-for-e-commerce-recommendation-in-alibaba)
 19 |     * [2019\-06](#2019-06)
 20 |       * [Hierarchical Decision Making by Generating and Following Natural Language Instructions](#hierarchical-decision-making-by-generating-and-following-natural-language-instructions)
 21 |     * [2019\-07](#2019-07)
 22 |       * [R\-Transformer: Recurrent Neural Network Enhanced Transformer](#r-transformer-recurrent-neural-network-enhanced-transformer)
 23 |     * [2019\-08](#2019-08)
 24 |       * [Neural Code Search Evaluation Dataset](#neural-code-search-evaluation-dataset)
 25 | 
 26 | Articles
 27 | ========
 28 | ## 2019-01
 29 | ### Pull out all the stops: Textual analysis via punctuation sequences
 30 | 
 31 | **Authors:** Alexandra N. M. Darmon, Marya Bazzi, Sam D. Howison, Mason A. Porter
 32 | 
 33 | **Abstract:** Whether enjoying the lucid prose of a favorite author or slogging through some other writer's cumbersome, heavy-set prattle (full of parentheses, em-dashes, compound adjectives, and Oxford commas), readers will notice stylistic signatures not only in word choice and grammar, but also in punctuation itself. Indeed, visual sequences of punctuation from different authors produce marvelously different (and visually striking) sequences. Punctuation is a largely overlooked stylistic feature in "stylometry'', the quantitative analysis of written text. In this paper, we examine punctuation sequences in a corpus of literary documents and ask the following questions: Are the properties of such sequences a distinctive feature of different authors? Is it possible to distinguish literary genres based on their punctuation sequences? Do the punctuation styles of authors evolve over time? Are we on to something interesting in trying to do stylometry without words, or are we full of sound and fury (signifying nothing)?
 34 | 
 35 | **URL:** https://arxiv.org/abs/1901.00519
 36 | 
 37 | **Notes:** really nice idea - analyze the punctuation itself, it is shown to be enough to distinct authorship; I think that some other tasks could be formulated, like punctuation style transfer or punctuation improvement only just from big corpora
 38 | 
 39 | ### Assessing BERT’s Syntactic Abilities
 40 | 
 41 | **Authors:** Yoav Goldberg
 42 | 
 43 | **Abstract:** I assess the extent to which the recently introduced BERT model captures English syntactic phenomena, using (1) naturally-occurring subject-verb agreement stimuli; (2) “coloreless green ideas” subject-verb agreement stimuli, in which content words in natural sentences are randomly replaced with words sharing the same part-of-speech and inflection; and (3) manually crafted stimuli for subject-verb agreement and reflexive anaphora phenomena. The BERT model performs remarkably well on all cases.
 44 | 
 45 | **URL:** http://u.cs.biu.ac.il/~yogo/bert-syntax.pdf
 46 | 
 47 | **Notes:** I like the idea of this small and concise research, it answers clear question clearly; I think more research could be done in this direction
 48 | 
 49 | ### Human few-shot learning of compositional instructions
 50 | 
 51 | **Authors:** Brenden M. Lake, Tal Linzen, Marco Baroni
 52 | 
 53 | **Abstract:** People learn in fast and flexible ways that have not been emulated by machines. Once a person learns a new verb "dax," he or she can effortlessly understand how to "dax twice," "walk and dax," or "dax vigorously." There have been striking recent improvements in machine learning for natural language processing, yet the best algorithms require vast amounts of experience and struggle to generalize new concepts in compositional ways. To better understand these distinctively human abilities, we study the compositional skills of people through language-like instruction learning tasks. Our results show that people can learn and use novel functional concepts from very few examples (few-shot learning), successfully applying familiar functions to novel inputs. People can also compose concepts in complex ways that go beyond the provided demonstrations. Two additional experiments examined the assumptions and inductive biases that people make when solving these tasks, revealing three biases: mutual exclusivity, one-to-one mappings, and iconic concatenation. We discuss the implications for cognitive modeling and the potential for building machines with more human-like language learning capabilities.
 54 | 
 55 | **URL:** https://arxiv.org/abs/1901.04587
 56 | 
 57 | **Notes:** interesting work on few shot learning in language; a person should "translate" from unknown constructed language to visual language; some flaws: there are only United State residents (so English-speaking) and proposed tasks could influence each other
 58 | 
 59 | ### No Training Required: Exploring Random Encoders for Sentence Classification
 60 | 
 61 | **Authors:** John Wieting, Douwe Kiela
 62 | 
 63 | **Abstract:** We explore various methods for computing sentence representations from pre-trained word embeddings without any training, i.e., using nothing but random parameterizations. Our aim is to put sentence embeddings on more solid footing by 1) looking at how much modern sentence embeddings gain over random methods---as it turns out, surprisingly little; and by 2) providing the field with more appropriate baselines going forward---which are, as it turns out, quite strong. We also make important observations about proper experimental protocol for sentence classification evaluation, together with recommendations for future research.
 64 | 
 65 | **URL:** https://arxiv.org/abs/1901.10444
 66 | 
 67 | **Notes:** new work from FAIR about random encoders for text clf; pooling over random projection of word emb, randomly init'ed (and never updated) LSTMs, and analog of simple RNN, also random; LSTM even reach a SotA on TREC, and they all are really good in all tasks
 68 | 
 69 | ### Pay Less Attention with Lightweight and Dynamic Convolutions
 70 | 
 71 | **Authors:** Felix Wu, Angela Fan, Alexei Baevski, Yann N. Dauphin, Michael Auli
 72 | 
 73 | **Abstract:** Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT'14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.
 74 | 
 75 | **URL:** https://arxiv.org/abs/1901.10430
 76 | 
 77 | **Notes:** Facebook takes a next step in quasi-RNNs: lightweight convs are using softmax pooling over time, and dynamic convs use position encoding to shift weights for particular timestep; this work achieves new SotA on En-De MT and also they're close in other tasks
 78 | 
 79 | ## 2019-02
 80 | ### Rethinking Action Spaces for Reinforcement Learning in End-to-end Dialog Agents with Latent Variable Models
 81 | 
 82 | **Authors:** Tiancheng Zhao, Kaige Xie, Maxine Eskenazi
 83 | 
 84 | **Abstract:** Defining action spaces for conversational agents and optimizing their decision-making process with reinforcement learning is an enduring challenge. Common practice has been to use handcrafted dialog acts, or the output vocabulary, e.g. in neural encoder decoders, as the action spaces. Both have their own limitations. This paper proposes a novel latent action framework that treats the action spaces of an end-to-end dialog agent as latent variables and develops unsupervised methods in order to induce its own action space from the data. Comprehensive experiments are conducted examining both continuous and discrete action types and two different optimization methods based on stochastic variational inference. Results show that the proposed latent actions achieve superior empirical performance improvement over previous word-level policy gradient methods on both DealOrNoDeal and MultiWoz dialogs. Our detailed analysis also provides insights about various latent variable approaches for policy learning and can serve as a foundation for developing better latent actions in future research.
 85 | 
 86 | **URL:** https://arxiv.org/abs/1902.08858
 87 | 
 88 | **Notes:** reinforcement learning applied to dialog systems: the RL is applied only to latent action space, while leaving text decoder be fixed (SL pre-trained); authors propose lite ELBo with a constraint on latent variable to be close to a prior only; with code!
 89 | 
 90 | ## 2019-04
 91 | ### Unsupervised Data Augmentation
 92 | 
 93 | **Authors:** Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, Quoc V. Le
 94 | 
 95 | **Abstract:** Despite its success, deep learning still needs large labeled datasets to succeed. Data augmentation has shown much promise in alleviating the need for more labeled data, but it so far has mostly been applied in supervised settings and achieved limited gains. In this work, we propose to apply data augmentation to unlabeled data in a semi-supervised learning setting. Our method, named Unsupervised Data Augmentation or UDA, encourages the model predictions to be consistent between an unlabeled example and an augmented unlabeled example. Unlike previous methods that use random noise such as Gaussian noise or dropout noise, UDA has a small twist in that it makes use of harder and more realistic noise generated by state-of-the-art data augmentation methods. This small twist leads to substantial improvements on six language tasks and three vision tasks even when the labeled set is extremely small. For example, on the IMDb text classification dataset, with only 20 labeled examples, UDA outperforms the state-of-the-art model trained on 25,000 labeled examples. On standard semi-supervised learning benchmarks, CIFAR-10 with 4,000 examples and SVHN with 1,000 examples, UDA outperforms all previous approaches and reduces more than $30\%$ of the error rates of state-of-the-art methods: going from 7.66% to 5.27% and from 3.53% to 2.46% respectively. UDA also works well on datasets that have a lot of labeled data. For example, on ImageNet, with 1.3M extra unlabeled data, UDA improves the top-1/top-5 accuracy from 78.28/94.36% to 79.04/94.45% when compared to AutoAugment.
 96 | 
 97 | **URL:** https://arxiv.org/abs/1904.12848
 98 | 
 99 | **Notes:** a nice work from Google on [unsupervised] data augmentation, the key idea is to add specific smoothness loss on perturbed data; new SotA on IMDB using only 20 (sic!) labelled examples; authors introduce TF-IDF based word replacement for augmentation
100 | 
101 | ## 2019-05
102 | ### Controlled CNN-based Sequence Labeling for Aspect Extraction
103 | 
104 | **Authors:** Lei Shu, Hu Xu, Bing Liu
105 | 
106 | **Abstract:** One key task of fine-grained sentiment analysis on reviews is to extract aspects or features that users have expressed opinions on. This paper focuses on supervised aspect extraction using a modified CNN called controlled CNN (Ctrl). The modified CNN has two types of control modules. Through asynchronous parameter updating, it prevents over-fitting and boosts CNN's performance significantly. This model achieves state-of-the-art results on standard aspect extraction datasets. To the best of our knowledge, this is the first paper to apply control modules to aspect extraction.
107 | 
108 | **URL:** https://arxiv.org/abs/1905.06407
109 | 
110 | **Notes:** CNN with gates proves its effectiveness for aspect extraction task (sequence labelling); interestingly, BERT out of the box gives result better than in original works on the corpora (SemEval-2014&2016)
111 | 
112 | ### Behavior Sequence Transformer for E-commerce Recommendation in Alibaba
113 | 
114 | **Authors:** Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, Wenwu Ou
115 | 
116 | **Abstract:** Deep learning based methods have been widely used in industrial recommendation systems (RSs). Previous works adopt an Embedding&MLP paradigm: raw features are embedded into low-dimensional vectors, which are then fed on to MLP for final recommendations. However, most of these works just concatenate different features, ignoring the sequential nature of users' behaviors. In this paper, we propose to use the powerful Transformer model to capture the sequential signals underlying users' behavior sequences for recommendation in Alibaba. Experimental results demonstrate the superiority of the proposed model, which is then deployed online at Taobao and obtain significant improvements in online Click-Through-Rate (CTR) comparing to two baselines.
117 | 
118 | **URL:** https://arxiv.org/abs/1905.06874
119 | 
120 | **Notes:** Alibaba's successor to famous word2vec introduction to a RecSys field; the Transformer is adopted to item recommendations, authors modified embedding & positional encoding to comply with a setting, but the transformer block is the same;
121 | 
122 | ## 2019-06
123 | ### Hierarchical Decision Making by Generating and Following Natural Language Instructions
124 | 
125 | **Authors:** Hengyuan Hu, Denis Yarats, Qucheng Gong, Yuandong Tian, Mike Lewis
126 | 
127 | **Abstract:** We explore using latent natural language instructions as an expressive and compositional representation of complex actions for hierarchical decision making. Rather than directly selecting micro-actions, our agent first generates a latent plan in natural language, which is then executed by a separate model. We introduce a challenging real-time strategy game environment in which the actions of a large number of units must be coordinated across long time scales. We gather a dataset of 76 thousand pairs of instructions and executions from human play, and train instructor and executor models. Experiments show that models using natural language as a latent variable significantly outperform models that directly imitate human actions. The compositional structure of language proves crucial to its effectiveness for action representation. We also release our code, models and data.
128 | 
129 | **URL:** https://arxiv.org/abs/1906.00744
130 | 
131 | **Notes:** great effort from Facebook: two networks, one generates an order, another is following it to achieve a goal in RTS setting; authors explore simple RNNs with softmax choice; I think this is the first step in a wide field
132 | 
133 | ## 2019-07
134 | ### R-Transformer: Recurrent Neural Network Enhanced Transformer
135 | 
136 | **Authors:** Zhiwei Wang, Yao Ma, Zitao Liu, Jiliang Tang
137 | 
138 | **Abstract:** Recurrent Neural Networks have long been the dominating choice for sequence modeling. However, it severely suffers from two issues: impotent in capturing very long-term dependencies and unable to parallelize the sequential computation procedure. Therefore, many non-recurrent sequence models that are built on convolution and attention operations have been proposed recently. Notably, models with multi-head attention such as Transformer have demonstrated extreme effectiveness in capturing long-term dependencies in a variety of sequence modeling tasks. Despite their success, however, these models lack necessary components to model local structures in sequences and heavily rely on position embeddings that have limited effects and require a considerable amount of design efforts. In this paper, we propose the R-Transformer which enjoys the advantages of both RNNs and the multi-head attention mechanism while avoids their respective drawbacks. The proposed model can effectively capture both local structures and global long-term dependencies in sequences without any use of position embeddings. We evaluate R-Transformer through extensive experiments with data from a wide range of domains and the empirical results show that R-Transformer outperforms the state-of-the-art methods by a large margin in most of the tasks. We have made the code publicly available at \url{this https URL}.
139 | 
140 | **URL:** https://arxiv.org/abs/1907.05572
141 | 
142 | **Notes:** RNNs kinda strike back: authors use RNN to read local context similarly to TCN; these representations then are fed to transformer blocks; the results are good for various sequence tasks, not only NLP; with code!
143 | 
144 | ## 2019-08
145 | ### Neural Code Search Evaluation Dataset
146 | 
147 | **Authors:** Hongyu Li, Seohyun Kim, Satish Chandra
148 | 
149 | **Abstract:** There has been an increase of interest in code search using natural language. Assessing the performance of such code search models can be difficult without a readily available evaluation suite. In this paper, we present an evaluation dataset consisting of natural language query and code snippet pairs, with the hope that future work in this area can use this dataset as a common benchmark. We also provide the results of two code search models ([1] and [6]) from recent work. The evaluation dataset is available at this https URL
150 | 
151 | **URL:** https://arxiv.org/abs/1908.09804
152 | 
153 | **Notes:** Facebook's benchmark on code search; alongside with GitHub one: https://github.com/github/CodeSearchNet ; these two have been released a few days apart; there are interesting new opportunities for research in NLP-on-code field https://is.gd/y8J1QL
154 | 
155 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Recent Deep Learning papers in NLU and RL
 2 | 
 3 | I think that other people's notes are rarely useful, so I'm listing the interesting for me papers with a few words about the main idea for me to make references in memory.
 4 | 
 5 | If you're in such stuff, welcome: [__papers list__](./PAPERS2019.md).
 6 | 
 7 | To keep list size feasible the each year papers are separated to specific file:
 8 | * [2019](./PAPERS2019.md)
 9 | * [2018](./PAPERS2018.md)
10 | * [2017](./PAPERS2017.md)
11 | * [2016 & earlier](./PAPERS.md)
12 | 
13 | I'm not only adding recent papers, but also update lists for previous periods, if I find interesting "old" paper. To keep up with updates, you could follow me on [twitter](https://twitter.com/madrugad0).
14 | 
15 | Also I've found that, there are some additional materials which are helpful for me or my students, so I've added a new list of miscellaneous [articles](./MISC.md). It is not sorted in chronological order, only by content type.
16 | 
17 | To make the tool work you need Table-of-Content creation tool, which is integrated with this software; you only need `wget` available for download. This tool has an option of twitter integration, which need to be installed and configured before usage. The tool was tested to work on Ubuntu Linux 16.04 & MacOS X 10.6; it should work on newer versions of these operating systems.
18 | 
19 | ## License
20 | ![CC BY](https://licensebuttons.net/l/by/3.0/88x31.png)
21 | 
22 | ## Table of Contents
23 | I'm using [toc maker](https://github.com/ekalinin/github-markdown-toc.go) from Eugeny Kalinin, it is very useful, thank to him for this tool.
24 | 
25 | ## Twitter integration
26 | For twitting I'm using the perfect ["t"](https://github.com/sferik/t). It is really elegant and simple to use.
27 | 


--------------------------------------------------------------------------------
/formatter.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function
  2 | 
  3 | import argparse
  4 | import stat
  5 | import tempfile
  6 | import os
  7 | import sys
  8 | import re
  9 | import requests
 10 | 
 11 | # tricky imports for versions compatibility
 12 | if sys.version_info[0] == 2: 
 13 |     import commands as cmd
 14 |     from urllib import urlencode
 15 |     if sys.version_info[1] > 5:
 16 |         from HTMLParser import HTMLParser
 17 |         html = HTMLParser()
 18 |         from urllib import urlopen
 19 |     else:
 20 |         raise ImportError("Cannot import HTMLParser")
 21 |     to_str = str
 22 | else:  # assuming that it is python 3
 23 |     import subprocess as cmd
 24 |     from urllib.parse import urlencode
 25 |     from urllib.request import urlopen
 26 |     if sys.version_info[1] > 3:
 27 |         import html
 28 |     else:
 29 |         from html.parser import HTMLParser
 30 |         html = HTMLParser()
 31 |     to_str = lambda x: str(x, encoding="utf-8")
 32 | 
 33 | 
 34 | TWEET_LIMIT = 280
 35 | 
 36 | 
 37 | class ArticleFormatter:
 38 |     def __init__(self, twitter_command=""):
 39 |         self.twitter_command = twitter_command
 40 |         self._init()
 41 | 
 42 |     def _init(self):
 43 |         """
 44 |         Clean the object
 45 |         """
 46 |         self.index = None
 47 |         self.buf = []
 48 |         self.has_title = False
 49 |         self.has_authors = False
 50 |         self.has_abstract = False
 51 |         self.has_URL = False
 52 |         self.has_notes = False
 53 | 
 54 |         self.new_article = False
 55 | 
 56 |     def __call__(self, s):
 57 |         self.buf.append(s)
 58 |         self._analyze()
 59 |         if self.has_title and self.has_authors and self.has_abstract and self.has_URL and self.has_notes:
 60 |             printed = self._print()
 61 |             if self.new_article and self.twitter_command:
 62 |                 twit = self._twitting()
 63 |                 print("twitting: " + twit)
 64 |             self._init()
 65 |             return printed
 66 |         else:
 67 |             return ""
 68 | 
 69 |     def _twitting(self):
 70 |         url = shorten_url(self.buf[3] if self.buf[3][:8] != "**URL:**" else self.buf[3][9:])
 71 |         text = self.buf[4] if self.buf[4][:10] != "**Notes:**" else self.buf[4][11:]
 72 |         if len(text) > TWEET_LIMIT - len(url) - 1 - 3:  # one symbol for space, three symbols more
 73 |             premature_ending = "... "
 74 |             # FIXME: for some reason twitter counts for three symbols more, than len()
 75 |             while len(text) > TWEET_LIMIT - len(premature_ending) - len(url) - 3:
 76 |                 text = str.rsplit(text, " ", 1)[0]
 77 | 
 78 |             twit = "\"" + text + premature_ending + url + "\""
 79 |         else:
 80 |             twit = "\"" + text + " " + url + "\""
 81 | 
 82 |         cmd.getstatusoutput(self.twitter_command + " " + twit)
 83 |         return twit
 84 | 
 85 |     def _analyze(self):
 86 |         # special case of arxiv link
 87 |         if not self.has_title and not self.has_authors \
 88 |                 and self.buf[-1].startswith("http") and "arxiv.org" in self.buf[-1]:
 89 |             self.buf = list(parse_arxiv(self.buf[-1])) + [self.buf[-1]]
 90 |             self.has_title = True
 91 |             self.has_authors = True
 92 |             self.has_abstract = True
 93 |             self.has_URL = True
 94 |             return
 95 | 
 96 |         # special case of openreview link
 97 |         if not self.has_title and not self.has_authors \
 98 |                 and self.buf[-1].startswith("http") and "openreview.net" in self.buf[-1]:
 99 |             self.buf = list(parse_openreview(self.buf[-1])) + [self.buf[-1]]
100 |             self.has_title = True
101 |             self.has_authors = True
102 |             self.has_abstract = True
103 |             self.has_URL = True
104 |             return
105 | 
106 |         if not self.has_title:
107 |             if 0 < len(self.buf[-1]) < 150:
108 |                 self.has_title = True
109 |         elif not self.has_authors:
110 |             if 0 < len(self.buf[-1]):
111 |                 self.has_authors = True
112 |                 self.index = len(self.buf)
113 |         elif not self.has_URL:
114 |             if self.buf[-1] and (self.buf[-1].startswith("http") or self.buf[-1].startswith("**URL:**")):
115 |                 self.has_URL = True
116 |                 self.buf = self.buf[:self.index] + [" ".join(filter(lambda x: x, self.buf[self.index:-1])),
117 |                                                     self.buf[-1]]
118 |                 self.has_abstract = True
119 |         elif self.buf[-1]:
120 |             self.has_notes = True
121 | 
122 |     def _print(self):
123 |         self.buf = list(filter(lambda x: x, self.buf))
124 |         if len(self.buf) != 5:
125 |             print("\n".join(self.buf) + "\n")
126 |             raise ValueError("Wrong article description format!")
127 |         printed = ""
128 |         if not self.buf[0].startswith("###"):
129 |             printed += "### "
130 |         printed += self.buf[0] + "\n\n"
131 |         if not self.buf[1].startswith("**Authors:**"):
132 |             printed += "**Authors:** "
133 |         printed += self.buf[1] + "\n\n"
134 |         if not self.buf[2].startswith("**Abstract:**"):
135 |             printed += "**Abstract:** "
136 |         printed += self.buf[2] + "\n\n"
137 |         if not self.buf[3].startswith("**URL:**"):
138 |             printed += "**URL:** "
139 |         printed += self.buf[3] + "\n\n"
140 | 
141 |         if not self.buf[4].startswith("**Notes:**"):
142 |             printed += "**Notes:** "
143 |             self.new_article = True
144 |         printed += self.buf[4] + "\n\n"
145 | 
146 |         return printed
147 | 
148 | 
149 | def shorten_url(url):
150 |     resp = requests.get('https://is.gd/create.php?' + urlencode({'url': url, 'format': 'simple'}))
151 |     return resp.text
152 | 
153 | 
154 | def parse_arxiv(url):
155 |     resp = html.unescape(to_str(urlopen(url).read()))
156 | 
157 |     # title
158 |     title_start = resp.find("Title:") + 6
159 |     title = resp[title_start:resp.find("</h1>", title_start)].strip()
160 |     title = re.sub("<[^>]*>", "", title).strip()
161 | 
162 |     # authors
163 |     authors_start = resp.find("Authors:") + 8
164 |     authors = resp[authors_start:resp.find("</div>", authors_start)]
165 |     authors = re.sub("<[^>]*>", "", authors)
166 |     authors = re.sub("\n", "", authors).strip()
167 | 
168 |     # abstract
169 |     abstract_start = resp.find("Abstract:")
170 |     abstract_start = resp.find("</span>", abstract_start) + 7
171 |     abstract = resp[abstract_start:resp.find("</blockquote>", abstract_start)]
172 |     abstract = " ".join(abstract.split("\n")).strip()
173 | 
174 |     # URL in abstract
175 |     url_position_start = abstract.find("<a href=\"")
176 |     if url_position_start >= 0:
177 |         url_position_stop = abstract.find("\"", url_position_start + 9)
178 |         link_stop = abstract.find("</a>", url_position_stop) + 4
179 |         abstract_url = abstract[url_position_start + 9:url_position_stop]
180 |         # FIXME: use actual text in <a>
181 |         abstract = abstract[:url_position_start] + "[URL](" + abstract_url + ")" + abstract[link_stop:]
182 |     abstract = re.sub("<[^>]*>", "", abstract)
183 | 
184 |     return title, authors, abstract
185 | 
186 | 
187 | def parse_openreview(url):
188 |     resp = html.unescape(to_str(urlopen(url).read()))
189 | 
190 |     # title
191 |     title_start = resp.find('class="note_content_title citation_title">')
192 |     title_start = resp.find("\n", title_start) + 1
193 |     title = resp[title_start:resp.find("\n", title_start)].strip()
194 | 
195 |     # authors
196 |     authors_start = resp.find('author">') + 8
197 |     authors = resp[authors_start:resp.find("</h3>", authors_start)]
198 | 
199 |     # abstract
200 |     abstract_start = resp.find("Abstract:")
201 |     abstract_start = resp.find('note-content-value">', abstract_start) + 20
202 |     abstract = resp[abstract_start:resp.find("</span>", abstract_start)]
203 |     abstract = " ".join(abstract.split("\n")).strip()
204 | 
205 |     return title, authors, abstract
206 | 
207 | 
208 | def parse_args():
209 |     parser = argparse.ArgumentParser()
210 |     parser.add_argument("--toc-maker", help="path to ToC making tool")
211 |     parser.add_argument("--twitter-poster", default="t update", help="twitter poster command")
212 |     parser.add_argument("-t", "--use-twitter", action="store_true")
213 | 
214 |     known_args, unknown_args = parser.parse_known_args()
215 | 
216 |     if not known_args.toc_maker:
217 |         known_args.toc_maker = "./gh-md-toc"
218 |         if not os.path.isfile(known_args.toc_maker):
219 |             s = cmd.getoutput("uname -s").lower()
220 |             f = "gh-md-toc.%s.amd64.tgz" % s
221 |             URL = "https://github.com/ekalinin/github-markdown-toc.go/releases/download/0.8.0/%s" % f
222 |             if not os.path.isfile(f):
223 |                 if cmd.getstatusoutput("wget %s" % URL)[0] != 0:
224 |                     raise EnvironmentError("Cannot download toc maker from URL: %s" % URL)
225 |             if cmd.getstatusoutput("tar xzf %s" % f)[0] != 0:
226 |                     raise EnvironmentError("Cannot untar toc maker from file %s" % f)
227 |             os.remove(f)
228 | 
229 |             current_permissions = stat.S_IMODE(os.lstat(known_args.toc_maker).st_mode)
230 |             os.chmod(known_args.toc_maker, current_permissions & stat.S_IXUSR)
231 | 
232 |     if unknown_args:
233 |         filepath = unknown_args[0]
234 |     else:
235 |         print("You should specify the path for file to work with!")
236 |         quit(1)
237 | 
238 |     return known_args, filepath
239 | 
240 | 
241 | def main():
242 |     known_args, filepath = parse_args()
243 |     formatted = ""
244 | 
245 |     with open(filepath) as f:
246 |         pass_lines = False
247 |         if known_args.use_twitter:
248 |             formatter = ArticleFormatter(known_args.twitter_poster)
249 |         else:
250 |             formatter = ArticleFormatter()
251 | 
252 |         for line in f:
253 |             l = line.strip()
254 |             if l == "Table of Contents":
255 |                 pass_lines = True
256 |             elif l in ["Articles", "Miscellaneous"]:
257 |                 pass_lines = False
258 |                 formatted += l + "\n"
259 |             elif l in ["========", "============="]:
260 |                 formatted += l + "\n"
261 |             elif l[:3] == "## ":
262 |                 formatted += l + "\n"
263 |             elif not pass_lines:
264 |                 formatted += formatter(l)
265 | 
266 |     temp = tempfile.NamedTemporaryFile(delete=False, mode='wt')
267 |     temp.write(formatted)
268 |     temp.close()
269 |     toc = cmd.getoutput("%s %s" % (known_args.toc_maker, temp.name))
270 |     os.remove(temp.name)
271 | 
272 |     with open(filepath, "wt") as f:
273 |         f.write(toc[:-74] + formatted)
274 | 
275 | 
276 | if __name__ == "__main__":
277 |     main()
278 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | requests
2 | 


--------------------------------------------------------------------------------