├── README.md ├── files └── Chap06_Dynamic_Programming_in_Algorithm_Design_Kleinberg_Tardos.pdf └── notes ├── 01-rl-basic.md └── 02-mdp.md /README.md: -------------------------------------------------------------------------------- 1 | # RL4NLP Reading Group (Spring 2017) 2 | 3 | - Location: CSE 203 4 | 5 | ## Schedule 6 | 7 | ### 1. RL Basic and MDP 8 | 9 | - Yangfeng 10 | - Time: April 17, Monday, 4:30 - 5:30 PM 11 | - Reading: [Reinforcement Learning: An Introduction](http://incompleteideas.net/book/the-book-2nd.html) Chap 01 and 03 12 | - Notes: [Chap 01](notes/01-rl-basic.md) and [Chap 03](notes/02-mdp.md) 13 | 14 | ### 2. Dynamic Programming and Monte Carlo Methods 15 | 16 | - Chenhao 17 | - Time: April 24, Monday, 4:30 - 5:30 PM 18 | - Reading: [Reinforcement Learning: An Introduction](http://incompleteideas.net/book/the-book-2nd.html) Chap 04 and 05 19 | 20 | ### 3. Policy Gradient Methods 21 | 22 | - Ji 23 | - Time: May 1, Monday, 4:30 - 5:30 PM 24 | - Reading: [Reinforcement Learning: An Introduction](http://incompleteideas.net/book/the-book-2nd.html) Chap 13 25 | 26 | ### 4. POS tagging and Syntactic Parsing 27 | 28 | - Yijia 29 | - Time: May 8, Monday, 4:30 - 5:30 PM 30 | - Suggested reading: 31 | * [EACL imitation learning tutorial](https://sheffieldnlp.github.io/ImitationLearningTutorialEACL2017/) 32 | * [DAGGER](https://www.cs.cmu.edu/~sross1/publications/Ross-AIStats11-NoRegret.pdf) 33 | * [LOLS](https://arxiv.org/pdf/1502.02206.pdf) 34 | * [A Dynamic Oracle for Arc-Eager Dependency Parsing](http://www.aclweb.org/anthology/C12-1059) 35 | * [Noise Reduction and Targeted Exploration in Imitation Learning for Abstract Meaning Representation Parsing](http://aclweb.org/anthology/P16-1001) 36 | 37 | 38 | ### 5. Information Extraction 39 | 40 | - Colin 41 | - Time: May 15, Monday, 4:30 - 5:30 PM 42 | - Suggested reading: some papers from Regina's group 43 | - [Learning to Win by Reading Manuals in a Monte-Carlo Framework](http://people.csail.mit.edu/regina/my_papers/civ11.pdf) 44 | - [Improving Information Extraction by Acquiring External Evidence with Reinforcement Learning](http://people.csail.mit.edu/karthikn/assets/pdf/rlie16.pdf) 45 | 46 | 47 | ### 6. Machine Translation and Language Modeling 48 | 49 | - Max 50 | - Time: May 22, Monday, 4:30 - 5:30 PM 51 | - Suggested reading: 52 | - [Don’t Until the Final Verb Wait: Reinforcement Learning for Simultaneous Machine Translation](https://www.umiacs.umd.edu/~jbg/docs/2014_emnlp_simtrans.pdf) 53 | - [Dual Learning for Machine Translation](https://papers.nips.cc/paper/6469-dual-learning-for-machine-translation.pdf) 54 | 55 | ### 7. Summarization and Question Answering 56 | 57 | - Mandar 58 | - Time: Jun 5, Monday, 4:30 - 5:30 PM 59 | - Suggested reading: 60 | - [A Deep Reinforced Model for Abstractive Summarization](https://arxiv.org/pdf/1705.04304.pdf) 61 | - [Coarse-to-Fine Question Answering for Long Documents](http://homes.cs.washington.edu/~eunsol/papers/acl17eunsol.pdf) 62 | 63 | -------------------------------------------------------------------------------- /files/Chap06_Dynamic_Programming_in_Algorithm_Design_Kleinberg_Tardos.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jiyfeng/rl4nlp/782a3d8f3c88bcf41512c105ffa5c326dada3616/files/Chap06_Dynamic_Programming_in_Algorithm_Design_Kleinberg_Tardos.pdf -------------------------------------------------------------------------------- /notes/01-rl-basic.md: -------------------------------------------------------------------------------- 1 | # Chap 01: The Reinforcement Learning (RL) Problem 2 | 3 | ## Introduction 4 | 5 | - Three characteristics of the RL problems 6 | - being closed-loop in an essential way 7 | - not having direct instructions as to what actions to take 8 | - where the consequences of actions, including reward signals, play out over extended time periods 9 | - The difference between RL and supervised learning 10 | - The difference between RL and unsupervised learning 11 | - RL: maximize rewards 12 | - Unsupervised learning: find hidden structures of data 13 | - The special challenge of RL: the tradeoff between **exploration** and **exploitation**. An agent is supposed to both 14 | - exploit what it already knows in order to obtain reward 15 | - explore in order to make better action selections in the future 16 | 17 | ## Elements of RL 18 | 19 | - A policy 20 | - define the learning agent's way of behaving at a given time 21 | - the **core** of an agent in the sense that it alone is sufficient to determine behavior 22 | - A reward signal 23 | - define the goal in an RL problem by determining what are the good and bad events for the agent 24 | - the agent's sole objective is to maximize the **total** reward it receives over the long run 25 | - the process that generates the reward signal must be unalterable by the agent 26 | - A value function 27 | - specify what is good in the long run 28 | - the *value* of a state is the total amount of reward an agent can **expect** to accumulate over the future, starting from that state 29 | - a value of the prediction of rewards in a long run given the current state 30 | - A model of the environmnet (optional) 31 | 32 | ## Tic-Tac-Toe 33 | 34 | - The difference between evolutionary methods and the methods of learning value functions 35 | - learning a value function takes advantage of information available during the course of play 36 | -------------------------------------------------------------------------------- /notes/02-mdp.md: -------------------------------------------------------------------------------- 1 | # Chap 03: Finite Markov Decision Processes 2 | 3 | Notations 4 | 5 | - $S_t\in\mathcal{S}$: the environment state at step $t$, where $\mathcal{S}$ is the set of possible states 6 | - $A_t\in\mathcal{A}(S_t)$: action given $S_t$, where $\mathcal{A}(S_t)$ is the set of actions available in state $S_t$ 7 | - $R_{t+1}\in\mathcal{R}\subset\mathbb{R}$: reward 8 | - $\pi_t$: agent's policy, where $\pi_t(a|s)$ is the probability that $A_t=a$ if $S_t=s$ 9 | 10 | ## Returns 11 | 12 | Expected discounted return 13 | 14 | $$G_t=\sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}$$ 15 | where $\gamma$ is a parameter, $0\leq \gamma\leq 1$, called the discount rate. 16 | 17 | ## MDP 18 | 19 | A reinforcement learning task that statisfies the Markov property is called a **Markov Discision Process** (MDP). 20 | 21 | A finite MDP is specified by its state and action sets ($\mathcal{S}$ and $\mathcal{A}$) and its one-step dynamics of the environment as: 22 | $$p(s',r|s,a)=\text{Pr}(S_{t+1}=s',R_{t+1}=r|S_t=s,A_t=a)$$ 23 | 24 | All other things can be computed from this dynamics, including 25 | 26 | - the expected rewards for state-action pairs $r(s,a)=\mathbb{E}(R_{t+1}|S_t=s,A_t=a)$ 27 | - the state transition prob $p(s'|s,a)=\text{Pr}(S_{t+1}=s'|S_t=s,A_t=a)$ 28 | - the expected rewards for state-action-next-state triples $r(s,a,s')=\mathbb{E}(R_{t+1}|S_t=s,A_t=a,S_{t+1}=s')$ 29 | 30 | **An alternative definition**[1] A MDP is defined by: 31 | 32 | - a ste of states $\mathcal{S}$ 33 | - a start state or initial state $s_0\in\mathcal{S}$ 34 | - a set of actions $\mathcal{A}$ 35 | - a transition prob $P(S_{t+1} = s'|S_t = s,A_t = a)$ 36 | - a reward prob $P(R_{t+1}=r|S_t=s,A_t=a)$ 37 | 38 | ## Value functions 39 | 40 | For MDP, the **state-value function** for a policy $\pi$ is defined as 41 | $$v_{\pi}(s)=\mathbb{E}_{\pi}(G_t|S_t=s)=\mathbb{E}_{\pi}(\sum_{k=0}^{\infty}\gamma^kR_{t+k+1})$$ 42 | 43 | Similar way can be used to define the **action-value function** for policy $\pi$, $q_{\pi}(s,a)$, 44 | $$q_{\pi}(s,a)=\mathbb{E}_{\pi}(G_t|S_t=s,A_t=a)$$ 45 | 46 | - Q learning 47 | 48 | ### Bellman equation 49 | 50 | The Bellman equation for $\pi$ is defined as following: for $s\in\mathcal{S}$, 51 | 52 | $$v_{\pi}(s)=\sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\pi}(s')]$$ 53 | which specifies the relation between $S_t$ and $S_{t+1}$ for a given $\pi$. 54 | 55 | Reinforcement learning can solve Markov decision processes without explicit specification of the transition probabilities. 56 | 57 | ## Optimal Value Functions 58 | 59 | In finite MDPs, value functions define a partial ordering over policies 60 | 61 | - A policy $\pi$ is defined to be better than or equal to a policy $\pi'$ if its expected return is greater than or equal to that of $\pi'$ for all states. 62 | 63 | Optimal state-value function, denoted $v_{\ast}$, $\forall s\in\mathcal{S}$ 64 | $$v_{\ast}(s)=\max_{\pi}v_{\pi}(s)$$ 65 | 66 | For MDP 67 | $$v_{\ast}(s)=\max_{a\in\mathcal{A}(s)}\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\ast}(s')]$$ 68 | 69 | This equation means that any policy that is greedy with respect to the optimal evaluation function $v_{\ast}$ is an optimal policy. Optimal policies also share the same *optimal action-value function*, denoted $q_{\ast}$, 70 | $$q_{\ast}(s,a) = \sum_{s',r}p(s',r|s,a)[r+\gamma\max_{a'}q_{\ast}(s',a')]$$ 71 | 72 | 73 | - This equation is also a special formulation that dynamic programming could find the optimal solution [2]. 74 | - Q learning 75 | 76 | An [example](https://aclweb.org/anthology/D/D16/D16-1261.pdf) of using MDP for information extraction. 77 | 78 | ## Reference 79 | 80 | 1. Mohri, Rostmizadeh, Talwalkar. Foundations of Machine Learning. 2012 81 | 2. Kleinberg and Tardos. [Chap 06 Dynamic Programming](../files/Chap06_Dynamic_Programming_in_Algorithm_Design_Kleinberg_Tardos.pdf), Algorithm Design. 2005 82 | --------------------------------------------------------------------------------