├── README.md
├── files
    └── Chap06_Dynamic_Programming_in_Algorithm_Design_Kleinberg_Tardos.pdf
└── notes
    ├── 01-rl-basic.md
    └── 02-mdp.md


/README.md:
--------------------------------------------------------------------------------
 1 | # RL4NLP Reading Group (Spring 2017)
 2 | 
 3 | - Location: CSE 203
 4 | 
 5 | ## Schedule
 6 | 
 7 | ### 1. RL Basic and MDP
 8 | 
 9 | - Yangfeng
10 | - Time: April 17, Monday, 4:30 - 5:30 PM
11 | - Reading: [Reinforcement Learning: An Introduction](http://incompleteideas.net/book/the-book-2nd.html) Chap 01 and 03
12 | - Notes: [Chap 01](notes/01-rl-basic.md) and [Chap 03](notes/02-mdp.md)
13 | 
14 | ### 2. Dynamic Programming and Monte Carlo Methods
15 | 
16 | - Chenhao
17 | - Time: April 24, Monday, 4:30 - 5:30 PM
18 | - Reading: [Reinforcement Learning: An Introduction](http://incompleteideas.net/book/the-book-2nd.html) Chap 04 and 05
19 | 
20 | ### 3. Policy Gradient Methods
21 | 
22 | - Ji
23 | - Time: May 1, Monday, 4:30 - 5:30 PM
24 | - Reading: [Reinforcement Learning: An Introduction](http://incompleteideas.net/book/the-book-2nd.html) Chap 13
25 | 
26 | ### 4. POS tagging and Syntactic Parsing
27 | 
28 | - Yijia
29 | - Time: May 8, Monday, 4:30 - 5:30 PM
30 | - Suggested reading: 
31 |     * [EACL imitation learning tutorial](https://sheffieldnlp.github.io/ImitationLearningTutorialEACL2017/)
32 |     * [DAGGER](https://www.cs.cmu.edu/~sross1/publications/Ross-AIStats11-NoRegret.pdf)
33 |     * [LOLS](https://arxiv.org/pdf/1502.02206.pdf)
34 |     * [A Dynamic Oracle for Arc-Eager Dependency Parsing](http://www.aclweb.org/anthology/C12-1059)
35 |     * [Noise Reduction and Targeted Exploration in Imitation Learning for Abstract Meaning Representation Parsing](http://aclweb.org/anthology/P16-1001)
36 | 
37 | 
38 | ### 5. Information Extraction
39 | 
40 | - Colin
41 | - Time: May 15, Monday, 4:30 - 5:30 PM
42 | - Suggested reading: some papers from Regina's group
43 | 	- [Learning to Win by Reading Manuals in a Monte-Carlo Framework](http://people.csail.mit.edu/regina/my_papers/civ11.pdf)
44 | 	- [Improving Information Extraction by Acquiring External Evidence with Reinforcement Learning](http://people.csail.mit.edu/karthikn/assets/pdf/rlie16.pdf)
45 | 
46 | 
47 | ### 6. Machine Translation and Language Modeling
48 | 
49 | - Max
50 | - Time: May 22, Monday, 4:30 - 5:30 PM
51 | - Suggested reading:
52 |     - [Don’t Until the Final Verb Wait: Reinforcement Learning for Simultaneous Machine Translation](https://www.umiacs.umd.edu/~jbg/docs/2014_emnlp_simtrans.pdf)
53 |     - [Dual Learning for Machine Translation](https://papers.nips.cc/paper/6469-dual-learning-for-machine-translation.pdf)
54 | 
55 | ### 7. Summarization and Question Answering
56 | 
57 | - Mandar
58 | - Time: Jun 5, Monday, 4:30 - 5:30 PM
59 | - Suggested reading:
60 |     - [A Deep Reinforced Model for Abstractive Summarization](https://arxiv.org/pdf/1705.04304.pdf)
61 |     - [Coarse-to-Fine Question Answering for Long Documents](http://homes.cs.washington.edu/~eunsol/papers/acl17eunsol.pdf)
62 | 
63 | 


--------------------------------------------------------------------------------
/files/Chap06_Dynamic_Programming_in_Algorithm_Design_Kleinberg_Tardos.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jiyfeng/rl4nlp/782a3d8f3c88bcf41512c105ffa5c326dada3616/files/Chap06_Dynamic_Programming_in_Algorithm_Design_Kleinberg_Tardos.pdf


--------------------------------------------------------------------------------
/notes/01-rl-basic.md:
--------------------------------------------------------------------------------
 1 | # Chap 01: The Reinforcement Learning (RL) Problem
 2 | 
 3 | ## Introduction 
 4 | 
 5 | - Three characteristics of the RL problems
 6 | 	- being closed-loop in an essential way
 7 | 	- not having direct instructions as to what actions to take
 8 | 	- where the consequences of actions, including reward signals, play out over extended time periods
 9 | - The difference between RL and supervised learning
10 | - The difference between RL and unsupervised learning
11 | 	- RL: maximize rewards
12 | 	- Unsupervised learning: find hidden structures of data
13 | - The special challenge of RL: the tradeoff between **exploration** and **exploitation**. An agent is supposed to both
14 | 	- exploit what it already knows in order to obtain reward
15 | 	- explore in order to make better action selections in the future
16 | 
17 | ## Elements of RL
18 | 
19 | - A policy
20 | 	- define the learning agent's way of behaving at a given time
21 | 	- the **core** of an agent in the sense that it alone is sufficient to determine behavior
22 | - A reward signal
23 | 	- define the goal in an RL problem by determining what are the good and bad events for the agent
24 | 	- the agent's sole objective is to maximize the **total** reward it receives over the long run
25 | 	- the process that generates the reward signal must be unalterable by the agent
26 | - A value function
27 | 	- specify what is good in the long run
28 | 	- the *value* of a state is the total amount of reward an agent can **expect** to accumulate over the future, starting from that state
29 | 	- a value of the prediction of rewards in a long run given the current state
30 | - A model of the environmnet (optional)
31 | 
32 | ## Tic-Tac-Toe
33 | 
34 | - The difference between evolutionary methods and the methods of learning value functions
35 | 	- learning a value function takes advantage of information available during the course of play
36 | 


--------------------------------------------------------------------------------
/notes/02-mdp.md:
--------------------------------------------------------------------------------
 1 | # Chap 03: Finite Markov Decision Processes
 2 | 
 3 | Notations
 4 | 
 5 | - $S_t\in\mathcal{S}$: the environment state at step $t$, where $\mathcal{S}$ is the set of possible states
 6 | - $A_t\in\mathcal{A}(S_t)$: action given $S_t$, where $\mathcal{A}(S_t)$ is the set of actions available in state $S_t$
 7 | - $R_{t+1}\in\mathcal{R}\subset\mathbb{R}$: reward
 8 | - $\pi_t$: agent's policy, where $\pi_t(a|s)$ is the probability that $A_t=a$ if $S_t=s$
 9 | 
10 | ## Returns
11 | 
12 | Expected discounted return
13 | 
14 | $$G_t=\sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}$$
15 | where $\gamma$ is a parameter, $0\leq \gamma\leq 1$, called the discount rate.
16 | 
17 | ## MDP
18 | 
19 | A reinforcement learning task that statisfies the Markov property is called a **Markov Discision Process** (MDP).
20 | 
21 | A finite MDP is specified by its state and action sets ($\mathcal{S}$ and $\mathcal{A}$) and its one-step dynamics of the environment as:
22 | $$p(s',r|s,a)=\text{Pr}(S_{t+1}=s',R_{t+1}=r|S_t=s,A_t=a)$$
23 | 
24 | All other things can be computed from this dynamics, including
25 | 
26 | - the expected rewards for state-action pairs $r(s,a)=\mathbb{E}(R_{t+1}|S_t=s,A_t=a)$
27 | - the state transition prob $p(s'|s,a)=\text{Pr}(S_{t+1}=s'|S_t=s,A_t=a)$
28 | - the expected rewards for state-action-next-state triples $r(s,a,s')=\mathbb{E}(R_{t+1}|S_t=s,A_t=a,S_{t+1}=s')$
29 | 
30 | **An alternative definition**[1] A MDP is defined by:
31 | 
32 | - a ste of states $\mathcal{S}$
33 | - a start state or initial state $s_0\in\mathcal{S}$
34 | - a set of actions $\mathcal{A}$
35 | - a transition prob $P(S_{t+1} = s'|S_t = s,A_t = a)$
36 | - a reward prob $P(R_{t+1}=r|S_t=s,A_t=a)$
37 | 
38 | ## Value functions
39 | 
40 | For MDP, the **state-value function** for a policy $\pi$ is defined as
41 | $$v_{\pi}(s)=\mathbb{E}_{\pi}(G_t|S_t=s)=\mathbb{E}_{\pi}(\sum_{k=0}^{\infty}\gamma^kR_{t+k+1})$$
42 | 
43 | Similar way can be used to define the **action-value function** for policy $\pi$, $q_{\pi}(s,a)$,
44 | $$q_{\pi}(s,a)=\mathbb{E}_{\pi}(G_t|S_t=s,A_t=a)$$
45 | 
46 | - Q learning
47 | 
48 | ### Bellman equation
49 | 
50 | The Bellman equation for $\pi$ is defined as following: for $s\in\mathcal{S}$,
51 | 
52 | $$v_{\pi}(s)=\sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\pi}(s')]$$
53 | which specifies the relation between $S_t$ and $S_{t+1}$ for a given $\pi$.
54 | 
55 | Reinforcement learning can solve Markov decision processes without explicit specification of the transition probabilities.
56 | 
57 | ## Optimal Value Functions
58 | 
59 | In finite MDPs, value functions define a partial ordering over policies
60 | 
61 | - A policy $\pi$ is defined to be better than or equal to a policy $\pi'$ if its expected return is greater than or equal to that of $\pi'$ for all states.
62 | 
63 | Optimal state-value function, denoted $v_{\ast}$, $\forall s\in\mathcal{S}$
64 | $$v_{\ast}(s)=\max_{\pi}v_{\pi}(s)$$
65 | 
66 | For MDP
67 | $$v_{\ast}(s)=\max_{a\in\mathcal{A}(s)}\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\ast}(s')]$$
68 | 
69 | This equation means that any policy that is greedy with respect to the optimal evaluation function $v_{\ast}$ is an optimal policy. Optimal policies also share the same *optimal action-value function*, denoted $q_{\ast}$, 
70 | $$q_{\ast}(s,a) = \sum_{s',r}p(s',r|s,a)[r+\gamma\max_{a'}q_{\ast}(s',a')]$$
71 | 
72 | 
73 | - This equation is also a special formulation that dynamic programming could find the optimal solution [2].
74 | - Q learning
75 | 
76 | An [example](https://aclweb.org/anthology/D/D16/D16-1261.pdf) of using MDP for information extraction.
77 | 
78 | ## Reference
79 | 
80 | 1. Mohri, Rostmizadeh, Talwalkar. Foundations of Machine Learning. 2012
81 | 2. Kleinberg and Tardos. [Chap 06 Dynamic Programming](../files/Chap06_Dynamic_Programming_in_Algorithm_Design_Kleinberg_Tardos.pdf), Algorithm Design. 2005
82 | 


--------------------------------------------------------------------------------