├── Deep_RL.pdf
├── QwRB.tex
├── README.md
├── REINFORCE.tex
├── actorcritic.tex
├── batchac.tex
├── batchacwdf.tex
├── cem.tex
├── ctrasinf.tex
├── dagger.tex
├── ddpg.tex
├── dqn.tex
├── dyna.tex
├── dynagen.tex
├── exploration.tex
├── figures
    ├── bellmanbackup.png
    ├── dqn.png
    ├── dynarollout.png
    ├── fitV.png
    ├── im_RNN.png
    ├── imitation_div.png
    ├── latent.png
    ├── localmodel.png
    ├── marginal.png
    ├── markov.png
    ├── modelnn.png
    ├── multimodal.png
    ├── opt.png
    ├── overfit.png
    ├── parallelsim.png
    ├── poliback.png
    ├── qwrb.png
    ├── rlanatomy.png
    ├── trajheat.png
    ├── vae.png
    └── varinf.png
├── fittedQ.tex
├── fittedvaliter.tex
├── guided.tex
├── ilqr.tex
├── imitation.tex
├── intro.tex
├── inverse.tex
├── lqr.tex
├── main.tex
├── maxent.tex
├── mb05.tex
├── mb10.tex
├── mb15.tex
├── mb20.tex
├── mblatent.tex
├── mbpolicy.tex
├── mcts.tex
├── modelbased.tex
├── offline.tex
├── onlineQiter.tex
├── onlineac.tex
├── pgtheory.tex
├── policyiter1.tex
├── policyiter2.tex
├── poligrad.tex
├── preface.tex
├── pretrain.tex
├── pseudocount.tex
├── qfunc.tex
├── qwrb_tn.tex
├── ref.bib
├── transfer.tex
├── value.tex
└── varinfer.tex


/Deep_RL.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/Deep_RL.pdf


--------------------------------------------------------------------------------
/QwRB.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Q-Learning with Replay Buffer}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:QwRB}
 5 | \REQUIRE Some base policy for data collection; hyperparameter $K$
 6 | \WHILE{true}
 7 |     \STATE Collect dataset $\{(s_i,a_i,s'_i,r_i)\}$ using some policy, add it to replay buffer $\mathcal{B}$
 8 |     \FOR{$K$ times}
 9 |         \STATE Sample a batch $(s_i,a_i,s'_i,r_i)$ from $\mathcal{B}$
10 |         \STATE Set $y_i\leftarrow r(s_i,a_i) + \gamma \max_{a'_i}Q_\phi(s'_i,a'_i)$
11 |         \STATE Set $\phi \leftarrow \phi-\alpha\Sigma_i\frac{dQ_\phi}{d\phi}(s_i,a_i)(Q_\phi(s_i,a_i) - y_i)$
12 |     \ENDFOR
13 | \ENDWHILE
14 | \end{algorithmic}
15 | \end{algorithm}


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Deep Reinforcement Learning Textbook
 2 | ## A collection of comprehensive notes on Deep Reinforcement Learning, based on UC Berkeley's CS 285 (prev. CS 294-112) taught by Professor Sergey Levine.
 3 | * Compile the latex source code into PDF locally.
 4 | * Alternatively, you could download this repo as a zip file and upload the zip file to Overleaf and start editing online.
 5 | * This repo is linked to my Overleaf editor so it is regularly updated.m
 6 | * Please let me know if you have any questions or suggestions. Reach me via <harryhzhang@berkeley.edu>
 7 | 
 8 | ## Introduction
 9 | In recent years, deep reinforcement learning (DRL) has emerged as a transformative paradigm, bridging the domains of artificial intelligence, machine learning, and robotics to enable the creation of intelligent, adaptive, and autonomous systems. This textbook is designed to provide a comprehensive, in-depth introduction to the principles, techniques, and applications of deep reinforcement learning, empowering students, researchers, and practitioners to advance the state of the art in this rapidly evolving field. As the first DRL class I have taken was Prof. Levine's CS 294-112, this book's organization and materials are based strongly on the CS 294-112 (now CS 285)'s slides and syllabus.
10 | 
11 | The primary objective of this textbook is to offer a systematic and rigorous treatment of DRL, from foundational concepts and mathematical formulations to cutting-edge algorithms and practical implementations. We strive to strike a balance between theoretical clarity and practical relevance, providing readers with the knowledge and tools needed to develop novel DRL solutions for a wide array of real-world problems.
12 | 
13 | The textbook is organized into several parts, each dedicated to a specific aspect of DRL:
14 | 
15 | 1. Fundamentals: This part covers the essential background material in reinforcement learning, including Markov decision processes, value functions, and fundamental algorithms such as Q-learning and policy gradients.
16 | 2. Deep Learning for Reinforcement Learning: Here, we delve into the integration of deep learning techniques with reinforcement learning, discussing topics such as function approximation, representation learning, and the use of deep neural networks as function approximators.
17 | 3. Advanced Techniques and Algorithms: This part presents state-of-the-art DRL algorithms, such as Deep Q-Networks (DQN), Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC), along with their theoretical underpinnings and practical considerations.
18 | 4. Exploration and Exploitation: We explore strategies for balancing exploration and exploitation in DRL, examining methods such as intrinsic motivation, curiosity-driven learning, and Bayesian optimization.
19 | 5. Real-World Applications: This section showcases the application of DRL to various domains, including robotics, computer vision, natural language processing, and healthcare, highlighting the challenges and opportunities in each area.
20 | Throughout the textbook, we supplement the theoretical exposition with practical examples, case studies, and programming exercises, allowing readers to gain hands-on experience in implementing DRL algorithms and applying them to diverse problems. We also provide references to relevant literature, guiding the reader towards further resources for deepening their understanding and pursuing advanced topics.
21 | 
22 | We envision this textbook as a valuable resource for students, researchers, and practitioners seeking a solid grounding in deep reinforcement learning, as well as a springboard for future innovation and discovery in this exciting and dynamic field. It is our hope that this work will contribute to the ongoing growth and development of DRL, facilitating the creation of intelligent systems that can learn, adapt, and thrive in complex, ever-changing environments.
23 | 
24 | We extend our deepest gratitude to our colleagues, reviewers, and students, whose invaluable feedback and insights have helped shape this textbook. We also wish to acknowledge the pioneering researchers whose contributions have laid the foundation for DRL and inspired us to embark on this journey.
25 | 
26 | ## Update Log
27 | * Aug 26, 2020: Started adding Fall 2020 materials
28 | * Aug 28, 2020: Fixed typos in Intro. Credit: Warren Deng.
29 | * Aug 30, 2020: Added more explanation to the imitation learning chapter.
30 | * Sep 13, 2020: Added advanced PG in PG and fixed typos in PG.
31 | * Sep 14, 2020: AC chapter format, typos fix, more analysis on A2C
32 | * Sep 16, 2020: Chapter 10.1 KL div typo fix. Credit: Cong Wang.
33 | * Sep 19, 2020: Chapter 3.7.1 parathesis typo fix. Credit: Yunkai Zhang.
34 | * Sep 23, 2020: Q learning chapter fix
35 | * Sep 26, 2020: More explanation and fix to the advanced PG chapter (specifically intuition behind TRPO).
36 | * Sep 28, 2020: Typo fixed and more explanation in Optimal Control. Typos were pointed out in Professor Levine's lecture.
37 | * Oct 6, 2021: Model-based RL chapter fixed. Added Distillation subsection.
38 | * Nov. 20, 2021: Fixed typos in DDPG, Online Actor Critic, and PG theory. Credit: Javier Leguina.
39 | * Apr. 2, 2023: Fixed typos in VAE and PG theory. Credit: wangcongrobot
40 | 


--------------------------------------------------------------------------------
/REINFORCE.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{REINFORCE Algorithm}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:reinforce}
 5 | \REQUIRE Base policy $\pi_\theta(a_t|s_t)$, sample trajectories $\tau^i$
 6 | 
 7 | \WHILE{true}
 8 |     \STATE Sample $\{\tau^i\}$ from $\pi_\theta(a_t|s_t)$ (run it on a robot).
 9 |     \STATE $\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_i\left(\sum_t\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_t r(s_{i,t},a_{i,t})\right)$
10 |     \STATE Improve policy by $\theta \leftarrow \theta + \alpha\nabla_\theta J(\theta)$
11 | \ENDWHILE
12 | \RETURN optimal trajectory from gradient ascent as $\tau^{return}$
13 | \end{algorithmic}
14 | \end{algorithm}


--------------------------------------------------------------------------------
/actorcritic.tex:
--------------------------------------------------------------------------------
  1 | \chapter{Actor-Critic Algorithms}
  2 | Recall from last chapter, we derived the policy gradient theorem:
  3 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})\right)$$
  4 | where we defined the summed reward as the ``reward-to-go'' function $\hat{Q}_{i,t}$, and it represents the estimate of expected reward if we take action $a_{i,t}$ in state $s_{i,t}$. We have shown that this estimate has very high variance, and we shall see how we can improve policy gradients from using better estimation of the reward-to-go function.
  5 | 
  6 | \section{Reward-to-Go}
  7 | Let us take a closer look at the reward-to-go. To improve the estimation, one way is to get closer to the precise value of the reward-to-go. We can define the reward-to-go using expectation:
  8 | $$Q(s_t,a_t) = \sum_{t'=t}^T \mathbb{E}_{p(\theta)}\left[r(s_{t'},a_{t'})|s_t,a_t\right]$$
  9 | this is the \textbf{true, expected} value of the reward-to-go.
 10 | 
 11 | Therefore, one could imagine using this true expected value, combined with our original Monte Carlo approximation to yield a better estimate:
 12 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log \pi_\theta (a_{i,t}|s_{i,t})Q(s_{i,t},a_{i,t})$$.
 13 | 
 14 | \section{Using Baselines}
 15 | As we have seen in last chapter, one can reduce the high variance of the policy gradient using baselines. We have also seen that it is possible to calculate the optimal baseline value to yield the minimum variance, although people often use the average reward for sake of simplicity.
 16 | 
 17 | Motivated by this, let us recall the definition of the value function (defined in the introduction section):
 18 | $$V(s_t) = \mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}\left[Q(s_t,a_t)\right]$$
 19 | By definition, the value function is the average of Q-function value.
 20 | 
 21 | Similarly, we can use the \textbf{average} reward-to-go as a baseline to reduce the variance. Specifically, we could use the value function $V(s_t)$ as the baseline, thus improving the estimate of the gradient in the following way:
 22 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log \pi_\theta (a_{i,t}|s_{i,t})\left(Q(s_{i,t},a_{i,t}) - V(s_{i,t})\right)$$
 23 | and the value function we used is a better approximation of the baseline $b_t = \frac{1}{N}\sum_i Q(s_{i,t},a_{i,t})$.
 24 | 
 25 | What have we done here? What is the intuition behind subtracting the value function from the Q-function? Essentially, we are quantifying how much an action $a_{i,t}$ is better than the average actions. In some sense, it measures the \textbf{advantage} of applying an action over the average action. Therefore, to formalize our intuition, let us define the advantage as follows:
 26 | $$A^\pi(s_t,a_t) = Q^\pi(s_t,a_t) - V^\pi(s_t)$$
 27 | which quantitatively measures how much better action $a_t$ is.
 28 | 
 29 | Putting it all together, now a better baseline-backed policy gradient estimate using Monte Carlo estimate can be written as:
 30 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log \pi_\theta (a_{i,t}|s_{i,t})A^\pi(s_{i,t},a_{i,t})$$.
 31 | 
 32 | \section{Value Function Fitting}
 33 | The better the estimate of the advantage function, the lower the variance, and we can have better policy gradient. Let us massage the definition of the Q-function a little in order to find some interesting mathematical relations between $Q$ and $V$:
 34 | \begin{align*}
 35 |     Q^\pi(s_t,a_t) &= \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}\left[r(s_{t'},a_{t'})|s_t,a_t\right]\\
 36 |     &= r(s_t,a_t)+\sum_{t'=t+1}^T \mathbb{E}_{\pi_\theta}\left[r(s_{t'},a_{t'})|s_t,a_t\right]\\
 37 |     &= r(s_t,a_t) + V^\pi(s_{t+1})\\
 38 |     &= r(s_t,a_t) + \mathbb{E}_{s_{t+1}\sim p(s_{t+1}|s_t,a_t)}\left[V^\pi(s_{t+1})\right]
 39 | \end{align*}
 40 | The last expectation of the value function is used because we do not know what the next state actually is. Note that we can be a little crude with respect to that expectation in such a way that we just use the full value function $V^\pi(\cdot)$ on one single sample of the next state, and use the value as the expectation, ignoring the fact that there are multiple other next states. With this estimate, we can plug into the advantage function:
 41 | $$A^\pi(s_t,a_t) \simeq r(s_t,a_t) + V^\pi(s_{t+1}) - V^\pi(s_t)$$
 42 | 
 43 | Therefore, it is almost enough to just approximate the value function, which solely depends on state, to generate approximations of other functions. To achieve this, we can use a neural network to fit our value function $V(s)$, and use the fit value function to approximate our policy gradient, as illustrated in Fig. \ref{fig:fitV} 
 44 | \begin{figure}
 45 |     \centering
 46 |     \includegraphics[scale=0.5]{figures/fitV.png}
 47 |     \caption{Fitting the value function}
 48 |     \label{fig:fitV}
 49 | \end{figure}
 50 | 
 51 | \section{Policy Evaluation}
 52 | Here in this section, we discuss the process and purpose of fitting the value function.
 53 | 
 54 | \subsection{Why Do We Evaluate a Policy}
 55 | Policy evaluation is a process that given a fixed policy $\pi$, we figure out how good it is by fitting the value function $V^\pi(\cdot)$ by using this expectation:
 56 | $$V^\pi(s_t) = \sum_{t'=t}^T\mathbb{E}_{\pi_\theta}\left[r(s_{t'},a_{t'})|s_t\right]$$
 57 | Having the value function allows us to figure out how good the policy is because the reinforcement learning objective can be equivalently written as $J(\theta) = \mathbb{E}_{s_1\sim p(s_1)}\left[V^\pi(s_1)\right]$, where we take the expectation of the value function value of the initial state over all possible initial states.
 58 | \subsection{How to Evaluate a Policy}
 59 | To evaluate a policy, we can use an approach similar to the policy gradient - Monte Carlo approximation. Specifically, we can estimate the value function by summing up the reward collected from time step $t$:
 60 | $$V^\pi(s_t) \simeq \sum_{t'=t}^T r(s_{t'},a_{t'})$$
 61 | and if we are able to reset the simulator, we could indeed ameliorate this estimate by taking multiple samples ($N$) as follows:
 62 | $$V^\pi(s_t) \simeq \frac{1}{N}\sum_{i=1}^N\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})$$
 63 | In practice, we can just use the single sample approximation.
 64 | 
 65 | Here is a question, if our original objective is to use $V^\pi$ to reduce the variance, but we end up using a single sample estimation to estimate $V^\pi$, does it actually help? The answer is yes, because we are using a neural net to fit the Monte Carlo targets from a variety of different states, so even though we do single sample estimate, the value function does generalize when we visit similar states. 
 66 | 
 67 | \subsection{Monte Carlo Evaluation with Function Approximation}
 68 | To fit our value function, we could use a supervised learning approach. Essentially, we can use our single sample estimation of the value function as our function value, and fit a function that maps the states to the value function values. Therefore, our training data will be $\left\{(s_{i,t}, \sum_{t'=t}^Tr(s_{i,t'},a_{i,t'}))\right\}$, and we denote the function value labels as $y_{i,t}$, and we define a typical supervised regression loss function which we try to minimize as $\mathcal{L}(\phi) = \frac{1}{2}\sum_i\lvert|\hat{V}_\phi^\pi(s_i)-y_i|\rvert^2$. 
 69 | 
 70 | \subsection{Improving the Estimate Using Bootstrap}
 71 | In fact, we can improve our training process because the original applied target $y_{i,t}$ is not perfect. We could use a technique called \textbf{bootstrapping}. Recall the definition of our ideal target in the supervised regression:
 72 | $$\begin{aligned}
 73 |     y_{i,t} &= \sum_{t'=t}^T\mathbb{E}_{\pi_\theta}\left[r(s_{t'},a_{t'})|s_{i,t}\right]\\
 74 |     & \simeq r(s_{i,t},a_{i,t})+\sum_{t'=t+1}^T\left[r(s_{t'},a_{t'})|s_{i,t+1}\right]\\
 75 |     & \simeq  r(s_{i,t},a_{i,t}) + V^\pi(s_{i,t+1})
 76 | \end{aligned}$$
 77 | , compared with our Monte Carlo targets: $y_{i,t} = \sum_{t'=t}^T r(s_{i,t'},a_{i,t'})$.
 78 | 
 79 | Bootstrapping means applying our previous estimation on our current estimation. In our ideal targets, the last estimation is accurate if we knew the actual $V^\pi$. But if the actual value function is not known, we can just apply bootstrapping by using the current fit estimate $\hat{V}^\pi_\phi$ to estimate the next state's value: $\hat{V}^\pi_\phi(s_{i,t+1})$. Such an estimate is biased, but it has low variance.
 80 | 
 81 | Consequently, our training data using bootstrapping becomes: \[\left\{(s_{i,t}, r(s_{i,t},a_{i,t}) +\hat{V}^\pi_\phi(s_{i,t+1})) \right\}\]. Such bootstrapped targets work well with highly stochastic environments. 
 82 | 
 83 | \section{Batch Actor-Critic Algorithm}
 84 | Now we are ready to devise our first actor-critic algorithm. The reason why we call it actor-critic is that we use a critic (value function) to decrease the high variance of the actor (Q-function/policy). The full algorithm is shown in Alg. \ref{alg:batchac} and we call it a batch algorithm because it is not online. We shall see the online version later.
 85 | \input{batchac.tex}
 86 | In Algorithm \ref{alg:batchac}, the way how we fit $\hat{V}_\phi$ is by minimizing the supervised regression norm $\mathcal{L}(\phi) = \frac{1}{2}\sum_i\lvert|\hat{V}_\phi^\pi(s_i)-y_i|\rvert^2$.
 87 | \section{Aside: Discount Factors}
 88 | Imagine if we had an infinite horizon environment ($T\rightarrow\infty$), then our estimated value function $\hat{V}^\pi_\phi(s)$ can get infinitely large in many cases. Therefore, one possible way to address this issue is to say that it is better to get rewards sooner than later. Therefore, instead of labeling our values as $y_{i,t} \simeq r(s_{i,t}, a_{i,t}) + \hat{V}^\pi_\phi(s_{i,t+1})$, we can shrink the value function value as we progress to the next time step. To achieve this, we introduce a hyperparameter called a \textbf{discount factor}, denoted as $\gamma$, where $\gamma \in [0,1]$:
 89 | $$y_{i,t} \simeq r(s_{i,t}, a_{i,t}) + \gamma\cdot\hat{V}^\pi_\phi(s_{i,t+1})$$
 90 | in most cases, $\gamma = 0.99$ works well.
 91 | 
 92 | Let us apply the discount factor to policy gradients. Basically, we have two options to impose this discount factor. The first option is :
 93 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t'=t}^T \gamma^{t'-t}r(s_{i,t'},a_{i,t'})\right)$$
 94 | and the second option is:
 95 | $$\begin{aligned}
 96 | \nabla_\theta J(\theta) &\simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T \gamma^{t-1}r(s_{i,t},a_{i,t})\right)\\
 97 | &\simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t'=t}^T \gamma^{t'-1}r(s_{i,t'},a_{i,t'})\right) \;\mathrm{(causality)}\\
 98 | &\simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\gamma^{t-1}\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t'=t}^T \gamma^{t'-t}r(s_{i,t'},a_{i,t'})\right)
 99 | \end{aligned}$$
100 | Intuitively, the second option assigns less weight to later step's gradient, so it essentially means that later steps matter less in our discount. 
101 | 
102 | In practice, we can show that option 1 gives us better variance, so it is actually what we use. The full derivation can be found in this paper \cite{thomas2014bias}. Now in our actor-critic algorithm, after we impose the discount factor, we have the following gradient:
103 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(r(s_{i,t},a_{i,t}) + \gamma\hat{V}_\phi^\pi(s_{i,t+1})-\hat{V}_\phi^\pi(s_{i,t})\right)$$
104 | 
105 | Now we can incorporate the discount factor with our actor-critic algorithm in Algorithm \ref{alg:batchacwdf}.
106 | \input{batchacwdf.tex}
107 | 
108 | \section{Online Actor-Critic Algorithm}
109 | Now that we have seen actor-critic algorithms with a batch of samples, we can further improve the performance by making it fully online. Namely, we are taking the gradient step based on the current sample so that we are not storing any large number of samples, which is more efficient. In the online version of actor-critic, we essentially use two neural nets: one for the policy, the other one for the value function. This is simple and stable, but as the states dimension becomes higher, we are not giving any shared features between the actor and the critic. Therefore, we can also make the network shared between the policy and the value function. For example, in image-based observations scenarios, we could share the conv layers' weights for the two networks and only differ the two in the final fully connected layers.
110 | 
111 | In each step, we can only take one sample and gradually improve our value function using that sample. Here is the sketch of the online version of actor-critic algorithm in Algorithm \ref{alg:onlineac}. 
112 | \input{onlineac.tex}
113 | Note that in steps 3-5, we are only taking a gradient step from one sample. In reality, this works best if we use a batch of samples instead of just one, and one can use parallel workers (simulations) either synchronously or asynchronously to achieve it, as illustrated in Fig \ref{fig:parallelsim}. 
114 | \begin{figure}
115 |     \centering
116 |     \includegraphics[scale=0.5]{figures/parallelsim.png}
117 |     \caption{Parallel simulations for online actor-critic}
118 |     \label{fig:parallelsim}
119 | \end{figure}
120 | 
121 | One caveat about the asynchronous version is that as the parameter server gets updated, the data collection policy might not be updated, which means the newly collected data might not come from the latest policy, thus making the current acting policy slightly outdated. Such problem is less of an issue in practice because the policy only gets updated by a tiny bit every time.
122 | 
123 | \section{Critics as State-Dependent Baselines}
124 | Now let us further discuss the connection between a baseline and a critic. Recall in the Monte Carlo version of policy gradient, the gradient is defined as:
125 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t'=1}^T r(s_{i,t'},a_{i,t'}) - b\right)$$
126 | and in actor-critic algorithm, we estimate the gradient by estimating the advantage function:
127 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(r(s_{i,t},a_{i,t}) + \gamma\hat{V}_\phi^\pi(s_{i,t+1})-\hat{V}_\phi^\pi(s_{i,t})\right)$$
128 | 
129 | So what are the pros and cons of the two approaches? In policy gradient with baselines, we have shown that there is no bias in our estimation, but there might be high variance due to our single-sample estimation of the cost-to-go function. On the other hand, in the actor-critic algorithm, we have shown that we have lower variance due to the critic, but we end up having a biased estimation because of the possibly bad critic as we are bootstrapping. So can we somehow keep the estimator unbiased while lowering the variance with the critic $\hat{V}^\pi_\phi$?
130 | 
131 | The solution is obvious and straightforward, we can just use $\hat{V}^\pi_\phi$ in place of $b$:
132 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t'=1}^T r(s_{i,t'},a_{i,t'}) - \hat{V}^\pi_\phi(s_{i,t})\right)$$
133 | In this way, we obtain an unbiased estimator with lower variance. 
134 | 
135 | \section{Eligibility Traces and n-Step Returns}
136 | In the above comparison of the two methods, we have seen that in the actor-critic advantage function, we have lower variance but higher bias, while in the Monte Carlo policy gradient, the advantage function has lower bias but higher variance. The reason why this tradeoff exists is that as we go further in our trajectory into the future, the variance increases due to the fact that the current single sample approximation is not representative enough for the future. Therefore, the Monte Carlo advantage function is good for getting accurate values in the near term, but not the long term. In contrast, in actor-critic advantage, the bias potentially skews the values in the near term, but the fact that the bias incorporates a lot of states will likely make it a better approximator in the long run. Therefore, it would be better if we could use the actor-critic based advantage for further in the future, and use the Monte Carlo based one for the near term in order to control the bias-variance tradeoff.
137 | 
138 | As a result, we can cut the trajectory before the variance gets too big. Mathematically, we can estimate the advantage function by combining the two approaches: use the Monte Carlo approach only for the first $n$ steps:
139 | $$\hat{A}^\pi_n(s_t,a_t) = \sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'}) - \hat{V}^\pi_\phi(s_t)+\gamma^n\hat{V}^\pi_\phi(s_{t+n})$$
140 | here we applied an n-step estimator, which sums the reward from now to $n$ steps from now, and $n>1$ often gives us better performance.
141 | 
142 | Furthermore, if we don't want to choose just one $n$, we can use a weighted combination of different $n$-steps returns, which we can define as the General Advantage Estimation(GAE):
143 | $$ \hat{A}_{GAE}(s_t,a_t) = \sum_{n=1}^\infty w_n \hat{A}^\pi_n(s_t,a_t)$$
144 | To choose the weights, we should prefer cutting earlier, so we can assign the weights accordingly: $w_n\propto \lambda^{n-1}$, where we call $\lambda$ the chance of getting cut.
145 | 
146 | 


--------------------------------------------------------------------------------
/batchac.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Batch Actor-Critic Algorithm}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:batchac}
 5 | \REQUIRE Base policy $\pi_\theta(a_t|s_t)$
 6 | 
 7 | \WHILE{true}
 8 |     \STATE Sample $\{s_i,a_i\}$ from $\pi_\theta(a|s)$ (run it on a robot)
 9 |     \STATE Fit $\hat{V}_\phi(s)$ to sampled reward sums
10 |     \STATE Evaluate $\hat{A}^\pi(s_i,a_i) = r(s_i,a_i)+\hat{V}_\phi(s'_i)-\hat{V}_\phi(s_i)$
11 |     \STATE $\nabla_\theta J(\theta) \simeq \sum_i\nabla_\theta\log \pi_\theta(a_i|s_i)\hat{A}^\pi(s_i,a_i)$
12 |     \STATE Improve policy by $\theta \leftarrow \theta + \alpha\nabla_\theta J(\theta)$
13 | \ENDWHILE
14 | \RETURN optimal policy from gradient ascent as $\pi^{return}$
15 | \end{algorithmic}
16 | \end{algorithm}


--------------------------------------------------------------------------------
/batchacwdf.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Batch Actor-Critic Algorithm with Discount Factor}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:batchacwdf}
 5 | \REQUIRE Base policy $\pi_\theta(a_t|s_t)$, hyperparameter $\gamma$
 6 | 
 7 | \WHILE{true}
 8 |     \STATE Sample $\{s_i,a_i\}$ from $\pi_\theta(a|s)$ (run it on a robot)
 9 |     \STATE Fit $\hat{V}_\phi(s)$ to sampled reward sums
10 |     \STATE Evaluate $\hat{A}^\pi(s_i,a_i) = r(s_i,a_i)+\gamma\hat{V}_\phi(s'_i)-\hat{V}_\phi(s_i)$
11 |     \STATE $\nabla_\theta J(\theta) \simeq \sum_i\nabla_\theta\log \pi_\theta(a_i|s_i)\hat{A}^\pi(s_i,a_i)$
12 |     \STATE Improve policy by $\theta \leftarrow \theta + \alpha\nabla_\theta J(\theta)$
13 | \ENDWHILE
14 | \RETURN optimal policy from gradient ascent as $\pi^{return}$
15 | \end{algorithmic}
16 | \end{algorithm}


--------------------------------------------------------------------------------
/cem.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Cross Entropy Method with Continuous-valued Input}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:cem}
 5 | \REQUIRE Some base distribution for action sequence $p(A)$ 
 6 | \WHILE{true}
 7 |     \STATE Sample $A_1,...,A_N$ from $p(A)$
 8 |     \STATE Evaluate $J(A_1),...,J(A_N)$
 9 |     \STATE Pick elites $A_{i_1},...,A_{i_M}$ with the highest value, where $M<N$
10 |     \STATE Refit $p(A)$ to elites $A_{i_1},...,A_{i_M}$
11 | \ENDWHILE
12 | \end{algorithmic}
13 | \end{algorithm}


--------------------------------------------------------------------------------
/ctrasinf.tex:
--------------------------------------------------------------------------------
  1 | \chapter{Control as Inference}
  2 | In this chapter, we will talk about how we derive optimal control, reinforcement learning, and planning as
  3 | probabilistic inference. In a lot of scenarios that, say, involve biological behaviors, the data is not optimal. The behavior of the agent might be stochastic, but good behaviors are still more likely. 
  4 | 
  5 | \section{Probabilistic Graphical Model of Decision Making}
  6 | When we do not make any assumption of optimal behavior, we cannot ensure that the actions are chosen optimally. In other words, we cannot assume the following relation:
  7 | \[
  8 | a_1,\dots,a_T = \argmaxA_{a_1,\dots,a_T}\sum_{t=1}^Tr(s_t,a_t)
  9 | \]
 10 | Instead, we should model the probability distribution of seeing a trajectory $p(\tau) = p(s_{1:T},a_{1:T})$. We also introduce a binary optimality variable $\mathcal{O}_t$, which represents if the agent if behaving optimally at time step $t$. Then we are interested in $p(\tau|\mathcal{O}_{1:T})$, where we infer the probability of the given trajectory given the agent is optimal at every time step.
 11 | \begin{figure}
 12 |     \centering
 13 |     \includegraphics[scale=0.4]{figures/opt.png}
 14 |     \caption{Optimalility in stochastic behaviors}
 15 |     \label{fig:opt}
 16 | \end{figure}
 17 | Now we will model the optimality variable as follows: we model that the probability that variable is true given state and action is an exponential of the reward:
 18 | \[
 19 | p(\mathcal{O}_t|s_t,a_t) = \exp{r(s_t,a_t)}
 20 | \]
 21 | this might seem an arbitrary choice at the first sight, but we shall see later that this gives us an elegant mathematical expression in our derivation. We also assume for now that the reward function is always negative, but we can always take any reward function and normalize it by subtracting the max reward. Then by Bayes' Rule, we have:
 22 | \begin{align*}
 23 |     p(\tau|\mathcal{O}_{1:T}) &= \frac{p(\tau,\mathcal{O}_{1:T})}{p(\mathcal{O}_{1:T})}\\
 24 |     &\propto p(\tau)\prod_t \exp{r(s_t,a_t)}\\
 25 |     &= p(\tau)\exp{\sum_t r(s_t,a_t)}
 26 | \end{align*}
 27 | What does the above expression imply? Well, let us pretend that the dynamics are deterministic, then the first term $p(\tau)$ just means if this trajectory is possible. If not, then the probability is 0. If the trajectory is indeed possible, since we are multiplying by the exponent of the sum of rewards, then the probability of a trajectory given the agent is acting optimally is big with high rewards, but small with low rewards.
 28 | 
 29 | Let us take a look at the optimality model in Fig. \ref{fig:opt}. Why is this model important? Because the model is able to model suboptimal behavior, which is important for inverse RL that will be covered later. We then can apply inference algorithms to solve control and planning problems. It also provides an explanation for why stochastic behavior might be preferred, which is useful for exploration and transfer learning.
 30 | 
 31 | \subsection{Inference in the Optimality Model}
 32 | The first inference we will do is to compute the backward message $\beta_t(s_t,a_t) = p(\mathcal{O}_{t:T}|s_t,a_t)$, which means the probability of the agent being optimal from the current time step to the end given state and action. Another inference we will do is the policy $p(a_t|s_t,\mathcal{O}_{1:T})$. Note that we are inferring the possible actions taken given optimality. The last inference we do is the forward message $\alpha_t(s_t) = p(s_t|\mathcal{O}_{1:t-1})$, which is the probability of landing in a particular state given that the agent is acting optimally up to the current time step. 
 33 | 
 34 | \subsection{Inferring the Backward Messages}
 35 | The backward messages we are inferring is $\beta_t(s_t,a_t) = p(\mathcal{O}_{t:T}|s_t,a_t)$, which we will try to express in terms of transition probability $p(s_{t+1}|s_t,a_t)$ and optimality probability $p(\mathcal{O}_t|s_t,a_t)$. Mathematically, we can calculate $\beta_t(s_t,a_t)$ as:
 36 | \begin{align*}
 37 |     \beta_t(s_t,a_t)&=p(\mathcal{O}_{t:T}|s_t,a_t)\\
 38 |     &= \int p(\mathcal{O}_{t:T},s_{t+1}|s_t,a_t)ds_{t+1}\\
 39 |     &= \int p(\mathcal{O}_{t+1:T}|s_{t+1})p(s_{t+1}|s_t,a_t)p(\mathcal{O}_{t}|s_t,a_t)ds_{t+1}
 40 | \end{align*}
 41 | The second and the third terms in the product are known, so let us now focus on the first term:
 42 | \begin{align*}
 43 | p(\mathcal{O}_{t+1:T}|s_{t+1}) &= \int p(\mathcal{O}_{t+1:T}|s_{t+1},a_{t+1})p(a_{t+1}|s_{t+1})da_{t+1}\\
 44 | &=\int \beta(s_{t+1},a_{t+1})da_{t+1}
 45 | \end{align*}
 46 | we ignored $p(a_{t+1}|s_{t+1})$ it means which actions are likely a priori, and we assume it is uniform (constant) for now.
 47 | 
 48 | Therefore, to calculate the backward message, we have a recursive relation. For $t= T-1\;\text{to}\;1$:
 49 | \begin{align*}
 50 |     \beta_t(s_t,a_t) &= p(\mathcal{O}_{t}|s_t,a_t)\mathbb{E}_{s_{t+1}\sim p(s_{t+1}|s_t,a_t)}[\beta_{t+1}(s_{t+1})]\\
 51 |      \beta_t(s_t) &= \mathbb{E}_{a_{t}\sim p(a_t|s_t)}[\beta_t(s_t,a_t)]
 52 | \end{align*}
 53 | 
 54 | \subsection{A Closer Look}
 55 | Let us take a closer look at the backward pass. Let $V_t(s_t) = \log \beta_t(s_t)$, and let $Q_t(s_t,a_t) = \log \beta_t(s_t,a_t)$. Then 
 56 | $$V_t(s_t) = \log \int \exp (Q_t(s_t,a_t))da_t$$
 57 | As $Q_t(s_t,a_t)$ gets bigger $V_t(s_t)\rightarrow \max_{a_t}Q_t(s_t,a_t)$. Using the expression of $\beta_t(s_t,a_t)$, we will have
 58 | \[
 59 | Q_t(s_t,a_t) = r(s_t,a_t) + \log \mathbb{E}[\exp (V_{t+1}(s_{t+1}))]
 60 | \]
 61 | 
 62 | Recall in value iteration, we set $Q(s,a) \leftarrow r(s,a) + \gamma \mathbb{E}[V(s')]$. When the transition is deterministic, we have $Q_t(s_t,a_t) = r(s_t,a_t)+V_{t+1}(s_{t+1})$, which is similar to value iteration. However, when the transition is stochastic, then the $\log \exp$ term is like a maximum operation, so we have a biased optimistic estimation of the Q-function.
 63 | 
 64 | \subsection{Aside: The Action Prior}
 65 | Recall that we assumed $p(a_t|s_t)$ to be uniform, so it became constant in our integral. However, we shall see that it does not change much if the action prior is not uniform. Our V function now becomes $V_t(s_t) = \log \int \exp (Q_t(s_t,a_t)+\log p(a_t|s_t))da_t$, and our Q-function becomes $Q(s_t,a_t) = r(s_t,a_t) + \log p(a_t|s_t) + \log \mathbb{E}[\exp (V_{t+1}(s_{t+1}))]$
 66 | We can put the extra $p(a_t|s_t)$ into the reward term, then we will have the same expression of the Q-funtion, thus the V function. Therefore, uniform action prior can be assumed without loss of generality because it can always be folded into the reward.
 67 | 
 68 | \subsection{Inferring the Policy}
 69 | Now with backward messages available to us, we can then proceed to infer the policy $p(a_t|s_t,\mathcal{O}_{1:T})$. We derive the policy as follows:
 70 | \begin{align*}
 71 |     p(a_t|s_t,\mathcal{O}_{1:T}) &= \pi(a_t|s_t)\\
 72 |     &= p(a_t|s_t,\mathcal{O}_{t:T})\\
 73 |     &= \frac{p(a_t,s_t|\mathcal{O}_{t:T})}{p(s_t|\mathcal{O}_{t:T})}\\
 74 |     &= \frac{p(\mathcal{O}_{t:T}|a_t,s_t)p(a_t,s_t)/p(\mathcal{O}_{t:T})}{p(\mathcal{O}_{t:T}|s_t)p(s_t)/p(\mathcal{O}_{t:T})}\\
 75 |     &= \frac{p(\mathcal{O}_{t:T}|a_t,s_t)}{p(\mathcal{O}_{t:T}|s_t)}\frac{p(a_t,s_t)}{p(s_t)}\\
 76 |     &= \frac{\beta_t(s_t,a_t)}{\beta_t(s_t)}p(a_t|s_t)
 77 | \end{align*}
 78 | we discard the optimality variables $1,\dots t-1$ because then are conditionally independent of $s_t$. We also discard $p(a_t|s_t)$ since we can assume it as uniform. Using the definition of $V,Q$, we have
 79 | \begin{align*}
 80 |     \pi(a_t|s_t)&= \frac{\beta_t(s_t,a_t)}{\beta_t(s_t)}\\
 81 |     &= \exp(Q_t(s_t,a_t) - V_t(s_t))\\
 82 |     &= \exp(A_t(s_t,a_t))
 83 | \end{align*}
 84 | This result makes sense, because when we have large advantage function values, the action is more likely to be taken.
 85 | 
 86 | \subsection{Inferring the Forward Messages}
 87 | We now can infer our third task, the forward message $\alpha_t(s_t) = p(s_t|\mathcal{O}_{1:t-1})$. The derivation is as follows:
 88 | \begin{align*}
 89 |     \alpha_t(s_t) &= p(s_t|\mathcal{O}_{1:t-1})\\
 90 |     &= \int p(s_t,s_{t-1},a_{t-1}|\mathcal{O}_{1:t-1})ds_{t-1}da_{t-1}\\
 91 |     &= \int p(s_t|s_{t-1},a_{t-1},\mathcal{O}_{1:t-1})p(a_{t-1}|s_{t-1},\mathcal{O}_{1:t-1})p(s_{t-1}|\mathcal{O}_{1:t-1})ds_{t-1}da_{t-1}\\
 92 |     &= \int p(s_t|s_{t-1},a_{t-1})p(a_{t-1}|s_{t-1},\mathcal{O}_{t-1})p(s_{t-1}|\mathcal{O}_{1:t-1})ds_{t-1}da_{t-1}
 93 | \end{align*}
 94 | here we used the fact that the current state is conditionally independent of the previous optimality variables given the previous state, and we also used the fact that the current action is conditionally independent of the previous optimality variables given the current state. The first term is just the dynamics, so we need to figure out what the second and the third terms by Bayes' rule:
 95 | \begin{align*}
 96 | p(a_{t-1}|s_{t-1},\mathcal{O}_{t-1})p(s_{t-1}|\mathcal{O}_{1:t-1}) &= \frac{p(\mathcal{O}_{t-1}|s_{t-1},a_{t-1})p(a_{t-1}|s_{t-1})}{p(\mathcal{O}_{t-1}|s_{t-1})}\frac{p(\mathcal{O}_{t-1}|s_{t-1})p(s_{t-1}|p(\mathcal{O}_{1:t-2})}{p(\mathcal{O}_{t-1}|\mathcal{O}_{1:t-2})}\\
 97 | &=\frac{p(\mathcal{O}_{t-1}|s_{t-1},a_{t-1})p(a_{t-1}|s_{t-1})}{p(\mathcal{O}_{t-1}|\mathcal{O}_{1:t-2})}\alpha_{t-1}(s_{t-1})
 98 | \end{align*}
 99 | so now we have a recursive relation, and $\alpha_a(s_1) = p(s_1)$ is usually known. 
100 | 
101 | Another byproduct of having this forward message is that we can combine it with the backward message to calculate the probability of landing in a particular state given optimality variables:
102 | \[
103 | p(s_t|\mathcal{O}_{1:T})= \frac{p(s_t,\mathcal{O}_{1:T})}{p(\mathcal{O}_{1:T})} = \frac{p(\mathcal{O}_{t:T}|s_t)p(s_t,\mathcal{O}_{1:t-1})}{p(\mathcal{O}_{1:T})} \propto \beta_t(s_t)p(s_t|\mathcal{O}_{1:t-1})p(\mathcal{O}_{1:t-1})\propto\beta_t(s_t)\alpha_t(s_t)
104 | \]
105 | 
106 | Geometrically, the relation between the state marginal and the product of backward and forward messages is shown in Fig. \ref{fig:marginal}. Here the backward messages is a backward cone, and the forward message is a forward cone. When we take the product of the two, we are essentially finding the intersection of the two cones. Intuitively, for a state in a trajectory, the state marginals are tighter near the beginning and the end, but looser near the center because the state marginals need to close in at the beginning and the end of a trajectory.
107 | \begin{figure}
108 |     \centering
109 |     \includegraphics[scale=0.4]{figures/marginal.png}
110 |     \caption{Forward/backward messages intersection}
111 |     \label{fig:marginal}
112 | \end{figure}
113 | 
114 | \section{The Optimism Problem}
115 | Recall in the dynamic programming view of our backward message inference, the Q-function can be written as:
116 | \[
117 | Q_t(s_t,a_t) = r(s_t,a_t) + \log \mathbb{E}[\exp (V_{t+1}(s_{t+1}))]
118 | \]
119 | We have shown that $\log\mathbb{E}\exp$ behaves like a $\max$, thus bringing us bias in the estimate of Q-function. Marginalizing and conditioning the backward message $\beta_t(s_t,a_t) = p(\mathcal{O}_{t:T}|s_t,a_t)$, we can have two different distributions to infer: first, we can have the policy $p(a_t|s_t,\mathcal{O}_{1:T})$, which means given that you had a high reward (optimal), what was your action probability? Second, we can have the transition $p(s_{t+1}|s_t,a_t,\mathcal{O}_{1:T})$, and we should notice that this is not equal to the transition probability $p(s_{t+1}|s_t,a_t)$ because now we are asking given that you obtained high rewards, what was your transition probability? To address the optimism problem, we need to ask the first question: given that you obtained high reward, what was your action probability, assuming that we have the same transition probability, such that we are no luckier than we usually are. 
120 | 
121 | It turns out the first question is a difficult one. To answer that question, we can find another distribution $q(s_{1:T},a_{1:T})$ that is close to $p(s_t,a_t|\mathcal{O}_{1:T})$, but have the same $p(s_{t+1}|s_t,a_t)$. So let's us try variational inference. Let our evidence $x$, what we have observed, be the optimality variables $\mathcal{O}_{1:T}$, and the latent variable $z$, what we have not observed, be the trajectory $s_{1:T},a_{1:T}$. Using variational inference, we find a $q(z)$ to approximate $p(z|x)$.
122 | 
123 | Let $q(s_{1:T},a_{1:T}) = p(s_1)\prod_t p(s_{t+1}|s_t,a_t)q(a_t|s_t)$ since we are keeping the same initial state distribution and the same transition. Recall that the variational lower bound of the likelihood approximation is:
124 | \[
125 | \log p(x)\geq \mathbb{E}_{z\sim q(z)}[\log p(x,z) - \log q(z)]
126 | \]
127 | plugging in our previous definition of $q(z)$, we have
128 | \begin{align*}
129 |     \log p(\mathcal{O}_{1:T}) &\geq \mathbb{E}_{(s_{1:T},a_{1:T})\sim q}\bigg[ \log p(s_1) + \sum_{t=1}^T \log p(s_{t+1}|s_t,a_t) + \log p(\mathcal{O}_{T}|s_t,a_t)\\ &- \log p(s_1) - \sum_{t=1}^T \log p(s_{t+1}|s_t,a_t) - \sum_{t=1}^T\log q(a_t|s_t) \bigg]\\
130 |     &= \mathbb{E}_{(s_{1:T},a_{1:T})\sim q}\left[\sum_t r(s_t,a_t) - \log q(a_t|s_t)\right]\\
131 |     &= \mathbb{E}_{(s_{1:T},a_{1:T})\sim q}\left[r(s_t,a_t)+\mathcal{H}(q(a_t|s_t))\right]
132 | \end{align*}
133 | Therefore, to maximize the lower bound, we maximize the reward and the entropy.
134 | 
135 | %fill this in later
136 | Using dynamic programming, we can get rid of the optimism $\max$ in the Bellman backup term.


--------------------------------------------------------------------------------
/dagger.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Dataset Aggregation (DAgger)}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:dagger}
 5 | \REQUIRE Human data $\mathcal{D} = \{o_1,a_1,...,o_N,a_N\}$
 6 | 
 7 | \WHILE{true}
 8 |     \STATE Train $\pi_\theta(a_t|o_t)$ from human data $\mathcal{D} = \{o_1,a_1,...,o_N,a_N\}$.
 9 |     \STATE Run $\pi_\theta(a_t|o_t)$ to get dataset $\mathcal{D}_\pi = \{o_1,...,o_M\}$
10 |     \STATE Ask human to label $\mathcal{D}_\pi$ with actions $a_t$
11 |     \STATE Aggregate $\mathcal{D} \leftarrow \mathcal{D} \cup \mathcal{D_\pi}$
12 | \ENDWHILE
13 | \RETURN optimal imitation-learned trajectory as $\tau^{return}$
14 | \end{algorithmic}
15 | \end{algorithm}


--------------------------------------------------------------------------------
/ddpg.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Deep Deterministic Policy Gradient (DDPG)}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:ddpg}
 5 | \WHILE{true}
 6 |     \STATE Take some action $a_i$ and observe $(s_i,a_i,s'_i,r_i)$ and add it to $\mathcal{B}$
 7 |     \STATE Sample mini-batch $\{s_j,a_j,s'_j,r_j\}$
 8 |     \STATE Compute $y_j = r_j + \gamma Q_{\phi'}(s'_j,a'_j)$ using target networks $Q_{\phi'}$ and $\mu_{\theta'}$
 9 |     \STATE $\phi \leftarrow \phi-\alpha\Sigma_j\frac{dQ_\phi}{d\phi}(s_j,a_j)(Q_\phi(s_j,a_j) - y_J)$
10 |     \STATE $\theta \leftarrow \theta + \beta\sum_j\frac{d\mu}{d\theta}(s_j)\frac{dQ_\phi}{da}(s_j,a)$
11 |     \STATE update $\phi'$ and $\theta'$
12 | \ENDWHILE
13 | \end{algorithmic}
14 | \end{algorithm}


--------------------------------------------------------------------------------
/dqn.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Classic Deep Q-Learning Algorithm (DQN)}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:dqn}
 5 | \WHILE{true}
 6 |     \STATE Take some action $a_i$ and observe $(s_i,a_i,s'_i,r_i)$ and add it to $\mathcal{B}$
 7 |     \STATE Sample mini-batch $\{s_j,a_j,s'_j,r_j\}$
 8 |     \STATE Compute $y_j = r_j + \gamma \max_{a'_j}Q_{\phi'}(s'_j,a'_j)$ using target network $Q_{\phi'}$
 9 |     \STATE $\phi \leftarrow \phi-\alpha\Sigma_j\frac{dQ_\phi}{d\phi}(s_j,a_j)(Q_\phi(s_j,a_j) - y_J)$
10 |     \STATE update $\phi'$: copy $\phi$ every $N$ steps.
11 | \ENDWHILE
12 | \end{algorithmic}
13 | \end{algorithm}


--------------------------------------------------------------------------------
/dyna.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Dyna}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:dyna}
 5 | \REQUIRE Some exploration policy for data collection $\pi_0$
 6 | \STATE Given state $s$, pick action $a$ using exploration policy
 7 | \STATE Observe $s'$ and $r$, to get transition $(s,a,s',r)$
 8 | \STATE Update model $\hat{p}(s'|s,a)$ and $\hat{r}(s,a)$ using $(s,a,s')$
 9 | \STATE Q-update: $Q(s,a)\leftarrow Q(s,a) + \alpha\mathbb{E}_{s',r}\left[r + \max_{a'} Q(s',a')-Q(s,a)\right]$
10 | \FOR{$K$ times}
11 | \STATE Sample $(s,a)\sim \mathcal{B}$ from buffer of past states and actions
12 | \STATE Q-update: $Q(s,a)\leftarrow Q(s,a) + \alpha\mathbb{E}_{s',r}\left[r + \max_{a'} Q(s',a')-Q(s,a)\right]$
13 | \ENDFOR
14 | \end{algorithmic}
15 | \end{algorithm}


--------------------------------------------------------------------------------
/dynagen.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{General Dyna}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:dynagen}
 5 | \REQUIRE Some exploration policy for data collection $\pi_0$
 6 | \STATE Collect some data, consisting of transitions $(s,a,s',r)$
 7 | \STATE Learn model $\hat{p}(s'|s,a)$ (and optionally, $\hat{r}(s,a)$)
 8 | \FOR{$K$ times}
 9 | \STATE Sample $s\sim \mathcal{B}$ from buffer
10 | \STATE Choose action $a$ (from $\mathcal{B}$, from $\pi$, or random)
11 | \STATE Simulate $s'\sim\hat{p}(s'|s,a)$ (and $r=\hat{r}(s,a)$)
12 | \STATE Train on $(s,a,s',r)$ with model-free RL
13 | \STATE (optional) take $N$ more model-based steps
14 | \ENDFOR
15 | \end{algorithmic}
16 | \end{algorithm}


--------------------------------------------------------------------------------
/exploration.tex:
--------------------------------------------------------------------------------
  1 | \chapter{Exploration}
  2 | In reinforcement learning, we aim to balance exploitation and exploration. In a lot of setting, exploring with random behaviors does not give us satisfactory result due to the highly complex environment. Explorations mainly concerns with two different questions: how can an agent discover high-reward strategies that require a temporally extended sequence of complex behaviors that, individually, are not rewarding? How can an agent decide whether to attempt new behaviors (to discover ones with higher reward) or continue to do the best thing it knows so far? Long story short, we can define exploitation as doing what you know will yield highest reward, and exploration as  doing things you haven’t done before, in the hopes of getting even higher reward. In order to explore well, we need to come up with some smart strategies to discover some better way to gain rewards. To illustrate, let us look at a classic example of exploration and exploitation trade-off. Say you want to eat in a restaurant, to exploit, you would eat at your favorite restaurant, but to explore, you would go to a new restaurant and see if it is better than your favorite one. 
  3 | 
  4 | \section{Multi-arm Bandits}
  5 | A bandit problem is a type of simple exploration problem. The term comes from the name of a popular slot machine. We are interested in multi-arm bandits where each arm gives us different reward, and we are mainly concerned with the question of pulling which arm gives us the best reward among all arms.
  6 | 
  7 | Mathematically, our actions to choose are
  8 | \[
  9 | \mathcal{A} = \{\text{pull}_1,\text{pull}_2,\dots,\text{pull}_n\}
 10 | \]
 11 | since we have $n$ arms, and assume there is a true distribution of which arm gives us a higher reward, which we do not know a priori:
 12 | \[
 13 | r(a_n)\sim p(r|a_n)
 14 | \]
 15 | 
 16 | \subsection{Defining a Bandit}
 17 | Assume our reward for each arm $r_i$ comes from a distribution $r(a_i)\sim p_{\theta_i}(r_i)$. For example, if our reward is a binary variable then we can define $p_{\theta_i}(r_i)$ as
 18 | \[
 19 | p(r_i=1) = \theta_i,\;\;p(r_i=0) = 1-\theta_i
 20 | \]
 21 | We also know that $\theta_i\sim p(\theta)$, but we do not know anything else about the distribution.
 22 | 
 23 | This actually defines a meta-level POMDP. Our latent state is actually $s = [\theta_1,\dots,\theta_n]$, which is the true parameterization of the arms' reward distribution. We also have a belief state, which is our observation in some sense. The belief state is an estimate of the probability of getting high reward of each arm:
 24 | \[
 25 | \hat{p}(\theta_1,\dots,\theta_n)
 26 | \]
 27 | 
 28 | To measure the goodness of exploration algorithm, we define the regret of exploration. The regret of exploration is the difference from optimal policy at time step $T$:
 29 | \[
 30 | Reg(T) = T\mathbb{E}[r(a^*)] - \sum_{t=1}^Tr(a_t)
 31 | \]
 32 | where the first term means in hindsight, how much reward could I have got if I had taken the best action all the way until the end, and the second term means the actual reward I have got. The difference of these two gives us the reward.
 33 | 
 34 | \subsection{Optimistic Exploration}
 35 | One simple way to explore is to use an optimistic exploration strategy, where we keep track of average reward $\hat{\mu}_a$ for each action $a$. Naturally, one way to pick an action is the greedy exploitation:
 36 | \[
 37 | a = \argmaxA\hat{\mu}_a
 38 | \]
 39 | then to explore better, we need to add bonus to new actions too:
 40 | \[
 41 | a = \argmaxA\hat{\mu}_a + C\sigma_a
 42 | \]
 43 | where $\sigma_a$ is some quantification of uncertainty about action $a$. 
 44 | 
 45 | The intuition behind this strategy is to try each arm until you are sure that it is not great. Therefore, if you think an action might be good, go on and try it for a few more times, and try something else if you are sure it is not good.
 46 | 
 47 | One popular method to gauge the uncertainty about an action is to use the upper confidence bound (UCB). Specifically, 
 48 | \[
 49 | a = \argmaxA\hat{\mu}_a + \sqrt{\frac{2\ln T}{N(a)}}
 50 | \]
 51 | where $N(a)$ counts the number of times that we have applied action $a$. Using this strategy, we can get a bound of regret $O(\log T)$.
 52 | 
 53 | \subsection{Probability Matching}
 54 | Recall we have a belief state model that represents our own estimate of each arm's reward parameterization:
 55 | \[
 56 | \hat{p}(\theta_1,\dots,\theta_n)
 57 | \]
 58 | We can improve our belief state by keep updating it. The idea is to sample $(\theta_1,\dots,\theta_n)$ from the distribution, and pretend that the model $(\theta_1,\dots,\theta_n)$ is correct, then we take the optimal action and update the belief model. Then we take the action by $a = \argmaxA_a\mathbb{E}_{\theta_a}[r(a)]$.
 59 | 
 60 | This is called posterior sampling or Thompson sampling. This method is harder to analyze theoretically, but can work very well empirically.
 61 | \subsection{Information Gain}
 62 | Say we want to determine some latent variable $z$, one way to decide which action to take is look at the entropy of the estimate of the prior $\mathcal{H}(\hat{p}(z))$. Intuitively, the entropy should be high if our estimate is off, and low if our estimate is accurate. By the same token, we can incorporate evidence $y$ with this entropy so that we look at the entropy $\mathcal{H}(\hat{p}(z|y))$ in order to see if the entropy changes after $y$ is known to us. For example, $y$ could be the reward $r(a)$. 
 63 | 
 64 | The \textbf{information gain} of observing $y$ is defined as
 65 | \[
 66 | IG(z,y) = \mathbb{E}_y[\mathcal{H}(\hat{p}(z)) - \mathcal{H}(\hat{p}(z|y))]
 67 | \]
 68 | which is the expected decrease of entropy after observing $y$. Note that we do not know what $y$ actually is, but we have some knowledge of what it might be. This is why we are taking the expected value. Typically, the information gain also depends on the action so we can have $IG(z,y|a)$. Hence,
 69 | \[
 70 | IG(z,y) = \mathbb{E}_y[\mathcal{H}(\hat{p}(z)) - \mathcal{H}(\hat{p}(z|y))|a]
 71 | \]
 72 | it measures how much we learn from $z$ from action $a$, given the current beliefs.
 73 | 
 74 | In our exploration setting, the observation is the observed reward:
 75 | \[
 76 | y = r(a)
 77 | \]
 78 | the latent state is the parameters for model $p(r_a)$:
 79 | \[
 80 | z = \theta_a
 81 | \]
 82 | then the information gain of $a$ is calculated as:
 83 | \[
 84 | g(a) = IG(\theta_a,r_a|a)
 85 | \]
 86 | we also define another quantity $\Delta(a)$ that measures the expected suboptimality of $a$:
 87 | \[
 88 | \Delta(a) = \mathbb{E}[r(a^*) - r(a)]
 89 | \]
 90 | As a result we take action according to the rule
 91 | \[
 92 | a = \argminA_a\frac{\Delta(a)^2}{g(a)}
 93 | \]
 94 | This rule intuitively means that we do not take an action if we are sure if it is optimal (large $\Delta(a)$), or if we cannot learn anything from applying that action (small $g(a)$).
 95 | 
 96 | We talked about bandits models because bandits are easier to analyze and understand. We can derive foundations for exploration methods, ane then apply these methods to more complex MDPs. Most exploration strategies require some kind of uncertainty estimation (even if it’s naïve). We usually assumes some value to new information. For example, we assume unknown means good (optimism), sample means truth, and information gain means good.
 97 | 
 98 | \section{Exploration in MDPs}
 99 | Recall our UCB exploration policy to choose action:
100 | \[
101 | a = \argmaxA\hat{\mu}_a + \sqrt{\frac{2\ln T}{N(a)}}
102 | \]
103 | here $N(a)$ is the exploration bonus.
104 | 
105 | Can we apply the same idea in MDPs, which are what we work with in RL? Essentially, we can do the same exploration bonus $N(s,a)$ or $N(s)$ and add it to the reward:
106 | \[
107 | r^+(s,a) = r(s,a) + \mathcal{B}(N(s))
108 | \]
109 | where the bonus $N(s)$ decreases with the increase of visitation frequency, and then we use $r^+(s,a)$ instead of $r(s,a)$ in any model-free algorithm. This is a simple addition to any RL algorithm, but we need to tune the bonus weight.
110 | 
111 | \subsection{Counting the Exploration Bonus}
112 | We count the number of times that we have encountered the state $s$ using $N(s)$. However, in many situations such as video games or autonomous driving, we never actually see the exact same state twice. Therefore, we need to take the notion of similarity into account: we count the number of times we have encountered similar states, instead of the same states. 
113 | 
114 | The idea is to fit a density model $p_\theta(s)$ or $p_\theta(s,a)$ to the states. $p_\theta(s)$ is low for very novel states, and high for states that are very similar to the states we have seen, even if it might be completely new. To design this density model, we can seek some inspirations from a simple small MDP. If we have a small MDP, then the density of visiting a state $s$ is modeled as:
115 | \[
116 | P(s) = \frac{N(s)}{n}
117 | \]
118 | and if we see the same state again, this density becomes:
119 | \[
120 | P'(s) = \frac{N(s)+1}{n+1}
121 | \]
122 | we design our neural net density model obeying the same rule. 
123 | 
124 | We devise a deep pseudo-count procedure to count the states as shown in Alg. \ref{alg:pseudocount}
125 | \input{pseudocount.tex}
126 | In step 5, we solve for $\hat{N}$ using the following equations:
127 | \begin{align*}
128 |     p_\theta(s_i) &= \frac{\hat{N}(s_i)}{\hat{n}}\\
129 |     p_{\theta'}(s_i) &= \frac{\hat{N}(s_i)+1}{\hat{n}+1}
130 | \end{align*}
131 | this two equations with two unknowns, and we solve for $\hat{N},\hat{n}$ as follows:
132 | \begin{align*}
133 |     \hat{N} &= \hat{n}p_\theta(s_i)\\
134 |     \hat{n} &= \frac{1-p_{\theta'}(s_i)}{p_{\theta'}(s_i) - p_{\theta}(s_i)}p_{\theta}(s_i)
135 | \end{align*}
136 | These counters are able to count similar states.
137 | 
138 | \section{Exploration with Q-functions}
139 | Recall in earlier chapters, we covered epsilon-greedy exploration strategy. This strategy is essentially taking random actions with a probability of $\epsilon$. However, in many cases, exploring by taking random actions might not be ideal. For example, in the Atari game called SeaQuest, taking random actions by going back and forth might make the submarine run out of oxygen quickly without exploring meaningful states. Therefore, we need to explore more efficiently by sticking to one general strategy for an extended period of time. 
140 | 
141 | Thus, we introduce exloring with Q-functions. Exploring with random actions (e.g., epsilon-greedy) is not efficient enough because we oscillate back and forth, so we might not go to a coherent or interesting place. However, exploring with random Q-functions make us commit to a randomized but internally consistent strategy for an entire episode, so we act coherently in the same episode. 
142 | 
143 | To do this, we maintain a distribution of Q-functions. This distribution could be any Q-function with artificial Gaussian noise. Then we sample a Q-function from this distribution $p(Q)$, and act according to this function for the entire episode. Then we update $p(Q)$ and repeat. Since Q-learning is off-policy, we don’t care which Q-function was used to collect data. 
144 | 
145 | Using this method, we don't need to modify the original reward function since we are just doing off-policy Q-learning, but in many cases good exploration bonus functions tend to do better.
146 | 
147 | \section{Revisiting Information Gain in MDP Exploration}
148 | In MDPs, we can also use information gain $IG(z,y|a)$ which was introduced in the multi-arm bandit problem. However, we need to figure out what information gain we are looking for exactly. First, we can find information gain about reward $r(s,a)$. However, this is not useful when the reward is sparse, so nothing meaningful could come out from this. We could also find information gain about state density $p(s)$, where we can find how much the state visitation changes after we know something. This is useful because $p(s)$ changes drastically if the information gain is big enough. We could also learn the information gain about the transition model $p(s'|s,a)$, which is good for learning the MDP itself. However, none of the above three different settings can be calculated exactly. Thus, we need a proxy to estimate the information gain. 
149 | 
150 | \subsection{Prediction Gain}
151 | One way to approximate the information gain is to use the prediction gain:
152 | \[
153 | \log p_{\theta'}(s) - \log_\theta(s)
154 | \]
155 | which is used on state densities. This quantity takes the difference between the density before and after seeing the state $s$. Therefore, if the prediction gain is big, then the state $s$ is novel.
156 | \subsection{Variational Information Maximization for Exploration (VIME)}
157 | This method to approximate the information gain was first introduced in \cite{houthooft2016vime} by Houthooft et al.. Mathematically, the information gain can be equivalently written in terms of KL divergence as:
158 | \[
159 | D_{KL}(p(z|y)||p(z))
160 | \]
161 | There is some quantity about the MDP we want to learn about, which in this case, without loss of generality, is the transition
162 | \[
163 | p_\theta(s_{t+1}|s_t,a_t)
164 | \]
165 | then in the parametrization of the information gain, what we want to learn about is the parameter of the quantity of interest $\theta$. $z$ could also be some other distributions that involve $\theta$ such as $p_\theta(s)$ and $p_\theta(r|s,a)$. The evidence $y$ we observe is the transition. Therefore:
166 | \begin{align*}
167 |     z &= \theta\\
168 |     y &= (s_t,a_t,s_{t+1})
169 | \end{align*}
170 | Our information gain in terms of KL-divergence can be set up as
171 | \[
172 | D_{KL}(p(\theta|h,s_{t+1},s_t,a_t)||p(\theta|h))
173 | \]
174 | where $h$ is the history of all prior transitions. Therefore, intuitively, the transition we observe is more intuitive if it causes the belief over $\theta$ to change.
175 | 
176 | The idea of VIME is to use variational inference to approximate $p(\theta|h)$ since maintaining the whole history $h$ is not feasible. So we use a distribution to approximate the history:
177 | \[
178 | q(\theta|\phi) \simeq p(\theta|h)
179 | \]
180 | the new distribution is parameterized by $\phi$, so when we observe a new transition, we update $\phi$ to get $\phi'$.
181 | 
182 | As you recall, we update the parameters by optimizing the variational lower bound
183 | \[
184 | D_{KL}(q(\theta|\phi)||p(h|\theta)p(\theta)
185 | \]
186 | and we represent $q(\theta|\phi)$ as a product of independent Gaussians of parameter distributions with mean $\phi$. 
187 | 
188 | After updating $\phi'$, we use $D_{KL}(q(\theta|\phi')||q(\theta|\phi))$ as the approximate information gain. 
189 | 
190 | \section{Improving RL with Imitation}
191 | Imitation learning is simple, stable supervised learning, but it requires demonstrations, and it must address distributional shift. Furthermore, the state of the art imitation learning can only be as good as the demo. In contrast, reinforcement learning can become arbitrarily good, but it requires reward function, and must address exploration. Also, it often does not have convergence guarantees.
192 | \subsection{Pretrain and Finetune}
193 | Can we somehow combine the two such that we have both demonstrations data and rewards? The answer is yes. The simplest idea that is practical and works is to pretrain with imitation learning and fine tune with RL. This idea is shown in Alg. \ref{alg:pretrain}. 
194 | \input{pretrain.tex}
195 | 
196 | The problem with Alg. \ref{alg:pretrain} is that in step 4, we might collect very bad experience due to distribution shift, so the first batch of data might be bad, thus destroying the initialization. 
197 | 
198 | \subsection{Off-policy RL}
199 | One way to mitigate the issue of forgetting the demonstrations is to use off-policy RL because off-policy RL can use any data, and if we let it use demonstrations as off-policy samples, since demonstrations are provided as data in every iteration, they are never forgotten. Furthermore, the policy can still become better than the demos, since it is not forced to mimic them. To achieve this, we could use off-policy policy gradients and off-policy Q-learning. 
200 | 
201 | Recall policy gradients with importance sampling:
202 | \[
203 | \nabla_\theta J(\theta) = \sum_{\tau\in\mathcal{D}}\left[\sum_{t=1}^T\nabla_\theta\log\pi_\theta(a_t|s_t)\left(\prod_{t'=1}^t\frac{\pi_\theta(a_{t'}|s_{t'})}{q(a_{t'}|s_{t'})}\right)\left(\sum_{t'=t}^Tr(s_{t'},a_{t'})\right)\right]
204 | \]
205 | The trick here is when we collect the sum of samples, in our samples, we include both our experience and demonstration. However, it seems a little weird because in policy gradients we actually want on-policy data, so why are we including off-policy data. To answer this question, let us build up some intuition by looking at the optimal importance sampling distribution. Say we want to estimate $\mathbb{E}_{p(x)}[f(x)]$ using importance sampling:
206 | \[
207 | \mathbb{E}_{p(x)}[f(x)] \simeq \frac{1}{N} \sum_i\frac{p(x_i)}{q(x_i)}f(x_i)
208 | \]
209 | and it can be proven that
210 | \[
211 | q\propto p(x)|f(x)|
212 | \]
213 | gives us the smallest variance. Therefore, by taking off-policy demonstration samples, we are motivating importance sampling to use distributions that have higher reward than current policy in order to get closer to the optimal distribution.
214 | %TODO: why
215 | To construct the sampling distribution, first we need to figure out which distribution the demonstrations come from. First, we could use supervised behavioral cloning to learn $\pi_{demo}$. If the demonstration is from multiple distribution, we could instead use \textbf{fusion distribution} by
216 | \[
217 | q(x) = \frac{1}{M}\sum_i q_i(x)
218 | \]
219 | 
220 | \subsection{Q-learning with Demonstrations}
221 | Since Q-learning is already off-policy, there is actually no need to bother with importance weights like we did in policy gradients. Therefore, one simple solution is just drop demonstrations into the replay buffer.
222 | 
223 | We can modify Alg. \ref{alg:QwRB_tn} slightly such that we initialize $\mathcal{B}$ with some demonstrations data, and then do the same things as before.
224 | 
225 | \subsection{Imitation as an Auxiliary Loss Function}
226 | Recall the imitation learning maximum likelihood training objective is 
227 | \[
228 | \sum_{(s,a)\sim\mathcal{D}_{demo}}\log \pi_\theta(a|s)
229 | \]
230 | and the RL objective is
231 | \[
232 | \mathbb{E}_{\pi_\theta}[r(s,a)]
233 | \]
234 | to combine the two, we can come up with a hybrid objective:
235 | \[
236 | \mathbb{E}_{\pi_\theta}[r(s,a)] + \lambda \sum_{(s,a)\sim\mathcal{D}_{demo}}\log \pi_\theta(a|s)
237 | \]
238 | 


--------------------------------------------------------------------------------
/figures/bellmanbackup.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/bellmanbackup.png


--------------------------------------------------------------------------------
/figures/dqn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/dqn.png


--------------------------------------------------------------------------------
/figures/dynarollout.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/dynarollout.png


--------------------------------------------------------------------------------
/figures/fitV.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/fitV.png


--------------------------------------------------------------------------------
/figures/im_RNN.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/im_RNN.png


--------------------------------------------------------------------------------
/figures/imitation_div.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/imitation_div.png


--------------------------------------------------------------------------------
/figures/latent.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/latent.png


--------------------------------------------------------------------------------
/figures/localmodel.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/localmodel.png


--------------------------------------------------------------------------------
/figures/marginal.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/marginal.png


--------------------------------------------------------------------------------
/figures/markov.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/markov.png


--------------------------------------------------------------------------------
/figures/modelnn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/modelnn.png


--------------------------------------------------------------------------------
/figures/multimodal.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/multimodal.png


--------------------------------------------------------------------------------
/figures/opt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/opt.png


--------------------------------------------------------------------------------
/figures/overfit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/overfit.png


--------------------------------------------------------------------------------
/figures/parallelsim.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/parallelsim.png


--------------------------------------------------------------------------------
/figures/poliback.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/poliback.png


--------------------------------------------------------------------------------
/figures/qwrb.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/qwrb.png


--------------------------------------------------------------------------------
/figures/rlanatomy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/rlanatomy.png


--------------------------------------------------------------------------------
/figures/trajheat.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/trajheat.png


--------------------------------------------------------------------------------
/figures/vae.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/vae.png


--------------------------------------------------------------------------------
/figures/varinf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/figures/varinf.png


--------------------------------------------------------------------------------
/fittedQ.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Fitted Q-Iteration Algorithm}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:fittedQ}
 5 | \REQUIRE Some base policy for data collection; hyperparameter $K$
 6 | \WHILE{true}
 7 |     \STATE Collect dataset $\{(s_i,a_i,s'_i,r_i)\}$ using some policy
 8 |     \FOR{$K$ times}
 9 |         \STATE Set $y_i\leftarrow r(s_i,a_i) + \gamma \max_{a'_i}Q_\phi(s'_i,a'_i)$
10 |         \STATE Set $\phi \leftarrow \argminA_\phi \frac{1}{2}\Sigma_i\lvert|Q_\phi(s_i,a_i) - y_i|\rvert^2$
11 |     \ENDFOR
12 | \ENDWHILE
13 | \end{algorithmic}
14 | \end{algorithm}


--------------------------------------------------------------------------------
/fittedvaliter.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Fitted Value Iteration}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:fittedvaliter}
 5 | \WHILE{true}
 6 |     \STATE set $y_i \leftarrow \max_{a_i}\left(r(s_i,a_i) + \gamma \mathbb{E}\left[V_\phi(s'_i)\right]\right)$
 7 |     \STATE set $\phi \leftarrow \argminA_\phi \frac{1}{2}\Sigma_i\lvert|V_\phi(s_i) - y_i|\rvert^2$
 8 | \ENDWHILE
 9 | \end{algorithmic}
10 | \end{algorithm}


--------------------------------------------------------------------------------
/guided.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Guided Policy Search}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:guided}
 5 | \WHILE{True}
 6 | \STATE Optimize each local policy $\pi_{LQR,i}(u_t|x_t)$ on initial state $x_{0,i}$ with respect to $\Tilde{c}_{k,i}(x_t,u_t)$
 7 | \STATE Use samples from the previous step to train $\pi_\theta(u_t|x_t)$ to mimic each $\pi_{LQR,i}(u_t|x_t)$
 8 | \STATE Update cost function $\Tilde{c}_{k+1,i}(x_t,u_t) = c(x_t,u_t) + \lambda_{k+1}\log \pi_\theta(u_t|x_t)$
 9 | \ENDWHILE
10 | \end{algorithmic}
11 | \end{algorithm}


--------------------------------------------------------------------------------
/ilqr.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Iterative LQR (iLQR)}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:ilqr}
 5 | \WHILE{until convergence}
 6 |     \STATE $F_t = \nabla_{x_t,u_t}f(\delta x_t, \delta u_t)$
 7 |     \STATE $C_t = \nabla^2_{x_t,u_t}c(\delta x_t, \delta u_t)$
 8 |     \STATE $c_t = \nabla_{x_t,u_t}c(\delta x_t, \delta u_t)$
 9 |     \STATE Run LQR backward recursion on state $\delta x_t = x_t - \hat{x_t}$ and action $\delta u_t = u_t - \hat{u_t}$
10 |     \STATE Run forward pass with real nonlinear dynamics and $u_t = K_t(x_t-\hat{x_t}) + k_t + \hat{u_t}$
11 |     \STATE Update $\hat{x_t}$ and $\hat{u_t}$ based on states and actions in forward pass
12 |     \ENDWHILE
13 | \end{algorithmic}
14 | \end{algorithm}
15 | 


--------------------------------------------------------------------------------
/imitation.tex:
--------------------------------------------------------------------------------
 1 | \chapter{Imitation Learning}
 2 | Imitation learning is also called behavioral cloning. The basic idea of imitation learning is ``train to fit the expert behavior''. In other words, given a demonstration, we want to make the agent follow the demonstration as closely as possible, to best imitate the demonstration's behaviors.
 3 | \section{Distribution Mismatch}
 4 | However, a big problem of such type of approach is that it does not generalize well, if at all. For example, one can imagine that the agent makes a small mistake and ends up being in a slightly different state from what it has seen (trained) before, but since the state is novel, the agent does not know how to act, thus behaving randomly, diverging from the learned trajectory. An illustration of such mismatches is shown in Fig. \ref{fig:imitation_div}.
 5 | \begin{figure}
 6 |     \centering
 7 |     \includegraphics[scale = 0.3]{figures/imitation_div.png}
 8 |     \caption{Mistakes aggregate in behavior cloning.}
 9 |     \label{fig:imitation_div}
10 | \end{figure}
11 | \section{Dataset Aggregation}
12 | The aggregation of mistakes that the agent makes often makes imitation learning not feasible. But imitation learning does work in some cases. Intuitively, if the agent could somehow learn from the mistakes, and we keep appending data to the agent's dataset so that the agent is exposed to a variety of states, then the expected trajectory would get closer to the training trajectory. If the actions applied to those states are correct, then eventually, the agent can, ideally, converge to an optimal trajectory. This is essentially the idea behind an imitation learning algorithm called Dataset Aggregation (DAgger) \cite{ross2011reduction}.
13 | \input{dagger.tex}
14 | DAgger essentially makes the training trajectory close to the expected trajectory by constantly appending test data to training data. In step 3 of Algorithm \ref{alg:dagger}, what we are doing is basically discarding the actions from running trained policy $\pi_\theta(a_t|o_t)$. Instead, we ask a human in the loop to label what they would have done based on the observations in $D_\pi$. Take an autonomous car for example, the training data would be images labeled with steering commands, and we let the car collect more data, which are only images. Then we give those images to a human expert, and let the human determine, based on each image, what action (steer left, right, or go straight) that the human would have applied if he observed such an image.
15 | 
16 | It can be proven that DAgger resolves the distribution ``drift'' issue. However, one problem with DAgger is that human might be error-prone, so the human labelled data might be flawed to use. Furthermore, more subtly, human, in most cases, does not make decisions based on a Markovian process. Therefore, the current time step's action might be dependent on a state/observation some number of time steps ago.
17 | \section{When Does Imitation Learning Fail?}
18 | In general, there are some cases where one may fail to fit the expert data, which lead to undesirable outcomes of imitation learning.
19 | \subsection{Non-Markovian Behaviors}
20 | First, the data is Non-Markovian, as mentioned above. Actually Non-Markovian process is more natural and intuitive for human in that human learns from their past mistakes. Why is this wrong? Essentially, we are fitting the wrong distribution. Since we are fitting a policy distribution based on a Markovian process, our goal is fitting $\pi(a_t|o_t)$. However, if the expert data is not Markovian, then we are trying to fit $\pi(a_t|o_t)$ from another distribution $\pi(a_t|o+1,...,o_t)$. One solution is to use a lot of previous memory frames, and concatenate them as one huge frame, effectively augmenting the state space. However, this solution might require too many weights in the neural net encoder, significantly increasing the computational complexity. Another solution is that one can fit the expert data using an RNN with shared weights in the convolutional encoder, and one possible implementation is shown in Fig. \ref{fig:im_RNN}. Usually, an LSTM cell works well.
21 | \begin{figure}
22 |     \centering
23 |     \includegraphics[scale = 0.4]{figures/im_RNN.png}
24 |     \caption{Using RNN to address non-Markovian expert data.}
25 |     \label{fig:im_RNN}
26 | \end{figure}
27 | 
28 | 
29 | The underlying reason why having a full history makes imitation learning difficult is that history data tends to exacerbate the causality misclassification, which is also called \textbf{causal confusion}. Having more history might make the agent learn the wrong direction of causality, and in many cases, the wrong direction is actually easier to learn. This can be illustrated in an autonomous car example. Consider a car that has a brake light on the dashboard that lights up every time the brake pedal is pressed. When the brake pedal is pressed due to the red light/obstacle in front of the vehicle, it is probably easier for the agent to associate the brake action with the brake light rather than with the red light/obstacle in front of the car. The causal confusion issue can be alleviated with the use of DAgger because the human annotator is able to provide the correct causal relation. For more information, please refer to this paper \cite{de2019causal}.
30 | \subsection{Multimodal Behaviors}
31 | Another scenario where fitting expert might fail is that the expert has \textbf{multi-modal} behaviors. An example of this is that when you are controlling a drone to dodge a tree ahead, you either steer left or steer right. However, if you choose the wrong parametric form of the distribution (e.g. a simple Gaussian) of the actions, the distribution might average out left and right and choose to go straight, as shown in figure \ref{fig:multimodal}.
32 | \begin{figure}
33 |     \centering
34 |     \includegraphics[scale=0.5]{figures/multimodal.png}
35 |     \caption{Multimodal behaviors.}
36 |     \label{fig:multimodal}
37 | \end{figure}
38 | Some methods to mitigate this issue include: first, one can use a mixture of different Gaussian distributions, instead of just one. Second, construct a latent space variables model, which we will talk more about in variational inference. Third, we can use autogregressive discretization. Specifically, a mixture of Gaussians means that the policy distribution should be a weighted sum of different Gaussians with different means and variances. 
39 | \section{Theoretical Analysis of Imitation Learning's Error}
40 | First we define two different reward functions for imitation learning. To make our analysis easier, we assume the policy function is deterministic. First, we can define the reward function as $r(s,a)=\log p(a=\pi^*(s)|s)$. This function measures the log likelihood of the action equal to expert policy's action. Another (simpler) choice can just be a counter to count the number of mistakes. Specifically, 
41 | \[ c(s,a) = \begin{cases} 
42 |           0 & \text{if } a = \pi^*(s) \\
43 |           1 & \text{o.w.}
44 |        \end{cases}
45 |     \]
46 | To analyze this, let's introduce a lower bound on the probability of making mistakes: $\pi_\theta(a\neq \pi^*(s)|s)\leq \epsilon$ for all $s\sim p_{train}(s)$, where $p_{train}$ is training data distribution. The fit distribution of states $p_\theta(s)$ is consisted of two parts: the first part is the probability of no mistakes made, and the second part is the probability of making some mistakes. Using Bayes' rule, we can calculate $p_\theta(s)$ as follows:
47 | $$p_\theta(s_t) = (1-\epsilon)^tp_{train}(s_t)+(1-(1-\epsilon)^t)p_{mistake}(s_t)$$
48 | so to measure the divergence of $p_\theta$ from $p_{train}$, we take the difference of the two distributions (naive, total variation divergence):
49 | $$|p_\theta(s_t)-p_{train}(s_t)|=(1-(1-\epsilon)^t)|p_{mistake}-p_{train}|\leq2(1-(1-\epsilon)^t)$$
50 | $$\leq 2\epsilon t$$
51 | where we used the identity that $(1-\epsilon)^t\geq1-\epsilon t$ for $\epsilon\in [0,1]$. Thus, we can calculate the expected number of mistakes the agent makes using this scheme by:
52 | $$\sum_t\mathbb{E}_{p_\theta(s_t)}[c_t] = \sum_t\sum_{s_t}p_\theta(s_t)c_t(s_t)\leq\sum_t\sum_{s_t}p_{train}(s_t)c_t(s_t)+|p_\theta(s_t)-p_{train}(s_t)|c_{max}$$
53 | $$\leq\sum_t\epsilon+2\epsilon t$$
54 | $$\in O(\epsilon T^2)$$
55 | Also note that with DAgger $p_{train}(s)\rightarrow p_\theta(s)$. So we no longer have the second item inside the summation for DAgger. Thus for DAgger, the expected value should be in $O(\epsilon T)$.
56 | 
57 | As we see, when we have longer horizon length, the errors are going to be aggregated, thus making more mistakes, and this is one of the most fundamental disadvantages of imitation learning, as discussed in \cite{ross2011reduction}.
58 | \section{Summary}
59 | Overall, what are some disadvantages of imitation learning? We have a human factor to provide data in the entire loop, which is potentially finite, and to generate a good policy, one need to learn from a lot of data. Moreover, human cannot provide all kinds of data. Specifically, a human may have trouble with providing data such as the joint angle/torque of a robotic arm. Therefore, we wish that machines can learn automatically, from unlimited data.


--------------------------------------------------------------------------------
/intro.tex:
--------------------------------------------------------------------------------
 1 | \chapter{Introduction}
 2 | 
 3 | Here we review some of the terminologies that frequently appear in the field of Reinforcement Learning.
 4 | \section{Important Concepts}
 5 | Imagine a \textbf{discrete-time} system (environment), meaning that the system is discretized into time steps. At time step $t$, we define the state of the system to be $s_t$. The state of a system/agent could be some intrinsic data of it. For example, for a car, the state of a car at a given time step could be the car's angular velocity, acceleration, and mass. We also define an agent's action $a_t$, which is equivalent to the notion of input in control theory. A \textbf{policy} is a function that takes in a state and outputs an action, determining what the agent should do given current time step's state. 
 6 | 
 7 | A policy function could be deterministic such that $a_t = \pi_\theta(s_t)$. The policy function could also be a distribution, which we can define as $\pi_\theta(a_t|s_t)$. Note that in many cases, a state is not \textbf{fully observable}, so we might only partially observe the state of the agent via an observation $o_t$. In this case, the policy function should condition on the observation $o_t$. 
 8 | 
 9 | We should also define a \textbf{transition function} of the environment, which is also called the \textbf{model} of the system. In the most general case, the transition should be stochastic, meaning that a state could evolve into a number of other states potentially. Therefore, this function should be a distribution, defined as $p(s_{t+1}|s_t, a_t)$. Note that this distribution is conditioned on both the state and the action at time step $t$. If you are familiar with control theory, you would probably notice the resemblance of the transition distribution with a discrete-time system's dynamics function. A sequence of states and actions become a \textbf{trajectory}, which we call $\tau$.
10 | 
11 | Meanwhile, we also define a \textbf{reward function} of the environment. For example, in Pacman, the player gains one point after eating a dot, and loses 100 points after being eaten by a ghost. Formally, the reward function should be a function of both state and action, so we denote the reward function as $r(s,a)$. The rewards could be a hand-engineered function, or it could be implicit that the agent actually needs to learn the rewards first, as seen in the case of Inverse RL (IRL).
12 | 
13 | In many reinforcement learning problems, we assume the states transitions are \textbf{Markovian}, meaning that the state at time step $t$ is only responsible for contributing to the state at time step $t+1$.
14 | Fig. \ref{fig:markov} an illustration of a Markov chain, and the direction of the arrow means causality.
15 | \begin{figure}
16 |     \centering
17 |     \includegraphics[scale = 0.4]{figures/markov.png}
18 |     \caption{A simple Markov chain.}
19 |     \label{fig:markov}
20 | \end{figure}
21 | 
22 | In control theory, such as the LQR optimization problem, we aim to minimize the cost function of the system. What would be an equivalent notion in reinforcement learning? Recall we defined a reward function, $r(s,a)$, so naturally we want to collect as much reward as possible in the environment. Without loss of generality, we assume that the environment is stochastic. Define a trajectory distribution $p_\theta(\tau)$ according to Bayes' Rule:
23 | $$ p(\tau) \vcentcolon= p(s_1,a_1,...,s_T,a_T) = p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)$$
24 | where $T$ is the length of episode horizon. Therefore, given this trajectory distribution, we can calculate the expected value of the total reward function induced by following this trajectory as $\mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\sum_t r(s_t,a_t)\right]$. Therefore, to optimize this objective, we want to find a parameter $\theta$, such that $\theta$ maximizes the above expectation:
25 | $$\theta = \argmaxA_\theta\mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\sum_t r(s_t,a_t)\right]$$
26 | \section{Value Function and Q Function}
27 | To facilitate our calculation of the above expectation, and to simplify the notations, we introduce two important types of functions: Q function and value function. In most cases these two functions are not given to us in closed form, and one needs to approximate and improve the functions using some deep neural net, hence the notion of ``deep'' in deep reinforcement learning.
28 | \subsection{Q Function}
29 | Q function is a function of both state and action, so it is denoted as $Q(s_t,a_t)$ and it quantitatively measures the quality of taking action $a_t$ at state $s_t$. Mathematically, it is the expected sum of reward from the current time step given state $s_t$ and action $a_t$. This is very similar to the ``cost-to-go'' function in control theory, especially Model Predictive Control. We define $Q(s,a)$ as:
30 | $$Q^\pi(s_t,a_t) = \sum_{t'=t}^T{\mathbb{E}_{\pi_\theta}[r(s_t',a_t')|s_t,a_t]}$$
31 | \subsection{Value Function}
32 | Unlike the Q function, value function is only a function of state, so intuitively it quantitatively measures the value of being in state $s_t$. Mathematically, it is defined as $V^\pi(s_t) =\sum_{t'=t}^T{\mathbb{E}_{\pi_\theta}[r(s_t',a_t')|s_t]}$. Again by Bayes' rule we can obtain the relation between Value function and Q function: $V^\pi(s_t)=\mathbb{E}_{a_t\sim\pi(a_t|s_t)}[Q^\pi(s_t,a_t)]$.
33 | 
34 | Furthermore, if we sum the value function over all possible initial states, we essentially recovered the objective of reinforcement learning: $\mathbb{E}_{s_1\sim p(s_1)}[V^\pi(s_1)]$, where $p(s_1)$ is a known distribution of all possible initial states.
35 | \section{Reinforcement Learning Anatomy}
36 | In RL, we usually have three parts in the whole pipeline. We need to keep generating data for the agent to learn, and use the data to generate samples in order to fit and regress onto a model. Then with this model, we estimate the reward and based on the reward we update our policy to maximize the reward, and we go back to step 1. Therefore, our primal concern is to efficiently run the three parts so that the agent can learn optimally with less data and computation. Here is an illustration of the three steps in Fig. \ref{fig:rlanatomy}.
37 | \begin{figure}
38 |     \centering
39 |     \includegraphics[scale=0.5]{figures/rlanatomy.png}
40 |     \caption{Three steps of reinforcement learning.}
41 |     \label{fig:rlanatomy}
42 | \end{figure}


--------------------------------------------------------------------------------
/inverse.tex:
--------------------------------------------------------------------------------
  1 | \chapter{Inverse Reinforcement Learning}
  2 | So far in our RL algorithms, we have been assuming that the reward function is known a priori, or it is manually designed to define a task. What if we want to learn the reward function from observing an expert, and then use reinforcement learning? This is the idea of inverse RL, where we first figure out the reward function and then apply RL. 
  3 | 
  4 | Why should we worry about learning rewards at all? From the imitation learning perspective, the agent learns via imitation by copying the actions performed by the expert, without any reasoning about outcomes of actions. However, the natural way that human learn through imitation is that human copy the intent of the expert, and thus might take very different actions. In RL, it is often the case that the reward function is ambiguous in the environment. For example, it is hard to hand-design a reward function for autonomous driving. 
  5 | 
  6 | The inverse RL problem definition is as follows: we try to infer the reward functions from demonstrations, and then learn to maximize the inferred reward using any RL algorithm that was covered so far. Formally, in inverse RL, we learn $r_\psi(s,a)$, and then use it to learn $\pi^*(a|s)$. However, this is an underspecified problem, because many reward function can explain the same behavior. The reward function can take many forms. One potential form is the linear reward function, which is a weighted sum of features:
  7 | \[
  8 | r_\psi(s,a) = \sum_i\psi_if_i(s,a) = \psi^Tf(s,a)
  9 | \]
 10 | or it could be a neural net with parameters $\psi$.
 11 | 
 12 | \section{Feature Matching Inverse RL}
 13 | Let us focus on the linear reward function design for now. Since it is a weighted sum of features, one natural interpretation to match the features is to match the expectation of important features. Let $\pi^{r_\psi}$ be the optimal policy for reward function$r_\psi$, then we to design the reward, we are picking $\psi$ such that 
 14 | \[
 15 | \mathbb{E}_{\pi^{r_\psi}}[f(s,a)] = \mathbb{E}_{\pi^*}[f(s,a)]
 16 | \]
 17 | The right hand side expectation can be estimated using samples from expert: take $N$ samples of features, and get the average. The left hand side expectation is a little involved. One way to do it is to use any RL algorithm to maximize $r_\psi$, which is defined using the right hand side samples, and then produce $\pi^{r_\psi}$, and then we can use this policy to generate more samples. Another way is to use dynamic programming if we are given the transitions. To ensure the equality holds, we borrow some ideas from the support vector machine classifier, where we maximize the margin between the optimal policy's rewards and that of any other policy:
 18 | \[
 19 | \max_{\psi,m} m\;\;\text{s.t. }\;\psi^T\mathbb{E}_{\pi^*}[f(s,a)]\geq \max_{\pi\in \Pi}\psi^T\mathbb{E}_\pi[f(s,a)]+m
 20 | \]
 21 | but we also need to address the similarity between $\pi$ and $\pi^*$ so that similar policies do not need to abide by the $m$ margin requirement.
 22 | 
 23 | Using the SVM trick (with the use of Lagrangian dual), we can transform the above optimization into the following which also contains a function that measures the similarity between policies:
 24 | \[
 25 | \min_\psi \frac{1}{2}\lvert|\psi|\rvert^2\;\; \text{s.t. }\;\psi^T\mathbb{E}_{\pi^*}[f(s,a)]\geq \max_{\pi\in \Pi}\psi^T\mathbb{E}_\pi[f(s,a)]+D(\pi,\pi^*)
 26 | \]
 27 | where $D(\pi,\pi^*)$ measures the difference in feature expectations. However, such approaches have some issues: maximizing the margin is a bit arbitrary, and there is no clear model of expert suboptimality (can add slack variables). Furthermore, now we have a messy constrained optimization problem, which is not great for deep learning!
 28 | 
 29 | \section{Learning the Optimality Variable}
 30 | Recall that in last chapter, we introduced the optimality variable $\mathcal{O}_t$ to indicate if the agent is acting optimally. It turns out that as we learn the reward function, we are also learning the optimality variable. The optimality variable is defined as $p(\mathcal{O}_t|s_t,a_t) = \exp(r_\psi(s_t,a_t))$. Since the reward parameter $\psi$ is unknown, the optimality distribution should also depend on $\psi$: $p(\mathcal{O}_t|s_t,a_t,\psi)$. Recall that \[p(\tau|\mathcal{O}_{1:T},\psi)\propto\exp \left(\sum_tr_\psi(s_t,a_t)\right)\] Note that we can ignore $p(\tau)$ in our optimiztion since it does not depend on $\psi$. We are given sample trajectories $\{\tau_i\}$ sampled from expert policy $\pi^*(\tau)$, so the maximum likelihood training can be done using:
 31 | \[
 32 | \max_\psi\frac{1}{N}\sum_{i=1}^N\log p(\tau_i|\mathcal{O}_{1:T},\psi) = \max_\psi\frac{1}{N}\sum_{i=1}^N r_\psi(\tau_i)-\log Z
 33 | \]
 34 | where $Z$ is the \textbf{partition function} needed to make the sum of probability with respect to $\tau$ 1.
 35 | 
 36 | \subsection{Inverse RL Partition Function}
 37 | In our maximum likelihood training, to make the probability with respect to $\tau$ sum to 1, we introduced the IRL partition function $Z$. Mathematically, $Z$ is the integral of all possible trajectories:
 38 | \[
 39 | Z = \int p(\tau)\exp(r_\psi(\tau))d\tau
 40 | \]
 41 | Then we take the gradient of the likelihood with respect to $\psi$ after plugging in $Z$:
 42 | \begin{align*}
 43 | \nabla_\psi \mathcal{L} &= \frac{1}{N}\sum_{i=1}^N\nabla_\psi r_\psi(\tau_i) - \frac{1}{Z}\int p(\tau)\exp(r_\psi(\tau))\nabla_\psi r_\psi(\tau)d\tau\\
 44 | &= \mathbb{E}_{\tau\sim\pi^*(\tau)}[\nabla_\psi r_\psi(\tau_i)] - \mathbb{E}_{\tau\sim p(\tau|\mathcal{O}_{1:T},\psi)}[\nabla_\psi r_\psi(\tau)]
 45 | \end{align*}
 46 | The first expectation is estimated with expert samples, and the second expectation is the soft optimal policy under current reward. To increase the gradient, we want more expert trajectory and less current agent trajectory.
 47 | 
 48 | \subsection{Estimating the Expectation}
 49 | In the above derivation of the gradient of the likelihood, the first expectation is easy to calculate, but the second one is hard. To calculate the second expectation, we need to do some messaging:
 50 | \begin{align*}
 51 |     \mathbb{E}_{\tau\sim p(\tau|\mathcal{O}_{1:T},\psi)}[\nabla_\psi r_\psi(\tau)] &= \mathbb{E}_{\tau\sim p(\tau|\mathcal{O}_{1:T},\psi)}\left[\nabla_\psi\sum_{t=1}^Tr_\psi(s_t,a_t)\right]\\
 52 |     &= \sum_{t=1}^T\mathbb{E}_{(s_t,a_t)\sim p(s_t,a_t|\mathcal{O}_{1:T},\psi)}[\nabla_\psi r_\psi(s_t,a_t)]
 53 | \end{align*}
 54 | Note that the distribution $p(s_t,a_t|\mathcal{O}_{1:T},\psi)$ can be rewritten using chain rule as:
 55 | \[
 56 | p(s_t,a_t|\mathcal{O}_{1:T},\psi) = p(a_t|s_t,\mathcal{O}_{1:T},\psi)p(s_t|\mathcal{O}_{1:T},\psi)
 57 | \]
 58 | where 
 59 | \begin{align*}
 60 |     p(a_t|s_t,\mathcal{O}_{1:T},\psi) &= \frac{\beta(s_t,a_t)}{\beta(s_t)}\\
 61 |     p(s_t|\mathcal{O}_{1:T},\psi)&\propto\alpha(s_t)\beta(s_t)
 62 | \end{align*}
 63 | Therefore, the distribution is directly proportional to the product of the backward message and the forward message:
 64 | \[
 65 | p(a_t|s_t,\mathcal{O}_{1:T},\psi)p(s_t|\mathcal{O}_{1:T},\psi)\propto\beta(s_t,a_t)\alpha(s_t)
 66 | \]
 67 | If we let $\mu_t(s_t,a_t)\propto\beta(s_t,a_t)\alpha(s_t)$, then the second expectation can be written as:
 68 | \begin{align*}
 69 | \mathbb{E}_{\tau\sim p(\tau|\mathcal{O}_{1:T},\psi)}[\nabla_\psi r_\psi(\tau)] &= \sum_{t=1}^T\int \int \mu_t(s_t,a_t)\nabla_\psi r_\psi(\tau)ds_tda_t\\
 70 | &= \sum_{t=1}^T\mu_t^T\nabla_\psi r_\psi
 71 | \end{align*}
 72 | where $\mu_t$ is the state-action visitation probability for each $(s_t,a_t)$.
 73 | 
 74 | Now we are ready to sketch out our MaxEnt Inverse RL algorithm in Alg. \ref{alg:maxent}. We can use this to learn the reward function.
 75 | \input{maxent.tex}
 76 | Why is it called maximum entropy (MaxEnt)? Because in cases where $r_\psi(s_t,a_t) = \psi^Tf(s_t,a_t)$, we can show that Alg. \ref{alg:maxent} oprimizes
 77 | \[
 78 | \max_\psi\mathcal{H}(\pi^{r_\psi})\;\text{s.t. }\;\;
 79 | \mathbb{E}_{\pi^{r_\psi}}[f] = \mathbb{E}_{\pi^*}[f]
 80 | \]
 81 | 
 82 | \section{Unknown Dynamics and Large State/Action Spaces}
 83 | So far, MaxEnt inverse RL requires us to solve for a soft optimal policy in the inner loop, and it enumerates all state-action tuples for visitation frequency and gradient. To apply the IRL algorithms in practical problem settings, we need to handle large and continuous state and action spaces and unknown dynamics.
 84 | 
 85 | Recall the gradient of likelihood is calculated as
 86 | \[
 87 | \nabla_\psi \mathcal{L} =\mathbb{E}_{\tau\sim\pi^*(\tau)}[\nabla_\psi r_\psi(\tau_i)] - \mathbb{E}_{\tau\sim p(\tau|\mathcal{O}_{1:T},\psi)}[\nabla_\psi r_\psi(\tau)]
 88 | \]
 89 | We know that the first expectation is easy to calculate by sampling expert data, but the second expectation which is taken under the soft optimal policy under current reward is hard to calculate. One idea to calculate it is to learn the entire soft optimal policy $p(a_t|s_t,\mathcal{O}_{1:T},\psi)$ using any max-ent RL algorithm and then run this policy to sample $\{\tau_j\}$ such that:
 90 | \[
 91 | \nabla_\psi \mathcal{L} = \frac{1}{N}\sum_{i=1}^N\nabla_\psi r_\psi(\tau_i) - \frac{1}{M}\sum_{j=1}^M\nabla_\psi r_\psi(\tau_j)
 92 | \]
 93 | where we estimate the second expectation using the current policy samples. However, this is highly impractical because this requires us to run an RL algorithm to convergence in every gradient step.
 94 | \subsection{More Efficient Updates}
 95 | As mentioned above, learning $p(a_t|s_t,\mathcal{O}_{1:T},\psi)$ in the inner loop in each time step is expensive. Therefore, we can relax this objective a little to make it more efficient: instead of learning the policy at each time step, we could improve the policy a little in each time step such that if the policy keeps getting better, we can generate good samples eventually. Now sampling from this improved distribution is not actually sampling from the distribution we want, which is $p(\tau|\mathcal{O}_{1:T},\psi)$, we are actually getting a biased estimate of the distribution. Therefore, to resolve this issue, we use importance sampling:
 96 | \begin{align*}
 97 |     \nabla_\psi \mathcal{L} &\simeq \frac{1}{N}\sum_{i=1}^N\nabla_\psi r_\psi(\tau_i) - \frac{1}{\sum_j w_j}\sum_{j=1}^M w_j\nabla_\psi r_\psi(\tau_j)\\
 98 |     w_j &= \frac{p(\tau)\exp(r_\psi(\tau_j))}{\pi(\tau_j)}
 99 | \end{align*}
100 | And if we take a closer look at the importance ratio $w_j$:
101 | \begin{align*}
102 |      w_j &= \frac{p(\tau)\exp(r_\psi(\tau_j))}{\pi(\tau_j)}\\
103 |      &= \frac{p(s_1)\prod_t p(s_{t+1}|s_t,a_t\exp(r_\psi(s_t,a_t))}{p(s_1)\prod_t p(s_{t+1}|s_t,a_t\pi(a_t|s_t)}\\
104 |      &= \frac{\exp(\sum_tr_\psi(s_t,a_t))}{\prod_t\pi(a_t|s_t)}
105 | \end{align*}
106 | With the importance ratio, each policy update with respect to $r_\psi$ brings us closer to the target distribution.
107 | 
108 | \section{Inverse RL as a Generative Adversarial Network}
109 | The idea of inverse RL looks like a game. Specifically, we have an initial policy $\pi_\theta$, and expert demonstrations $\pi^*$. We sample trajectories $\tau_j$ from the initial policy, and $\tau_i$ from the expert policy. Then our gradient step looks like:
110 | \[
111 | \nabla_\psi \mathcal{L} \simeq \frac{1}{N}\sum_{i=1}^N\nabla_\psi r_\psi(\tau_i) - \frac{1}{\sum_j w_j}\sum_{j=1}^M w_j\nabla_\psi r_\psi(\tau_j)
112 | \]
113 | where demos are made more likely and samples are made less likely. Then we update the initial policy $\pi_\theta$ with respect to $r_\psi$:
114 | \[
115 | \nabla_\theta\mathcal{L}\simeq \frac{1}{M}\sum_{j=1}^M\nabla_\theta \log \pi_\theta(\tau_j)r_\psi(\tau_j)
116 | \]
117 | which in turn changes the policy to make it harder to distinguish from demos.
118 | 
119 | This looks a lot like a GAN. In a GAN, we have a generator that takes in some noise $z$, and outputs a distribution $p_\theta(x|z)$. We sample from the generator distribution $p_\theta(x)$. There is also demonstration data, for example, the real images, which we sample from its distribution $p^*(x)$. There is a discriminator parameterized by $\psi$ that determines if the data generated by the generator is real: $D(x) = p_\psi(\text{real}|x)$. We update the discriminator parameter by maximizing the binary log likelihood:
120 | \[
121 | \psi = \argmaxA_\psi \frac{1}{N}\sum_{x\sim p^*}\log D_\psi(x) + \frac{1}{M}\sum_{x\sim p_\theta}\log(1-D_\psi(x))
122 | \]
123 | where the log likelihood of the data is from demonstration is maximized and that of the data is from generator is minimized. We also update the generator parameter $\theta$:
124 | \[
125 | \theta\leftarrow \argmaxA_\theta \mathbb{E}_{x\sim p_\theta}\log D_\psi(x)
126 | \]
127 | so as to make it harder to distinguish from demos.
128 | 
129 | Therefore, interestingly, we can frame the IRL problem as a GAN. In a GAN, the optimal discriminator can be defined as:
130 | \[
131 | D^*(x) = \frac{p^*(x)}{p_\theta(x) + p^*(x)}
132 | \]
133 | For inverse RL, the optimal policy approaches $\pi_\theta(\tau)\propto p(\tau)\exp(r_\psi(\tau))$. Choosing the above optimal parameterization of the discriminator:
134 | \begin{align*}
135 |     D_\psi(\tau) &= \frac{p(\tau)\frac{1}{Z}\exp(r(\tau))}{p_\theta(\tau) + p(\tau)\frac{1}{Z}\exp(r(\tau))}\\
136 |     &= \frac{p(\tau)\frac{1}{Z}\exp(r(\tau))}{p(\tau)\prod_t\pi_\theta(a_t|s_t) + p(\tau)\frac{1}{Z}\exp(r(\tau))}\\
137 |     &= \frac{\frac{1}{Z}\exp(r(\tau))}{\prod_t\pi_\theta(a_t|s_t) + \frac{1}{Z}\exp(r(\tau))}
138 | \end{align*}
139 | then we optimize the discriminator with respect to $\psi$ such that:
140 | \[
141 | \psi \leftarrow \argmaxA_\psi \mathbb{E}_{\tau\sim p^*}[\log D_\psi(\tau)] + \mathbb{E}_{\tau\sim \pi_\theta}[\log (1-D_\psi(\tau))]
142 | \]
143 | Now we don't need the importance ratio anymore, because it is subsumed into $Z$.
144 | 
145 | We could also use a general discriminator, where $D_\psi$ is just a normal binary neural net classifier. It is often simpler to set up optimization, because we have fewer moving parts. However, the discriminator knows nothing at convergence generally cannot reoptimize the reward.
146 | % TODO: know what it actually means


--------------------------------------------------------------------------------
/lqr.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Solving for Linear Quadratic Regulator (LQR)}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:lqr}
 5 | \STATE Backward Recursion
 6 | \FOR{$t = T$ to 1}
 7 |     \STATE $Q_t = C_t + F_t^TV_{t+1}F_t$
 8 |     \STATE $q_{t} = c_{t} +  F^T_{t}V_Tf_{t} + F^T_{t}v_{t+1}$
 9 |     \STATE $Q(x_{t}, u_{t}) = \text{const} +\frac{1}{2} \begin{bmatrix}x_{t}\\u_{t}\end{bmatrix}^TQ_{t}\begin{bmatrix}x_{t}\\u_{t}\end{bmatrix} + \begin{bmatrix}x_{t}\\u_{t}\end{bmatrix}^Tq_{t}$
10 |     \STATE $u_t\leftarrow \argminA_{u_t}Q(x_t,u_t) = K_tx_t + k_t$
11 |     \STATE $K_{t} = -Q^{-1}_{u_{t},u_{t}}Q_{u_{t},x_{t}}$
12 |     \STATE $k_{t} = -Q_{u_{t},u_{t}}q_{u_{t}}$
13 |     \STATE $V_t = Q_{x_t,x_t} + Q_{x_t,u_t}K_t + K^T_tQ_{u_t,x_t} + K_t^TQ_{u_t,u_t}K_t$
14 |     \STATE $v_t = q_{x_t} + Q_{x_t,u_t}k_t + K_t^TQ_{u_t} + K^T_tQ_{u_t,u_t}k_t$
15 |     \STATE $V(x_t) = \text{const} + \frac{1}{2}x_t^TV_tx_t + x_t^Tv_t$
16 | \ENDFOR
17 | \STATE Forward Recursion
18 | \FOR{$t=1$ to T}
19 |     \STATE $u_t = K_tx_t + k_t$
20 |     \STATE $x_{t+1} = f(x_t,u_t)$
21 | \ENDFOR
22 | \end{algorithmic}
23 | \end{algorithm}


--------------------------------------------------------------------------------
/main.tex:
--------------------------------------------------------------------------------
  1 | \documentclass[12pt,oneside]{book}
  2 | \usepackage{amsmath}
  3 | \usepackage{color}
  4 | \usepackage[margin=1.00in]{geometry} \setlength{\parindent}{2em} \usepackage{graphicx}
  5 | \usepackage{hyperref} 
  6 | \usepackage{listings}
  7 | \usepackage[T1]{fontenc}
  8 | \usepackage{amsfonts}
  9 | \usepackage{todonotes}
 10 | \usepackage{algorithm}
 11 | \usepackage[noend]{algorithmic}
 12 | \setlength{\marginparwidth}{2cm}
 13 | \usepackage{enumitem}
 14 | \setlist{nosep}
 15 | \DeclareMathOperator*{\argminA}{arg\,min}
 16 | \DeclareMathOperator*{\argmaxA}{arg\,max}
 17 | \usepackage{tcolorbox}
 18 | \tcbset{colback=blue!10,colframe=blue!45
 19 |        ,coltitle=black,fonttitle=\bfseries}
 20 | 
 21 | \newcommand{\tab}[1][1cm]{\hspace*{#1}}
 22 | \definecolor{codegreen}{rgb}{0,0.6,0}
 23 | \definecolor{codegray}{rgb}{0.5,0.5,0.5}
 24 | \definecolor{codepurple}{rgb}{0.58,0,0.82}
 25 | \definecolor{backcolour}{rgb}{0.95,0.95,0.92}
 26 | 
 27 |  
 28 |  \lstdefinestyle{mystyle}{
 29 |      backgroundcolor=\color{backcolour},   
 30 |      commentstyle=\color{codegreen},
 31 |      keywordstyle=\color{magenta},
 32 |      numberstyle=\tiny\color{codegray},
 33 |      stringstyle=\color{codepurple},
 34 |      basicstyle=\footnotesize,
 35 |      breakatwhitespace=false,         
 36 |      breaklines=true,                 
 37 |      captionpos=b,                    
 38 |      keepspaces=true,                 
 39 |      numbers=left,                    
 40 |      numbersep=5pt,                  
 41 |      showspaces=false,                
 42 |      showstringspaces=false,
 43 |      showtabs=false,                  
 44 |      tabsize=2,
 45 |      language=C
 46 |      }
 47 | \lstset{style=mystyle, escapeinside={|}{|}}
 48 | 
 49 | 
 50 | %ensures that the "chapter 1" and the title of chapter are on same line
 51 | %the compact parameter ensures that the vertical spacing is less
 52 | \usepackage[compact]{titlesec}
 53 | \titleformat{\chapter}[hang] 
 54 | {\normalfont\Large\bfseries}{\chaptertitlename\ \thechapter:}{1em}{} 
 55 | 
 56 | \usepackage[thmmarks,thref,hyperref,amsmath]{ntheorem}
 57 | \theorembodyfont{\upshape\mdseries}
 58 | \newcommand*\phantomrel[1]{\mathrel{\phantom{#1}}}
 59 | % Use these for theorems, lemmas, proofs, etc.
 60 | \newtheorem{theorem}{Theorem}[chapter]
 61 | \newtheorem{lemma}[theorem]{Lemma}
 62 | \newtheorem{proposition}[theorem]{Proposition}
 63 | \newtheorem{claim}[theorem]{Claim}
 64 | \newtheorem{corollary}[theorem]{Corollary}
 65 | \newtheorem{definition}[theorem]{Definition}
 66 | \newenvironment{proof}{{\bf Proof:}}{\hfill\rule{2mm}{2mm}}
 67 | 
 68 | %create shortcut for definition
 69 | \def\bdef{\begin{definition}}
 70 | \def\endef{\end{definition}}
 71 | 
 72 | \usepackage{mdframed}
 73 | \usepackage{color,soul}
 74 | \usepackage{mathtools}
 75 | \usepackage{amssymb}
 76 | 
 77 | 
 78 | %for the indentation of the "section", "subsection" and "chapter" titles 
 79 | \titleformat{\section}
 80 | {\normalfont\Large\bfseries}{\thesection}{1em}{}
 81 | \titleformat{\subsection}
 82 | {\normalfont\large\bfseries}{\thesubsection}{1em}{}
 83 | \titleformat{\subsubsection}
 84 | {\normalfont\normalsize\bfseries}{\thesubsubsection}{1em}{}
 85 | \titleformat{\paragraph}[runin]
 86 | {\normalfont\normalsize\bfseries}{\theparagraph}{1em}{}
 87 | \titleformat{\subparagraph}[runin]
 88 | {\normalfont\normalsize\bfseries}{\thesubparagraph}{1em}{}
 89 | 
 90 | \def\Name{Harry Zhang}
 91 | \title{Deep Reinforcement Learning\\
 92 | \large CS 285, University of California, Berkeley}
 93 | % \medium{Department of Electronic Engineering and Computer Science}
 94 | \author{\Name}
 95 | \date{December 2019}
 96 | \begin{document}
 97 | \sloppy
 98 | \maketitle
 99 | \pagenumbering{roman}
100 | \tableofcontents
101 | \include{preface}
102 | \mainmatter
103 | \newpage
104 | \pagenumbering{arabic}
105 | 
106 | \include{intro}
107 | \include{imitation}
108 | \include{poligrad}
109 | \include{actorcritic}
110 | \include{value}
111 | \include{qfunc}
112 | \include{pgtheory}
113 | \include{modelbased}
114 | \include{mbpolicy}
115 | \include{varinfer}
116 | \include{ctrasinf}
117 | \include{inverse}
118 | \include{transfer}
119 | \include{exploration}
120 | \include{offline}
121 | \bibliographystyle{IEEEtran}
122 | \bibliography{IEEEabrv,ref}
123 | \end{document}
124 | 


--------------------------------------------------------------------------------
/maxent.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{MaxEnt Inverse RL}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:maxent}
 5 | \REQUIRE Some random reward parameter $\psi$
 6 | \WHILE{True}
 7 | \STATE Given $\psi$, compute backward message $\beta(s_t,a_t)$
 8 | \STATE Given $\psi$, compute forward message $\alpha(s_t)$
 9 | \STATE Compute $\mu_t(s_t,a_t)\propto \beta(s_t,a_t)\alpha(s_t)$
10 | \STATE Evaluate $\nabla_{\psi}\mathcal{L} = \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\psi r_\psi(s_{i,t},a_{i,t}) -\sum_{t=1}^T\int \int \mu_t(s_t,a_t)\nabla_\psi r_\psi(\tau)ds_tda_t$
11 | \STATE $\psi \leftarrow \psi + \eta\nabla_\psi\mathcal{L}$
12 | \ENDWHILE
13 | \end{algorithmic}
14 | \end{algorithm}


--------------------------------------------------------------------------------
/mb05.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Model-based Reinforcement Learning Version 0.5}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:mb05}
 5 | \REQUIRE Some base policy for data collection $\pi_0$
 6 | \STATE Run base policy $\pi(a_t|s_t)$ (e.g. random policy) to collect $\mathcal{D} = \{(s,a,s')_i\}$
 7 | \STATE Learn dynamics model $f(s,a)$ to minimize $\sum_i\lvert|f(s_i,a_i)-s'_i|\rvert^2$
 8 | \STATE Plan through $f(s,a)$ to choose actions
 9 | \end{algorithmic}
10 | \end{algorithm}


--------------------------------------------------------------------------------
/mb10.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Model-based Reinforcement Learning Version 1.0}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:mb10}
 5 | \REQUIRE Some base policy for data collection $\pi_0$
 6 | \STATE Run base policy $\pi(a_t|s_t)$ (e.g. random policy) to collect $\mathcal{D} = \{(s,a,s')_i\}$
 7 | \WHILE{True}
 8 | \STATE Learn dynamics model $f(s,a)$ to minimize $\sum_i\lvert|f(s_i,a_i)-s'_i|\rvert^2$
 9 | \STATE Plan through $f(s,a)$ to choose actions
10 | \STATE Execute those actions and add the resulting data $\{(s,a,s')_j\}$ to $\mathcal{D}$
11 | \ENDWHILE
12 | \end{algorithmic}
13 | \end{algorithm}


--------------------------------------------------------------------------------
/mb15.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Model-based Reinforcement Learning Version 1.5}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:mb15}
 5 | \REQUIRE Some base policy for data collection $\pi_0$, hyperparameter $N$
 6 | \STATE Run base policy $\pi(a_t|s_t)$ (e.g. random policy) to collect $\mathcal{D} = \{(s,a,s')_i\}$
 7 | \FOR{every $N$ steps}
 8 | \WHILE{True}
 9 | \STATE Learn dynamics model $f(s,a)$ to minimize $\sum_i\lvert|f(s_i,a_i)-s'_i|\rvert^2$
10 | \STATE Plan through $f(s,a)$ to choose actions
11 | \STATE Execute the first planned action, observe resulting state $s'$ (MPC)
12 | \STATE Append $(s,a,s')$ to dataset $\mathcal{D}$
13 | \ENDWHILE
14 | \ENDFOR
15 | \end{algorithmic}
16 | \end{algorithm}


--------------------------------------------------------------------------------
/mb20.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Model-based Reinforcement Learning Version 1.5}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:mb20}
 5 | \REQUIRE Some base policy for data collection $\pi_0$
 6 | \STATE Run base policy $\pi_0(a_t|s_t)$ (e.g. random policy) to collect $\mathcal{D} = \{(s,a,s')_i\}$
 7 | \WHILE{True}
 8 | \STATE Learn dynamics model $f(s,a)$ to minimize $\sum_i\lvert|f(s_i,a_i)-s'_i|\rvert^2$
 9 | \STATE Backpropagate through $f(s,a)$ into the policy to optimize $\pi_\theta(a_t|s_t)$
10 | \STATE Run $\pi_\theta(a_t|s_t)$, appending the visited tuples $(s,a,s')$ to $\mathcal{D}$.
11 | \ENDWHILE
12 | \end{algorithmic}
13 | \end{algorithm}


--------------------------------------------------------------------------------
/mblatent.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Model-based Reinforcement Learning with Latent States}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:mblatent}
 5 | \REQUIRE Some base policy for data collection $\pi_0$, hyperparameter $N$
 6 | \STATE Run base policy $\pi(a_t|s_t)$ (e.g. random policy) to collect $\mathcal{D} = \{(s,a,s')_i\}$
 7 | \FOR{every $N$ steps}
 8 | \WHILE{True}
 9 | \STATE Learn dynamics model $p_\phi(s_{t+1}|s_t,a_t),p_\phi(r_t|s_t),p(o_t|s_t),g_\psi(o_t)$
10 | \STATE Plan through $f(s,a)$ to choose actions
11 | \STATE Execute the first planned action, observe resulting state $o'$ (MPC)
12 | \STATE Append $(o,a,o')$ to dataset $\mathcal{D}$
13 | \ENDWHILE
14 | \ENDFOR
15 | \end{algorithmic}
16 | \end{algorithm}


--------------------------------------------------------------------------------
/mbpolicy.tex:
--------------------------------------------------------------------------------
  1 | \chapter{Model-based Policy Learning}
  2 | So far we have covered the basics of model-based RL that we first learn a model and use a model for control. We have seen that this approach does not work well in general because of the effect of distributional shift in model-based RL. We have also seen the method to quantify uncertainty in our model in order to alleviate this issue. The methods we covered so far do not involve learning policies. In this chapter, we will cover model-based reinforcement learning of policies. Specifically, we will learn global policies and local policies, and combine local policies into global policies using guided policy search and policy distillation. We shall understand how and why we should use models to learn policies, global and local policy learning, and how local policies can be merged via supervised learning into a global policy.
  3 | 
  4 | We have seen the difference between a closed-loop and open-loop controller. We also discussed why an open-loop controller is suboptimal because we are rolling out a whole sequence of actions solely based on one state observation. Therefore, it would be more ideal if we could design a closed-loop controller where state feedbacks can help us correct the mistakes we make. Recall in a stochastic environment, we are optimizing over the policy as follows:
  5 | \begin{align*}
  6 |     p(s_1,a_1,\dots,s_T,a_T)&=p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\
  7 |     \pi&=\argmaxA_\pi \mathbb{E}_{\tau\sim p(\tau)}\left[\sum_t r(s_t,a_t)\right]
  8 | \end{align*}
  9 | and $\pi$ could take several forms: $\pi$ can be a neural net, which we call a \textbf{global} policy, and it can also be a time-varying linear controller $K_t s_t + k_t$ as we saw in LQR, which we call a \textbf{local} policy.
 10 | 
 11 | \section{Back-propagate into the Policy}
 12 | Let us start with a simple solution for model-based policy learning. Ideally, we could build a computational graph in Tensorflow, and calculate the partial derivatives step by step so that we can backpropage into policy and optimize the policy, illustrated in Fig. \ref{fig:poliback}.
 13 | \begin{figure}
 14 |     \centering
 15 |     \includegraphics[scale=0.4]{figures/poliback.png}
 16 |     \caption{Back-propagate into policies}
 17 |     \label{fig:poliback}
 18 | \end{figure}
 19 | Then we can modify our model-based policy-free RL algorithm to accomodate this new policy learning process in Alg. \ref{alg:mb20}.
 20 | \input{mb20.tex}
 21 | \subsection{Vanishing and Exploding Gradients}
 22 | One problem with Alg. \ref{alg:mb20}, or general gradient-based optimization is that as we progress into the time steps, we might encounter vanishing or exploding gradients. Because as we apply chain rule, the gradients get multiplied by each other, so we the product may get extremely big (exploding) or extremely small (vanishing), making optimization a lot harder. Furthermore, we have similar parameter sensitivity problems as shooting methods, but we no longer have convenient second order LQR-like method, because the policy function is extremely complicated and policy parameters couple all the time steps, so no dynamic programming. % what does this mean
 23 | 
 24 | So what can we do about it? First, we can use model-free RL algorithms with synthetic samples generated by the model. Essentially, we are using models to accelerate model-free RL. Second, we can use simpler policies than neural nets such as LQR, and train local policies to solve simple tasks, and then combine them into global policies via supervised learning. 
 25 | 
 26 | \section{Model-free Optimization with a Model}
 27 | Recall the equation from policy gradients:
 28 | \[
 29 | \nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\hat{Q}^\pi_{i,t}
 30 | \]
 31 | Note that we are not doing any backprop through time in policy gradient because we are calculating the gradient with respect to an expectation, so we can just take the derivative of the probability of the samples instead of the actual dynamics function. 
 32 | 
 33 | Then we look at the regular backprop (pathwise) gradient, we see a more chain rule-like gradient:
 34 | \[
 35 | \nabla_\theta J(\theta) = \sum_{t=1}^T\frac{dr_t}{ds_t}\prod_{t'=2}^t\frac{d{s_{t'}}}{da_{t'-1}}\frac{da_{t'-1}}{ds_{t'-1}}
 36 | \]
 37 | The two gradients are different, because the policy gradient is for stochastic systems while the backprop policy is for deterministic systems. But using variational inference, we can prove that they are calculating the same gradient differently, thus having different tradeoffs. We will talk about variational inference more in-depth in the next chapter. 
 38 | 
 39 | Actually, given more samples to reduce variance, policy gradient is more stable because it does not require multiplying many Jacobians. However, if our models are inaccurate, the samples we use from the wrong model will be incorrect, and the mistakes are likely to exacerbate as time goes on. So it would be nice to use such model-free optimizer and keep the rolled out samples' trajectory short. This is essentially what Dyna algorithm does.
 40 | 
 41 | \subsection{Dyna}
 42 | Dyna is an online Q-learning algorithm that performs model-free RL with a model. 
 43 | \input{dyna.tex}
 44 | In step 3 of Alg. \ref{alg:dyna}, we are updating the model and reward function using the observed transition. Then in step 6, we will sample some old state and action pairs and apply the model onto the sampled pair, so the $s'$ in step 7 are synthetic next states. Intuitively, as the models get better, the expectation estimate in step 7 also gets more accurate. This algorithm seems arbitrary in many aspects, but the gist is to keep improving models and use models to improve Q-function estimation by taking expectations. 
 45 | 
 46 | We can also generalize Dyna to see how this kind of general Dyna-style model-based RL algorithms work. The generalized algorithm is shown in Alg. \ref{alg:dynagen}.
 47 | \input{dynagen.tex}
 48 | \begin{figure}
 49 |     \centering
 50 |     \includegraphics[scale=0.4]{figures/dynarollout.png}
 51 |     \caption{General Dyna training}
 52 |     \label{fig:dynarollout}
 53 | \end{figure}
 54 | As shown in Fig. \ref{fig:dynarollout}, we choose some states (orange dots) from the buffer, simulate the next states using the learned model, and then train model-free RL with synthetic data $(s,a,s',r)$ where $s$ is from the experience buffer, $s'$ is from the learned model. One could also take more than one step if one believes that the model is good enough for more steps. 
 55 | 
 56 | This algorithm only requires very short (as few as one step) rollouts from model, so the mistakes will not exacerbate and accumulate much. Moreover, we explore well with a lot of samples because we still see diverse states. 
 57 | 
 58 | \section{Local and Global Models}
 59 | Recall that in LQR, we can turn a constrianed optimization problem into an unconstrained problem:
 60 | \[
 61 | \min_{u_1,\dots,u_T}c(x_1,u_1)+c(f(x_1,u_1),u_2)+\dots+c(f(f(\dots)\dots),u_T)
 62 | \]
 63 | Backpropagation is indeed a possible solution to solve this optimization problem, and we need $\frac{df}{d{x_t}}$, $\frac{df}{du_t}$, $\frac{dc}{dx_t}$, $\frac{dc}{du_t}$
 64 | 
 65 | \subsection{Local Models}
 66 | Since LQR gives us a state-feedback controller for a linear system, we can keep linearizing the system and iteratively apply LQR to generate local models. We fit $\frac{df}{d{x_t}}$, $\frac{df}{du_t}$ around the current trajectory or policy. Say the model is a Gaussian $p(x_{t+1}|x_t,u_t)=\mathcal{N}(f(x_t,u_t),\Sigma)$, then we can approximate the model as a linear function $f(x_t,u_t)\simeq A_t x_t+B_tu_t$, and we can use $\frac{df}{d{x_t}}$ as $A_t$, and $\frac{df}{du_t}$ as $B_t$.
 67 | 
 68 | Iterative LQR produces $\hat{x_t},\hat{u_t},K_t,k_t$, where $u_t = K_t(x_t-\hat{x_t})+k_t+\hat{u_t}$. We can execute the controller using a Gaussian $p(u_t|x_t) = \mathcal{N}(K_t(x_t-\hat{x_t})+k_t+\hat{u_t},\Sigma_t)$ because we can add noise to the iLQR controller so that all samples do not look the same. Practically, we can set $\Sigma_t = Q^{-1}_{u_t,u_t}$.
 69 | \begin{figure}
 70 |     \centering
 71 |     \includegraphics[scale=0.35]{figures/localmodel.png}
 72 |     \caption{Local models fitting}
 73 |     \label{fig:localmodel}
 74 | \end{figure}
 75 | We can fit the model $p(s_{t+1}|s_t,a_t)$ using Bayesian linear regression, and use the global model as prior.
 76 | 
 77 | We also need to stay close to old controller if we go too far. If trajectory distribution is close, then dynamics will be close too. Close here means the KL-divergence is small $D_{KL}(p(\tau)||p(\Bar{\tau}))\leq \epsilon$.
 78 | 
 79 | %todo: idk how fitting the dynamics works
 80 | 
 81 | \subsection{Guided Policy Search}
 82 | The high level idea of guided policy search is to use some simpler local policy such as local LQR controller to help and guide the learning process of more complex global policy learner. Essentially, we would use the local models trajectories as the training data for a supervised learning neural net that can solve all the tasks.
 83 | 
 84 | However, one problem is that the local policies might not be able to be reproduced using a single neural net. Therefore, after training the global policy with supervised learning, we need to reoptimize the local policies using the global policy so that the policies are consistent with each other. The sketch of guided policy search is shown in Alg. \ref{alg:guided}. Note that the cost function $\Tilde{c}_{k,i}$ is the modified cost function to keep $\pi_{LQR}$ close to $\pi_\theta$.
 85 | \input{guided.tex}
 86 | 
 87 | In Divide and Conquer RL, the idea is similar, except that we are replacing the local LQR controllers with local neural net. 
 88 | 
 89 | \subsection{Distillation}
 90 | In RL, we borrow some ideas from supervised learning to achieve the task of learning a global policy from a bunch of local policies. 
 91 | 
 92 | Recall in supervised learning, we use model ensemble to make our predictions more robust and accurate. However, keeping a lot of models is expensive during test time. Is there a way to train just one model that can behave as well as a meta-learner?
 93 | 
 94 | The idea, proposed by Hinton in \cite{hinton2015distilling}, is to train a model on the ensemble's predictions as ``soft'' targets using:
 95 | \[
 96 | p_i = \frac{\exp(z_i/T)}{\sum_j (z_j/T)}
 97 | \]
 98 | where $T$ is called temperature. The new labels here can be intuitively explained using the example of MNIST dataset. For example, a handwritten digit ``2'' looks like a 2 and a backward 6. Therefore, the soft-labels that we use to train the distilled model is going to be ``80\% chance being 2 and 20\% chance being 6''.
 99 | 
100 | In RL, to achieve multi-task global policy learning, we can use something similar called policy distillation. The idea is to train a global policy using a bunch of local tasks:
101 | \[
102 | \mathcal{L} =  \sum_a \pi_{E_i}(a|s)\log\pi_{AMN}(a|s)
103 | \]
104 | where the meta-policy $\pi_{AMN}$ can be trained in a supervised learning fashion. 
105 | 
106 | For more details, please refer to \cite{rusu2015policy, parisotto2015actor}.
107 | 


--------------------------------------------------------------------------------
/mcts.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Generic Monte Carlo Tree Search (MCTS)}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:mcts}
 5 | \REQUIRE Some base tree policy for expanding nodes, some base default policy to simulate a trajectory from a leaf
 6 | \WHILE{true}
 7 |     \STATE Find a leaf $s_l$ using TreePolicy($s_1$)
 8 |     \STATE Evaluate the leaf using DefaultPolicy($s_l$)
 9 |     \STATE Update all values in tree between $s_1$ and $s_l$
10 | \ENDWHILE
11 | \STATE Take best action from $s_1$
12 | \end{algorithmic}
13 | \end{algorithm}


--------------------------------------------------------------------------------
/modelbased.tex:
--------------------------------------------------------------------------------
  1 | \chapter{Model-Based Reinforcement Learning}
  2 | What we have covered so far can be categorized as ``model-free'' reinforcement learning. The reason why it is called model-free is that the transition probabilities are unknown and we did not even attempt to learn the transition probabilities. Recall the RL objective:
  3 | \begin{align*}
  4 |   \pi_\theta(\tau) &= p_\theta(s_1,a_1,...,s_T,a_t)\\
  5 |   &= p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)\\
  6 |   \theta^* &= \argmaxA_\theta\mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\sum_t r(s_t,a_t)\right]
  7 | \end{align*}
  8 | The transition probabilities $p(s_{t+1}|s_t,a_t)$ is not known in all the model-free RL algorithms that we have learned such as Q-learning and policy gradients. But what if we know the transition dynamics? Recall that at the very beginning of the notes we drew an analogy of RL and control theory; in many cases, we do know the system's internal transition. For example, in games, easily modeled systems, and simulated environments, the transitions are given to us. Moreover, it is not uncommon to learn the transition models: in classic robotics, system identification fits unknown parameters of a known model to learn how the system evolves, and one could also imagine a deep learning approach where we could potentially fit a general-purpose model to observed transition data for later use. In fact, the latter case is the essence of Model-based RL, where we learn the transition dynamics first, and then figure out how to choose actions. To learn about model-based RL, we shall start from a simpler case, where we know the transitions and determine how we control the system optimally based on the transitions. After this, we can apply our optimal control theory to the more general case, where we actually learn the transitions first.
  9 | \section{Optimal Control}
 10 | Optimal control is a task that we come across when we are well aware of the transition probabilities and we try to learn how to control the system optimally. In optimal control, there are two different categories of controller design: the first one is \textbf{open-loop} control, where we do not have any state feedbacks, and we roll out a sequence of actions based on the current state that we observe. The second one is called \textbf{closed-loop} control, where we determine the action at each time step based on the current state, and how we determine the action to apply is based on state feedbacks. In some cases, our transition functions are deterministic, while in others, the transition functions are stochastic.
 11 | 
 12 | In a open loop controller, if we have a deterministic transition in our system such that $s_{t+1} = f(s_t,a_t)$, then our action sequence should be determined by choosing those that can return the maximum rewards:
 13 | \begin{align*}
 14 |     a_1, ..., a_T&=\argmaxA_{a1...a_T}\sum_{t=1}^Tr(s_t,a_t)\\
 15 |     \mathrm{s.t. }\;s_{t+1} &= f(s_t,a_t)
 16 | \end{align*}
 17 | In stochastic scenarios, the transition function is a probabilistic distribution, where we have $p(s_{t+1}|s_t,a_t)$, and the action sequence should be chosen based on expectation of the rewards:
 18 | \begin{align*}
 19 |     p_\theta(s_1,...,s_T)&=p(s_1)\prod_{t=1}^Tp(s_{t+1}|s_t,a_t)\\
 20 |     a_1,...,a_T &= \argmaxA_{a_1,...,a_T}\mathbb{E}\left[\sum_t r(s_t,a_t)|a_1,...,a_T\right]
 21 | \end{align*}
 22 | Note that we roll out all actions to apply only based on the initial state marginal, so we do not consider any state-feedback in this case.
 23 | 
 24 | In a closed-loop controller, however, we keep interacting with the world, so we need a policy function that can tell us the action to apply if we input the current state: $a_t\sim \pi(a_t|s_t)$, which we call a state-feedback. We choose our policy function as follows:
 25 | \begin{align*}
 26 |     p(s_1,a_1,...,s_T,a_T) &= p(s_1)\prod_{t=1}^T\pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\\
 27 |     \pi &=\argmaxA_\pi \mathbb{E}_{\tau\sim p(\tau)}\left[\sum_t r(s_t,a_t)\right]
 28 | \end{align*}
 29 | Generally, $\pi$ could take many forms, such as a neural net or time-variant linear controller $K_ts_t + k_t$.
 30 | 
 31 | \section{Open-loop Planning}
 32 | For now, let us focus on a simple, open-loop controller, and see how we choose actions using such controller. In open-loop scenarios, we roll out a sequence of actions by doing $\argmaxA$ on the sum of rewards:
 33 | $a_1,...,a_T = \argmaxA_{a_1,...,a_T}J(a_1,...,a_T)$
 34 | and compactly, we can say that $A = \argmaxA_A J(A)$.
 35 | \subsection{Random Shooting}
 36 | Perhaps the easiest and most intuitive stochastic optimization method in open-loop control is the random shooting method. In such method, we first sample some different action sequences $A_1,...,A_N$ from some known distribution (such as uniform), and then we choose $A_i$ based on $\argmaxA_i J(A_i)$. This is highly inefficient in that we are not improving what we sample, so we might get stuck in some mediocre action sequence. Therefore, we can keep improving the samples we choose from a Gaussian distribution based on some elites sequences. This is the basic idea of Cross Entropy Method (CEM).
 37 | 
 38 | \subsection{Cross Entropy Method}
 39 | CEM improves upon the random shooting's guess and check scheme by choosing some elites sequences which give us higher rewards and refit the distribution to the high rewards. Intuitively, we are getting closer to higher rewards as we refit the distribution. Here is a sketch of CEM, as shown in Alg. \ref{alg:cem}. Usually there are two different criteria for CEM termination. The first one is simply just upper bound the maximum number of iterations. The second one is more subtle. For example, if we model the distribution as normal, we wish to stop if the distribution becomes ``narrow'' enough eventually, and we can say that we stop CEM when the fit standard deviation is below some threshold value. 
 40 | 
 41 | When we ``fit'' a distribution, what we are actually doing is to iteratively search for the best parameters to find a distribution from which we can sample to give us the best cost-to-go value. In a normal distribution, we are just learning $\mu$ and $\sigma$.
 42 | \input{cem.tex}
 43 | This algorithm is extremely simple and very fast if parallelized. However, it suffers from very harsh dimensionality limit and it only works for open-loop scenarios. 
 44 | \subsection{Monte Carlo Tree Search}
 45 | Now imagine our action space is discrete, we can apply a stochastic optimization technique called Monte Carlo Tree Search (MCTS), which is very popular in planning in stochastic games. The gist of this method is that in discrete action space, we are essentially expanding out a tree. However, the tree might be too big to expand out due to computational cost. Therefore, one way to save the computational cost is to partially expand the tree and use a policy to simulate a trajectory from the last expanded node. A generic sketch of MCTS is shown in Alg. \ref{alg:mcts}.
 46 | \input{mcts.tex}
 47 | Note that we the tree policy is not an actual policy, because it is just a method to traverse through our tree in order to select a node to expand. The default policy is an actual policy that is able to simulate the system. Since simulations are involved here, we have to be able to roll back to the original state.
 48 | \subsection{UCT Tree Policy}
 49 | 
 50 | In MCTS, how do we choose the nodes to expand? Intuitively, we need to keep choosing the nodes with high rewards so far, and simultaneously pick the ones that have not been chosen in order to explore. Therefore, one way to do it is to use the UCT tree policy. In this policy, we gauge the performance of each node by assigning a score function where the score
 51 | \[S(s_t) = \frac{Q(s_t)}{N(s_t)}+2C\sqrt{\frac{2\ln{N(s_{t-1})}}{N(s_t)}}\]. If the current node $s_t$ is not fully expanded, meaning that there is action that we never took before, then we choose new $a_t$; else, we choose the child with best $S(s_{t+1})$.
 52 | 
 53 | More details about MCTS can be found in \cite{browne2012survey} and \cite{guo2014deep}.
 54 | 
 55 | \subsection{Using Derivatives}
 56 | Let us consider the control theory counterpart of the RL objective. Essentially, we have a constrained optimization problem defined as follows:
 57 | $$\min_{u_1,...,u_T}\sum_{t=1}^Tc(x_t,u_t)\;\mathrm{s.t.}\;x_t = f(x_{t-1}, u_{t-1})$$
 58 | if we plug in the transition constraint, we have:
 59 | $$\min_{u_1,...,u_T}c(x_1,u_1) + c(f(x_1,u_1),u_2)+...+c(f(f(...)...),u_T)$$
 60 | which becomes an unconstrained optimization problem. Since it is unconstrained now, one might ask, can we do gradient descent on it? The usual answer is yes, but only if we use some more powerful optimization technique such as 2nd-order Newton method. Because optimization problems such as shooting methods are hard and often ill-conditioned via 1st order gradient descent. 
 61 | \subsection{Shooting Methods and Collocation Methods}
 62 | There are two different classes of gradient descent based method: shooting method and collocation method. In shooting methods, we optimize only on action sequences, so the actions are the only optimization variables. We only optimize upon the actions and apply the action to the state and see where it shoots to, so the states are the consequences of the actions that we optimize. The optimization problem can be written as:
 63 | $$\min_{u_1,...,u_T}c(x_1,u_1) + c(f(x_1,u_1),u_2)+...+c(f(f(...)...),u_T)$$
 64 | 
 65 | However, in collocation method, we optimize upon both actions and states, with constraints, and the optimization problem is written as:
 66 | $$\min_{u_1,...,u_T,x_1,...,x_T}\sum_{t=1}^Tc(x_t,u_t)\;\mathrm{s.t.}\;x_t = f(x_{t-1}, u_{t-1})$$
 67 | 
 68 | \subsection{Linear Quadratic Regulator (LQR)}
 69 | Let us start with a simple case of shooting method, where we apply a 2nd order-style optimization technique to achieve optimal control. The simple case assumes that we have a linear system, where the transition function is affine, and we have a quadratic cost function. 
 70 | 
 71 | Thus, the transition function should be of the form:
 72 | \[
 73 | f(x_t,u_t) = F_t \begin{bmatrix} x_t\\u_t\end{bmatrix}+f_t
 74 | \]
 75 | and the cost function should be of the form:
 76 | \[
 77 | c(x_t,u_t) = \frac{1}{2}\begin{bmatrix}x_t\\u_t\end{bmatrix}^TC_t\begin{bmatrix}x_t\\u_t\end{bmatrix}+\begin{bmatrix}x_t\\u_t\end{bmatrix}^Tc_t
 78 | \]
 79 | What we are doing right now is to solve for a closed-form solution for an optimal LQR controller. The idea is to use backward recursion. Since we are doing shooting method, we have $$\min_{u_1,...,u_T}c(x_1,u_1) + c(f(x_1,u_1),u_2)+...+c(f(f(...)...),u_T)$$
 80 | and the last item is the only term that depends on $u_T$. Therefore, as a base case, we can try to solve for $u_T$ first. In order to simplify our computation, let us define some blocks in the matrices that we defined above. Specifically, let us assume that
 81 | \[
 82 | C_T = \begin{bmatrix} C_{x_T,x_T}& C_{x_T,u_T}\\C_{u_T,x_T}&C_{u_T,u_T}\end{bmatrix}
 83 | \]
 84 | and
 85 | \[
 86 | c_T=\begin{bmatrix}c_{x_T}\\c_{u_T}\end{bmatrix}
 87 | \]
 88 | 
 89 | Since our cost function is
 90 | \[
 91 | Q(x_T,u_T) = \mathrm{const} + \frac{1}{2}\begin{bmatrix}x_T\\u_T\end{bmatrix}^TC_T\begin{bmatrix}x_T\\u_T\end{bmatrix}+\begin{bmatrix}x_T\\u_T\end{bmatrix}^Tc_T
 92 | \]
 93 | by setting gradient to 0, we will have 
 94 | \[
 95 | \nabla_{u_T}Q(x_T,u_T) = C_{u_T,x_T}x_T+C_{u_T,u_T}u_T+c_{u_T}^T=0
 96 | \]
 97 | By solving this equation, we have solved $u_T$, in terms of known constants and $x_T$:
 98 | \[
 99 | u_T = -C_{u_T,u_T}^{-1}(C_{u_T,x_T}x_T+c_{u_T})
100 | \]
101 | and to make notations more compact, let us denote $u_T$ as $u_T = K_Tx_T+k_T$, and $K_T = -C_{u_T,u_T}^{-1}C_{u_t,x_T}$, $k_T = -C_{u_T,u_T}^{-1}c_{u_T}$.
102 | 
103 | Now having solved our terminal control input $u_T$, which is fully determined by our terminal state $x_T$, we can eliminate $u_T$ in $Q(x_T,u_T).$ Plugging in, we have
104 | \begin{align*}
105 | V(x_T) &= \mathrm{const} + \frac{1}{2}\begin{bmatrix}x_T\\K_Tx_T+k_T\end{bmatrix}^TC_T\begin{bmatrix}x_T\\K_Tx_T+k_T\end{bmatrix}+\begin{bmatrix}x_T\\K_Tx_T+k_T\end{bmatrix}^Tc_T\\
106 | &=\frac{1}{2}x_T^TC_{x_T,x_T}x_T + \frac{1}{2}x_T^TC_{x_T,u_T}K_Tx_T+ \frac{1}{2}x_T^TK_T^TC_{u_T,x_T}x_T + \frac{1}{2}x_T^TK_T^TC_{u_T,u_T}K_Tx_T\\& \phantomrel{=} {}  + x_T^TK_T^TC_{u_T,u_T}k_T + \frac{1}{2}x_T^TC_{x_T,u_T}k_T + x_T^Tc_{x_T} + x_T^TK_T^Tc_{u_T} + \mathrm{const}\\
107 | &= \text{const} + \frac{1}{2}x_T^TV_Tx_T + x_T^Tv_T
108 | \end{align*}
109 | where we define $v_T$ and $V_T$ to make the notation more compact as follows:
110 | \begin{align*}
111 |     V_T &= C_{x_T,x_T} + C_{x_T,u_T}K_T +K_T^TC_{u_T,x_T} + K_T^TC_{u_T,u_T}K_T^T\\
112 |     v_T&= c_{x_T} + C_{x_T,u_T}k_T+K_T^Tc_{u_T} + K_T^TC_{u_T,u_T}k_T
113 | \end{align*}
114 | 
115 | Having solved the base case, we solve for other optimal control inputs backwards. Let us first proceed to solve for $u_{T-1}$ in terms of $x_{T-1}$. Now note that $u_{T-1}$ not only affects state $x_{T-1}$, but it also affects $x_T$ because of the system dynamics:
116 | \[
117 | f(x_{T-1}, u_{T-1}) = x_T = F_{T-1}\begin{bmatrix}x_{T-1}\\u_{T-1}\end{bmatrix} + f_{T-1}
118 | \]
119 | Therefore, the cost function from $T-1$ can be calculated as:
120 | \[
121 | Q(x_{T-1}, u_{T-1}) = \text{const} + \frac{1}{2}\begin{bmatrix}x_{T-1}\\u_{T-1}\end{bmatrix}^TC_{T-1}\begin{bmatrix}x_{T-1}\\u_{T-1}\end{bmatrix} + \begin{bmatrix}x_{T-1}\\u_{T-1}\end{bmatrix}^Tc_{T-1} + V(f(x_{T-1}, u_{T-1}))
122 | \]
123 | and if we plug the transition dynamics function into $V(x_T)$, we will have:
124 | \[
125 | V(x_T) = \text{const} + \frac{1}{2}\begin{bmatrix}x_{T-1}\\u_{T-1}\end{bmatrix}^TF_{T-1}^TV_TF_{T-1}\begin{bmatrix}x_{T-1}\\u_{T-1}\end{bmatrix} + \begin{bmatrix}x_{T-1}\\u_{T-1}\end{bmatrix}^TF_{T-1}^TV_Tf_{T-1}+\begin{bmatrix}x_{T-1}\\u_{T-1}\end{bmatrix}^TF_{T-1}^Tv_T
126 | \]
127 | More compactly, we write the cost function as:
128 | \[
129 | Q(x_{T-1}, u_{T-1}) = \text{const} +\frac{1}{2} \begin{bmatrix}x_{T-1}\\u_{T-1}\end{bmatrix}Q_{T-1}\begin{bmatrix}x_{T-1}\\u_{T-1}\end{bmatrix} + \begin{bmatrix}x_{T-1}\\u_{T-1}\end{bmatrix}^Tq_{T-1}
130 | \]
131 | where $Q_{T-1} = C_{T-1} + F^T_{T-1}V_TF_{T-1}$, and $q_{T-1} = c_{T-1} +  F^T_{T-1}V_Tf_{T-1} + F^T_{T-1}v_T$. To solve the optimization problem, we set the gradient to 0:
132 | \[
133 | \nabla_{u_{T-1}}Q(x_{T-1}, u_{T-1}) =Q_{u_{T-1},x_{T-1}}x_{T-1} + Q_{u_{T-1},u_{T-1}}u_{T-1}+q_{u_{T-1}}^T=0
134 | \]
135 | solving the equation, we have the following expression for $u_{T-1}$:
136 | \begin{align*}
137 |     u_{T-1} &= K_{T-1}x_{T-1} + k_{T-1}\\
138 |     K_{T-1} &= -Q^{-1}_{u_{T-1},u_{T-1}}Q_{u_{T-1},x_{T-1}}\\
139 |     k_{T-1} &= -Q_{u_{T-1},u_{T-1}}q_{u_{T-1}}
140 | \end{align*}
141 | 
142 | Applying the same technique backwards, we can solve for the states and inputs at each time step, as illustrated in Alg. \ref{alg:lqr}.
143 | \input{lqr.tex}
144 | In step 5 of Alg. \ref{alg:lqr}, Q-function represents the total cost from now until end if we take $u_t$ from state $x_t$, and in step 11, the V-function represents the total cost from now until end from state $x_t$, so $V(x_t) = \min_{u_t}Q_{x_t,u_t}$, which we call the cost-to-go function. The above derivation is one of the many derivations of the Riccati Equation.
145 | 
146 | What we have analyzed above is based on deterministic dynamics. What if the transition (dynamics) is stochastic? Specifically, consider the following setup:
147 | \begin{align*}
148 |     x_{t+1} &\sim p(x_{t+1}|x_t,u_t)\\
149 |     p(x_{t+1}|x_t,u_t) &= \mathcal{N}\left(F_t\begin{bmatrix}x_t\\u_t\end{bmatrix} + f_t, \Sigma_t\right)
150 | \end{align*}
151 | where our transition is actually a Gaussian distribution with constant covariance. It turns out that we can apply the exact same algorithm, choosing actions according to $u_t = K_tx_t + k_t$, and we can ignore $\Sigma_t$ due to symmetry of Gaussians.
152 | 
153 | \subsection{Iterative LQR (iLQR)}
154 | In LQR, we assumed that the dynamics are linear. In non-linear cases, however, we can apply a similar approach called iterative LQR. Specifically, we can iteratively apply Jacobian linearization to locally linearize the system with respect to an equilibrium point. Consequently, we approximate a non-linear system as a linear-quadratic system:
155 | \begin{align*}
156 |     f(x_t,u_t) &\simeq f(\hat{x_t}, \hat{u_t}) + \nabla_{x_t,u_t}f(\hat{x_t}, \hat{u_t})\begin{bmatrix}x_t-\hat{x_t}\\u_t-\hat{u_t}\end{bmatrix}\\
157 |     c(x_t,u_t) &\simeq c(\hat{x_t}, \hat{u_t}) + \nabla_{x_t,u_t}c(\hat{x_t}, \hat{u_t})\begin{bmatrix}x_t-\hat{x_t}\\u_t-\hat{u_t}\end{bmatrix} + \frac{1}{2}\begin{bmatrix}x_t-\hat{x_t}\\u_t-\hat{u_t}\end{bmatrix}^T\nabla^2_{x_t,u_t}c(\hat{x_t}, \hat{u_t})\begin{bmatrix}x_t-\hat{x_t}\\u_t-\hat{u_t}\end{bmatrix}
158 | \end{align*}
159 | 
160 | Now we have an LQR system with respect to the divergence from the action space and state space's equilibrium points:
161 | \begin{align*}
162 |     \Bar{f}(\delta x_t, \delta u_t) &= F_t \begin{bmatrix}\delta x_t\\\delta u_t\end{bmatrix}\\
163 |     \Bar{c}(\delta x_t, \delta u_t) &= \frac{1}{2}\begin{bmatrix}\delta x_t\\\delta u_t\end{bmatrix}^TC_t\begin{bmatrix}\delta x_t\\\delta u_t\end{bmatrix}^Tc_t
164 | \end{align*}
165 | where 
166 | \begin{align*}
167 |     F_t &= \nabla_{x_t,u_t}f(\delta x_t, \delta u_t) \\
168 |     C_t &= \nabla^2_{x_t,u_t}c(\delta x_t, \delta u_t)\\
169 |     c_t &= \nabla_{x_t,u_t}c(\delta x_t, \delta u_t)
170 | \end{align*}
171 | Then we can iteratively run LQR with dynamics $\Bar{f}$, cost $\Bar{c}$, state $\delta x_t$, and action $\delta u_t$.
172 | \input{ilqr.tex}
173 | 
174 | A sketch of iLQR is shown in Alg. \ref{alg:ilqr}. In essence, iLQR is an approximation of Newton's method for solving $\min_{u_1,...,u_T}c(x_1,u_1) + c(f(x_1,u_1),u_2)+...+c(f(f(...)...),u_T)$.
175 | 
176 | \section{Model-based RL}
177 | In this section, we are going to cover a rather simpler case of model-based RL. Specifically, we are going to talk about a technique to learn a model of the system first, and then use the optimal control technique we covered last time to improve the model. Furthermore, we will learn to address uncertainty in the model such as model mismatch and imperfection.
178 | \subsection{Basics}
179 | Why do we learn the model? Because when the model is unknown, we can learn the model so that we know $f(s_t,a_t) = s_{t+1}$ or $p(s_{t+1}|s_t,a_t)$ in stochastic case, we could use the tools from optimal control to maximize our rewards. 
180 | 
181 | Our first attempt is naive, we learn $f(s_t,a_t)$ from data, and then plan through it. We call this approach model-based RL version 0.5, or vanilla model-based RL, as shown in algorithm \ref{alg:mb05}.
182 | \input{mb05.tex}
183 | This is essentially what people do in system identification, which is a technique used in classic robotics, and it is effective when we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters. However, it does not work generally because of distribution mismatch: when the model is imperfect, we might suffer from false learning. Furthermore, since we are blindly following a trajectory, the mismatch exacerbates as we use more expressive model classes, when $p_{\pi_0}(s_t)\neq p_{\pi_f}(s_t)$.
184 | 
185 | Acknowledging this disadvantage, we could improve the vanilla model-based RL by making $p_{\pi_0}(s_t) = p_{\pi_f}(s_t)$. As we have seen in Alg. \ref{alg:dagger}, we can keep aggregating data into our dataset in order to make our model converge to demonstration model. Applying the same approach, we keep updating the dataset by running the current model, and then update the model accordingly. Take a look at the updated model-based RL algorithm in Alg. \ref{alg:mb10}.
186 | \input{mb10.tex}
187 | Version 1.0 addresses the model mismatch issue and drives the current model as close as possible to the true dynamics model. However, we are still blindly following a trajectory in step 5 of Alg. \ref{alg:mb10}, and if we made a mistake, we would follow the wrong step which makes the mistake exacerbate. Therefore, we need to somehow adjust our plan as time goes on. One way to do this is to borrow some ideas from modern control theory: Model Predictive Control (MPC).
188 | 
189 | In MPC, we are given the system's dynamics model, and we are trying to design an adaptive controller by solving a finite time constrained optimal control problem at each time step, and take only the first action in the generated sequence of actions. Then we replan based on the new state. For sake of simplicity, we will skip the discussion about safe set and terminal set in MPC in this chapter. But the ``replan'' idea in MPC is exactly what we need to improve our model-based RL version 1.0. We essentially are aiming to take one action in the planned sequence and only observe one new state, and then append the observed transition to our dataset $\mathcal{D}$. The improvement is shown in Alg. \ref{alg:mb15}.
190 | \input{mb15.tex}
191 | The while loop in algorithm \ref{alg:mb15} refers to replanning in MPC, which is solving for an optimization problem at each time step after we take the first action planned. The for loop, however, means that we are periodically retraining the model in order to make it closer to the true underlying transition model. Intuitively, the more frequently the agent replans, the less perfect each individual plan needs to be, because since we are frequently replanning, we are able to correct our mistakes made in previous plans more easily. Consequently, one is able to correct the plans as one increase the replanning frequency. Therefore, if we are frequently replanning, we could use shorter horizons in the CFTOC problem that MPC is solving. 
192 | 
193 | \subsection{Performance Gaps in Model-based RL}
194 | Believe it or not, sometimes model-based RL performs worse than model-free RL. The problem is from step 5 of algorithm \ref{alg:mb15}. In this step, we plan through the model to choose actions, which means we are solving an optimization problem based on the data we collect. One could imagine that if we overfit the data, the agent might have some wrong belief about the model, thus generating wrong actions. Pictorially, this phenomenon is illustrated in Fig. \ref{fig:overfit}.
195 | \begin{figure}
196 |     \centering
197 |     \includegraphics[scale=0.5]{figures/overfit.png}
198 |     \caption{False belief about the model from overfitting}
199 |     \label{fig:overfit}
200 | \end{figure}
201 | 
202 | Therefore, we need to explore to get better, more representative data of the model, thus preventing overfitting and false belief. The expected value of the reward is not the same as optimistic or pessimistic. In step 5, when we choose actions, we only take actions for which we think we will get high reward in expectation, with respect to uncertain dynamics, which avoids exploiting the model too much.
203 | 
204 | \subsection{Uncertainty-aware Models}
205 | Under imperfect models and model mismatch, one might expect wrong actions planned. Therefore, one way to deal with this problem is to construct an uncertainty-aware model, where we can quantitatively estimate the uncertainty in the model, so that we can assess the accuracy of the model and the planned actions. 
206 | 
207 | The first idea is to use entropy of output distribution, and as we know, higher entropy means higher uncertainty. We can estimate the entropy of $p(s_{t+1}|s_t,a_t)$. However, this is not enough because when the model is wrong, we might still have low variance, thus low entropy. Even though in some regions the model is highly uncertain, the output entropy is still low.
208 | 
209 | The reason why entropy of the output distribution alone is not expressive enough is that there are two types of uncertainty:
210 | \begin{itemize}
211 |     \item aleatoric (statistical) uncertainty, where the data itself is noisy.
212 |     \item epistemic (model) uncertainty, where the model is certain about data, but we are not certain about model.
213 | \end{itemize}
214 | These two types of uncertainty are not the same. We cannot gauge the correctness of the second model based on the output entropy, and the entropy of the first model might be higher even though it is potentially a very ``good'' model. 
215 | 
216 | The second idea is to estimate the model uncertainty, where we essentially estimate how uncertainty we are about the model.
217 | \begin{figure}
218 |     \centering
219 |     \includegraphics[scale=0.5]{figures/modelnn.png}
220 |     \caption{Estimating the model using a neural net}
221 |     \label{fig:modelnn}
222 | \end{figure}
223 | Usually, we use maximum likelihood estimation, where
224 | \[
225 | \argmaxA_\theta \log p(\theta|\mathcal{D}) = \argmaxA_\theta \log p(\mathcal{D}|\theta)
226 | \]
227 | Instead if we estimate the posterior of data $p(\theta|\mathcal{D})$ instead of argmax, the entropy of the distribution gives us the model uncertainty from the data. Moreover, we can predict using $\int p(s_{t+1}|s_t,a_t,\theta)p(\theta|\mathcal{D}) d\theta$.
228 | 
229 | To learn the posterior distribution, we can apply bootstrap ensembles, where we use multiple networks to learn the same distribution. Formally, say we have $N$ networks, each with a parameter $\theta_i$ to learn $p(s_{t+1}|s_t,a_t)$, we can then estimate the posterior by:
230 | \[
231 | p(\theta|\mathcal{D}) \simeq \frac{1}{N}\sum_i \delta(\theta_i)
232 | \]
233 | where $\delta(\cdot)$ is the direc-delta function. To train it, we need to generate independent datasets to get independent models. One way to do this is to train $\theta_i$ on $\mathcal{D}_i$ sampled with replacement from $\mathcal{D}$. This method is simple, but it is a very crude approximation. 
234 | 
235 | With this ensemble of networks, we choose actions a little differently. Before, we choose actions by $J(a_1,\dots,a_H) = \sum_{t=1}^Hr(s_t,a_t)$, where $s_{t+1} = f(s_t,a_t)$, and now we average over the ensemble by $J(a_1,\dots,a_H) = \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^Hr(s_{t,i},a_{t,i})$, where $s_{t+1,i} = f(s_{t,i},a_{t,i})$
236 | 
237 | In general, for candidate action sequence $a_1,\dots,a_H$, we first sample $\theta \sim p(\theta|\mathcal{D})$, then at each time step $t$, we sample $s_{t+1}\sim p(s_{t+1}|s_t,a_t,\theta)$, then we calculate the reward $R = \sum_t r(s_t,a_t)$, and we repeat the aforementioned steps and accumulate the average reward.
238 | % idk how this works....
239 | \subsection{Latent Space Model}
240 | In many cases, we are given very complex observations of the states such as pixel-based images, where we do not have full access to the states. To learn the dynamics using observations, we need to learn from the latent space and infer the states from observations. 
241 | \begin{figure}
242 |     \centering
243 |     \includegraphics[scale=0.4]{figures/latent.png}
244 |     \caption{Latent space model}
245 |     \label{fig:latent}
246 | \end{figure}
247 | From Fig. \ref{fig:latent}, we can see that we need to learn the following models:
248 | \begin{itemize}
249 |     \item $p(o_t|s_t)$, the observation model
250 |     \item $p(s_{t+1}|s_t,a_t)$, the dynamics model
251 |     \item $p(r_t|s_t,a_t)$, the reward model
252 | \end{itemize}
253 | 
254 | Recall that in high level, model-based RL algorithms are basically doing a maximum likelihood estimation in training given fully observed states:
255 | \[
256 | \max_\phi \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\log p_\phi(s_{t+1,i}|s_{t,i},a_{t,i})
257 | \]
258 | then with latent models, we are not sure about the actual state, so we take the expected value:
259 | \[
260 | \max_\phi \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\mathbb{E}\left[\log p_\phi(s_{t+1,i}|s_{t,i},a_{t,i}) + \log p_\phi(o_{t,i}|s_{t,i})\right]
261 | \]
262 | where the expectation is with respect to the distribution of $(s_t,s_{t+1})\sim p(s_t,s_{t+1}|o_{1:T},a_{1:T})$
263 | 
264 | However, the posterior distribution $p(s_t,s_{t+1}|o_{1:T},a_{1:T})$ is usually intractable if we have very complex dynamics. As a result, we could instead try to learn an approximate posterior, which we call $q_\psi(s_t|o_{1:t},a_{1;t})$. We could also learn $q_\psi(s_t,s_{t+1}|o_{1:t},a_{1;t})$ and $q_\psi(s_t|o_t)$. We call this technique learning an \textbf{encoder}. Learning the distribution $q_\psi(s_t|o_t)$ is crude, but it is the simplest to implement. If we just decide to learn this distribution for now, then the expectation becomes:
265 | \[
266 | \max_\phi \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\mathbb{E}\left[\log p_\phi(s_{t+1,i}|s_{t,i},a_{t,i}) + \log p_\phi(o_{t,i}|s_{t,i})\right]
267 | \]
268 | such that the expectation is with respect to $s_t\sim q_\psi(s_t|o_t)$, $s_{t+1}\sim q_\psi(s_{t+1}|o_{t+1})$
269 | 
270 | For now, let us focus on a simple case where $q(s_t|o_t)$ is deterministic, because the stochastic case requires variational inference, which will be covered in-depth in a later chapter. In deterministic case, we are training a neural net $g_\psi(o_t) = s_t$ using a direc-delta function such that $q_\psi(s_t|o_t) = \delta(s_t = g_\psi(o_t))$. Then the expectation can be simplified as
271 | \[
272 | \max_\phi \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\log p_\phi(g_\psi(o_{t+1,i})|g_\psi(o_{t,i}),a_{t,i}) + \log p_\phi(o_{t,i}|g_\psi(o_{t,i}))
273 | \]
274 | Now everything is differentiable, we can train using backpropagation. 
275 | 
276 | Thus, we can slightly modify Alg. \ref{alg:mb15} so that we can deal with observations and latent space. We show the sketch of this slightly modified algorithm in Alg. \ref{alg:mblatent}. In step 4, we are respectively learning the dynamics, reward model, observation model, and encoder.
277 | \input{mblatent.tex}
278 | 
279 | Interested readers can refer to \cite{watter2015embed} and \cite{zhang2018solar} for more information on learning from pixel-based images as latent states.


--------------------------------------------------------------------------------
/offline.tex:
--------------------------------------------------------------------------------
1 | \chapter{Offline RL}
2 | RL is fundamentally an “active” learning paradigm: the agent needs
3 | to collect its own dataset to learn meaningful policies. However, this might be unsafe or expensive in real world problems (e.g., autonomous driving). Therefore it would be more data-efficient to learn from a previously collected static dataset, which we call Offline (Batch) RL.
4 | 
5 | \section{Offline RL Performance}
6 | In regular supervised learning problems such as classification, the algorithm can do as good as the dataset. But in offline RL, due to the ``stitching'' property, it can sometimes do better than the dataset. In fact, one can show that Q-learning recovers optimal policy from random data.


--------------------------------------------------------------------------------
/onlineQiter.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Online Q-Iteration Algorithm}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:onlineQiter}
 5 | \WHILE{true}
 6 |     \STATE Take some action $a_i$ and observe $(s_i,a_i,s'_i,r_i)$
 7 |     \STATE $y_i = r(s_i,a_i) + \gamma \max_{a'}Q_\phi(s_i,a_i,s'_i,r_i)$
 8 |     \STATE $\phi \leftarrow \phi-\alpha\frac{dQ_\phi}{d\phi}(s_i,a_i)(Q_\phi(s_i,a_i) - y_i)$
 9 | \ENDWHILE
10 | \end{algorithmic}
11 | \end{algorithm}


--------------------------------------------------------------------------------
/onlineac.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Online Actor-Critic Algorithm}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:onlineac}
 5 | \REQUIRE Base policy $\pi_\theta(a_t|s_t)$, hyperparameter $\gamma$
 6 | 
 7 | \WHILE{true}
 8 |     \STATE Take action $a\sim\pi_\theta(a|s)$, get $(s,a,s',r)$
 9 |     \STATE Update $\hat{V}^\pi_\phi$ using target $r+\gamma\hat{V}^\pi_\phi(s')$
10 |     \STATE Evaluate $\hat{A}^\pi(s,a)=r(s,a)+\gamma\hat{V}^\pi_\phi(s')-\hat{V}^\pi_\phi(s)$
11 |     \STATE $\nabla_\theta J(\theta) \simeq \nabla_\theta\log \pi_\theta(a|s)\hat{A}^\pi(s,a)$
12 |     \STATE Improve policy by $\theta \leftarrow \theta + \alpha\nabla_\theta J(\theta)$
13 | \ENDWHILE
14 | \RETURN optimal policy from gradient ascent as $\pi^{return}$
15 | \end{algorithmic}
16 | \end{algorithm}


--------------------------------------------------------------------------------
/pgtheory.tex:
--------------------------------------------------------------------------------
  1 | \chapter{Policy Gradients Theory and Advanced Policy Gradients}
  2 | Why does Policy Gradient algorithm work? Recall our generic policy gradient algorithm: we are essentially looping to constantly estimate the advantage function $\hat{A}^\pi(s_t,a_t)$ for the current policy $\pi$, and then we use this estimate to improve the policy by taking a gradient step on the policy parameter $\theta$, as shown in Alg. \ref{alg:reinforce}. This is very similar to the policy iteration algorithm that we discussed in last chapter; the idea of policy iteration is to constantly evaluate the advantage function $A^\pi(s,a)$ and update the policy accordingly using the $\argmaxA$ implicit policy. In this chapter, we are going to dive deeper into the policy gradient algorithm, and we will show that the policy gradient algorithm can be reduced to our policy iteration algorithm, which we will prove mathematically. 
  3 | 
  4 | \section{Policy Gradient as Policy Iteration}
  5 | The sum of rewards of RL was defined as $J(\theta) = \mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\sum_t\gamma^tr(s_t,a_t)\right]$, which is an expectation taken under the trajectory distribution. We claim that given a new parameter $\theta'$, $J(\theta') - J(\theta) = \mathbb{E}_{\tau\sim p_{\theta'}(\tau)}\left[\sum_t\gamma^tA^{\pi_\theta}(s_t,a_t)\right]$, which is an expectation taken under the new policy's trajectory distribution. The difference of the two sums of rewards is the improvement of applying the new policy compared to using the old policy. We claim that this improvement is equal to the expected value of the old policy's advantage function value taken under the new policy's trajectory distribution. The proof is as follows:
  6 | \begin{align*}
  7 |     J(\theta') - J(\theta) &= J(\theta') - \mathbb{E}_{s_0\sim p(s_0)}\left[V^{\pi_\theta}(s_0)\right]\\
  8 |     &= J(\theta') - \mathbb{E}_{\tau\sim p_{\theta'}(\tau)}\left[V^{\pi_\theta}(s_0)\right]\\
  9 |     &= J(\theta') - \mathbb{E}_{\tau\sim p_{\theta'}(\tau)}\left[\sum_{t=0}^\infty\gamma^t V^{\pi_\theta}(s_t) - \sum_{t=1}^\infty\gamma^tV^{\pi_\theta}(s_t)\right]\\
 10 |     &= J(\theta') + \mathbb{E}_{\tau\sim p_{\theta'}(\tau)}\left[\sum_{t=0}^\infty\gamma^t\left(\gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t)\right)\right]\\
 11 |     &= \mathbb{E}_{\tau\sim p_{\theta'}(\tau)}\left[\sum_{t=0}^\infty\gamma^tr(s_t,a_t)\right] + \mathbb{E}_{\tau\sim p_{\theta'}(\tau)}\left[\sum_{t=0}^\infty\gamma^t\left(\gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t)\right)\right]\\
 12 |     &= \mathbb{E}_{\tau\sim p_{\theta'}(\tau)}\left[\sum_{t=0}^\infty\gamma^t\left(r(s_t,a_t) + \gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t)\right)\right]\\
 13 |     &= \mathbb{E}_{\tau\sim p_{\theta'}(\tau)}\left[\sum_{t=0}^\infty\gamma^tA^{\pi_\theta}(s_t,a_t)\right]
 14 | \end{align*}
 15 | In the first two steps, we swapped out the initial states distribution in the expectation. This might seem weird at the first sight, but the intuition is that the initial state marginal is the same for any policy. Therefore, the expectation taken under the initial state marginal can be equivalently written as any policy's trajectory distribution, and for simplicity, we choose the policy of interest $\pi'$, with corresponding parameter $\theta'$.
 16 | 
 17 | Now we have proved our claim, but we see the result has a distribution mismatch: the expectation we take is under $\pi_{\theta'}$, but the advantage function $A$ is under $\pi_\theta$. It would be nice if we could make the two distributions the same. Therefore, we make use of our powerful statistical tool, \textbf{importance sampling}:
 18 | 
 19 | \begin{align*}
 20 | \mathbb{E}_{\tau\sim p_{\theta'}(\tau)}\left[\sum_{t=0}^\infty\gamma^tA^{\pi_\theta}(s_t,a_t)\right] &= \sum_t\mathbb{E}_{s_t\sim p_{\theta'}(s_t)}\left[\mathbb{E}_{a_t\sim \pi_{\theta'}(a_t|s_t)}\left[\gamma^t A^{\pi_\theta}(s_t,a_t)\right]\right]\\
 21 | &=\sum_t\mathbb{E}_{s_t\sim p_{\theta'}(s_t)}\left[\mathbb{E}_{a_t\sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta}(s_t,a_t)\right]\right]
 22 | \end{align*}
 23 | 
 24 | Now the outer expectation is still under $\theta'$ state marginal. Can we simply ignore the distribution mismatch and say that it is approximately equal to $\sum_t\mathbb{E}_{s_t\sim p_{\theta}(s_t)}\left[\mathbb{E}_{a_t\sim \pi_{\theta}(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta}(s_t,a_t)\right]\right]$, which we define as $\bar{A}(\theta')$? We would be all set if the approximation holds, because if so, then $J(\theta') - J(\theta) \simeq \bar{A}(\theta')$, which means we can calculate $\nabla_{\theta'}\bar{A}(\theta')$ without generating new samples and calculating any new advantage functions because the only term that depends on $\theta'$ in $\bar{A}(\theta')$ is the policy term in the numerator of importance sampling ratio. Thus, we can just use the current samples from $\pi_\theta$.
 25 | 
 26 | \section{Distribution Mismatch Bound}
 27 | As we discussed above, if we could ignore the distribution mismatch, then we would solve a number of problems. So when can we indeed ignore the distribution mismatch? We claim that $p_\theta(s_t)$ is close to $p_{\theta'}(s_t)$ when $\pi_\theta$ is close $\pi_{\theta'}$. This claim sounds rather silly, but in light of this claim, we could quantify the mismatch and bound the distribution change.
 28 | 
 29 | \subsection{A Simple $\epsilon$ Bound}
 30 | First, let us assume that $\pi_\theta$ is deterministic, which means $a_t = \pi_\theta(s_t)$. Then as we have seen in imitation learning, $\pi_{\theta'}$ is close to $\pi_\theta$ if $\pi_{\theta'}(a_t\neq \pi_\theta(s_t)|s_t)\leq \epsilon$. Using the same probability bound we defined in imitation learning, we have the new policy's state marginal defined as:
 31 | $$p_{\theta'}(s_t) = (1-\epsilon)^tp_\theta(s_t) + (1-(1-\epsilon)^t)p_{mistake}(s_t)$$
 32 | and we can bound the prior distribution mismatch by using:
 33 | \begin{align*}
 34 |   |p_{\theta'}(s_t) - p_\theta(s_t)| &= (1-(1-\epsilon)^t)|p_{mistake}(s_t) - p_\theta(s_t)|\\
 35 |   &\leq 2(1-(1-\epsilon)^t)\\
 36 |   &\leq2\epsilon t
 37 | \end{align*}
 38 | This is not a good bound, but it is a bound.
 39 | 
 40 | Now let's focus on the more general case, that $\pi_\theta$ is an arbitrary distribution. Then we can try to quantify the notion of ``close'' by saying $\pi_\theta$ is close to $\pi_{\theta'}$ if:
 41 | $$|\pi_{\theta'}(a_t|s_t) - \pi_\theta(a_t|s_t)| \leq \epsilon \; \forall s_t$$
 42 | Here is a useful lemma that we will use later: if $|p_X(x) - p_Y(y)| = \epsilon$, then there exists a joint distribution of $x,y$, which we call $p(x,y)$ such that $p(x) = p_X(x)$ and $p(y) = p_Y(y)$ and $p(x=y) = 1-\epsilon$. Equivalently, this means that under these circumstances, $p_X(x)$ disagrees with $p_Y(y)$ with probability $\epsilon$. If we plug in our $\pi_\theta$ and $\pi_{\theta'}$, then we can show that $\pi_{\theta'}(a_t|s_t)$ takes a different action than $\pi_\theta(a_t|s_t)$ with probability less than $\epsilon$. Using this lemma, we have the same bound as in the deterministic case:
 43 | \begin{align*}
 44 |   |p_{\theta'}(s_t) - p_\theta(s_t)| &= (1-(1-\epsilon)^t)|p_{mistake}(s_t) - p_\theta(s_t)|\\
 45 |   &\leq 2(1-(1-\epsilon)^t)\\
 46 |   &\leq2\epsilon t
 47 | \end{align*}
 48 | Now let us first focus on a more general case where we have a generic function of state $f(s_t)$:
 49 | \begin{align*}
 50 |     \mathbb{E}_{p_\theta'(s_t)}[f(s_t)] &= \sum_{s_t}p_{\theta'}(s_t)f(s_t)\\
 51 |     &= \sum_{s_t}(p_\theta(s_t) - p_\theta(s_t)+p_{\theta'}(s_t))f(s_t)\\
 52 |     &=\sum_{s_t}p_\theta(s_t)f(s_t) - (p_\theta(s_t)-p_{\theta'}(s_t))f(s_t)\\
 53 |     &\geq \sum_{s_t}p_\theta(s_t)f(s_t) - |p_\theta(s_t) - p_{\theta'}(s_t)|f(s_t)\\
 54 |     &\geq \sum_{s_t}p_\theta(s_t)f(s_t) - |p_\theta(s_t) - p_{\theta'}(s_t)|*\max_{s_t}f(s_t)\\
 55 |     &\geq \mathbb{E}_{p_\theta(s_t)}[f(s_t)] - 2\epsilon t\max_{s_t}f(s_t)
 56 | \end{align*}
 57 | Now, putting it all together, let us plug in the term inside the expectation taken under the mismatched distribution:
 58 | \begin{multline}
 59 |     \sum_t\mathbb{E}_{s_t\sim p_{\theta'}(s_t)}\left[\mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta}(s_t,a_t)\right]\right]\\
 60 |     \geq \sum_t\mathbb{E}_{s_t\sim p_{\theta}}(s_t)\left[\mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta}(s_t,a_t)\right]\right] - \sum_t 2\epsilon C
 61 | \end{multline}
 62 | where the constant term $C$ is a constant depending on the maximum reward, so in the finite horizon case, it should be of $O(Tr_{max})$, and in infinite horizon case, it should be of $O(r_{max}\gamma^t)$, whose sum can be simplified by convergence theory to $O(\frac{r_{max}}{1-\gamma})$. Therefore, for small $\epsilon$, we can simply ignore the mismatch. 
 63 | 
 64 | What have we proved? We have proved that we can update the policy parameter $\theta'$ by
 65 | $$\theta' \leftarrow \argmaxA_{\theta'} \sum_t \mathbb{E}_{s_t\sim p_{\theta}}(s_t)\left[\mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta}(s_t,a_t)\right]\right]$$
 66 | such that $|\pi_{\theta'}(a_t|s_t) - \pi_{\theta}(a_t|s_t)|\leq \epsilon$ if $\epsilon$ is small.
 67 | 
 68 | \subsection{A More Convenient Bound - KL Divergence}
 69 | We now use a better, more convenient bound, which is the KL-divergence. We claim that we can apply the aforementioned update if the total variational divergence ($D_{TV}$) is bounded by the KL-divergence as follows:
 70 | $$|\pi_{\theta'}(a_t|s_t) - \pi_{\theta}(a_t|s_t)|\leq \sqrt{\frac{1}{2}D_{KL}(\pi_{\theta'}(a_t|s_t)||\pi_{\theta}(a_t|s_t))}$$
 71 | where KL-divergence is defined as:
 72 | $$D_{KL}(p_1(x)||p_2(x)) = \mathbb{E}_{x\sim p_1(x)}\left[\log\frac{p_1(x)}{p_2(x)}\right]$$
 73 | 
 74 | Then the update rule of the policy parameter becomes:
 75 | $$\theta' \leftarrow \argmaxA_{\theta'} \sum_t \mathbb{E}_{s_t\sim p_{\theta}}(s_t)\left[\mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta}(s_t,a_t)\right]\right]$$
 76 | such that $D_{KL}(\pi_{\theta'}(a_t|s_t)||\pi_{\theta}(a_t|s_t))\leq \epsilon$. We have guaranteed improvement if we have small $\epsilon$.
 77 | 
 78 | \subsection{Enforcing the Distribution Mismatch Constraint}
 79 | Now how do we incorporate the constraint on the distribution mismatch with our objective? One way to do it is to introduce a Lagrangian because we have the following optimization problem:
 80 | \begin{equation}
 81 | \begin{aligned}
 82 | \theta'\leftarrow \argmaxA_{\theta'} \sum_t \mathbb{E}_{s_t\sim p_{\theta}}(s_t)\left[\mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta}(s_t,a_t)\right]\right]\\
 83 | \mathrm{s.t. }\;D_{KL}(\pi_{\theta'}(a_t|s_t)||\pi_{\theta}(a_t|s_t))\leq \epsilon
 84 |     \end{aligned}
 85 | \end{equation}
 86 | Then we can have our Lagrangian $\mathcal{L}(\theta', \lambda)$ as
 87 | $$\mathcal{L}(\theta', \lambda) = \sum_t \mathbb{E}_{s_t\sim p_{\theta}}(s_t)\left[\mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta}(s_t,a_t)\right]\right] - \lambda(D_{KL}(\pi_{\theta'}(a_t|s_t)||\pi_{\theta}(a_t|s_t)) - \epsilon)$$
 88 | Then we optimize in terms of the Lagrangian by first maximizing $\mathcal{L}(\theta', \lambda)$ with respect to $\theta'$, which we can just do incompletely for a few gradient steps, then we update the dual variable by $\lambda \leftarrow \lambda + \alpha(D_{KL}(\pi_{\theta'}(a_t|s_t)||\pi_{\theta}(a_t|s_t)) - \epsilon)$. This technique is an instance of dual gradient descent, and we will talk about it more in depth in a later chapter. Essentially, the intuition is that we raise $\lambda$ if the constraint is violated too much, and else lower it. Note that one could also solve this optimization problem by thinking of $\lambda$ as a regularization term for the original optimization program.
 89 | 
 90 | \subsection{Other Optimization Techniques}
 91 | There are also some other ways to optimize based on the distribution mismatch bound. One way to do it is by using 1st order Taylor expansion. Since $\theta' \leftarrow \argmaxA_{\theta'}\bar{A}(\theta')$, we can apply first order Taylor expansion by
 92 | \begin{equation}
 93 | \begin{aligned}
 94 | \theta'\leftarrow \argmaxA_{\theta'} \nabla_{\theta}\bar{A}(\theta)(\theta'-\theta)\\
 95 | \mathrm{s.t. }\;D_{KL}(\pi_{\theta'}(a_t|s_t)||\pi_{\theta}(a_t|s_t))\leq \epsilon
 96 |     \end{aligned}
 97 | \end{equation}
 98 | 
 99 | From what we have learned in policy gradients, we can derive the gradient of $\bar{A}$ as follows:
100 | $$\nabla_{\theta}\bar{A}(\theta) = \sum_t \mathbb{E}_{s_t\sim p_{\theta}}(s_t)\left[\mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}\left[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t\nabla_{\theta'}\log\pi_{\theta'}(a_t|s_t) A^{\pi_\theta}(s_t,a_t)\right]\right]$$
101 | and if we have $\pi_\theta \simeq \pi_{\theta'}$, then we can effectively cancel out the importance ratio:
102 | $$\nabla_{\theta}\bar{A}(\theta) = \sum_t \mathbb{E}_{s_t\sim p_{\theta}}(s_t)\left[\mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}\left[\gamma^t\nabla_{\theta'}\log\pi_{\theta'}(a_t|s_t) A^{\pi_\theta}(s_t,a_t)\right]\right] = \nabla_\theta J(\theta)$$
103 | just like normal policy gradient.
104 | 
105 | Consequently, our original optimization problem can be equivalently written as:
106 | \begin{equation}
107 | \begin{aligned}
108 | \theta'\leftarrow \argmaxA_{\theta'} \nabla_{\theta}J(\theta)^T(\theta'-\theta)\\
109 | \mathrm{s.t. }\;D_{KL}(\pi_{\theta'}(a_t|s_t)||\pi_{\theta}(a_t|s_t))\leq \epsilon
110 |     \end{aligned}
111 | \end{equation}
112 | We now have the RL objective in our optimization, then can we just use gradient ascent just like what we did in policy gradient? Well, it turns out gradient ascent is enforcing some other constraint:
113 | \begin{equation}
114 | \begin{aligned}
115 | \theta'\leftarrow \argmaxA_{\theta'} \nabla_{\theta}J(\theta)^T(\theta'-\theta)\\
116 | \mathrm{s.t. }\;\lvert|\theta'-\theta|\rvert^2\leq \epsilon
117 |     \end{aligned}
118 | \end{equation}
119 | whose update rule can be written as 
120 | \[\theta' \leftarrow \theta + \sqrt{\frac{\epsilon}{\lvert|\nabla_\theta J(\theta)|\rvert^2}}\nabla_\theta J(\theta)\]
121 | , and the square root term is just our learning rate, which depends on $\epsilon$. We do not want this constraint in that it is optimizing in the parameter $\theta$ space, which is a bounded $\epsilon-$ball, but we want to optimize in the policy space in an ellipsoidal shape because we want to optimize the more important parameters with smaller step size and less important parameters with bigger step size.
122 | 
123 | Since the two optimization problems are not the same, we will tweak the KL-divergence constraint a little bit using Taylor expansion one more time. If the two policies are very similar to each other, one could approximate KL-divergence by 
124 | \[D_{KL}(\pi_{\theta'}||\pi_\theta) \simeq \frac{1}{2}(\theta' - \theta)^TF(\theta'-\theta)\],
125 | where $F$ is called the Fisher-information matrix, and it is defined as
126 | \[F = \mathbb{E}_{\pi_\theta}{\left[\nabla_\theta\log \pi_\theta(a|s)\nabla_\theta\log\pi_\theta(a|s)^T\right]}\].
127 | 
128 | This matrix $F$ can be estimated using samples, and it gives us a convenient quadratic bound. Using a technique similar to Newton-Raphson, we can update the parameter by 
129 | \[\theta' \leftarrow \theta  + \alpha F^{-1}\nabla_\theta J(\theta)\]
130 | , and the learning rate $\alpha$ can be chosen as 
131 | \[
132 | \alpha = \sqrt{\frac{2\epsilon}{\nabla_\theta J(\theta)^TF\nabla_\theta J(\theta)}}
133 | \]. Now our update rule is a lot more similar to gradient descent, except that in gradient descent, the l2 norm constrains the update step into a circle, while in our 2nd order approximation of KL-divergence, it constrains the update step into a ellipse. In practice if we want to solve this natural gradient descent problem with Fisher information matrix efficiently, there are some nice techniques to approximate the inverse of $F$, as suggested in the TRPO paper \cite{schulman2015trust}.
134 | 


--------------------------------------------------------------------------------
/policyiter1.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Policy Iteration via DP}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:policyiter1}
 5 | \WHILE{true}
 6 |     \STATE Evaluate $V^\pi(s,a)$
 7 |     \STATE Set $\pi\leftarrow \pi'$
 8 | \ENDWHILE
 9 | \end{algorithmic}
10 | \end{algorithm}


--------------------------------------------------------------------------------
/policyiter2.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Value Iteration via DP}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:policyiter2}
 5 | \WHILE{true}
 6 |     \STATE Set $Q(s,a)\leftarrow r(s,a) + \gamma\mathbb{E}\left[V(s')\right]$
 7 |     \STATE Set $V(s) \leftarrow \max_{a}Q(s,a)$
 8 | \ENDWHILE
 9 | \end{algorithmic}
10 | \end{algorithm}


--------------------------------------------------------------------------------
/poligrad.tex:
--------------------------------------------------------------------------------
  1 | \chapter{Policy Gradient Methods}
  2 | Recall the objective of Reinforcement Learning:
  3 | $$\theta = \argmaxA_\theta\mathbb{E}_{\tau\sim p_\theta(\tau)}\left[\sum_t r(s_t,a_t)\right]$$
  4 | This is actually framed as an optimization problem. Therefore, we can use a variety of optimization techniques, such as gradient descent, to optimize this objective. To be more concrete, let us define a function $J(\theta)$:
  5 | $$J(\theta) = \mathbb{E}_{\tau\sim \pi_\theta(\tau)}[r(\tau)]$$
  6 | By definition, $r(\tau)$ is the sum of reward incurred in this trajectory, so it can be equivalently defined as $\sum_{t=1}^T r(s_t,a_t)$, and by definition of expectation, we can more conveniently express the $J$ function as an integral of the product of policy and reward:
  7 | $$J(\theta) = \int \pi_\theta r(\tau)d\tau$$.
  8 | With this integral, we can easily take the gradient to perform gradient descent/ascent. A convenient expression of the gradient of $J(\theta)$ is shown below.
  9 | \section{Policy Gradient Theorem}
 10 | In this section, we will derive the mathematical expression of the policy gradient theorem.
 11 | 
 12 | Recall a convenient identity:
 13 | $$\pi_\theta(\tau)\nabla_\theta \log \pi_\theta(\tau) = \pi_\theta(\tau)\frac{\nabla_\theta \pi_\theta(\tau)}{\pi_\theta(\tau)}=\nabla_\theta \pi_\theta (\tau)$$
 14 | Using this identity, we can take the gradient of $J(\theta)$ in a cleaner fashion:
 15 | \begin{align*}
 16 | \nabla_\theta J(\theta) = \int \nabla_\theta \pi_\theta r(\tau)d\tau\\
 17 | =\int\pi_\theta(\tau)\nabla_\theta \log \pi_\theta(\tau) r(\tau) d\tau\\
 18 | = \mathbb{E}_{\tau\sim \pi_\theta(\tau)}[\nabla_\theta \log \pi_\theta (\tau)r(\tau)]
 19 | \end{align*}
 20 | Now, we want to get rid of the huge $\log \pi_\theta(\tau)$ from our equation. Recall that a trajectory $\tau$ is a list of states and actions, so $\pi_\theta(s_1,a_1,...,s_T,a_T) = p(s_1)\prod_{t=1}^T\pi_\theta (a_t|s_t)p(s_{t+1}|s_t,a_t)$ by Bayes' rule. Then we take the log on both sides, and we end up getting $\log\pi_\theta(\tau) = \log p(s_1) + \sum_{t=1}^T \log\pi_\theta(a_t|s_t)+\log p(s_{t+1}|s_t,a_t)$.
 21 | Plugging into our original gradient:
 22 | \begin{equation}
 23 | \begin{aligned}
 24 |     \nabla_\theta J(\theta) &= \mathbb{E}_{\tau\sim \pi_\theta(\tau)}\left[\nabla_\theta\left(\log p(s_1) + \sum_{t=1}^T \log \pi_\theta(a_t|s_t)+\log p(s_{t+1}|s_t,a_t)\right)r(\tau)\right]\\
 25 |     &= \mathbb{E}_{\tau\sim \pi_\theta(\tau)}\left[ \left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_t|s_t)\right)\left(\sum_{t=1}^T r(s_t,a_t)\right)\right]
 26 | \end{aligned}
 27 | \end{equation}
 28 | Note that in the above calculation, we cancel out $\log p(s_1)$ and $\log p(s_{t+1}|s_t,a_t)$ because we are taking the gradient with respect to $\theta$, but those two expressions do not depend on $\theta$. The first item in the final expectation is similar to maximum likelihood.
 29 | \section{Evaluating the Policy Gradient}
 30 | In our derivation, we mathematically derived an expression for policy gradient, which involves calculating an expectation. However, in most cases we cannot easily obtain this expectation easily because it is highly possible that it involves a huge, intractable integral. Therefore, what are we going to do if the expectation (integral) is hard to evaluate? The answer is to approximate, and more specifically, we use Monte Carlo approximation. The idea is to take N samples, and average them out:
 31 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)$$
 32 | where the subscripts $i, t$ means time step $t$ in the $i$-th rollout. With the above gradient, we can do gradient descent (ascent) on the parameter $\theta$ by:
 33 | $$\theta \leftarrow \theta + \alpha\nabla_\theta J(\theta)$$ Now we are ready to propose a vanilla policy gradient algorithm by direct gradient ascent on the Monte Carlo-approximated policy gradient parameters, the REINFORCE Algorithm, as shown in Algorithm \ref{alg:reinforce}.
 34 | \input{REINFORCE.tex}
 35 | 
 36 | \section{Example: Gaussian Policy}
 37 | Now let us work out a simple case of running Algorithm \ref{alg:reinforce} on a simple Gaussian policy. A Gaussian policy means that the policy function is a Gaussian distribution. Specifically, $\pi_\theta(a_t|s_t) = \mathcal{N}\left(f_\text{neural net}(s_t); \Sigma\right)$. One advantage of using a Gaussian policy is that it is easy to obtain a closed-form expression for the Gaussian derivative. We simply write out the quadratic discriminant function in a multivariate Gaussian distribution:
 38 | $$\log \pi_\theta(a_t|s_t) = -\frac{1}{2}\left(f(s_t)-a_t\right)^T\Sigma^{-1}\left(f(s_t)-a_t\right) + C =  -\frac{1}{2}{\left\lVert f(s_t)-a_t\right\rVert}_\Sigma^2 + C$$
 39 | Taking the derivative, we have:
 40 | $$\nabla_\theta \log \pi_\theta(a_t|s_t) = -\frac{1}{2}\Sigma^{-1}(f(s_t)-a_t)\frac{df}{d\theta}$$
 41 | And we use Gradient ascent as discussed above.
 42 | \section{Intuition behind Policy Gradient: What are We Actually Doing?}
 43 | Recall we mentioned that the first term inside the expectation is similar to maximum likelihood. Maximum Likelihood can be written as maximizing the log likelihood of an event. Let us compare the two side by side. Recall the expression of the policy gradient is:
 44 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\nabla_\theta \log \pi_\theta(\tau_i)r(\tau_i)$$
 45 | And the Maximum Likelihood is defined as:
 46 | $$\nabla_\theta J_{ML}(\theta)\simeq\frac{1}{N}\sum_{i=1}^N\nabla_\theta \log \pi_\theta(\tau_i)$$
 47 | As we discussed before, the first term in the policy gradient is exactly the same as the definition of maximum likelihood!
 48 | 
 49 | So what are we doing here when taking this gradient? Intuitively, we are assigning more weight to more rewarding trajectories by making trajectories with higher rewards more probable. Equivalently, higher-reward trajectories are likely to have more probability to be chosen. This intuition is crucial to the policy gradient methods and is illustrated in Fig. \ref{fig:trajheat}.
 50 | \begin{figure}
 51 |     \centering
 52 |     \includegraphics[scale=0.5]{figures/trajheat.png}
 53 |     \caption{More rewarding trajectories are more probable.}
 54 |     \label{fig:trajheat}
 55 | \end{figure}
 56 | 
 57 | \section{Partial Observability}
 58 | Can we use the policy gradient on a Partially Observed Markov Decision Process (POMDP)? The short answer is Yes. Why? Recall (yet again) the policy gradient expression:
 59 | 
 60 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T r(s_{i,t},a_{i,t})\right)$$
 61 | 
 62 | In this expression, we do not even have the transition function in it. Long story short, the Markovian property is not actually used! So we can use policy gradient on a POMDP without any modification except instead of $s_t$, we use $o_t$.
 63 | 
 64 | Note that we do not care about what the state actually is. Any Non-Markovian proces can be made Markovian by setting the state as the whole history. 
 65 | 
 66 | \section{Disadvantages of the Policy Gradient}
 67 | Recall the intuition behind the policy gradient update: we make trajectories with more reward more probable. Let's consider the following scenario: say two trajectories have similar positive rewards, while another trajectory has a low, negative reward. Then policy gradient is going to assign zero to low probability to the third trajectory, and high probabilities to the other two. Now imagine we add a large constant number to our reward function, and apparently it does not change the relative relation between different trajectories' rewards because we are only adding in a constant. Now the negative reward becomes positive, and policy gradient is likely to spread out the likelihood for the three trajectories since all three rewards are positive now. This is bad because our reward distribution does not change at all, but after adding in a constant, the distribution of policy gradient changed substantially. 
 68 | 
 69 | This problem shows the high variance in our naive policy gradient algorithm. Therefore, we need to come up with some methods to reduce the variance introduced in the policy gradients. 
 70 | \section{Reducing Policy Gradients Variance using Baselines}
 71 | \subsection{Causality}
 72 | One simple fix for high variance is to use the fact of causality: policy at time $t'$ cannot affect reward at $t$ if $t<t'$. This simple, commonsensical idea allows us to discard some operands in the summation:
 73 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\left(\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})\right)$$
 74 | and we define the second item in the summation as the ``reward-to-go''. Notice that in the reward-to-go term, we start the summation from time $t$ instead of 1, by causality. The idea is that we are multiplying the likelihood by smaller numbers due to the reduction of the summation term, so we can reduce the variance to some extent.
 75 | \subsection{Baselines}
 76 | Another common approach is to use baselines. By baselines, we mean that instead of making all high-reward trajectories more likely, we only make trajectories \textbf{better than average} more likely. So naturally, we define a baseline $b$ as the average reward:
 77 | $$b = \frac{1}{N} \sum_{i=1}^Nr(\tau)$$
 78 | Incorporating the baseline $b$ into our original policy gradient expression:
 79 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N \nabla_\theta \log\pi_\theta(\tau)\left[r(\tau)-b\right]$$
 80 | But, are we allowed to that? Yes, in fact, we can show that the expectation is the same with baseline $b$. To show this, we can express the expectation of baseline as:
 81 | \begin{align*}
 82 | \mathbb{E}_{\pi_\theta(\tau)}\left[\nabla_\theta\log \pi_\theta(\tau)b\right]&=
 83 | \int \pi_\theta(\tau)\nabla \log\pi_\theta(\tau)b\;d\tau\\
 84 | &=\int \nabla_\theta \pi_\theta(\tau)b\;d\tau\\
 85 | &=b\nabla_\theta\int\pi_\theta(\tau)\;d\tau\\
 86 | &=b\nabla_\theta 1\\
 87 | &=0
 88 | \end{align*}
 89 | Therefore, by subtracting a baseline, our policy gradient is still unbiased in expectation!
 90 | \subsection{Analyzing the Variance with Baselines}
 91 | Let us explicitly write down the variance of the policy gradient. Recall the definition of variance:
 92 | $$\mathrm{Var}[x] = \mathbb{E}[x^2]-\mathbb{E}[x]^2$$
 93 | And the policy gradient with baselines is written as:
 94 | $$\nabla_\theta J(\theta) \simeq \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[\nabla_\theta \log\pi_\theta(\tau)\left(r(\tau)-b\right)\right]$$
 95 | Therefore, the variance of the policy gradient can be written as follows:
 96 | $$\mathrm{Var} = \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[\left(\nabla_\theta \log\pi_\theta(\tau)\left(r(\tau)-b\right)\right)^2\right] - \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[\nabla_\theta \log\pi_\theta(\tau)\left(r(\tau)-b\right) \right]^2$$
 97 | Note that in the second squared expectation term of variance, it can be equivalently written as $\mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[\nabla_\theta \log\pi_\theta(\tau)r(\tau) \right]^2$ since baselines are unbiased in expectation.
 98 | 
 99 | Now we have an expression of variance with respect to baseline $b$, we can calculate the optimal $b$ that minimizes the variance by setting the gradient of variance to 0:
100 | \begin{align*}
101 |     \frac{d\mathrm{Var}}{db} &= \frac{d}{db}\mathbb{E}\left[g(\tau)^2(r(\tau)-b)^2\right]\\ &=\frac{d}{db}\mathbb{E}\left[g(\tau)^2r(\tau)^2\right] - 2\mathbb{E}\left[g(\tau)^2r(\tau)b\right] + b^2\mathbb{E}\left[g(\tau)^2\right]\\
102 |     &=-2\mathbb{E}\left[ g(\tau)^2r(\tau)\right]+2b\mathbb{E}\left[g(\tau)^2\right]\\
103 |     &=0
104 | \end{align*}
105 | Solving the equation, we will have
106 | $$b^{opt} = \frac{\mathbb{E}\left[g(\tau)^2r(\tau)\right]}{\mathbb{E}\left[g(\tau)^2\right]}$$
107 | where $b^{opt}$ is the optimal baseline value for reducing the variance.
108 | 
109 | In practice, we just use the average reward for baseline. 
110 | 
111 | \section{On-Policy vs. Off-Policy}
112 | We now introduce two concepts in RL: on-policy and off-policy. On-policy means that we learn only from using the current policy $\pi_\theta$, and off-policy means we learn also from other policies. Apparently, policy gradient is an on-policy method because $\nabla_\theta J(\theta) \simeq \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[\nabla_\theta \log\pi_\theta(\tau)\left(r(\tau)-b\right)\right]$, and the expectation is taken under the current, known trajectory of interest. Therefore, every time we have a new policy, we need to use new samples. Since we are changing $\theta$, $\pi_\theta$ also changes overtime in policy gradient. One can imagine that this is extremely inefficient in neural networks, because in a neural network, $\theta$ only changes a little and the overhead for changing the policy is large.
113 | 
114 | One solution is to use off-policy learning.
115 | 
116 | \subsection{Off-policy Learning and Importance Sampling}
117 | We first introduce an important technique called importance sampling. Given a distribution $p(x)$, how do we calculate the expectation from samples from another distribution $q(x)$? This is the idea of importance sampling, by using an importance ratio, we can calculate the expectation from another distribution, thus enabling to learn off-policy. 
118 | 
119 | In importance sampling:
120 | \begin{align*}
121 |     \mathbb{E}_{x\sim p(x)}\left[f(x)\right] &= \int p(x)f(x)\;dx\\
122 |     &=\int \frac{q(x)}{q(x)}p(x)f(x)\;dx\\
123 |     &=\mathbb{E}_{x\sim q(x)}\left[\frac{p(x)}{q(x)}f(x)\right]
124 | \end{align*}
125 | 
126 | Then we can plug it into the off-policy policy gradient. Say we have a trained policy $\pi_\theta(\tau)$, and we have samples from another policy $\Bar{\pi}(\tau)$, we can use the samples from $\Bar{\pi}(\tau)$ to calculate $J(\theta)$ function using importance sampling:
127 | \begin{align*}
128 |     J(\theta) &= \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[r(\tau)\right]\\
129 |     &= \mathbb{E}_{\tau\sim\Bar{\pi}(\tau)}\left[ \frac{\pi_\theta(\tau)}{\Bar{\pi}(\tau)} r(\tau)\right]
130 | \end{align*}
131 | Now we want to look closely at the importance ratio. Recall that $\pi_\theta(\tau) = p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)$. Then we can simplify the ratio in the following way:
132 | \begin{align*}
133 | \frac{\pi_\theta(\tau)}{\Bar{\pi}(\tau)} &=\frac{p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)}{p(s_1)\prod_{t=1}^T\Bar{\pi}(a_t|s_t)p(s_{t+1}|s_t,a_t)}\\
134 | &= \frac{\prod_{t=1}^T\pi_\theta(a_t|s_t)}{\prod_{t=1}^T\Bar{\pi}(a_t|s_t)}
135 | \end{align*}
136 | \subsection{Deriving Policy Gradient with Importance Sampling}
137 | It turns out that we can recover the original policy gradient theorem using off-policy learning using importance sampling. Recall the objective of RL as defined in the first chapter:
138 | $$\theta^* = \argmaxA_\theta J(\theta)$$
139 | and we defined $J(\theta)$ as $J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[r(\tau)\right]$. Now if we want to estimate $J$ with some new parameter $\theta'$, we can use importance sampling as discussed above:
140 | $$J(\theta') = \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau)\right]$$
141 | then we take the gradient as:
142 | $$\nabla_{\theta'}J(\theta') &= 
143 |     \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[\frac{\nabla_{\theta'}\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}r(\tau)\right]\\
144 |     &=\mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\nabla_{\theta'}\log \pi_{\theta'}(\tau)r(\tau)\right]
145 | $$
146 | 
147 | Now if we estimate it locally, by setting $\theta = \theta'$, then we will cancel out the importance ratio, ending up with $\mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[\nabla_{\theta'}\log \pi_{\theta'}(\tau)r(\tau)\right]$.
148 | 
149 | \section{First Order Approximation for Importance Sampling}
150 | Now we focus on the cases where we do not use local approximation, when $\theta \neq \theta'$.
151 | \begin{align*}
152 |     J(\theta') &= \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[r(\tau)\right]\\
153 |     \nabla_{\theta'}J(\theta')&=\mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[\frac{\pi_{\theta'}(\tau)}{\pi_\theta(\tau)}\nabla_{\theta'}\log\pi_{\theta'}(\tau)r(\tau)\right]\\
154 |     &= \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[  \left(  \frac{\prod_{t=1}^T\pi_{\theta'}(a_t|s_t)}{\prod_{t=1}^T\pi_{\theta}(a_t|s_t)} \right)\left( \sum_{t=1}^T\nabla_{\theta'}\log\pi_{\theta'}(a_t|s_t)  \right)\left(\sum_{t=1}^Tr(s_t,a_t)\right)  \right]
155 | \end{align*}
156 | Now there is a problem in the equation. Note that the ratio of the two products can be very small or very big if $T$ is big, thus increasing the variance. To alleviate the issue, one can make use of causality as we discussed before:
157 | \begin{align*}
158 |     \nabla_{\theta'}J(\theta')&= \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[  \left(  \frac{\prod_{t=1}^T\pi_{\theta'}(a_t|s_t)}{\prod_{t=1}^T\pi_{\theta}(a_t|s_t)} \right)\left( \sum_{t=1}^T\nabla_{\theta'}\log\pi_{\theta'}(a_t|s_t)  \right)\left(\sum_{t=1}^Tr(s_t,a_t)\right)  \right]\\
159 |     &=\mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[  \sum_{t=1}^T\nabla_{\theta'}\log\pi_{\theta'}(a_t|s_t) \left(\prod_{t'=1}^t \frac{\pi_{\theta'}(a_{t'}|s_{t'}) }{\pi_{\theta}(a_{t'}|s_{t'})}\right)    \left(  \sum_{t'=t}^Tr(s_{t'},a_{t'})\left(  \prod_{t'' = t}^{t'}\frac{\pi_{\theta'}(a_{t''}|s_{t''}) }{\pi_{\theta}(a_{t''}|s_{t''})} \right)    \right)\right]
160 | \end{align*}
161 | Here we used the fact of causality that future actions don't affect the current weight. Also note that the last ratio of products can be deleted, and we essentially get the policy iteration algorithm, which we will discuss in later chapters. 
162 | 
163 | So when we delete the last weight, we end up having
164 | $$\nabla_{\theta'}J(\theta')= \mathbb{E}_{\tau\sim\pi_\theta(\tau)}\left[   \sum_{t=1}^T\nabla_{\theta'}\log\pi_{\theta'}(a_t|s_t)    \left(\prod_{t'=1}^t \frac{\pi_{\theta'}(a_{t'}|s_{t'}) }{\pi_{\theta}(a_{t'}|s_{t'})}\right)\left(  \sum_{t'=t}^Tr(s_{t'},a_{t'})\right)\right]$$
165 | The product of ratio is again exponential in $T$, so we may have high variance. 
166 | 
167 | Recall on-policy policy gradient:
168 | $$\nabla_{\theta}J(\theta) \simeq \frac{1}{N} \sum_{i=1}^T \sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\hat{Q}_{i,t}$$
169 | Similarly in off-policy policy gradient:
170 | \begin{align*}
171 |     \nabla_{\theta'}J(\theta') &= \frac{1}{N} \sum_{i=1}^T \sum_{t=1}^T   \frac{\pi_{\theta'}(s_{i,t},a_{i,t})}{\pi_{\theta}(s_{i,t},a_{i,t})}\nabla_{\theta'} \log \pi_{\theta'}(a_{i,t}|s_{i,t})\hat{Q}_{i,t}\\
172 |     &= \frac{1}{N} \sum_{i=1}^T \sum_{t=1}^T \frac{\pi_{\theta'}(s_{i,t})}{\pi_{\theta}(s_{i,t})}\frac{\pi_{\theta'}(s_{i,t}|a_{i,t})}{\pi_{\theta}(s_{i,t}|a_{i,t})}\nabla_{\theta'} \log \pi_{\theta'}(a_{i,t}|s_{i,t})\hat{Q}_{i,t}
173 | \end{align*}
174 | In later chapters, we can see that we can pretty much ignore the first states priors ratio.
175 | 
176 | \subsection{Advanced Policy Gradients}
177 | Recall the policy gradients update rule:
178 | \[
179 | \theta\leftarrow\theta + \alpha\nabla_\theta J(\theta)
180 | \]
181 | In many cases, some parameters have more impact on the outcome than others. Therefore, intuitively, we would like to set higher learning rate for parameters with less impact and lower learning rate for parameters with more impact. To do this, we leverage covariant/natural policy gradient. Let us look at at the constrained view of iterative gradient descent:
182 | \begin{equation}
183 | \begin{aligned}
184 | \theta'\leftarrow \argmaxA_{\theta'}(\theta'-\theta)^T \nabla_{\theta}J(\theta)\\
185 | \mathrm{s.t. }\;||\theta'-\theta||^2\leq\epsilon
186 |     \end{aligned}
187 | \end{equation}
188 | where this $\epsilon$ controls how far we should go. But this $\epsilon$ is defined in the parameters' space, which means that we do not have much control over individual parameters. To resolve this, we would like to \textbf{rescale} this constraint so that we can constrain the step size in terms of the policy space, thus giving us more control on individual parameters. For example, we can use:
189 | 
190 | \begin{equation}
191 | \begin{aligned}
192 | \theta'\leftarrow \argmaxA_{\theta'}(\theta'-\theta)^T \nabla_{\theta}J(\theta)\\
193 | \mathrm{s.t. }\;D(\pi_{\theta'}-\pi_\theta)\leq\epsilon
194 |     \end{aligned}
195 | \end{equation}
196 | where $D(\cdot, \cdot)$ is a parameterization-independent divergence measure, which usually is the KL-divergence: $D_{KL}(\pi_{\theta'}-\pi_\theta) = \mathbb{E}_{\pi_{\theta'}}[\log\pi_\theta - \log\pi_{\theta'}]$.
197 | 
198 | We can also estimate the KL divergence locally using second-order Taylor expansion by:
199 | \[
200 | D_{KL}(\pi_{\theta'}-\pi_\theta)\approx(\theta'-\theta)^TF(\theta'-\theta)
201 | \]
202 | where $F$ is the Fisher-information matrix defined as:
203 | \[
204 | F = \mathbb{E}_{\pi_{\theta'}}[\nabla_\theta\log\pi_\theta(a|s)\nabla_\theta\log\pi_\theta(a|s)^T]
205 | \]
206 | 
207 | Thus, with $F$, the rescaled constraint optimization problem can be equivalently rewritten as:
208 | \begin{equation}
209 | \begin{aligned}
210 | \theta'\leftarrow \argmaxA_{\theta'}(\theta'-\theta)^T \nabla_{\theta}J(\theta)\\
211 | \mathrm{s.t. }\;||\theta'-\theta||^2_F\leq\epsilon
212 |     \end{aligned}
213 | \end{equation}
214 | 
215 | Using Lagrangian, one could solve this optimization problem iteeratively as follows:
216 | \[\theta \leftarrow \theta + \alpha F^{-1}\nabla_\theta J(\theta)
217 | \]
218 | This is the basic idea behind TRPO (PPO).


--------------------------------------------------------------------------------
/preface.tex:
--------------------------------------------------------------------------------
1 | \chapter*{Preface}
2 | \addcontentsline{toc}{chapter}{Preface}
3 | This is a compilation of notes from UC Berkeley's CS 285 (formerly CS 294-112) taught by Professor Sergey Levine. The notes are based on Fall 2018, Fall 2019, and Fall 2020 offerings of the course. The author is currently a EECS student at UC Berkeley, and the readers are welcome to contact the author via \href{mailto:harryhzhang@berkeley.edu}{harryhzhang@berkeley.edu}.
4 | 
5 | The notes assume the reader have some familiarity with key concepts such as machine learning, neural networks, Markov Decision Process (MDP), and optimal control.
6 | \newline
7 | \linebreak
8 | Rest in Peace, Andy.


--------------------------------------------------------------------------------
/pretrain.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Pretrain and Fine Tune}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:pretrain}
 5 | \REQUIRE Demonstrations data $\mathcal{D}$
 6 | \STATE Collect demonstration data $\mathcal{D} = \{(s_i,a_i)\}$
 7 | \STATE Initialize $\pi_\theta$ as $\max_\theta\sum_i\log\pi_\theta(a_i|s_i)$
 8 | \WHILE{not done}
 9 | \STATE Run $\pi_\theta$ to collect experience
10 | \STATE Improve $\pi_\theta$ with any RL algorithm
11 | \ENDWHILE
12 | \end{algorithmic}
13 | \end{algorithm}


--------------------------------------------------------------------------------
/pseudocount.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Exploring with Pseudo-count}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:pseudocount}
 5 | \REQUIRE Some base model $p_\theta(s)$
 6 | \WHILE{not done}
 7 | \STATE Fit model $p_\theta(s)$ to all the states $\mathcal{D}$ seen so far
 8 | \STATE Take a step $i$ and observe $s_i$
 9 | \STATE Fit model $p_{\theta'}(s)$ to $\mathcal{D}\cup s_i$
10 | \STATE Use $p_\theta(s_i)$ and $p_{\theta'}(s')$ to estimate $\hat{N}(s)$
11 | \STATE Set $r^+(s,a) = r(s,a) + \mathcal{B}(\hat{N}(s))$
12 | \ENDWHILE
13 | \end{algorithmic}
14 | \end{algorithm}


--------------------------------------------------------------------------------
/qfunc.tex:
--------------------------------------------------------------------------------
 1 | \chapter{Q-Function Methods}
 2 | Recall the algorithms that we discussed in last chapter: Alg. \ref{alg:fittedQ} and Alg. \ref{alg:onlineQiter}, where we devised a fitted Q-learning algorithm and a fully online version of it. We have also shown that Q-learning is fully off-policy, meaning that we do not care about the trajectory we are taking, we only care about the current transition and the next state we land in. So what is the problem with the above Q-learning algorithms? To see this, let us carefully look at step 4 of Alg. \ref{alg:onlineQiter}. The gradient step that we are taking is equivalently written as:
 3 | $$\phi\leftarrow \phi - \alpha\frac{dQ_\phi}{d\phi}(s_i,a_i)\left(Q_\phi(s_i,a_i) - \left[r(s_i,a_i)+\gamma\max_{a'}Q_\phi(s_i,a_i,s'_i,r_i)\right]\right)$$
 4 | This is not gradient descent! Because the ``target'' value $y_i$ is not constant and depends on our parameter $\phi$, and we are not taking gradient from $y_i$. Therefore, this is not the gradient descent step that we used to see before. Moreover, in step 2, we are only taking one sample of transition. This sampling scheme brings us two problems: the first one is that one sample is really hard to train the network (recall in online actor-critic, we would use parallel workers to obtain multiple online samples), and the second problem is the samples we are drawing are correlated in that the transitions are dependent on each other. As you may know, Stochastic Gradient Descent converges only if we are taking the correct gradient, and when the samples are IID. We violated both requirements, so the Q-learning algorithms in Alg. \ref{alg:fittedQ} and Alg. \ref{alg:onlineQiter} do not have any convergence guarantees. 
 5 | 
 6 | \section{Replay Buffers}
 7 | One of the solutions that we can implement to resolve the above correlation issue is to use a \textbf{replay buffer}. A replay buffer $\mathcal{B}$ is a buffer of transition data that contains the single samples that we have drawn so far. Therefore, in Q-learning, if we sample a batch from $\mathcal{B}$, the samples are no longer correlated, and we also keep updating the replay buffer by adding in real-world transitions. One can view this interaction in the image illustrated in Fig. \ref{fig:qwfb}.
 8 | \begin{figure}
 9 |     \centering
10 |     \includegraphics[scale=0.5]{figures/qwrb.png}
11 |     \caption{Q-learning with replay buffers}
12 |     \label{fig:qwfb}
13 | \end{figure}
14 | 
15 | 
16 | Putting it all together, we sketch out the full Q-learning algorithm with replay buffer in Alg. \ref{alg:QwRB}.
17 | \input{QwRB.tex}
18 | What have changed here in Alg. \ref{alg:QwRB} compared with Alg. \ref{alg:fittedQ}? In step 2, we are not only collecting dataset, but also adding the dataset to replay buffer $\mathcal{B}$. Inside the for loop, we are now sampling a batch of transitions from $\mathcal{B}$, which will bring us lower variance when we take the gradient step on the batch. We also periodically update $\mathcal{B}$.
19 | 
20 | The above solution solves the correlation problem, but we still need to address the wrong gradient problem. 
21 | 
22 | \section{Target Networks}
23 | To solve the wrong gradient problem, we are essentially trying to solve the problem that the target value $y_i$ is heavily dependent on our gradient parameter $\phi$. Therefore, one way to improve this is to use a separate Q-function with another parameter $\phi'$ in order to decorrelate the two values. We should also use a well-defined $\phi'$, which can be set as the $\phi$ parameter 1000 steps ago. When setting $y_i$, we set it from the Q function with another parameter, thus decorrelating the two values. Together with the use of replay buffer, we alleviated the wrong policy and the correlation samples problems, as shown in Alg. \ref{alg:QwRB_tn}.
24 | \input{qwrb_tn.tex}
25 | Here we are frequently sampling batches from $\mathcal{B}$ and taking the gradient steps. We are less frequently updating the replay buffer as discussed in Alg. \ref{alg:QwRB}. We are even less frequently updating the network parameters, since we said one good choice for parameter $\phi'$ is to use $\phi$ 1000 steps ago.
26 | 
27 | Using target networks and replay buffers, we can sketch out a classic deep Q-learning algorithm (DQN), proposed by Minh et al. in \cite{mnih2013playing}. The pseudocode is in \ref{alg:dqn}. Here we choose $N$ to be a large number because our intention is to infrequently update the parameters. To further optimize the algorithm, it might feel weird to abruptly update $\phi'$ after $N$ steps. Therefore, to alleviate this ``abruptness'', we can use \textbf{Polyak Averaging}: in step 6 of Alg. \ref{alg:dqn}, instead of copying $\phi$ every $N$ steps, we do $\phi' \leftarrow \tau\phi' + (1-\tau)\phi$. We also call such update a damped update.
28 | \input{dqn.tex}
29 | 
30 | Now let us view the three different Q-learning algorithms in a more general way. As shown in Fig. \ref{fig:dqn}, there are three different steps in the algorithm. The first step is to collect data, the second step is to update the target in the target network, and the third step is to regress onto the Q-function. In the simplest, regression-based fitted Q-learning algorithm, process 3 is in the inner loop of process 2, which is in the inner loop of process 1. In online Q-learning, we evict the old transitions immediately, and process 1, 2, and 3 run at the same speed. In DQN, process 1 and 3 run at the same speed, but process 2 runs slower.
31 | \begin{figure}
32 |     \centering
33 |     \includegraphics[scale=0.5]{figures/dqn.png}
34 |     \caption{Q-learning in a more general view.}
35 |     \label{fig:dqn}
36 | \end{figure}
37 | 
38 | \section{Inaccuracy in Q-Learning}
39 | Q-values are not necessarily accurate. The reason lies in the target value. Recall that the target value $y$ is defined as $y_j = r_j + \gamma \max_{a'_j}Q_{\phi'}(s'_j,a'_j)$. The $\max$ operation in the target is the main problem, because for two random variables $X_1$ and $X_2$, $\mathbb{E}[\max(X_1,X_2)] \geq \max(\mathbb{E}[X_1],\mathbb{E}[X_2])$. Therefore, when $Q_{\phi'}(s'_j,a'_j)$ is noisy, the max operation is going to overestimate the next Q-value.
40 | 
41 | \subsection{Double Q-Learning}
42 | One might notice that $\max_{a'}Q_{\phi'}(s',a') = Q_{\phi'}(s',\argmaxA_{a'}Q_{\phi'}(s',a'))$. Thus, if we somehow managed to decorrelate the error from the selected action and the error from the Q-function, we could eliminate the erroneous overestimation. To achieve this, we can use two different networks to choose actions and evaluate the Q-function values. 
43 | \begin{align*}
44 |     Q_{\phi_A}(s,a) &\leftarrow r + \gamma Q_{\phi_B}\left(s',\argmaxA_{a'}Q_{\phi_A}(s',a')\right)\\
45 |     Q_{\phi_B}(s,a) &\leftarrow r + \gamma Q_{\phi_A}\left(s',\argmaxA_{a'}Q_{\phi_B}(s',a')\right)
46 | \end{align*}
47 | Essentially we are using one network's parameter to update the value, while using the other's to select the action. Using the two separate networks, we are decorrelating the action selection and value evaluation errors, thus decreasing the overestimation in the Q-values.
48 | 
49 | In practice, we can just use the actual and target networks for the two separate networks. Therefore, instead of setting target $y$ as $y = r + \gamma Q_{\phi'}\left(s',\argmaxA_{a'}Q_{\phi'}(s',a')\right)$, we use the current network to select action, and use the target network to evaluate value: $y = r + \gamma Q_{\phi'}\left(s',\argmaxA_{a'}Q_{\phi}(s',a')\right)$.
50 | 
51 | \subsection{N-Step Return Estimator}
52 | In the definition of our target $y$, $y_{i,t} = r_{i,t} + \max_{a_{i,t+1}}Q_{\phi'}(s_{i,t+1},a_{i,t+1})$, the Q-value only matters if it is a good estimate. If the Q-value estimate is bad, the only values that matter are from the reward term, so we are not learning much about the Q-function. To resolve this problem, let us recall the N-step cut trick we did in the actor-critic algorithm. In actor-critic algorithm, to leverage the bias and variance tradeoff in policy gradient, we can end the trajectory earlier, and only count the reward summed up to $N$ steps from now. Specifically, we can define the target as:
53 | $$y_{i,t} = \sum_{t'=t}^{t+N-1}\gamma^{t-t'}r_{i,t} +\gamma^N \max_{a_{i,t+1}}Q_{\phi'}(s_{i,t+1},a_{i,t+1})$$
54 | One subtle problem with this solution is that the learning process suddenly becomes on-policy, so we cannot efficiently make use of the off-policy data. Why is it on-policy? If we look at the summation of the rewards, we are collecting the rewards data using a certain trajectory, which is generated by a specific policy. Therefore, we end up having less biased target values when the Q-values are inaccurate, and in practice, it is faster in early stages of learning. However, it is only correct when we are learning on-policy. To fix this problem, one could ignore this mismatch, which somehow works very well in practice. Or one could cut the trace by dynamically adapting $N$ to get only on-policy data. Also, one could use importance sampling as we discussed before. For more details, please refer to this paper by Munos et al. \cite{munos2016safe}.
55 | \subsection{Q-Learning with Continuous Actions}
56 | Recall the implicit policy that we define using Q-learning:
57 | \begin{align*}
58 | \pi'(a_t|s_t) & = \begin{cases}
59 |                 1, & \mathrm{if } \;a_t = \argmaxA_{a_t}A^\pi(s_t,a_t)\\
60 |                  0, & \mathrm{otherwise}
61 |                     \end{cases}
62 | \end{align*}
63 | One problem with this definition is that the $\argmaxA$ operation cannot be easily applied if the actions are continuous. How are we going to address such an issue?
64 | 
65 | One option is to use various optimization techniques, as one may have seen in UC Berkeley's EE 127. Specifically, one could use SGD on the action space to produce an optimal $a_t$ by solving an optimization problem. Another simple approach is to stochastically optimize the Q-values by using some samples of the values from some pre-defined distribution (e.g. uniform): $\max_a Q(s,a) \simeq \max \{Q(s,a_1), ..., Q(s,a_N)\}$. One could also improve the accuracy by using some more sophisticated optimization techniques such as Cross-Entropy Method (CEM).
66 | 
67 | Option no. 2 is to use function classes that are easy to maximize. For example, one could use the Normalized Advantage Functions (NAF) proposed by Gu et al. in \cite{gu2016continuous}.
68 | 
69 | Another rather fancier option is to learn an approximate optimizer, which was originally proposed by Lillicrap et al. in \cite{lillicrap2015continuous}. The idea of Deep Deterministic Policy Gradient (which is actually a Q-learning in disguise) is to train another network $\mu_\theta(s)$ such that $\mu_\theta(s) \simeq \argmaxA_a Q_\phi(s,a)$. To train the network, one can see that the optimization of Q-function with respect to $\theta$ can be solved by $\theta \leftarrow \argmaxA_\theta Q_\phi(s,\mu_\theta(s))$ because by chain rule, $\frac{dQ_\phi}{d\theta} = \frac{da}{d\theta}\frac{dQ_\theta}{da}$. Then the new target becomes:
70 | $$y_j = r_j + \gamma Q_{\phi'}(s'_j,\mu_\theta(s'_j))\simeq r_j + \gamma Q_{\phi'}(s'_j, \argmaxA_{a'}Q_{\phi'}(s'_j,a'_j))$$
71 | 
72 | The sketch of DDPG is in Alg. \ref{alg:ddpg}. In step 5, we are updating the Q-function, and in step 6, we are updating the argmax-er. Therefore, DDPG is essentially DQN with a learned argmax-er.
73 | \input{ddpg.tex}
74 | 
75 | 
76 | 


--------------------------------------------------------------------------------
/qwrb_tn.tex:
--------------------------------------------------------------------------------
 1 | \begin{algorithm}[t!]
 2 | \caption{Q-Learning with Replay Buffer and Target Network}
 3 | \begin{algorithmic}[1]
 4 | \label{alg:QwRB_tn}
 5 | \REQUIRE Some base policy for data collection; hyperparameter $K$ and $N$
 6 | \WHILE{true}
 7 | \STATE Save target network parameters: $\phi' \leftarrow \phi$
 8 | \FOR{$N$ times}
 9 |     \STATE Collect dataset $\{(s_i,a_i,s'_i,r_i)\}$ using some policy, add it to replay buffer $\mathcal{B}$
10 |     \FOR{$K$ times}
11 |         \STATE Sample a batch $(s_i,a_i,s'_i,r_i)$ from $\mathcal{B}$
12 |         \STATE Set $y_i\leftarrow r(s_i,a_i) + \gamma \max_{a'_i}Q_{\phi'}(s'_i,a'_i)$
13 |         \STATE Set $\phi \leftarrow \phi-\alpha\Sigma_i\frac{dQ_\phi}{d\phi}(s_i,a_i)(Q_\phi(s_i,a_i) - y_i)$
14 |     \ENDFOR
15 |     \ENDFOR
16 | \ENDWHILE
17 | \end{algorithmic}
18 | \end{algorithm}


--------------------------------------------------------------------------------
/ref.bib:
--------------------------------------------------------------------------------
  1 | @inproceedings{ross2011reduction,
  2 |   title={A reduction of imitation learning and structured prediction to no-regret online learning},
  3 |   author={Ross, St{\'e}phane and Gordon, Geoffrey and Bagnell, Drew},
  4 |   booktitle={Proceedings of the fourteenth international conference on artificial intelligence and statistics},
  5 |   pages={627--635},
  6 |   year={2011}
  7 | }
  8 | @article{de2019causal,
  9 |   title={Causal Confusion in Imitation Learning},
 10 |   author={de Haan, Pim and Jayaraman, Dinesh and Levine, Sergey},
 11 |   journal={arXiv preprint arXiv:1905.11979},
 12 |   year={2019}
 13 | }
 14 | @inproceedings{thomas2014bias,
 15 |   title={Bias in natural actor-critic algorithms},
 16 |   author={Thomas, Philip},
 17 |   booktitle={International conference on machine learning},
 18 |   pages={441--448},
 19 |   year={2014}
 20 | }
 21 | @article{mnih2013playing,
 22 |   title={Playing atari with deep reinforcement learning},
 23 |   author={Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Graves, Alex and Antonoglou, Ioannis and Wierstra, Daan and Riedmiller, Martin},
 24 |   journal={arXiv preprint arXiv:1312.5602},
 25 |   year={2013}
 26 | }
 27 | 
 28 | @inproceedings{munos2016safe,
 29 |   title={Safe and efficient off-policy reinforcement learning},
 30 |   author={Munos, R{\'e}mi and Stepleton, Tom and Harutyunyan, Anna and Bellemare, Marc},
 31 |   booktitle={Advances in Neural Information Processing Systems},
 32 |   pages={1054--1062},
 33 |   year={2016}
 34 | }
 35 | 
 36 | @inproceedings{gu2016continuous,
 37 |   title={Continuous deep q-learning with model-based acceleration},
 38 |   author={Gu, Shixiang and Lillicrap, Timothy and Sutskever, Ilya and Levine, Sergey},
 39 |   booktitle={International Conference on Machine Learning},
 40 |   pages={2829--2838},
 41 |   year={2016}
 42 | }
 43 | 
 44 | @article{lillicrap2015continuous,
 45 |   title={Continuous control with deep reinforcement learning},
 46 |   author={Lillicrap, Timothy P and Hunt, Jonathan J and Pritzel, Alexander and Heess, Nicolas and Erez, Tom and Tassa, Yuval and Silver, David and Wierstra, Daan},
 47 |   journal={arXiv preprint arXiv:1509.02971},
 48 |   year={2015}
 49 | }
 50 | 
 51 | @article{browne2012survey,
 52 |   title={A survey of monte carlo tree search methods},
 53 |   author={Browne, Cameron B and Powley, Edward and Whitehouse, Daniel and Lucas, Simon M and Cowling, Peter I and Rohlfshagen, Philipp and Tavener, Stephen and Perez, Diego and Samothrakis, Spyridon and Colton, Simon},
 54 |   journal={IEEE Transactions on Computational Intelligence and AI in games},
 55 |   volume={4},
 56 |   number={1},
 57 |   pages={1--43},
 58 |   year={2012},
 59 |   publisher={IEEE}
 60 | }
 61 | 
 62 | @inproceedings{guo2014deep,
 63 |   title={Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning},
 64 |   author={Guo, Xiaoxiao and Singh, Satinder and Lee, Honglak and Lewis, Richard L and Wang, Xiaoshi},
 65 |   booktitle={Advances in neural information processing systems},
 66 |   pages={3338--3346},
 67 |   year={2014}
 68 | }
 69 | 
 70 | @inproceedings{watter2015embed,
 71 |   title={Embed to control: A locally linear latent dynamics model for control from raw images},
 72 |   author={Watter, Manuel and Springenberg, Jost and Boedecker, Joschka and Riedmiller, Martin},
 73 |   booktitle={Advances in neural information processing systems},
 74 |   pages={2746--2754},
 75 |   year={2015}
 76 | }
 77 | 
 78 | @article{zhang2018solar,
 79 |   title={Solar: Deep structured latent representations for model-based reinforcement learning},
 80 |   author={Zhang, Marvin and Vikram, Sharad and Smith, Laura and Abbeel, Pieter and Johnson, Matthew J and Levine, Sergey},
 81 |   journal={arXiv preprint arXiv:1808.09105},
 82 |   year={2018}
 83 | }
 84 | 
 85 | @article{hinton2015distilling,
 86 |   title={Distilling the knowledge in a neural network},
 87 |   author={Hinton, Geoffrey and Vinyals, Oriol and Dean, Jeff},
 88 |   journal={arXiv preprint arXiv:1503.02531},
 89 |   year={2015}
 90 | }
 91 | 
 92 | @inproceedings{houthooft2016vime,
 93 |   title={Vime: Variational information maximizing exploration},
 94 |   author={Houthooft, Rein and Chen, Xi and Duan, Yan and Schulman, John and De Turck, Filip and Abbeel, Pieter},
 95 |   booktitle={Advances in Neural Information Processing Systems},
 96 |   pages={1109--1117},
 97 |   year={2016}
 98 | }
 99 | 
100 | @inproceedings{schulman2015trust,
101 |   title={Trust region policy optimization},
102 |   author={Schulman, John and Levine, Sergey and Abbeel, Pieter and Jordan, Michael and Moritz, Philipp},
103 |   booktitle={International conference on machine learning},
104 |   pages={1889--1897},
105 |   year={2015}
106 | }
107 | 
108 | @article{parisotto2015actor,
109 |   title={Actor-mimic: Deep multitask and transfer reinforcement learning},
110 |   author={Parisotto, Emilio and Ba, Jimmy Lei and Salakhutdinov, Ruslan},
111 |   journal={arXiv preprint arXiv:1511.06342},
112 |   year={2015}
113 | }
114 | 
115 | @article{rusu2015policy,
116 |   title={Policy distillation},
117 |   author={Rusu, Andrei A and Colmenarejo, Sergio Gomez and Gulcehre, Caglar and Desjardins, Guillaume and Kirkpatrick, James and Pascanu, Razvan and Mnih, Volodymyr and Kavukcuoglu, Koray and Hadsell, Raia},
118 |   journal={arXiv preprint arXiv:1511.06295},
119 |   year={2015}
120 | }


--------------------------------------------------------------------------------
/transfer.tex:
--------------------------------------------------------------------------------
1 | \chapter{Transfer Learning}
2 | This chapter is a high-level overview of transfer learning and multi-task learning techniques. More to be filled in later.


--------------------------------------------------------------------------------
/value.tex:
--------------------------------------------------------------------------------
 1 | \chapter{Value Function Methods}
 2 | In the last two chapters, we discussed some policy gradient-based algorithms. We have also seen the fact that the policy gradient methods have high variance. Therefore, it would be nice if we could completely omit the gradient step. To achieve this, we are going to talk about the value function methods in this chapter. 
 3 | 
 4 | \section{An Implicit Policy}
 5 | To omit the policy gradient, one still has to generate a policy function so that it takes in a state and outputs an action. Recall in actor-critic algorithms, we use the advantage function $A^\pi(s_t,a_t)$ to gauge how much better is the action $a_t$ than the average action according to $\pi$. Provided that we have a somehow accurate representation of this advantage function, we can just forget about generating a policy $\pi$, and just do this:
 6 | $$\argmaxA_{a_t} A^\pi(s_t,a_t)$$
 7 | which means we take the best action from $s_t$, if we follow $\pi$. Even though we have no knowledge of what the policy $\pi$ actually is, by doing the $\argmaxA$, we can guarantee that the action produced is at least as good as the action from the policy function that we do not know. Therefore, as long as we have a accurate representation of the advantage function $A^\pi(s_t,a_t)$, we can implicitly generate a parameter-free policy function:
 8 | \begin{align*}
 9 | \pi'(a_t|s_t) & = \begin{cases}
10 |                 1, & \mathrm{if } \;a_t = \argmaxA_{a_t}A^\pi(s_t,a_t)\\
11 |                  0, & \mathrm{otherwise}
12 |                     \end{cases}
13 | \end{align*}
14 | and as we have shown above, this implicit policy is at least as good as the unknown policy $\pi$.
15 | 
16 | \section{Policy Iteration}
17 | Having omitted the policy, we can then proceed to introduce the policy iteration algorithm.
18 | \subsection{High Level Idea}
19 | The basic idea of policy iteration algorithms is very simple: we evaluate the advantage function $A^\pi(s,a)$ and then update the policy using the update rule as defined above in the implicit policy, and then we can loop to constantly improve our policy. 
20 | 
21 | The problem here is how to evaluate $A^\pi(s,a)$. In other words, how does one find an accurate representation of the advantage function in order to accurately update the policy. As before, we have seen that the advantage function an be equivalently defined as follows:
22 | $$A^\pi(s,a) = r(s,a) + \gamma\mathbb{E}\left[V^\pi(s')\right]-V^\pi(s)$$
23 | Therefore, if we can evaluate the value function $V^\pi(s)$ then we can also evaluate $A^\pi(s,a)$. So in the high-level policy iteration algorithm, we can just use the value function in place of the advantage function.
24 | 
25 | \subsection{Dynamic Programming}
26 | Now let us make a simple assumption. Suppose we know a priori the transition probability $p(s'|s,a)$ and both states $s$ and action $a$ are discrete. Then a very natural dynamic programming update is the bootstrapped update, as we have seen before:
27 | $$V^\pi(s)\leftarrow \mathbb{E}_{a\sim\pi(a|s)} \left[r(s,a) + \gamma\mathbb{E}_{s'\sim p(s'|s,a)}\left[V^\pi(s')\right]\right]$$
28 | and we can just use the current estimate inside the nested expectation for simplicity. 
29 | 
30 | According to our definition of the implicit policy function $\pi'$, the policy is actually deterministic. Therefore, we can completely get rid of the outside expectation, and the value function update can be further simplified as:
31 | $$V^\pi(s)\leftarrow r(s,\pi(s)) + \gamma\mathbb{E}_{s'\sim p(s'|s,a)}\left[V^\pi(s')\right]$$
32 | \input{policyiter1.tex}
33 | This version of policy iteration is sketched in Algorithm \ref{alg:policyiter1}
34 | 
35 | We can even further simplify the dynamic programming update. Note that in each iteration, we are updating the policy first in order to update the value function. Thus, it would be faster if we could skip the policy part and directly improve the value estimation. Meanwhile, the $\argmaxA$ operation that we apply on the advantage function itself is an implicit policy. We also know that $\argmaxA_{a_t}A^\pi(s,a) = \argmaxA_{a_t}Q^\pi(s,a)$, because the two values only differ by the subtraction term $V^\pi$, which does not depend on action:
36 | $$A^\pi(s,a) = r(s,a) + \gamma\mathbb{E}\left[V^\pi(s')\right]$$
37 | \input{policyiter2.tex}
38 | Having this, we can simplify the policy iteration algorithm further, as illustrated in Alg. \ref{alg:policyiter2}. As we skipped the policy update part, we call this new, simplified algorithm ``value iteration algorithm''.
39 | 
40 | \section{Fitted Value Iteration}
41 | The policy iteration and value iteration algorithm we discussed above are heavily based on an impractical assumption: the total number of states is finite and small, because we are trying to construct a tabular expression of the value function and the Q function. Apparently, the tables are going to explode in dimensions if there are a lot of states. We call this the Curse of Dimensionality. To resolve this problem, we can use a neural network to approximate the functions instead of constructing a tabular expression of the function. 
42 | 
43 | \subsection{Fitted Supervised Value Iteration Algorithm}
44 | Since we know that the value function is defined as $\max_a Q^\pi(s,a)$, we can use this definition as the labels for the value function in order to define an L2 loss function:
45 | $$\mathcal{L}(\phi) = \frac{1}{2}\lvert|V_\phi(s) - \max_{a}Q^\pi(s,a)|\rvert^2$$
46 | \input{fittedvaliter.tex}
47 | Then we can sketch out a simple fitted value iteration algorithm using this loss function in Algorithm \ref{alg:fittedvaliter}. Note that when setting the label, the ideal step to take is to enumerate all the states and find the corresponding label. However, when it is impractical, one could just use some samples and enumerate all the actions to find the labels. Moreover, when we take the maximum over all the actions from a state, we implicitly assume that the transition dynamics are known. Why? Because we want to take an action, record the value of that action, and then roll back to the previous state in order to check the values of other actions. Thus, without the transition dynamics, we cannot easily take the maximum.
48 | \subsection{Fitted Q-Iteration Algorithm}
49 | To address this problem, we can apply the same ``max'' trick in policy iteration. In policy iteration, we skip the policy update and calculate the values directly. Here in fitted value iteration, we can get around the transition dynamics by looking up the Q function table, because $V_\phi(s) \simeq \max_{a}Q_\phi(s,a)$, and this max operation is merely a table lookup from the Q value table. Consequently, we are now iterating on the Q values. Such a method works for off-policy samples (unlike actor-critic), and it only needs one network, so it does not have any high-variance policy gradient. However, as we shall see in later sections, such methods do not have convergence guarantees on non-linear functions, which could potentially be problematic.
50 | 
51 | The full fitted Q-iteration algorithm is shown in Algorithm \ref{alg:fittedQ}.
52 | \input{fittedQ.tex}
53 | \subsection{A Closer Look at Q-Iteration Algorithm}
54 | Let us take a closer look at the fitted Q-learning algorithm. 
55 | First, let us discuss why the algorithm is fully off-policy. In step 2 of Alg. \ref{alg:fittedQ}, we are not collecting a lot of transition data, and we do not care about the trajectories. Furthermore, in step 4, we are taking the step off-policy in that we do not care about which state we are going to, because we only care about the value of the transition. In other words, the tansition we take is independent of the unknown policy $\pi$. Therefore, the fitted Q-iteration algorithm is fully off-policy.
56 | 
57 | Another question we can ask is that what is fitted Q-iteration actually optimizing? In step 5 of Alg. \ref{alg:fittedQ}, we are minimizing the difference between the Q function value and the label we approximated. In fact, we call this difference the Bellman Error, defined as follows:
58 | $$\epsilon = \frac{1}{2}\mathbb{E}_{(s,a)\sim\beta}\left[\left(Q_\phi(s,a) - \left[r(s,a)+\gamma\max_{a'}Q_\phi(s',a')\right]\right)^2\right]$$
59 | So in this particular step, we are optimizing the Bellman Error, and if $\epsilon = 0$, we have optimal Q-function, corresponding to optimal policy $\pi$, which can be recovered by the $\argmaxA$ operation. However, rather ironically, we do not know what we are optimizing in the previous steps, and this is a potential problem of the fitted Q-learning algorithm, and most convergence guarantees are lost when we do not have the tabular case.
60 | 
61 | \subsection{Online Q-Iteration Algorithm}
62 | We can also make the samples more efficient by making the Q-iteration algorithm completely online. By online we mean that we do not store any transition. Instead, we take one transition and immediately apply the transition to our value update. The online version of Q-Iteration Algorithm is sketched in Alg. \ref{alg:onlineQiter}.
63 | \input{onlineQiter.tex}
64 | As we see in step 2 of the algorithm, we are taking an action off-policy, so we have a lot of choices to make. 
65 | 
66 | \section{Value Function Learning Theory}
67 | One question that one might ask after seeing the variety of algorithm as shown above is does the value function method converge? If so, it converges to what? To take a closer look in order to answer the question, let us define a Bellman backup operator $\mathcal{B}$:
68 | $$\mathcal{B}V = \max_a r_a + \gamma\mathcal{T}_aV$$
69 | where $r_a$ is a stacked vector of rewards at all states for action $a$. $\mathcal{T}_a$ is a matrix of transitions for action $a$ such that $\mathcal{T}_{a,i,j} = p(s'=i|s=j, a)$.
70 | We also define a fixed point of the Bellman backup operator $\mathcal{B}$, denoted as $V^*$:
71 | $$V^*(s) = \max_a r(s,a) + \gamma\mathbb{E}[V^*(s')]$$
72 | so it is similar to the notion of the stationary distribution in MDP, $V^* = \mathcal{B}V^*$. One can prove that such fixed point always exists, and it corresponds to the optimal policy, but the question is: will we reach it?
73 | 
74 | In the \textbf{tabular representation} case, we can prove that value iteration always reaches the fixed point $V^*$ because mathematically, the Bellman backup operator is a \textbf{contraction}. A contraction in our scenario is defined as follows: for any $V$ and $\Bar{V}$, we have $\lvert|\mathcal{B}V - \mathcal{B}\Bar{V}|\rvert_\infty \leq \lvert|V -\Bar{V}|\rvert_\infty$. In other words, after applying the Bellman backup operator, the gap always gets smaller by $\gamma$ with respect to the l-$\infty$ norm.
75 | 
76 | Now let us proceed to analyze the \textbf{non-tabular representation} case. In this scenario, unfortunately, we have lost a lot of convergence guarantees. Recall that in normal value iteration (tabular case), we use the Bellman backup operator $\mathcal{B}$ to update $V$: $V\leftarrow \mathcal{B}V$. In fitted (non-tabular) value iteration, we use the Bellman backup operator, $\mathcal{B}$, together with another operator $\Pi$. The operator $\Pi$ is defined as:
77 | $\Pi V = \argminA_{V'\in\Omega}\frac{1}{2}\sum\lvert|V'(s) - V(s)|\rvert^2$. So $\Pi$ is a projection onto $\Omega$ in terms of l2 norm. This projection is illustrated in Fig. \ref{fig:bellmanbackup}. The set that $V$ and $V'$ lie in can be thought as a representation of all value functions. Therefore, the set $\Omega$ can be represented by neural networks.
78 | 
79 | \begin{figure}
80 |     \centering
81 |     \includegraphics[scale=0.5]{figures/bellmanbackup.png}
82 |     \caption{Bellman Backup Projection}
83 |     \label{fig:bellmanbackup}
84 | \end{figure}
85 | 
86 | Now we have the two operators defined, we can see that $\mathcal{B}$ is a contraction with respect to the l-$\infty$ norm, and the operator $\Pi$ is a contraction with respect to the l2 norm. But what if we impose one operator to another? Is the compound operator also a contraction? The answer is no. Therefore, such non-tabular Q-iterations do not have any convergence guarantee as the operator is not a contraction. 
87 | 
88 | What about fitted Q-iteration? Concisely, the fitted Q-iteration algorithm can be defined as $Q \leftarrow \Pi\mathcal{B}Q$. Therefore, the same reasoning can be applied to the fitted Q-learning: since the compound operator is no longer a contraction, we do not have any guarantee for convergence. We can say the same thing in online Q-iteration as well.
89 | 
90 | However, one might ask, in step 4 of Alg. \ref{alg:onlineQiter}, aren't we just doing gradient descent, which definitely converges? As a matter of fact, this is not real gradient descent in that the target value is constantly changing due to the off-policy nature of this algorithm. So we have this sad corollary: in general cases, fitted bootstrapped policy evaluation does not converge.


--------------------------------------------------------------------------------
/varinfer.tex:
--------------------------------------------------------------------------------
  1 | \chapter{Variational Inference and Generative Models}
  2 | In this chapter we are going to explore some techniques that allow us to infer latent variables in latent space. We will try to understand the role of latent probabilistic models in deep learning and how to use them.
  3 | 
  4 | In RL, we are mostly concerned with conditional distributions $p(x|y)$ because we are trying to fit a policy function $\pi_\theta(a|s)$ which is a probabilistic model of action conditioned on state. 
  5 | 
  6 | So what are latent variable models? Consider that we have a very complicated distribution $p(x)$, which cannot be easily modeled by a mixture of Gaussians. By Bayes' rule, this complicated prior can be modeled by two other easier distributions:
  7 | \[
  8 | p(x) = \int p(x|z)p(z)dz
  9 | \]
 10 | $p(x|z)$ and $p(z)$ could be modeled by a conditional Gaussian and a Gaussian respectively. Since any function could be represented by a big enough neural network to an arbitrary precision, we can then use a neural net to represent $p(x|z)$ as $p(x|z) = \mathcal{N}(\mu_{nn}(z),\sigma_{nn}(z))$. This sample distribution is an easy distribution with complicated parameters. Often in practice, we won't even learn $p(z)$, because we could just model it as a Gaussian distribution and transform it into any nonlinear distribution using the integral. The challenge of this approach, however, is to efficiently approximate the integral, which is quite hard.
 11 | 
 12 | In RL, we mainly use latent variable models in the following scenarios. First, we could use conditional latent variable models for multi-modal policies, as we discussed in imitation learning. Specifically, we could train a network with Gaussian noise to infer the state from image-based observations. Another scenario is that we could use latent variable models for model-based RL. Essentially, we learn a conditional distribution $p(o_t|x_t)$ and prior $p(x_t)$.
 13 | 
 14 | \section{Training Latent Variable Models}
 15 | The model we are trying to fit is $p_\theta(x)$. We train the model using data $\mathcal{D} = \{x_1,x_2,\dots,x_N\}$. We use maximum likelihood fit: $\theta\leftarrow \argmaxA_\theta\frac{1}{N}\sum_i\log p_\theta(x_i)$. Using latent variables, we have $\theta\leftarrow \argmaxA_\theta\frac{1}{N}\sum_i\log \left(\int p_\theta(x_i|z)p(z)dz \right)$. And as we have shown above, the integral is completely intractable. 
 16 | 
 17 | Alternatively, we could use the expected log-likelihood:
 18 | \[
 19 | \theta\leftarrow \argmaxA_\theta\frac{1}{N}\sum_i\mathbb{E}_{z\sim p(z|x_i)}\log p_\theta(x_i)
 20 | \]
 21 | However, the conditional distribution $p(z|x_i)$ is unknown. Therefore, we can approximate this distribution with a simpler distribution $q_i(z) =\mathcal{N}(\mu_i,\sigma_i)$.
 22 | 
 23 | \subsection{Variational Approximation}
 24 | It turns out that if we approximate the distribution using $q_i(z)$, we can bound the distribution of interest $\log p(x_i)$. Therefore, by maximizing this lower bound, we are maximizing the log likelihood. We use $q_i(z)$ to approximate $\log p(x_i)$ by:
 25 | \begin{align*}
 26 |     \log p(x_i) &= \log \int_z p(x_i|z)p(z)\\
 27 |     &= \log \int_z p(x_i|z)p(z)\frac{q_i(z)}{q_i(z)}\\
 28 |     &= \log\mathbb{E}_{z\sim q_i(z)}\left[\frac{p(x_i|z)p(z)}{q_i(z)}\right]\\
 29 |     &\geq \mathbb{E}_{z\sim q_i(z)}\left[\log \frac{p(x_i|z)p(z)}{q_i(z)}\right]\\
 30 |     &=\mathbb{E}_{z\sim q_i(z)}\left[\log p(x_i|z)+\log p(z)\right]-\mathbb{E}_{z\sim q_i(z)}\left[\log q_i(z)\right]\\
 31 |     &= \mathbb{E}_{z\sim q_i(z)}\left[\log p(x_i|z)+\log p(z)\right] + \mathcal{H}(q_i)
 32 | \end{align*}
 33 | where we applied Jensen's inequality in the second to last step. Jensen's inequality states that:
 34 | \[
 35 | \log \mathbb{E}[y] \geq \mathbb{E}[\log y]
 36 | \]
 37 | 
 38 | If we maximize $\log p(x_i|z)$, we will maximize $\log p(x_i)$. Also, intuitively, if we maximize $\log p(x_i|z)$, we are maximizing the peak of the distribution, and since we are maximizing the entropy $\mathcal{H}(q_i)$ too, we are also making the distribution as wide as possible, which is how we drive the approximated distribution $q_i(z)$ as close as possible to the target distribution $p(x_i,z)$.
 39 | 
 40 | Let us take a closer look at this lower bound. Define $\mathbb{E}_{z\sim q_i(z)}\left[\log p(x_i|z)+\log p(z)\right] + \mathcal{H}(q_i)$ as $\mathcal{L}_i(p,q_i)$. In tuitively, this term measures the likelihood. For a $q_i(z)$ to approximate $p(z|x_i)$ well, we need to minimize the KL-divergence between the two distributions. By definition, the KL divergence of the two distributions is written as:
 41 | \begin{align*}
 42 |     D_{KL}(q_i(z)||p(z|x_i))&=\mathbb{E}_{z\sim q_i(z)}\left[\log \frac{q_i(z)}{p(z|x_i)}\right]\\
 43 |     &=\mathbb{E}_{z\sim q_i(z)}\left[\log \frac{q_i(z)p(x_i)}{p(x_i,z)}\right]\\
 44 |     &= -\mathbb{E}_{z\sim q_i(z)}\left[\log p(x_i|z)+\log p(z)\right] + \mathbb{E}_{z\sim q_i(z)}\left[\log q_i(z)\right]+ \mathbb{E}_{z\sim q_i(z)}\left[\log p(x_i)\right]\\
 45 |     &= -\mathbb{E}_{z\sim q_i(z)}\left[\log p(x_i|z)+\log p(z)\right] -\mathcal{H}(q_i)+\log p(x_i)\\
 46 |     &=-\mathcal{L}_i(p,q_i) + \log p(x_i)
 47 | \end{align*}
 48 | Therefore,
 49 | \begin{align*}
 50 |     \log p(x_i) &= D_{KL}(q_i(z)||p(z|x_i)) + \mathcal{L}_i(p,q_i)\\
 51 |     \log p(x_i) &\geq \mathcal{L}_i(p,q_i)
 52 | \end{align*}
 53 | Note that we eliminated the expectation $\mathbb{E}_{z\sim q_i(z)}\left[\log p(x_i)\right]$ because $p(x_i)$ does not depend $z$.
 54 | 
 55 | Since $D_{KL}(q_i(x_i)||p(z|x_i)) = -\mathcal{L}_i(p,q_i) + \log p(x_i)$, maximizing $\mathcal{L}_i(p,q_i)$ with respect to $q_i$ minimizes the KL-divergence. Now in our maximum likelihood training, instead of doing $\theta \leftarrow \argmaxA_\theta \frac{1}{N}\sum_i \log p_\theta(x_i)$, we can use the lower bound and do $\theta \leftarrow \argmaxA_\theta \frac{1}{N}\sum_i \mathcal{L}_i(p,q_i)$ to approximate it. To optimize, for each $x_i$, we calculate $\nabla_\theta\mathcal{L}_i(p,q_i)$ by sampling $z\sim q_i(z)$ and the gradient of the likelihood term can be approximated using $\nabla_\theta\mathcal{L}_i(p,q_i)\simeq \nabla_\theta\log p_\theta(x_i|z)$ because $\log p_\theta(x_i|z)$ is the only term in the likelihood that depends on $\theta$. Then we apply gradient ascent on the parameter $\theta$ by $\theta \leftarrow \theta + \alpha\nabla_\theta\mathcal{L}_i(p,q_i)$.
 56 | 
 57 | However, we also need to update $q_i$ to maximize $\mathcal{L}_i(p,q_i)$ because it also depends on $\mathcal{H}(q_i)$. Let's say $q_i(z) = \mathcal{N}(\mu_i,\sigma_i)$, then we can apply gradient ascent on both parameters $\mu_i$, $\sigma_i$ to update this distribution. The problem here is the above update rule is for each data point. Therefore, the number of parameters is $|\theta| + (|\mu_i| + |\sigma_i|)*N$, where $N$ is the number of data points. Thus, we can modify the distribution we are learning so that we use a more general neural network to approximate $q(z|x_i)$ such that $q(z|x_i) = q_i(z)\simeq p(z|x_i)$. Now the number of the network parameter does not scale with the number of data points. 
 58 | 
 59 | \subsection{Amortized Variational Inference}
 60 | The above idea is called amortized variational inference. When we maximize the likelihood, instead of using $q_i$ for each data point, we use a general neural net $q_\phi$, parameterized by $\phi$. Then when we update $q_\phi$, we can just apply gradient ascent on $\phi$ by $\phi\leftarrow \phi + \alpha\nabla_\phi\mathcal{L}$. The likelihood can be denoted as $\mathcal{L}_i(p_\theta(x_i|z),q_\phi(z|x_i))$. 
 61 | 
 62 | How do we calculate $\nabla_\phi\mathcal{L}$? Note that 
 63 | \[
 64 | \mathcal{L}_i = \mathbb{E}_{z\sim q_\phi(z|x_i)}\left[\log p_\theta(x_i|z) + \log p(z)\right]+\mathcal{H}(q_\phi(z|x_i))
 65 | \]
 66 | to calculate the gradient of the likelihood with respect to $\phi$, we can calculate the entropy term's gradient easily using textbook formula. However, the first term is harder because the expectation is taken under a distribution depending on $\phi$, but the term inside the expectation is independent of $\phi$. Where have we seen this before? Where have seen the same type of gradient in policy gradient, and by applying the convenient identity, we can get the same form of gradient. If we call $\log p_\theta(x_i|z) + \log p(z)$ as $r(x_i,z)$, and $\mathbb{E}_{z\sim q_\phi(z|x_i)}$ as $J(\phi)$. Applying the same trick as in policy gradient, we can calculate $\nabla J(\phi)$ as:
 67 | \[
 68 | \nabla J(\phi)\simeq \frac{1}{M}\sum_j\nabla_\phi\log q_\phi(z_j|x_i)r(x_i,z_j)
 69 | \]
 70 | \subsection{The Reparameterization Trick}
 71 | Consider $q_\phi(z|x)$ as a Gaussian distribution $\mathcal{N}(\mu_\phi(x),\sigma_\phi(x))$, then for every $z$ in this distribution, it can be expressed as $z=\mu_\phi(x)+\epsilon\sigma_\phi(x)$, where $\epsilon$ is some type of a Gaussian noise $\epsilon \sim \mathcal{N}(0,1)$, and the noise is independent of $\phi$. Thus, we have:
 72 | \begin{align*}
 73 |     J(\phi) &= \mathbb{E}_{z\sim q_\phi(z|x_i)}[r(x_i,z)]\\
 74 |     &= \mathbb{E}_{\epsilon\sim \mathcal{N}(0,1)}\left[r(x_i,\mu_\phi(x)+\epsilon\sigma_\phi(x))\right]
 75 | \end{align*}
 76 | To estimate $\nabla_\phi J(\phi)$, we can just sample $M$ samples of $\epsilon$ from a Gaussian $\mathcal{N}(0,1)$.
 77 | 
 78 | Using this reparameterization trick, we can derive the expression of $\mathcal{L}_i$ in another way:
 79 | \begin{align*}
 80 |     \mathcal{L}_i &= \mathbb{E}_{z\sim q_\phi(z|x_i)}\left[\log p_\theta(x_i|z) + \log p(z)\right]+\mathcal{H}(q_\phi(z|x_i))\\
 81 |     &= \mathbb{E}_{z\sim q_\phi(z|x_i)}\left[\log p_\theta(x_i|z)\right] + \mathbb{E}_{z\sim q_\phi(z|x_i)}\left[\log p(z)\right] + \mathcal{H}(q_\phi(z|x_i))\\
 82 |     &=  \mathbb{E}_{z\sim q_\phi(z|x_i)}\left[\log p_\theta(x_i|z)\right] - D_{KL}(q_\phi(z|x_i)||p(z))\\
 83 |     &= \mathbb{E}_{\epsilon\sim \mathcal{N}(0,1)}\left[\log p_\theta(x_i|\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i))\right] - D_{KL}(q_\phi(z|x_i)||p(z))\\
 84 |     &\simeq \log p_\theta( x_i|\mu_\phi(x_i)+\epsilon\sigma_\phi(x_i)) - D_{KL}(q_\phi(z|x_i)||p(z))
 85 | \end{align*}
 86 | The complete computational graph for variational inference is shown in Fig. \ref{fig:varinf}. 
 87 | \begin{figure}
 88 |     \centering
 89 |     \includegraphics[scale=0.4]{figures/varinf.png}
 90 |     \caption{Variational inference}
 91 |     \label{fig:varinf}
 92 | \end{figure}
 93 | 
 94 | Compared with policy gradient, the reparameterization trick is easy to implement and as low variance, but it only works for continuous latent variables. Policy gradient can Can handle both discrete and continuous latent variables, but it is subject to high variance, rand equires multiple samples and small learning rates.
 95 | 
 96 | \section{Variational Autoencoder (VAE)}
 97 | The variational autoencoder (VAE) consists of two parts: an encoder and a decoder. The encoder $q_\phi(z|x) = \mathcal{N}(\mu_\phi(x),\sigma_\phi(x))$ parameterized by $\phi$ gives us a latent variable $z$, and the decoder $p_\theta(x|z) = \mathcal{N}(\mu_\theta(x),\sigma_\theta(x))$ is parameterized by $\theta$.
 98 | 
 99 | When we are inferring $p(x)$ by $p(x) = \int p(x|z)p(z)dz$, we sample $z$ from the distribution $p(z)$, and sample $x$ from the distribution $p(x|z)$. Why does this work? Recall the evidence lower bound $\mathcal{L}_i$ is defined as:
100 | \[
101 | \mathcal{L}_i=  \mathbb{E}_{z\sim q_\phi(z|x_i)}\left[\log p_\theta(x_i|z)\right] - D_{KL}(q_\phi(z|x_i)||p(z))
102 | \]
103 | $q_\phi$ should embed your observations $x_i$ into $z$, into a distribution that is closer to the prior. So if the training data is embedded into the distribution that is similar to the prior, it makes sense that the samples from the prior will give you things that look like the data.
104 | \begin{figure}[H]
105 |     \centering
106 |     \includegraphics[scale=0.4]{figures/vae.png}
107 |     \caption{VAE}
108 |     \label{fig:vae}
109 | \end{figure}


--------------------------------------------------------------------------------