├── Deep_RL.pdf ├── QwRB.tex ├── README.md ├── REINFORCE.tex ├── actorcritic.tex ├── batchac.tex ├── batchacwdf.tex ├── cem.tex ├── ctrasinf.tex ├── dagger.tex ├── ddpg.tex ├── dqn.tex ├── dyna.tex ├── dynagen.tex ├── exploration.tex ├── figures ├── bellmanbackup.png ├── dqn.png ├── dynarollout.png ├── fitV.png ├── im_RNN.png ├── imitation_div.png ├── latent.png ├── localmodel.png ├── marginal.png ├── markov.png ├── modelnn.png ├── multimodal.png ├── opt.png ├── overfit.png ├── parallelsim.png ├── poliback.png ├── qwrb.png ├── rlanatomy.png ├── trajheat.png ├── vae.png └── varinf.png ├── fittedQ.tex ├── fittedvaliter.tex ├── guided.tex ├── ilqr.tex ├── imitation.tex ├── intro.tex ├── inverse.tex ├── lqr.tex ├── main.tex ├── maxent.tex ├── mb05.tex ├── mb10.tex ├── mb15.tex ├── mb20.tex ├── mblatent.tex ├── mbpolicy.tex ├── mcts.tex ├── modelbased.tex ├── offline.tex ├── onlineQiter.tex ├── onlineac.tex ├── pgtheory.tex ├── policyiter1.tex ├── policyiter2.tex ├── poligrad.tex ├── preface.tex ├── pretrain.tex ├── pseudocount.tex ├── qfunc.tex ├── qwrb_tn.tex ├── ref.bib ├── transfer.tex ├── value.tex └── varinfer.tex /Deep_RL.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/harryzhangOG/Deep-RL-Notes/a2280cf46ec66326300a3272485189931562fefc/Deep_RL.pdf -------------------------------------------------------------------------------- /QwRB.tex: -------------------------------------------------------------------------------- 1 | \begin{algorithm}[t!] 2 | \caption{Q-Learning with Replay Buffer} 3 | \begin{algorithmic}[1] 4 | \label{alg:QwRB} 5 | \REQUIRE Some base policy for data collection; hyperparameter $K$ 6 | \WHILE{true} 7 | \STATE Collect dataset $\{(s_i,a_i,s'_i,r_i)\}$ using some policy, add it to replay buffer $\mathcal{B}$ 8 | \FOR{$K$ times} 9 | \STATE Sample a batch $(s_i,a_i,s'_i,r_i)$ from $\mathcal{B}$ 10 | \STATE Set $y_i\leftarrow r(s_i,a_i) + \gamma \max_{a'_i}Q_\phi(s'_i,a'_i)$ 11 | \STATE Set $\phi \leftarrow \phi-\alpha\Sigma_i\frac{dQ_\phi}{d\phi}(s_i,a_i)(Q_\phi(s_i,a_i) - y_i)$ 12 | \ENDFOR 13 | \ENDWHILE 14 | \end{algorithmic} 15 | \end{algorithm} -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Deep Reinforcement Learning Textbook 2 | ## A collection of comprehensive notes on Deep Reinforcement Learning, based on UC Berkeley's CS 285 (prev. CS 294-112) taught by Professor Sergey Levine. 3 | * Compile the latex source code into PDF locally. 4 | * Alternatively, you could download this repo as a zip file and upload the zip file to Overleaf and start editing online. 5 | * This repo is linked to my Overleaf editor so it is regularly updated.m 6 | * Please let me know if you have any questions or suggestions. Reach me via 7 | 8 | ## Introduction 9 | In recent years, deep reinforcement learning (DRL) has emerged as a transformative paradigm, bridging the domains of artificial intelligence, machine learning, and robotics to enable the creation of intelligent, adaptive, and autonomous systems. This textbook is designed to provide a comprehensive, in-depth introduction to the principles, techniques, and applications of deep reinforcement learning, empowering students, researchers, and practitioners to advance the state of the art in this rapidly evolving field. As the first DRL class I have taken was Prof. Levine's CS 294-112, this book's organization and materials are based strongly on the CS 294-112 (now CS 285)'s slides and syllabus. 10 | 11 | The primary objective of this textbook is to offer a systematic and rigorous treatment of DRL, from foundational concepts and mathematical formulations to cutting-edge algorithms and practical implementations. We strive to strike a balance between theoretical clarity and practical relevance, providing readers with the knowledge and tools needed to develop novel DRL solutions for a wide array of real-world problems. 12 | 13 | The textbook is organized into several parts, each dedicated to a specific aspect of DRL: 14 | 15 | 1. Fundamentals: This part covers the essential background material in reinforcement learning, including Markov decision processes, value functions, and fundamental algorithms such as Q-learning and policy gradients. 16 | 2. Deep Learning for Reinforcement Learning: Here, we delve into the integration of deep learning techniques with reinforcement learning, discussing topics such as function approximation, representation learning, and the use of deep neural networks as function approximators. 17 | 3. Advanced Techniques and Algorithms: This part presents state-of-the-art DRL algorithms, such as Deep Q-Networks (DQN), Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC), along with their theoretical underpinnings and practical considerations. 18 | 4. Exploration and Exploitation: We explore strategies for balancing exploration and exploitation in DRL, examining methods such as intrinsic motivation, curiosity-driven learning, and Bayesian optimization. 19 | 5. Real-World Applications: This section showcases the application of DRL to various domains, including robotics, computer vision, natural language processing, and healthcare, highlighting the challenges and opportunities in each area. 20 | Throughout the textbook, we supplement the theoretical exposition with practical examples, case studies, and programming exercises, allowing readers to gain hands-on experience in implementing DRL algorithms and applying them to diverse problems. We also provide references to relevant literature, guiding the reader towards further resources for deepening their understanding and pursuing advanced topics. 21 | 22 | We envision this textbook as a valuable resource for students, researchers, and practitioners seeking a solid grounding in deep reinforcement learning, as well as a springboard for future innovation and discovery in this exciting and dynamic field. It is our hope that this work will contribute to the ongoing growth and development of DRL, facilitating the creation of intelligent systems that can learn, adapt, and thrive in complex, ever-changing environments. 23 | 24 | We extend our deepest gratitude to our colleagues, reviewers, and students, whose invaluable feedback and insights have helped shape this textbook. We also wish to acknowledge the pioneering researchers whose contributions have laid the foundation for DRL and inspired us to embark on this journey. 25 | 26 | ## Update Log 27 | * Aug 26, 2020: Started adding Fall 2020 materials 28 | * Aug 28, 2020: Fixed typos in Intro. Credit: Warren Deng. 29 | * Aug 30, 2020: Added more explanation to the imitation learning chapter. 30 | * Sep 13, 2020: Added advanced PG in PG and fixed typos in PG. 31 | * Sep 14, 2020: AC chapter format, typos fix, more analysis on A2C 32 | * Sep 16, 2020: Chapter 10.1 KL div typo fix. Credit: Cong Wang. 33 | * Sep 19, 2020: Chapter 3.7.1 parathesis typo fix. Credit: Yunkai Zhang. 34 | * Sep 23, 2020: Q learning chapter fix 35 | * Sep 26, 2020: More explanation and fix to the advanced PG chapter (specifically intuition behind TRPO). 36 | * Sep 28, 2020: Typo fixed and more explanation in Optimal Control. Typos were pointed out in Professor Levine's lecture. 37 | * Oct 6, 2021: Model-based RL chapter fixed. Added Distillation subsection. 38 | * Nov. 20, 2021: Fixed typos in DDPG, Online Actor Critic, and PG theory. Credit: Javier Leguina. 39 | * Apr. 2, 2023: Fixed typos in VAE and PG theory. Credit: wangcongrobot 40 | -------------------------------------------------------------------------------- /REINFORCE.tex: -------------------------------------------------------------------------------- 1 | \begin{algorithm}[t!] 2 | \caption{REINFORCE Algorithm} 3 | \begin{algorithmic}[1] 4 | \label{alg:reinforce} 5 | \REQUIRE Base policy $\pi_\theta(a_t|s_t)$, sample trajectories $\tau^i$ 6 | 7 | \WHILE{true} 8 | \STATE Sample $\{\tau^i\}$ from $\pi_\theta(a_t|s_t)$ (run it on a robot). 9 | \STATE $\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_i\left(\sum_t\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_t r(s_{i,t},a_{i,t})\right)$ 10 | \STATE Improve policy by $\theta \leftarrow \theta + \alpha\nabla_\theta J(\theta)$ 11 | \ENDWHILE 12 | \RETURN optimal trajectory from gradient ascent as $\tau^{return}$ 13 | \end{algorithmic} 14 | \end{algorithm} -------------------------------------------------------------------------------- /actorcritic.tex: -------------------------------------------------------------------------------- 1 | \chapter{Actor-Critic Algorithms} 2 | Recall from last chapter, we derived the policy gradient theorem: 3 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})\right)$$ 4 | where we defined the summed reward as the ``reward-to-go'' function $\hat{Q}_{i,t}$, and it represents the estimate of expected reward if we take action $a_{i,t}$ in state $s_{i,t}$. We have shown that this estimate has very high variance, and we shall see how we can improve policy gradients from using better estimation of the reward-to-go function. 5 | 6 | \section{Reward-to-Go} 7 | Let us take a closer look at the reward-to-go. To improve the estimation, one way is to get closer to the precise value of the reward-to-go. We can define the reward-to-go using expectation: 8 | $$Q(s_t,a_t) = \sum_{t'=t}^T \mathbb{E}_{p(\theta)}\left[r(s_{t'},a_{t'})|s_t,a_t\right]$$ 9 | this is the \textbf{true, expected} value of the reward-to-go. 10 | 11 | Therefore, one could imagine using this true expected value, combined with our original Monte Carlo approximation to yield a better estimate: 12 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log \pi_\theta (a_{i,t}|s_{i,t})Q(s_{i,t},a_{i,t})$$. 13 | 14 | \section{Using Baselines} 15 | As we have seen in last chapter, one can reduce the high variance of the policy gradient using baselines. We have also seen that it is possible to calculate the optimal baseline value to yield the minimum variance, although people often use the average reward for sake of simplicity. 16 | 17 | Motivated by this, let us recall the definition of the value function (defined in the introduction section): 18 | $$V(s_t) = \mathbb{E}_{a_t\sim \pi_\theta(a_t|s_t)}\left[Q(s_t,a_t)\right]$$ 19 | By definition, the value function is the average of Q-function value. 20 | 21 | Similarly, we can use the \textbf{average} reward-to-go as a baseline to reduce the variance. Specifically, we could use the value function $V(s_t)$ as the baseline, thus improving the estimate of the gradient in the following way: 22 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log \pi_\theta (a_{i,t}|s_{i,t})\left(Q(s_{i,t},a_{i,t}) - V(s_{i,t})\right)$$ 23 | and the value function we used is a better approximation of the baseline $b_t = \frac{1}{N}\sum_i Q(s_{i,t},a_{i,t})$. 24 | 25 | What have we done here? What is the intuition behind subtracting the value function from the Q-function? Essentially, we are quantifying how much an action $a_{i,t}$ is better than the average actions. In some sense, it measures the \textbf{advantage} of applying an action over the average action. Therefore, to formalize our intuition, let us define the advantage as follows: 26 | $$A^\pi(s_t,a_t) = Q^\pi(s_t,a_t) - V^\pi(s_t)$$ 27 | which quantitatively measures how much better action $a_t$ is. 28 | 29 | Putting it all together, now a better baseline-backed policy gradient estimate using Monte Carlo estimate can be written as: 30 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_{\theta}\log \pi_\theta (a_{i,t}|s_{i,t})A^\pi(s_{i,t},a_{i,t})$$. 31 | 32 | \section{Value Function Fitting} 33 | The better the estimate of the advantage function, the lower the variance, and we can have better policy gradient. Let us massage the definition of the Q-function a little in order to find some interesting mathematical relations between $Q$ and $V$: 34 | \begin{align*} 35 | Q^\pi(s_t,a_t) &= \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}\left[r(s_{t'},a_{t'})|s_t,a_t\right]\\ 36 | &= r(s_t,a_t)+\sum_{t'=t+1}^T \mathbb{E}_{\pi_\theta}\left[r(s_{t'},a_{t'})|s_t,a_t\right]\\ 37 | &= r(s_t,a_t) + V^\pi(s_{t+1})\\ 38 | &= r(s_t,a_t) + \mathbb{E}_{s_{t+1}\sim p(s_{t+1}|s_t,a_t)}\left[V^\pi(s_{t+1})\right] 39 | \end{align*} 40 | The last expectation of the value function is used because we do not know what the next state actually is. Note that we can be a little crude with respect to that expectation in such a way that we just use the full value function $V^\pi(\cdot)$ on one single sample of the next state, and use the value as the expectation, ignoring the fact that there are multiple other next states. With this estimate, we can plug into the advantage function: 41 | $$A^\pi(s_t,a_t) \simeq r(s_t,a_t) + V^\pi(s_{t+1}) - V^\pi(s_t)$$ 42 | 43 | Therefore, it is almost enough to just approximate the value function, which solely depends on state, to generate approximations of other functions. To achieve this, we can use a neural network to fit our value function $V(s)$, and use the fit value function to approximate our policy gradient, as illustrated in Fig. \ref{fig:fitV} 44 | \begin{figure} 45 | \centering 46 | \includegraphics[scale=0.5]{figures/fitV.png} 47 | \caption{Fitting the value function} 48 | \label{fig:fitV} 49 | \end{figure} 50 | 51 | \section{Policy Evaluation} 52 | Here in this section, we discuss the process and purpose of fitting the value function. 53 | 54 | \subsection{Why Do We Evaluate a Policy} 55 | Policy evaluation is a process that given a fixed policy $\pi$, we figure out how good it is by fitting the value function $V^\pi(\cdot)$ by using this expectation: 56 | $$V^\pi(s_t) = \sum_{t'=t}^T\mathbb{E}_{\pi_\theta}\left[r(s_{t'},a_{t'})|s_t\right]$$ 57 | Having the value function allows us to figure out how good the policy is because the reinforcement learning objective can be equivalently written as $J(\theta) = \mathbb{E}_{s_1\sim p(s_1)}\left[V^\pi(s_1)\right]$, where we take the expectation of the value function value of the initial state over all possible initial states. 58 | \subsection{How to Evaluate a Policy} 59 | To evaluate a policy, we can use an approach similar to the policy gradient - Monte Carlo approximation. Specifically, we can estimate the value function by summing up the reward collected from time step $t$: 60 | $$V^\pi(s_t) \simeq \sum_{t'=t}^T r(s_{t'},a_{t'})$$ 61 | and if we are able to reset the simulator, we could indeed ameliorate this estimate by taking multiple samples ($N$) as follows: 62 | $$V^\pi(s_t) \simeq \frac{1}{N}\sum_{i=1}^N\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})$$ 63 | In practice, we can just use the single sample approximation. 64 | 65 | Here is a question, if our original objective is to use $V^\pi$ to reduce the variance, but we end up using a single sample estimation to estimate $V^\pi$, does it actually help? The answer is yes, because we are using a neural net to fit the Monte Carlo targets from a variety of different states, so even though we do single sample estimate, the value function does generalize when we visit similar states. 66 | 67 | \subsection{Monte Carlo Evaluation with Function Approximation} 68 | To fit our value function, we could use a supervised learning approach. Essentially, we can use our single sample estimation of the value function as our function value, and fit a function that maps the states to the value function values. Therefore, our training data will be $\left\{(s_{i,t}, \sum_{t'=t}^Tr(s_{i,t'},a_{i,t'}))\right\}$, and we denote the function value labels as $y_{i,t}$, and we define a typical supervised regression loss function which we try to minimize as $\mathcal{L}(\phi) = \frac{1}{2}\sum_i\lvert|\hat{V}_\phi^\pi(s_i)-y_i|\rvert^2$. 69 | 70 | \subsection{Improving the Estimate Using Bootstrap} 71 | In fact, we can improve our training process because the original applied target $y_{i,t}$ is not perfect. We could use a technique called \textbf{bootstrapping}. Recall the definition of our ideal target in the supervised regression: 72 | $$\begin{aligned} 73 | y_{i,t} &= \sum_{t'=t}^T\mathbb{E}_{\pi_\theta}\left[r(s_{t'},a_{t'})|s_{i,t}\right]\\ 74 | & \simeq r(s_{i,t},a_{i,t})+\sum_{t'=t+1}^T\left[r(s_{t'},a_{t'})|s_{i,t+1}\right]\\ 75 | & \simeq r(s_{i,t},a_{i,t}) + V^\pi(s_{i,t+1}) 76 | \end{aligned}$$ 77 | , compared with our Monte Carlo targets: $y_{i,t} = \sum_{t'=t}^T r(s_{i,t'},a_{i,t'})$. 78 | 79 | Bootstrapping means applying our previous estimation on our current estimation. In our ideal targets, the last estimation is accurate if we knew the actual $V^\pi$. But if the actual value function is not known, we can just apply bootstrapping by using the current fit estimate $\hat{V}^\pi_\phi$ to estimate the next state's value: $\hat{V}^\pi_\phi(s_{i,t+1})$. Such an estimate is biased, but it has low variance. 80 | 81 | Consequently, our training data using bootstrapping becomes: \[\left\{(s_{i,t}, r(s_{i,t},a_{i,t}) +\hat{V}^\pi_\phi(s_{i,t+1})) \right\}\]. Such bootstrapped targets work well with highly stochastic environments. 82 | 83 | \section{Batch Actor-Critic Algorithm} 84 | Now we are ready to devise our first actor-critic algorithm. The reason why we call it actor-critic is that we use a critic (value function) to decrease the high variance of the actor (Q-function/policy). The full algorithm is shown in Alg. \ref{alg:batchac} and we call it a batch algorithm because it is not online. We shall see the online version later. 85 | \input{batchac.tex} 86 | In Algorithm \ref{alg:batchac}, the way how we fit $\hat{V}_\phi$ is by minimizing the supervised regression norm $\mathcal{L}(\phi) = \frac{1}{2}\sum_i\lvert|\hat{V}_\phi^\pi(s_i)-y_i|\rvert^2$. 87 | \section{Aside: Discount Factors} 88 | Imagine if we had an infinite horizon environment ($T\rightarrow\infty$), then our estimated value function $\hat{V}^\pi_\phi(s)$ can get infinitely large in many cases. Therefore, one possible way to address this issue is to say that it is better to get rewards sooner than later. Therefore, instead of labeling our values as $y_{i,t} \simeq r(s_{i,t}, a_{i,t}) + \hat{V}^\pi_\phi(s_{i,t+1})$, we can shrink the value function value as we progress to the next time step. To achieve this, we introduce a hyperparameter called a \textbf{discount factor}, denoted as $\gamma$, where $\gamma \in [0,1]$: 89 | $$y_{i,t} \simeq r(s_{i,t}, a_{i,t}) + \gamma\cdot\hat{V}^\pi_\phi(s_{i,t+1})$$ 90 | in most cases, $\gamma = 0.99$ works well. 91 | 92 | Let us apply the discount factor to policy gradients. Basically, we have two options to impose this discount factor. The first option is : 93 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t'=t}^T \gamma^{t'-t}r(s_{i,t'},a_{i,t'})\right)$$ 94 | and the second option is: 95 | $$\begin{aligned} 96 | \nabla_\theta J(\theta) &\simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t=1}^T \gamma^{t-1}r(s_{i,t},a_{i,t})\right)\\ 97 | &\simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t'=t}^T \gamma^{t'-1}r(s_{i,t'},a_{i,t'})\right) \;\mathrm{(causality)}\\ 98 | &\simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\gamma^{t-1}\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t'=t}^T \gamma^{t'-t}r(s_{i,t'},a_{i,t'})\right) 99 | \end{aligned}$$ 100 | Intuitively, the second option assigns less weight to later step's gradient, so it essentially means that later steps matter less in our discount. 101 | 102 | In practice, we can show that option 1 gives us better variance, so it is actually what we use. The full derivation can be found in this paper \cite{thomas2014bias}. Now in our actor-critic algorithm, after we impose the discount factor, we have the following gradient: 103 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(r(s_{i,t},a_{i,t}) + \gamma\hat{V}_\phi^\pi(s_{i,t+1})-\hat{V}_\phi^\pi(s_{i,t})\right)$$ 104 | 105 | Now we can incorporate the discount factor with our actor-critic algorithm in Algorithm \ref{alg:batchacwdf}. 106 | \input{batchacwdf.tex} 107 | 108 | \section{Online Actor-Critic Algorithm} 109 | Now that we have seen actor-critic algorithms with a batch of samples, we can further improve the performance by making it fully online. Namely, we are taking the gradient step based on the current sample so that we are not storing any large number of samples, which is more efficient. In the online version of actor-critic, we essentially use two neural nets: one for the policy, the other one for the value function. This is simple and stable, but as the states dimension becomes higher, we are not giving any shared features between the actor and the critic. Therefore, we can also make the network shared between the policy and the value function. For example, in image-based observations scenarios, we could share the conv layers' weights for the two networks and only differ the two in the final fully connected layers. 110 | 111 | In each step, we can only take one sample and gradually improve our value function using that sample. Here is the sketch of the online version of actor-critic algorithm in Algorithm \ref{alg:onlineac}. 112 | \input{onlineac.tex} 113 | Note that in steps 3-5, we are only taking a gradient step from one sample. In reality, this works best if we use a batch of samples instead of just one, and one can use parallel workers (simulations) either synchronously or asynchronously to achieve it, as illustrated in Fig \ref{fig:parallelsim}. 114 | \begin{figure} 115 | \centering 116 | \includegraphics[scale=0.5]{figures/parallelsim.png} 117 | \caption{Parallel simulations for online actor-critic} 118 | \label{fig:parallelsim} 119 | \end{figure} 120 | 121 | One caveat about the asynchronous version is that as the parameter server gets updated, the data collection policy might not be updated, which means the newly collected data might not come from the latest policy, thus making the current acting policy slightly outdated. Such problem is less of an issue in practice because the policy only gets updated by a tiny bit every time. 122 | 123 | \section{Critics as State-Dependent Baselines} 124 | Now let us further discuss the connection between a baseline and a critic. Recall in the Monte Carlo version of policy gradient, the gradient is defined as: 125 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t'=1}^T r(s_{i,t'},a_{i,t'}) - b\right)$$ 126 | and in actor-critic algorithm, we estimate the gradient by estimating the advantage function: 127 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(r(s_{i,t},a_{i,t}) + \gamma\hat{V}_\phi^\pi(s_{i,t+1})-\hat{V}_\phi^\pi(s_{i,t})\right)$$ 128 | 129 | So what are the pros and cons of the two approaches? In policy gradient with baselines, we have shown that there is no bias in our estimation, but there might be high variance due to our single-sample estimation of the cost-to-go function. On the other hand, in the actor-critic algorithm, we have shown that we have lower variance due to the critic, but we end up having a biased estimation because of the possibly bad critic as we are bootstrapping. So can we somehow keep the estimator unbiased while lowering the variance with the critic $\hat{V}^\pi_\phi$? 130 | 131 | The solution is obvious and straightforward, we can just use $\hat{V}^\pi_\phi$ in place of $b$: 132 | $$\nabla_\theta J(\theta) \simeq \frac{1}{N}\sum_{i=1}^N\left(\sum_{t=1}^T\nabla_\theta \log\pi_\theta(a_{i,t}|s_{i,t})\right)\left(\sum_{t'=1}^T r(s_{i,t'},a_{i,t'}) - \hat{V}^\pi_\phi(s_{i,t})\right)$$ 133 | In this way, we obtain an unbiased estimator with lower variance. 134 | 135 | \section{Eligibility Traces and n-Step Returns} 136 | In the above comparison of the two methods, we have seen that in the actor-critic advantage function, we have lower variance but higher bias, while in the Monte Carlo policy gradient, the advantage function has lower bias but higher variance. The reason why this tradeoff exists is that as we go further in our trajectory into the future, the variance increases due to the fact that the current single sample approximation is not representative enough for the future. Therefore, the Monte Carlo advantage function is good for getting accurate values in the near term, but not the long term. In contrast, in actor-critic advantage, the bias potentially skews the values in the near term, but the fact that the bias incorporates a lot of states will likely make it a better approximator in the long run. Therefore, it would be better if we could use the actor-critic based advantage for further in the future, and use the Monte Carlo based one for the near term in order to control the bias-variance tradeoff. 137 | 138 | As a result, we can cut the trajectory before the variance gets too big. Mathematically, we can estimate the advantage function by combining the two approaches: use the Monte Carlo approach only for the first $n$ steps: 139 | $$\hat{A}^\pi_n(s_t,a_t) = \sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'}) - \hat{V}^\pi_\phi(s_t)+\gamma^n\hat{V}^\pi_\phi(s_{t+n})$$ 140 | here we applied an n-step estimator, which sums the reward from now to $n$ steps from now, and $n>1$ often gives us better performance. 141 | 142 | Furthermore, if we don't want to choose just one $n$, we can use a weighted combination of different $n$-steps returns, which we can define as the General Advantage Estimation(GAE): 143 | $$ \hat{A}_{GAE}(s_t,a_t) = \sum_{n=1}^\infty w_n \hat{A}^\pi_n(s_t,a_t)$$ 144 | To choose the weights, we should prefer cutting earlier, so we can assign the weights accordingly: $w_n\propto \lambda^{n-1}$, where we call $\lambda$ the chance of getting cut. 145 | 146 | -------------------------------------------------------------------------------- /batchac.tex: -------------------------------------------------------------------------------- 1 | \begin{algorithm}[t!] 2 | \caption{Batch Actor-Critic Algorithm} 3 | \begin{algorithmic}[1] 4 | \label{alg:batchac} 5 | \REQUIRE Base policy $\pi_\theta(a_t|s_t)$ 6 | 7 | \WHILE{true} 8 | \STATE Sample $\{s_i,a_i\}$ from $\pi_\theta(a|s)$ (run it on a robot) 9 | \STATE Fit $\hat{V}_\phi(s)$ to sampled reward sums 10 | \STATE Evaluate $\hat{A}^\pi(s_i,a_i) = r(s_i,a_i)+\hat{V}_\phi(s'_i)-\hat{V}_\phi(s_i)$ 11 | \STATE $\nabla_\theta J(\theta) \simeq \sum_i\nabla_\theta\log \pi_\theta(a_i|s_i)\hat{A}^\pi(s_i,a_i)$ 12 | \STATE Improve policy by $\theta \leftarrow \theta + \alpha\nabla_\theta J(\theta)$ 13 | \ENDWHILE 14 | \RETURN optimal policy from gradient ascent as $\pi^{return}$ 15 | \end{algorithmic} 16 | \end{algorithm} -------------------------------------------------------------------------------- /batchacwdf.tex: -------------------------------------------------------------------------------- 1 | \begin{algorithm}[t!] 2 | \caption{Batch Actor-Critic Algorithm with Discount Factor} 3 | \begin{algorithmic}[1] 4 | \label{alg:batchacwdf} 5 | \REQUIRE Base policy $\pi_\theta(a_t|s_t)$, hyperparameter $\gamma$ 6 | 7 | \WHILE{true} 8 | \STATE Sample $\{s_i,a_i\}$ from $\pi_\theta(a|s)$ (run it on a robot) 9 | \STATE Fit $\hat{V}_\phi(s)$ to sampled reward sums 10 | \STATE Evaluate $\hat{A}^\pi(s_i,a_i) = r(s_i,a_i)+\gamma\hat{V}_\phi(s'_i)-\hat{V}_\phi(s_i)$ 11 | \STATE $\nabla_\theta J(\theta) \simeq \sum_i\nabla_\theta\log \pi_\theta(a_i|s_i)\hat{A}^\pi(s_i,a_i)$ 12 | \STATE Improve policy by $\theta \leftarrow \theta + \alpha\nabla_\theta J(\theta)$ 13 | \ENDWHILE 14 | \RETURN optimal policy from gradient ascent as $\pi^{return}$ 15 | \end{algorithmic} 16 | \end{algorithm} -------------------------------------------------------------------------------- /cem.tex: -------------------------------------------------------------------------------- 1 | \begin{algorithm}[t!] 2 | \caption{Cross Entropy Method with Continuous-valued Input} 3 | \begin{algorithmic}[1] 4 | \label{alg:cem} 5 | \REQUIRE Some base distribution for action sequence $p(A)$ 6 | \WHILE{true} 7 | \STATE Sample $A_1,...,A_N$ from $p(A)$ 8 | \STATE Evaluate $J(A_1),...,J(A_N)$ 9 | \STATE Pick elites $A_{i_1},...,A_{i_M}$ with the highest value, where $M