├── .DS_Store ├── .gitattributes ├── RLResourceGuide.md └── images ├── Q_values_by_trial.png ├── Q_values_by_trial2.png ├── Softmax_Udacity.png ├── SuttonBartoRL.png ├── alpha_post_distribution.png ├── beta_post_distribution.png ├── choice_probabilities_by_trial.png └── choice_probabilities_by_trial2.png /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rmgeddert/Reinforcement-Learning-Resource-Guide/d848a453c5aa14a9bc255a5a456f4f4064d11aa0/.DS_Store -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /RLResourceGuide.md: -------------------------------------------------------------------------------- 1 | 2 | # Reinforcement Learning Resource Guide & Tutorial 3 | 4 | This is a resource guide/tutorial for those interested in reinforcement learning and modeling. While not comprehensive, it will hopefully give you a basic understanding of how RL works (and modeling more generally). If you have suggestions for improvements or corrections, please email me at raphael.geddert@duke.edu! 5 | 6 | #### Outline 7 | 8 | - Reinforcement Learning Introduction 9 | - Background Resources (Math, statistics, modeling, programing) 10 | - Tutorial - Modeling an RL agent performing a k-armed bandit task 11 | 12 | ------ 13 | 14 | ## Introduction to Reinforcement Learning 15 | 16 | Reinforcement learning is the process by which someone learns which of several actions to take in any given situation, by trying these actions numerous times and learning whether these actions are good or bad based on feedback. 17 | 18 | ![Sutton & Barto 2018 Reinforcement Learning](images/SuttonBartoRL.png) 19 |
Sutton & Barto 2018 agent-environment interaction diagram (Figure 3.1) 20 | 21 | Put in more concrete terms, reinforcement learning considers an **agent** that exists in an **environment**. The environment is the world that the agent interacts with. At each time step *t*, the agent is shown a state of the world *s*. This state can be "partially observed", in the sense that the agent might not know everything about how the world has changed, or even if it has changed at all. Regardless, given this state, the agent now chooses one of several **actions**. Learning which action to perform in any given state is the problem reinforcement learning is attempting to solve. After performing an action, the agent receives a **reward** feedback that lets it know whether the action it performed was a good one or not. The agents objective is to maximize reward in the long run, known as the **return**. 22 | 23 | There is of course a lot more to RL, such as **policies**, **value functions**, and various **optimization strategies**. A good place to start is this excellent [summary of key concepts in RL](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html). Some of these will be discussed below, so stay tuned! 24 | 25 | You may also like: 26 | - [Reinforcement Learning Demystified](https://towardsdatascience.com/reinforcement-learning-demystified-36c39c11ec14) - towardsdatascience.com 27 | - [Basics of Reinforcement Learning](https://medium.com/@zsalloum/basics-of-reinforcement-learning-the-easy-way-fb3a0a44f30e) - Medium.com 28 | - https://github.com/aikorea/awesome-rl 29 | - https://www.quora.com/What-are-the-best-resources-to-learn-Reinforcement-Learning 30 | - https://medium.com/@yuxili/resources-for-deep-reinforcement-learning-a5fdf2dc730f 31 | 32 | Finally, it is critical to mention perhaps the most famous of RL resources - the RL bible of sorts - the [2018 Reinforcement Learning Textbook](http://www.incompleteideas.net/book/the-book-2nd.html) by Sutton and Barto. For perspective, the textbook has been cited over 35,000 times in just last the two years. The book is incredibly thorough and comprehensive, but people that are new to RL might find it a bit overwhelming or overly detailed. Still, it is an excellent resource and worth checking out once you feel more comfortable with the basic concepts of RL. 33 | 34 | ------ 35 | 36 | ## Background Resources 37 | 38 | While it is definitely possible to understand and model reinforcement learning without having to learn complicated math or programming topics, these are definitely required if you want a deep understanding of RL. Below are great math/statistics/programming resources to get you started. You can also keep going with the tutorial and come back in case you later realize some more background knowledge might be useful. 39 | 40 | **Note:** The Pearson lab already has excellent resources for much of this. Check out their lab website [here](https://pearsonlab.github.io/learning.html). 41 | 42 | ### Math 43 | 44 | - **Linear Algebra:** 45 | - 3Blue1Brown's [Essence of Linear Algebra](https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab). 46 | - You kahn practice linear algebra on [khan academy](https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/vectors/v/vector-introduction-linear-algebra). 47 | 48 | 49 | - **Statistics:** 50 | - [Statistical Rethinking with examples in R and Stan](https://xcelab.net/rm/statistical-rethinking/) by Richard McElreath 51 | - Russ Poldrack's [Statistical Thinking for the 21st Century](https://statsthinking21.github.io/statsthinking21-core-site/). 52 | - [Introduction to probability and statistics](https://seeing-theory.brown.edu/#firstPage). 53 | 54 | 55 | - **Calculus** 56 | - 3Blue1Brown's [Essence of Calculus](https://www.youtube.com/watch?v=WUvTyaaNkzM&list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr) 57 | - Gilbert Strang's [calculus textbook](https://ocw.mit.edu/ans7870/resources/Strang/Edited/Calculus/Calculus.pdf) 58 | 59 | ### Programming 60 | 61 | - **Python:** 62 | - You can't go wrong with Jake Vanderplas’s [A Whirlwind Tour of Python](https://jakevdp.github.io/WhirlwindTourOfPython/) and [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/). 63 | - Software carpentry 1-day [Python tutorial](https://swcarpentry.github.io/python-novice-inflammation/) 64 | - See also the Egner Lab's [Python resource page](https://github.com/egnerlab/Lab-Manual/tree/master/Programming/Python) 65 | 66 | 67 | - **R:** There are some tutorials below that use R (because of the RStan package for modeling) so a good understanding would be very useful. 68 | - [R for Data Science](https://r4ds.had.co.nz/) by Hadley Wickham. 69 | - [Advanced R](https://adv-r.hadley.nz/), again by Hadley Wickham. 70 | - This great [youtube tutorial](https://www.youtube.com/watch?v=jWjqLW-u3hc&app=desktop) on dplyr. 71 | 72 | 73 | - **Misc:** 74 | - [Stan](https://mc-stan.org/) and specifically [RStan](https://mc-stan.org/users/interfaces/rstan) for modeling. These will be covered in more detail below. 75 | 76 | ------ 77 | 78 | ## Your First Reinforcement Learning Model 79 | 80 | This next section goes through a tutorial on fitting a model of a multi-armed bandit task. First, you will simulate an agent completing the task. This involves determining both how the agent chooses which of several actions to perform (i.e., its policy) as well as how it learns which of these actions to prioritize based on feedback from the environment (i.e., rewards). Next, you will perform parameter estimation using the negative log likelihood to estimate the parameters that determined the agents behavior. 81 | 82 | ### A Multi-Armed Bandit task 83 | 84 | Consider the following learning problem. You are at a casino, and in front of you are three slot machines. You have 100 spin tokens that you got for your birthday from your parents, where one token will let you spin one slot machine one time. You know from a friend that works at the casino that the slot machines have different average reward payouts, but you don't know which of the three has the best payout. To complicate things further, the slot machines have some randomness in their payouts. That is, sometimes they might pay out more than their average and sometimes less. Given these circumstances, what is the best strategy for maximizing the overall reward you receive from using the 100 tokens? 85 | 86 | This is the original conceptualization of the **k-armed bandit task**. The bandit task involves an agent choosing from among several actions (pulling the arm of one of several slot machines). Each arm has some probability of giving the agent a reward. The agent's objective is to figure out which arms to pull in order to maximize the long-term expected reward. 87 | 88 | ##### Exploitation versus Exploration 89 | 90 | The principle problem of the bandit task is the **exploration/exploitation tradeoff**. Consider yourself at the casino again. At first, you have no idea which of the three slot machines has the highest average reward payout. Suppose you try out Slot Machine #1 first. After 5 attempts, you get a $10 reward 4 times and a $0 reward once. Hey, that's pretty good! But, maybe one of the other slot machines has an even higher payout. You move on to Slot Machine #2, and to your dismay, you get a $0 reward 5 times in a row. Shoot, if only you had used those tokens on Slot Machine #1! Next, you try Slot Machine #3 an immediately get $100 on the first attempt, and then another $100 on the second attempt. Shoot again, if only you had explored more instead of wasting all those tokens on Slot Machines #1 and #2! 91 | 92 | This is the essence of the exploration/exploitation dilemma. On the one hand, exploring helps you learn about which of the slot machines is best. On the other hand, any time you spend on a bad slot machine is a wasted opportunity to play a slot machine that has a higher payout. 93 | 94 | How we deal with the exploration/exploitation dilemma is called our **policy**. How do we simultaneously learn about all of the bandit arms while also maximizing reward? We could always pick randomly, but this doesn't seem like it will maximize reward. We would be just as likely to choose Slot Machine #2 as Slot Machine #3, even though it is obviously inferior. Alternatively, we could pick which ever arm we currently think is best. This is known as a **greedy** policy. While this strategy makes intuitive sense, it turns out that a purely greedy policy often gets stuck in local minima. This is because there is no requirement to explore all the arms before sticking with one. As soon as a single arm rises above whatever starting value the other arms are set to it will stick, even if another arm (potentially still unexplored) is in fact better. A greedy strategy might have stuck with Slot Machine #1 after seeing that it got a fairly decent reward compared to initial expectations and never tried slot machine #3. 95 | 96 | >See this [towardsdatascience.com article](https://towardsdatascience.com/reinforcement-learning-demystified-exploration-vs-exploitation-in-multi-armed-bandit-setting-be950d2ee9f6) about exploration/exploitation in the multi-armed bandit task. 97 | 98 | There are countless alternative policy options, such as **epsilon-greedy**, a policy which is greedy most of the time and random (i.e, exploratory) epsilon percent of the time. Another is **optimistic greedy**, a completely greedy policy which sets the initial expectations for each candidate action ludicrously high, so that as each action fails to meet these high expectations and the reward expectation drops accordingly, each arm takes a turn being the "best action". Eventually, the highest valued action arrives at its actual expected reward (and stops decreasing) and the policy picks this option for the remainder of trials. 99 | 100 | > As mentioned above but worth repeating, see [Sutton and Barto 2018](http://www.incompleteideas.net/book/the-book-2nd.html) **Chapter 2** for an in-depth discussion of various policies to solve the bandit task. 101 | 102 | In this tutorial we will actually use a more complicated action policy known as the **softmax greedy** policy. Similar to the epsilon greedy policy, it is greedy the majority of the time and random otherwise. However, the probability of choosing each arm *changes dynamically* based on the current expectations of value of each action, meaning that if one action is obviously better then the rest then the policy is mostly greedy, whereas if the action expectations are similar in value (as they are initially) the policy is quite random. We will also introduce a parameter known as the **inverse temperature** that determines how often the agent chooses the best option. 103 | 104 | ---- 105 | 106 | ## Our Scenario 107 | 108 | In this tutorial we will model an agent faced with two slot machines, AKA a 2-armed bandit task. Each arm can give a reward `reward = 1` or not `reward = 0`. **Arm 1** will give a reward **70%** of the time and **Arm 2** will give a reward **30%** of the time, but our agent doesn't know that. By the end of the exercise, our agent will hopefully choose **Arm 1** most of the time, and you will understand the logic and computations that drive this learning. 109 | 110 | #### Coding: 111 | We will be programming our agent in **R**. We will start by initializing a few parameters and objects. 112 | 113 | ``` R 114 | data_out <- "C:/Users/..." #the folder where you want the data to be outputted 115 | 116 | #veriable declaration 117 | nTrials <- 1000 #number of trials to model 118 | nArms <- 2 #number of bandit arms 119 | banditArms <- c(1:nArms) #array of length nArms 120 | armRewardProbabilities <- c(0.7, 0.3) #probability of returning reward for each arm 121 | ``` 122 | 123 | ### Step 1: Representing each Arm's *Expected Value* 124 | 125 | In order for our agent to complete this task, it first needs to represent how valuable it thinks each action is. We operationalize this with something known as a Q-value. A Q-value is a numerical representation of the expected average reward of an action. If an action gives a reward of `$0` half of the time and `$100` half of the time, its Q-value is `$50`. If an action gives a reward `0` 20% of the time and `1` 80% of the time, its Q-value is `0.8`. For now, we will initialize our Q-values for each arm at `0`. With time (and rewards), these will be updated to approximate the correct expected rewards (i.e., the Q-values should equal to `0.7` and `0.3`). 126 | 127 | #### Coding: 128 | Let's initialize our Q-values for each arm at 0. We'll make a variable `currentQs` that will store the Q value only for the current trial (since these are needed to determine which arm our agent will choose) as well as a `trialQs` variable that stores Q values at each time step for later visualization. 129 | 130 | ``` R 131 | Qi <- 0 #initial Q value 132 | currentQs <- vector(length = length(banditArms)) #vector that contains the most recent Q-value for each arm 133 | trialQs <- matrix(data = NA, nrow = nTrials, ncol = nArms) #stores Q at each time for visualization later 134 | 135 | 136 | #assign initial Q value 137 | for (arm in banditArms) { 138 | currentQs[arm] <- Qi 139 | } 140 | ``` 141 | 142 | ### Step 2: Choosing an action 143 | 144 | Next, we need to determine what our **action policy** is. Given a set of Q-values, what action do we choose? For this tutorial we are going to implement something known as a **softmax greedy policy**, which has a parameter known as **inverse temperature**. 145 | 146 | ##### Softmax Function 147 | 148 | On any given trial, which arm should the agent choose? As described above, we could be entirely random (a random policy), or we could always pick the action with the highest Q-value (a greedy policy). Neither of these are optimal, so we will use something a little more nuanced that considers the Q-values of the various actions to determine the probabilities of choosing them. 149 | 150 | Enter the [softmax function](https://en.wikipedia.org/wiki/Softmax_function). The softmax function takes a set of numbers as an input (in our case the Q-values of the actions) and returns a set of probabilities proportional to each value. That is, the higher the Q-value compared to other values, the higher the probability associated with that action. More specifically, the probability *p* of performing action *a* is equal to the exponential of its corresponding Q-value `e^(Q-value)`, divided by the sum of the exponentials of all the Q-values `sum over all options(e^Q-value)`. **Here is a great [medium article](https://medium.com/data-science-bootcamp/understand-the-softmax-function-in-minutes-f3a59641e86d) explaining it.** 151 | 152 | ![alt text](images/Softmax_Udacity.png) 153 |
Udacity Slide on the Softmax Function 154 | 155 | Linguistically, the softmax function is a **"soft"** maximum in the sense that it picks the maximum but is "soft" about it; sometimes it will pick another option instead. A "hard" max function, on the other hand, would always choose the best option (highest Q-value), without considering *how much* higher this Q-value was than the other options. The equation for the softmax function is: `e^(Q-value) / sum over all Q-values(e^(Q-value))`. 156 | 157 | > Softmax function: the probability of choosing a given Q-value is `e^(Q-value)` divided by `the sum of e^(Q-value) for all arms/Q-values in the set`. 158 | 159 | The softmax function is great because it weighs action probabilities by their associated Q-values, and normalizes Q-values by how the compare to other actions. If our Q-value estimates for two actions are very different - say we think `Arm 1` has an expected reward of 1000 and `Arm 2` has an expected reward of 5 - then we want to be very likely to choose `Arm 1`. Alternatively, if our Q-value estimates are very close together, say we think `Arm 1` has an expected reward of 12 and `Arm 2` has an expected reward of 10, then we might want to be a bit more exploratory since it quite possible that `Arm 2` is in fact just as good if not better than `Arm 1`. 160 | 161 | Additionally, the softmax function has no problems handling Q-values that are 0 or even negative, allowing it to flexibly adapt to a variety of situations. 162 | 163 | ##### Inverse Temperature 164 | 165 | While the softmax function is great, one problem it has is that it assumes that all people that complete the bandit-task perform the probability calculation in the same way. Consider Person A and Person B each performing a 3-armed bandit task. After 100 trials, their Q-values for each arm are `Arm 1 = 2.0`, `Arm 2 = 1.0`, and `Arm 3 = 0.0`. Given these Q-values, the vanilla softmax function will always return action probabilities of `0.7`, `0.2`, and `0.1` respectively. But what if Person A is very risk averse, and is in fact much more likely to prefer a greedy policy that chooses `Arm 1` the most, or what if Person B is an extremely curious and optimistic person and therefore wants to choose `Arm 3` frequently even though the current Q-value is the lowest? 166 | 167 | Enter the **inverse temperature** parameter `beta`. `beta` is a parameter that scales Q-values, thereby tweaking probabilities in a way that can either make the agent very greedy or very exploratory. A very large `beta` means the agent will be very greedy and careful, almost exclusively choosing the action with the highest Q-value. A very small `beta` (near 0) means the agent will choose more or less randomly despite the Q-values. 168 | 169 | >A great intuitive way to think about `beta` is in terms of the temperature of a molecule. A molecule that is very **cold** is very still - this is akin to always choosing the best option (highest Q-value). A molecule that is very **hot** has a lot of energy and bounces around a lot - this is akin to randomly choosing from options regardless of their Q-values. 170 | > 171 | >**IMPORTANT NOTE:** `beta` is the **inverse temperature**, not the temperature. So, a **cold** strategy of always staying with the best Q-value corresponds to a high inverse temperature, and vice versa. 172 | 173 | Our new softmax function with an inverse temperature parameter looks like this: 174 | 175 | `e^(beta * Q-value)` divided by `the sum of e^(beta * Q-value) for all arms/Q-values in the set` 176 | 177 | `beta` changes the probabilities of selecting various options by **scaling Q-values**. For example: 178 | 179 | - `beta = 1`: Given Q-values 0.6 for Arm 1 and 0.5 for Arm 2, if `beta = 1`, then `Q-value * beta` = 0.6 and 0.5 respectively. In effect we leave the Q-values exactly as is and the probability calculation uses the vanilla softmax function. 180 | 181 | - `beta -> 0`: As `beta` approaches 0, `Q-value * beta` approaches 0, and crucially the differences between the various Q-values also approaches 0. In effect, we make all the Q-values the same and therefore we choose randomly. The lower the `beta`, the more random the action probabilities. 182 | 183 | - `beta -> inf`: As `beta` approaches infinity, `Q-value * beta` also approaches infinity, and crucially so does the difference between the Q-values. Thus, we have increasingly more reason to choose the option with the highest Q-value. The policy therefore becomes increasingly greedy with increasing `beta`. 184 | 185 | To summarize, a small `beta` (< 1) approximates more exploratory/random behavior. The difference between Q-values is minimized and our agent has less reason to choose the best option. Conversely, a large `beta` approximates more conservative/exploitative behavior. With the best option becoming increasingly better than the others, the agent is more and more likely to only choose that option. 186 | 187 | >See this towardsdatascience [article](https://towardsdatascience.com/softmax-function-simplified-714068bf8156) for more information on the softmax function. 188 | 189 | #### Coding: 190 | We will initialize a beta value (let's pick 5, a slightly greedier policy), as well as a vector that contains the probabilities of choosing each arm (probabilities add up to 1). We will also initialize a vector that contains which action we picked. Once we have our action probabilities, we will choose one of those action stochastically (based on the probabilities) and save our choice in the `choices` vector. `choiceProbs` will contain the probabilities of choosing each arm for the current time step. We'll also make a `trialChoiceProbs` variable which will let us visualize the choice probabilities at each trial later. 191 | 192 | ``` R 193 | beta <- 5 #inverse temperature 194 | choices <- vector(length = nTrials) #stores the choice made on each trial (which arm was picked) 195 | choiceProbs <- vector(length = length(banditArms)) #contains the probabilities associated with choosing each action 196 | trialChoiceProbs <- matrix(data = NA, nrow = nTrials, ncol = nArms) #stores probabilities at each time step 197 | ``` 198 | 199 | Later on in the code we will also have to calculate the choice probabilities given their Q-values for each trial. This will look like this: 200 | 201 | ```R 202 | for (trial in 1:nTrials) { 203 | #calculate sumExp for softmax function 204 | #sumExp is the sum of exponents i.e., what we divide by in our softmax equation 205 | sumExp <- 0 206 | for (arm in banditArms) { 207 | sumExp <- sumExp + exp(beta * currentQs[arm]) 208 | } 209 | 210 | #calculate choice probabilities 211 | for (arm in banditArms) { 212 | choiceProbs[arm] = exp(beta * currentQs[arm]) / sumExp 213 | } 214 | 215 | #save choice probabilities in matrix for later visualization 216 | trialChoiceProbs[trial,] <- choiceProbs 217 | ``` 218 | 219 | Next we choose an action based on those probabilities. 220 | 221 | ``` R 222 | # choose action given choice probabilities, save in choices vector 223 | choices[trial] <- sample(banditArms, size = 1, replace = FALSE, prob = choiceProbs) 224 | ``` 225 | 226 | ### Step 3: Learning (Updating Q-values) 227 | 228 | Now that we have a policy for choosing actions, how do we learn from those actions? In other words, how to we update our Q-values from their initial values (in this case 0) to match the real values (0.7 for `Arm 1` and 0.3 for `Arm 2`). 229 | 230 | #### Prediction Errors 231 | 232 | Learning happens whenever our expectations don't match our experience. So, if our Q-value for Arm 1 is 0.5 (suggesting a 50% chance of reward) and yet we notice that we are receiving reward more than 50% of the time, then we should increase our Q-value until it matches reality. 233 | 234 | Practically, all we need to do is increase the Q-value every time we receive a reward and decrease the Q-value every time we don't receive a reward. By doing this we will eventually approximate the correct reward rate (but see below). We calculate the **prediction error** as the **difference between our Q-value and the actual reward outcome**. In our example our reward outcome will either be 0 or 1 depending on the lever pull. 235 | 236 | >Say our Q-value is equal to $10. We expect a $10 reward on the next trial. After pulling the lever, we get a $25 reward. This reward is $15 greater than our expectation, so our prediction error is +$15. 237 | > 238 | > If instead we receive a reward of $5, this reward is $5 less than we expected so our prediction error would equal -$5. 239 | 240 | Notice that our prediction error is **positive** when the result is greater than our expectation and **negative** when the result is less than our expectation. 241 | 242 | ##### Learning Rate 243 | 244 | We raise our Q-value every time we get a reward and decrease it when we don't. **By how much do we change our Q value?** Enter the **learning rate** (`alpha`), which determines how much we update our Q-values based on a prediction error. 245 | 246 | The learning rate tells us how much each new piece of information should impact our expectations. It is a value between 0 and 1 that the prediction error is multiplied by. The formula we will use to update our Q-values is `Updated Q-value = Old Q-value + learning rate alpha * (Reward - Old Q-value)`, where `(Reward - Old Q-Value)` is the prediction error described above. To better understand what the learning rate is doing, consider the following extreme values for the learning rate on our example above. 247 | 248 | ##### Learning Rate Example 249 | 250 | Let's imagine that we have a single bandit arm, `Arm 1`. We don't know anything about this arm yet, except that it sometimes gives us `reward = 0` and sometimes `reward = 1`. Since we don't know anything about the reward likelihood, let's set our initial Q-value (our guess of expected reward), at `0.5`. We pull the arm, and we receive a reward! So, `Q-value = 0.5` and `reward = 1`. Let's see how our Q-value gets updated using various learning rates. 251 | 252 | --- 253 | 254 | `learning rate = 0`: 255 | 256 | If the learning rate `alpha = 0`, we can plug all our values into the Q-value updating formula above. Our new Q-value is: `Updated Q-value = 0.5 + 0 * (1 - 0.5)`, where `0.5` is our Old Q-value, `0` is our learning rate, and `(1 - 0.5)` is the prediction error. Solving this we get `0.5 + 0 * (0.5) = 0.5 + 0 = 0.5`. Our new Q-value is identical to our old Q-value! 257 | 258 | --- 259 | 260 | `learning rate = 1`: 261 | 262 | Let's try `alpha = 1` instead. `Updated Q-value = 0.5 + 1 * (1 - 0.5)` => `0.5 + 1(0.5)` => `0.5 + 0.5` => `1`. Our new Q-value is 1! 263 | 264 | Let's try another trial, this time `reward = 0`. The new Q-value from that we just calculated will be our new `old Q-value`. `Updated Q-value = 1 + 1 * (0 - 1)` => `1 + 1(-1)` => `1 - 1` => `0`. Now our Q-value is 0! 265 | 266 | > It turns out that if `learning rate = 1` then we will update our Q-value to exactly match the reward on that trial. In effect it will bounce back and forth between 0 and 1 forever and never *converge* on the real reward rate. This is what happens when the learning rate is too high. 267 | 268 | --- 269 | `learning rate = 0.1`: 270 | 271 | Finally, let's try a more reasonable `alpha = 0.1`. 272 | 273 | `Updated Q-value = 0.5 + 0.1 * (1 - 0.5)` => `0.5 + 0.1(0.5)` => `0.5 + 0.05` => `0.55`. Our Q-value adjusted slightly upwards from `0.5` to `0.55`. This is what we want! Overtime, we'll eventually approximate the correct reward approximation. 274 | 275 | --- 276 | 277 | To summarize, if the learning rate is low, we will make small changes to our Q-value. The change approaches 0 as the learning rate approaches 0. If the learning rate = 0 we don't move at all. **If our learning rate is too low, it can take many many trials to converge to the right value.** 278 | 279 | If our learning rate is large (close to 1) then we will make big changes to our Q-value. If the learning rate = 1, we will update our Q-value to exactly match the reward outcome of the previous trial. **If our learning rate is too large, we will never converge on the right value because we will always jump past it**. 280 | 281 | A reasonable value for the learning rate is `0.01`, though there is a lot of variation here and many different techniques for changing it dynamically. 282 | 283 | ##### Note: 284 | An alternative way to think about the learning rate is *how many trials in the past should I consider when setting my Q-value?* 285 | - If learning rate = 1 we only consider the most recent trial, whatever our result, that is our new Q-value. 286 | - As the learning rate approaches 0, each new trial is less informative so in effect we consider more and more previous trials in determining what our Q-value should be. 287 | 288 | >For more information on learning rates, check out this towardsdatascience [article](https://towardsdatascience.com/https-medium-com-dashingaditya-rakhecha-understanding-learning-rate-dd5da26bb6de). 289 | 290 | #### Coding: 291 | 292 | Steps 1 and 2 explained how we made our decision about which arm to pull. Now that we have made our decision, we will get a reward. Remember that this reward is stochastic, so even if we pick the better arm `Arm 1`, there is still only a 70% chance that we will get a reward. 293 | 294 | Let's initialize our learning rate and a vector to store rewards. 295 | 296 | ``` R 297 | alpha <- .01 #learning rate 298 | rewards <- vector(length = nTrials) 299 | ``` 300 | 301 | In our trial loop, we now add code that gets a reward based on the choice we made, and stores it in the `rewards` vector. Then, we update our Q-value for the arm that we chose based on this reward. We'll also save these Q-values in a matrix so we can visualize the Q-values afterwards. 302 | 303 | ``` R 304 | #given bandit arm choice, get reward outcome (based on armRewardProbabilities) 305 | rewards[trial] <- rbinom(1,size = 1,prob = armRewardProbabilities[choices[trial]]) 306 | 307 | #given reward outcome, update Q values 308 | currentQs[choices[trial]] <- currentQs[choices[trial]] + alpha * (rewards[trial] - currentQs[choices[trial]]) 309 | 310 | #save Q values in matrix of all Q-values for later visualization 311 | allQs[trial,] <- currentQs 312 | ``` 313 | 314 | ----- 315 | 316 | ### Putting it all together 317 | 318 | We now have everything we need to help our agent learn. Here are the steps: 319 | 320 | 1. **Action:** The agent will choose one of the two arms to pull. Instead of choosing deterministically, the agent will choose stochastically using the softmax function. Since our agent's initial Q-values are both set to 0 (i.e. they are the same) at first the agent will be equally likely to choose either arm. Later, the agent will tend to choose the option with the highest Q-value. The probability of this is influenced by the `beta` (or inverse temperature) parameter, which determines how greedy versus exploratory our agent is. 321 | 2. **Reward:** If the agent chooses Arm 1, it will have a 70% chance of receiving a reward. If Arm 2, it will have a 30% chance of receiving a reward. This reward will either be a 0 or a 1. 322 | 3. **Update Q-values:** Finally, given the reward outcome, our agent will update its Q-value (only for the arm it chose), slightly changing the Q-value upwards or downwards based on the outcome. How much the Q-value is changed is determined by the learning rate `alpha`. 323 | 324 | The agent will continue to do this for however many trials. Hopefully by the end of the trials, the agent will have correctly approximated the Q-values and will be choosing the best arm most of the time. 325 | 326 | Below is a complete R script for generating bandit task data for an agent. Specified initially are the **number of trials the agent completes**, **how many arms there are**, **the reward probabilities for each arm**, **the learning rate**, **the inverse temperature value**, and **the initial Q values for each arm**. 327 | 328 | Please play around with this script! Change the learning rate, change the temperature parameter, add as many arms as you like. 329 | 330 | ``` R 331 | data_out <- "C:/Users/..." #the folder where you want the data to be outputted 332 | 333 | #data generation specifications 334 | nTrials <- 1000 335 | nArms <- 2 #try a different here instead 336 | banditArms <- c(1:nArms) 337 | armRewardProbabilities <- c(0.7, 0.3) #each arm needs its own reward probability 338 | alpha <- .01 #learning rate, play around with this 339 | beta <- 5 #inverse temperature, and with this 340 | Qi <- 0 #initial Q values 341 | currentQs <- vector(length = length(banditArms)) 342 | trialQs <- matrix(data = NA, nrow = nTrials, ncol = nArms) 343 | choiceProbs <- vector(length = length(banditArms)) 344 | trialChoiceProbs <- matrix(data = NA, nrow = nTrials, ncol = nArms) 345 | choices <- vector(length = nTrials) 346 | rewards <- vector(length = nTrials) 347 | 348 | #assign initial Q value 349 | for (arm in banditArms) { 350 | currentQs[arm] <- Qi 351 | } 352 | 353 | for (trial in 1:nTrials) { 354 | 355 | #calculate sumExp for softmax function 356 | sumExp <- 0 357 | for (arm in banditArms) { 358 | sumExp <- sumExp + exp(beta * currentQs[arm]) 359 | } 360 | 361 | #calculate choice probabilities 362 | for (arm in banditArms) { 363 | choiceProbs[arm] = exp(beta * currentQs[arm]) / sumExp 364 | } 365 | 366 | #save choice probabilities in matrix for later visualization 367 | trialChoiceProbs[trial,] <- choiceProbs 368 | 369 | # choose action given choice probabilities, save in choices vector 370 | choices[trial] <- sample(banditArms, size = 1, replace = FALSE, prob = choiceProbs) 371 | 372 | #given bandit arm choice, get reward outcome (based on armRewardProbabilities) 373 | rewards[trial] <- rbinom(1,size = 1,prob = armRewardProbabilities[choices[trial]]) 374 | 375 | #given reward outcome, update Q values 376 | currentQs[choices[trial]] <- currentQs[choices[trial]] + alpha * (rewards[trial] - currentQs[choices[trial]]) 377 | 378 | #save Q values in matrix of all Q-values 379 | trialQs[trial,] <- currentQs 380 | } 381 | 382 | #combine choices and rewards into dataframe 383 | df <- data.frame(choices, rewards) 384 | 385 | #save out data df as csv 386 | fileName <- paste(data_out, "Generated_Data.csv",sep = "/") 387 | write.csv(df,fileName, row.names = FALSE) 388 | ``` 389 | 390 | ### Looking under the hood 391 | 392 | Hopefully the script ran without issues. Still, other than the out putted data file of trial choices and rewards our script doesn't give us much information about how our agent actually learned to perform the task. 393 | 394 | Let's first visualize the development of the Q values for each arm over time. 395 | 396 | ``` R 397 | library(ggplot2) 398 | library(reshape2) 399 | 400 | #turn trialQs matrix into dataframe 401 | Qvalues_df <- as.data.frame(trialQs) 402 | 403 | #add column names 404 | for (i in 1:length(Qvalues_df)){ 405 | colnames(Qvalues_df)[i] <- paste("Arm", i, sep="") 406 | } 407 | 408 | #add column of trial counts 409 | Qvalues_df$trialCount <- as.numeric(row.names(Qvalues_df)) 410 | 411 | #turn df into long format for plotting 412 | Qvalues_long <- melt(Qvalues_df, id = "trialCount") 413 | 414 | #plot Q values over time 415 | ggplot(data=Qvalues_long, aes(x = trialCount, y = value, color = variable)) + 416 | geom_point(size = 0.5) + 417 | ggtitle("Q values by Trial") 418 | 419 | ``` 420 | ![alt text](images/Q_values_by_trial.png) 421 | 422 | As you can see, our Q values begin at our initial value of `0`. As the agent chooses actions over time, it updates the Q-values until they eventually approximate the correct Q-values of `0.7` for **Arm 1** and `0.3` for **Arm 2**. 423 | 424 | One important thing to notice is that the Q values for **Arm 1** both better approximate the correct value of `0.7` and are significantly more variable. This is a result of our inverse temperature parameter. Since our agent is fairly greedy (`beta` = 5 which is greater than 1), our agent chooses **Arm 1** significantly more often than **Arm 2** and it can learn **Arm 1**'s correct Q value better and these get updated more frequently. 425 | 426 | One way to visualize this greediness is by plotting the choice probabilities for each arm as they evolve over time: 427 | 428 | ``` R 429 | #turn trial choice probs into dataframe 430 | ChoiceProbs_df <- as.data.frame(trialChoiceProbs) 431 | 432 | #add column names 433 | for (i in 1:length(ChoiceProbs_df)){ 434 | colnames(ChoiceProbs_df)[i] <- paste("Arm", i, sep="") 435 | } 436 | 437 | #add column of trial counts 438 | ChoiceProbs_df$trialCount <- as.numeric(row.names(ChoiceProbs_df)) 439 | 440 | #turn df into long format for plotting 441 | ChoiceProbs_long <- melt(ChoiceProbs_df, id = "trialCount") 442 | 443 | #plot Q values over time 444 | ggplot(data=ChoiceProbs_long, aes(x = trialCount, y = value, color = variable)) + 445 | geom_point(size = 0.5) + 446 | ggtitle("Probability of Choosing Arm by Trial") 447 | ``` 448 | 449 | ![alt text](images/choice_probabilities_by_trial.png) 450 | 451 | Initially, the probability of choosing each arm is `0.5`, since the Q-values initialize at the same value of `0`. As the Q-values get updated, however, the agent increasingly chooses **Arm 1** because of its higher Q-value. 452 | 453 | Remember from before that the extent to which our agent prefers the best arm is parameterized by the `beta` parameter. Run the simulation again but change the `beta` parameter to `0.5`. If we again visualize the Q values and choice probabilities we see the following: 454 | 455 | ![alt text](images/Q_values_by_trial2.png) 456 | 457 | ![alt text](images/choice_probabilities_by_trial2.png) 458 | 459 | In this case, our agent is very exploratory - choosing randomly with little regard to the Q-values of the arms. Despite learning that **Arm 1** has a higher Q-value, the agent continues to choose each arm about half of the time. In this case the agent is clearly not maximizing its return, but interestingly it does a much better job of approximating the correct Q-value for **Arm 2** of `0.3`. This is because an agent needs to sample from an arm repeatedly in order to correctly approximate it, something that it wasn't doing previously since our agent was sampling from **Arm 1** predominantly. 460 | 461 | And that's it! Congrats on successfully simulating your first RL learning process. Be sure to play around with the learning rate and inverse temperature and vary the number of bandit arms to see how these parameters affect the decisions the agent makes and how well it can learn the arm probabilities. Hopefully you now have some sense of how an agent can use reinforcement learning to learn which lever arms are best and maximize reward. Given a set of parameters, the agent learned over time and made a series of choices based on this knowledge (the **Q-values** it had learned) and based on its **action policy**. 462 | 463 | 464 | ----- 465 | 466 | ## Modeling our agent 467 | 468 | While the above exercise (hopefully!) was useful in demonstrating the key concepts of reinforcement learning, in real life we don't have access to the parameters that give rise to the data. In fact, that is what modeling is all about! Say we have a human participant perform our 2-armed bandit task. Which lever arms would they pull? Well, as we know, that depends a lot on their personality. Are they very risk-averse or more optimistic? Are they a fast learner or do they need more time? In other words, if our participant was an RL agent, what would their `learning rate` and `inverse temperature` parameters be equal to? We could ask them of course, but this isn't very scientific, and it would be impossible to draw conclusions from. Instead, it would be better if we could estimate this participant's "parameters" using their actual task data. 469 | 470 | What we are getting at here is that we need a way to infer parameter values given data. We are essentially performing our earlier data simulation in reverse. Rather than specifying parameters and observing the resultant actions, we are instead observing actions and inferring the parameters that gave rise to that data. 471 | 472 | **Below we perform something known as** ***Parameter Estimation*** **to approximate the `learning rate` and `inverse temperature` parameters.** 473 | 474 | > Nathaniel Daw's 2009 [*Trial-by-trial data analysis using computational models*](http://www.princeton.edu/~ndaw/d10.pdf) is in awesome reference that covers parameter estimation. See also a [paper by Wilson & Collins](https://elifesciences.org/articles/49547) that covers best practices of modeling behavioral data. 475 | 476 | #### Parameter Estimation 477 | 478 | In the second part of this tutorial we are going to perform **parameter estimation** on our simulated data. Given our data, we would like to estimate the learning rate `alpha` and inverse temperature `beta` that gave rise to that data. It is important to note that since both our agent and our bandit arms were stochastic (that is, probabilistic instead of deterministic), there is necessarily some noise, so our estimation cannot be perfect. Still, as the number of trials increases we will be increasingly be able to approximate our learning rate and inverse temperature. 479 | 480 | >The goal of the next section is to create a model that will return the alpha (learning rate) and beta (inverse temperature) parameter values we used in generating the bandit data we are feeding into the model. 481 | 482 | Parameter Estimation is an algorithmic technique that tries out a series of parameter values (called **candidate values**) and then decides which of those parameters are **most likely**. In other words, our model will make a best guess about what the parameters are and keep updating these parameters until it finds some set of parameter values that **maximize** the **likelihood** of the data. 483 | 484 | >Steps of parameter estimation: 485 | > 486 | >1. Specify a set of parameters at some initial value. (i.e., `alpha = 0.5` and `beta = 1`) 487 | >2. Calculate how **likely** the data is given those parameters. 488 | >3. Update the parameters to a new set of values. 489 | >4. Repeat steps 2 - 3. 490 | >5. Return the parameter values that make the data the **most likely**. 491 | 492 | To clarify, let's refer back to our figure of **Probability of Choosing Arm by Trial** when `alpha = 0.01` and `beta = 5` from earlier. 493 | 494 | ![alt text](images/choice_probabilities_by_trial.png) 495 | 496 | Suppose we had a human subject come in to the lab and perform the task. When looking at our data, we see that from trials 750 - 1000, our human subject chose `Arm 2` 43% of the time. If that is the case, clearly our human subject does **not** have a `beta = 5`. In fact, their behavior sounds much more like it corresponds to our second simulation with `beta = 0.5`. 497 | 498 | ![alt text](images/choice_probabilities_by_trial2.png) 499 | 500 | **This next point cannot be understated.** It is entirely **possible** that despite our agent having the action probabilities corresponding to `beta = 5`, they still chose `Arm 2` 43% of the time. Sure, the probability of choosing `Arm 2` was about 10% during that time and doing so corresponds to flipping heads on a coin 1000 times in a row, but it is still **possible**. Critically, though, it isn't very **likely**, and in fact the data makes much more sense (i.e., the data is much more **likely**) if `beta` instead equals `0.5`. It turns out that we can quantify just *how* likely the data is in each case. By trying out many different candidate parameter values we can determine what the most likely parameters are given the data. 501 | 502 | #### Bayes Rule 503 | 504 | Before we get into specifics of maximizing our data likelihood, a brief aside on [Bayes rule](https://towardsdatascience.com/what-is-bayes-rule-bb6598d8a2fd). Bayes rule is what allows us to estimate parameter likelihoods. It states that the **probability of our parameters given our data** (given this data, what are those most likely parameters?) is proportional to the **probability of our data given a parameter** multiplied by the prior probability of the parameters. In order to figure out the most likely parameters we actually need to start with the likelihood of the data first. 505 | 506 | >To quote Nathaniel Daw 2009, "This equation famously shows how to start with a theory of how parameters (noisily) produce data, and invert it into a theory by which data (noisily) reveal the parameters the produced it". 507 | 508 | Resources: 509 | - towardsdatascience.com [Intuitive Derivation of Bayes Rule](https://towardsdatascience.com/bayes-theorem-the-holy-grail-of-data-science-55d93315defb) article. 510 | - Another towardsdatascience [article](https://towardsdatascience.com/bayes-rule-applied-75965e4482ff) 511 | 512 | #### Calculating Data Likelihood 513 | 514 | The likelihood of data given the parameters is equal to the probability of each individual data point (i.e., each action) multiplied together. So, referring back to the probability graph (with `beta = 5`), suppose our agent pick `Arm 2` on the first trial. As we can see, the probability of choosing `Arm 2` was 50%, or `0.5`. If on Trial 750 our agent picked `Arm 2` again. Here, a `beta` of 5 assigns the probability of choosing `Arm 2` at 10%, or `0.1`. We can get the likelihood of every action just by looking at the choice probability of the arm the agent chose at each trial, which was calculated using the softmax equation. To get the overall likelihood of our data we then simply multiply each of these probabilities together. 515 | 516 | > **IMPORTANT NOTE:** This isn't quite the full story, because it turns out that if you multiply all of the probabilities together (0.5 * 0.3 * 0.9 * 0.1 * ...), especially if you are finding the probability of 1000 trials, you get what is called **arithmetic underflow**. That is, the overall probability becomes so incredibly small that most computers lack the numerical precision to represent them. The solution to this is to instead sum the **log** of the probabilities instead of multiplying the probabilities directly. **log** values are considerably more stable and more immune to underflow. This is called **maximizing the log likelihood**. 517 | 518 | As you might be noticing, the **probability of each action** is just the `choice probability` calculated using a softmax function from our simulation. At each given time point, our agent used the softmax function to calculate probabilities for each action. Thus, we the observers can now do the exact same thing to figure out how likely the actions were. What we are doing here is the following: 519 | 520 | 1. Again, start by specifying parameters. 521 | 2. At trial n, our participant had done A actions and received R rewards. 522 | 3. Given the `learning rate` we specified, our agent should have updated their Q values to correspond to these values, `Q1` and `Q2`. 523 | 4. Given the `inverse temperature` parameter we specified and `Q1` and `Q2`, our agent should have calculated the choice probabilities as `P1` and `P2`. 524 | 5. We already know that our participant chose `Arm 1` on this trial, so the likelihood of that action was `P1` (and vice versa if `Arm 2`). 525 | 6. Repeating this for all trials, we get the probability of all the actions. 526 | 7. Next we change the parameters to new values. 527 | 8. We repeat this until some arbitrary stopping point and then decide in which iteration the likelihood of the data was the greatest. The parameters that correspond to this iteration become our parameter estimates. 528 | 529 | #### Stan (and Rstan) 530 | 531 | There are many different software packages for doing modeling. We will be using [Stan](https://mc-stan.org/) which has versions in most major programming languages. We will be using the RStan package in R. 532 | 533 | - See [here](https://mc-stan.org/users/interfaces/rstan) to learn more about RStan 534 | - Here is an RStan [getting started guide](https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started) 535 | 536 | > Check out the [RStan Documentation](https://mc-stan.org/docs/2_23/stan-users-guide-2_23.pdf)! Be sure to search for any functions that are used below that are confusing, such as `target` or `log_softmax`. 537 | 538 | ## Code for Parameter estimation using RStan 539 | 540 | Next we will actually perform the **parameter estimation** explained above. 541 | 542 | > Remember to first install RStan! `install.packages("rstan", repos = "https://cloud.r-project.org/", dependencies = TRUE)` 543 | 544 | Our Stan model consists of the following sections: 545 | 546 | ##### Data 547 | 548 | ``` 549 | data { 550 | int nArms; //number of bandit arms 551 | int nTrials; //number of trials 552 | int armChoice[nTrials]; //index of which arm was pulled 553 | int result[nTrials]; //outcome of bandit arm pull 554 | } 555 | ``` 556 | 557 | Here we specify what inputs our model needs. When calling the model, we will need to specify how many arms the bandit task has `nArms`, how many trials there are `nTrials`, a vector of arm choices `armChoice` (which arm was pulled on each trial) of size `nTrials`, and another vector of what the results of those arm pulls were `result`, also of size `nTrials`. 558 | 559 | ##### Parameters 560 | 561 | ``` 562 | parameters { 563 | real alpha; //learning rate 564 | real beta; //softmax parameter - inverse temperature 565 | } 566 | ``` 567 | 568 | Here we specify what parameters our model is estimating. We add constraints to the learning rate to keep it between 0 and 1 (other values don't make sense) and make sure that beta is a real number (though it is unbounded). 569 | 570 | ##### Transformed Parameters 571 | 572 | The below code specifies two other "parameters", namely a vector containing Q-values and a prediction error parameter. Note that these are not the same as the parameters being estimated, hence they are called "transformed parameters". The Q-value and the prediction error are initialized, and then below that the model specifies how the Q-value should have changed given the reward feedback on each trial, and given the parameters that we are currently estimating. The transformed parameters are important for estimating the probability of each trial, since we will need to calculate what the Q-values were at each trial (given our candidate `alpha`) and then use these Q-values and our candidate `beta` to calculate the probability of action n at trial n. 573 | 574 | ``` 575 | transformed parameters { 576 | vector[nArms] Q[nTrials]; // value function for each arm 577 | real delta[nTrials]; // prediction error 578 | 579 | for (trial in 1:nTrials) { 580 | 581 | //set initial Q and delta for each trial 582 | if (trial == 1) { 583 | 584 | //if first trial, initialize Q values as specified 585 | for (a in 1:nArms) { 586 | Q[trial, a] = 0; 587 | } 588 | 589 | } else { 590 | 591 | //otherwise, carry forward Q from last trial to serve as initial value 592 | for (a in 1:nArms) { 593 | Q[trial, a] = Q[trial - 1, a]; 594 | } 595 | 596 | } 597 | 598 | //calculate prediction error and update Q (based on specified beta) 599 | delta[trial] = result[trial] - Q[trial, armChoice[trial]]; 600 | 601 | //update Q value based on prediction error (delta) and learning rate (alpha) 602 | Q[trial, armChoice[trial]] = Q[trial, armChoice[trial]] + alpha * delta[trial]; 603 | } 604 | } 605 | ``` 606 | 607 | As you look through this, you will notice that it is remarkably similar to the data generation script we ran earlier. That is because it is! The transformed parameters models the exact same process as before - updating Q-values based on actions, results and the associated prediction errors. The difference here is that we don't stochastically choose the next action based on the softmax function - the actions have already been made! Instead we are calculating what the action probabilities would have been at that trial (given the candidate parameters we are currently testing) in order to determine what the probability of the action that was picked was. 608 | 609 | > Notice that this script is deterministic, not stochastic. Since we already know the actions and results, we are instead (arbitrarily at first) choosing some alpha and beta and then seeing what the Q-values overtime would look like given those choices and actions. These Q-values then let us calculate how likely those actions would have been. The more likely, the better our parameter guesses must be. 610 | 611 | ##### Model 612 | 613 | First we specify some priors for our parameters. Because we have no information yet, we are choosing uninformative priors. Next, our model iterates over hundreds of parameter estimates. For each parameter estimate, it loops through all the trials and calculates the probability of the arm choice that was made, using the same softmax function our agent used when we simulated our data earlier. 614 | 615 | ``` 616 | model { 617 | // priors 618 | beta ~ normal(0, 5); 619 | alpha ~ beta(1, 1); 620 | 621 | for (trial in 1:nTrials) { 622 | //returns the probability of having made the choice you made, given your beta and your Q's 623 | target += log_softmax(Q[trial] * beta)[armChoice[trial]]; 624 | } 625 | ``` 626 | 627 | ----- 628 | 629 | Below is the completed code as well as an R script that runs the model and spits out parameter estimates. Hopefully these approximately match our actual parameters! 630 | 631 | ##### RL_model.stan: 632 | 633 | Create a new file in RStudio. Add the code below and then save it with the name RL_model.stan. 634 | 635 | ``` 636 | data { 637 | int nArms; //number of bandit arms 638 | int nTrials; //number of trials 639 | int armChoice[nTrials]; //index of which arm was pulled 640 | int result[nTrials]; //outcome of bandit arm pull 641 | } 642 | 643 | parameters { 644 | real alpha; //learning rate 645 | real beta; //softmax parameter - inverse temperature 646 | } 647 | 648 | transformed parameters { 649 | vector[nArms] Q[nTrials]; // value function for each arm 650 | real delta[nTrials]; // prediction error 651 | 652 | for (trial in 1:nTrials) { 653 | 654 | //set initial Q and delta for each trial 655 | if (trial == 1) { 656 | 657 | //if first trial, initialize Q values as specified 658 | for (a in 1:nArms) { 659 | Q[1, a] = 0; 660 | } 661 | 662 | } else { 663 | 664 | //otherwise, carry forward Q from last trial to serve as initial value 665 | for (a in 1:nArms) { 666 | Q[trial, a] = Q[trial - 1, a]; 667 | } 668 | 669 | } 670 | 671 | //calculate prediction error and update Q (based on specified beta) 672 | delta[trial] = result[trial] - Q[trial, armChoice[trial]]; 673 | 674 | //update Q value based on prediction error (delta) and learning rate (alpha) 675 | Q[trial, armChoice[trial]] = Q[trial, armChoice[trial]] + alpha * delta[trial]; 676 | } 677 | } 678 | 679 | model { 680 | // priors 681 | beta ~ normal(0, 5); 682 | alpha ~ beta(1, 1); 683 | 684 | for (trial in 1:nTrials) { 685 | //returns the probability of having made the choice you made, given your beta and your Q's 686 | target += log_softmax(Q[trial] * beta)[armChoice[trial]]; 687 | } 688 | } 689 | ``` 690 | 691 | #### Script for running RL_model.stan using RStan: 692 | 693 | Next, run this script in a new RScript. be sure to check out [RStan documentation](https://mc-stan.org/docs/2_23/stan-users-guide-2_23.pdf) for clarification of the various functions. 694 | 695 | ``` 696 | library("rstan") # observe startup messages 697 | library("tidyverse") 698 | 699 | setwd("C:/Users/... directory with RL_model.stan") 700 | 701 | df <- read_csv("Generated_Data.csv") 702 | model_data <- list( nArms = length(unique(df$choices)), 703 | nTrials = nrow(df), 704 | armChoice = df$choices, 705 | result = df$rewards) 706 | my_model <- stan_model(file = "RL_model.stan") 707 | 708 | fit <- optimizing(object = my_model, data = model_data) 709 | 710 | #get alpha and beta estimates 711 | fit$par[1] 712 | fit$par[2] 713 | ``` 714 | 715 | If everything goes well, you should get parameter estimates for `alpha` and `beta`. For example, I was able to get an `alpha` estimate of `0.0119504` and a `beta` estimate of `7.666186`. While not exactly correct to our correct parameters of `alpha = 0.01` and `beta = 5`, they are remarkably close. 716 | 717 | >**Note:** Since the agent simulation is stochastic, you will certainly get different parameter estimates since the inputted data will be different. 718 | 719 | #### Looking inside the model fit 720 | 721 | The `optimizing` stan function returns point estimates of the parameters by maximizing the joint posterior (log) likelihood. While these are great at getting approximate values, there is very little sense about how close these estimates are to the correct values. To do that we will need to perform a more intensive model fitting using the [sampling](https://mc-stan.org/rstan/reference/stanmodel-method-sampling.html) function. The `sampling` function allows us to see distributions of possible parameter values, which will hopefully give us a sense of how well our model is able to estimate the correct parameters. 722 | 723 | ``` R 724 | library("rstan") # observe startup messages 725 | library("tidyverse") 726 | 727 | setwd("~/Documents/Programming/RL Modeling") 728 | 729 | df <- read_csv("Generated_Data.csv") 730 | model_data <- list( nArms = length(unique(df$choices)), 731 | nTrials = nrow(df), 732 | armChoice = df$choices, 733 | result = df$rewards) 734 | my_model <- stan_model(file = "RL_model.stan") 735 | 736 | sample <- sampling(object = my_model, data = model_data) 737 | 738 | plot(sample, plotfun = "hist", pars= "alpha") 739 | plot(sample, plotfun = "hist", pars= "beta") 740 | ``` 741 | ![alt text](images/alpha_post_distribution.png) 742 | ![alt text](images/beta_post_distribution.png) 743 | 744 | As you can see, the `alpha` estimate nicely contains our correct value of `0.01`, whereas our `beta` does not contain the correct value of `5`. Your estimates might be different - again, since our data is quite noisy there is necessarily a lot of noise in estimating our parameters. 745 | 746 | To get a sense of just how close our estimates are we can use the `summary` RStan function `summary(sample)`, which returns the mean, standard error, standard deviation, etc of these distributions for each parameter. 747 | 748 | We can further visualize our model fits using the [ShinyStan](https://mc-stan.org/users/interfaces/shinystan) package. Simply install the package `install.packages("shinystan")` before proceeding. 749 | 750 | ``` R 751 | library("shinystan") 752 | 753 | launch_shinystan(sample) 754 | ``` 755 | 756 | While the features of shinystan won't be covered here, feel free to explore to see the intricacies of the model fit we just performed. 757 | 758 | ----- 759 | 760 | There is a lot more to RStan model fitting and reinforcement learning generally, but those concepts are outside the scope of this tutorial. Hopefully you can now appreciate the math underlying reinforcement learning, as well as learned some basics about how we might estimate parameters using the likelihood of data. There are links throughout this tutorial to other resources that are much more comprehensive if you find these interesting. 761 | 762 | Thanks for reading, and again, please let me know if you have any suggestions for improvements (or just want to say hi!) at raphael.geddert@duke.edu 763 | -------------------------------------------------------------------------------- /images/Q_values_by_trial.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rmgeddert/Reinforcement-Learning-Resource-Guide/d848a453c5aa14a9bc255a5a456f4f4064d11aa0/images/Q_values_by_trial.png -------------------------------------------------------------------------------- /images/Q_values_by_trial2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rmgeddert/Reinforcement-Learning-Resource-Guide/d848a453c5aa14a9bc255a5a456f4f4064d11aa0/images/Q_values_by_trial2.png -------------------------------------------------------------------------------- /images/Softmax_Udacity.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rmgeddert/Reinforcement-Learning-Resource-Guide/d848a453c5aa14a9bc255a5a456f4f4064d11aa0/images/Softmax_Udacity.png -------------------------------------------------------------------------------- /images/SuttonBartoRL.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rmgeddert/Reinforcement-Learning-Resource-Guide/d848a453c5aa14a9bc255a5a456f4f4064d11aa0/images/SuttonBartoRL.png -------------------------------------------------------------------------------- /images/alpha_post_distribution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rmgeddert/Reinforcement-Learning-Resource-Guide/d848a453c5aa14a9bc255a5a456f4f4064d11aa0/images/alpha_post_distribution.png -------------------------------------------------------------------------------- /images/beta_post_distribution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rmgeddert/Reinforcement-Learning-Resource-Guide/d848a453c5aa14a9bc255a5a456f4f4064d11aa0/images/beta_post_distribution.png -------------------------------------------------------------------------------- /images/choice_probabilities_by_trial.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rmgeddert/Reinforcement-Learning-Resource-Guide/d848a453c5aa14a9bc255a5a456f4f4064d11aa0/images/choice_probabilities_by_trial.png -------------------------------------------------------------------------------- /images/choice_probabilities_by_trial2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rmgeddert/Reinforcement-Learning-Resource-Guide/d848a453c5aa14a9bc255a5a456f4f4064d11aa0/images/choice_probabilities_by_trial2.png --------------------------------------------------------------------------------