└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # DeepRLHacks 2 | From a talk given by [John Schulman](http://joschu.net/) titled "The Nuts and Bolts of Deep RL Research" (Aug 2017) 3 | These are tricks written down while attending summer [Deep RL Bootcamp at UC Berkeley](https://www.deepbootcamp.io/). 4 | 5 | **Update**: RL bootcamp just released the [video](https://www.youtube.com/watch?v=8EcdaCk9KaQ&feature=youtu.be) and the rest of the [lectures](https://sites.google.com/view/deep-rl-bootcamp/lectures). 6 | 7 | ## Tips to debug new algorithm 8 | 1. Simplify the problem by using a low dimensional state space environment. 9 | - John suggested to use the [Pendulum problem](https://gym.openai.com/envs/Pendulum-v0) because the problem has a 2-D state space (angle of pendulum and velocity). 10 | - Easy to visualize what the value function looks like and what state the algorithm should be in and how they evolve over time. 11 | - Easy to visually spot why something isn't working (aka, is the value function smooth enough and so on). 12 | 13 | 2. To test if your algorithm is reasonable, construct a problem you know it should work on. 14 | - Ex: For hierarchical reinforcement learning you'd construct a problem with an OBVIOUS hierarchy it should learn. 15 | - Can easily see if it's doing the right thing. 16 | - WARNING: Don't over fit method to your toy problem (realize it's a toy problem). 17 | 18 | 3. Familiarize yourself with certain environments you know well. 19 | - Over time, you'll learn how long the training should take. 20 | - Know how rewards evolve, etc... 21 | - Allows you to set a benchmark to see how well you're doing against your past trials. 22 | - John uses the hopper robot where he knows how fast learning should take, and he can easily spot odd behaviors. 23 | 24 | ## Tips to debug a new task 25 | 1. Simplify the task 26 | - Start simple until you see signs of life. 27 | - Approach 1: Simplify the feature space: 28 | - For example, if you're learning from images (huge dimensional space), then maybe hand engineer features first. Example: If you think your function is trying to approximate a location of something, use the x,y location as features as step 1. 29 | - Once it starts working, make the problem harder until you solve the full problem. 30 | - Approach 2: simplify the reward function. 31 | - Formulate so it can give you FAST feedback to know whether you're doing the right thing or not. 32 | - Ex: Have reward for robot when it hits the target (+1). Hard to learn because maybe too much happens in between starting and reward. Reformulate as distance to target instead which will increase learning and allow you to iterate faster. 33 | 34 | ## Tips to frame a problem in RL 35 | Maybe it's unclear what the features are and what the reward should be, or if it's feasible at all. 36 | 37 | 1. First step: Visualize a random policy acting on this problem. 38 | - See where it takes you. 39 | - If random policy on occasion does the right thing, then high chance RL will do the right thing. 40 | - Policy gradient will find this behavior and make it more likely. 41 | - If random policy never does the right thing, RL will likely also not. 42 | 43 | 2. Make sure observations usable: 44 | - See if YOU could control the system by using the same observations you give the agent. 45 | - Example: Look at preprocessed images yourself to make sure you don't remove necessary details or hinder the algorithm in a certain way. 46 | 47 | 3. Make sure everything is reasonably scaled. 48 | - Rule of thumb: 49 | - Observations: Make everything mean 0, standard deviation 1. 50 | - Reward: If you control it, then scale it to a reasonable value. 51 | - Do it across ALL your data so far. 52 | - Look at all observations and rewards and make sure there aren't crazy outliers. 53 | 54 | 4. Have good baseline whenever you see a new problem. 55 | - It's unclear which algorithm will work, so have a set of baselines (from other methods) 56 | - Cross entropy method 57 | - Policy gradient methods 58 | - Some kind of Q-learning method (checkout [OpenAI Baselines](https://github.com/openai/baselines) as a starter or [RLLab](https://github.com/rll/rllab)) 59 | 60 | ## Reproducing papers 61 | Sometimes (often), it's hard to reproduce results from papers. Some tricks to do that: 62 | 63 | 1. Use more samples than needed. 64 | 2. Policy right... but not exactly 65 | - Try to make it work a little bit. 66 | - Then tweak hyper parameters to get up to the public performance. 67 | - If want to get it to work at ALL, use bigger batch sizes. 68 | - If batch size is too small, noisy will overpower signal. 69 | - Example: TRPO, John was using too tiny of a batch size and had to use 100k time steps. 70 | - For DQN, best hyperparams: 10k time steps, 1mm frames in replay buffer. 71 | 72 | 73 | ## Guidelines on-going training process 74 | Sanity check that your training is going well. 75 | 76 | 1. Look at sensitivity of EVERY hyper parameter 77 | - If algo is too sensitive, then NOT robust and should NOT be happy with it. 78 | - Sometimes it happens that a method works one way because of funny dynamics but NOT in general. 79 | 80 | 2. Look for indicators that the optimization process is healthy. 81 | - Varies 82 | - Look at whether value function is accurate. 83 | - Is it predicting well? 84 | - Is it predicting returns well? 85 | - How big are the updates? 86 | - Standard diagnostics from deep networks 87 | 88 | 3. Have a system for continuously benchmarking code. 89 | - Needs DISCIPLINE. 90 | - Look at performance across ALL previous problems you tried. 91 | - Sometimes it'll start working on one problem but mess up performance in others. 92 | - Easy to over fit on a single problem. 93 | - Have a battery of benchmarks you run occasionally. 94 | 95 | 4. Think your algorithm is working but you're actually seeing random noise. 96 | - Example: Graph of 7 tasks with 3 algorithms and looks like 1 algorithm might be doing best on all problems, but turns out they're all the same algorithm with DIFFERENT random seeds. 97 | 98 | 5. Try different random seeds!! 99 | - Run multiple times and average. 100 | - Run multiple tasks on multiple seeds. 101 | - If not, you're likely to over fit. 102 | 103 | 6. Additional algorithm modifications might be unnecessary. 104 | - Most tricks are ACTUALLY normalizing something in some way or improving your optimization. 105 | - A lot of tricks also have the same effect... So you can remove some of them and SIMPLIFY your algorithm (VERY KEY). 106 | 107 | 7. Simplify your algorithm 108 | - Will generalize better 109 | 110 | 8. Automate your experiments 111 | - Don't spend your whole day watching your code spit out numbers. 112 | - Launch experiments on cloud services and analyze results. 113 | - Frameworks to track experiments and results: 114 | - Mostly use iPython notebooks. 115 | - DBs seem unnecessary to store results. 116 | 117 | 118 | ## General training strategies 119 | 1. Whiten and standardize data (for ALL seen data since the beginning). 120 | - Observations: 121 | - Do it by computing a running mean and standard deviation. Then z-transform everything. 122 | - Over ALL data seen (not just the recent data). 123 | - At least it'll scale down over time how fast it's changing. 124 | - Might trip up the optimizer if you keep changing the objective. 125 | - Rescaling (by using recent data) means your optimizer probably didn't know about that and performance will collapse. 126 | 127 | - Rewards: 128 | - Scale and DON'T shift. 129 | - Affects agent's will to live. 130 | - Will change the problem (aka, how long you want it to survive). 131 | 132 | - Standardize targets: 133 | - Same way as rewards. 134 | 135 | - PCA Whitening? 136 | - Could help. 137 | - Starting to see if it actually helps with neural nets. 138 | - Huge scales (-1000, 1000) or (-0.001, 0.001) certainly make learning slow. 139 | 140 | 2. Parameters that inform discount factors. 141 | - Determines how far you're giving credit assignment. 142 | - Ex: if factor is 0.99, then you're ignoring what happened 100 steps ago... Means you're shortsighted. 143 | - Better to look at how that corresponds to real time 144 | - Intuition, in RL we're usually discretizing time. 145 | - aka: are those 100 steps 3 seconds of actual time? 146 | - what happens during that time? 147 | - If TD methods for policy gradient of Value fx estimation, gamma can be close to 1 (like 0.999) 148 | - Algo becomes very stable. 149 | 150 | 3. Look to see that problem can actually be solved in the discretized level. 151 | - Example: In game if you're doing frame skip. 152 | - As a human, can you control it or is it impossible? 153 | - Look at what random exploration looks like 154 | - Discretization determines how far your Brownian motion goes. 155 | - If do many actions in a row, then tend to explore further. 156 | - Choose your time discretization in a way that works. 157 | 158 | 4. Look at episode returns closely. 159 | - Not just mean, look at min and max. 160 | - The max return is something your policy can hone in pretty well. 161 | - Is your policy ever doing the right thing?? 162 | - Look at episode length (sometimes more informative than episode reward). 163 | - if on game you might be losing every time so you might never win, but... episode length can tell you if you're losing SLOWER. 164 | - Might see an episode length improvement in the beginning but maybe not reward. 165 | 166 | 167 | ## Policy gradient diagnostics 168 | 1. Look at entropy really carefully 169 | - Entropy in ACTION space 170 | - Care more about entropy in state space, but don't have good methods for calculating that. 171 | - If going down too fast, then policy becoming deterministic and will not explore. 172 | - If NOT going down, then policy won't be good because it is really random. 173 | - Can fix by: 174 | - KL penalty 175 | - Keep entropy from decreasing too quickly. 176 | - Add entropy bonus. 177 | - How to measure entropy. 178 | - For most policies can compute entropy analytically. 179 | - If continuous, it's usually a Gaussian, so can compute differential entropy. 180 | 181 | 2. Look at KL divergence 182 | - Look at size of updates in terms of KL divergence. 183 | - example: 184 | - If KL is .01 then very small. 185 | - If 10 then too much. 186 | 187 | 3. Baseline explained variance. 188 | - See if value function is actually a good predictor or a reward. 189 | - if negative it might be overfitting or noisy. 190 | - Likely need to tune hyper parameters 191 | 192 | 4. Initialize policy 193 | - Very important (more so than in supervised learning). 194 | - Zero or tiny final layer to maximize entropy 195 | - Maximize random exploration in the beginning 196 | 197 | ## Q-Learning Strategies 198 | 1. Be careful about replay buffer memory usage. 199 | - You might need a huge buffer, so adapt code accordingly. 200 | 201 | 2. Play with learning rate schedule. 202 | 203 | 3. If converges slowly or has slow warm-up period in the beginning 204 | - Be patient... DQN converges VERY slowly. 205 | 206 | 207 | ## Bonus from [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/): 208 | 1. A good feature can be to take the difference between two frames. 209 | - This delta vector can highlight slight state changes otherwise difficult to distinguish. 210 | 211 | 212 | 213 | 214 | 215 | --------------------------------------------------------------------------------