└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # DeepRLHacks  
  2 | From a talk given by [John Schulman](http://joschu.net/) titled "The Nuts and Bolts of Deep RL Research" (Aug 2017)   
  3 | These are tricks written down while attending summer [Deep RL Bootcamp at UC Berkeley](https://www.deepbootcamp.io/).   
  4 | 
  5 | **Update**: RL bootcamp just released the [video](https://www.youtube.com/watch?v=8EcdaCk9KaQ&feature=youtu.be) and the rest of the [lectures](https://sites.google.com/view/deep-rl-bootcamp/lectures). 
  6 | 
  7 | ## Tips to debug new algorithm   
  8 | 1. Simplify the problem by using a low dimensional state space environment.      
  9 |     - John suggested to use the [Pendulum problem](https://gym.openai.com/envs/Pendulum-v0) because the problem has a 2-D state space (angle of pendulum and velocity).    
 10 |     - Easy to visualize what the value function looks like and what state the algorithm should be in and how they evolve over time.  
 11 |     - Easy to visually spot why something isn't working (aka, is the value function smooth enough and so on).
 12 | 
 13 | 2. To test if your algorithm is reasonable, construct a problem you know it should work on.   
 14 |     - Ex: For hierarchical reinforcement learning you'd construct a problem with an OBVIOUS hierarchy it should learn. 
 15 |     - Can easily see if it's doing the right thing.   
 16 |     - WARNING: Don't over fit method to your toy problem (realize it's a toy problem).   
 17 | 
 18 | 3. Familiarize yourself with certain environments you know well.
 19 |     - Over time, you'll learn how long the training should take.   
 20 |     - Know how rewards evolve, etc... 
 21 |     - Allows you to set a benchmark to see how well you're doing against your past trials.    
 22 |     - John uses the hopper robot where he knows how fast learning should take, and he can easily spot odd behaviors.    
 23 | 
 24 | ## Tips to debug a new task   
 25 | 1. Simplify the task
 26 |     - Start simple until you see signs of life.   
 27 |     - Approach 1: Simplify the feature space: 
 28 |       - For example, if you're learning from images (huge dimensional space), then maybe hand engineer features first. Example: If you think your function is trying to approximate a location of something, use the x,y location as features as step 1. 
 29 |       - Once it starts working, make the problem harder until you solve the full problem.   
 30 |    - Approach 2: simplify the reward function.
 31 |       - Formulate so it can give you FAST feedback to know whether you're doing the right thing or not.   
 32 |       - Ex: Have reward for robot when it hits the target (+1). Hard to learn because maybe too much happens in between starting and reward. Reformulate as distance to target instead which will increase learning and allow you to iterate faster.    
 33 | 
 34 | ## Tips to frame a problem in RL   
 35 | Maybe it's unclear what the features are and what the reward should be, or if it's feasible at all.    
 36 | 
 37 | 1. First step: Visualize a random policy acting on this problem.   
 38 |     - See where it takes you.    
 39 |     - If random policy on occasion does the right thing, then high chance RL will do the right thing.   
 40 |       - Policy gradient will find this behavior and make it more likely.  
 41 |     - If random policy never does the right thing, RL will likely also not.   
 42 | 
 43 | 2. Make sure observations usable:
 44 |     - See if YOU could control the system by using the same observations you give the agent.   
 45 |       - Example: Look at preprocessed images yourself to make sure you don't remove necessary details or hinder the algorithm in a certain way.
 46 | 
 47 | 3. Make sure everything is reasonably scaled.   
 48 |     - Rule of thumb: 
 49 |       - Observations: Make everything mean 0, standard deviation 1.
 50 |       - Reward: If you control it, then scale it to a reasonable value.
 51 |         - Do it across ALL your data so far.   
 52 |     - Look at all observations and rewards and make sure there aren't crazy outliers.    
 53 | 
 54 | 4. Have good baseline whenever you see a new problem.   
 55 |     - It's unclear which algorithm will work, so have a set of baselines (from other methods)
 56 |       - Cross entropy method   
 57 |       - Policy gradient methods 
 58 |       - Some kind of Q-learning method (checkout [OpenAI Baselines](https://github.com/openai/baselines) as a starter or [RLLab](https://github.com/rll/rllab)) 
 59 | 
 60 | ## Reproducing papers    
 61 | Sometimes (often), it's hard to reproduce results from papers. Some tricks to do that:   
 62 | 
 63 | 1. Use more samples than needed.    
 64 | 2. Policy right... but not exactly
 65 |      - Try to make it work a little bit.   
 66 |      - Then tweak hyper parameters to get up to the public performance.   
 67 |      - If want to get it to work at ALL, use bigger batch sizes. 
 68 |        - If batch size is too small, noisy will overpower signal.  
 69 |        - Example: TRPO, John was using too tiny of a batch size and had to use 100k time steps. 
 70 |        - For DQN, best hyperparams: 10k time steps, 1mm frames in replay buffer.
 71 | 
 72 | 
 73 | ## Guidelines on-going training process   
 74 | Sanity check that your training is going well.    
 75 | 
 76 | 1. Look at sensitivity of EVERY hyper parameter
 77 |     - If algo is too sensitive, then NOT robust and should NOT be happy with it.   
 78 |     - Sometimes it happens that a method works one way because of funny dynamics but NOT in general.
 79 | 
 80 | 2. Look for indicators that the optimization process is healthy.  
 81 |     - Varies 
 82 |     - Look at whether value function is accurate.
 83 |       - Is it predicting well?    
 84 |       - Is it predicting returns well?
 85 |       - How big are the updates?   
 86 |     - Standard diagnostics from deep networks   
 87 | 
 88 | 3. Have a system for continuously benchmarking code.    
 89 |     - Needs DISCIPLINE.   
 90 |     - Look at performance across ALL previous problems you tried.   
 91 |       - Sometimes it'll start working on one problem but mess up performance in others.   
 92 |       - Easy to over fit  on a single problem.
 93 |     - Have a battery of benchmarks you run occasionally.   
 94 | 
 95 | 4. Think your algorithm is working but you're actually seeing random noise.   
 96 |     - Example: Graph of 7 tasks with 3 algorithms and looks like 1 algorithm might be doing best on all problems, but turns out they're all the same algorithm with DIFFERENT random seeds.   
 97 | 
 98 | 5. Try different random seeds!!
 99 |     - Run multiple times and average.   
100 |     - Run multiple tasks on multiple seeds. 
101 |       - If not, you're likely to over fit.   
102 | 
103 | 6. Additional algorithm modifications might be unnecessary.      
104 |     - Most tricks are ACTUALLY normalizing something in some way or improving your optimization.  
105 |     - A lot of tricks also have the same effect... So you can remove some of them and SIMPLIFY your algorithm (VERY KEY).   
106 | 
107 | 7. Simplify your algorithm   
108 |     - Will generalize better
109 | 
110 | 8. Automate your experiments   
111 |     - Don't spend your whole day watching your code spit out numbers.   
112 |     - Launch experiments on cloud services and analyze results.   
113 |     - Frameworks to track experiments and results:
114 |       - Mostly use iPython notebooks.
115 |       - DBs seem unnecessary to store results.   
116 | 
117 | 
118 | ## General training strategies
119 | 1. Whiten and standardize data (for ALL seen data since the beginning).   
120 |     - Observations:
121 |       - Do it by computing a running mean and standard deviation. Then z-transform everything.   
122 |       - Over ALL data seen (not just the recent data).
123 |         - At least it'll scale down over time how fast it's changing.
124 |         - Might trip up the optimizer if you keep changing the objective. 
125 |         - Rescaling (by using recent data) means your optimizer probably didn't know about that and performance will collapse.
126 |   
127 |     - Rewards:
128 |       - Scale and DON'T shift. 
129 |         - Affects agent's will to live.
130 |         - Will change the problem (aka, how long you want it to survive).
131 | 
132 |     - Standardize targets:
133 |       - Same way as rewards.
134 |   
135 |     - PCA Whitening?
136 |       - Could help.
137 |       - Starting to see if it actually helps with neural nets.
138 |       - Huge scales (-1000, 1000) or (-0.001, 0.001) certainly make learning slow.   
139 | 
140 | 2. Parameters that inform discount factors.
141 |     - Determines how far you're giving credit assignment.   
142 |     - Ex: if factor is 0.99, then you're ignoring what happened 100 steps ago... Means you're shortsighted. 
143 |       - Better to look at how that corresponds to real time 
144 |         - Intuition, in RL we're usually discretizing time.  
145 |         - aka: are those 100 steps 3 seconds of actual time? 
146 |         - what happens during that time?
147 |     - If TD methods for policy gradient of Value fx estimation, gamma can be close to 1 (like 0.999)
148 |       - Algo becomes very stable.   
149 | 
150 | 3. Look to see that problem can actually be solved in the discretized level.  
151 |     - Example: In game if you're doing frame skip.
152 |       - As a human, can you control it or is it impossible?
153 |       - Look at what random exploration looks like 
154 |         - Discretization determines how far your Brownian motion goes. 
155 |         - If do many actions in a row, then tend to explore further.   
156 |         - Choose your time discretization in a way that works.
157 | 
158 | 4. Look at episode returns closely.   
159 |     - Not just mean, look at min and max.
160 |       - The max return is something your policy can hone in pretty well.
161 |       - Is your policy ever doing the right thing??
162 |     - Look at episode length (sometimes more informative than episode reward).
163 |       - if on game you might be losing every time so you might never win, but... episode length can tell you if you're losing SLOWER.
164 |       - Might see an episode length improvement in the beginning but maybe not reward.
165 | 
166 | 
167 | ## Policy gradient diagnostics   
168 | 1. Look at entropy really carefully   
169 |     - Entropy in ACTION space
170 |       - Care more about entropy in state space, but don't have good methods for calculating that.
171 |     - If going down too fast, then policy becoming deterministic and will not explore.   
172 |     - If NOT going down, then policy won't be good because it is really random.   
173 |     - Can fix by:
174 |       - KL penalty
175 |         - Keep entropy from decreasing too quickly.    
176 |       - Add entropy bonus.
177 |     - How to measure entropy.   
178 |       - For most policies can compute entropy analytically. 
179 |         - If continuous, it's usually a Gaussian, so can compute differential entropy.  
180 |     
181 | 2. Look at KL divergence
182 |     - Look at size of updates in terms of KL divergence.   
183 |     - example:
184 |       - If KL is .01 then very small.
185 |       - If 10 then too much.
186 |   
187 | 3. Baseline explained variance.   
188 |     - See if value function is actually a good predictor or a reward.   
189 |       - if negative it might be overfitting or noisy.
190 |         - Likely need to tune hyper parameters
191 | 
192 | 4. Initialize policy   
193 |     - Very important (more so than in supervised learning).   
194 |     - Zero or tiny final layer to maximize entropy
195 |       - Maximize random exploration in the beginning   
196 | 
197 | ## Q-Learning Strategies 
198 | 1. Be careful about replay buffer memory usage.  
199 |     - You might need a huge buffer, so adapt code accordingly.   
200 | 
201 | 2. Play with learning rate schedule.   
202 | 
203 | 3. If converges slowly or has slow warm-up period in the beginning
204 |     - Be patient... DQN converges VERY slowly.   
205 | 
206 | 
207 | ## Bonus from [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/):   
208 | 1. A good feature can be to take the difference between two frames.   
209 |    - This delta vector can highlight slight state changes otherwise difficult to distinguish.   
210 | 
211 | 
212 | 
213 | 
214 | 
215 | 


--------------------------------------------------------------------------------