├── README.md ├── code ├── Arm.py ├── BlockyVector.py ├── ReinforceAgent.py ├── distributions.py ├── environments.py ├── logger.py ├── policies.py └── regressors.py └── pdfs ├── A_Few_Observations_About_Policy_Gradient_Approximations.pdf ├── A_Minimal_Working_Example_of_Empirical_Gradient_Ascent.pdf ├── Is_Randomization_Necessary.pdf ├── Natural Gradients, Mahalanobis Distances, and Distances between Distributions.pdf ├── Policy_Exploration_in_a_Cold_Universe.pdf ├── Policy_Exploration_without_Back-Looking_Terms.pdf └── slides.pdf /README.md: -------------------------------------------------------------------------------- 1 | REINFORCE tutorial 2 | ================= 3 | 4 | This repository contains a collection of [scripts](code/) and [notes](pdfs/) that explain the basics of the so-called REINFORCE algorithm, a method for estimating the derivative of an expected value with respect to the parameters of a distribution. 5 | 6 | The method was introduced into the reinforcement learning literature by Ronald J. Williams in ["Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning"](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf) (_Machine Learning_, 1992) but has earlier precedents. 7 | 8 | This repository was created to provide some background material for a talk I gave 6 March 2017 at the Berlin machine learning meet-up. The [slides](pdfs/slides.pdf) from the talk are also available here, although they are not completely self-explanatory. 9 | 10 | I have also included a few theoretical notes which explain various aspects of REINFORCE, Trust Region Policy Optimization, and other policy gradients methods: 11 | 12 | * ["A Few Observations About Policy Gradient Approximations"](pdfs/A_Few_Observations_About_Policy_Gradient_Approximations.pdf) contains an introductory description of the REINFORCE method; 13 | * ["Policy Exploration without Back-Looking Terms"](pdfs/Policy_Exploration_without_Back-Looking_Terms.pdf) explains a term-dropping trick that reduces the variance of the gradient estimate without changing its mean; 14 | * ["A Minimal Working Example of Empirical Gradient Ascent"](pdfs/A_Minimal_Working_Example_of_Empirical_Gradient_Ascent.pdf) explicitly computes the distribution and mean of the gradient estimate in a simple example; 15 | * ["Policy Exploration in a Cold Universe"](pdfs/Policy_Exploration_in_a_Cold_Universe.pdf) illustrates how the REINFORCE algorithm deals with the exploration/exploitation trade-off in a particularly malicious case. 16 | * ["Is Randomization Necessary?"](pdfs/Is_Randomization_Necessary.pdf) explains why stochastic policies may be better than deterministic when the policy class isn't convex. 17 | 18 | These papers were originally written for internal use in my company, the robot software company [micropsi industries](http://www.micropsi-industries.com/), but are now freely available. -------------------------------------------------------------------------------- /code/Arm.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | import time 4 | 5 | import theano 6 | from theano import tensor as tns 7 | 8 | 9 | class Arm(object): 10 | 11 | def __init__(self, lengths): 12 | 13 | self.lengths = np.array(lengths) # lengths of the arm segments 14 | self.n = len(self.lengths) # number of movable joints 15 | 16 | self.lastaction = np.zeros_like(lengths) 17 | 18 | self.friction = 0.10 # resistance: determines how fast speed decays 19 | self.inertia = 0.01 # unresponsiveness: the effect of actions on speed 20 | 21 | self.reset() 22 | self.compile_dynamics() 23 | 24 | def reset(self): 25 | """ Reset the angles and the angle velocities of the arm. """ 26 | 27 | angles = (2*np.pi) * np.random.random(size=self.n) 28 | velocities = np.zeros(self.n) 29 | placeholder = np.nan * np.zeros(2) # dummy tip position 30 | 31 | # first set the angles and velocities to their correct values: 32 | self.state = np.concatenate([angles, velocities, placeholder]) 33 | 34 | # now that the angles have been set, recompute the tip position: 35 | self.state[-2:] = self.x[-1], self.y[-1] 36 | 37 | def DYNAMICS(self, STATE, ACTION): 38 | 39 | OLD_ANGLES = STATE[0 : self.n] 40 | OLD_VELOCITY = STATE[self.n : -2] 41 | 42 | FRICTIONLESS = self.inertia*OLD_VELOCITY + (1 - self.inertia)*ACTION 43 | NEW_VELOCITY = (1 - self.friction) * FRICTIONLESS 44 | 45 | # NEW_ANGLES = OLD_ANGLES + NEW_VELOCITY 46 | NEW_ANGLES = OLD_ANGLES + OLD_VELOCITY 47 | 48 | ABSOLUTE_ANGLES = tns.cumsum(NEW_ANGLES) 49 | 50 | X = tns.sum(self.lengths * np.cos(ABSOLUTE_ANGLES)) 51 | Y = tns.sum(self.lengths * np.sin(ABSOLUTE_ANGLES)) 52 | 53 | return tns.concatenate([NEW_ANGLES, NEW_VELOCITY, [X, Y]]) 54 | 55 | def compile_dynamics(self): 56 | 57 | S = tns.dvector("S") 58 | U = tns.dvector("U") 59 | 60 | F = self.DYNAMICS(S, U) 61 | 62 | Fs = theano.gradient.jacobian(F, wrt=S) 63 | Fu = theano.gradient.jacobian(F, wrt=U) 64 | 65 | F_PARAMS = [F, Fs, Fu] 66 | 67 | print("Compiling dynamics . . .") 68 | self.dynamics = theano.function(inputs=[S, U], outputs=F) 69 | self.dynamics_params = theano.function(inputs=[S, U], outputs=F_PARAMS) 70 | print("Done.\n") 71 | 72 | @property 73 | def angles(self): 74 | 75 | return self.state[0 : self.n] 76 | 77 | @property 78 | def x(self): 79 | 80 | absolute_angles = np.cumsum(self.angles) 81 | relative_positions = self.lengths * np.cos(absolute_angles) 82 | absolute_positions = np.cumsum(relative_positions) 83 | 84 | return np.concatenate([[0.0], absolute_positions]) 85 | 86 | @property 87 | def y(self): 88 | 89 | absolute_angles = np.cumsum(self.angles) 90 | relative_positions = self.lengths * np.sin(absolute_angles) 91 | absolute_positions = np.cumsum(relative_positions) 92 | 93 | return np.concatenate([[0.0], absolute_positions]) 94 | 95 | @property 96 | def tipx(self): 97 | 98 | return self.state[-2] 99 | 100 | @property 101 | def tipy(self): 102 | 103 | return self.state[-1] 104 | 105 | def move(self, action): 106 | 107 | self.lastaction = action 108 | self.state = self.dynamics(self.state, action) 109 | 110 | 111 | class Box(object): 112 | 113 | def __init__(self, low, high): 114 | 115 | self.low = low 116 | self.high = high 117 | 118 | @property 119 | def shape(self): 120 | 121 | return self.low.shape 122 | 123 | def sample(self): 124 | 125 | return self.low + (self.high - self.low)*np.random.random(size=self.shape) 126 | 127 | 128 | class ReachingGame(object): 129 | 130 | def __init__(self, n=3, lengths=None, figsize=(10, 10)): 131 | 132 | if n is None and lengths is None: 133 | n = 3 134 | 135 | if lengths is None: 136 | lengths = np.ones(n) 137 | 138 | if n is None: 139 | n = len(lengths) 140 | 141 | # parameter initialization 142 | self.arm = Arm(lengths) 143 | self.reset_goal() 144 | 145 | stateones = np.ones(2*self.arm.n + 4) 146 | actionones = np.ones(self.arm.n) 147 | 148 | self.observation_space = Box(-np.inf*stateones, np.inf*stateones) 149 | self.action_space = Box(-0.5*actionones, 0.5*actionones) 150 | 151 | # loss function parameters 152 | self.threshold = 0.1 153 | self.sharpness = 5.0 154 | self.regweight = 0.9 155 | self.offset = self.threshold**2 * (self.sharpness - 1.0) 156 | 157 | # create and compile loss expression in Theano 158 | self.compile_loss() 159 | 160 | # some plotting-relevant parameters 161 | self.figsize = figsize 162 | self.lastreward = 0 163 | self.isvisible = False 164 | 165 | def reset(self): 166 | 167 | self.arm.reset() 168 | self.reset_goal() 169 | 170 | return self.observe() 171 | 172 | def reset_goal(self): 173 | 174 | angle = (2*np.pi) * np.random.random() 175 | radius = np.random.random() 176 | 177 | x = radius * np.cos(angle) 178 | y = radius * np.sin(angle) 179 | 180 | self.set_goal(x, y) 181 | 182 | def step(self, action): 183 | 184 | self.arm.move(action) 185 | 186 | state = self.observe() 187 | self.lastreward = -self.loss(self.goal, self.arm.state, action) 188 | 189 | return state, self.lastreward, False, {} 190 | 191 | def compile_loss(self): 192 | 193 | S = tns.dvector("STATE") 194 | U = tns.dvector("ACTION") 195 | GOAL = tns.dvector("[X*, Y*]") 196 | 197 | # loss, part 1: distance-based loss 198 | 199 | TIP = S[-2:] 200 | DIST2 = tns.sum((GOAL - TIP)**2) 201 | 202 | SHARP = self.sharpness*DIST2 203 | BLUNT = DIST2 + self.offset 204 | SMALL = (DIST2 < self.threshold**2) 205 | 206 | PROPER_LOSS = theano.ifelse.ifelse(SMALL, SHARP, BLUNT) 207 | 208 | # loss, part 2: action-based loss 209 | 210 | REGULARIZER = tns.sum(U**2) 211 | 212 | # part 1 + part 2 213 | 214 | L = PROPER_LOSS + self.regweight*REGULARIZER 215 | 216 | Ls = theano.grad(L, wrt=S) 217 | Lu = theano.grad(L, wrt=U) 218 | 219 | Lss = theano.gradient.jacobian(Ls, wrt=S) 220 | Lus = theano.gradient.jacobian(Lu, wrt=S, disconnected_inputs='ignore') 221 | Luu = theano.gradient.jacobian(Lu, wrt=U, disconnected_inputs='ignore') 222 | 223 | LOSS_PARAMS = [L, Ls, Lu, Lss, Lus, Luu] 224 | 225 | print("Compiling loss . . .") 226 | self.loss = theano.function(inputs=[GOAL, S, U], outputs=L) 227 | self.loss_params = theano.function(inputs=[GOAL, S, U], outputs=LOSS_PARAMS) 228 | print("Done.\n") 229 | 230 | def makeplot(self): 231 | """ Initialize the plot of the arm and goal. """ 232 | 233 | plt.ion() # don't stop and wait after plotting 234 | 235 | # Plotting 236 | self.figure, self.axes = plt.subplots(figsize=self.figsize) 237 | 238 | armlength = np.sum(self.arm.lengths) 239 | windowsize = [-1.1*armlength, 1.1*armlength] 240 | 241 | plt.xlim(windowsize) 242 | plt.ylim(windowsize) 243 | 244 | self.armlines, = self.axes.plot(self.arm.x, self.arm.y, 'bo-', ms=10, lw=5, alpha=0.5) 245 | self.dot, = self.axes.plot([self.goalx], [self.goaly], 'ro', ms=15, alpha=0.5) 246 | self.losstext = self.axes.text(-armlength, 0.95*armlength, "", fontsize=18) 247 | 248 | self.axes.set_aspect('equal') 249 | 250 | self.isvisible = True 251 | plt.show() 252 | 253 | def close(self): 254 | """ Close the plot of the arm and goal. """ 255 | 256 | self.reset() 257 | 258 | if self.isvisible: 259 | plt.close(self.figure) 260 | 261 | @property 262 | def goal(self): 263 | 264 | return np.array([self.goalx, self.goaly]) 265 | 266 | def set_goal(self, x, y): 267 | 268 | self.goalx = x 269 | self.goaly = y 270 | 271 | def update(self): 272 | 273 | self.armlines.set_xdata(self.arm.x) 274 | self.armlines.set_ydata(self.arm.y) 275 | 276 | self.dot.set_xdata([self.goalx]) 277 | self.dot.set_ydata([self.goaly]) 278 | 279 | self.losstext.set_text("reward = %.2f" % self.lastreward) 280 | 281 | if self.isvisible: 282 | self.figure.canvas.draw_idle() 283 | plt.pause(1e-8) # matplotlib witchcraft 284 | 285 | def render(self): 286 | 287 | if not self.isvisible: 288 | self.makeplot() 289 | 290 | self.update() 291 | 292 | def observe(self): 293 | 294 | return np.concatenate([self.arm.state, self.goal]) 295 | 296 | 297 | if __name__ == '__main__': 298 | 299 | game = ReachingGame(lengths=np.ones(7)) 300 | 301 | import time 302 | 303 | for i in range(320): 304 | 305 | forces = game.action_space.sample() 306 | 307 | game.step(forces) 308 | game.render() 309 | time.sleep(1./25) 310 | 311 | -------------------------------------------------------------------------------- /code/BlockyVector.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | 4 | class BlockyVector(list): 5 | 6 | def __init__(self, elements): 7 | """ Create a list of arrays on which various operations can be done. """ 8 | 9 | self.extend(elements) 10 | 11 | def __repr__(self): 12 | 13 | return "BlockyVector(%s)" % list(self) 14 | 15 | def __add__(self, other): 16 | 17 | return BlockyVector([a + b for a, b in zip(self, other)]) 18 | 19 | def __iadd__(self, other): 20 | 21 | blocky_sum = self + other # note: not list concatenation 22 | 23 | # assert self.shape == other.shape == blocky_sum.shape 24 | 25 | return blocky_sum 26 | 27 | def __sub__(self, other): 28 | 29 | return BlockyVector([a - b for a, b in zip(self, other)]) 30 | 31 | def __sum__(self, other): 32 | 33 | return BlockyVector([a + b for a, b in zip(self, other)]) 34 | 35 | def __mul__(self, other): 36 | 37 | if type(other) == BlockyVector: 38 | return BlockyVector([a * b for a, b in zip(self, other)]) 39 | 40 | else: 41 | return BlockyVector([other * block for block in self]) 42 | 43 | def __rmul__(self, scalar): 44 | 45 | return self.__mul__(scalar) 46 | 47 | def __pow__(self, exponent): 48 | 49 | return BlockyVector([block ** exponent for block in self]) 50 | 51 | def sum(self): 52 | 53 | return sum(np.sum(block) for block in self) 54 | 55 | @property 56 | def shape(self): 57 | 58 | return [block.shape for block in self] 59 | 60 | def __eq__(self, other): 61 | 62 | return BlockyVector([a == b for a, b in zip(self, other)]) 63 | 64 | def all(self): 65 | 66 | return np.all([np.all(block) for block in self]) 67 | 68 | def any(self): 69 | 70 | return np.any([np.any(block) for block in self]) 71 | 72 | 73 | if __name__ == '__main__': 74 | 75 | one22 = np.ones((2, 2)) 76 | one13 = np.ones((1, 3)) 77 | 78 | ones = BlockyVector([one22, one13]) 79 | twos = BlockyVector([2*one22, 2*one13]) 80 | 81 | assert ones.shape == [(2, 2), (1, 3)] 82 | assert twos.shape == [(2, 2), (1, 3)] 83 | assert 2*ones == twos 84 | 85 | ones += ones 86 | 87 | assert ones.shape == [(2, 2), (1, 3)] # blocky addition != concatenation 88 | assert twos.shape == [(2, 2), (1, 3)] # no reason this should fail 89 | assert ones == twos # true if __iadd__ worked 90 | 91 | Ax = np.array([2., 3., 4., 6.]) 92 | Ay = np.array([0., 0., 1., 1.]) 93 | 94 | Bx = BlockyVector(Ax) 95 | By = BlockyVector(Ay) 96 | 97 | As = np.array(Ax) + np.array(Ay) 98 | Ad = np.array(Ax) - np.array(Ay) 99 | Bm = np.mean([Bx, By], axis=0) 100 | 101 | assert np.allclose(0.5*As, Bm) 102 | assert np.allclose(As, Bx + By) 103 | assert np.allclose(Ad, Bx - By) 104 | 105 | Ax = np.array([2., 6.]) 106 | Ay = np.array([0., 0., 1., 0.]) 107 | 108 | b = BlockyVector([Ax, Ay]) 109 | 110 | bb = b * b 111 | b2 = b ** 2 112 | 113 | blocky_equality = (bb == b2) 114 | 115 | assert blocky_equality.all() 116 | assert blocky_equality.any() 117 | 118 | list_of_blocky_vectors = [] 119 | 120 | for i in range(7): 121 | 122 | ones23 = np.ones((2, 3)) 123 | ones14 = np.ones((1, 4)) 124 | ones14 = np.ones((5, 7)) 125 | 126 | new_blocky_vector = BlockyVector([ones23, ones14]) 127 | list_of_blocky_vectors.append(new_blocky_vector) 128 | 129 | mean_vector = np.mean(list_of_blocky_vectors, axis=0) 130 | mean_vector = BlockyVector(mean_vector) 131 | 132 | assert mean_vector.shape == new_blocky_vector.shape 133 | assert mean_vector == new_blocky_vector # since they're all identical 134 | -------------------------------------------------------------------------------- /code/ReinforceAgent.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import os 3 | import time 4 | 5 | import logger 6 | from BlockyVector import BlockyVector 7 | from regressors import PolynomialRegressor, PolynomialTemporalRegressor 8 | 9 | 10 | class ReinforceAgent(object): 11 | 12 | def __init__(self, policy, baseline=None): 13 | 14 | self.policy = policy 15 | 16 | if baseline is not None: 17 | self.baseline = baseline 18 | else: 19 | self.baseline = PolynomialRegressor(sdim=self.policy.sdim, degree=0) 20 | 21 | self.rsumlists = [] # progress tracker; saves N rewardsums per epoch 22 | 23 | def rollout(self, environment, T=100, render=False, fps=24): 24 | """ Peform a single rollout in the given environment. """ 25 | 26 | smemory = 5 27 | umemory = 5 28 | 29 | environment.reset() # to ensure the existence of a first state 30 | 31 | states = [environment.reset() for t in range(smemory)] 32 | actions = [environment.action_space.sample() for t in range(umemory)] 33 | rewards = [] 34 | scores = [] 35 | 36 | for t in range(T): 37 | 38 | if render: 39 | environment.render() 40 | time.sleep(1.0/fps) 41 | 42 | # The agent responds to the environment: 43 | action = self.policy.sample(states, actions, self.policy.weights) 44 | score = self.policy.score(action, states, actions, self.policy.weights) 45 | 46 | actions.append(action) 47 | scores.append(score) 48 | 49 | # The environment responds to the agent: 50 | state, reward, done, info = environment.step(action) 51 | 52 | states.append(state) 53 | rewards.append(reward) 54 | 55 | if done: 56 | break 57 | 58 | if render: 59 | environment.close() 60 | 61 | # Because of the state yielded from the initial environment.reset(), 62 | # we end up with one state which the agent never gets to respond to: 63 | states = states[smemory - 1 : T + smemory - 1] 64 | actions = actions[umemory:] 65 | 66 | assert len(states) == len(actions) == T 67 | 68 | return states, actions, rewards, scores 69 | 70 | def reinforce(self, states, rewards, scores, gamma=None): 71 | """ Compute (the REINFORCE gradient estimate, the array of advantages). """ 72 | 73 | returns = self.smear(rewards, gamma=gamma) 74 | advantages = returns - self.baseline.predict(states) 75 | 76 | assert not np.any(np.isnan(returns)) 77 | assert not np.any(np.isnan(advantages)) 78 | 79 | # Note: the * in `adv * score` triggers score.__rmul__(adv). 80 | 81 | terms = [adv * score for adv, score in zip(advantages, scores)] 82 | gradient = BlockyVector(np.sum(terms, axis=0)) 83 | 84 | assert gradient.shape == self.policy.weights.shape 85 | 86 | return gradient, advantages 87 | 88 | def collect(self, environment, N=20, T=100, gamma=None, verbose=True): 89 | """ Collect learning-relevant stats over N rollouts of length <= T. """ 90 | 91 | gradients = [] 92 | rewardsums = [] 93 | 94 | allstates = [] 95 | alladvantages = [] 96 | 97 | for n in range(N): 98 | 99 | states, actions, rewards, scores = self.rollout(environment, T=T) 100 | gradient, advantages = self.reinforce(states, rewards, scores, gamma) 101 | 102 | gradients.append(gradient) 103 | rewardsums.append(np.sum(rewards)) 104 | 105 | allstates.extend(states) 106 | alladvantages.extend(advantages) 107 | 108 | meangradient = BlockyVector(np.mean(gradients, axis=0)) 109 | 110 | if verbose: 111 | 112 | length = (meangradient ** 2).sum() ** 0.5 113 | print("Length of the mean gradient: %.2f." % length) 114 | 115 | sqs = [((BlockyVector(g) - meangradient) ** 2).sum() for g in gradients] 116 | std = np.mean(sqs, axis=0) ** 0.5 117 | print("Standard deviation of the sample gradients: %.2f." % std) 118 | 119 | print() 120 | 121 | return meangradient, rewardsums, allstates, alladvantages 122 | 123 | def train(self, environment, I=np.inf, N=100, T=1000, gamma=0.90, learning_rate=0.1, 124 | verbose=True, dirpath=None, save_args=True, save_weights=True, 125 | plot_progress=True, imshow_weights=False): 126 | """ Collect empirical information and update parameters I times. """ 127 | 128 | if save_args or save_weights or plot_progress or imshow_weights: 129 | dirpath = logger.makedir(dirpath) 130 | 131 | if save_args: 132 | logger.save_args(dirpath, policy=self.policy, baseline=self.baseline, 133 | environment=environment, I=I, N=N, T=T, gamma=gamma, 134 | learning_rate=learning_rate, dist=self.policy.dist) 135 | 136 | if plot_progress: 137 | rsumlists = [] 138 | 139 | if imshow_weights: 140 | filename = os.path.join(dirpath, "weights_0000.png") 141 | self.policy.imshow_weights(show=False, filename=filename) 142 | 143 | if save_weights: 144 | new_named_file = os.path.join(dirpath, "weights_0000.npz") 145 | old_most_recent = os.path.join(dirpath, "weights.npz") 146 | self.policy.saveas(new_named_file) 147 | self.policy.saveas(old_most_recent) 148 | 149 | i = 0 150 | 151 | while True: 152 | 153 | if verbose: 154 | print("\n", ("TRAINING EPOCH %s:" % i).center(60), "\n") 155 | 156 | # obtain learning-relevant statistics through experimentation: 157 | gradient, rsums, states, advans = self.collect(environment, N, T, gamma) 158 | 159 | if verbose: 160 | logger.print_stats(rsums) 161 | 162 | # update the policy parameters as suggested by the gradient: 163 | self.policy.update(gradient, learning_rate, verbose=verbose) 164 | 165 | # Re-estimate the parameters of the advantage-predictor: 166 | self.baseline.fit(states, advans, verbose=verbose) 167 | 168 | numi = str(i + 1).rjust(4, '0') 169 | 170 | if save_weights: 171 | new_named_file = os.path.join(dirpath, "weights_%s.npz" % numi) 172 | old_most_recent = os.path.join(dirpath, "weights.npz") 173 | self.policy.saveas(new_named_file) 174 | self.policy.saveas(old_most_recent) 175 | 176 | if imshow_weights: 177 | filename = os.path.join(dirpath, "weights_%s.png" % numi) 178 | self.policy.imshow_weights(show=False, filename=filename) 179 | 180 | if plot_progress: 181 | rsumlists.append(rsums) 182 | filename = os.path.join(dirpath, "progress.png") 183 | logger.plot_progress(rsumlists, show=False, filename=filename) 184 | 185 | i += 1 186 | if i >= I: 187 | break 188 | 189 | if verbose: 190 | print(" Finished training. ".center(72, "="), "\n") 191 | 192 | def smear(self, rewards, gamma=None): 193 | """ Form returns from the rewards. """ 194 | 195 | # In the plan REINFORCE algorithm, every action in the rollout is 196 | # held responsible for everything that happened at any time during 197 | # the rollout, whether past, present, or futre. We express this by 198 | # multiplying each score by the total sum-of-rewards. 199 | 200 | if gamma is None: 201 | return np.sum(rewards) * np.ones_like(rewards) 202 | 203 | # In order to reduce variance, however, it is safe to not hold any 204 | # action accountable for a past event (gamma == 1). If the environ- 205 | # ment has few long-term dependencies, it can also be advisable to 206 | # not hold any actions responsible for much later events (gamma < 1). 207 | 208 | returns = np.zeros_like(rewards) 209 | all_later_returns = 0.0 210 | 211 | for t, reward in reversed(list(enumerate(rewards))): 212 | returns[t] = reward + (gamma * all_later_returns) 213 | all_later_returns = returns[t] 214 | 215 | return returns 216 | 217 | 218 | if __name__ == '__main__': 219 | 220 | import policies 221 | 222 | agent = ReinforceAgent(policies.PolynomialPolicy(sdim=2, udim=3, degree=0)) -------------------------------------------------------------------------------- /code/distributions.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from theano import tensor as tns 3 | from scipy.special import gammaln, betaln 4 | 5 | 6 | def BETALN(A, B): 7 | """ Symbolically compute the value of the Beta function at (A, B). """ 8 | 9 | return tns.gammaln(A) + tns.gammaln(B) - tns.gammaln(A + B) 10 | 11 | 12 | class Beta(object): 13 | 14 | def __init__(self): 15 | 16 | self.low = 0.0 17 | self.high = 1.0 18 | 19 | self.nparams = 2 20 | 21 | def __repr__(self): 22 | 23 | return "Beta()" 24 | 25 | def sample(self, params, size=None): 26 | """ Sample from a beta distribution with the given parameters. """ 27 | 28 | a = params[0] 29 | b = params[1] 30 | 31 | return np.random.beta(a, b, size=size) 32 | 33 | def LOGP(self, x, params): 34 | """ Symbolic log-density according to a Beta distribution. """ 35 | 36 | a = params[0] 37 | b = params[1] 38 | 39 | return tns.sum((a - 1)*tns.log(x) + (b - 1)*tns.log(1 - x) - BETALN(a, b)) 40 | 41 | def logp(self, x, params): 42 | """ Numeric log-density according to a Beta distribution. """ 43 | 44 | a = params[0] 45 | b = params[1] 46 | 47 | return np.sum((a - 1) * np.log(x) + (b - 1) * np.log(1 - x) - betaln(a, b)) 48 | 49 | 50 | class ArctanGaussian(object): 51 | 52 | def __init__(self, low=None, high=None): 53 | 54 | self.low = 0.0 if low is None else low 55 | self.high = 1.0 if high is None else high 56 | 57 | self.nparams = 2 58 | 59 | def __repr__(self): 60 | 61 | return "ArctanGaussian()" 62 | 63 | def sample(self, params, size=None): 64 | """ Arctan of a sample from a Gaussian distribution. """ 65 | 66 | mu = params[0] 67 | sigma = params[1] + 1e-20 68 | 69 | if size is None: 70 | assert len(set(len(param) for param in params)) == 1 # all == size 71 | 72 | gaussian = np.random.normal(loc=mu, scale=np.abs(sigma), size=size) 73 | 74 | return self.squash(gaussian) 75 | 76 | def LOGP(self, x, params): 77 | """ Symbolic log-density of the arctan of a Gassian distribution. """ 78 | 79 | mu = params[0] 80 | sigma = params[1] 81 | 82 | g = self.UNSQUASH(x) 83 | square = (g - mu)**2 / sigma**2 84 | norm = tns.log(2 * np.pi * sigma**2) 85 | 86 | return -0.5 * tns.sum(square + norm) 87 | 88 | def logp(self, x, params): 89 | """ Numeric log-density of the arctan of a Gassian distribution. """ 90 | 91 | mu = params[0] 92 | sigma = params[1] + 1e-20 93 | 94 | g = self.unsquash(x) 95 | square = (g - mu)**2 / sigma**2 96 | norm = np.log(2 * np.pi * sigma**2) 97 | 98 | return -0.5 * np.sum(square + norm) 99 | 100 | def squash(self, sample): 101 | """ Force a sample from the native sample space into the unit box. """ 102 | 103 | return 0.5 + np.arctan(sample)/np.pi 104 | 105 | def unsquash(self, unit_box_sample): 106 | """ Perform the inverse of the boxing operation. """ 107 | 108 | return np.tan(np.pi * (unit_box_sample - 0.5)) 109 | 110 | def SQUASH(self, sample): 111 | """ Perform the boxing operation symbolically (see .squash). """ 112 | 113 | return 0.5 + tns.arctan(sample)/np.pi 114 | 115 | def UNSQUASH(self, unit_box_sample): 116 | """ Perform the unboxing operation symbolically (see .unsquash). """ 117 | 118 | return tns.tan(np.pi * (unit_box_sample - 0.5)) 119 | 120 | 121 | class NoisyArctan(ArctanGaussian): 122 | 123 | def __init__(self, sigma=None): 124 | 125 | self.nparams = 1 126 | self.sigma = 0.1 if sigma is None else sigma 127 | 128 | def __repr__(self): 129 | 130 | return "NoisyArctan(sigma=%s)" % str(self.sigma) 131 | 132 | def sample(self, params, size=None): 133 | """ Arctan of a sample from a Gaussian distribution. """ 134 | 135 | gaussian = np.random.normal(loc=params[0], scale=self.sigma, size=size) 136 | 137 | return self.squash(gaussian) 138 | 139 | def LOGP(self, x, params): 140 | """ Symbolic log-density of the arctan of a Gaussian distribution. """ 141 | 142 | square = (self.UNSQUASH(x) - params[0])**2 / self.sigma**2 143 | norm = tns.log(2 * np.pi * self.sigma**2) 144 | 145 | return -0.5 * tns.sum(square + norm) 146 | 147 | def logp(self, x, params): 148 | """ Numeric log-density of the arctan of a Gaussian distribution. """ 149 | 150 | square = (self.unsquash(x) - params[0])**2 / self.sigma**2 151 | norm = np.log(2 * np.pi * sigma**2) 152 | 153 | return -0.5 * np.sum(square + norm) 154 | 155 | 156 | class Gaussian(object): 157 | 158 | def __init__(self, sigma=None): 159 | 160 | if sigma is None: 161 | self.sigma = None 162 | self.nparams = 2 163 | else: 164 | self.sigma = sigma 165 | self.nparams = 1 166 | 167 | def __repr__(self): 168 | 169 | return "Gaussian(sigma=%s)" % str(self.sigma) 170 | 171 | def sample(self, params, size=None): 172 | """ Sample from a Gaussian distribution with the given parameters. """ 173 | 174 | mu = params[0] 175 | sigma = params[1] if self.sigma is None else self.sigma 176 | 177 | return np.random.normal(loc=mu, scale=np.abs(sigma), size=size) 178 | 179 | def LOGP(self, x, params): 180 | """ Symbolic log-density according to a Gaussian distribution. """ 181 | 182 | mu = params[0] 183 | sigma = params[1] if self.sigma is None else self.sigma 184 | 185 | square = (x - mu)**2 / sigma**2 186 | norm = tns.log(2 * np.pi * sigma**2) 187 | 188 | return -0.5 * tns.sum(square + norm) 189 | 190 | def logp(self, x, params): 191 | """ Numeric log-density according to a Gaussian distribution. """ 192 | 193 | mu = params[0] 194 | sigma = params[1] if self.sigma is None else self.sigma 195 | 196 | square = (g - mu)**2 / sigma**2 197 | norm = np.log(2 * np.pi * sigma**2) 198 | 199 | return -0.5 * np.sum(square + norm) 200 | 201 | 202 | if __name__ == '__main__': 203 | 204 | from matplotlib import pyplot as plt 205 | 206 | arctangauss = ArctanGaussian() 207 | n = 10000 208 | 209 | y = arctangauss.sample([0, .01], size=n) 210 | 211 | plt.hist(y, bins=40) 212 | plt.show() 213 | 214 | x = np.random.normal(loc=0, scale=0.01, size=n) 215 | 216 | plt.hist(0.5 + np.arctan(x)/np.pi, bins=40) 217 | plt.show() 218 | 219 | plt.hist(arctangauss.squash(x), bins=40) 220 | plt.show() 221 | 222 | # check that the normalization operation actually does what it says: 223 | 224 | arctangauss = ArctanGaussian() 225 | mu, sigma = np.zeros(5), np.ones(5) 226 | 227 | for i in range(100): 228 | 229 | x = np.random.normal(loc=mu, scale=sigma) 230 | Tx = arctangauss.squash(x) 231 | x_reconstructed = arctangauss.unsquash(Tx) 232 | 233 | assert np.allclose(x, x_reconstructed) 234 | assert np.all(0 <= Tx) and np.all(Tx <= 1) 235 | 236 | for i in range(100): 237 | 238 | Tx = arctangauss.sample(mu, sigma) 239 | 240 | assert 0 < Tx and Tx < 1 241 | 242 | # assert density integrates to 1.0: 243 | 244 | beta = Beta() 245 | params = np.random.gamma(1, 1), np.random.gamma(5, 1) 246 | 247 | x = np.linspace(0, 1, 1000) 248 | y = np.exp([beta.logp(xn, params) for xn in x]) 249 | 250 | deltax = x[1:] - x[:-1] 251 | midx = 0.5*(x[1:] + x[:-1]) 252 | 253 | miny = np.min([y[1:], y[:-1]], axis=0) 254 | maxy = np.max([y[1:], y[:-1]], axis=0) 255 | 256 | maxerror = np.sum(deltax * (maxy - miny)) 257 | 258 | undersum = np.sum(deltax * miny) 259 | oversum = np.sum(deltax * maxy) 260 | 261 | assert undersum < 1 262 | assert 1 - undersum < maxerror 263 | 264 | if not np.any(np.isinf(y)): 265 | 266 | assert oversum > 1 267 | assert oversum - 1 < maxerror 268 | 269 | assert oversum > undersum 270 | assert oversum - undersum < 2*maxerror 271 | 272 | # assert empirical frequencies ~= numerical Glivenko-Cantelli integrals: 273 | 274 | samplesize = 10000 275 | sample = beta.sample(params, size=samplesize) 276 | 277 | empirical = [] 278 | numerical = [] 279 | 280 | if not np.any(np.isinf(y)): 281 | midy = 0.5*(maxy + miny) 282 | else: 283 | midy = miny 284 | 285 | for fraction in np.linspace(.05, .95, 20): 286 | 287 | emp = np.sum(sample < fraction) / samplesize 288 | num = np.sum((midx < fraction) * deltax * midy) 289 | 290 | empirical.append(emp) 291 | numerical.append(num) 292 | 293 | tol = 0.05 / min(params) 294 | 295 | assert np.allclose(empirical, numerical, atol=tol, rtol=tol) 296 | 297 | # if you want, do some plotting: 298 | 299 | plt.figure(figsize=(16, 9)) 300 | plt.title("$a = %.2f$, $b = %.2f$" % params, fontsize=24) 301 | plt.hist(sample, bins=50, normed=True) 302 | plt.plot(x, y, lw=5, alpha=0.5, color="red") 303 | plt.show() -------------------------------------------------------------------------------- /code/environments.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from matplotlib import pyplot as plt 3 | 4 | from gym import spaces 5 | 6 | 7 | class TargetPractice(object): 8 | 9 | def __init__(self, sdim=3, udim=2, goal=None): 10 | """ A game in which the goal is to perform a specific fixed action. """ 11 | 12 | self.sdim = sdim 13 | self.udim = udim 14 | 15 | self.observation_space = spaces.box.Box(np.zeros(sdim), np.ones(sdim)) 16 | self.action_space = spaces.box.Box(np.zeros(udim), np.ones(udim)) 17 | 18 | self.goal = np.random.random(size=self.udim) if goal is None else goal 19 | 20 | def __repr__(self): 21 | 22 | return "TargetPractice(sdim=%s, udim=%s, goal=%r)" % (self.sdim, self.udim, self.goal) 23 | 24 | def reset(self): 25 | 26 | self.state = np.random.random(size=self.sdim) 27 | 28 | return self.state 29 | 30 | def dynamics(self, state, action): 31 | 32 | return np.random.random(size=self.sdim) 33 | 34 | def step(self, action): 35 | 36 | # the reward depends on the *old* action, not the new one: 37 | distance = np.sum((action - self.goal) ** 2) 38 | reward = -distance 39 | 40 | # after having thus computed the reward, select the next state: 41 | self.state = self.dynamics(self.state, action) 42 | 43 | return self.state, reward, False, {} # obs, reward, done, oddities 44 | 45 | def render(self): 46 | 47 | raise NotImplementedError 48 | 49 | def close(self): 50 | 51 | raise NotImplementedError 52 | 53 | 54 | class EchoGame(object): 55 | 56 | def __init__(self, sdim=3, udim=3): 57 | """ A game in which the goal is to repeat back the state as an action. """ 58 | 59 | assert sdim == udim 60 | 61 | self.dim = self.sdim = self.udim = sdim 62 | 63 | self.observation_space = spaces.box.Box(np.zeros(self.dim), np.ones(self.dim)) 64 | self.action_space = spaces.box.Box(np.zeros(self.dim), np.ones(self.dim)) 65 | 66 | def __repr__(self): 67 | 68 | return "EchoGame(sdim=%s, udim=%s)" % (self.sdim, self.udim) 69 | 70 | def reset(self): 71 | 72 | self.state = np.random.random(size=self.sdim) 73 | 74 | return self.state 75 | 76 | def dynamics(self, state, action): 77 | 78 | return np.random.random(size=self.sdim) 79 | 80 | def step(self, action): 81 | 82 | # the reward depends on the *old* action, not the new one: 83 | distance = np.sum((self.state - action) ** 2) 84 | reward = -distance 85 | 86 | # after having thus computed the reward, select the next state: 87 | self.state = self.dynamics(self.state, action) 88 | 89 | return self.state, reward, False, {} # obs, reward, done, oddities 90 | 91 | def render(self): 92 | 93 | raise NotImplementedError 94 | 95 | def close(self): 96 | 97 | raise NotImplementedError 98 | 99 | 100 | -------------------------------------------------------------------------------- /code/logger.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from matplotlib import pyplot as plt 3 | 4 | import datetime 5 | import os 6 | 7 | 8 | def makedir(dirpath=None): 9 | """ Create a directory at results// or `dirpath`; return the path. """ 10 | 11 | # if no path is given, pick one: 12 | if dirpath is None: 13 | now = datetime.datetime.now() 14 | dirname = now.strftime("%Y_%b_%d_%Hh%M") 15 | dirpath = os.path.join("results", dirname) 16 | 17 | # if no such directory exists, create: 18 | if not os.path.exists(dirpath): 19 | os.makedirs(dirpath) 20 | 21 | return dirpath 22 | 23 | 24 | def save_args(dirpath, **kwargs): 25 | """ Save a text file documenting the values of the kwargs. """ 26 | 27 | logpath = os.path.join(dirpath, "call.txt") 28 | logfile = open(logpath, "w") 29 | 30 | for item in kwargs.items(): 31 | logfile.write("%s=%s\n" % item) 32 | 33 | logfile.close() 34 | 35 | 36 | def print_stats(rewardsums): 37 | """ Pretty-print some statistical information about the data. """ 38 | 39 | mean = np.mean(rewardsums) 40 | std = np.std(rewardsums) 41 | 42 | meantext = "Mean sum-of-rewards per rollout: %.3f ± %.3f." % (mean, std) 43 | 44 | print(meantext) 45 | print("‾" * len(meantext)) 46 | print() 47 | 48 | print("Percentiles of the sums-of-rewards:") 49 | print() 50 | 51 | percents = np.linspace(0, 100, 5 + 1) 52 | percentiles = [np.percentile(rewardsums, p) for p in percents] 53 | 54 | print(" | ".join(("%.0f" % p).center(9) for p in percents)) 55 | print(" | ".join(("%.5g" % p).center(9) for p in percentiles)) 56 | print() 57 | 58 | 59 | def plot_progress(samples, show=True, filename=None): 60 | """ Plot the temporal development of a list of lists of numbers. """ 61 | 62 | plt.figure(figsize=(20, 10)) 63 | 64 | numepochs = len(samples) 65 | epochs = np.arange(numepochs) 66 | 67 | means = [np.mean(sample) for sample in samples] 68 | medians = [np.median(sample) for sample in samples] 69 | 70 | for p in np.linspace(5, 50, 10): 71 | 72 | top = [np.percentile(sample, 50 + p) for sample in samples] 73 | bot = [np.percentile(sample, 50 - p) for sample in samples] 74 | 75 | plt.fill_between(epochs, bot, top, color="gold", alpha=0.1) 76 | 77 | plt.plot(epochs, medians, color="orange", alpha=0.5, lw=5) 78 | plt.plot(epochs, means, color="blue", alpha=0.5, lw=4) 79 | 80 | plt.xlim(-1, numepochs) 81 | 82 | plt.xlabel("Training epoch", fontsize=24) 83 | plt.ylabel("Sum-of-rewards per episode", fontsize=24) 84 | 85 | if filename: 86 | plt.savefig(filename) 87 | 88 | if show: 89 | plt.show() 90 | 91 | plt.close('all') 92 | -------------------------------------------------------------------------------- /code/policies.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from matplotlib import pyplot as plt 3 | 4 | import theano 5 | from theano import tensor as tns 6 | 7 | import distributions 8 | from BlockyVector import BlockyVector 9 | 10 | 11 | class Policy(object): 12 | 13 | def __init__(self, *args, **kwargs): 14 | 15 | pass 16 | 17 | def __repr__(self): 18 | 19 | classname = self.__class__.__name__ 20 | keyvalues = ["%s=%s" % item for item in self.__dict__.items()] 21 | 22 | return "%s(%s)" % (classname, "{%s}" % ", ".join(keyvalues)) 23 | 24 | def saveas(self, filename=None): 25 | """ Save the parameters of the policy for later loading. """ 26 | 27 | np.savez(filename, **self.__dict__) 28 | 29 | def load(self, filename=None): 30 | """ Initialize a policy from saved parameters. """ 31 | 32 | for key, value in np.load(filename).items(): 33 | self.__dict__[key] = value if value.ndim > 0 else value.item() 34 | 35 | def imshow_weights(self, blocks=None, show=True, filename=None): 36 | """ imshow the parameter matrix. """ 37 | 38 | if blocks is None: 39 | blocks = self.weights 40 | 41 | nblocks = len(blocks) 42 | width, height = 16, 9 43 | 44 | assert nblocks > 0 45 | 46 | ncols = int(np.ceil(np.sqrt(width/height * nblocks))) 47 | nrows = int(np.ceil(nblocks / ncols)) 48 | ncols = nblocks if nrows == 1 else ncols 49 | 50 | figure = plt.figure(figsize=(width, height)) 51 | 52 | for i, block in enumerate(blocks): 53 | 54 | plt.subplot(nrows, ncols, i + 1) 55 | plt.imshow(block, interpolation='nearest', aspect='auto') 56 | plt.colorbar() 57 | 58 | plt.tight_layout() 59 | 60 | if filename is not None: 61 | plt.savefig(filename) 62 | 63 | if show: 64 | plt.show() 65 | 66 | plt.close(figure) 67 | 68 | def update(self, direction, length, verbose=False): 69 | """ Take a step in the direction, peforming certain sanity checks. """ 70 | 71 | assert direction.shape == self.weights.shape 72 | 73 | if verbose: 74 | L2 = (direction ** 2).sum() ** 0.5 75 | print("Length of direction vector: %.5g." % L2) 76 | print("Length of the step taken: %.5g." % (length * L2)) 77 | print() 78 | 79 | shape_before = self.weights.shape 80 | self.weights += length * direction 81 | 82 | assert shape_before == self.weights.shape 83 | 84 | 85 | class GaussianPolicy(Policy): 86 | 87 | def __init__(self, sdim=None, udim=None, weights=None, sigma=None, 88 | filename=None, *args, **kwargs): 89 | 90 | if filename is not None: 91 | 92 | self.load(filename) 93 | self.compile() 94 | return 95 | 96 | self.sdim = sdim 97 | self.udim = udim 98 | 99 | self.dist = distributions.Gaussian(sigma=sigma) 100 | self.weights = self.random_weights() if weights is None else BlockyVector(weights) 101 | 102 | self.compile() 103 | 104 | def LOGP(self, ACTION, PARAMS): 105 | 106 | return self.dist.LOGP(ACTION, PARAMS) 107 | 108 | def compile(self): 109 | 110 | SHIST = tns.dmatrix("STATE HISTORY") 111 | UHIST = tns.dmatrix("ACTION HISTORY") 112 | 113 | WEIGHTS = [tns.dmatrix("WEIGHT_%s" % i) for i in range(len(self.weights))] 114 | PARAMS = self.PARAMS(SHIST, UHIST, WEIGHTS) 115 | 116 | ACTION = tns.dvector("ACTION") 117 | SAMPLE = self.UNWRAP(ACTION) 118 | 119 | LOGP = self.LOGP(SAMPLE, PARAMS) 120 | SCORE = theano.grad(LOGP, wrt=WEIGHTS) 121 | 122 | print("Compiling policy functions . . .") 123 | self.paramlist = theano.function(inputs=[SHIST, UHIST] + WEIGHTS, outputs=PARAMS, on_unused_input='ignore') 124 | self.logp = theano.function(inputs=[ACTION, SHIST, UHIST] + WEIGHTS, outputs=LOGP, on_unused_input='ignore') 125 | self.dlogp = theano.function(inputs=[ACTION, SHIST, UHIST] + WEIGHTS, outputs=SCORE, on_unused_input='ignore') 126 | print("Done.\n") 127 | 128 | def params(self, shist, uhist, weights=None): 129 | 130 | if weights is None: 131 | weights = self.weights 132 | 133 | shist = np.atleast_2d(shist) # [] ==> array([[]]) 134 | uhist = np.atleast_2d(uhist) # [] ==> array([[]]) 135 | 136 | return self.paramlist(shist, uhist, *weights) 137 | 138 | def sample(self, shist, uhist, weights=None): 139 | 140 | params = self.params(shist, uhist, weights) 141 | 142 | assert len(params) == self.dist.nparams 143 | 144 | for param in params: 145 | assert len(param) == self.udim 146 | 147 | return self.dist.sample(params) 148 | 149 | def score(self, action, shist=[], uhist=[], weights=None): 150 | 151 | if weights is None: 152 | weights = self.weights 153 | 154 | shist = np.atleast_2d(shist) 155 | uhist = np.atleast_2d(uhist) 156 | 157 | return BlockyVector(self.dlogp(action, shist, uhist, *weights)) 158 | 159 | 160 | 161 | class BoxPolicy(Policy): 162 | 163 | def __init__(self, sdim=None, udim=None, low=None, high=None, 164 | environment=None, weights=None, dist=None, 165 | filename=None, *args, **kwargs): 166 | 167 | if filename is not None: 168 | self.load(filename) 169 | self.compile() 170 | return 171 | 172 | if environment is None: 173 | 174 | self.sdim = sdim 175 | self.udim = udim 176 | 177 | self.low = low 178 | self.high = high 179 | 180 | else: 181 | 182 | self.sdim = environment.observation_space.shape[0] 183 | self.udim = environment.action_space.shape[0] 184 | 185 | self.low = environment.action_space.low 186 | self.high = environment.action_space.high 187 | 188 | self.dist = distributions.ArctanGaussian() if dist is None else dist 189 | self.weights = self.random_weights() if weights is None else BlockyVector(weights) 190 | 191 | self.compile() 192 | 193 | def wrap(self, action): 194 | """ Convert an unnormalized action into a box element. """ 195 | 196 | return self.low + (self.high - self.low)*action 197 | 198 | def unwrap(self, box_elm): 199 | """ Convert a box element into a unit cube element. """ 200 | 201 | return (box_elm - self.low) / (self.high - self.low) 202 | 203 | def WRAP(self, action): 204 | """ Convert an unnormalized action into a box element. """ 205 | 206 | return self.low + (self.high - self.low)*action 207 | 208 | def UNWRAP(self, box_elm): 209 | """ Convert a box element into a unit cube element. """ 210 | 211 | return (box_elm - self.low) / (self.high - self.low) 212 | 213 | def LOGP(self, ACTION, PARAMS): 214 | 215 | return self.dist.LOGP(ACTION, PARAMS) 216 | 217 | def compile(self): 218 | 219 | SHIST = tns.dmatrix("STATE HISTORY") 220 | UHIST = tns.dmatrix("ACTION HISTORY") 221 | 222 | WEIGHTS = [tns.dmatrix("WEIGHT_%s" % i) for i in range(len(self.weights))] 223 | PARAMS = self.PARAMS(SHIST, UHIST, WEIGHTS) 224 | 225 | ACTION = tns.dvector("ACTION") 226 | SAMPLE = self.UNWRAP(ACTION) 227 | 228 | LOGP = self.LOGP(SAMPLE, PARAMS) 229 | SCORE = theano.grad(LOGP, wrt=WEIGHTS) 230 | 231 | print("Compiling policy functions . . .") 232 | self.paramlist = theano.function(inputs=[SHIST, UHIST] + WEIGHTS, outputs=PARAMS, on_unused_input='ignore') 233 | self.logp = theano.function(inputs=[ACTION, SHIST, UHIST] + WEIGHTS, outputs=LOGP, on_unused_input='ignore') 234 | self.dlogp = theano.function(inputs=[ACTION, SHIST, UHIST] + WEIGHTS, outputs=SCORE, on_unused_input='ignore') 235 | print("Done.\n") 236 | 237 | def params(self, shist, uhist, weights=None): 238 | 239 | if weights is None: 240 | weights = self.weights 241 | 242 | shist = np.atleast_2d(shist) 243 | uhist = np.atleast_2d(uhist) 244 | 245 | return self.paramlist(shist, uhist, *weights) 246 | 247 | def sample(self, shist, uhist, weights=None): 248 | 249 | params = self.params(shist, uhist, weights) 250 | 251 | assert len(params) == self.dist.nparams 252 | 253 | for param in params: 254 | assert len(param) == self.udim 255 | 256 | unitboxed = self.dist.sample(params) 257 | actionboxed = self.wrap(unitboxed) 258 | 259 | assert not np.any(np.isnan(unitboxed)) 260 | 261 | assert np.all(np.zeros_like(unitboxed) <= unitboxed) 262 | assert np.all(unitboxed <= np.ones_like(unitboxed)) 263 | 264 | assert np.all(self.low <= actionboxed) 265 | assert np.all(actionboxed <= self.high) 266 | 267 | return actionboxed 268 | 269 | def score(self, action, shist=[], uhist=[], weights=None): 270 | 271 | if weights is None: 272 | weights = self.weights 273 | 274 | shist = np.atleast_2d(shist) 275 | uhist = np.atleast_2d(uhist) 276 | 277 | return BlockyVector(self.dlogp(action, shist, uhist, *weights)) 278 | 279 | 280 | class PolynomialPolicy(BoxPolicy): 281 | 282 | def __init__(self, degree=3, *args, **kwargs): 283 | 284 | self.degree = degree 285 | 286 | super().__init__(*args, **{key: val for key, val in kwargs.items() if key != 'degree'}) 287 | 288 | def random_weights(self): 289 | 290 | inputdim = 1 + (self.sdim * self.degree) # [1] + concatenated powers 291 | outputdim = self.udim # each parameter has the same dim as the action 292 | 293 | pick_matrix = lambda: np.random.normal(size=(outputdim, inputdim)) 294 | matrix_list = [pick_matrix() for _ in range(self.dist.nparams)] 295 | 296 | return BlockyVector(matrix_list) 297 | 298 | def PARAMS(self, SHIST, UHIST, WEIGHTS): 299 | 300 | STATE = SHIST[-1, :] 301 | 302 | POWERS = [STATE ** (n + 1) for n in range(self.degree)] 303 | INPUTS = tns.concatenate([[1]] + POWERS) 304 | 305 | return [tns.dot(WEIGHT, INPUTS) for WEIGHT in WEIGHTS] 306 | 307 | 308 | class FeedForwardPolicy(BoxPolicy): 309 | 310 | def __init__(self, hidden=[], degree=None, *args, **kwargs): 311 | 312 | self.hidden = hidden 313 | self.degree = 1 if degree is None else degree 314 | 315 | super().__init__(*args, **{key: val for key, val in kwargs.items() 316 | if key not in ['hidden', 'degree']}) 317 | 318 | def random_weights(self): 319 | 320 | smemory = 2 321 | umemory = 2 322 | 323 | assert self.dist.nparams == 1 324 | 325 | firstsize = (smemory*self.sdim + umemory*self.udim)*self.degree 326 | lastsize = self.udim # note that we only allow a single parameter 327 | 328 | indims = [firstsize] + [self.degree*w for w in self.hidden] 329 | outdims = self.hidden + [lastsize] 330 | 331 | weights = [np.random.normal(size=(outdim, indim)) 332 | for indim, outdim in zip(indims, outdims)] 333 | 334 | return BlockyVector(weights) 335 | 336 | def PARAMS(self, SHIST, UHIST, WEIGHTS): 337 | 338 | smemory = 2 339 | umemory = 2 340 | 341 | SLIST = [SHIST[-(t + 1), :] for t in range(smemory)] 342 | ULIST = [UHIST[-(t + 1), :] for t in range(umemory)] 343 | 344 | INPUT = tns.concatenate(SLIST + ULIST) 345 | 346 | X = [INPUT] 347 | 348 | for WEIGHT_D in WEIGHTS[:-1]: 349 | LAYER = tns.concatenate([X[-1] ** n for n in range(self.degree)]) 350 | LINEAR = tns.dot(WEIGHT_D, LAYER) 351 | X.append(tns.tanh(LINEAR)) 352 | 353 | LAYER = tns.concatenate([X[-1] ** n for n in range(self.degree)]) 354 | LINEAR = tns.dot(WEIGHTS[-1], LAYER) 355 | X.append(LINEAR) # no squashing 356 | 357 | return [X[-1]] # list containing only one parameter vector 358 | 359 | 360 | if __name__ == '__main__': 361 | 362 | pass -------------------------------------------------------------------------------- /code/regressors.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | 4 | class Regressor(object): 5 | 6 | def __init__(self, *args, **kwargs): 7 | """ Initialize a function approximator. """ 8 | 9 | self.params = 0 10 | 11 | def __repr__(self): 12 | 13 | return "Regressor()" 14 | 15 | def design(self, states): 16 | """ Convert an array of states into an input array of the right shape. """ 17 | 18 | return states 19 | 20 | def predict(self, states): 21 | """ Predict an array of values based on an array of states. """ 22 | 23 | raise NotImplementedError 24 | 25 | def error(self, states, values): 26 | 27 | return np.mean((self.predict(states) - values) ** 2) 28 | 29 | def MLE(self, states, values): 30 | """ Compute the maximum-likelihood parameter settings given the data. """ 31 | 32 | raise NotImplementedError 33 | 34 | def fit(self, states, values, caution=0.01, verbose=False): 35 | """ Re-estimate the parameters of the Regressor to fit empirical data. """ 36 | 37 | if verbose: 38 | error = self.error(states, values) 39 | print("Fitting baseline (prior error: %.3f) . . ." % error) 40 | 41 | solution = self.MLE(states, values) 42 | self.params = caution*self.params + (1 - caution)*solution 43 | 44 | if verbose: 45 | error = self.error(states, values) 46 | print("Done (posterior error: %.3f).\n" % error) 47 | 48 | 49 | class PolynomialRegressor(Regressor): 50 | 51 | def __init__(self, sdim, degree=3): 52 | 53 | self.sdim = sdim 54 | self.degree = degree 55 | self.params = np.zeros(1 + self.degree*self.sdim) 56 | 57 | def __repr__(self): 58 | 59 | return "<%s: degree=%s>" % (self.__class__.__name__, self.degree) 60 | 61 | def design(self, states): 62 | """ For the design matrix (matrix of input vectors) from the states. """ 63 | 64 | states = np.array(states) 65 | samples, sdim = states.shape 66 | 67 | assert self.sdim == sdim 68 | 69 | ones = np.ones((samples, 1)) 70 | 71 | if self.degree < 1: 72 | return ones 73 | 74 | statepowers = [states ** (n + 1) for n in range(self.degree)] 75 | powermatrix = np.concatenate(statepowers, axis=1) 76 | 77 | return np.concatenate([ones, powermatrix], axis=1) 78 | 79 | def predict(self, states): 80 | 81 | return self.design(states).dot(self.params) 82 | 83 | def MLE(self, states, values): 84 | 85 | inputs = self.design(states) 86 | solution, residuals, rank, sngrts = np.linalg.lstsq(inputs, values) 87 | 88 | return solution 89 | 90 | 91 | class PolynomialTemporalRegressor(Regressor): 92 | 93 | def __init__(self, sdim, degree=2, timedegree=1): 94 | 95 | self.sdim = sdim 96 | self.degree = degree 97 | self.timedegree = timedegree 98 | self.params = np.zeros(1 + self.degree*self.sdim + self.timedegree) 99 | 100 | def __repr__(self): 101 | 102 | return "<%s: degree=%s>" % (self.__class__.__name__, self.degree) 103 | 104 | def design(self, states): 105 | """ For the design matrix (matrix of input vectors) from the states. """ 106 | 107 | states = np.array(states) 108 | samples, sdim = states.shape 109 | 110 | assert self.sdim == sdim 111 | 112 | ones = [np.ones((samples, 1))] 113 | 114 | timecolumn = np.arange(samples).reshape((samples, 1)) 115 | timepowers = [timecolumn ** (n + 1) for n in range(self.timedegree)] 116 | 117 | statepowers = [states ** (n + 1) for n in range(self.degree)] 118 | 119 | if self.degree == 0 and self.timedegree == 0: 120 | return np.concatenate(ones, axis=1) 121 | 122 | elif self.degree == 0 and self.timedegree > 0: 123 | return np.concatenate(ones + timepowers, axis=1) 124 | 125 | elif self.degree > 0 and self.timedegree == 0: 126 | return np.concatenate(ones + statepowers, axis=1) 127 | 128 | else: 129 | return np.concatenate(ones + statepowers + timepowers, axis=1) 130 | 131 | def predict(self, states): 132 | 133 | return self.design(states).dot(self.params) 134 | 135 | def MLE(self, states, values): 136 | 137 | inputs = self.design(states) 138 | solution, residuals, rank, sngrts = np.linalg.lstsq(inputs, values) 139 | 140 | return solution 141 | 142 | 143 | if __name__ == '__main__': 144 | 145 | sdim = 3 146 | samples = 100 147 | maxdegree = 3 148 | maxtimedegree = 3 149 | 150 | A = np.random.normal(size=sdim) 151 | B = np.random.normal() 152 | 153 | states = np.random.normal(size=(samples, sdim)) 154 | 155 | linear = states.dot(A.T) + B 156 | nonlinear = np.sin(states.dot(A.T)) + np.exp(states).dot(A.T) 157 | 158 | for values, datatype in zip([linear, nonlinear], ['linear', 'nonlinear']): 159 | 160 | print((" Fitting to %s data: " % datatype).center(46, "=")) 161 | print("") 162 | print("") 163 | 164 | for d in range(maxdegree): 165 | 166 | print("Polynomial function approximator of degree %s):" % d) 167 | print("‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾") 168 | 169 | approximator = PolynomialRegressor(sdim, degree=d) 170 | approximator.fit(states, values, verbose=True) 171 | 172 | for t in range(maxtimedegree): 173 | 174 | print("Temporal-polynomial function approximator of degree (%s, %s):" % (d, t)) 175 | print("‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾") 176 | 177 | approximator = PolynomialTemporalRegressor(sdim, degree=d, timedegree=t) 178 | approximator.fit(states, values, verbose=True) 179 | 180 | print() 181 | 182 | print() 183 | -------------------------------------------------------------------------------- /pdfs/A_Few_Observations_About_Policy_Gradient_Approximations.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mathias-madsen/reinforce_tutorial/3ead4b6f4fd741ecc86a40f04269c2f9210ee62c/pdfs/A_Few_Observations_About_Policy_Gradient_Approximations.pdf -------------------------------------------------------------------------------- /pdfs/A_Minimal_Working_Example_of_Empirical_Gradient_Ascent.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mathias-madsen/reinforce_tutorial/3ead4b6f4fd741ecc86a40f04269c2f9210ee62c/pdfs/A_Minimal_Working_Example_of_Empirical_Gradient_Ascent.pdf -------------------------------------------------------------------------------- /pdfs/Is_Randomization_Necessary.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mathias-madsen/reinforce_tutorial/3ead4b6f4fd741ecc86a40f04269c2f9210ee62c/pdfs/Is_Randomization_Necessary.pdf -------------------------------------------------------------------------------- /pdfs/Natural Gradients, Mahalanobis Distances, and Distances between Distributions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mathias-madsen/reinforce_tutorial/3ead4b6f4fd741ecc86a40f04269c2f9210ee62c/pdfs/Natural Gradients, Mahalanobis Distances, and Distances between Distributions.pdf -------------------------------------------------------------------------------- /pdfs/Policy_Exploration_in_a_Cold_Universe.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mathias-madsen/reinforce_tutorial/3ead4b6f4fd741ecc86a40f04269c2f9210ee62c/pdfs/Policy_Exploration_in_a_Cold_Universe.pdf -------------------------------------------------------------------------------- /pdfs/Policy_Exploration_without_Back-Looking_Terms.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mathias-madsen/reinforce_tutorial/3ead4b6f4fd741ecc86a40f04269c2f9210ee62c/pdfs/Policy_Exploration_without_Back-Looking_Terms.pdf -------------------------------------------------------------------------------- /pdfs/slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mathias-madsen/reinforce_tutorial/3ead4b6f4fd741ecc86a40f04269c2f9210ee62c/pdfs/slides.pdf --------------------------------------------------------------------------------