├── README.md
├── code
    ├── Arm.py
    ├── BlockyVector.py
    ├── ReinforceAgent.py
    ├── distributions.py
    ├── environments.py
    ├── logger.py
    ├── policies.py
    └── regressors.py
└── pdfs
    ├── A_Few_Observations_About_Policy_Gradient_Approximations.pdf
    ├── A_Minimal_Working_Example_of_Empirical_Gradient_Ascent.pdf
    ├── Is_Randomization_Necessary.pdf
    ├── Natural Gradients, Mahalanobis Distances, and Distances between Distributions.pdf
    ├── Policy_Exploration_in_a_Cold_Universe.pdf
    ├── Policy_Exploration_without_Back-Looking_Terms.pdf
    └── slides.pdf


/README.md:
--------------------------------------------------------------------------------
 1 | REINFORCE tutorial
 2 | =================
 3 | 
 4 | This repository contains a collection of [scripts](code/) and [notes](pdfs/) that explain the basics of the so-called REINFORCE algorithm, a method for estimating the derivative of an expected value with respect to the parameters of a distribution.
 5 | 
 6 | The method was introduced into the reinforcement learning literature by Ronald J. Williams in ["Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning"](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf) (_Machine Learning_, 1992) but has earlier precedents.
 7 | 
 8 | This repository was created to provide some background material for a talk I gave 6 March 2017 at the Berlin machine learning meet-up. The [slides](pdfs/slides.pdf) from the talk are also available here, although they are not completely self-explanatory.
 9 | 
10 | I have also included a few theoretical notes which explain various aspects of REINFORCE, Trust Region Policy Optimization, and other policy gradients methods:
11 | 
12 |  * ["A Few Observations About Policy Gradient Approximations"](pdfs/A_Few_Observations_About_Policy_Gradient_Approximations.pdf) contains an introductory description of the REINFORCE method;
13 |  * ["Policy Exploration without Back-Looking Terms"](pdfs/Policy_Exploration_without_Back-Looking_Terms.pdf) explains a term-dropping trick that reduces the variance of the gradient estimate without changing its mean;
14 |  * ["A Minimal Working Example of Empirical Gradient Ascent"](pdfs/A_Minimal_Working_Example_of_Empirical_Gradient_Ascent.pdf) explicitly computes the distribution and mean of the gradient estimate in a simple example;
15 |  * ["Policy Exploration in a Cold Universe"](pdfs/Policy_Exploration_in_a_Cold_Universe.pdf) illustrates how the REINFORCE algorithm deals with the exploration/exploitation trade-off in a particularly malicious case.
16 |  * ["Is Randomization Necessary?"](pdfs/Is_Randomization_Necessary.pdf) explains why stochastic policies may be better than deterministic when the policy class isn't convex.
17 | 
18 | These papers were originally written for internal use in my company, the robot software company [micropsi industries](http://www.micropsi-industries.com/), but are now freely available.


--------------------------------------------------------------------------------
/code/Arm.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import matplotlib.pyplot as plt
  3 | import time
  4 | 
  5 | import theano
  6 | from theano import tensor as tns
  7 | 
  8 | 
  9 | class Arm(object):
 10 |     
 11 |     def __init__(self, lengths):
 12 |         
 13 |         self.lengths = np.array(lengths) # lengths of the arm segments
 14 |         self.n = len(self.lengths) # number of movable joints
 15 |         
 16 |         self.lastaction = np.zeros_like(lengths)
 17 |         
 18 |         self.friction = 0.10 # resistance: determines how fast speed decays
 19 |         self.inertia = 0.01 # unresponsiveness: the effect of actions on speed
 20 |         
 21 |         self.reset()
 22 |         self.compile_dynamics()
 23 |     
 24 |     def reset(self):
 25 |         """ Reset the angles and the angle velocities of the arm. """
 26 |         
 27 |         angles = (2*np.pi) * np.random.random(size=self.n)
 28 |         velocities = np.zeros(self.n)
 29 |         placeholder = np.nan * np.zeros(2) # dummy tip position
 30 |         
 31 |         # first set the angles and velocities to their correct values:
 32 |         self.state = np.concatenate([angles, velocities, placeholder])
 33 |         
 34 |         # now that the angles have been set, recompute the tip position:
 35 |         self.state[-2:] = self.x[-1], self.y[-1]
 36 |     
 37 |     def DYNAMICS(self, STATE, ACTION):
 38 |         
 39 |         OLD_ANGLES = STATE[0 : self.n]
 40 |         OLD_VELOCITY = STATE[self.n : -2]
 41 |     
 42 |         FRICTIONLESS = self.inertia*OLD_VELOCITY + (1 - self.inertia)*ACTION
 43 |         NEW_VELOCITY = (1 - self.friction) * FRICTIONLESS
 44 | 
 45 |         # NEW_ANGLES = OLD_ANGLES + NEW_VELOCITY
 46 |         NEW_ANGLES = OLD_ANGLES + OLD_VELOCITY
 47 |         
 48 |         ABSOLUTE_ANGLES = tns.cumsum(NEW_ANGLES)
 49 |         
 50 |         X = tns.sum(self.lengths * np.cos(ABSOLUTE_ANGLES))
 51 |         Y = tns.sum(self.lengths * np.sin(ABSOLUTE_ANGLES))
 52 | 
 53 |         return tns.concatenate([NEW_ANGLES, NEW_VELOCITY, [X, Y]])
 54 | 
 55 |     def compile_dynamics(self):
 56 | 
 57 |         S = tns.dvector("S")
 58 |         U = tns.dvector("U")
 59 |     
 60 |         F = self.DYNAMICS(S, U)
 61 |     
 62 |         Fs = theano.gradient.jacobian(F, wrt=S)
 63 |         Fu = theano.gradient.jacobian(F, wrt=U)
 64 |     
 65 |         F_PARAMS = [F, Fs, Fu]
 66 |     
 67 |         print("Compiling dynamics . . .")
 68 |         self.dynamics = theano.function(inputs=[S, U], outputs=F)
 69 |         self.dynamics_params = theano.function(inputs=[S, U], outputs=F_PARAMS)
 70 |         print("Done.\n")
 71 | 
 72 |     @property
 73 |     def angles(self):
 74 |         
 75 |         return self.state[0 : self.n]
 76 | 
 77 |     @property
 78 |     def x(self):
 79 |         
 80 |         absolute_angles = np.cumsum(self.angles)
 81 |         relative_positions = self.lengths * np.cos(absolute_angles)
 82 |         absolute_positions = np.cumsum(relative_positions)
 83 | 
 84 |         return np.concatenate([[0.0], absolute_positions])
 85 | 
 86 |     @property
 87 |     def y(self):
 88 |         
 89 |         absolute_angles = np.cumsum(self.angles)
 90 |         relative_positions = self.lengths * np.sin(absolute_angles)
 91 |         absolute_positions = np.cumsum(relative_positions)
 92 | 
 93 |         return np.concatenate([[0.0], absolute_positions])
 94 | 
 95 |     @property
 96 |     def tipx(self):
 97 |         
 98 |         return self.state[-2]
 99 | 
100 |     @property
101 |     def tipy(self):
102 |         
103 |         return self.state[-1]
104 | 
105 |     def move(self, action):
106 |         
107 |         self.lastaction = action
108 |         self.state = self.dynamics(self.state, action)
109 | 
110 | 
111 | class Box(object):
112 |     
113 |     def __init__(self, low, high):
114 |         
115 |         self.low = low
116 |         self.high = high
117 |         
118 |     @property
119 |     def shape(self):
120 |         
121 |         return self.low.shape
122 |     
123 |     def sample(self):
124 |         
125 |         return self.low + (self.high - self.low)*np.random.random(size=self.shape)
126 | 
127 | 
128 | class ReachingGame(object):
129 |     
130 |     def __init__(self, n=3, lengths=None, figsize=(10, 10)):
131 |         
132 |         if n is None and lengths is None:
133 |             n = 3
134 |         
135 |         if lengths is None:
136 |             lengths = np.ones(n)
137 |         
138 |         if n is None:
139 |             n = len(lengths)
140 | 
141 |         # parameter initialization
142 |         self.arm = Arm(lengths)
143 |         self.reset_goal()
144 |         
145 |         stateones = np.ones(2*self.arm.n + 4)
146 |         actionones = np.ones(self.arm.n)
147 |         
148 |         self.observation_space = Box(-np.inf*stateones, np.inf*stateones)
149 |         self.action_space = Box(-0.5*actionones, 0.5*actionones)
150 |         
151 |         # loss function parameters
152 |         self.threshold = 0.1
153 |         self.sharpness = 5.0
154 |         self.regweight = 0.9
155 |         self.offset = self.threshold**2 * (self.sharpness - 1.0)
156 | 
157 |         # create and compile loss expression in Theano
158 |         self.compile_loss()
159 | 
160 |         # some plotting-relevant parameters
161 |         self.figsize = figsize
162 |         self.lastreward = 0
163 |         self.isvisible = False
164 | 
165 |     def reset(self):
166 |         
167 |         self.arm.reset()
168 |         self.reset_goal()
169 |         
170 |         return self.observe()
171 |     
172 |     def reset_goal(self):
173 |         
174 |         angle = (2*np.pi) * np.random.random()
175 |         radius = np.random.random()
176 | 
177 |         x = radius * np.cos(angle)
178 |         y = radius * np.sin(angle)
179 |         
180 |         self.set_goal(x, y)
181 |     
182 |     def step(self, action):
183 |         
184 |         self.arm.move(action)
185 |         
186 |         state = self.observe()
187 |         self.lastreward = -self.loss(self.goal, self.arm.state, action)
188 |         
189 |         return state, self.lastreward, False, {}
190 |     
191 |     def compile_loss(self):
192 |         
193 |         S = tns.dvector("STATE")
194 |         U = tns.dvector("ACTION")
195 |         GOAL = tns.dvector("[X*, Y*]")
196 |         
197 |         # loss, part 1: distance-based loss
198 |     
199 |         TIP = S[-2:]
200 |         DIST2 = tns.sum((GOAL - TIP)**2)
201 |         
202 |         SHARP = self.sharpness*DIST2
203 |         BLUNT = DIST2 + self.offset
204 |         SMALL = (DIST2 < self.threshold**2)
205 |         
206 |         PROPER_LOSS = theano.ifelse.ifelse(SMALL, SHARP, BLUNT)
207 |     
208 |         # loss, part 2: action-based loss
209 |     
210 |         REGULARIZER = tns.sum(U**2)
211 |     
212 |         # part 1 + part 2
213 |     
214 |         L = PROPER_LOSS + self.regweight*REGULARIZER
215 | 
216 |         Ls = theano.grad(L, wrt=S)
217 |         Lu = theano.grad(L, wrt=U)
218 | 
219 |         Lss = theano.gradient.jacobian(Ls, wrt=S)
220 |         Lus = theano.gradient.jacobian(Lu, wrt=S, disconnected_inputs='ignore')
221 |         Luu = theano.gradient.jacobian(Lu, wrt=U, disconnected_inputs='ignore')
222 |     
223 |         LOSS_PARAMS = [L, Ls, Lu, Lss, Lus, Luu]
224 |     
225 |         print("Compiling loss . . .")
226 |         self.loss = theano.function(inputs=[GOAL, S, U], outputs=L)
227 |         self.loss_params = theano.function(inputs=[GOAL, S, U], outputs=LOSS_PARAMS)
228 |         print("Done.\n")
229 | 
230 |     def makeplot(self):
231 |         """ Initialize the plot of the arm and goal. """
232 |         
233 |         plt.ion() # don't stop and wait after plotting
234 | 
235 |         # Plotting
236 |         self.figure, self.axes = plt.subplots(figsize=self.figsize)
237 |         
238 |         armlength = np.sum(self.arm.lengths)
239 |         windowsize = [-1.1*armlength, 1.1*armlength]
240 | 
241 |         plt.xlim(windowsize)
242 |         plt.ylim(windowsize)
243 | 
244 |         self.armlines, = self.axes.plot(self.arm.x, self.arm.y, 'bo-', ms=10, lw=5, alpha=0.5)
245 |         self.dot, = self.axes.plot([self.goalx], [self.goaly], 'ro', ms=15, alpha=0.5)
246 |         self.losstext = self.axes.text(-armlength, 0.95*armlength, "", fontsize=18)
247 | 
248 |         self.axes.set_aspect('equal')
249 |         
250 |         self.isvisible = True
251 |         plt.show()
252 |     
253 |     def close(self):
254 |         """ Close the plot of the arm and goal. """
255 |         
256 |         self.reset()
257 |         
258 |         if self.isvisible:
259 |             plt.close(self.figure)
260 | 
261 |     @property
262 |     def goal(self):
263 |         
264 |         return np.array([self.goalx, self.goaly])
265 |     
266 |     def set_goal(self, x, y):
267 |         
268 |         self.goalx = x
269 |         self.goaly = y
270 |         
271 |     def update(self):
272 | 
273 |         self.armlines.set_xdata(self.arm.x)
274 |         self.armlines.set_ydata(self.arm.y)
275 | 
276 |         self.dot.set_xdata([self.goalx])
277 |         self.dot.set_ydata([self.goaly])
278 |         
279 |         self.losstext.set_text("reward = %.2f" % self.lastreward)
280 |         
281 |         if self.isvisible:
282 |             self.figure.canvas.draw_idle()
283 |             plt.pause(1e-8) # matplotlib witchcraft
284 |     
285 |     def render(self):
286 |         
287 |         if not self.isvisible:
288 |             self.makeplot()
289 |         
290 |         self.update()
291 |     
292 |     def observe(self):
293 |         
294 |         return np.concatenate([self.arm.state, self.goal])
295 | 
296 | 
297 | if __name__ == '__main__':
298 | 
299 |     game = ReachingGame(lengths=np.ones(7))
300 |     
301 |     import time
302 | 
303 |     for i in range(320):
304 | 
305 |         forces = game.action_space.sample()
306 | 
307 |         game.step(forces)
308 |         game.render()
309 |         time.sleep(1./25)
310 | 
311 | 


--------------------------------------------------------------------------------
/code/BlockyVector.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | 
  3 | 
  4 | class BlockyVector(list):
  5 |     
  6 |     def __init__(self, elements):
  7 |         """ Create a list of arrays on which various operations can be done. """
  8 | 
  9 |         self.extend(elements)
 10 |     
 11 |     def __repr__(self):
 12 |         
 13 |         return "BlockyVector(%s)" % list(self)
 14 | 
 15 |     def __add__(self, other):
 16 |         
 17 |         return BlockyVector([a + b for a, b in zip(self, other)])
 18 | 
 19 |     def __iadd__(self, other):
 20 |         
 21 |         blocky_sum = self + other # note: not list concatenation
 22 |         
 23 |         # assert self.shape == other.shape == blocky_sum.shape
 24 |         
 25 |         return blocky_sum
 26 | 
 27 |     def __sub__(self, other):
 28 |         
 29 |         return BlockyVector([a - b for a, b in zip(self, other)])
 30 | 
 31 |     def __sum__(self, other):
 32 | 
 33 |         return BlockyVector([a + b for a, b in zip(self, other)])
 34 |     
 35 |     def __mul__(self, other):
 36 |         
 37 |         if type(other) == BlockyVector:
 38 |             return BlockyVector([a * b for a, b in zip(self, other)])
 39 |         
 40 |         else:
 41 |             return BlockyVector([other * block for block in self])
 42 |     
 43 |     def __rmul__(self, scalar):
 44 |         
 45 |         return self.__mul__(scalar)
 46 | 
 47 |     def __pow__(self, exponent):
 48 |         
 49 |         return BlockyVector([block ** exponent for block in self])
 50 | 
 51 |     def sum(self):
 52 |         
 53 |         return sum(np.sum(block) for block in self)
 54 |     
 55 |     @property
 56 |     def shape(self):
 57 |         
 58 |         return [block.shape for block in self]
 59 |     
 60 |     def __eq__(self, other):
 61 |         
 62 |         return BlockyVector([a == b for a, b in zip(self, other)])
 63 |     
 64 |     def all(self):
 65 |         
 66 |         return np.all([np.all(block) for block in self])
 67 | 
 68 |     def any(self):
 69 |         
 70 |         return np.any([np.any(block) for block in self])
 71 | 
 72 | 
 73 | if __name__ == '__main__':
 74 |     
 75 |     one22 = np.ones((2, 2))
 76 |     one13 = np.ones((1, 3))
 77 |     
 78 |     ones = BlockyVector([one22, one13])
 79 |     twos = BlockyVector([2*one22, 2*one13])
 80 | 
 81 |     assert ones.shape == [(2, 2), (1, 3)]
 82 |     assert twos.shape == [(2, 2), (1, 3)]
 83 |     assert 2*ones == twos
 84 | 
 85 |     ones += ones
 86 |     
 87 |     assert ones.shape == [(2, 2), (1, 3)] # blocky addition != concatenation
 88 |     assert twos.shape == [(2, 2), (1, 3)] # no reason this should fail
 89 |     assert ones == twos # true if __iadd__ worked
 90 |     
 91 |     Ax = np.array([2., 3., 4., 6.])
 92 |     Ay = np.array([0., 0., 1., 1.])
 93 | 
 94 |     Bx = BlockyVector(Ax)
 95 |     By = BlockyVector(Ay)
 96 |     
 97 |     As = np.array(Ax) + np.array(Ay)
 98 |     Ad = np.array(Ax) - np.array(Ay)
 99 |     Bm = np.mean([Bx, By], axis=0)
100 |     
101 |     assert np.allclose(0.5*As, Bm)
102 |     assert np.allclose(As, Bx + By)
103 |     assert np.allclose(Ad, Bx - By)
104 |     
105 |     Ax = np.array([2., 6.])
106 |     Ay = np.array([0., 0., 1., 0.])
107 | 
108 |     b = BlockyVector([Ax, Ay])
109 |     
110 |     bb = b * b
111 |     b2 = b ** 2
112 |     
113 |     blocky_equality = (bb == b2)
114 |     
115 |     assert blocky_equality.all()
116 |     assert blocky_equality.any()
117 |     
118 |     list_of_blocky_vectors = []
119 |     
120 |     for i in range(7):
121 |         
122 |         ones23 = np.ones((2, 3))
123 |         ones14 = np.ones((1, 4))
124 |         ones14 = np.ones((5, 7))
125 |         
126 |         new_blocky_vector = BlockyVector([ones23, ones14])
127 |         list_of_blocky_vectors.append(new_blocky_vector)
128 |     
129 |     mean_vector = np.mean(list_of_blocky_vectors, axis=0)
130 |     mean_vector = BlockyVector(mean_vector)
131 |     
132 |     assert mean_vector.shape == new_blocky_vector.shape
133 |     assert mean_vector == new_blocky_vector # since they're all identical
134 | 


--------------------------------------------------------------------------------
/code/ReinforceAgent.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import os
  3 | import time
  4 | 
  5 | import logger
  6 | from BlockyVector import BlockyVector
  7 | from regressors import PolynomialRegressor, PolynomialTemporalRegressor
  8 | 
  9 | 
 10 | class ReinforceAgent(object):
 11 |     
 12 |     def __init__(self, policy, baseline=None):
 13 |         
 14 |         self.policy = policy
 15 |         
 16 |         if baseline is not None:
 17 |             self.baseline = baseline
 18 |         else:
 19 |             self.baseline = PolynomialRegressor(sdim=self.policy.sdim, degree=0)
 20 | 
 21 |         self.rsumlists = [] # progress tracker; saves N rewardsums per epoch
 22 | 
 23 |     def rollout(self, environment, T=100, render=False, fps=24):
 24 |         """ Peform a single rollout in the given environment. """
 25 |         
 26 |         smemory = 5
 27 |         umemory = 5
 28 |         
 29 |         environment.reset() # to ensure the existence of a first state
 30 | 
 31 |         states = [environment.reset() for t in range(smemory)]
 32 |         actions = [environment.action_space.sample() for t in range(umemory)]
 33 |         rewards = []
 34 |         scores = []
 35 | 
 36 |         for t in range(T):
 37 | 
 38 |             if render:
 39 |                 environment.render()
 40 |                 time.sleep(1.0/fps)
 41 | 
 42 |             # The agent responds to the environment:
 43 |             action = self.policy.sample(states, actions, self.policy.weights)
 44 |             score = self.policy.score(action, states, actions, self.policy.weights)
 45 | 
 46 |             actions.append(action)
 47 |             scores.append(score)
 48 | 
 49 |             # The environment responds to the agent:
 50 |             state, reward, done, info = environment.step(action)
 51 | 
 52 |             states.append(state)
 53 |             rewards.append(reward)
 54 | 
 55 |             if done:
 56 |                 break
 57 |         
 58 |         if render:
 59 |             environment.close()
 60 |         
 61 |         # Because of the state yielded from the initial environment.reset(),
 62 |         # we end up with one state which the agent never gets to respond to:
 63 |         states = states[smemory - 1 : T + smemory - 1]
 64 |         actions = actions[umemory:]
 65 |         
 66 |         assert len(states) == len(actions) == T
 67 | 
 68 |         return states, actions, rewards, scores
 69 |     
 70 |     def reinforce(self, states, rewards, scores, gamma=None):
 71 |         """ Compute (the REINFORCE gradient estimate, the array of advantages). """
 72 |         
 73 |         returns = self.smear(rewards, gamma=gamma)
 74 |         advantages = returns - self.baseline.predict(states)
 75 | 
 76 |         assert not np.any(np.isnan(returns))
 77 |         assert not np.any(np.isnan(advantages))
 78 |         
 79 |         # Note: the * in `adv * score` triggers score.__rmul__(adv).
 80 | 
 81 |         terms = [adv * score for adv, score in zip(advantages, scores)]
 82 |         gradient = BlockyVector(np.sum(terms, axis=0))
 83 |         
 84 |         assert gradient.shape == self.policy.weights.shape
 85 |         
 86 |         return gradient, advantages
 87 |     
 88 |     def collect(self, environment, N=20, T=100, gamma=None, verbose=True):
 89 |         """ Collect learning-relevant stats over N rollouts of length <= T. """
 90 |         
 91 |         gradients = []
 92 |         rewardsums = []
 93 |         
 94 |         allstates = []
 95 |         alladvantages = []
 96 | 
 97 |         for n in range(N):
 98 |             
 99 |             states, actions, rewards, scores = self.rollout(environment, T=T)
100 |             gradient, advantages = self.reinforce(states, rewards, scores, gamma)
101 | 
102 |             gradients.append(gradient)
103 |             rewardsums.append(np.sum(rewards))
104 |             
105 |             allstates.extend(states)
106 |             alladvantages.extend(advantages)
107 |         
108 |         meangradient = BlockyVector(np.mean(gradients, axis=0))
109 | 
110 |         if verbose:
111 |             
112 |             length = (meangradient ** 2).sum() ** 0.5
113 |             print("Length of the mean gradient: %.2f." % length)
114 |             
115 |             sqs = [((BlockyVector(g) - meangradient) ** 2).sum() for g in gradients]
116 |             std = np.mean(sqs, axis=0) ** 0.5
117 |             print("Standard deviation of the sample gradients: %.2f." % std)
118 | 
119 |             print()
120 | 
121 |         return meangradient, rewardsums, allstates, alladvantages
122 | 
123 |     def train(self, environment, I=np.inf, N=100, T=1000, gamma=0.90, learning_rate=0.1,
124 |                     verbose=True, dirpath=None, save_args=True, save_weights=True,
125 |                     plot_progress=True, imshow_weights=False):
126 |         """ Collect empirical information and update parameters I times. """
127 |         
128 |         if save_args or save_weights or plot_progress or imshow_weights:
129 |             dirpath = logger.makedir(dirpath)
130 |         
131 |         if save_args:
132 |             logger.save_args(dirpath, policy=self.policy, baseline=self.baseline,
133 |                              environment=environment, I=I, N=N, T=T, gamma=gamma,
134 |                              learning_rate=learning_rate, dist=self.policy.dist)
135 | 
136 |         if plot_progress:
137 |             rsumlists = []
138 |         
139 |         if imshow_weights:
140 |             filename = os.path.join(dirpath, "weights_0000.png")
141 |             self.policy.imshow_weights(show=False, filename=filename)
142 | 
143 |         if save_weights:
144 |             new_named_file = os.path.join(dirpath, "weights_0000.npz")
145 |             old_most_recent = os.path.join(dirpath, "weights.npz")
146 |             self.policy.saveas(new_named_file)
147 |             self.policy.saveas(old_most_recent)
148 | 
149 |         i = 0
150 | 
151 |         while True:
152 |             
153 |             if verbose:
154 |                 print("\n", ("TRAINING EPOCH %s:" % i).center(60), "\n")
155 | 
156 |             # obtain learning-relevant statistics through experimentation:
157 |             gradient, rsums, states, advans = self.collect(environment, N, T, gamma)
158 | 
159 |             if verbose:
160 |                 logger.print_stats(rsums)
161 | 
162 |             # update the policy parameters as suggested by the gradient:
163 |             self.policy.update(gradient, learning_rate, verbose=verbose)
164 | 
165 |             # Re-estimate the parameters of the advantage-predictor:
166 |             self.baseline.fit(states, advans, verbose=verbose)
167 | 
168 |             numi = str(i + 1).rjust(4, '0')
169 |             
170 |             if save_weights:
171 |                 new_named_file = os.path.join(dirpath, "weights_%s.npz" % numi)
172 |                 old_most_recent = os.path.join(dirpath, "weights.npz")
173 |                 self.policy.saveas(new_named_file)
174 |                 self.policy.saveas(old_most_recent)
175 | 
176 |             if imshow_weights:
177 |                 filename = os.path.join(dirpath, "weights_%s.png" % numi)
178 |                 self.policy.imshow_weights(show=False, filename=filename)
179 | 
180 |             if plot_progress:
181 |                 rsumlists.append(rsums)
182 |                 filename = os.path.join(dirpath, "progress.png")
183 |                 logger.plot_progress(rsumlists, show=False, filename=filename)
184 |             
185 |             i += 1
186 |             if i >= I:
187 |                 break
188 | 
189 |         if verbose:
190 |             print(" Finished training. ".center(72, "="), "\n")
191 | 
192 |     def smear(self, rewards, gamma=None):
193 |         """ Form returns from the rewards. """
194 |         
195 |         # In the plan REINFORCE algorithm, every action in the rollout is
196 |         # held responsible for everything that happened at any time during
197 |         # the rollout, whether past, present, or futre. We express this by
198 |         # multiplying each score by the total sum-of-rewards.
199 |         
200 |         if gamma is None:
201 |             return np.sum(rewards) * np.ones_like(rewards)
202 | 
203 |         # In order to reduce variance, however, it is safe to not hold any
204 |         # action accountable for a past event (gamma == 1). If the environ-
205 |         # ment has few long-term dependencies, it can also be advisable to
206 |         # not hold any actions responsible for much later events (gamma < 1).
207 |         
208 |         returns = np.zeros_like(rewards)
209 |         all_later_returns = 0.0
210 |         
211 |         for t, reward in reversed(list(enumerate(rewards))):
212 |             returns[t] = reward + (gamma * all_later_returns)
213 |             all_later_returns = returns[t]
214 |         
215 |         return returns
216 | 
217 | 
218 | if __name__ == '__main__':
219 |     
220 |     import policies
221 |     
222 |     agent = ReinforceAgent(policies.PolynomialPolicy(sdim=2, udim=3, degree=0))


--------------------------------------------------------------------------------
/code/distributions.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from theano import tensor as tns
  3 | from scipy.special import gammaln, betaln
  4 | 
  5 | 
  6 | def BETALN(A, B):
  7 |     """ Symbolically compute the value of the Beta function at (A, B). """
  8 | 
  9 |     return tns.gammaln(A) + tns.gammaln(B) - tns.gammaln(A + B)
 10 | 
 11 | 
 12 | class Beta(object):
 13 |     
 14 |     def __init__(self):
 15 |         
 16 |         self.low = 0.0
 17 |         self.high = 1.0
 18 |         
 19 |         self.nparams = 2
 20 |     
 21 |     def __repr__(self):
 22 |         
 23 |         return "Beta()"
 24 |     
 25 |     def sample(self, params, size=None):
 26 |         """ Sample from a beta distribution with the given parameters. """
 27 |         
 28 |         a = params[0]
 29 |         b = params[1]
 30 |         
 31 |         return np.random.beta(a, b, size=size)
 32 |     
 33 |     def LOGP(self, x, params):
 34 |         """ Symbolic log-density according to a Beta distribution. """
 35 |         
 36 |         a = params[0]
 37 |         b = params[1]
 38 |         
 39 |         return tns.sum((a - 1)*tns.log(x) + (b - 1)*tns.log(1 - x) - BETALN(a, b))
 40 |     
 41 |     def logp(self, x, params):
 42 |         """ Numeric log-density according to a Beta distribution. """
 43 |         
 44 |         a = params[0]
 45 |         b = params[1]
 46 |         
 47 |         return np.sum((a - 1) * np.log(x) + (b - 1) * np.log(1 - x) - betaln(a, b))
 48 | 
 49 | 
 50 | class ArctanGaussian(object):
 51 |     
 52 |     def __init__(self, low=None, high=None):
 53 |         
 54 |         self.low = 0.0 if low is None else low
 55 |         self.high = 1.0 if high is None else high
 56 |         
 57 |         self.nparams = 2
 58 | 
 59 |     def __repr__(self):
 60 |         
 61 |         return "ArctanGaussian()"
 62 |     
 63 |     def sample(self, params, size=None):
 64 |         """ Arctan of a sample from a Gaussian distribution. """
 65 |         
 66 |         mu = params[0]
 67 |         sigma = params[1] + 1e-20
 68 |         
 69 |         if size is None:
 70 |             assert len(set(len(param) for param in params)) == 1 # all == size
 71 |         
 72 |         gaussian = np.random.normal(loc=mu, scale=np.abs(sigma), size=size)
 73 |         
 74 |         return self.squash(gaussian)
 75 |     
 76 |     def LOGP(self, x, params):
 77 |         """ Symbolic log-density of the arctan of a Gassian distribution. """
 78 |         
 79 |         mu = params[0]
 80 |         sigma = params[1]
 81 |         
 82 |         g = self.UNSQUASH(x)
 83 |         square = (g - mu)**2 / sigma**2
 84 |         norm = tns.log(2 * np.pi * sigma**2)
 85 |         
 86 |         return -0.5 * tns.sum(square + norm)
 87 | 
 88 |     def logp(self, x, params):
 89 |         """ Numeric log-density of the arctan of a Gassian distribution. """
 90 |         
 91 |         mu = params[0]
 92 |         sigma = params[1] + 1e-20
 93 |         
 94 |         g = self.unsquash(x)
 95 |         square = (g - mu)**2 / sigma**2
 96 |         norm = np.log(2 * np.pi * sigma**2)
 97 |         
 98 |         return -0.5 * np.sum(square + norm)
 99 |     
100 |     def squash(self, sample):
101 |         """ Force a sample from the native sample space into the unit box. """
102 |         
103 |         return 0.5 + np.arctan(sample)/np.pi
104 | 
105 |     def unsquash(self, unit_box_sample):
106 |         """ Perform the inverse of the boxing operation. """
107 |         
108 |         return np.tan(np.pi * (unit_box_sample - 0.5))
109 | 
110 |     def SQUASH(self, sample):
111 |         """ Perform the boxing operation symbolically (see .squash). """
112 |         
113 |         return 0.5 + tns.arctan(sample)/np.pi
114 | 
115 |     def UNSQUASH(self, unit_box_sample):
116 |         """ Perform the unboxing operation symbolically (see .unsquash). """
117 |         
118 |         return tns.tan(np.pi * (unit_box_sample - 0.5))
119 | 
120 | 
121 | class NoisyArctan(ArctanGaussian):
122 |     
123 |     def __init__(self, sigma=None):
124 |         
125 |         self.nparams = 1
126 |         self.sigma = 0.1 if sigma is None else sigma
127 |     
128 |     def __repr__(self):
129 |         
130 |         return "NoisyArctan(sigma=%s)" % str(self.sigma)
131 | 
132 |     def sample(self, params, size=None):
133 |         """ Arctan of a sample from a Gaussian distribution. """
134 |         
135 |         gaussian = np.random.normal(loc=params[0], scale=self.sigma, size=size)
136 |         
137 |         return self.squash(gaussian)
138 |     
139 |     def LOGP(self, x, params):
140 |         """ Symbolic log-density of the arctan of a Gaussian distribution. """
141 |         
142 |         square = (self.UNSQUASH(x) - params[0])**2 / self.sigma**2
143 |         norm = tns.log(2 * np.pi * self.sigma**2)
144 |         
145 |         return -0.5 * tns.sum(square + norm)
146 | 
147 |     def logp(self, x, params):
148 |         """ Numeric log-density of the arctan of a Gaussian distribution. """
149 |         
150 |         square = (self.unsquash(x) - params[0])**2 / self.sigma**2
151 |         norm = np.log(2 * np.pi * sigma**2)
152 |         
153 |         return -0.5 * np.sum(square + norm)
154 | 
155 | 
156 | class Gaussian(object):
157 |     
158 |     def __init__(self, sigma=None):
159 |         
160 |         if sigma is None:
161 |             self.sigma = None
162 |             self.nparams = 2
163 |         else:
164 |             self.sigma = sigma
165 |             self.nparams = 1
166 | 
167 |     def __repr__(self):
168 |         
169 |         return "Gaussian(sigma=%s)" % str(self.sigma)
170 |     
171 |     def sample(self, params, size=None):
172 |         """ Sample from a Gaussian distribution with the given parameters. """
173 |         
174 |         mu = params[0]
175 |         sigma = params[1] if self.sigma is None else self.sigma
176 |         
177 |         return np.random.normal(loc=mu, scale=np.abs(sigma), size=size)
178 |     
179 |     def LOGP(self, x, params):
180 |         """ Symbolic log-density according to a Gaussian distribution. """
181 |         
182 |         mu = params[0]
183 |         sigma = params[1] if self.sigma is None else self.sigma
184 |         
185 |         square = (x - mu)**2 / sigma**2
186 |         norm = tns.log(2 * np.pi * sigma**2)
187 |         
188 |         return -0.5 * tns.sum(square + norm)
189 | 
190 |     def logp(self, x, params):
191 |         """ Numeric log-density according to a Gaussian distribution. """
192 |         
193 |         mu = params[0]
194 |         sigma = params[1] if self.sigma is None else self.sigma
195 |         
196 |         square = (g - mu)**2 / sigma**2
197 |         norm = np.log(2 * np.pi * sigma**2)
198 |         
199 |         return -0.5 * np.sum(square + norm)
200 |     
201 | 
202 | if __name__ == '__main__':
203 |     
204 |     from matplotlib import pyplot as plt
205 | 
206 |     arctangauss = ArctanGaussian()
207 |     n = 10000
208 |     
209 |     y = arctangauss.sample([0, .01], size=n)
210 |     
211 |     plt.hist(y, bins=40)
212 |     plt.show()
213 |     
214 |     x = np.random.normal(loc=0, scale=0.01, size=n)
215 | 
216 |     plt.hist(0.5 + np.arctan(x)/np.pi, bins=40)
217 |     plt.show()
218 | 
219 |     plt.hist(arctangauss.squash(x), bins=40)
220 |     plt.show()
221 | 
222 |     # check that the normalization operation actually does what it says:
223 | 
224 |     arctangauss = ArctanGaussian()
225 |     mu, sigma = np.zeros(5), np.ones(5)
226 |     
227 |     for i in range(100):
228 |     
229 |         x = np.random.normal(loc=mu, scale=sigma)
230 |         Tx = arctangauss.squash(x)
231 |         x_reconstructed = arctangauss.unsquash(Tx)
232 |     
233 |         assert np.allclose(x, x_reconstructed)
234 |         assert np.all(0 <= Tx) and np.all(Tx <= 1)
235 |     
236 |     for i in range(100):
237 |     
238 |         Tx = arctangauss.sample(mu, sigma)
239 | 
240 |         assert 0 < Tx and Tx < 1
241 | 
242 |     # assert density integrates to 1.0:
243 |     
244 |     beta = Beta()
245 |     params = np.random.gamma(1, 1), np.random.gamma(5, 1)
246 |     
247 |     x = np.linspace(0, 1, 1000)
248 |     y = np.exp([beta.logp(xn, params) for xn in x])
249 | 
250 |     deltax = x[1:] - x[:-1]
251 |     midx = 0.5*(x[1:] + x[:-1])
252 | 
253 |     miny = np.min([y[1:], y[:-1]], axis=0)
254 |     maxy = np.max([y[1:], y[:-1]], axis=0)
255 |     
256 |     maxerror = np.sum(deltax * (maxy - miny))
257 |     
258 |     undersum = np.sum(deltax * miny)
259 |     oversum = np.sum(deltax * maxy)
260 |     
261 |     assert undersum < 1
262 |     assert 1 - undersum < maxerror
263 |     
264 |     if not np.any(np.isinf(y)):
265 |     
266 |         assert oversum > 1
267 |         assert oversum - 1 < maxerror
268 | 
269 |         assert oversum > undersum
270 |         assert oversum - undersum < 2*maxerror
271 |     
272 |     # assert empirical frequencies ~= numerical Glivenko-Cantelli integrals:
273 |     
274 |     samplesize = 10000
275 |     sample = beta.sample(params, size=samplesize)
276 |     
277 |     empirical = []
278 |     numerical = []
279 |     
280 |     if not np.any(np.isinf(y)):
281 |         midy = 0.5*(maxy + miny)
282 |     else:
283 |         midy = miny
284 |     
285 |     for fraction in np.linspace(.05, .95, 20):
286 | 
287 |         emp = np.sum(sample < fraction) / samplesize
288 |         num = np.sum((midx < fraction) * deltax * midy)
289 |         
290 |         empirical.append(emp)
291 |         numerical.append(num)
292 |     
293 |     tol = 0.05 / min(params)
294 |     
295 |     assert np.allclose(empirical, numerical, atol=tol, rtol=tol)
296 | 
297 |     # if you want, do some plotting:
298 | 
299 |     plt.figure(figsize=(16, 9))
300 |     plt.title("$a = %.2f$, $b = %.2f$" % params, fontsize=24)
301 |     plt.hist(sample, bins=50, normed=True)
302 |     plt.plot(x, y, lw=5, alpha=0.5, color="red")
303 |     plt.show()


--------------------------------------------------------------------------------
/code/environments.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from matplotlib import pyplot as plt
  3 | 
  4 | from gym import spaces
  5 | 
  6 | 
  7 | class TargetPractice(object):
  8 |     
  9 |     def __init__(self, sdim=3, udim=2, goal=None):
 10 |         """ A game in which the goal is to perform a specific fixed action. """
 11 |         
 12 |         self.sdim = sdim
 13 |         self.udim = udim
 14 |         
 15 |         self.observation_space = spaces.box.Box(np.zeros(sdim), np.ones(sdim))
 16 |         self.action_space = spaces.box.Box(np.zeros(udim), np.ones(udim))
 17 |         
 18 |         self.goal = np.random.random(size=self.udim) if goal is None else goal
 19 |     
 20 |     def __repr__(self):
 21 |         
 22 |         return "TargetPractice(sdim=%s, udim=%s, goal=%r)" % (self.sdim, self.udim, self.goal)
 23 |         
 24 |     def reset(self):
 25 |         
 26 |         self.state = np.random.random(size=self.sdim)
 27 |         
 28 |         return self.state
 29 |     
 30 |     def dynamics(self, state, action):
 31 |         
 32 |         return np.random.random(size=self.sdim)
 33 |     
 34 |     def step(self, action):
 35 |         
 36 |         # the reward depends on the *old* action, not the new one:
 37 |         distance = np.sum((action -  self.goal) ** 2)
 38 |         reward = -distance
 39 | 
 40 |         # after having thus computed the reward, select the next state:
 41 |         self.state = self.dynamics(self.state, action)
 42 | 
 43 |         return self.state, reward, False, {} # obs, reward, done, oddities
 44 |     
 45 |     def render(self):
 46 |         
 47 |         raise NotImplementedError
 48 | 
 49 |     def close(self):
 50 |         
 51 |         raise NotImplementedError
 52 | 
 53 | 
 54 | class EchoGame(object):
 55 |     
 56 |     def __init__(self, sdim=3, udim=3):
 57 |         """ A game in which the goal is to repeat back the state as an action. """
 58 |         
 59 |         assert sdim == udim
 60 |         
 61 |         self.dim = self.sdim = self.udim = sdim
 62 |         
 63 |         self.observation_space = spaces.box.Box(np.zeros(self.dim), np.ones(self.dim))
 64 |         self.action_space = spaces.box.Box(np.zeros(self.dim), np.ones(self.dim))
 65 |     
 66 |     def __repr__(self):
 67 |         
 68 |         return "EchoGame(sdim=%s, udim=%s)" % (self.sdim, self.udim)
 69 |         
 70 |     def reset(self):
 71 |         
 72 |         self.state = np.random.random(size=self.sdim)
 73 |         
 74 |         return self.state
 75 |     
 76 |     def dynamics(self, state, action):
 77 |         
 78 |         return np.random.random(size=self.sdim)
 79 |     
 80 |     def step(self, action):
 81 |         
 82 |         # the reward depends on the *old* action, not the new one:
 83 |         distance = np.sum((self.state - action) ** 2)
 84 |         reward = -distance
 85 | 
 86 |         # after having thus computed the reward, select the next state:
 87 |         self.state = self.dynamics(self.state, action)
 88 | 
 89 |         return self.state, reward, False, {} # obs, reward, done, oddities
 90 |     
 91 |     def render(self):
 92 |         
 93 |         raise NotImplementedError
 94 | 
 95 |     def close(self):
 96 |         
 97 |         raise NotImplementedError
 98 | 
 99 | 
100 | 


--------------------------------------------------------------------------------
/code/logger.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from matplotlib import pyplot as plt
 3 | 
 4 | import datetime
 5 | import os
 6 | 
 7 | 
 8 | def makedir(dirpath=None):
 9 |     """ Create a directory at results/<now>/ or `dirpath`; return the path. """
10 | 
11 |     # if no path is given, pick one:
12 |     if dirpath is None:
13 |         now = datetime.datetime.now()
14 |         dirname = now.strftime("%Y_%b_%d_%Hh%M")
15 |         dirpath = os.path.join("results", dirname)
16 | 
17 |     # if no such directory exists, create:
18 |     if not os.path.exists(dirpath):
19 |         os.makedirs(dirpath)
20 |     
21 |     return dirpath
22 | 
23 | 
24 | def save_args(dirpath, **kwargs):
25 |     """ Save a text file documenting the values of the kwargs. """
26 |     
27 |     logpath = os.path.join(dirpath, "call.txt")
28 |     logfile = open(logpath, "w")
29 |     
30 |     for item in kwargs.items():
31 |         logfile.write("%s=%s\n" % item)
32 |     
33 |     logfile.close()
34 | 
35 | 
36 | def print_stats(rewardsums):
37 |     """ Pretty-print some statistical information about the data. """
38 | 
39 |     mean = np.mean(rewardsums)
40 |     std = np.std(rewardsums)
41 |     
42 |     meantext = "Mean sum-of-rewards per rollout: %.3f ± %.3f." % (mean, std)
43 | 
44 |     print(meantext)
45 |     print("‾" * len(meantext))
46 |     print()
47 |     
48 |     print("Percentiles of the sums-of-rewards:")
49 |     print()
50 | 
51 |     percents = np.linspace(0, 100, 5 + 1)
52 |     percentiles = [np.percentile(rewardsums, p) for p in percents]
53 | 
54 |     print(" | ".join(("%.0f" % p).center(9) for p in percents))
55 |     print(" | ".join(("%.5g" % p).center(9) for p in percentiles))
56 |     print()
57 | 
58 | 
59 | def plot_progress(samples, show=True, filename=None):
60 |     """ Plot the temporal development of a list of lists of numbers. """
61 |     
62 |     plt.figure(figsize=(20, 10))
63 |     
64 |     numepochs = len(samples)
65 |     epochs = np.arange(numepochs)
66 |     
67 |     means = [np.mean(sample) for sample in samples]
68 |     medians = [np.median(sample) for sample in samples]
69 | 
70 |     for p in np.linspace(5, 50, 10):
71 |         
72 |         top = [np.percentile(sample, 50 + p) for sample in samples]
73 |         bot = [np.percentile(sample, 50 - p) for sample in samples]
74 |         
75 |         plt.fill_between(epochs, bot, top, color="gold", alpha=0.1)
76 |     
77 |     plt.plot(epochs, medians, color="orange", alpha=0.5, lw=5)
78 |     plt.plot(epochs, means, color="blue", alpha=0.5, lw=4)
79 |     
80 |     plt.xlim(-1, numepochs)
81 |     
82 |     plt.xlabel("Training epoch", fontsize=24)
83 |     plt.ylabel("Sum-of-rewards per episode", fontsize=24)
84 |     
85 |     if filename:
86 |         plt.savefig(filename)
87 |     
88 |     if show:
89 |         plt.show()
90 | 
91 |     plt.close('all')
92 | 


--------------------------------------------------------------------------------
/code/policies.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | from matplotlib import pyplot as plt
  3 | 
  4 | import theano
  5 | from theano import tensor as tns
  6 | 
  7 | import distributions
  8 | from BlockyVector import BlockyVector
  9 | 
 10 | 
 11 | class Policy(object):
 12 |     
 13 |     def __init__(self, *args, **kwargs):
 14 |         
 15 |         pass
 16 |     
 17 |     def __repr__(self):
 18 |         
 19 |         classname = self.__class__.__name__
 20 |         keyvalues = ["%s=%s" % item for item in self.__dict__.items()]
 21 |         
 22 |         return "%s(%s)" % (classname, "{%s}" % ", ".join(keyvalues))
 23 | 
 24 |     def saveas(self, filename=None):
 25 |         """ Save the parameters of the policy for later loading. """
 26 |         
 27 |         np.savez(filename, **self.__dict__)
 28 |     
 29 |     def load(self, filename=None):
 30 |         """ Initialize a policy from saved parameters. """
 31 |         
 32 |         for key, value in np.load(filename).items():
 33 |             self.__dict__[key] = value if value.ndim > 0 else value.item()
 34 | 
 35 |     def imshow_weights(self, blocks=None, show=True, filename=None):
 36 |         """ imshow the parameter matrix. """
 37 |         
 38 |         if blocks is None:
 39 |             blocks = self.weights
 40 | 
 41 |         nblocks = len(blocks)
 42 |         width, height = 16, 9
 43 | 
 44 |         assert nblocks > 0
 45 | 
 46 |         ncols = int(np.ceil(np.sqrt(width/height * nblocks)))
 47 |         nrows = int(np.ceil(nblocks / ncols))
 48 |         ncols = nblocks if nrows == 1 else ncols
 49 | 
 50 |         figure = plt.figure(figsize=(width, height))
 51 | 
 52 |         for i, block in enumerate(blocks):
 53 |     
 54 |             plt.subplot(nrows, ncols, i + 1)
 55 |             plt.imshow(block, interpolation='nearest', aspect='auto')
 56 |             plt.colorbar()
 57 | 
 58 |         plt.tight_layout()
 59 | 
 60 |         if filename is not None:
 61 |             plt.savefig(filename)
 62 |         
 63 |         if show:
 64 |             plt.show()
 65 |         
 66 |         plt.close(figure)
 67 |     
 68 |     def update(self, direction, length, verbose=False):
 69 |         """ Take a step in the direction, peforming certain sanity checks. """
 70 | 
 71 |         assert direction.shape == self.weights.shape
 72 |         
 73 |         if verbose:
 74 |             L2 = (direction ** 2).sum() ** 0.5
 75 |             print("Length of direction vector: %.5g." % L2)
 76 |             print("Length of the step taken: %.5g." % (length * L2))
 77 |             print()
 78 |         
 79 |         shape_before = self.weights.shape
 80 |         self.weights += length * direction
 81 |         
 82 |         assert shape_before == self.weights.shape
 83 | 
 84 | 
 85 | class GaussianPolicy(Policy):
 86 |     
 87 |     def __init__(self, sdim=None, udim=None, weights=None, sigma=None,
 88 |                        filename=None, *args, **kwargs):
 89 |         
 90 |         if filename is not None:
 91 | 
 92 |             self.load(filename)
 93 |             self.compile()
 94 |             return
 95 |         
 96 |         self.sdim = sdim
 97 |         self.udim = udim
 98 | 
 99 |         self.dist = distributions.Gaussian(sigma=sigma)
100 |         self.weights = self.random_weights() if weights is None else BlockyVector(weights)
101 | 
102 |         self.compile()
103 | 
104 |     def LOGP(self, ACTION, PARAMS):
105 |         
106 |         return self.dist.LOGP(ACTION, PARAMS)
107 |     
108 |     def compile(self):
109 |         
110 |         SHIST = tns.dmatrix("STATE HISTORY")
111 |         UHIST = tns.dmatrix("ACTION HISTORY")
112 |         
113 |         WEIGHTS = [tns.dmatrix("WEIGHT_%s" % i) for i in range(len(self.weights))]
114 |         PARAMS = self.PARAMS(SHIST, UHIST, WEIGHTS)
115 | 
116 |         ACTION = tns.dvector("ACTION")
117 |         SAMPLE = self.UNWRAP(ACTION)
118 |         
119 |         LOGP = self.LOGP(SAMPLE, PARAMS)
120 |         SCORE = theano.grad(LOGP, wrt=WEIGHTS)
121 |         
122 |         print("Compiling policy functions . . .")
123 |         self.paramlist = theano.function(inputs=[SHIST, UHIST] + WEIGHTS, outputs=PARAMS, on_unused_input='ignore')
124 |         self.logp = theano.function(inputs=[ACTION, SHIST, UHIST] + WEIGHTS, outputs=LOGP, on_unused_input='ignore')
125 |         self.dlogp = theano.function(inputs=[ACTION, SHIST, UHIST] + WEIGHTS, outputs=SCORE, on_unused_input='ignore')
126 |         print("Done.\n")
127 |         
128 |     def params(self, shist, uhist, weights=None):
129 |         
130 |         if weights is None:
131 |             weights = self.weights
132 |         
133 |         shist = np.atleast_2d(shist) # [] ==> array([[]])
134 |         uhist = np.atleast_2d(uhist) # [] ==> array([[]])
135 |         
136 |         return self.paramlist(shist, uhist, *weights)
137 | 
138 |     def sample(self, shist, uhist, weights=None):
139 |         
140 |         params = self.params(shist, uhist, weights)
141 |         
142 |         assert len(params) == self.dist.nparams
143 |         
144 |         for param in params:
145 |             assert len(param) == self.udim
146 | 
147 |         return self.dist.sample(params)
148 | 
149 |     def score(self, action, shist=[], uhist=[], weights=None):
150 |         
151 |         if weights is None:
152 |             weights = self.weights
153 |         
154 |         shist = np.atleast_2d(shist)
155 |         uhist = np.atleast_2d(uhist)
156 |         
157 |         return BlockyVector(self.dlogp(action, shist, uhist, *weights))
158 | 
159 | 
160 | 
161 | class BoxPolicy(Policy):
162 |     
163 |     def __init__(self, sdim=None, udim=None, low=None, high=None,
164 |                        environment=None, weights=None, dist=None,
165 |                        filename=None, *args, **kwargs):
166 |         
167 |         if filename is not None:
168 |             self.load(filename)
169 |             self.compile()
170 |             return
171 |         
172 |         if environment is None:
173 |             
174 |             self.sdim = sdim
175 |             self.udim = udim
176 |             
177 |             self.low = low
178 |             self.high = high
179 |         
180 |         else:
181 |             
182 |             self.sdim = environment.observation_space.shape[0]
183 |             self.udim = environment.action_space.shape[0]
184 |             
185 |             self.low = environment.action_space.low
186 |             self.high = environment.action_space.high
187 | 
188 |         self.dist = distributions.ArctanGaussian() if dist is None else dist
189 |         self.weights = self.random_weights() if weights is None else BlockyVector(weights)
190 | 
191 |         self.compile()
192 |     
193 |     def wrap(self, action):
194 |         """ Convert an unnormalized action into a box element. """
195 | 
196 |         return self.low + (self.high - self.low)*action
197 |         
198 |     def unwrap(self, box_elm):
199 |         """ Convert a box element into a unit cube element. """
200 | 
201 |         return (box_elm - self.low) / (self.high - self.low)
202 |     
203 |     def WRAP(self, action):
204 |         """ Convert an unnormalized action into a box element. """
205 | 
206 |         return self.low + (self.high - self.low)*action
207 | 
208 |     def UNWRAP(self, box_elm):
209 |         """ Convert a box element into a unit cube element. """
210 | 
211 |         return (box_elm - self.low) / (self.high - self.low)
212 | 
213 |     def LOGP(self, ACTION, PARAMS):
214 |         
215 |         return self.dist.LOGP(ACTION, PARAMS)
216 |     
217 |     def compile(self):
218 |         
219 |         SHIST = tns.dmatrix("STATE HISTORY")
220 |         UHIST = tns.dmatrix("ACTION HISTORY")
221 |         
222 |         WEIGHTS = [tns.dmatrix("WEIGHT_%s" % i) for i in range(len(self.weights))]
223 |         PARAMS = self.PARAMS(SHIST, UHIST, WEIGHTS)
224 | 
225 |         ACTION = tns.dvector("ACTION")
226 |         SAMPLE = self.UNWRAP(ACTION)
227 |         
228 |         LOGP = self.LOGP(SAMPLE, PARAMS)
229 |         SCORE = theano.grad(LOGP, wrt=WEIGHTS)
230 |         
231 |         print("Compiling policy functions . . .")
232 |         self.paramlist = theano.function(inputs=[SHIST, UHIST] + WEIGHTS, outputs=PARAMS, on_unused_input='ignore')
233 |         self.logp = theano.function(inputs=[ACTION, SHIST, UHIST] + WEIGHTS, outputs=LOGP, on_unused_input='ignore')
234 |         self.dlogp = theano.function(inputs=[ACTION, SHIST, UHIST] + WEIGHTS, outputs=SCORE, on_unused_input='ignore')
235 |         print("Done.\n")
236 |         
237 |     def params(self, shist, uhist, weights=None):
238 |         
239 |         if weights is None:
240 |             weights = self.weights
241 |         
242 |         shist = np.atleast_2d(shist)
243 |         uhist = np.atleast_2d(uhist)
244 |         
245 |         return self.paramlist(shist, uhist, *weights)
246 | 
247 |     def sample(self, shist, uhist, weights=None):
248 |         
249 |         params = self.params(shist, uhist, weights)
250 |         
251 |         assert len(params) == self.dist.nparams
252 |         
253 |         for param in params:
254 |             assert len(param) == self.udim
255 | 
256 |         unitboxed = self.dist.sample(params)
257 |         actionboxed = self.wrap(unitboxed)
258 | 
259 |         assert not np.any(np.isnan(unitboxed))
260 | 
261 |         assert np.all(np.zeros_like(unitboxed) <= unitboxed)
262 |         assert np.all(unitboxed <= np.ones_like(unitboxed))
263 | 
264 |         assert np.all(self.low <= actionboxed)
265 |         assert np.all(actionboxed <= self.high)
266 |         
267 |         return actionboxed
268 | 
269 |     def score(self, action, shist=[], uhist=[], weights=None):
270 |         
271 |         if weights is None:
272 |             weights = self.weights
273 |         
274 |         shist = np.atleast_2d(shist)
275 |         uhist = np.atleast_2d(uhist)
276 |         
277 |         return BlockyVector(self.dlogp(action, shist, uhist, *weights))
278 | 
279 | 
280 | class PolynomialPolicy(BoxPolicy):
281 |     
282 |     def __init__(self, degree=3, *args, **kwargs):
283 |         
284 |         self.degree = degree
285 |         
286 |         super().__init__(*args, **{key: val for key, val in kwargs.items() if key != 'degree'})
287 |     
288 |     def random_weights(self):
289 | 
290 |         inputdim = 1 + (self.sdim * self.degree) # [1] + concatenated powers
291 |         outputdim = self.udim # each parameter has the same dim as the action
292 |         
293 |         pick_matrix = lambda: np.random.normal(size=(outputdim, inputdim))
294 |         matrix_list = [pick_matrix() for _ in range(self.dist.nparams)]
295 |         
296 |         return BlockyVector(matrix_list)
297 | 
298 |     def PARAMS(self, SHIST, UHIST, WEIGHTS):
299 |         
300 |         STATE = SHIST[-1, :]
301 |         
302 |         POWERS = [STATE ** (n + 1) for n in range(self.degree)]
303 |         INPUTS = tns.concatenate([[1]] + POWERS)
304 | 
305 |         return [tns.dot(WEIGHT, INPUTS) for WEIGHT in WEIGHTS]
306 | 
307 | 
308 | class FeedForwardPolicy(BoxPolicy):
309 |     
310 |     def __init__(self, hidden=[], degree=None, *args, **kwargs):
311 |         
312 |         self.hidden = hidden
313 |         self.degree = 1 if degree is None else degree
314 |         
315 |         super().__init__(*args, **{key: val for key, val in kwargs.items()
316 |                                    if key not in ['hidden', 'degree']})
317 |     
318 |     def random_weights(self):
319 |         
320 |         smemory = 2
321 |         umemory = 2
322 |         
323 |         assert self.dist.nparams == 1
324 |         
325 |         firstsize = (smemory*self.sdim + umemory*self.udim)*self.degree
326 |         lastsize = self.udim # note that we only allow a single parameter
327 |         
328 |         indims = [firstsize] + [self.degree*w for w in self.hidden]
329 |         outdims = self.hidden + [lastsize]
330 |         
331 |         weights = [np.random.normal(size=(outdim, indim))
332 |                    for indim, outdim in zip(indims, outdims)]
333 |         
334 |         return BlockyVector(weights)
335 | 
336 |     def PARAMS(self, SHIST, UHIST, WEIGHTS):
337 |         
338 |         smemory = 2
339 |         umemory = 2
340 |         
341 |         SLIST = [SHIST[-(t + 1), :] for t in range(smemory)]
342 |         ULIST = [UHIST[-(t + 1), :] for t in range(umemory)]
343 |          
344 |         INPUT = tns.concatenate(SLIST + ULIST)
345 | 
346 |         X = [INPUT]
347 |         
348 |         for WEIGHT_D in WEIGHTS[:-1]:
349 |             LAYER = tns.concatenate([X[-1] ** n for n in range(self.degree)])
350 |             LINEAR = tns.dot(WEIGHT_D, LAYER)
351 |             X.append(tns.tanh(LINEAR))
352 | 
353 |         LAYER = tns.concatenate([X[-1] ** n for n in range(self.degree)])
354 |         LINEAR = tns.dot(WEIGHTS[-1], LAYER)
355 |         X.append(LINEAR) # no squashing
356 | 
357 |         return [X[-1]] # list containing only one parameter vector
358 | 
359 | 
360 | if __name__ == '__main__':
361 |     
362 |     pass


--------------------------------------------------------------------------------
/code/regressors.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | 
  3 | 
  4 | class Regressor(object):
  5 |     
  6 |     def __init__(self, *args, **kwargs):
  7 |         """ Initialize a function approximator. """
  8 | 
  9 |         self.params = 0
 10 |     
 11 |     def __repr__(self):
 12 |         
 13 |         return "Regressor()"
 14 |     
 15 |     def design(self, states):
 16 |         """ Convert an array of states into an input array of the right shape. """
 17 |         
 18 |         return states
 19 |     
 20 |     def predict(self, states):
 21 |         """ Predict an array of values based on an array of states. """
 22 |         
 23 |         raise NotImplementedError
 24 |         
 25 |     def error(self, states, values):
 26 |         
 27 |         return np.mean((self.predict(states) - values) ** 2)
 28 |     
 29 |     def MLE(self, states, values):
 30 |         """ Compute the maximum-likelihood parameter settings given the data. """
 31 |     
 32 |         raise NotImplementedError
 33 |         
 34 |     def fit(self, states, values, caution=0.01, verbose=False):
 35 |         """ Re-estimate the parameters of the Regressor to fit empirical data. """
 36 |         
 37 |         if verbose:
 38 |             error = self.error(states, values)
 39 |             print("Fitting baseline (prior error: %.3f) . . ." % error)
 40 |         
 41 |         solution = self.MLE(states, values)
 42 |         self.params = caution*self.params + (1 - caution)*solution
 43 | 
 44 |         if verbose:
 45 |             error = self.error(states, values)
 46 |             print("Done (posterior error: %.3f).\n" % error)
 47 | 
 48 | 
 49 | class PolynomialRegressor(Regressor):
 50 |     
 51 |     def __init__(self, sdim, degree=3):
 52 |         
 53 |         self.sdim = sdim
 54 |         self.degree = degree
 55 |         self.params = np.zeros(1 + self.degree*self.sdim)
 56 |         
 57 |     def __repr__(self):
 58 |         
 59 |         return "<%s: degree=%s>" % (self.__class__.__name__, self.degree)
 60 |     
 61 |     def design(self, states):
 62 |         """ For the design matrix (matrix of input vectors) from the states. """
 63 |         
 64 |         states = np.array(states)
 65 |         samples, sdim = states.shape
 66 | 
 67 |         assert self.sdim == sdim
 68 |         
 69 |         ones = np.ones((samples, 1))
 70 |         
 71 |         if self.degree < 1:
 72 |             return ones
 73 | 
 74 |         statepowers = [states ** (n + 1) for n in range(self.degree)]
 75 |         powermatrix = np.concatenate(statepowers, axis=1)
 76 |         
 77 |         return np.concatenate([ones, powermatrix], axis=1)
 78 |     
 79 |     def predict(self, states):
 80 |         
 81 |         return self.design(states).dot(self.params)
 82 |     
 83 |     def MLE(self, states, values):
 84 |         
 85 |         inputs = self.design(states)
 86 |         solution, residuals, rank, sngrts = np.linalg.lstsq(inputs, values)
 87 |         
 88 |         return solution
 89 | 
 90 | 
 91 | class PolynomialTemporalRegressor(Regressor):
 92 |     
 93 |     def __init__(self, sdim, degree=2, timedegree=1):
 94 |         
 95 |         self.sdim = sdim
 96 |         self.degree = degree
 97 |         self.timedegree = timedegree
 98 |         self.params = np.zeros(1 + self.degree*self.sdim + self.timedegree)
 99 |     
100 |     def __repr__(self):
101 |         
102 |         return "<%s: degree=%s>" % (self.__class__.__name__, self.degree)
103 |     
104 |     def design(self, states):
105 |         """ For the design matrix (matrix of input vectors) from the states. """
106 |         
107 |         states = np.array(states)
108 |         samples, sdim = states.shape
109 |         
110 |         assert self.sdim == sdim
111 |         
112 |         ones = [np.ones((samples, 1))]
113 |         
114 |         timecolumn = np.arange(samples).reshape((samples, 1))
115 |         timepowers = [timecolumn ** (n + 1) for n in range(self.timedegree)]
116 |         
117 |         statepowers = [states ** (n + 1) for n in range(self.degree)]
118 | 
119 |         if self.degree == 0 and self.timedegree == 0:
120 |             return np.concatenate(ones, axis=1)
121 | 
122 |         elif self.degree == 0 and self.timedegree > 0:
123 |             return np.concatenate(ones + timepowers, axis=1)
124 | 
125 |         elif self.degree > 0 and self.timedegree == 0:
126 |             return np.concatenate(ones + statepowers, axis=1)
127 | 
128 |         else:
129 |             return np.concatenate(ones + statepowers + timepowers, axis=1)
130 | 
131 |     def predict(self, states):
132 |         
133 |         return self.design(states).dot(self.params)
134 |     
135 |     def MLE(self, states, values):
136 |         
137 |         inputs = self.design(states)
138 |         solution, residuals, rank, sngrts = np.linalg.lstsq(inputs, values)
139 |         
140 |         return solution
141 | 
142 | 
143 | if __name__ == '__main__':
144 |     
145 |     sdim = 3
146 |     samples = 100
147 |     maxdegree = 3
148 |     maxtimedegree = 3
149 |     
150 |     A = np.random.normal(size=sdim)
151 |     B = np.random.normal()
152 |     
153 |     states = np.random.normal(size=(samples, sdim))
154 | 
155 |     linear = states.dot(A.T) + B
156 |     nonlinear = np.sin(states.dot(A.T)) + np.exp(states).dot(A.T)
157 |     
158 |     for values, datatype in zip([linear, nonlinear], ['linear', 'nonlinear']):
159 |     
160 |         print((" Fitting to %s data: " % datatype).center(46, "="))
161 |         print("")
162 |         print("")
163 |     
164 |         for d in range(maxdegree):
165 | 
166 |             print("Polynomial function approximator of degree %s):" % d)
167 |             print("‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾")
168 | 
169 |             approximator = PolynomialRegressor(sdim, degree=d)
170 |             approximator.fit(states, values, verbose=True)
171 | 
172 |             for t in range(maxtimedegree):
173 |         
174 |                 print("Temporal-polynomial function approximator of degree (%s, %s):" % (d, t))
175 |                 print("‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾")
176 | 
177 |                 approximator = PolynomialTemporalRegressor(sdim, degree=d, timedegree=t)
178 |                 approximator.fit(states, values, verbose=True)
179 | 
180 |             print()
181 | 
182 |         print()
183 | 


--------------------------------------------------------------------------------
/pdfs/A_Few_Observations_About_Policy_Gradient_Approximations.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mathias-madsen/reinforce_tutorial/3ead4b6f4fd741ecc86a40f04269c2f9210ee62c/pdfs/A_Few_Observations_About_Policy_Gradient_Approximations.pdf


--------------------------------------------------------------------------------
/pdfs/A_Minimal_Working_Example_of_Empirical_Gradient_Ascent.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mathias-madsen/reinforce_tutorial/3ead4b6f4fd741ecc86a40f04269c2f9210ee62c/pdfs/A_Minimal_Working_Example_of_Empirical_Gradient_Ascent.pdf


--------------------------------------------------------------------------------
/pdfs/Is_Randomization_Necessary.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mathias-madsen/reinforce_tutorial/3ead4b6f4fd741ecc86a40f04269c2f9210ee62c/pdfs/Is_Randomization_Necessary.pdf


--------------------------------------------------------------------------------
/pdfs/Natural Gradients, Mahalanobis Distances, and Distances between Distributions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mathias-madsen/reinforce_tutorial/3ead4b6f4fd741ecc86a40f04269c2f9210ee62c/pdfs/Natural Gradients, Mahalanobis Distances, and Distances between Distributions.pdf


--------------------------------------------------------------------------------
/pdfs/Policy_Exploration_in_a_Cold_Universe.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mathias-madsen/reinforce_tutorial/3ead4b6f4fd741ecc86a40f04269c2f9210ee62c/pdfs/Policy_Exploration_in_a_Cold_Universe.pdf


--------------------------------------------------------------------------------
/pdfs/Policy_Exploration_without_Back-Looking_Terms.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mathias-madsen/reinforce_tutorial/3ead4b6f4fd741ecc86a40f04269c2f9210ee62c/pdfs/Policy_Exploration_without_Back-Looking_Terms.pdf


--------------------------------------------------------------------------------
/pdfs/slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mathias-madsen/reinforce_tutorial/3ead4b6f4fd741ecc86a40f04269c2f9210ee62c/pdfs/slides.pdf


--------------------------------------------------------------------------------