├── .gitignore
├── README.md
├── climbing.py
├── encoding.py
├── learn.py
├── policy.py
├── rewards.py
├── search.png
├── search.py
└── transitions.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | simulate_search.py
3 | *.html
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | title: Reinforcement Learning in Python
  2 | author:
  3 |   name: Nathan Epstein
  4 |   twitter: epstein_n
  5 |   url: http://nepste.in
  6 |   email: _@nepste.in
  7 | 
  8 | --
  9 | 
 10 | # Reinforcement Learning in Python
 11 | 
 12 | --
 13 | 
 14 | ### Roadmap
 15 | 
 16 | - Concepts
 17 | - Using Data
 18 | - Python Implementation
 19 | - Sample Application
 20 | 
 21 | --
 22 | 
 23 | # Concepts
 24 | 
 25 | --
 26 | 
 27 | ### Reinforcement Learning
 28 | 
 29 | Reinforcement learning is a type of machine learning in which software agents are trained to take actions in a given environment to maximize a cumulative reward.
 30 | 
 31 | --
 32 | 
 33 | ### Markov Decision Process - Components
 34 | 
 35 | A Markov Decision process is a mathematical formalism that we will use to implement reinforcement learning. The relevant components of this formalism are
 36 | the __state space__, __action space__, __transition probabilities__, and __rewards__.
 37 | 
 38 | --
 39 | 
 40 | ### State Space
 41 | 
 42 | The exhaustive set of possible states that a process can be in. Generally known a priori.
 43 | 
 44 | --
 45 | 
 46 | ### Action Space
 47 | 
 48 | The exhaustive set of possible actions that can be taken (to influence the likelihood of transition between states). Generally known a priori.
 49 | 
 50 | --
 51 | 
 52 | ### Transition Probabilities
 53 | 
 54 | The probabilities of transitioning between the various states given actions taken (specifically, a tensor, P, such that P_ijk = probabilitiy of going from state i to state k given action j). Generally not known a priori.
 55 | 
 56 | --
 57 | 
 58 | ### Rewards
 59 | 
 60 | The rewards associated with occupying each state. Generally not known a priori.
 61 | 
 62 | --
 63 | 
 64 | ### Markov Decision Process - Objectives
 65 | 
 66 | We are interested in understanding these elements in order to develop a __policy__; the set of actions we will take in each state.
 67 | 
 68 | Our goal is to determine a policy which produces the greatest possible cumulative rewards.
 69 | 
 70 | --
 71 | 
 72 | # Using Data
 73 | 
 74 | --
 75 | 
 76 | ### Objectives
 77 | 
 78 | Our goal is to learn the transition probabilities and the rewards (and build a policy based on these rewards). We will estimate these values using observed data.
 79 | 
 80 | --
 81 | 
 82 | ### Estimating Rewards
 83 | 
 84 | R(s) = (total reward we got in state s) / (#times we visited state s)
 85 | 
 86 | --
 87 | 
 88 | ### Estimating Transition Probability
 89 | 
 90 | P(s, a, s') = (#times we took action a in state s and we went to s') / (#times we took action a in state s)
 91 | 
 92 | --
 93 | 
 94 | ### Value Iteration
 95 | 
 96 | Value Iteration is an algorithm by which we can use the estimated rewards and transition probabilities to determine an optimal policy.
 97 | 
 98 | --
 99 | 
100 | ### Value Iteration - algorithm
101 | 
102 | 1) For each state s, initialize V(s) = 0
103 | 
104 | 2) Repeat until convergence:
105 | 
106 | For each state, update V(s) = R(s) + max_a∈A SUM_s'∈S(P(s, a, s') * V(s'))
107 | 
108 | 3) Policy in state s is the a ∈ A which maximizes V(s)
109 | 
110 | --
111 | 
112 | # Python Implementation
113 | 
114 | --
115 | 
116 | ### Following Along
117 | 
118 | If you want to look at the code examples on your own, you can find all code for this presentation at https://github.com/NathanEpstein/pydata-reinforce
119 | 
120 | --
121 | 
122 | ### Reward Parser
123 | 
124 | ```python
125 | 
126 | import numpy as np
127 | 
128 | class RewardParser:
129 |   def __init__(self, observations, dimensions):
130 |     self.observations = observations
131 |     self.state_count = dimensions['state_count']
132 | 
133 |   def rewards(self):
134 |     total_state_rewards = np.zeros(self.state_count)
135 |     total_state_visits = np.zeros(self.state_count)
136 | 
137 |     for observation in self.observations:
138 |       visits = float(len(observation['state_transitions']))
139 |       reward_per_visit = observation['reward'] / visits
140 | 
141 |       for state_transition in observation['state_transitions']:
142 |         state = state_transition['state']
143 |         total_state_rewards[state] += reward_per_visit
144 |         total_state_visits[state] += 1
145 | 
146 |     average_state_rewards = total_state_rewards / total_state_visits
147 |     average_state_rewards = np.nan_to_num(average_state_rewards)
148 | 
149 |     return average_state_rewards
150 | 
151 | ```
152 | 
153 | --
154 | 
155 | ### Transition Parser (part 1)
156 | 
157 | ```python
158 | 
159 | import numpy as np
160 | 
161 | class TransitionParser:
162 |   def __init__(self, observations, dimensions):
163 |     self.observations = observations
164 |     self.state_count = dimensions['state_count']
165 |     self.action_count = dimensions['action_count']
166 | 
167 |   def transition_probabilities(self):
168 |     transition_count = self._count_transitions()
169 |     return self._parse_probabilities(transition_count)
170 | 
171 | ```
172 | 
173 | --
174 | 
175 | ### Transition Parser (part 2)
176 | 
177 | ```python
178 | 
179 |   def _count_transitions(self):
180 |     transition_count = np.zeros((self.state_count, self.action_count, self.state_count))
181 | 
182 |     for observation in self.observations:
183 |       for state_transition in observation['state_transitions']:
184 |         state = state_transition['state']
185 |         action = state_transition['action']
186 |         state_ = state_transition['state_']
187 | 
188 |         transition_count[state][action][state_] += 1
189 | 
190 |     return transition_count
191 | 
192 |   def _parse_probabilities(self, transition_count):
193 |     P = np.zeros((self.state_count, self.action_count, self.state_count))
194 | 
195 |     for state in range(0, self.state_count):
196 |       for action in range(0, self.action_count):
197 | 
198 |         total_transitions = float(sum(transition_count[state][action]))
199 | 
200 |         if (total_transitions > 0):
201 |           P[state][action] = transition_count[state][action] / total_transitions
202 |         else:
203 |           P[state][action] = 1.0 / self.state_count
204 | 
205 |     return P
206 | 
207 | ```
208 | 
209 | --
210 | 
211 | ### Policy Parser
212 | 
213 | ```python
214 | 
215 | import numpy as np
216 | 
217 | class PolicyParser:
218 |   def __init__(self, dimensions):
219 |     self.state_count = dimensions['state_count']
220 |     self.action_count = dimensions['action_count']
221 | 
222 |   def policy(self, P, rewards):
223 |     best_policy = np.zeros(self.state_count)
224 |     state_values = np.zeros(self.state_count)
225 | 
226 |     GAMMA, ITERATIONS = 0.9, 50
227 |     for i in range(ITERATIONS):
228 |       for state in range(0, self.state_count):
229 |         state_value = -float('Inf')
230 |         for action in range(0, self.action_count):
231 |           action_value = 0
232 |           for state_ in range(0, self.state_count):
233 |             action_value += (P[state][action][state_] * state_values[state_] * GAMMA)
234 |           if (action_value >= state_value):
235 |             state_value = action_value
236 |             best_policy[state] = action
237 |         state_values[state] = rewards[state] + state_value
238 |     return best_policy
239 | 
240 | ```
241 | 
242 | --
243 | 
244 | ### Putting It Together (Markov Agent)
245 | 
246 | ```python
247 | 
248 | from rewards import RewardParser
249 | from transitions import TransitionParser
250 | from policy import PolicyParser
251 | 
252 | class MarkovAgent:
253 |   def __init__(self, observations, dimensions):
254 |     # create reward, transition, and policy parsers
255 |     self.reward_parser = RewardParser(observations, dimensions)
256 |     self.transition_parser = TransitionParser(observations, dimensions)
257 |     self.policy_parser = PolicyParser(dimensions)
258 | 
259 |   def learn(self):
260 |     R = self.reward_parser.rewards()
261 |     P = self.transition_parser.transition_probabilities()
262 | 
263 |     self.policy = self.policy_parser.policy(P, R)
264 | ```
265 | --
266 | 
267 | # Sample Application
268 | 
269 | 
270 | ##"I believe the robots are our future, teach them well and let them lead the way."
271 | 
272 | --
273 | ### Climbing
274 | 
275 | - 5 states: bottom, low, middle, high, top.
276 | - No leaving bottom and top.
277 | - We get a reward at the top, nothing at the bottom.
278 | 
279 | --
280 | 
281 | ### Data - observations
282 | 
283 | ```python
284 | 
285 | observations = [
286 |   { 'state_transitions': [
287 |       { 'state': 'low', 'action': 'climb', 'state_': 'mid' },
288 |       { 'state': 'mid', 'action': 'climb', 'state_': 'high' },
289 |       { 'state': 'high', 'action': 'sink', 'state_': 'mid' },
290 |       { 'state': 'mid', 'action': 'sink', 'state_': 'low' },
291 |       { 'state': 'low', 'action': 'sink', 'state_': 'bottom' }
292 |     ],
293 |     'reward': 0
294 |   },
295 |   { 'state_transitions': [
296 |       { 'state': 'low', 'action': 'climb', 'state_': 'mid' },
297 |       { 'state': 'mid', 'action': 'climb', 'state_': 'high' },
298 |       { 'state': 'high', 'action': 'climb', 'state_': 'top' },
299 |     ],
300 |     'reward': 0
301 |   }
302 | ]
303 | 
304 | ```
305 | 
306 | --
307 | 
308 | ### Data - trap states
309 | 
310 | ```python
311 | 
312 | trap_states = [
313 |   {
314 |     'state_transitions': [
315 |       { 'state': 'bottom', 'action': 'sink', 'state_': 'bottom' },
316 |       { 'state': 'bottom', 'action': 'climb', 'state_': 'bottom' }
317 |     ],
318 |     'reward': 0
319 |   },
320 |   {
321 |     'state_transitions': [
322 |       { 'state': 'top', 'action': 'sink', 'state_': 'top' },
323 |       { 'state': 'top', 'action': 'climb', 'state_': 'top' },
324 |     ],
325 |     'reward': 1
326 |   },
327 | ]
328 | 
329 | ```
330 | 
331 | --
332 | 
333 | ### Training
334 | 
335 | ```python
336 | from learn import MarkovAgent
337 | mark = MarkovAgent(observations + trap_states)
338 | mark.learn()
339 | 
340 | print(mark.policy)
341 | # {'high': 'climb', 'top': 'sink', 'bottom': 'sink', 'low': 'climb', 'mid': 'climb'}
342 | # NOTE: policy in top and bottom states is chosen randomly (doesn't affect state)
343 | 
344 | ```
345 | 
346 | --
347 | 
348 | ### Search
349 | 
350 | - Given an array of sorted numbers, find a target value as quickly as possible.
351 | 
352 | - Our random numbers will be distributed exponentially. Can we use this information to do better than binary search?
353 | 
354 | --
355 | 
356 | ### Approach
357 | 
358 | - Create many example searches (random, linear, and binary).
359 | - Reward for each search should be -1 * #steps to find target.
360 | - Append these with trap states with positive rewards.
361 | 
362 | --
363 | 
364 | ### "Feature Selection"
365 | 
366 | Our model can only learn what we show it (i.e. what's encoded in state). Our state will include:
367 | 
368 | - current location
369 | - whether current location is above or below the target
370 | - known index range
371 | - whether our target is above or below the distribution mean
372 | 
373 | Example state: "12:up:0:50:True"
374 | 
375 | --
376 | 
377 | ### Training
378 | 
379 | ```python
380 | from learn import MarkovAgent
381 | from search import *
382 | import numpy as np
383 | 
384 | simulator = SearchSimulation()
385 | observations = simulator.observations(50000, 15)
386 | mark = MarkovAgent(observations)
387 | mark.learn()
388 | 
389 | 
390 | class AISearch(Search):
391 |   def update_location(self):
392 |     self.location = mark.policy[self.state()]
393 | 
394 | ```
395 | 
396 | --
397 | 
398 | ### Comparison
399 | 
400 | ```python
401 | binary_results = []
402 | linear_results = []
403 | random_results = []
404 | ai_results = []
405 | 
406 | for i in range(10000):
407 |   # create array and target value
408 |   array = simulator._random_sorted_array(15)
409 |   target = random.choice(array)
410 | 
411 |   # generate observation for search of each type
412 |   binary = simulator.observation(BinarySearch(array, target))
413 |   linear = simulator.observation(LinearSearch(array, target))
414 |   rando = simulator.observation(RandomSearch(array, target))
415 |   ai = simulator.observation(AISearch(array, target))
416 | 
417 |   # append result
418 |   binary_results.append(len(binary['state_transitions']))
419 |   linear_results.append(len(linear['state_transitions']))
420 |   random_results.append(len(rando['state_transitions']))
421 |   ai_results.append(len(ai['state_transitions']))
422 | 
423 | # display average results
424 | print "Average binary search length: {0}".format(np.mean(binary_results)) # 3.6469
425 | print "Average linear search length: {0}".format(np.mean(linear_results)) # 5.5242
426 | print "Average random search length: {0}".format(np.mean(random_results)) # 14.2132
427 | print "Average AI search length: {0}".format(np.mean(ai_results)) # 3.1095
428 | 
429 | ```
430 | 
431 | --
432 | 
433 | ### Results
434 | 
435 | <img src="./search.png">
436 | 
437 | 
438 | 
439 | 
440 | 
441 | 
442 | 
443 | 
444 | 


--------------------------------------------------------------------------------
/climbing.py:
--------------------------------------------------------------------------------
 1 | from learn import MarkovAgent
 2 | 
 3 | observations = [
 4 |   { 'state_transitions': [
 5 |       { 'state': 'low', 'action': 'climb', 'state_': 'mid' },
 6 |       { 'state': 'mid', 'action': 'climb', 'state_': 'high' },
 7 |       { 'state': 'high', 'action': 'sink', 'state_': 'mid' },
 8 |       { 'state': 'mid', 'action': 'sink', 'state_': 'low' },
 9 |       { 'state': 'low', 'action': 'sink', 'state_': 'bottom' }
10 |     ],
11 |     'reward': 0
12 |   },
13 |   { 'state_transitions': [
14 |       { 'state': 'low', 'action': 'climb', 'state_': 'mid' },
15 |       { 'state': 'mid', 'action': 'climb', 'state_': 'high' },
16 |       { 'state': 'high', 'action': 'climb', 'state_': 'top' },
17 |     ],
18 |     'reward': 0
19 |   }
20 | ]
21 | 
22 | trap_states = [
23 |   {
24 |     'state_transitions': [
25 |       { 'state': 'bottom', 'action': 'sink', 'state_': 'bottom' },
26 |       { 'state': 'bottom', 'action': 'climb', 'state_': 'bottom' }
27 |     ],
28 |     'reward': 0
29 |   },
30 |   {
31 |     'state_transitions': [
32 |       { 'state': 'top', 'action': 'sink', 'state_': 'top' },
33 |       { 'state': 'top', 'action': 'climb', 'state_': 'top' },
34 |     ],
35 |     'reward': 1
36 |   },
37 | ]
38 | 
39 | observations += trap_states
40 | 
41 | mark = MarkovAgent(observations)
42 | mark.learn()
43 | 
44 | # mark correctly learns that the optimal strategy is to always go up
45 | print(mark.policy)
46 | 


--------------------------------------------------------------------------------
/encoding.py:
--------------------------------------------------------------------------------
 1 | class StateActionEncoder:
 2 |   def __init__(self, observations):
 3 |     self.observations = observations
 4 |     self._parse_states_and_actions()
 5 | 
 6 |   def parse_dimensions(self):
 7 |     return {
 8 |       'state_count': len(self.int_to_state),
 9 |       'action_count': len(self.int_to_action)
10 |     }
11 | 
12 |   def observations_to_int(self):
13 |     for observation in self.observations:
14 |       for transition in observation['state_transitions']:
15 |         transition['state'] = self.state_to_int[transition['state']]
16 |         transition['state_'] = self.state_to_int[transition['state_']]
17 |         transition['action'] = self.action_to_int[transition['action']]
18 | 
19 |   def parse_encoded_policy(self, encoded_policy):
20 |     policy = {}
21 |     for index, encoded_action in enumerate(encoded_policy):
22 |       state = self.int_to_state[index]
23 |       action = self.int_to_action[int(encoded_action)]
24 |       policy[state] = action
25 | 
26 |     return policy
27 | 
28 |   def _parse_states_and_actions(self):
29 |     state_dict, action_dict = {}, {}
30 |     state_array, action_array = [], []
31 |     state_index, action_index = 0, 0
32 | 
33 |     for observation in self.observations:
34 |       for transition in observation['state_transitions']:
35 |         state = transition['state']
36 |         action = transition['action']
37 | 
38 |         if state not in state_dict.keys():
39 |           state_dict[state] = state_index
40 |           state_array.append(state)
41 |           state_index += 1
42 | 
43 |         if action not in action_dict.keys():
44 |           action_dict[action] = action_index
45 |           action_array.append(action)
46 |           action_index += 1
47 | 
48 |     self.state_to_int = state_dict
49 |     self.action_to_int = action_dict
50 |     self.int_to_state = state_array
51 |     self.int_to_action = action_array
52 | 
53 | 


--------------------------------------------------------------------------------
/learn.py:
--------------------------------------------------------------------------------
 1 | from encoding import StateActionEncoder
 2 | from rewards import RewardParser
 3 | from transitions import TransitionParser
 4 | from policy import PolicyParser
 5 | 
 6 | class MarkovAgent:
 7 |   def __init__(self, observations):
 8 |     # encode observation data as int values
 9 |     self.state_action_encoder = StateActionEncoder(observations)
10 |     self.state_action_encoder.observations_to_int()
11 |     dimensions = self.state_action_encoder.parse_dimensions()
12 | 
13 |     # create reward, transition, and policy parsers
14 |     self.reward_parser = RewardParser(observations, dimensions)
15 |     self.transition_parser = TransitionParser(observations, dimensions)
16 |     self.policy_parser = PolicyParser(dimensions)
17 | 
18 |   def learn(self):
19 |     R = self.reward_parser.rewards()
20 |     P = self.transition_parser.transition_probabilities()
21 | 
22 |     # learn int-encoded policy and convert to readable dictionary
23 |     encoded_policy = self.policy_parser.policy(P, R)
24 |     self.policy = self.state_action_encoder.parse_encoded_policy(encoded_policy)
25 | 
26 | 


--------------------------------------------------------------------------------
/policy.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | class PolicyParser:
 4 |   def __init__(self, dimensions):
 5 |     self.state_count = dimensions['state_count']
 6 |     self.action_count = dimensions['action_count']
 7 | 
 8 |   def policy(self, P, rewards):
 9 |     print('COMPUTING POLICY')
10 | 
11 |     best_policy = np.zeros(self.state_count)
12 |     state_values = np.zeros(self.state_count)
13 | 
14 |     GAMMA = 0.9
15 |     ITERATIONS = 125
16 |     for i in range(ITERATIONS):
17 |       print "iteration: {0} / {1}".format(i + 1, ITERATIONS)
18 | 
19 |       for state in range(0, self.state_count):
20 |         state_value = -float('Inf')
21 | 
22 |         for action in range(0, self.action_count):
23 |           action_value = 0
24 | 
25 |           for state_ in range(0, self.state_count):
26 |             action_value += (P[state][action][state_] * state_values[state_] * GAMMA)
27 | 
28 |           if (action_value >= state_value):
29 |             state_value = action_value
30 |             best_policy[state] = action
31 | 
32 |         state_values[state] = rewards[state] + state_value
33 | 
34 |     return best_policy


--------------------------------------------------------------------------------
/rewards.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | class RewardParser:
 4 |   def __init__(self, observations, dimensions):
 5 |     self.observations = observations
 6 |     self.state_count = dimensions['state_count']
 7 | 
 8 |   def rewards(self):
 9 |     print('COMPUTING REWARDS')
10 |     total_state_rewards = np.zeros(self.state_count)
11 |     total_state_visits = np.zeros(self.state_count)
12 | 
13 |     for observation in self.observations:
14 |       visits = float(len(observation['state_transitions']))
15 |       reward_per_visit = observation['reward'] / visits
16 | 
17 |       for state_transition in observation['state_transitions']:
18 |         state = state_transition['state']
19 |         total_state_rewards[state] += reward_per_visit
20 |         total_state_visits[state] += 1
21 | 
22 |     average_state_rewards = total_state_rewards / total_state_visits
23 |     average_state_rewards = np.nan_to_num(average_state_rewards)
24 | 
25 |     return average_state_rewards


--------------------------------------------------------------------------------
/search.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NathanEpstein/pydata-reinforce/10c2d75b51258794d60fa99e2eba23eddc103b0d/search.png


--------------------------------------------------------------------------------
/search.py:
--------------------------------------------------------------------------------
  1 | import random
  2 | 
  3 | # parent search class, methods for updating relevant features as search progresses
  4 | class Search:
  5 |   def __init__(self, array, target):
  6 |     self.path = []
  7 |     self.array = array
  8 |     self.target = target
  9 |     self.initialize_features()
 10 | 
 11 |   def initialize_features(self):
 12 |     self.location = random.choice(range(len(self.array)))
 13 |     self._update_direction()
 14 |     self.floor = 0
 15 |     self.ceil = len(self.array)
 16 |     self.high_target = True if (self.target > 1) else False
 17 | 
 18 |   def state(self):
 19 |     if (self.array[self.location] == self.target):
 20 |       return 'TARGET_FOUND'
 21 |     features = [
 22 |       str(self.location),
 23 |       self.direction,
 24 |       str(self.floor),
 25 |       str(self.ceil),
 26 |       str(self.high_target)
 27 |     ]
 28 |     return ':'.join(features)
 29 | 
 30 |   def update(self):
 31 |     self.update_location() #supplied by child class
 32 |     self.path.append(self.location)
 33 |     self._update_direction_and_bounds()
 34 |     return self.location
 35 | 
 36 |   def _update_direction_and_bounds(self):
 37 |     self._update_direction()
 38 |     self._update_bounds()
 39 | 
 40 |   def _update_direction(self):
 41 |     if self.array[self.location] < self.target:
 42 |       self.direction = 'up'
 43 |     else:
 44 |       self.direction = 'down'
 45 | 
 46 |   def _update_bounds(self):
 47 |     if self.direction == 'up':
 48 |       self.floor = self.location
 49 |     else:
 50 |       self.ceil = self.location
 51 | 
 52 | 
 53 | # specific search classes inherit from Search and supply update_location method
 54 | class BinarySearch(Search):
 55 |   def update_location(self):
 56 |     self.location = (self.ceil + self.floor) / 2
 57 | 
 58 | class LinearSearch(Search):
 59 |   def update_location(self):
 60 |     if self.direction == 'up':
 61 |       self.location += 1
 62 |     else:
 63 |       self.location -= 1
 64 | 
 65 | class RandomSearch(Search):
 66 |   def update_location(self):
 67 |     self.location = random.choice(range(len(self.array)))
 68 | 
 69 | # simulation class which uses searches to create training data
 70 | class SearchSimulation:
 71 |   def observation(self, array_length, supplied_search = None):
 72 |     if supplied_search is None:
 73 |       search = self._search_of_random_type(array_length)
 74 |     else:
 75 |       search = supplied_search
 76 | 
 77 |     transitions = []
 78 | 
 79 |     while (search.state() != 'TARGET_FOUND'):
 80 |       state = search.state()
 81 |       action = search.update()
 82 |       state_ = search.state()
 83 | 
 84 |       transitions.append({
 85 |         'state': state,
 86 |         'action': action,
 87 |         'state_': state_
 88 |       })
 89 | 
 90 |     if len(transitions) == 0:
 91 |       return self.observation(array_length)
 92 |     else:
 93 |       return {
 94 |         'reward': -len(transitions),
 95 |         'state_transitions': transitions
 96 |       }
 97 | 
 98 |   def observations(self, n, array_length):
 99 |     observations = []
100 |     for i in range(n):
101 |       observations.append(self.observation(array_length))
102 |     observations.append(self._trap_state(array_length))
103 |     return observations
104 | 
105 |   def _trap_state(self, array_length):
106 |     state_transitions = []
107 |     for i in range(array_length):
108 |       state_transitions.append({
109 |         'state': 'TARGET_FOUND',
110 |         'action': i,
111 |         'state_': 'TARGET_FOUND'
112 |       })
113 |     return {
114 |       'reward': 1,
115 |       'state_transitions': state_transitions
116 |     }
117 | 
118 |   def _search_of_random_type(self, array_length):
119 |     sorted_array = self._random_sorted_array(array_length)
120 |     target_int = random.choice(sorted_array)
121 | 
122 |     search_type = random.choice(['binary', 'linear', 'random'])
123 |     if search_type == 'binary':
124 |       search = BinarySearch(sorted_array, target_int)
125 |     if search_type == 'random':
126 |       search = RandomSearch(sorted_array, target_int)
127 |     if search_type == 'linear':
128 |       search = LinearSearch(sorted_array, target_int)
129 | 
130 |     return search
131 | 
132 |   def _random_sorted_array(self, length):
133 |     random_values = []
134 |     for i in range(length):
135 |       value = round(random.expovariate(1), 5)
136 |       random_values.append(value)
137 | 
138 |     random_values.sort()
139 |     return random_values
140 | 


--------------------------------------------------------------------------------
/transitions.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | class TransitionParser:
 4 |   def __init__(self, observations, dimensions):
 5 |     self.observations = observations
 6 |     self.state_count = dimensions['state_count']
 7 |     self.action_count = dimensions['action_count']
 8 | 
 9 |   def transition_probabilities(self):
10 |     print('COMPUTING TRANSITIONS')
11 |     transition_count = self._count_transitions()
12 |     return self._parse_probabilities(transition_count)
13 | 
14 |   def _count_transitions(self):
15 |     transition_count = np.zeros((self.state_count, self.action_count, self.state_count))
16 | 
17 |     for observation in self.observations:
18 |       for state_transition in observation['state_transitions']:
19 |         state = state_transition['state']
20 |         action = state_transition['action']
21 |         state_ = state_transition['state_']
22 | 
23 |         transition_count[state][action][state_] += 1
24 | 
25 |     return transition_count
26 | 
27 |   def _parse_probabilities(self, transition_count):
28 |     P = np.zeros((self.state_count, self.action_count, self.state_count))
29 | 
30 |     for state in range(0, self.state_count):
31 |       for action in range(0, self.action_count):
32 | 
33 |         total_transitions = float(sum(transition_count[state][action]))
34 | 
35 |         if (total_transitions > 0):
36 |           P[state][action] = transition_count[state][action] / total_transitions
37 |         else:
38 |           P[state][action] = 1.0 / self.state_count
39 | 
40 |     return P


--------------------------------------------------------------------------------