├── .gitignore ├── README.md ├── climbing.py ├── encoding.py ├── learn.py ├── policy.py ├── rewards.py ├── search.png ├── search.py └── transitions.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | simulate_search.py 3 | *.html 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | title: Reinforcement Learning in Python 2 | author: 3 | name: Nathan Epstein 4 | twitter: epstein_n 5 | url: http://nepste.in 6 | email: _@nepste.in 7 | 8 | -- 9 | 10 | # Reinforcement Learning in Python 11 | 12 | -- 13 | 14 | ### Roadmap 15 | 16 | - Concepts 17 | - Using Data 18 | - Python Implementation 19 | - Sample Application 20 | 21 | -- 22 | 23 | # Concepts 24 | 25 | -- 26 | 27 | ### Reinforcement Learning 28 | 29 | Reinforcement learning is a type of machine learning in which software agents are trained to take actions in a given environment to maximize a cumulative reward. 30 | 31 | -- 32 | 33 | ### Markov Decision Process - Components 34 | 35 | A Markov Decision process is a mathematical formalism that we will use to implement reinforcement learning. The relevant components of this formalism are 36 | the __state space__, __action space__, __transition probabilities__, and __rewards__. 37 | 38 | -- 39 | 40 | ### State Space 41 | 42 | The exhaustive set of possible states that a process can be in. Generally known a priori. 43 | 44 | -- 45 | 46 | ### Action Space 47 | 48 | The exhaustive set of possible actions that can be taken (to influence the likelihood of transition between states). Generally known a priori. 49 | 50 | -- 51 | 52 | ### Transition Probabilities 53 | 54 | The probabilities of transitioning between the various states given actions taken (specifically, a tensor, P, such that P_ijk = probabilitiy of going from state i to state k given action j). Generally not known a priori. 55 | 56 | -- 57 | 58 | ### Rewards 59 | 60 | The rewards associated with occupying each state. Generally not known a priori. 61 | 62 | -- 63 | 64 | ### Markov Decision Process - Objectives 65 | 66 | We are interested in understanding these elements in order to develop a __policy__; the set of actions we will take in each state. 67 | 68 | Our goal is to determine a policy which produces the greatest possible cumulative rewards. 69 | 70 | -- 71 | 72 | # Using Data 73 | 74 | -- 75 | 76 | ### Objectives 77 | 78 | Our goal is to learn the transition probabilities and the rewards (and build a policy based on these rewards). We will estimate these values using observed data. 79 | 80 | -- 81 | 82 | ### Estimating Rewards 83 | 84 | R(s) = (total reward we got in state s) / (#times we visited state s) 85 | 86 | -- 87 | 88 | ### Estimating Transition Probability 89 | 90 | P(s, a, s') = (#times we took action a in state s and we went to s') / (#times we took action a in state s) 91 | 92 | -- 93 | 94 | ### Value Iteration 95 | 96 | Value Iteration is an algorithm by which we can use the estimated rewards and transition probabilities to determine an optimal policy. 97 | 98 | -- 99 | 100 | ### Value Iteration - algorithm 101 | 102 | 1) For each state s, initialize V(s) = 0 103 | 104 | 2) Repeat until convergence: 105 | 106 | For each state, update V(s) = R(s) + max_a∈A SUM_s'∈S(P(s, a, s') * V(s')) 107 | 108 | 3) Policy in state s is the a ∈ A which maximizes V(s) 109 | 110 | -- 111 | 112 | # Python Implementation 113 | 114 | -- 115 | 116 | ### Following Along 117 | 118 | If you want to look at the code examples on your own, you can find all code for this presentation at https://github.com/NathanEpstein/pydata-reinforce 119 | 120 | -- 121 | 122 | ### Reward Parser 123 | 124 | ```python 125 | 126 | import numpy as np 127 | 128 | class RewardParser: 129 | def __init__(self, observations, dimensions): 130 | self.observations = observations 131 | self.state_count = dimensions['state_count'] 132 | 133 | def rewards(self): 134 | total_state_rewards = np.zeros(self.state_count) 135 | total_state_visits = np.zeros(self.state_count) 136 | 137 | for observation in self.observations: 138 | visits = float(len(observation['state_transitions'])) 139 | reward_per_visit = observation['reward'] / visits 140 | 141 | for state_transition in observation['state_transitions']: 142 | state = state_transition['state'] 143 | total_state_rewards[state] += reward_per_visit 144 | total_state_visits[state] += 1 145 | 146 | average_state_rewards = total_state_rewards / total_state_visits 147 | average_state_rewards = np.nan_to_num(average_state_rewards) 148 | 149 | return average_state_rewards 150 | 151 | ``` 152 | 153 | -- 154 | 155 | ### Transition Parser (part 1) 156 | 157 | ```python 158 | 159 | import numpy as np 160 | 161 | class TransitionParser: 162 | def __init__(self, observations, dimensions): 163 | self.observations = observations 164 | self.state_count = dimensions['state_count'] 165 | self.action_count = dimensions['action_count'] 166 | 167 | def transition_probabilities(self): 168 | transition_count = self._count_transitions() 169 | return self._parse_probabilities(transition_count) 170 | 171 | ``` 172 | 173 | -- 174 | 175 | ### Transition Parser (part 2) 176 | 177 | ```python 178 | 179 | def _count_transitions(self): 180 | transition_count = np.zeros((self.state_count, self.action_count, self.state_count)) 181 | 182 | for observation in self.observations: 183 | for state_transition in observation['state_transitions']: 184 | state = state_transition['state'] 185 | action = state_transition['action'] 186 | state_ = state_transition['state_'] 187 | 188 | transition_count[state][action][state_] += 1 189 | 190 | return transition_count 191 | 192 | def _parse_probabilities(self, transition_count): 193 | P = np.zeros((self.state_count, self.action_count, self.state_count)) 194 | 195 | for state in range(0, self.state_count): 196 | for action in range(0, self.action_count): 197 | 198 | total_transitions = float(sum(transition_count[state][action])) 199 | 200 | if (total_transitions > 0): 201 | P[state][action] = transition_count[state][action] / total_transitions 202 | else: 203 | P[state][action] = 1.0 / self.state_count 204 | 205 | return P 206 | 207 | ``` 208 | 209 | -- 210 | 211 | ### Policy Parser 212 | 213 | ```python 214 | 215 | import numpy as np 216 | 217 | class PolicyParser: 218 | def __init__(self, dimensions): 219 | self.state_count = dimensions['state_count'] 220 | self.action_count = dimensions['action_count'] 221 | 222 | def policy(self, P, rewards): 223 | best_policy = np.zeros(self.state_count) 224 | state_values = np.zeros(self.state_count) 225 | 226 | GAMMA, ITERATIONS = 0.9, 50 227 | for i in range(ITERATIONS): 228 | for state in range(0, self.state_count): 229 | state_value = -float('Inf') 230 | for action in range(0, self.action_count): 231 | action_value = 0 232 | for state_ in range(0, self.state_count): 233 | action_value += (P[state][action][state_] * state_values[state_] * GAMMA) 234 | if (action_value >= state_value): 235 | state_value = action_value 236 | best_policy[state] = action 237 | state_values[state] = rewards[state] + state_value 238 | return best_policy 239 | 240 | ``` 241 | 242 | -- 243 | 244 | ### Putting It Together (Markov Agent) 245 | 246 | ```python 247 | 248 | from rewards import RewardParser 249 | from transitions import TransitionParser 250 | from policy import PolicyParser 251 | 252 | class MarkovAgent: 253 | def __init__(self, observations, dimensions): 254 | # create reward, transition, and policy parsers 255 | self.reward_parser = RewardParser(observations, dimensions) 256 | self.transition_parser = TransitionParser(observations, dimensions) 257 | self.policy_parser = PolicyParser(dimensions) 258 | 259 | def learn(self): 260 | R = self.reward_parser.rewards() 261 | P = self.transition_parser.transition_probabilities() 262 | 263 | self.policy = self.policy_parser.policy(P, R) 264 | ``` 265 | -- 266 | 267 | # Sample Application 268 | 269 | 270 | ##"I believe the robots are our future, teach them well and let them lead the way." 271 | 272 | -- 273 | ### Climbing 274 | 275 | - 5 states: bottom, low, middle, high, top. 276 | - No leaving bottom and top. 277 | - We get a reward at the top, nothing at the bottom. 278 | 279 | -- 280 | 281 | ### Data - observations 282 | 283 | ```python 284 | 285 | observations = [ 286 | { 'state_transitions': [ 287 | { 'state': 'low', 'action': 'climb', 'state_': 'mid' }, 288 | { 'state': 'mid', 'action': 'climb', 'state_': 'high' }, 289 | { 'state': 'high', 'action': 'sink', 'state_': 'mid' }, 290 | { 'state': 'mid', 'action': 'sink', 'state_': 'low' }, 291 | { 'state': 'low', 'action': 'sink', 'state_': 'bottom' } 292 | ], 293 | 'reward': 0 294 | }, 295 | { 'state_transitions': [ 296 | { 'state': 'low', 'action': 'climb', 'state_': 'mid' }, 297 | { 'state': 'mid', 'action': 'climb', 'state_': 'high' }, 298 | { 'state': 'high', 'action': 'climb', 'state_': 'top' }, 299 | ], 300 | 'reward': 0 301 | } 302 | ] 303 | 304 | ``` 305 | 306 | -- 307 | 308 | ### Data - trap states 309 | 310 | ```python 311 | 312 | trap_states = [ 313 | { 314 | 'state_transitions': [ 315 | { 'state': 'bottom', 'action': 'sink', 'state_': 'bottom' }, 316 | { 'state': 'bottom', 'action': 'climb', 'state_': 'bottom' } 317 | ], 318 | 'reward': 0 319 | }, 320 | { 321 | 'state_transitions': [ 322 | { 'state': 'top', 'action': 'sink', 'state_': 'top' }, 323 | { 'state': 'top', 'action': 'climb', 'state_': 'top' }, 324 | ], 325 | 'reward': 1 326 | }, 327 | ] 328 | 329 | ``` 330 | 331 | -- 332 | 333 | ### Training 334 | 335 | ```python 336 | from learn import MarkovAgent 337 | mark = MarkovAgent(observations + trap_states) 338 | mark.learn() 339 | 340 | print(mark.policy) 341 | # {'high': 'climb', 'top': 'sink', 'bottom': 'sink', 'low': 'climb', 'mid': 'climb'} 342 | # NOTE: policy in top and bottom states is chosen randomly (doesn't affect state) 343 | 344 | ``` 345 | 346 | -- 347 | 348 | ### Search 349 | 350 | - Given an array of sorted numbers, find a target value as quickly as possible. 351 | 352 | - Our random numbers will be distributed exponentially. Can we use this information to do better than binary search? 353 | 354 | -- 355 | 356 | ### Approach 357 | 358 | - Create many example searches (random, linear, and binary). 359 | - Reward for each search should be -1 * #steps to find target. 360 | - Append these with trap states with positive rewards. 361 | 362 | -- 363 | 364 | ### "Feature Selection" 365 | 366 | Our model can only learn what we show it (i.e. what's encoded in state). Our state will include: 367 | 368 | - current location 369 | - whether current location is above or below the target 370 | - known index range 371 | - whether our target is above or below the distribution mean 372 | 373 | Example state: "12:up:0:50:True" 374 | 375 | -- 376 | 377 | ### Training 378 | 379 | ```python 380 | from learn import MarkovAgent 381 | from search import * 382 | import numpy as np 383 | 384 | simulator = SearchSimulation() 385 | observations = simulator.observations(50000, 15) 386 | mark = MarkovAgent(observations) 387 | mark.learn() 388 | 389 | 390 | class AISearch(Search): 391 | def update_location(self): 392 | self.location = mark.policy[self.state()] 393 | 394 | ``` 395 | 396 | -- 397 | 398 | ### Comparison 399 | 400 | ```python 401 | binary_results = [] 402 | linear_results = [] 403 | random_results = [] 404 | ai_results = [] 405 | 406 | for i in range(10000): 407 | # create array and target value 408 | array = simulator._random_sorted_array(15) 409 | target = random.choice(array) 410 | 411 | # generate observation for search of each type 412 | binary = simulator.observation(BinarySearch(array, target)) 413 | linear = simulator.observation(LinearSearch(array, target)) 414 | rando = simulator.observation(RandomSearch(array, target)) 415 | ai = simulator.observation(AISearch(array, target)) 416 | 417 | # append result 418 | binary_results.append(len(binary['state_transitions'])) 419 | linear_results.append(len(linear['state_transitions'])) 420 | random_results.append(len(rando['state_transitions'])) 421 | ai_results.append(len(ai['state_transitions'])) 422 | 423 | # display average results 424 | print "Average binary search length: {0}".format(np.mean(binary_results)) # 3.6469 425 | print "Average linear search length: {0}".format(np.mean(linear_results)) # 5.5242 426 | print "Average random search length: {0}".format(np.mean(random_results)) # 14.2132 427 | print "Average AI search length: {0}".format(np.mean(ai_results)) # 3.1095 428 | 429 | ``` 430 | 431 | -- 432 | 433 | ### Results 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | -------------------------------------------------------------------------------- /climbing.py: -------------------------------------------------------------------------------- 1 | from learn import MarkovAgent 2 | 3 | observations = [ 4 | { 'state_transitions': [ 5 | { 'state': 'low', 'action': 'climb', 'state_': 'mid' }, 6 | { 'state': 'mid', 'action': 'climb', 'state_': 'high' }, 7 | { 'state': 'high', 'action': 'sink', 'state_': 'mid' }, 8 | { 'state': 'mid', 'action': 'sink', 'state_': 'low' }, 9 | { 'state': 'low', 'action': 'sink', 'state_': 'bottom' } 10 | ], 11 | 'reward': 0 12 | }, 13 | { 'state_transitions': [ 14 | { 'state': 'low', 'action': 'climb', 'state_': 'mid' }, 15 | { 'state': 'mid', 'action': 'climb', 'state_': 'high' }, 16 | { 'state': 'high', 'action': 'climb', 'state_': 'top' }, 17 | ], 18 | 'reward': 0 19 | } 20 | ] 21 | 22 | trap_states = [ 23 | { 24 | 'state_transitions': [ 25 | { 'state': 'bottom', 'action': 'sink', 'state_': 'bottom' }, 26 | { 'state': 'bottom', 'action': 'climb', 'state_': 'bottom' } 27 | ], 28 | 'reward': 0 29 | }, 30 | { 31 | 'state_transitions': [ 32 | { 'state': 'top', 'action': 'sink', 'state_': 'top' }, 33 | { 'state': 'top', 'action': 'climb', 'state_': 'top' }, 34 | ], 35 | 'reward': 1 36 | }, 37 | ] 38 | 39 | observations += trap_states 40 | 41 | mark = MarkovAgent(observations) 42 | mark.learn() 43 | 44 | # mark correctly learns that the optimal strategy is to always go up 45 | print(mark.policy) 46 | -------------------------------------------------------------------------------- /encoding.py: -------------------------------------------------------------------------------- 1 | class StateActionEncoder: 2 | def __init__(self, observations): 3 | self.observations = observations 4 | self._parse_states_and_actions() 5 | 6 | def parse_dimensions(self): 7 | return { 8 | 'state_count': len(self.int_to_state), 9 | 'action_count': len(self.int_to_action) 10 | } 11 | 12 | def observations_to_int(self): 13 | for observation in self.observations: 14 | for transition in observation['state_transitions']: 15 | transition['state'] = self.state_to_int[transition['state']] 16 | transition['state_'] = self.state_to_int[transition['state_']] 17 | transition['action'] = self.action_to_int[transition['action']] 18 | 19 | def parse_encoded_policy(self, encoded_policy): 20 | policy = {} 21 | for index, encoded_action in enumerate(encoded_policy): 22 | state = self.int_to_state[index] 23 | action = self.int_to_action[int(encoded_action)] 24 | policy[state] = action 25 | 26 | return policy 27 | 28 | def _parse_states_and_actions(self): 29 | state_dict, action_dict = {}, {} 30 | state_array, action_array = [], [] 31 | state_index, action_index = 0, 0 32 | 33 | for observation in self.observations: 34 | for transition in observation['state_transitions']: 35 | state = transition['state'] 36 | action = transition['action'] 37 | 38 | if state not in state_dict.keys(): 39 | state_dict[state] = state_index 40 | state_array.append(state) 41 | state_index += 1 42 | 43 | if action not in action_dict.keys(): 44 | action_dict[action] = action_index 45 | action_array.append(action) 46 | action_index += 1 47 | 48 | self.state_to_int = state_dict 49 | self.action_to_int = action_dict 50 | self.int_to_state = state_array 51 | self.int_to_action = action_array 52 | 53 | -------------------------------------------------------------------------------- /learn.py: -------------------------------------------------------------------------------- 1 | from encoding import StateActionEncoder 2 | from rewards import RewardParser 3 | from transitions import TransitionParser 4 | from policy import PolicyParser 5 | 6 | class MarkovAgent: 7 | def __init__(self, observations): 8 | # encode observation data as int values 9 | self.state_action_encoder = StateActionEncoder(observations) 10 | self.state_action_encoder.observations_to_int() 11 | dimensions = self.state_action_encoder.parse_dimensions() 12 | 13 | # create reward, transition, and policy parsers 14 | self.reward_parser = RewardParser(observations, dimensions) 15 | self.transition_parser = TransitionParser(observations, dimensions) 16 | self.policy_parser = PolicyParser(dimensions) 17 | 18 | def learn(self): 19 | R = self.reward_parser.rewards() 20 | P = self.transition_parser.transition_probabilities() 21 | 22 | # learn int-encoded policy and convert to readable dictionary 23 | encoded_policy = self.policy_parser.policy(P, R) 24 | self.policy = self.state_action_encoder.parse_encoded_policy(encoded_policy) 25 | 26 | -------------------------------------------------------------------------------- /policy.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | class PolicyParser: 4 | def __init__(self, dimensions): 5 | self.state_count = dimensions['state_count'] 6 | self.action_count = dimensions['action_count'] 7 | 8 | def policy(self, P, rewards): 9 | print('COMPUTING POLICY') 10 | 11 | best_policy = np.zeros(self.state_count) 12 | state_values = np.zeros(self.state_count) 13 | 14 | GAMMA = 0.9 15 | ITERATIONS = 125 16 | for i in range(ITERATIONS): 17 | print "iteration: {0} / {1}".format(i + 1, ITERATIONS) 18 | 19 | for state in range(0, self.state_count): 20 | state_value = -float('Inf') 21 | 22 | for action in range(0, self.action_count): 23 | action_value = 0 24 | 25 | for state_ in range(0, self.state_count): 26 | action_value += (P[state][action][state_] * state_values[state_] * GAMMA) 27 | 28 | if (action_value >= state_value): 29 | state_value = action_value 30 | best_policy[state] = action 31 | 32 | state_values[state] = rewards[state] + state_value 33 | 34 | return best_policy -------------------------------------------------------------------------------- /rewards.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | class RewardParser: 4 | def __init__(self, observations, dimensions): 5 | self.observations = observations 6 | self.state_count = dimensions['state_count'] 7 | 8 | def rewards(self): 9 | print('COMPUTING REWARDS') 10 | total_state_rewards = np.zeros(self.state_count) 11 | total_state_visits = np.zeros(self.state_count) 12 | 13 | for observation in self.observations: 14 | visits = float(len(observation['state_transitions'])) 15 | reward_per_visit = observation['reward'] / visits 16 | 17 | for state_transition in observation['state_transitions']: 18 | state = state_transition['state'] 19 | total_state_rewards[state] += reward_per_visit 20 | total_state_visits[state] += 1 21 | 22 | average_state_rewards = total_state_rewards / total_state_visits 23 | average_state_rewards = np.nan_to_num(average_state_rewards) 24 | 25 | return average_state_rewards -------------------------------------------------------------------------------- /search.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NathanEpstein/pydata-reinforce/10c2d75b51258794d60fa99e2eba23eddc103b0d/search.png -------------------------------------------------------------------------------- /search.py: -------------------------------------------------------------------------------- 1 | import random 2 | 3 | # parent search class, methods for updating relevant features as search progresses 4 | class Search: 5 | def __init__(self, array, target): 6 | self.path = [] 7 | self.array = array 8 | self.target = target 9 | self.initialize_features() 10 | 11 | def initialize_features(self): 12 | self.location = random.choice(range(len(self.array))) 13 | self._update_direction() 14 | self.floor = 0 15 | self.ceil = len(self.array) 16 | self.high_target = True if (self.target > 1) else False 17 | 18 | def state(self): 19 | if (self.array[self.location] == self.target): 20 | return 'TARGET_FOUND' 21 | features = [ 22 | str(self.location), 23 | self.direction, 24 | str(self.floor), 25 | str(self.ceil), 26 | str(self.high_target) 27 | ] 28 | return ':'.join(features) 29 | 30 | def update(self): 31 | self.update_location() #supplied by child class 32 | self.path.append(self.location) 33 | self._update_direction_and_bounds() 34 | return self.location 35 | 36 | def _update_direction_and_bounds(self): 37 | self._update_direction() 38 | self._update_bounds() 39 | 40 | def _update_direction(self): 41 | if self.array[self.location] < self.target: 42 | self.direction = 'up' 43 | else: 44 | self.direction = 'down' 45 | 46 | def _update_bounds(self): 47 | if self.direction == 'up': 48 | self.floor = self.location 49 | else: 50 | self.ceil = self.location 51 | 52 | 53 | # specific search classes inherit from Search and supply update_location method 54 | class BinarySearch(Search): 55 | def update_location(self): 56 | self.location = (self.ceil + self.floor) / 2 57 | 58 | class LinearSearch(Search): 59 | def update_location(self): 60 | if self.direction == 'up': 61 | self.location += 1 62 | else: 63 | self.location -= 1 64 | 65 | class RandomSearch(Search): 66 | def update_location(self): 67 | self.location = random.choice(range(len(self.array))) 68 | 69 | # simulation class which uses searches to create training data 70 | class SearchSimulation: 71 | def observation(self, array_length, supplied_search = None): 72 | if supplied_search is None: 73 | search = self._search_of_random_type(array_length) 74 | else: 75 | search = supplied_search 76 | 77 | transitions = [] 78 | 79 | while (search.state() != 'TARGET_FOUND'): 80 | state = search.state() 81 | action = search.update() 82 | state_ = search.state() 83 | 84 | transitions.append({ 85 | 'state': state, 86 | 'action': action, 87 | 'state_': state_ 88 | }) 89 | 90 | if len(transitions) == 0: 91 | return self.observation(array_length) 92 | else: 93 | return { 94 | 'reward': -len(transitions), 95 | 'state_transitions': transitions 96 | } 97 | 98 | def observations(self, n, array_length): 99 | observations = [] 100 | for i in range(n): 101 | observations.append(self.observation(array_length)) 102 | observations.append(self._trap_state(array_length)) 103 | return observations 104 | 105 | def _trap_state(self, array_length): 106 | state_transitions = [] 107 | for i in range(array_length): 108 | state_transitions.append({ 109 | 'state': 'TARGET_FOUND', 110 | 'action': i, 111 | 'state_': 'TARGET_FOUND' 112 | }) 113 | return { 114 | 'reward': 1, 115 | 'state_transitions': state_transitions 116 | } 117 | 118 | def _search_of_random_type(self, array_length): 119 | sorted_array = self._random_sorted_array(array_length) 120 | target_int = random.choice(sorted_array) 121 | 122 | search_type = random.choice(['binary', 'linear', 'random']) 123 | if search_type == 'binary': 124 | search = BinarySearch(sorted_array, target_int) 125 | if search_type == 'random': 126 | search = RandomSearch(sorted_array, target_int) 127 | if search_type == 'linear': 128 | search = LinearSearch(sorted_array, target_int) 129 | 130 | return search 131 | 132 | def _random_sorted_array(self, length): 133 | random_values = [] 134 | for i in range(length): 135 | value = round(random.expovariate(1), 5) 136 | random_values.append(value) 137 | 138 | random_values.sort() 139 | return random_values 140 | -------------------------------------------------------------------------------- /transitions.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | class TransitionParser: 4 | def __init__(self, observations, dimensions): 5 | self.observations = observations 6 | self.state_count = dimensions['state_count'] 7 | self.action_count = dimensions['action_count'] 8 | 9 | def transition_probabilities(self): 10 | print('COMPUTING TRANSITIONS') 11 | transition_count = self._count_transitions() 12 | return self._parse_probabilities(transition_count) 13 | 14 | def _count_transitions(self): 15 | transition_count = np.zeros((self.state_count, self.action_count, self.state_count)) 16 | 17 | for observation in self.observations: 18 | for state_transition in observation['state_transitions']: 19 | state = state_transition['state'] 20 | action = state_transition['action'] 21 | state_ = state_transition['state_'] 22 | 23 | transition_count[state][action][state_] += 1 24 | 25 | return transition_count 26 | 27 | def _parse_probabilities(self, transition_count): 28 | P = np.zeros((self.state_count, self.action_count, self.state_count)) 29 | 30 | for state in range(0, self.state_count): 31 | for action in range(0, self.action_count): 32 | 33 | total_transitions = float(sum(transition_count[state][action])) 34 | 35 | if (total_transitions > 0): 36 | P[state][action] = transition_count[state][action] / total_transitions 37 | else: 38 | P[state][action] = 1.0 / self.state_count 39 | 40 | return P --------------------------------------------------------------------------------