├── LICENSE.md ├── Markov Chains ├── dispenser_mc.png └── robot_mc.png ├── README.md └── wolf_phc.py /LICENSE.md: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Yatharth Garg 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Markov Chains/dispenser_mc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yatharthgarg/Reinforcement-Learning/c8685c773e035474e39e59318bdadbd6bc075742/Markov Chains/dispenser_mc.png -------------------------------------------------------------------------------- /Markov Chains/robot_mc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yatharthgarg/Reinforcement-Learning/c8685c773e035474e39e59318bdadbd6bc075742/Markov Chains/robot_mc.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Reinforcement-Learning 2 | This project involves the implementation of WoLF-based (Win or Learn Fast) learning agents and it is implementing **WoLF Policy Hill Climbing.** 3 | 4 | The basic idea for this project was to vary the learning rates for the agents to support convergence of the algorithm. The main idea behind this algorithm is learn quickly while losing and slowly while winning. The specific method for determining when the agent is winning is by comparing the current policy’s expected payoff with that of the average policy over time. 5 | 6 | ### Problem is: 7 | ``` 8 | Two robots are operating in a factory to bring metal bars to two different production halls. The metal 9 | bars are dispensed in one place, only one bar can be picked up at a time, and each robot can only 10 | carry one bar at a time. Once a metal bar is picked up, a new one will appear at the dispenser with 11 | probability 0.5 every time step (every action taken corresponds to one time step). Each robot has 3 12 | action choices. It can either try to pick up a metal bar, deliver it to the production hall, or wait. 13 | If it tries to pick up a metal bar, it will succeed with probability 0.5 (due to imprecisions in its 14 | programming) if there is a metal bar available and fail if there is none available. If it tries to 15 | deliver a metal bar to the production hall, it will succeed with probability 1 if it is holding a metal 16 | bar and fail otherwise. If it decides to wait it will stay in place. If both robots try to pick up a 17 | metal bar at the same time, they will both fail. Each robot receives a payoff of 4 if it successfully 18 | delivers a metal bar to the production hall and incurs a cost of 1 if it tries to pick up a metal bar 19 | or if it tries to deliver one to the production hall (reflecting the energy it uses up). The wait 20 | action does not incur a cost. 21 | ``` 22 | 23 | ### For the following problem: 24 | ``` 25 | • Each robot has two possible states = it either has a bar (S1) or it doesn’t (S0). Each 26 | dispenser has two possible states = it either makes a bar (S1) or it doesn’t (S0) Each 27 | robot has three possible actions = pick up a bar (P), deliver the bar (D) or wait (W). 28 | • The states can be summarized as lists of 3 elements like (x,y,z) – where x is 1 if Robot1 29 | has a bar and 0 if it doesn’t, y is 1 if Robot2 has a bar and 0 if it doesn’t and z is 1 if 30 | Dispenser has a bar and 0 if it doesn’t. These are the following decision making markov 31 | processes – 32 | ``` 33 | ![alt text](https://github.com/yatharth1908/Reinforcement-Learning/blob/master/Markov%20Chains/dispenser_mc.png "Dispenser Decision Markov Chain") 34 | ``` 35 | N is the one action for dispenser meaning it is making a new bar. 36 | ``` 37 | 38 | ![alt text](https://github.com/yatharth1908/Reinforcement-Learning/blob/master/Markov%20Chains/robot_mc.png "Robot Decision Markov Chain") 39 | 40 | ### Algorithm implementation in the problem 41 | ``` 42 | • This algorithm requires two learning rates δl and δw with δl > δw and they are used to update 43 | agents policy depending upon if agent is winning or losing. If the agent is losing the larger 44 | value of delta is used and vice versa. I used δw = 0.0025 and δl = 0.01 in which the ration is 45 | equal to 4. " The discount factor is equal to 0.8 and the algorithm runs for 15000 iterations. 46 | • Mean Policy is another idea used for this algorithm which determines if the agent is winning or 47 | losing. 48 | • First, action was selected using random approach exploration with probability π(s,a) and then 49 | Q values, Mean Policy and Policy for first robot was updated according to the algorithm from the 50 | reference and the final policies were converging. 51 | ``` 52 | 53 | ### References 54 | 55 | For this project I followed the algorithm derived in paper [Rational and Convergent Learning in 56 | Stochastic Games by Dr. M. Bowling and Dr. M. Veloso.](http://www.cs.cmu.edu/~mmv/papers/01ijcai-mike.pdf) 57 | 58 | -------------------------------------------------------------------------------- /wolf_phc.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import random 3 | import matplotlib.pyplot as plt 4 | from scipy.stats import norm 5 | 6 | np.set_printoptions(threshold=np.inf) 7 | 8 | total_states = 8 9 | total_actions = 3 10 | iterations = 15000 11 | alpha = 0.2 #Learning Rate 12 | df = 0.8 #discount Factor 13 | d_win = 0.0025 14 | d_lose = 0.01 15 | end = total_actions - 1 16 | 17 | #returns available actions for the current state 18 | def actions_present(state): 19 | available_actions=[] 20 | for i in range(0,total_actions): 21 | if ((state[0,i]>=0)): 22 | available_actions.append(i) 23 | return available_actions 24 | 25 | def Qmax(Q,next_state): 26 | return np.max(Q[next_state]) 27 | 28 | def updateQ(state,action,Q,R): 29 | next_state=action 30 | q = ((1-alpha)*Q[state,action]) + alpha*(R[state,action] + (df * Qmax(Q,next_state))) 31 | return q 32 | 33 | def actions_select(state,Policy): 34 | p1=Policy[state,:] 35 | if (np.sum(p1)==1.0): 36 | return np.random.choice(3,1,p=p1) 37 | else: 38 | p1 /= p1.sum() 39 | return np.random.choice(3,1,p=p1) 40 | 41 | def delta(state,Q,Policy,MeanPolicy,d_win,d_lose): 42 | sumPolicy=0.0 43 | sumMeanPolicy=0.0 44 | for i in range(0,total_actions): 45 | sumPolicy=sumPolicy+(Policy[state,i]*Q[state,i]) 46 | sumMeanPolicy=sumMeanPolicy+(MeanPolicy[state,i]*Q[state,i]) 47 | if (sumPolicy>sumMeanPolicy): 48 | return d_win 49 | else: 50 | return d_lose 51 | 52 | def update_pi(state,Policy,MeanPolicy,Q,d_win,d_lose): 53 | maxQValueIndex = np.argmax(Q[state]) 54 | for i in range(0,total_actions): 55 | d_plus = delta(state,Q,Policy,MeanPolicy,d_win,d_lose) 56 | d_minus = ((-1.0)*d_plus)/((total_actions) - 1.0) 57 | if (i==maxQValueIndex): 58 | Policy[state,i] = min(1.0,Policy[state,i] + d_plus) 59 | else: 60 | Policy[state,i] = max(0.0,Policy[state,i] + d_minus) 61 | return Policy 62 | 63 | def update_meanpi(state,C,MeanPolicy,Policy): 64 | for i in range(0,total_actions): 65 | MeanPolicy[state,i] = MeanPolicy[state,i] + ((1.0/C[state]) * (Policy[state,i]-MeanPolicy[state,i])) 66 | return MeanPolicy 67 | 68 | 69 | def agent1(): 70 | Q = np.zeros((total_states,total_actions)) 71 | C = np.zeros(total_states) 72 | Policy = np.empty([total_states,total_actions]) 73 | MeanPolicy = np.zeros((total_states,total_actions)) 74 | 75 | Policy = np.array([ 76 | [1.0/3, 1.0/3, 1.0/3], 77 | [1.0/3, 1.0/3, 1.0/3], 78 | [1.0/3, 1.0/3, 1.0/3], 79 | [1.0/3, 1.0/3, 1.0/3], 80 | [1.0/3, 1.0/3, 1.0/3], 81 | [1.0/3, 1.0/3, 1.0/3], 82 | [1.0/3, 1.0/3, 1.0/3], 83 | [1.0/3, 1.0/3, 1.0/3] 84 | ]) 85 | 86 | 87 | R= np.matrix([ 88 | [-1, -1, 0], 89 | [-1, -1, 0], 90 | [-1, 3, 0], 91 | [-1, -1, 0], 92 | [-1, 3, 0], 93 | [-1, -1, 0], 94 | [-1, 3, 0], 95 | [-1, 3, 0]]) 96 | 97 | for i in range(0,iterations): 98 | #first step 99 | state = np.random.randint(0,total_states) 100 | available_actions = actions_present(R[state]) 101 | 102 | while state != 8: 103 | 104 | if (isinstance (state, (np.ndarray,np.generic))): 105 | action = actions_select(state[0],Policy) 106 | else: 107 | action = actions_select(state,Policy) 108 | Q[state,action]=updateQ(state,action,Q,R) 109 | 110 | C[state] = C[state]+1 111 | MeanPolicy = update_meanpi(state,C,MeanPolicy,Policy) 112 | Policy = update_pi(state,Policy,MeanPolicy,Q,d_win,d_lose) 113 | 114 | next_state=action 115 | if next_state==end: 116 | break 117 | state=next_state 118 | available_actions=actions_present(R[state]) 119 | 120 | print("Final Q values for agent1: \n {}\n".format(Q)) 121 | print("Final Policy for agent1: \n {}\n".format(Policy)) 122 | print("Final Mean Policy for agent1: \n {}\n".format(MeanPolicy)) 123 | return Policy 124 | 125 | 126 | 127 | def agent2(): 128 | Q = np.zeros((total_states,total_actions)) 129 | C = np.zeros(total_states) 130 | Policy = np.empty([total_states,total_actions]) 131 | MeanPolicy = np.zeros((total_states,total_actions)) 132 | 133 | Policy = np.array([ 134 | [1.0/3, 1.0/3, 1.0/3], 135 | [1.0/3, 1.0/3, 1.0/3], 136 | [1.0/3, 1.0/3, 1.0/3], 137 | [1.0/3, 1.0/3, 1.0/3], 138 | [1.0/3, 1.0/3, 1.0/3], 139 | [1.0/3, 1.0/3, 1.0/3], 140 | [1.0/3, 1.0/3, 1.0/3], 141 | [1.0/3, 1.0/3, 1.0/3] 142 | ]) 143 | 144 | R = np.matrix([ 145 | [-1, -1, 0], 146 | [-1, -1, 0], 147 | [-1, -1, 0], 148 | [-1, 3, 0], 149 | [-1, -1, 0], 150 | [-1, 3, 0], 151 | [-1, 3, 0], 152 | [-1, 3, 0]]) 153 | 154 | 155 | pick = [] 156 | deliver = [] 157 | wait = [] 158 | for i in range(0,iterations): 159 | #first step 160 | state = np.random.randint(0,total_states) 161 | 162 | available_actions = actions_present(R[state]) 163 | 164 | while state != 8: 165 | 166 | if (isinstance (state, (np.ndarray,np.generic))): 167 | action = actions_select(state[0],Policy) 168 | else: 169 | action = actions_select(state,Policy) 170 | Q[state,action]=updateQ(state,action,Q,R) 171 | C[state] = C[state]+1 172 | MeanPolicy = update_meanpi(state,C,MeanPolicy,Policy) 173 | Policy = update_pi(state,Policy,MeanPolicy,Q,d_win,d_lose) 174 | 175 | 176 | next_state=action 177 | if next_state==end: 178 | break 179 | state=next_state 180 | available_actions=actions_present(R[state]) 181 | 182 | print("Final Q values for agent2: \n {}\n".format(Q)) 183 | print("Final Policy for agent2: \n {}\n".format(Policy)) 184 | print("Final Mean Policy for agent2: \n {}\n".format(MeanPolicy)) 185 | 186 | return Policy 187 | 188 | if __name__ == "__main__": 189 | policy1 = agent1() 190 | policy2 = agent2() 191 | ne = [] 192 | eq1 = policy1.argmax(axis = 1) 193 | eq2 = policy2.argmax(axis = 1) 194 | eq = np.concatenate((eq1, eq2), axis=0) 195 | for i in range(len(eq)): 196 | if(eq[i] == 0): 197 | ne.append('Pick') 198 | elif(eq[i] == 1): 199 | ne.append('Deliver') 200 | else: 201 | ne.append('Wait') 202 | 203 | for i, j in zip(range(0, 8, +1), range(8, len(ne), +1)): 204 | state_ne = '(' + ne[i] + ',' + ne[j] + ')' 205 | print("Nash Equilibrium for state {}: {}\n".format(i, state_ne)) 206 | --------------------------------------------------------------------------------