├── LICENSE.md
├── Markov Chains
    ├── dispenser_mc.png
    └── robot_mc.png
├── README.md
└── wolf_phc.py


/LICENSE.md:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Yatharth Garg
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Markov Chains/dispenser_mc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yatharthgarg/Reinforcement-Learning/c8685c773e035474e39e59318bdadbd6bc075742/Markov Chains/dispenser_mc.png


--------------------------------------------------------------------------------
/Markov Chains/robot_mc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yatharthgarg/Reinforcement-Learning/c8685c773e035474e39e59318bdadbd6bc075742/Markov Chains/robot_mc.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Reinforcement-Learning
 2 | This project involves the implementation of WoLF-based (Win or Learn Fast) learning agents and it is implementing **WoLF Policy Hill Climbing.**
 3 | 
 4 | The basic idea for this project was to vary the learning rates for the agents to support convergence of the algorithm. The main idea behind this algorithm is learn quickly while losing and slowly while winning. The specific method for determining when the agent is winning is by comparing the current policy’s expected payoff with that of the average policy over time.
 5 | 
 6 | ### Problem is: 
 7 | ```
 8 | Two robots are operating in a factory to bring metal bars to two different production halls. The metal 
 9 | bars are dispensed in one place, only one bar can be picked up at a time, and each robot can only 
10 | carry one bar at a time. Once a metal bar is picked up, a new one will appear at the dispenser with 
11 | probability 0.5 every time step (every action taken corresponds to one time step). Each robot has 3
12 | action choices. It can either try to pick up a metal bar, deliver it to the production hall, or wait. 
13 | If it tries to pick up a metal bar, it will succeed with probability 0.5 (due to imprecisions in its 
14 | programming) if there is a metal bar available and fail if there is none available. If it tries to 
15 | deliver a metal bar to the production hall, it will succeed with probability 1 if it is holding a metal 
16 | bar and fail otherwise. If it decides to wait it will stay in place. If both robots try to pick up a 
17 | metal bar at the same time, they will both fail. Each robot receives a payoff of 4 if it successfully 
18 | delivers a metal bar to the production hall and incurs a cost of 1 if it tries to pick up a metal bar 
19 | or if it tries to deliver one to the production hall (reflecting the energy it uses up). The wait 
20 | action does not incur a cost.
21 | ```
22 | 
23 | ### For the following problem:
24 | ```
25 | • Each robot has two possible states = it either has a bar (S1) or it doesn’t (S0). Each 
26 | dispenser has two possible states = it either makes a bar (S1) or it doesn’t (S0) Each 
27 | robot has three possible actions = pick up a bar (P), deliver the bar (D) or wait (W).
28 | • The states can be summarized as lists of 3 elements like (x,y,z) – where x is 1 if Robot1 
29 | has a bar and 0 if it doesn’t, y is 1 if Robot2 has a bar and 0 if it doesn’t and z is 1 if 
30 | Dispenser has a bar and 0 if it doesn’t. These are the following decision making markov 
31 | processes –
32 | ```
33 | ![alt text](https://github.com/yatharth1908/Reinforcement-Learning/blob/master/Markov%20Chains/dispenser_mc.png "Dispenser Decision Markov Chain")
34 | ```
35 | N is the one action for dispenser meaning it is making a new bar.
36 | ```
37 | 
38 | ![alt text](https://github.com/yatharth1908/Reinforcement-Learning/blob/master/Markov%20Chains/robot_mc.png "Robot Decision Markov Chain")
39 | 
40 | ### Algorithm implementation in the problem
41 | ```
42 | • This algorithm requires two learning rates δl and δw with δl > δw and they are used to update
43 | agents policy depending upon if agent is winning or losing. If the agent is losing the larger 
44 | value of delta is used and vice versa. I used δw = 0.0025 and δl = 0.01 in which the ration is 
45 | equal to 4. "	The discount factor is equal to 0.8 and the algorithm runs for 15000 iterations.
46 | • Mean Policy is another idea used for this algorithm which determines if the agent is winning or 
47 | losing.
48 | • First, action was selected using random approach exploration with probability π(s,a) and then 
49 | Q values, Mean Policy and Policy for first robot was updated according to the algorithm from the 
50 | reference and the final policies were converging.
51 | ```
52 | 
53 | ### References
54 | 
55 | For this project I followed the algorithm derived in paper [Rational and Convergent Learning in 
56 | Stochastic Games by Dr. M. Bowling and Dr. M. Veloso.](http://www.cs.cmu.edu/~mmv/papers/01ijcai-mike.pdf)
57 | 
58 | 


--------------------------------------------------------------------------------
/wolf_phc.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import random
  3 | import matplotlib.pyplot as plt
  4 | from scipy.stats import norm
  5 | 
  6 | np.set_printoptions(threshold=np.inf)
  7 | 
  8 | total_states = 8
  9 | total_actions = 3
 10 | iterations = 15000
 11 | alpha = 0.2   #Learning Rate
 12 | df = 0.8  #discount Factor
 13 | d_win = 0.0025
 14 | d_lose = 0.01
 15 | end = total_actions - 1
 16 | 
 17 | #returns available actions for the current state
 18 | def actions_present(state):
 19 | 	available_actions=[]
 20 | 	for i in range(0,total_actions):
 21 | 		if ((state[0,i]>=0)):
 22 | 			available_actions.append(i)
 23 | 	return available_actions
 24 | 
 25 | def Qmax(Q,next_state):
 26 | 	return np.max(Q[next_state])
 27 | 
 28 | def updateQ(state,action,Q,R):
 29 | 	next_state=action
 30 | 	q = ((1-alpha)*Q[state,action]) + alpha*(R[state,action] + (df * Qmax(Q,next_state)))
 31 | 	return q
 32 | 
 33 | def actions_select(state,Policy):
 34 | 	p1=Policy[state,:]
 35 | 	if (np.sum(p1)==1.0):
 36 | 		return np.random.choice(3,1,p=p1)
 37 | 	else:
 38 | 		p1 /= p1.sum()
 39 | 		return np.random.choice(3,1,p=p1)
 40 | 
 41 | def delta(state,Q,Policy,MeanPolicy,d_win,d_lose):
 42 | 	sumPolicy=0.0
 43 | 	sumMeanPolicy=0.0
 44 | 	for i in range(0,total_actions):
 45 | 		sumPolicy=sumPolicy+(Policy[state,i]*Q[state,i])
 46 | 		sumMeanPolicy=sumMeanPolicy+(MeanPolicy[state,i]*Q[state,i])
 47 | 	if (sumPolicy>sumMeanPolicy):
 48 | 		return d_win
 49 | 	else:
 50 | 		return d_lose
 51 | 
 52 | def update_pi(state,Policy,MeanPolicy,Q,d_win,d_lose):
 53 | 	maxQValueIndex = np.argmax(Q[state])
 54 | 	for i in range(0,total_actions):
 55 | 		d_plus = delta(state,Q,Policy,MeanPolicy,d_win,d_lose)
 56 | 		d_minus = ((-1.0)*d_plus)/((total_actions) - 1.0)
 57 | 		if (i==maxQValueIndex):
 58 | 			Policy[state,i] = min(1.0,Policy[state,i] + d_plus)
 59 | 		else:
 60 | 			Policy[state,i] = max(0.0,Policy[state,i] + d_minus)
 61 | 	return Policy
 62 | 
 63 | def update_meanpi(state,C,MeanPolicy,Policy):
 64 | 	for i in range(0,total_actions):
 65 | 		MeanPolicy[state,i] = MeanPolicy[state,i] + ((1.0/C[state]) * (Policy[state,i]-MeanPolicy[state,i]))
 66 | 	return	MeanPolicy
 67 | 
 68 | 
 69 | def agent1():
 70 | 	Q = np.zeros((total_states,total_actions))
 71 | 	C = np.zeros(total_states)
 72 | 	Policy = np.empty([total_states,total_actions])
 73 | 	MeanPolicy = np.zeros((total_states,total_actions))
 74 | 
 75 | 	Policy = np.array([
 76 | 	[1.0/3, 1.0/3, 1.0/3],
 77 | 	[1.0/3, 1.0/3, 1.0/3],
 78 | 	[1.0/3, 1.0/3, 1.0/3],
 79 | 	[1.0/3, 1.0/3, 1.0/3],
 80 | 	[1.0/3, 1.0/3, 1.0/3],
 81 | 	[1.0/3, 1.0/3, 1.0/3],
 82 | 	[1.0/3, 1.0/3, 1.0/3],
 83 | 	[1.0/3, 1.0/3, 1.0/3]
 84 | 	])
 85 | 
 86 | 
 87 | 	R= np.matrix([
 88 | 		[-1, -1, 0],
 89 | 		[-1, -1, 0],
 90 | 		[-1, 3, 0],
 91 | 		[-1, -1, 0],
 92 | 		[-1, 3, 0],
 93 | 		[-1, -1, 0],
 94 | 		[-1, 3, 0],
 95 | 		[-1, 3, 0]])
 96 | 
 97 | 	for i in range(0,iterations):
 98 | 		#first step
 99 | 		state = np.random.randint(0,total_states)
100 | 		available_actions = actions_present(R[state])
101 | 
102 | 		while state != 8:
103 | 
104 | 			if (isinstance (state, (np.ndarray,np.generic))):
105 | 				action = actions_select(state[0],Policy)
106 | 			else:
107 | 				action = actions_select(state,Policy)
108 | 			Q[state,action]=updateQ(state,action,Q,R)
109 | 
110 | 			C[state] = C[state]+1
111 | 			MeanPolicy = update_meanpi(state,C,MeanPolicy,Policy)
112 | 			Policy = update_pi(state,Policy,MeanPolicy,Q,d_win,d_lose)
113 | 
114 | 			next_state=action
115 | 			if next_state==end:
116 | 				break
117 | 			state=next_state
118 | 			available_actions=actions_present(R[state])
119 | 
120 | 	print("Final Q values for agent1: \n {}\n".format(Q))
121 | 	print("Final Policy for agent1: \n {}\n".format(Policy))
122 | 	print("Final Mean Policy for agent1: \n {}\n".format(MeanPolicy))
123 | 	return Policy
124 | 
125 | 
126 | 
127 | def agent2():
128 | 	Q = np.zeros((total_states,total_actions))
129 | 	C = np.zeros(total_states)
130 | 	Policy = np.empty([total_states,total_actions])
131 | 	MeanPolicy = np.zeros((total_states,total_actions))
132 | 
133 | 	Policy = np.array([
134 | 	[1.0/3, 1.0/3, 1.0/3],
135 | 	[1.0/3, 1.0/3, 1.0/3],
136 | 	[1.0/3, 1.0/3, 1.0/3],
137 | 	[1.0/3, 1.0/3, 1.0/3],
138 | 	[1.0/3, 1.0/3, 1.0/3],
139 | 	[1.0/3, 1.0/3, 1.0/3],
140 | 	[1.0/3, 1.0/3, 1.0/3],
141 | 	[1.0/3, 1.0/3, 1.0/3]
142 | 	])
143 | 
144 | 	R = np.matrix([
145 | 		[-1, -1, 0],
146 | 		[-1, -1, 0],
147 | 		[-1, -1, 0],
148 | 		[-1, 3, 0],
149 | 		[-1, -1, 0],
150 | 		[-1, 3, 0],
151 | 		[-1, 3, 0],
152 | 		[-1, 3, 0]])
153 | 
154 | 
155 | 	pick = []
156 | 	deliver = []
157 | 	wait = []
158 | 	for i in range(0,iterations):
159 | 		#first step
160 | 		state = np.random.randint(0,total_states)
161 | 
162 | 		available_actions = actions_present(R[state])
163 | 
164 | 		while state != 8:
165 | 
166 | 			if (isinstance (state, (np.ndarray,np.generic))):
167 | 				action = actions_select(state[0],Policy)
168 | 			else:
169 | 				action = actions_select(state,Policy)
170 | 			Q[state,action]=updateQ(state,action,Q,R)
171 | 			C[state] = C[state]+1
172 | 			MeanPolicy = update_meanpi(state,C,MeanPolicy,Policy)
173 | 			Policy = update_pi(state,Policy,MeanPolicy,Q,d_win,d_lose)
174 | 
175 | 
176 | 			next_state=action
177 | 			if next_state==end:
178 | 				break
179 | 			state=next_state
180 | 			available_actions=actions_present(R[state])
181 | 
182 | 	print("Final Q values for agent2: \n {}\n".format(Q))
183 | 	print("Final Policy for agent2: \n {}\n".format(Policy))
184 | 	print("Final Mean Policy for agent2: \n {}\n".format(MeanPolicy))
185 | 
186 | 	return Policy
187 | 
188 | if __name__ == "__main__":
189 |     policy1 = agent1()
190 | policy2 = agent2()
191 | ne = []
192 | eq1 = policy1.argmax(axis = 1)
193 | eq2 = policy2.argmax(axis = 1)
194 | eq = np.concatenate((eq1, eq2), axis=0)
195 | for i in range(len(eq)):
196 | 	if(eq[i] == 0):
197 | 		ne.append('Pick')
198 | 	elif(eq[i] == 1):
199 | 		ne.append('Deliver')
200 | 	else:
201 | 		ne.append('Wait')
202 | 
203 | for i, j in zip(range(0, 8, +1), range(8, len(ne), +1)):
204 | 	state_ne = '(' + ne[i] + ',' + ne[j] + ')'
205 | 	print("Nash Equilibrium for state {}: {}\n".format(i, state_ne))
206 | 


--------------------------------------------------------------------------------