├── Images
    ├── RewardPerEpisode.png
    ├── grid.PNG
    └── optimal_solution.PNG
├── Q-Learning Algorithm.py
└── README.md


/Images/RewardPerEpisode.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ronanmmurphy/Q-Learning-Algorithm/3618d6421b8489848db691602848b5e2c1325983/Images/RewardPerEpisode.png


--------------------------------------------------------------------------------
/Images/grid.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ronanmmurphy/Q-Learning-Algorithm/3618d6421b8489848db691602848b5e2c1325983/Images/grid.PNG


--------------------------------------------------------------------------------
/Images/optimal_solution.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ronanmmurphy/Q-Learning-Algorithm/3618d6421b8489848db691602848b5e2c1325983/Images/optimal_solution.PNG


--------------------------------------------------------------------------------
/Q-Learning Algorithm.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Sat Mar 28 12:59:23 2020
  4 | 
  5 | Assignment 2 - Agents and Reinforcement Learning
  6 | 
  7 | @author: Ronan Murphy - 15397831
  8 | """
  9 | 
 10 | 
 11 | import numpy as np
 12 | import random
 13 | import matplotlib.pyplot as plt
 14 | 
 15 | 
 16 | 
 17 | #set the rows and columns length
 18 | BOARD_ROWS = 5
 19 | BOARD_COLS = 5
 20 | 
 21 | #initalise start, win and lose states
 22 | START = (0, 0)
 23 | WIN_STATE = (4, 4)
 24 | HOLE_STATE = [(1,0),(3,1),(4,2),(1,3)]
 25 | 
 26 | #class state defines the board and decides reward, end and next position
 27 | class State:
 28 |     def __init__(self, state=START):
 29 |         #initalise the state to start and end to false
 30 |         self.state = state
 31 |         self.isEnd = False        
 32 | 
 33 |     def getReward(self):
 34 |         #give the rewards for each state -5 for loss, +1 for win, -1 for others
 35 |         for i in HOLE_STATE:
 36 |             if self.state == i:
 37 |                 return -5
 38 |         if self.state == WIN_STATE:
 39 |             return 1       
 40 |         
 41 |         else:
 42 |             return -1
 43 | 
 44 |     def isEndFunc(self):
 45 |         #set state to end if win/loss
 46 |         if (self.state == WIN_STATE):
 47 |             self.isEnd = True
 48 |             
 49 |         for i in HOLE_STATE:
 50 |             if self.state == i:
 51 |                 self.isEnd = True
 52 | 
 53 |     def nxtPosition(self, action):     
 54 |         #set the positions from current action - up, down, left, right
 55 |         if action == 0:                
 56 |             nxtState = (self.state[0] - 1, self.state[1]) #up             
 57 |         elif action == 1:
 58 |             nxtState = (self.state[0] + 1, self.state[1]) #down
 59 |         elif action == 2:
 60 |             nxtState = (self.state[0], self.state[1] - 1) #left
 61 |         else:
 62 |             nxtState = (self.state[0], self.state[1] + 1) #right
 63 | 
 64 | 
 65 |         #check if next state is possible
 66 |         if (nxtState[0] >= 0) and (nxtState[0] <= 4):
 67 |             if (nxtState[1] >= 0) and (nxtState[1] <= 4):    
 68 |                     #if possible change to next state                
 69 |                     return nxtState 
 70 |         #Return current state if outside grid     
 71 |         return self.state 
 72 | 
 73 | 
 74 | 
 75 |         
 76 | #class agent to implement reinforcement learning through grid  
 77 | class Agent:
 78 | 
 79 |     def __init__(self):
 80 |         #inialise states and actions 
 81 |         self.states = []
 82 |         self.actions = [0,1,2,3]    # up, down, left, right
 83 |         self.State = State()
 84 |         #set the learning and greedy values
 85 |         self.alpha = 0.5
 86 |         self.gamma = 0.9
 87 |         self.epsilon = 0.1
 88 |         self.isEnd = self.State.isEnd
 89 | 
 90 |         # array to retain reward values for plot
 91 |         self.plot_reward = []
 92 |         
 93 |         #initalise Q values as a dictionary for current and new
 94 |         self.Q = {}
 95 |         self.new_Q = {}
 96 |         #initalise rewards to 0
 97 |         self.rewards = 0
 98 |         
 99 |         #initalise all Q values across the board to 0, print these values
100 |         for i in range(BOARD_ROWS):
101 |             for j in range(BOARD_COLS):
102 |                 for k in range(len(self.actions)):
103 |                     self.Q[(i, j, k)] =0
104 |                     self.new_Q[(i, j, k)] = 0
105 |         
106 |         print(self.Q)
107 |         
108 |     
109 | 
110 |     #method to choose action with Epsilon greedy policy, and move to next state
111 |     def Action(self):
112 |         #random value vs epsilon
113 |         rnd = random.random()
114 |         #set arbitraty low value to compare with Q values to find max
115 |         mx_nxt_reward =-10
116 |         action = None
117 |         
118 |         #9/10 find max Q value over actions 
119 |         if(rnd >self.epsilon) :
120 |             #iterate through actions, find Q  value and choose best 
121 |             for k in self.actions:
122 |                 
123 |                 i,j = self.State.state
124 |                 
125 |                 nxt_reward = self.Q[(i,j, k)]
126 |                 
127 |                 if nxt_reward >= mx_nxt_reward:
128 |                     action = k
129 |                     mx_nxt_reward = nxt_reward
130 |                     
131 |         #else choose random action
132 |         else:
133 |             action = np.random.choice(self.actions)
134 |         
135 |         #select the next state based on action chosen
136 |         position = self.State.nxtPosition(action)
137 |         return position,action
138 |     
139 |     
140 |     #Q-learning Algorithm
141 |     def Q_Learning(self,episodes):
142 |         x = 0
143 |         #iterate through best path for each episode
144 |         while(x < episodes):
145 |             #check if state is end
146 |             if self.isEnd:
147 |                 #get current rewrard and add to array for plot
148 |                 reward = self.State.getReward()
149 |                 self.rewards += reward
150 |                 self.plot_reward.append(self.rewards)
151 |                 
152 |                 #get state, assign reward to each Q_value in state
153 |                 i,j = self.State.state
154 |                 for a in self.actions:
155 |                     self.new_Q[(i,j,a)] = round(reward,3)
156 |                     
157 |                 #reset state
158 |                 self.State = State()
159 |                 self.isEnd = self.State.isEnd
160 |                 
161 |                 #set rewards to zero and iterate to next episode
162 |                 self.rewards = 0
163 |                 x+=1
164 |             else:
165 |                 #set to arbitrary low value to compare net state actions
166 |                 mx_nxt_value = -10
167 |                 #get current state, next state, action and current reward
168 |                 next_state, action = self.Action()
169 |                 i,j = self.State.state
170 |                 reward = self.State.getReward()
171 |                 #add reward to rewards for plot
172 |                 self.rewards +=reward
173 |                 
174 |                 #iterate through actions to find max Q value for action based on next state action
175 |                 for a in self.actions:
176 |                     nxtStateAction = (next_state[0], next_state[1], a)
177 |                     q_value = (1-self.alpha)*self.Q[(i,j,action)] + self.alpha*(reward + self.gamma*self.Q[nxtStateAction])
178 |                 
179 |                     #find largest Q value
180 |                     if q_value >= mx_nxt_value:
181 |                         mx_nxt_value = q_value
182 |                 
183 |                 #next state is now current state, check if end state
184 |                 self.State = State(state=next_state)
185 |                 self.State.isEndFunc()
186 |                 self.isEnd = self.State.isEnd
187 |                 
188 |                 #update Q values with max Q value for next state
189 |                 self.new_Q[(i,j,action)] = round(mx_nxt_value,3)
190 |             
191 |             #copy new Q values to Q table
192 |             self.Q = self.new_Q.copy()
193 |         #print final Q table output
194 |         print(self.Q)
195 |         
196 |     #plot the reward vs episodes
197 |     def plot(self,episodes):
198 |         
199 |         plt.plot(self.plot_reward)
200 |         plt.show()
201 |         
202 |         
203 |     #iterate through the board and find largest Q value in each, print output
204 |     def showValues(self):
205 |         for i in range(0, BOARD_ROWS):
206 |             print('-----------------------------------------------')
207 |             out = '| '
208 |             for j in range(0, BOARD_COLS):
209 |                 mx_nxt_value = -10
210 |                 for a in self.actions:
211 |                     nxt_value = self.Q[(i,j,a)]
212 |                     if nxt_value >= mx_nxt_value:
213 |                         mx_nxt_value = nxt_value
214 |                 out += str(mx_nxt_value).ljust(6) + ' | '
215 |             print(out)
216 |         print('-----------------------------------------------')
217 |         
218 |     
219 |         
220 | if __name__ == "__main__":
221 |     #create agent for 10,000 episdoes implementing a Q-learning algorithm plot and show values.
222 |     ag = Agent()
223 |     episodes = 10000
224 |     ag.Q_Learning(episodes)
225 |     ag.plot(episodes)
226 |     ag.showValues()


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Q-Learning-Algorithm
 2 | Reinforcement Learning implementmentation of deterministic FrozenLake ‘grid world’ problem where Q-learning agent learned a defined policy to optimally navigate through the lake. Python was used to program two classes which setup the state and agent respectively. Q-values are set state-action pairs and the algorithm chooses an optimal action for the current state based on estimates of this value. The reward and next state for this action is observed which allows for the Q value to be updated. Over many epochs this algorithm can learn the best path to take for this problem as long as the strategy balances exploration and exploitation correctly.
 3 | 
 4 | Grid:
 5 | 
 6 | ![Grid](https://github.com/ronanmmurphy/Q-Learning-Algorithm/blob/main/Images/grid.PNG?raw=true)
 7 | 
 8 | Method:
 9 | The Q learning algorithm implements an epsilon greedy method, choosing random value 10% of the time and the best action for all others. The Q value is determined with the formula Q:
10 | Q-Value = (1-α)*Q[(i,j,action)] + α*(Reward + ϒ*Qmax[nxtStateAction])
11 | The values are determined for each step in an episode and updated to the Q table. If the state is an end state the Q value is set to the reward value, -5 for loss +1 for win, and these are updated in Q value also, resetting the State to 0,0. 
12 | 
13 | Optimal Solution:
14 | 
15 | ![Optimal](https://github.com/ronanmmurphy/Q-Learning-Algorithm/blob/main/Images/optimal_solution.PNG?raw=true)
16 | 
17 | 
18 | 
19 | 
20 | Rewards Per Episode: 10,000 episodes were run to see the change in reward per epoch over time. It shows that although the algorithm starts off poorly it starts to learn quickly the optimal solution which is takes. As there is still a 10% chance of random action the algorithm never stays on the optimal solution, additional measures could be taken to change the % of exploration over time and exploit the optimal solution.
21 | 
22 | ![Rewards Per Episode](https://github.com/ronanmmurphy/Q-Learning-Algorithm/blob/main/Images/RewardPerEpisode.png?raw=true)
23 | 


--------------------------------------------------------------------------------