├── Images ├── RewardPerEpisode.png ├── grid.PNG └── optimal_solution.PNG ├── Q-Learning Algorithm.py └── README.md /Images/RewardPerEpisode.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ronanmmurphy/Q-Learning-Algorithm/3618d6421b8489848db691602848b5e2c1325983/Images/RewardPerEpisode.png -------------------------------------------------------------------------------- /Images/grid.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ronanmmurphy/Q-Learning-Algorithm/3618d6421b8489848db691602848b5e2c1325983/Images/grid.PNG -------------------------------------------------------------------------------- /Images/optimal_solution.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ronanmmurphy/Q-Learning-Algorithm/3618d6421b8489848db691602848b5e2c1325983/Images/optimal_solution.PNG -------------------------------------------------------------------------------- /Q-Learning Algorithm.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Sat Mar 28 12:59:23 2020 4 | 5 | Assignment 2 - Agents and Reinforcement Learning 6 | 7 | @author: Ronan Murphy - 15397831 8 | """ 9 | 10 | 11 | import numpy as np 12 | import random 13 | import matplotlib.pyplot as plt 14 | 15 | 16 | 17 | #set the rows and columns length 18 | BOARD_ROWS = 5 19 | BOARD_COLS = 5 20 | 21 | #initalise start, win and lose states 22 | START = (0, 0) 23 | WIN_STATE = (4, 4) 24 | HOLE_STATE = [(1,0),(3,1),(4,2),(1,3)] 25 | 26 | #class state defines the board and decides reward, end and next position 27 | class State: 28 | def __init__(self, state=START): 29 | #initalise the state to start and end to false 30 | self.state = state 31 | self.isEnd = False 32 | 33 | def getReward(self): 34 | #give the rewards for each state -5 for loss, +1 for win, -1 for others 35 | for i in HOLE_STATE: 36 | if self.state == i: 37 | return -5 38 | if self.state == WIN_STATE: 39 | return 1 40 | 41 | else: 42 | return -1 43 | 44 | def isEndFunc(self): 45 | #set state to end if win/loss 46 | if (self.state == WIN_STATE): 47 | self.isEnd = True 48 | 49 | for i in HOLE_STATE: 50 | if self.state == i: 51 | self.isEnd = True 52 | 53 | def nxtPosition(self, action): 54 | #set the positions from current action - up, down, left, right 55 | if action == 0: 56 | nxtState = (self.state[0] - 1, self.state[1]) #up 57 | elif action == 1: 58 | nxtState = (self.state[0] + 1, self.state[1]) #down 59 | elif action == 2: 60 | nxtState = (self.state[0], self.state[1] - 1) #left 61 | else: 62 | nxtState = (self.state[0], self.state[1] + 1) #right 63 | 64 | 65 | #check if next state is possible 66 | if (nxtState[0] >= 0) and (nxtState[0] <= 4): 67 | if (nxtState[1] >= 0) and (nxtState[1] <= 4): 68 | #if possible change to next state 69 | return nxtState 70 | #Return current state if outside grid 71 | return self.state 72 | 73 | 74 | 75 | 76 | #class agent to implement reinforcement learning through grid 77 | class Agent: 78 | 79 | def __init__(self): 80 | #inialise states and actions 81 | self.states = [] 82 | self.actions = [0,1,2,3] # up, down, left, right 83 | self.State = State() 84 | #set the learning and greedy values 85 | self.alpha = 0.5 86 | self.gamma = 0.9 87 | self.epsilon = 0.1 88 | self.isEnd = self.State.isEnd 89 | 90 | # array to retain reward values for plot 91 | self.plot_reward = [] 92 | 93 | #initalise Q values as a dictionary for current and new 94 | self.Q = {} 95 | self.new_Q = {} 96 | #initalise rewards to 0 97 | self.rewards = 0 98 | 99 | #initalise all Q values across the board to 0, print these values 100 | for i in range(BOARD_ROWS): 101 | for j in range(BOARD_COLS): 102 | for k in range(len(self.actions)): 103 | self.Q[(i, j, k)] =0 104 | self.new_Q[(i, j, k)] = 0 105 | 106 | print(self.Q) 107 | 108 | 109 | 110 | #method to choose action with Epsilon greedy policy, and move to next state 111 | def Action(self): 112 | #random value vs epsilon 113 | rnd = random.random() 114 | #set arbitraty low value to compare with Q values to find max 115 | mx_nxt_reward =-10 116 | action = None 117 | 118 | #9/10 find max Q value over actions 119 | if(rnd >self.epsilon) : 120 | #iterate through actions, find Q value and choose best 121 | for k in self.actions: 122 | 123 | i,j = self.State.state 124 | 125 | nxt_reward = self.Q[(i,j, k)] 126 | 127 | if nxt_reward >= mx_nxt_reward: 128 | action = k 129 | mx_nxt_reward = nxt_reward 130 | 131 | #else choose random action 132 | else: 133 | action = np.random.choice(self.actions) 134 | 135 | #select the next state based on action chosen 136 | position = self.State.nxtPosition(action) 137 | return position,action 138 | 139 | 140 | #Q-learning Algorithm 141 | def Q_Learning(self,episodes): 142 | x = 0 143 | #iterate through best path for each episode 144 | while(x < episodes): 145 | #check if state is end 146 | if self.isEnd: 147 | #get current rewrard and add to array for plot 148 | reward = self.State.getReward() 149 | self.rewards += reward 150 | self.plot_reward.append(self.rewards) 151 | 152 | #get state, assign reward to each Q_value in state 153 | i,j = self.State.state 154 | for a in self.actions: 155 | self.new_Q[(i,j,a)] = round(reward,3) 156 | 157 | #reset state 158 | self.State = State() 159 | self.isEnd = self.State.isEnd 160 | 161 | #set rewards to zero and iterate to next episode 162 | self.rewards = 0 163 | x+=1 164 | else: 165 | #set to arbitrary low value to compare net state actions 166 | mx_nxt_value = -10 167 | #get current state, next state, action and current reward 168 | next_state, action = self.Action() 169 | i,j = self.State.state 170 | reward = self.State.getReward() 171 | #add reward to rewards for plot 172 | self.rewards +=reward 173 | 174 | #iterate through actions to find max Q value for action based on next state action 175 | for a in self.actions: 176 | nxtStateAction = (next_state[0], next_state[1], a) 177 | q_value = (1-self.alpha)*self.Q[(i,j,action)] + self.alpha*(reward + self.gamma*self.Q[nxtStateAction]) 178 | 179 | #find largest Q value 180 | if q_value >= mx_nxt_value: 181 | mx_nxt_value = q_value 182 | 183 | #next state is now current state, check if end state 184 | self.State = State(state=next_state) 185 | self.State.isEndFunc() 186 | self.isEnd = self.State.isEnd 187 | 188 | #update Q values with max Q value for next state 189 | self.new_Q[(i,j,action)] = round(mx_nxt_value,3) 190 | 191 | #copy new Q values to Q table 192 | self.Q = self.new_Q.copy() 193 | #print final Q table output 194 | print(self.Q) 195 | 196 | #plot the reward vs episodes 197 | def plot(self,episodes): 198 | 199 | plt.plot(self.plot_reward) 200 | plt.show() 201 | 202 | 203 | #iterate through the board and find largest Q value in each, print output 204 | def showValues(self): 205 | for i in range(0, BOARD_ROWS): 206 | print('-----------------------------------------------') 207 | out = '| ' 208 | for j in range(0, BOARD_COLS): 209 | mx_nxt_value = -10 210 | for a in self.actions: 211 | nxt_value = self.Q[(i,j,a)] 212 | if nxt_value >= mx_nxt_value: 213 | mx_nxt_value = nxt_value 214 | out += str(mx_nxt_value).ljust(6) + ' | ' 215 | print(out) 216 | print('-----------------------------------------------') 217 | 218 | 219 | 220 | if __name__ == "__main__": 221 | #create agent for 10,000 episdoes implementing a Q-learning algorithm plot and show values. 222 | ag = Agent() 223 | episodes = 10000 224 | ag.Q_Learning(episodes) 225 | ag.plot(episodes) 226 | ag.showValues() -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Q-Learning-Algorithm 2 | Reinforcement Learning implementmentation of deterministic FrozenLake ‘grid world’ problem where Q-learning agent learned a defined policy to optimally navigate through the lake. Python was used to program two classes which setup the state and agent respectively. Q-values are set state-action pairs and the algorithm chooses an optimal action for the current state based on estimates of this value. The reward and next state for this action is observed which allows for the Q value to be updated. Over many epochs this algorithm can learn the best path to take for this problem as long as the strategy balances exploration and exploitation correctly. 3 | 4 | Grid: 5 | 6 | ![Grid](https://github.com/ronanmmurphy/Q-Learning-Algorithm/blob/main/Images/grid.PNG?raw=true) 7 | 8 | Method: 9 | The Q learning algorithm implements an epsilon greedy method, choosing random value 10% of the time and the best action for all others. The Q value is determined with the formula Q: 10 | Q-Value = (1-α)*Q[(i,j,action)] + α*(Reward + ϒ*Qmax[nxtStateAction]) 11 | The values are determined for each step in an episode and updated to the Q table. If the state is an end state the Q value is set to the reward value, -5 for loss +1 for win, and these are updated in Q value also, resetting the State to 0,0. 12 | 13 | Optimal Solution: 14 | 15 | ![Optimal](https://github.com/ronanmmurphy/Q-Learning-Algorithm/blob/main/Images/optimal_solution.PNG?raw=true) 16 | 17 | 18 | 19 | 20 | Rewards Per Episode: 10,000 episodes were run to see the change in reward per epoch over time. It shows that although the algorithm starts off poorly it starts to learn quickly the optimal solution which is takes. As there is still a 10% chance of random action the algorithm never stays on the optimal solution, additional measures could be taken to change the % of exploration over time and exploit the optimal solution. 21 | 22 | ![Rewards Per Episode](https://github.com/ronanmmurphy/Q-Learning-Algorithm/blob/main/Images/RewardPerEpisode.png?raw=true) 23 | --------------------------------------------------------------------------------