├── Unit6-REINFORCE ├── images │ ├── CartPole-v0.png │ └── LunarLander-v2.png ├── utils.py ├── README.md ├── main.py └── REINFORCE.py ├── Unit5-Deep-Q-Networks ├── MountainCar_success.pt ├── play.py ├── readme.md ├── utils.py ├── main_dqn.py └── dqn.py ├── Blog.md ├── README.md ├── Unit4-Temporal-Difference-Methods ├── readme.md ├── QLearning.ipynb └── SARSA.ipynb ├── Unit3-Monte-Carlo ├── readme.md ├── utils.py └── BlackJack.py ├── Sutton_and_Barton.md ├── Unit2-Bellman-Equations ├── readme.md ├── main.py └── BlobEnvironment.py └── Unit1-Multi-Armed-Bandits ├── utils.py ├── readme.md ├── bandits.py ├── main.py └── agents.py /Unit6-REINFORCE/images/CartPole-v0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RajGhugare19/Classical-RL/HEAD/Unit6-REINFORCE/images/CartPole-v0.png -------------------------------------------------------------------------------- /Unit6-REINFORCE/images/LunarLander-v2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RajGhugare19/Classical-RL/HEAD/Unit6-REINFORCE/images/LunarLander-v2.png -------------------------------------------------------------------------------- /Unit5-Deep-Q-Networks/MountainCar_success.pt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RajGhugare19/Classical-RL/HEAD/Unit5-Deep-Q-Networks/MountainCar_success.pt -------------------------------------------------------------------------------- /Blog.md: -------------------------------------------------------------------------------- 1 | 2 | ## My Blog 3 | 4 | * [Nuts and Bolts of RL](https://hackmd.io/@Raj-Ghugare/r1ttq0PNw) 5 | * [Multi-Armed Bandits and how to solve them](https://hackmd.io/@Raj-Ghugare/rkkk1XCVw) 6 | * [Policy Gradient theorem](https://hackmd.io/@Raj-Ghugare/rygKPUD08) 7 | * [REINFORCE what you've learnt](https://hackmd.io/@Raj-Ghugare/BJGFOdmCL) 8 | -------------------------------------------------------------------------------- /Unit6-REINFORCE/utils.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import time 3 | import numpy as np 4 | 5 | def plot_score(score_history,exp,save=False): 6 | score = np.array(score_history) 7 | iters = np.arange(len(score_history)) 8 | plt.plot(iters,score) 9 | plt.xlabel('training iterations') 10 | plt.ylabel('Total scores obtained') 11 | plt.title('REINFORCE ' + exp) 12 | if(save): 13 | plt.savefig('./images/'+exp+'.png') 14 | plt.legend() 15 | plt.show() 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Classical reinforcement learning 2 | After a year of exploring the field as a beginner, I found a good policy of how to start learning Reinforcement learning for a begginer from a researchers point of view. This repository is to share my RL journey and resources so that you can get an insight on how to learn how to learn. 3 | 4 | this repo has: 5 | 6 | - [x] Python implementations of various topics in sequential order. 7 | - [x] Free Resources that I found to be helpful(Video lecture links, online articles and blogs). 8 | 9 | Please feel free to open up a PR if you think you have more sources or notes to share :) 10 | 11 | Note: 12 | 13 | The python implementations of algorithms are not optimized and are meant to be for learning purposes only. 14 | -------------------------------------------------------------------------------- /Unit6-REINFORCE/README.md: -------------------------------------------------------------------------------- 1 | 2 | # REINFORCE 3 | 4 | REINFORCE is a vanilla policy gradient approach towards RL problems. This algorithm is implemented succesfully on the followoing problem from OpenAi gym 5 | 6 | ### Results: 7 | 8 | #### CartPole-v0 9 | 10 | ![](./images/CartPole-v0.png) 11 | 12 | #### LunarLander-v2 13 | 14 | ![](./images/LunarLander-v2.png) 15 | 16 | ### Observations: 17 | This method theoretically seems to work in expectation terms, but for individual trials occasionally gives sub-optimal results.The new data generated depends upon previous policy and hence this technique cannot be used in high stake situations. 18 | 19 | ### dependencies: 20 | 21 | * [openai gym](https://gym.openai.com/) 22 | * [pytorch](https://pytorch.org/) 23 | 24 | -------------------------------------------------------------------------------- /Unit4-Temporal-Difference-Methods/readme.md: -------------------------------------------------------------------------------- 1 | # Temporal-difference model free methods 2 | Temporal difference methods are the heart of Reinforcement learning. 3 | 4 | 1. [CS234 Lecture 4](https://www.youtube.com/watch?v=j080VBVGkfQ&list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u&index=4) 5 | 2. [Professor Balaraman Ravindran's RL untill the end of week 6](https://nptel.ac.in/courses/106106143/) 6 | 3. [Sutton and Barton chapter 6](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) 7 | 8 | I have implemented the following algorithms on a custom made blob environment(with the help of [sentdex's tutorial on how to make a blob env](https://www.youtube.com/watch?v=G92TF4xYQcU&list=PLQVvvaa0QuDezJFIOU5wDdfy4e9vdnx-7&index=4)) 9 | 10 | - [x] SARSA 11 | - [x] Q-Learning 12 | -------------------------------------------------------------------------------- /Unit3-Monte-Carlo/readme.md: -------------------------------------------------------------------------------- 1 | # Monte-Carlo model free methods 2 | Monte-Carlo is the start of model free learning algorithms. 3 | 1. [CS234 Lecture 3(Leave the TD-learning half for now)](https://www.youtube.com/watch?v=dRIhrn8cc9w&list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u&index=3) 4 | 2. [Professor Balaraman Ravindran's RL until week 6 second lecture](https://nptel.ac.in/courses/106106143/) 5 | 3. [Sutton and Barton chapter 5](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) 6 | 7 | You could also checkout the David silver lecture series, although I found CS234 much better. 8 | 9 | I have used the [blackjack environment](https://github.com/openai/gym/blob/master/gym/envs/toy_text/blackjack.py) of OpenAi gym 10 | to implement the following algorithms 11 | 12 | - [ ] On policy Monte-Carlo(With exploring start) 13 | - [ ] On policy Monte-Carlo(without exploring start) 14 | - [x] Off policy Monte-Carlo 15 | -------------------------------------------------------------------------------- /Sutton_and_Barton.md: -------------------------------------------------------------------------------- 1 | ## Reinforcement-Learning-An-Introduction-second-edition 2 | 3 | ### Solutions 4 | My solutions to the exercises of [Sutton and Barton's book](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) 5 | * [Chapter 2](https://hackmd.io/@Raj-Ghugare/HkfDBDlfv) 6 | * [Chapter 3](https://hackmd.io/@Raj-Ghugare/HkFPsyXtU) 7 | * [Chapter 4](https://hackmd.io/@Raj-Ghugare/H1IooEiyw) 8 | * [Chapter 5](https://hackmd.io/@Raj-Ghugare/SkaSu3HxD) 9 | * [Chapter 6](https://hackmd.io/@Raj-Ghugare/BkZZ3PaKL) 10 | 11 | ### How to contribute 12 | 13 | #### Corrections 14 | These solutions could have some errors and if you find something wrong please open a pull request by changing the readme and specifying the mistake in it. 15 | 16 | #### Solutions to problems which are not included 17 | If you want to provide your solutions then write your solutions in markdown using [hackmd](https://hackmd.io/?nav=overview) and append it to the readme in your pull request.I will add the solutions if they are appropriate. 18 | -------------------------------------------------------------------------------- /Unit6-REINFORCE/main.py: -------------------------------------------------------------------------------- 1 | import gym 2 | from REINFORCE import Agent 3 | from utils import plot_score 4 | import numpy as np 5 | import torch 6 | from gym import wrappers 7 | 8 | 9 | NAME = "LunarLander-v2" 10 | INPUT_DIMS = [8] 11 | GAMMA = 0.99 12 | N_ACTIONS = 4 13 | N_GAMES = 200 14 | 15 | if __name__ == '__main__': 16 | env = gym.make(NAME) 17 | agent = Agent(lr=0.001, input_dims=INPUT_DIMS, gamma=GAMMA, n_actions=N_ACTIONS, 18 | h1=64, h2=32) 19 | score_history = [] 20 | score = 0 21 | best_score = -1000 22 | 23 | for i in range(N_GAMES): 24 | print('episode: ', i, 'score %.3f' % score) 25 | done = False 26 | score = 0 27 | state = env.reset() 28 | while not done: 29 | action = agent.choose_action(state) 30 | next_state, reward, done, _ = env.step(action) 31 | agent.store_rewards(reward) 32 | state = next_state 33 | score += reward 34 | if(np.mean(score_history[-20:])>best_score and i>20): 35 | torch.save(agent.policy.state_dict(),'./params/'+NAME+'pt.') 36 | best_score = np.mean(score_history[-20]) 37 | score_history.append(score) 38 | agent.improve() 39 | 40 | plot_score(score_history,NAME,save=True) 41 | -------------------------------------------------------------------------------- /Unit3-Monte-Carlo/utils.py: -------------------------------------------------------------------------------- 1 | import seaborn as sns 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | 5 | def plot(target_policy): 6 | usable = np.zeros([11,10]) 7 | non_usable = np.zeros([11,10]) 8 | for i in target_policy: 9 | if i[0]>10: 10 | if i[2]: 11 | usable[i[0]-11,i[1]-1]=target_policy[i] 12 | else: 13 | non_usable[i[0]-11,i[1]-1]=target_policy[i] 14 | usable = np.flip(usable,0) 15 | non_usable = np.flip(non_usable,0) 16 | ax = sns.heatmap(usable, linewidth=0, cbar=False) 17 | plt.xlabel('dealer showing') 18 | plt.title("Non Usable ace") 19 | plt.title("Usable ace [Black=Stick,Off-white=Hit]") 20 | plt.yticks(np.arange(11),['21','20','19','18','17','16','15','14','13','12','11']) 21 | plt.xticks(np.arange(0,10)+0.5,['1','2','3','4','5','6','7','8','9','10']) 22 | ax.yaxis.tick_right() 23 | plt.show() 24 | ax = sns.heatmap(non_usable, linewidth=0, cbar=False) 25 | plt.ylabel('player sum') 26 | plt.xlabel('dealer showing') 27 | plt.title("Non Usable ace [[Black=Stick,Off-white=Hit]]") 28 | plt.yticks(np.arange(11),['21','20','19','18','17','16','15','14','13','12','11']) 29 | plt.xticks(np.arange(0,10)+0.5,['1','2','3','4','5','6','7','8','9','10']) 30 | ax.yaxis.tick_right() 31 | plt.show() 32 | -------------------------------------------------------------------------------- /Unit5-Deep-Q-Networks/play.py: -------------------------------------------------------------------------------- 1 | from dqn import cart_agent 2 | import torch 3 | import gym 4 | import time 5 | import numpy as np 6 | from utils import plot_learning_curve 7 | 8 | device = torch.device('cuda' if torch.cuda.is_available else 'cpu') 9 | env = gym.make('MountainCar-v0').unwrapped 10 | 11 | if __name__ == '__main__': 12 | player = agent(epsilon=0,eps_decay=0,epsilon_min=0,gamma=0,l_r=0,n_actions=3, 13 | memory=0,batch_size=0,target_update=0,env = env,save = True) 14 | n_games = 3 15 | scores = [] 16 | player.policy_net.load_state_dict(torch.load('/home/raj/My_projects/DQN/MountanCar.pt')) 17 | 18 | for i in range(n_games): 19 | env.reset() 20 | last_screen = player.get_state() 21 | current_screen = player.get_state() 22 | state = current_screen-last_screen 23 | 24 | done = False 25 | score = 0 26 | while not done: 27 | action = player.choose_action(state) 28 | time.sleep(0.05) 29 | _, reward, done, _ = player.env.step(action) 30 | 31 | last_screen = current_screen 32 | current_screen = player.get_state() 33 | 34 | next_state = current_screen - last_screen 35 | score += reward 36 | state = next_state 37 | 38 | 39 | scores.append(score) 40 | print(np.mean(scores)) 41 | plot_learning_curve(i, scores,0) 42 | -------------------------------------------------------------------------------- /Unit5-Deep-Q-Networks/readme.md: -------------------------------------------------------------------------------- 1 | 2 | # Deep Q learning using fixed q targets and experience replay 3 | 4 | ## Results 5 | 6 | ### Trained Mountain Car : 7 | ![](https://media.giphy.com/media/dZopKlQbCgEBTPBy8n/giphy.gif) 8 | 9 | ### Trained Cart Pole : 10 | ![](https://media.giphy.com/media/J5Yh1aY9WhlJc4TZFR/giphy.gif) 11 | 12 | ## Abstract: 13 | 14 | Function approximators like neural networks have succesfully been combined with reinforcement learning because of their ability to derive optimal estimations of the environment using higher order inputs like audios and images.This is an implementation of the Human-level control through deep reinforcement learning with some crunch time tweaks.My implementation was first tested using the low state inputs of the CartPole environment from OpenAi Gym.Then it was succesfully applied to different OpenAi gym environments without any major hyper-parameter tuning using just the high-dimensional sensory inputs. 15 | 16 | 17 | ## Environments: 18 | 19 | - **CartPole** - [https://gym.openai.com/envs/CartPole-v1/] 20 | - **MountainCar** - [https://gym.openai.com/envs/MountainCar-v0/] 21 | 22 | ## Instruction: 23 | 24 | ``` Hyper-parameters tuning for new problems should be done accordingly ``` 25 | ``` The path to save pytorch model checkpoints should be changed ``` 26 | 27 | ## Dependencies: 28 | 29 | - Anaconda: [link](https://docs.anaconda.com/anaconda/install/linux/) 30 | - OpenAi gym: [link](https://gym.openai.com/) 31 | - pytorch: [link](https://pytorch.org/) 32 | 33 | ## References: 34 | 35 | - [Playing Atari with Deep Reinforcement Learning](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) 36 | 37 | -------------------------------------------------------------------------------- /Unit5-Deep-Q-Networks/utils.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import numpy as np 3 | from IPython.display import clear_output 4 | import matplotlib 5 | import torch 6 | import matplotlib.pyplot as plt 7 | 8 | is_ipython = 'inline' in matplotlib.get_backend() 9 | if is_ipython: 10 | from IPython import display 11 | 12 | def plot_learning_curve(episode, scores, epsilon): 13 | clear_output(True) 14 | plt.figure(figsize=(20,5)) 15 | plt.subplot(131) 16 | plt.title('episode %s. average_reward: %s' % (episode, np.mean(scores[-10:]))) 17 | plt.plot(scores) 18 | plt.subplot(132) 19 | plt.title('epsilon') 20 | plt.plot(epsilon) 21 | plt.show() 22 | 23 | def plot_playing_curve(episode, scores): 24 | clear_output(True) 25 | plt.figure(figsize=(5,5)) 26 | plt.title('episode %s. average_reward: %s' % (episode, np.mean(scores[-10:]))) 27 | plt.plot(scores) 28 | plt.show() 29 | 30 | def plot_durations(scores,pause): 31 | plt.ion() 32 | plt.figure(2) 33 | plt.clf() 34 | 35 | durations_t = torch.tensor(scores, dtype=torch.float) 36 | plt.title('Training...') 37 | plt.xlabel('Episode') 38 | plt.ylabel('Scores') 39 | plt.plot(durations_t.numpy()) 40 | # Take 20 episode averages and plot them too 41 | if len(durations_t) >= 20: 42 | means = durations_t.unfold(0, 20, 1).mean(1).view(-1) 43 | means = torch.cat((torch.zeros(19), means)) 44 | plt.plot(means.numpy()) 45 | 46 | plt.pause(pause) # pause a bit so that plots are updated 47 | if is_ipython: 48 | display.clear_output(wait=True) 49 | display.display(plt.gcf()) 50 | -------------------------------------------------------------------------------- /Unit2-Bellman-Equations/readme.md: -------------------------------------------------------------------------------- 1 | # Bellman equations and solving MDPs 2 | Markov Decision Processes bring in the sequential decision making and delayed reward aspects of RL. 3 | 4 | 1. [Stanford CS234 lecture 2](https://www.youtube.com/watch?v=E3f2Camj0Is&list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u&index=2) 5 | 2. [Professor Balaraman Ravindran's RL week 3,4 and 5th only till policy iteration](https://nptel.ac.in/courses/106106143/) 6 | 3. [Sutton and Barton chapter 3 and 4](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) 7 | 8 | Balaram's lectures are more of classical RL and math intensive, So it is better to watch CS234 first. 9 | 10 | You should implement the algorithms once you understand the proofs. 11 | I have implemented the following : 12 | 13 | - [x] Value iteration 14 | - [x] Policy iteration 15 | - [x] Asynchronous value iteration 16 | - [ ] Real time dynamic programming 17 | 18 | The MDP which I have used is from the example 3.5 - Gridworld from Sutton and Barton.After running the code we can obtain figure 3.15 from the textbook 19 | 20 | ![](https://i.imgur.com/uwnhUyi.png) 21 | 22 | These are my notes on these topics. 23 | * [Bellman Equation](https://hackmd.io/Fuhp2hwyR4GknchLGBGTWw) 24 | * [Bellman Optimality Equation](https://hackmd.io/wqQyQAvlTVeGzLsaVLUswg) 25 | * [Value Iteration](https://hackmd.io/3o8W1o4rS6ikMs42PVXPAw) 26 | * [Policy Iteration](https://hackmd.io/8F3m-j59TB-RaxP3ysVsCg?both) 27 | 28 | 29 | References: 30 | 1. If you want to make your own blob environment then you can watch this [sentdex tutorial](https://www.youtube.com/watch?v=G92TF4xYQcU&list=PLQVvvaa0QuDezJFIOU5wDdfy4e9vdnx-7&index=4). 31 | -------------------------------------------------------------------------------- /Unit1-Multi-Armed-Bandits/utils.py: -------------------------------------------------------------------------------- 1 | ####################################################################### 2 | # Copyright (C) # 3 | # 2020(rajghugare.vnit@gmail.com) # 4 | # Permission given to modify the code as long as you keep this # 5 | # declaration at the top # 6 | ####################################################################### 7 | 8 | import numpy as np 9 | import matplotlib.pyplot as plt 10 | 11 | 12 | def plot_ArmCount(data, num_iters, bandit, k): 13 | x_index = [] 14 | for i in range(k): 15 | x_index.append(str(i)) 16 | location = np.arange(k) 17 | (e_Q, e_regret, e_arm_history, _) = data[bandit]["epsilon_greedy"] 18 | (s_Q, s_regret, s_arm_history, _) = data[bandit]["softmax"] 19 | (u_Q, u_regret, u_arm_history, _) = data[bandit]["UCB"] 20 | (fig, ax) = plt.subplots(1,1) 21 | bar1 = ax.bar(location, e_arm_history, label="epsilon_greedy", fill=False, edgecolor='green') 22 | bar2 = ax.bar(location, s_arm_history, label="softmax", fill=False, edgecolor='red') 23 | bar3 = ax.bar(location, u_arm_history, label="UCB1", fill=False, edgecolor='purple') 24 | ax.set_ylabel('Arm pull history') 25 | ax.set_title('Number of times arm pulled') 26 | ax.set_xticks(location) 27 | ax.set_xticklabels(x_index) 28 | ax.legend() 29 | fig.tight_layout() 30 | plt.show() 31 | 32 | 33 | 34 | def plot_regret(data, num_iters, bandit, player, k): 35 | #Plots regret of any one bandit at a time 36 | (Q, regret, arm_history, _) = data[bandit][player] 37 | t = np.arange(num_iters) 38 | plt.plot(t, regret, color='green', label=player) 39 | plt.xlabel("Time steps") 40 | plt.ylabel("Regret") 41 | plt.legend() 42 | plt.show() 43 | -------------------------------------------------------------------------------- /Unit1-Multi-Armed-Bandits/readme.md: -------------------------------------------------------------------------------- 1 | # Multi-Armed bandits(Immediate RL problems) 2 | Multi-armed bandits are ignored by a lot of people who begin studying RL,but I think that it is the best place to gain a strong mathematical foothold and get an idea of how things would work in a RL problem. 3 | 4 | ### My notes 5 | 6 | * [Overview of the Multi-armed bandit problem](https://hackmd.io/CZQq2azUTMCjt2FF_TQNfQ?view) 7 | * [Regret optimality with UCB1](https://hackmd.io/-DkQQy8DRYezVXDqUaPsYQ) 8 | * [PAC bounds with median elimination](https://hackmd.io/saK7DdqCRnyBfN3HykLhlA) 9 | 10 | ### Hello world of Reinforcement learning. 11 | 12 | I would strongly advice you guys to go through the resources I am going to list down.These will be enough for theoretically studying bandits(atleast enough to get a basic understanding of immediate RL). 13 | 14 | 1. [Just go through the introduction from wikipedia.](https://en.wikipedia.org/wiki/Multi-armed_bandit) 15 | 2. [Professor Balaraman Ravindran's RL week 1 and week 2](https://nptel.ac.in/courses/106106143/) 16 | 3. [Sutton and Barton chapter 2](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) 17 | 18 | As you are watching the lectures it is a good idea that you code the algorithms as you learn them. 19 | 20 | I have implemented the following algorithms in agents.py 21 | - [X] epsilon-greedy 22 | - [x] softmax 23 | - [x] UCB1 24 | - [x] Median elimination 25 | - [ ] Other variants of UCB 26 | - [ ] Thompson sampling 27 | - [ ] Policy gradient methods 28 | 29 | Results: 30 | ![](https://i.imgur.com/H4u6UaE.png) 31 | 32 | this shows the number of times different algorithms pulled different arms(arms are in ascending as per expected values) 33 | 34 | The ones which I havent implemented(You could/should if you want to). 35 | 36 | After you are done with this I would recommend to go through the notes that I made.These notes summarize the topics in a brief manner.I would suggest you to make similar notes 37 | 38 | 39 | 40 | -------------------------------------------------------------------------------- /Unit5-Deep-Q-Networks/main_dqn.py: -------------------------------------------------------------------------------- 1 | import gym 2 | from dqn import cart_agent 3 | import numpy as np 4 | import torch 5 | from utils import plot_durations 6 | from utils import plot_learning_curve 7 | 8 | device = torch.device('cuda' if torch.cuda.is_available else 'cpu') 9 | 10 | if __name__ == '__main__': 11 | 12 | env = gym.make('MountainCar-v0') 13 | 14 | A = agent(epsilon=1,eps_decay=0.005,epsilon_min=0.01,gamma=0.99,l_r=0.0001,n_actions=3, 15 | memory=20000,batch_size=32,target_update=7,env=env,save=True) 16 | 17 | scores, avg_score, epsilon_history = [], [], [] 18 | best_score = -np.inf 19 | n_games = 1000 20 | score = 0 21 | 22 | print("Save is currently !!!!!!!!!!!!!!!!!! ", A.save) 23 | 24 | for i in range(n_games): 25 | A.env.reset() 26 | last_screen = A.get_state() 27 | current_screen = A.get_state() 28 | state = current_screen-last_screen 29 | 30 | done = False 31 | score = 0 32 | 33 | if i%20==0 and i>0: 34 | plot_durations(scores, 0.001) 35 | print('----------------- training --------------------') 36 | print('epsiode number', i) 37 | print("Average score ",avg_score[-1]) 38 | print('----------------- training --------------------') 39 | 40 | while not done: 41 | action = A.choose_action(state) 42 | 43 | _, reward, done, _ = A.env.step(action) 44 | 45 | last_screen = current_screen 46 | current_screen = A.get_state() 47 | 48 | next_state = current_screen - last_screen 49 | 50 | A.store_experience(state,action,reward,done,next_state) 51 | A.learn_with_experience_replay() 52 | 53 | score += reward 54 | state = next_state 55 | 56 | scores.append(score) 57 | if i>30: 58 | avg_score.append(np.mean(scores[-30:])) 59 | else: 60 | avg_score.append(np.mean(scores)) 61 | 62 | if avg_score[-1] > best_score: 63 | torch.save(A.policy_net.state_dict(),'/home/raj/My_projects/DQN/MountanCar.pt') 64 | best_score = avg_score[-1] 65 | print("***************\ncurrent best average score is "+ str(best_score) +"\n***************") 66 | 67 | if i%A.target_update == 0: 68 | A.target_net.load_state_dict(A.policy_net.state_dict()) 69 | 70 | A.epsilon_decay() 71 | 72 | plot_durations(scores,5) 73 | -------------------------------------------------------------------------------- /Unit6-REINFORCE/REINFORCE.py: -------------------------------------------------------------------------------- 1 | import torch as T 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import torch.optim as optim 5 | import numpy as np 6 | 7 | class Policy(nn.Module): 8 | def __init__(self, lr, input_dims, h1, h2, n_actions): 9 | super(Policy,self).__init__() 10 | self.input_dims = input_dims 11 | self.lr = lr 12 | self.h1 = h1 13 | self.h2 = h2 14 | self.n_actions = n_actions 15 | self.linear1 = nn.Linear(*self.input_dims, self.h1) 16 | self.linear2 = nn.Linear(self.h1, self.h2) 17 | self.linear3 = nn.Linear(self.h2, self.n_actions) 18 | self.optimizer = optim.Adam(self.parameters(), lr=lr) 19 | 20 | self.device = T.device('cuda:0' if T.cuda.is_available() else 'cpu:0') 21 | self.to(self.device) 22 | 23 | def forward(self,obs): 24 | x = T.tensor(obs,dtype=T.float).to(self.device) 25 | x = F.relu(self.linear1(x)) 26 | x = F.relu(self.linear2(x)) 27 | x = self.linear3(x) 28 | 29 | return x 30 | 31 | class Agent(object): 32 | def __init__(self, lr, input_dims, gamma=0.99, n_actions=2, h1=128, h2=128): 33 | self.gamma = gamma 34 | self.reward_memory = [] 35 | self.action_memory = [] 36 | self.policy = Policy(lr, input_dims, h1, h2, n_actions) 37 | 38 | def choose_action(self, observation): 39 | probs = F.softmax(self.policy(observation),dim=0) 40 | action_probs = T.distributions.Categorical(probs) 41 | action = action_probs.sample() 42 | log_probs = T.log(probs[action]) 43 | self.action_memory.append(log_probs) 44 | 45 | return action.item() 46 | 47 | def store_rewards(self, reward): 48 | self.reward_memory.append(reward) 49 | 50 | def improve(self): 51 | self.policy.optimizer.zero_grad() 52 | G = np.zeros_like(self.reward_memory, dtype=np.float64) 53 | for t in range(len(self.reward_memory)): 54 | g_sum = 0 55 | disc = 1 56 | for i in range(t, len(self.reward_memory)): 57 | g_sum += self.reward_memory[i]*disc 58 | disc *= self.gamma 59 | G[t] = g_sum 60 | G = (G - np.mean(G))/(np.std(G) if np.std(G) > 0 else 1) 61 | 62 | G = T.tensor(G, dtype=T.float).to(self.policy.device) 63 | 64 | loss = 0 65 | for g,log_prob in zip(G, self.action_memory): 66 | loss += -g * log_prob 67 | 68 | loss.backward() 69 | self.policy.optimizer.step() 70 | 71 | self.action_memory = [] 72 | self.reward_memory = [] 73 | -------------------------------------------------------------------------------- /Unit1-Multi-Armed-Bandits/bandits.py: -------------------------------------------------------------------------------- 1 | 2 | ####################################################################### 3 | # Copyright (C) # 4 | # 2020(rajghugare.vnit@gmail.com) # 5 | # Permission given to modify the code as long as you keep this # 6 | # declaration at the top # 7 | ####################################################################### 8 | import numpy as np 9 | import random 10 | 11 | 12 | class GaussianStationaryBandit(object): 13 | def __init__(self, k, mu, sigma): 14 | self.qstar = np.array(mu) #ndarray expected payoff 15 | self.sigma = np.array(sigma) #ndarray standard deviation of payoff 16 | self.arm_history = np.zeros(k) #initializing history of all arms taken 17 | self.regret = [] #initializing history of regret after every action 18 | self.best_payoff = np.max(self.qstar) 19 | self.best_arm = np.argmax(self.qstar) 20 | self.num_arms = k 21 | self.rew = 0 22 | 23 | def pull(self, arm): 24 | self.arm_history[arm] += 1 25 | self.regret.append(self.best_payoff - self.qstar[arm]) 26 | reward = random.gauss(self.qstar[arm], self.sigma[arm]) 27 | self.reward += reward 28 | return reward 29 | 30 | def get_ArmHistory(self): 31 | return self.arm_history 32 | 33 | def get_regret(self): 34 | return self.regret 35 | 36 | def get_BestArm(self): 37 | return self.best_arm 38 | 39 | def reset(self): 40 | self.arm_history = np.zeros(self.num_arms) 41 | self.regret = [] 42 | self.re = 0 43 | 44 | def get_total_reward(self): 45 | return self.reward 46 | 47 | class BernoulliStationaryBandit(object): 48 | def __init__(self, k, mu): 49 | self.qstar = np.array(mu) #ndarray expected payoff 50 | self.arm_history = np.zeros(k) #initializing history of all arms taken 51 | self.regret = [] #initializing history of regret after every action 52 | self.best_payoff = np.max(self.qstar) 53 | self.best_arm = np.argmax(self.qstar) 54 | self.num_arms = k 55 | self.reward = 0 56 | 57 | def pull(self, arm): 58 | self.arm_history[arm] += 1 59 | self.regret.append(self.best_payoff - self.qstar[arm]) 60 | reward = np.random.choice([1,0], p = [self.qstar[arm], 1-self.qstar[arm]]) 61 | self.reward += reward 62 | return reward 63 | 64 | def get_ArmHistory(self): 65 | return self.arm_history 66 | 67 | def get_regret(self): 68 | return self.regret 69 | 70 | def get_BestArm(self): 71 | return self.best_arm 72 | 73 | def reset(self): 74 | self.arm_history = np.zeros(self.num_arms) 75 | self.regret = [] 76 | self.reward = 0 77 | 78 | def get_total_reward(self): 79 | return self.reward 80 | -------------------------------------------------------------------------------- /Unit1-Multi-Armed-Bandits/main.py: -------------------------------------------------------------------------------- 1 | ####################################################################### 2 | # Copyright (C) # 3 | # 2020(rajghugare.vnit@gmail.com) # 4 | # Permission given to modify the code as long as you keep this # 5 | # declaration at the top # 6 | ####################################################################### 7 | 8 | from agents import epsilon_greedy_agent 9 | from agents import softmax_agent 10 | from agents import Median_elimination_agent 11 | from agents import UCB 12 | from bandits import GaussianStationaryBandit 13 | from bandits import BernoulliStationaryBandit 14 | from utils import plot_ArmCount 15 | from utils import plot_regret 16 | import numpy as np 17 | 18 | 19 | #gauss_bandit = GaussianStationaryBandit(k, mu, sigma) 20 | #sigma = [1.5, 3.1, 4.1, 1, 0.1 ,2.1 ,1.1 ,0.61 ,.71 ,1] 21 | #epsilon_greedy_player_gaussian = epsilon_greedy_agent(gauss_bandit, 1, num_iters) 22 | #softmax_player_gaussian = softmax_agent(gauss_bandit, 1, num_iters) 23 | 24 | 25 | num_iters = 5000 26 | 27 | k = 10 28 | mu = np.array([0.1,0.5,0.7,0.73,0.756,0.789,0.81,0.83,0.855,0.865]) 29 | mu = np.arange(10)*0.1 30 | bernoulli_bandit = BernoulliStationaryBandit(k , mu) 31 | 32 | #initializing all the players 33 | epsilon_greedy_player_bernoulli = epsilon_greedy_agent(bernoulli_bandit, 1, num_iters) 34 | softmax_player_bernoulli = softmax_agent(bernoulli_bandit, 0.1, num_iters) 35 | median_elimination_player = Median_elimination_agent(bernoulli_bandit, epsilon=0.1, delta=0.1) 36 | UCB_player = UCB(bernoulli_bandit,num_iters) 37 | 38 | def play_UCB(): 39 | data["bernoulli_bandit"]["UCB"] = UCB_player.play() 40 | data["bernoulli_bandit"]["UCB"] = UCB_player.play() 41 | plot_regret(data, num_iters, "bernoulli_bandit", "UCB", k) 42 | plot_ArmCount(data, num_iters, "bernoulli_bandit", "UCB", k) 43 | 44 | 45 | def play_median_elimination(): 46 | data["bernoulli_bandit"]["median_elimination"] = median_elimination_player.play() 47 | 48 | 49 | def play_epsilon_greedy(): 50 | data["bernoulli_bandit"]["epsilon_greedy"] = epsilon_greedy_player_bernoulli.play() 51 | plot_regret(data, num_iters, "bernoulli_bandit", "epsilon_greedy", k) 52 | plot_ArmCount(data, num_iters, "bernoulli_bandit", "epsilon_greedy", k) 53 | 54 | def play_softmax(): 55 | data["bernoulli_bandit"]["softmax"] = softmax_player_bernoulli.play() 56 | plot_regret(data, num_iters, "bernoulli_bandit", "softmax", k) 57 | plot_ArmCount(data, num_iters, "bernoulli_bandit", "softmax", k) 58 | 59 | 60 | 61 | if __name__ == "__main__" : 62 | 63 | data = {"bernoulli_bandit":{},"gauss_bandit":{}} 64 | data["bernoulli_bandit"]["UCB"] = UCB_player.play() 65 | data["bernoulli_bandit"]["median_elimination"] = median_elimination_player.play() 66 | data["bernoulli_bandit"]["epsilon_greedy"] = epsilon_greedy_player_bernoulli.play() 67 | data["bernoulli_bandit"]["softmax"] = softmax_player_bernoulli.play() 68 | plot_ArmCount(data, num_iters, "bernoulli_bandit", k) 69 | -------------------------------------------------------------------------------- /Unit3-Monte-Carlo/BlackJack.py: -------------------------------------------------------------------------------- 1 | ####################################################################### 2 | # Copyright (C) # 3 | # 2020(rajghugare.vnit@gmail.com) # 4 | # Permission given to modify the code as long as you keep this # 5 | # declaration at the top # 6 | ####################################################################### 7 | 8 | import gym 9 | import numpy as np 10 | import matplotlib.pyplot as plt 11 | import random 12 | from utils import plot 13 | 14 | env= gym.make('Blackjack-v0') 15 | GAMMA = 1 16 | 17 | playerSum = list(np.arange(4,22)) 18 | agentCard = list(np.arange(1,11)) 19 | playerAce = [False,True] 20 | actionSpace = [0,1] 21 | stateSpace = [] 22 | 23 | target_policy = {} 24 | Q = {} 25 | C = {} 26 | for p in playerSum: 27 | for a in agentCard: 28 | for ace in playerAce: 29 | stateSpace.append((p,a,ace)) 30 | m = -1 31 | for action in actionSpace: 32 | Q[(p,a,ace),action] = random.random() 33 | if Q[(p,a,ace),action] > m: 34 | m = Q[(p,a,ace),action] 35 | argmax = action 36 | C[(p,a,ace),action] = 0 37 | target_policy[(p,a,ace)] = argmax 38 | 39 | def behaviour_policy(): 40 | r = random.uniform(0,1) 41 | if r<0.5: 42 | return 0 43 | else: 44 | return 1 45 | 46 | 47 | # Monte Carlo Off policy control to find optimal policy 48 | for i in range(1000000): 49 | states = [] 50 | actions = [] 51 | rewards = [] 52 | done = False 53 | states.append(env.reset()) 54 | a = behaviour_policy() 55 | actions.append(a) 56 | while True: 57 | (s,r,done,_) = env.step(a) 58 | rewards.append(r) 59 | if done: 60 | break 61 | a = behaviour_policy() 62 | actions.append(a) 63 | states.append(s) 64 | G = 0 65 | W = 1 #Importance sampling ratio 66 | for i in range(len(states)): 67 | G = G + GAMMA*rewards[-1-i] 68 | C[states[-i-1],actions[-i-1]] += W 69 | Q[states[-i-1],actions[-i-1]] = Q[states[-i-1],actions[-i-1]] + W*(G-Q[states[-i-1],actions[-i-1]])/C[states[-i-1],actions[-i-1]] 70 | m = -1 71 | for action in actionSpace: 72 | if Q[states[-1-i],action] > m: 73 | m = Q[states[-1-i],action] 74 | argmax = action 75 | target_policy[states[-1-i]] = argmax 76 | if actions[-i-1] != argmax: 77 | break 78 | W = W*(1/0.5) 79 | 80 | def play(n): 81 | win = 0 82 | loss = 0 83 | draw = 0 84 | for i in range(n): 85 | score = 0 86 | done = False 87 | s = env.reset() 88 | while not done: 89 | a = target_policy[s] 90 | (s,r,done,_) = env.step(a) 91 | score += r 92 | if score==0: 93 | draw += 1 94 | elif score==1: 95 | win += 1 96 | else: 97 | loss +=1 98 | print(win) 99 | print(loss) 100 | print(draw) 101 | 102 | plot(target_policy) 103 | -------------------------------------------------------------------------------- /Unit2-Bellman-Equations/main.py: -------------------------------------------------------------------------------- 1 | 2 | ####################################################################### 3 | # Copyright (C) # 4 | # 2020(rajghugare.vnit@gmail.com) # 5 | # Permission given to modify the code as long as you keep this # 6 | # declaration at the top # 7 | ####################################################################### 8 | 9 | import numpy as np 10 | from BlobEnvironment import BlobEnvironment 11 | 12 | env = BlobEnvironment() 13 | 14 | EPSILON = 0.01 #this is an optimality factor 15 | GAMMA = 0.9 #As defined in the problem itself 16 | 17 | def value_iteration(): 18 | n = 0 19 | v = np.zeros([5,5]) #This should still be considered as a 25 dimensional vector 20 | v_new = np.zeros([5,5]) #It is in the form of 5 by 5 matrix for better visual undertanding and slightly easier implementation 21 | while True: 22 | for y in range(5): 23 | for x in range(5): 24 | v_temp = np.zeros(4) 25 | for action in range(4): 26 | env.x = x 27 | env.y = y 28 | x_next,y_next,reward = env.step(action) 29 | v_temp[action] = reward + GAMMA*v[y_next,x_next] 30 | v_new[y,x] = np.max(v_temp) 31 | if np.max(np.abs(v - v_new)) < EPSILON*(1-GAMMA)/(2*GAMMA): 32 | env.plot_grid_values(np.round(v_new,decimals=2)) 33 | break 34 | v = np.copy(v_new) 35 | 36 | def policy_iteration(): 37 | n = 0 38 | policy = np.zeros([5,5],dtype = np.uint8) 39 | v = np.zeros([5,5]) 40 | v_new = np.zeros([5,5]) 41 | while True: 42 | #Policy evaluation 43 | while True: 44 | for y in range(5): 45 | for x in range(5): 46 | action = policy[y,x] 47 | env.x = x 48 | env.y = y 49 | x_next,y_next,reward = env.step(action) 50 | v_new[y,x] = reward + GAMMA*v[y_next,x_next] 51 | if np.max(np.abs(v - v_new)) < EPSILON*(1-GAMMA)/(2*GAMMA): 52 | break 53 | v = np.copy(v_new) 54 | #Policy improvement 55 | new_policy = np.zeros([5,5],dtype=np.uint8) 56 | for y in range(5): 57 | for x in range(5): 58 | v_temp = np.zeros(4) 59 | for action in range(4): 60 | env.x = x 61 | env.y = y 62 | x_next,y_next,reward = env.step(action) 63 | v_temp[action] = reward + GAMMA*v[y_next,x_next] 64 | new_policy[y,x] = np.argmax(v_temp) 65 | if np.array_equal(policy,new_policy): 66 | break 67 | policy = np.copy(new_policy) 68 | env.plot_policy(policy) 69 | return policy 70 | 71 | 72 | def Play_optimally(policy): 73 | (x,y) = env.reset() 74 | for i in range(25): 75 | action = policy[y,x] 76 | (x,y,reward) = env.step(action) 77 | print(reward) 78 | env.render() 79 | 80 | 81 | value_iteration() 82 | 83 | policy = policy_iteration() 84 | Play_optimally(policy) 85 | -------------------------------------------------------------------------------- /Unit2-Bellman-Equations/BlobEnvironment.py: -------------------------------------------------------------------------------- 1 | 2 | ####################################################################### 3 | # Copyright (C) # 4 | # 2020(rajghugare.vnit@gmail.com) # 5 | # Permission given to modify the code as long as you keep this # 6 | # declaration at the top # 7 | ####################################################################### 8 | 9 | import numpy as np 10 | import matplotlib.pyplot as plt 11 | import time 12 | import random 13 | 14 | class Blob(): 15 | def __init__(self, SIZE): 16 | self.size = SIZE 17 | self.x = np.random.randint(0, self.size) 18 | self.y = np.random.randint(0, self.size) 19 | 20 | def __str__(self): 21 | return f"{self.x}, {self.y}" 22 | 23 | class BlobEnvironment(): 24 | def __init__(self): 25 | self.size = 5 26 | self.n_actions = 4 27 | self.player = Blob(self.size) 28 | self.x = self.player.x 29 | self.y = self.player.y 30 | self.color = {"player":(0,0,255)} 31 | self.reward = 0 32 | 33 | def reset(self): 34 | self.x = self.player.x 35 | self.y = self.player.y 36 | return (self.x, self.y) 37 | 38 | def step(self,action=-1): 39 | if action == -1: 40 | print('lolll') 41 | action = self.player.policy() 42 | 43 | if action == 0: #Right 44 | self.move(x=1, y=0) 45 | elif action == 1: #Down 46 | self.move(x=0, y=1) 47 | elif action == 2: #Left 48 | self.move(x=-1, y=0) 49 | elif action == 3: #up 50 | self.move(x=0, y=-1) 51 | return self.x,self.y,self.reward 52 | 53 | def move(self, x, y): 54 | self.reward = 0 55 | if self.x==1 and self.y==0: 56 | self.reward = 10 57 | self.x = 1 58 | self.y = 4 59 | elif self.x==3 and self.y==0: 60 | self.reward = 5 61 | self.x = 3 62 | self.y = 2 63 | else: 64 | self.x += x 65 | self.y += y 66 | if self.x < 0: 67 | self.x = 0 68 | self.reward = -1 69 | elif self.x >= self.size: 70 | self.x = self.size-1 71 | self.reward = -1 72 | 73 | if self.y < 0: 74 | self.y = 0 75 | self.reward = -1 76 | elif self.y >= self.size: 77 | self.y = self.size-1 78 | self.reward = -1 79 | 80 | def render(self, RenderTime = 100): 81 | env = np.ones((self.size,self.size,3), dtype = np.uint8)*255 82 | env[self.y][self.x] = self.color["player"] 83 | plt.xticks(np.arange(-0.5,4.5,1),np.arange(5)) 84 | plt.yticks(np.arange(-0.5,4.5,1),np.arange(5)) 85 | plt.grid('True') 86 | plt.imshow(np.array(env)) 87 | plt.pause(RenderTime/100) 88 | 89 | def sample_actions(self): 90 | return np.random.randint(0, self.n_actions) 91 | 92 | def plot_grid_values(self, values): 93 | fig, axs = plt.subplots(1,1) 94 | axs.axis('off') 95 | the_table = axs.table(cellText=values,bbox=[0, 0, 1, 1],cellLoc="center") 96 | plt.show() 97 | 98 | def plot_policy(self, policy): 99 | P = [] 100 | for y in range(5): 101 | p = [] 102 | for x in range(5): 103 | if policy[y,x] == 0: 104 | p.append("right") 105 | elif policy[y,x] == 1: 106 | p.append("down") 107 | elif policy[y,x] == 2: 108 | p.append("left") 109 | else: 110 | p.append("up") 111 | P.append(p) 112 | fig, axs = plt.subplots(1,1) 113 | axs.axis('off') 114 | the_table = axs.table(cellText=P,bbox=[0, 0, 1, 1],cellLoc="center") 115 | plt.show() 116 | -------------------------------------------------------------------------------- /Unit1-Multi-Armed-Bandits/agents.py: -------------------------------------------------------------------------------- 1 | ####################################################################### 2 | # Copyright (C) # 3 | # 2020(rajghugare.vnit@gmail.com) # 4 | # Permission given to modify the code as long as you keep this # 5 | # declaration at the top # 6 | ####################################################################### 7 | 8 | import random 9 | import numpy as np 10 | from bandits import BernoulliStationaryBandit 11 | from bandits import GaussianStationaryBandit 12 | 13 | 14 | class epsilon_greedy_agent(BernoulliStationaryBandit): 15 | def __init__(self, bandit, epsilon, num_iters): 16 | self.bandit = bandit 17 | self.epsilon = epsilon 18 | self.num_iters = num_iters 19 | self.Q = np.ones(self.bandit.num_arms) 20 | 21 | def EpsilonGreedy_policy(self): 22 | if random.random()0: 37 | if t > self.num_iters/50: 38 | self.epsilon = 1/t 39 | arm_history = self.bandit.get_ArmHistory() 40 | regret = self.bandit.get_regret() 41 | total_reward = self.bandit.get_total_reward() 42 | return self.Q, regret, arm_history, total_reward 43 | 44 | 45 | class softmax_agent(BernoulliStationaryBandit): 46 | def __init__(self, bandit, beta, num_iters): 47 | self.bandit = bandit 48 | self.beta = beta 49 | self.num_iters = num_iters 50 | self.Q = np.ones(self.bandit.num_arms) 51 | 52 | def Softmax_policy(self): 53 | prob = np.copy(np.exp(self.Q/self.beta)/np.sum(np.exp(self.Q/self.beta))) 54 | arm = np.random.choice(np.arange(self.bandit.num_arms), p = prob) 55 | return arm 56 | 57 | def play(self): 58 | self.bandit.reset() 59 | self.indicator = np.zeros(self.bandit.num_arms) 60 | for t in range(self.num_iters): 61 | arm = self.Softmax_policy() 62 | reward = self.bandit.pull(arm) 63 | self.indicator[arm] += 1 64 | self.Q[arm] = (self.Q[arm]*self.indicator[arm] + reward)/(self.indicator[arm] + 1) 65 | arm_history = self.bandit.get_ArmHistory() 66 | regret = self.bandit.get_regret() 67 | total_reward = self.bandit.get_total_reward() 68 | return self.Q, regret, arm_history, total_reward 69 | 70 | 71 | class UCB(): 72 | def __init__(self, bandit, num_iters): 73 | self.bandit = bandit 74 | self.time = 0 75 | self.Q = np.zeros(self.bandit.num_arms) 76 | self.confidence = np.zeros(self.bandit.num_arms) 77 | self.num_iters = num_iters 78 | 79 | def UCB_policy(self): 80 | arm = np.argmax(np.add(self.Q,self.confidence)) 81 | return arm 82 | 83 | def play(self): 84 | self.bandit.reset() 85 | for i in range(self.bandit.num_arms): 86 | self.Q[i] = self.bandit.pull(i) 87 | self.time +=1 88 | for i in range(self.num_iters-self.bandit.num_arms): 89 | a_h = self.bandit.get_ArmHistory() 90 | self.confidence = np.sqrt(2*np.log(self.time)/a_h) 91 | arm = self.UCB_policy() 92 | reward = self.bandit.pull(arm) 93 | self.Q[arm] = (self.Q[arm]*a_h[arm] + reward)/(a_h[arm] + 1) 94 | arm_history = self.bandit.get_ArmHistory() 95 | regret = self.bandit.get_regret() 96 | total_reward = self.bandit.get_total_reward() 97 | return self.Q, regret, arm_history, total_reward 98 | 99 | 100 | class Median_elimination_agent(): 101 | def __init__(self, bandit, epsilon, delta): 102 | self.bandit = bandit 103 | self.epsilon = epsilon/4 104 | self.delta = delta/2 105 | self.Q = np.ones(self.bandit.num_arms) 106 | self.S = np.arange(self.bandit.num_arms) 107 | def play(self): 108 | self.bandit.reset() 109 | self.indicator = np.zeros(self.bandit.num_arms) 110 | while len(self.S) != 1: 111 | for arm in self.S: 112 | count = int(2*np.log(3/self.delta)/np.square(self.epsilon)) 113 | for i in range(count): 114 | reward = self.bandit.pull(arm) 115 | self.indicator[arm] += 1 116 | self.Q[arm] = (self.Q[arm]*self.indicator[arm] + reward)/(self.indicator[arm] + 1) 117 | M = np.median(self.Q[self.S]) 118 | self.S = np.delete(self.S, np.where(self.Q[self.S]self.epsilon_min: 147 | self.epsilon = self.epsilon-self.eps_decay 148 | return self.epsilon 149 | 150 | 151 | print('Done') 152 | -------------------------------------------------------------------------------- /Unit4-Temporal-Difference-Methods/QLearning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "#Custom Environment\n", 10 | "import numpy as np\n", 11 | "from PIL import Image\n", 12 | "import cv2\n", 13 | "import matplotlib.pyplot as plt\n", 14 | "from matplotlib import style\n", 15 | "import time\n", 16 | "import numpy as np\n", 17 | "import random\n", 18 | "\n", 19 | "\n", 20 | "style.use(\"ggplot\")\n", 21 | "\n", 22 | "\n", 23 | "class Blob():\n", 24 | " def __init__(self, SIZE = 10):\n", 25 | " self.size = SIZE\n", 26 | " self.x = np.random.randint(0, SIZE)\n", 27 | " self.y = np.random.randint(0, SIZE)\n", 28 | "\n", 29 | " def __str__(self):\n", 30 | " return f\"{self.x}, {self.y}\"\n", 31 | "\n", 32 | " def __sub__(self, other):\n", 33 | " return (self.x-other.x, self.y-other.y)\n", 34 | "\n", 35 | " def act(self, choice, diagonal = False):\n", 36 | " '''\n", 37 | " Gives us 4 total movement options. (0,1,2,3)\n", 38 | " '''\n", 39 | " if diagonal:\n", 40 | "\n", 41 | " if choice == 0:\n", 42 | " self.move(x=1, y=1)\n", 43 | " elif choice == 1:\n", 44 | " self.move(x=-1, y=-1)\n", 45 | " elif choice == 2:\n", 46 | " self.move(x=-1, y=1)\n", 47 | " elif choice == 3:\n", 48 | " self.move(x=1, y=-1)\n", 49 | "\n", 50 | " else:\n", 51 | " if choice == 0:\n", 52 | " self.move(x=0, y=1)\n", 53 | " elif choice == 1:\n", 54 | " self.move(x=0, y=-1)\n", 55 | " elif choice == 2:\n", 56 | " self.move(x=-1, y=0)\n", 57 | " elif choice == 3:\n", 58 | " self.move(x=1, y=0)\n", 59 | "\n", 60 | "\n", 61 | " def move(self, x=-100, y=-100):\n", 62 | "\n", 63 | " if x == -100:\n", 64 | " self.x += np.random.randint(-1, 2)\n", 65 | " else:\n", 66 | " self.x += x\n", 67 | "\n", 68 | " if y == -100:\n", 69 | " self.y += np.random.randint(-1, 2)\n", 70 | " else:\n", 71 | " self.y += y\n", 72 | "\n", 73 | " if self.x < 0:\n", 74 | " self.x = 0\n", 75 | " elif self.x > self.size-1:\n", 76 | " self.x = self.size-1\n", 77 | " if self.y < 0:\n", 78 | " self.y = 0\n", 79 | " elif self.y > self.size-1:\n", 80 | " self.y = self.size-1\n", 81 | "\n", 82 | "\n", 83 | "class ENVIRONMENT():\n", 84 | "\n", 85 | "\n", 86 | "\n", 87 | " def __init__(self, num_player=1, num_enemy=1, num_food=1, size = 10, diagonal = False):\n", 88 | " self.size = size\n", 89 | " self.naction = 4\n", 90 | " self.diagonal = diagonal\n", 91 | " self.num_enemy = num_enemy\n", 92 | " self.num_food = num_food\n", 93 | " self.player = Blob(size)\n", 94 | " self.enemy = [Blob() for _ in range(self.num_enemy)]\n", 95 | " self.food = [Blob() for _ in range(self.num_food)]\n", 96 | " self.reward = 0\n", 97 | " self.colors = {1: (255, 0, 0),\n", 98 | " 2: (0, 255, 0),\n", 99 | " 3: (0, 0, 255)}\n", 100 | " self.px,self.py = self.player.x,self.player.y\n", 101 | " self.ex,self.ey = [self.enemy[iter].x for iter in range(self.num_enemy)], [self.enemy[iter].y for iter in range(self.num_enemy)]\n", 102 | " self.fx,self.fy = [self.food[iter].x for iter in range(self.num_food)], [self.food[iter].y for iter in range(self.num_food)]\n", 103 | "\n", 104 | "\n", 105 | " def startover(self, newpos=False):\n", 106 | "\n", 107 | " self.player.x, self.player.y = self.px, self.py\n", 108 | " for iter in range(self.num_enemy):\n", 109 | " self.enemy[iter].x, self.enemy[iter].y = self.ex[iter], self.ey[iter]\n", 110 | " for iter in range(self.num_food):\n", 111 | " self.food[iter].x, self.food[iter].y = self.fx[iter], self.fy[iter]\n", 112 | " if newpos == True:\n", 113 | " self.player = Blob(self.size)\n", 114 | " self.reward = 0\n", 115 | "\n", 116 | " return (self.player.x, self.player.y), self.reward, False\n", 117 | "\n", 118 | " def step(self, action):\n", 119 | "\n", 120 | " self.player.act(action, self.diagonal)\n", 121 | " self.reward = self.calculate_reward()\n", 122 | " return (self.player.x, self.player.y), self.reward\n", 123 | "\n", 124 | " def calculate_reward(self):\n", 125 | "\n", 126 | " if self.player.x in [self.enemy[iter].x for iter in range(self.num_enemy)] and self.player.y in [self.enemy[iter].y for iter in range(self.num_enemy)]:\n", 127 | " return -100, True\n", 128 | "\n", 129 | " if self.player.x in [self.food[iter].x for iter in range(self.num_food)] and self.player.y in [self.food[iter].y for iter in range(self.num_food)]:\n", 130 | " return 100, True\n", 131 | "\n", 132 | " else:\n", 133 | " return -1, False\n", 134 | "\n", 135 | "\n", 136 | " def render(self,renderTime=100):\n", 137 | "\n", 138 | " env = np.zeros((self.size, self.size, 3), dtype=np.uint8)\n", 139 | " for iter in range(self.num_food):\n", 140 | " env[self.food[iter].x][self.food[iter].y] = self.colors[2]\n", 141 | " for iter in range(self.num_enemy):\n", 142 | " env[self.enemy[iter].x][self.enemy[iter].y] = self.colors[3]\n", 143 | " env[self.player.x][self.player.y] = self.colors[1]\n", 144 | " img = Image.fromarray(env, 'RGB')\n", 145 | " img = img.resize((300, 300))\n", 146 | " cv2.imshow(\"image\", np.array(img))\n", 147 | " cv2.waitKey(renderTime)\n", 148 | " # cv2.destroyAllWindows()\n", 149 | "\n", 150 | " def sample_action(self):\n", 151 | " return np.random.randint(0, self.naction)" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 2, 157 | "metadata": {}, 158 | "outputs": [ 159 | { 160 | "data": { 161 | "text/plain": [ 162 | "\"\\nActions \\ndiagonal = True\\n0 = down_right\\n1 = up_left\\n2 = up_right\\n3 = down_left\\nWhen space is not available action = action.split('_')[0]\\n\\nEnvironment\\nplayer = Blue\\nenemy = red\\ngoal = green \\n\\nIf a player is on \\nan enemy reward at that time step = -100\\nthe goal reward at that time step = 100\\nfor every other time step reward is = -1\\n\"" 163 | ] 164 | }, 165 | "execution_count": 2, 166 | "metadata": {}, 167 | "output_type": "execute_result" 168 | } 169 | ], 170 | "source": [ 171 | "env = ENVIRONMENT(diagonal=True, size=10, num_enemy = 3, num_food = 1)\n", 172 | "nS = 100\n", 173 | "nA = 4\n", 174 | "episodes = 25000\n", 175 | "epsilon = 0.95\n", 176 | "gamma = 0.9\n", 177 | "learning_rate = 0.1\n", 178 | "\"\"\"\n", 179 | "Actions \n", 180 | "diagonal = True\n", 181 | "0 = down_right\n", 182 | "1 = up_left\n", 183 | "2 = up_right\n", 184 | "3 = down_left\n", 185 | "When space is not available action = action.split('_')[0]\n", 186 | "\n", 187 | "Environment\n", 188 | "player = Blue\n", 189 | "enemy = red\n", 190 | "goal = green \n", 191 | "\n", 192 | "If a player is on \n", 193 | "an enemy reward at that time step = -100\n", 194 | "the goal reward at that time step = 100\n", 195 | "for every other time step reward is = -1\n", 196 | "\"\"\"" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 3, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "def E_policy(q,s,epsilon):\n", 206 | " r = random.random()\n", 207 | " if r self.size-1:\n", 76 | " self.x = self.size-1\n", 77 | " if self.y < 0:\n", 78 | " self.y = 0\n", 79 | " elif self.y > self.size-1:\n", 80 | " self.y = self.size-1\n", 81 | "\n", 82 | "\n", 83 | "class ENVIRONMENT():\n", 84 | "\n", 85 | "\n", 86 | "\n", 87 | " def __init__(self, num_player=1, num_enemy=1, num_food=1, size = 10, diagonal = False):\n", 88 | " self.size = size\n", 89 | " self.naction = 4\n", 90 | " self.diagonal = diagonal\n", 91 | " self.num_enemy = num_enemy\n", 92 | " self.num_food = num_food\n", 93 | " self.player = Blob(size)\n", 94 | " self.enemy = [Blob() for _ in range(self.num_enemy)]\n", 95 | " self.food = [Blob() for _ in range(self.num_food)]\n", 96 | " self.reward = 0\n", 97 | " self.colors = {1: (255, 0, 0),\n", 98 | " 2: (0, 255, 0),\n", 99 | " 3: (0, 0, 255)}\n", 100 | " self.px,self.py = self.player.x,self.player.y\n", 101 | " self.ex,self.ey = [self.enemy[iter].x for iter in range(self.num_enemy)], [self.enemy[iter].y for iter in range(self.num_enemy)]\n", 102 | " self.fx,self.fy = [self.food[iter].x for iter in range(self.num_food)], [self.food[iter].y for iter in range(self.num_food)]\n", 103 | "\n", 104 | "\n", 105 | " def startover(self, newpos=False):\n", 106 | "\n", 107 | " self.player.x, self.player.y = self.px, self.py\n", 108 | " for iter in range(self.num_enemy):\n", 109 | " self.enemy[iter].x, self.enemy[iter].y = self.ex[iter], self.ey[iter]\n", 110 | " for iter in range(self.num_food):\n", 111 | " self.food[iter].x, self.food[iter].y = self.fx[iter], self.fy[iter]\n", 112 | " if newpos == True:\n", 113 | " self.player = Blob(self.size)\n", 114 | " self.reward = 0\n", 115 | "\n", 116 | " return (self.player.x, self.player.y), self.reward, False\n", 117 | "\n", 118 | " def step(self, action):\n", 119 | "\n", 120 | " self.player.act(action, self.diagonal)\n", 121 | " self.reward = self.calculate_reward()\n", 122 | " return (self.player.x, self.player.y), self.reward\n", 123 | "\n", 124 | " def calculate_reward(self):\n", 125 | "\n", 126 | " if self.player.x in [self.enemy[iter].x for iter in range(self.num_enemy)] and self.player.y in [self.enemy[iter].y for iter in range(self.num_enemy)]:\n", 127 | " return -100, True\n", 128 | "\n", 129 | " if self.player.x in [self.food[iter].x for iter in range(self.num_food)] and self.player.y in [self.food[iter].y for iter in range(self.num_food)]:\n", 130 | " return 100, True\n", 131 | "\n", 132 | " else:\n", 133 | " return -1, False\n", 134 | "\n", 135 | "\n", 136 | " def render(self,renderTime=100):\n", 137 | "\n", 138 | " env = np.zeros((self.size, self.size, 3), dtype=np.uint8)\n", 139 | " for iter in range(self.num_food):\n", 140 | " env[self.food[iter].x][self.food[iter].y] = self.colors[2]\n", 141 | " for iter in range(self.num_enemy):\n", 142 | " env[self.enemy[iter].x][self.enemy[iter].y] = self.colors[3]\n", 143 | " env[self.player.x][self.player.y] = self.colors[1]\n", 144 | " img = Image.fromarray(env, 'RGB')\n", 145 | " img = img.resize((300, 300))\n", 146 | " cv2.imshow(\"image\", np.array(img))\n", 147 | " cv2.waitKey(renderTime)\n", 148 | " # cv2.destroyAllWindows()\n", 149 | "\n", 150 | " def sample_action(self):\n", 151 | " return np.random.randint(0, self.naction)" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 10, 157 | "metadata": {}, 158 | "outputs": [ 159 | { 160 | "data": { 161 | "text/plain": [ 162 | "\"\\nActions \\ndiagonal = True\\n0 = down_right\\n1 = up_left\\n2 = up_right\\n3 = down_left\\nWhen space is not available action = action.split('_')[0]\\n\\nEnvironment\\nplayer = Blue\\nenemy = red\\ngoal = green\\n\\nIf a player is on \\nan enemy reward at that time step = -100\\nthe goal reward at that time step = 100\\nfor every other time step reward is = -1\\n\"" 163 | ] 164 | }, 165 | "execution_count": 10, 166 | "metadata": {}, 167 | "output_type": "execute_result" 168 | } 169 | ], 170 | "source": [ 171 | "env = ENVIRONMENT(diagonal=True, size=10, num_enemy = 3, num_food = 1)\n", 172 | "episodes = 25000\n", 173 | "nS = 100\n", 174 | "nA = 4\n", 175 | "learning_rate = 0.01\n", 176 | "gamma = 0.9\n", 177 | "epsilon = 0.95\n", 178 | "\"\"\"\n", 179 | "Actions \n", 180 | "diagonal = True\n", 181 | "0 = down_right\n", 182 | "1 = up_left\n", 183 | "2 = up_right\n", 184 | "3 = down_left\n", 185 | "When space is not available action = action.split('_')[0]\n", 186 | "\n", 187 | "Environment\n", 188 | "player = Blue\n", 189 | "enemy = red\n", 190 | "goal = green\n", 191 | "\n", 192 | "If a player is on \n", 193 | "an enemy reward at that time step = -100\n", 194 | "the goal reward at that time step = 100\n", 195 | "for every other time step reward is = -1\n", 196 | "\"\"\"" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 11, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "def E_policy(q,s,epsilon):\n", 206 | " r = random.random()\n", 207 | " if r