├── Unit6-REINFORCE
    ├── images
    │   ├── CartPole-v0.png
    │   └── LunarLander-v2.png
    ├── utils.py
    ├── README.md
    ├── main.py
    └── REINFORCE.py
├── Unit5-Deep-Q-Networks
    ├── MountainCar_success.pt
    ├── play.py
    ├── readme.md
    ├── utils.py
    ├── main_dqn.py
    └── dqn.py
├── Blog.md
├── README.md
├── Unit4-Temporal-Difference-Methods
    ├── readme.md
    ├── QLearning.ipynb
    └── SARSA.ipynb
├── Unit3-Monte-Carlo
    ├── readme.md
    ├── utils.py
    └── BlackJack.py
├── Sutton_and_Barton.md
├── Unit2-Bellman-Equations
    ├── readme.md
    ├── main.py
    └── BlobEnvironment.py
└── Unit1-Multi-Armed-Bandits
    ├── utils.py
    ├── readme.md
    ├── bandits.py
    ├── main.py
    └── agents.py


/Unit6-REINFORCE/images/CartPole-v0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RajGhugare19/Classical-RL/HEAD/Unit6-REINFORCE/images/CartPole-v0.png


--------------------------------------------------------------------------------
/Unit6-REINFORCE/images/LunarLander-v2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RajGhugare19/Classical-RL/HEAD/Unit6-REINFORCE/images/LunarLander-v2.png


--------------------------------------------------------------------------------
/Unit5-Deep-Q-Networks/MountainCar_success.pt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RajGhugare19/Classical-RL/HEAD/Unit5-Deep-Q-Networks/MountainCar_success.pt


--------------------------------------------------------------------------------
/Blog.md:
--------------------------------------------------------------------------------
1 | 
2 | ## My Blog
3 | 
4 | * [Nuts and Bolts of RL](https://hackmd.io/@Raj-Ghugare/r1ttq0PNw)                                
5 | * [Multi-Armed Bandits and how to solve them](https://hackmd.io/@Raj-Ghugare/rkkk1XCVw)                                  
6 | * [Policy Gradient theorem](https://hackmd.io/@Raj-Ghugare/rygKPUD08)                                                  
7 | * [REINFORCE what you've learnt](https://hackmd.io/@Raj-Ghugare/BJGFOdmCL)              
8 | 


--------------------------------------------------------------------------------
/Unit6-REINFORCE/utils.py:
--------------------------------------------------------------------------------
 1 | import matplotlib.pyplot as plt
 2 | import time
 3 | import numpy as np
 4 | 
 5 | def plot_score(score_history,exp,save=False):
 6 |     score = np.array(score_history)
 7 |     iters = np.arange(len(score_history))
 8 |     plt.plot(iters,score)
 9 |     plt.xlabel('training iterations')
10 |     plt.ylabel('Total scores obtained')
11 |     plt.title('REINFORCE ' + exp)
12 |     if(save):
13 |         plt.savefig('./images/'+exp+'.png')
14 |     plt.legend()
15 |     plt.show()
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Classical reinforcement learning
 2 | After a year of exploring the field as a beginner, I found a good policy of how to start learning Reinforcement learning for a begginer from a researchers point of view. This repository is to share my RL journey and resources so that you can get an insight on how to learn how to learn.
 3 | 
 4 | this repo has:
 5 | 
 6 | - [x] Python implementations of various topics in sequential order.
 7 | - [x] Free Resources that I found to be helpful(Video lecture links, online articles and blogs).
 8 | 
 9 | Please feel free to open up a PR if you think you have more sources or notes to share :)
10 | 
11 | Note:
12 | 
13 | The python implementations of algorithms are not optimized and are meant to be for learning purposes only. 
14 | 


--------------------------------------------------------------------------------
/Unit6-REINFORCE/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # REINFORCE
 3 | 
 4 | REINFORCE is a vanilla policy gradient approach towards RL problems. This algorithm is implemented succesfully on the followoing problem from OpenAi gym
 5 | 
 6 | ### Results:
 7 | 
 8 | #### CartPole-v0
 9 | 
10 | ![](./images/CartPole-v0.png)
11 | 
12 | #### LunarLander-v2
13 | 
14 | ![](./images/LunarLander-v2.png)
15 | 
16 | ### Observations:
17 | This method theoretically seems to work in expectation terms, but for individual trials occasionally gives sub-optimal results.The new data generated depends upon previous policy and hence this technique cannot be used in high stake situations.
18 | 
19 | ### dependencies:
20 | 
21 | * [openai gym](https://gym.openai.com/)           
22 | * [pytorch](https://pytorch.org/)
23 | 
24 | 


--------------------------------------------------------------------------------
/Unit4-Temporal-Difference-Methods/readme.md:
--------------------------------------------------------------------------------
 1 | # Temporal-difference model free methods
 2 | Temporal difference methods are the heart of Reinforcement learning.
 3 | 
 4 | 1. [CS234 Lecture 4](https://www.youtube.com/watch?v=j080VBVGkfQ&list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u&index=4)
 5 | 2. [Professor Balaraman Ravindran's RL untill the end of week 6](https://nptel.ac.in/courses/106106143/)
 6 | 3. [Sutton and Barton chapter 6](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf)
 7 | 
 8 | I have implemented the following algorithms on a custom made blob environment(with the help of [sentdex's tutorial on how to make a blob env](https://www.youtube.com/watch?v=G92TF4xYQcU&list=PLQVvvaa0QuDezJFIOU5wDdfy4e9vdnx-7&index=4))
 9 | 
10 | - [x] SARSA
11 | - [x] Q-Learning
12 | 


--------------------------------------------------------------------------------
/Unit3-Monte-Carlo/readme.md:
--------------------------------------------------------------------------------
 1 | # Monte-Carlo model free methods
 2 | Monte-Carlo is the start of model free learning algorithms.
 3 | 1. [CS234 Lecture 3(Leave the TD-learning half for now)](https://www.youtube.com/watch?v=dRIhrn8cc9w&list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u&index=3)
 4 | 2. [Professor Balaraman Ravindran's RL until week 6 second lecture](https://nptel.ac.in/courses/106106143/)
 5 | 3. [Sutton and Barton chapter 5](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf)
 6 | 
 7 | You could also checkout the David silver lecture series, although I found CS234 much better.
 8 | 
 9 | I have used the [blackjack environment](https://github.com/openai/gym/blob/master/gym/envs/toy_text/blackjack.py) of OpenAi gym
10 | to implement the following algorithms
11 | 
12 | - [ ] On policy Monte-Carlo(With exploring start)
13 | - [ ] On policy Monte-Carlo(without exploring start)
14 | - [x] Off policy Monte-Carlo
15 | 


--------------------------------------------------------------------------------
/Sutton_and_Barton.md:
--------------------------------------------------------------------------------
 1 | ## Reinforcement-Learning-An-Introduction-second-edition
 2 | 
 3 | ### Solutions
 4 | My solutions to the exercises of [Sutton and Barton's book](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf)
 5 | * [Chapter 2](https://hackmd.io/@Raj-Ghugare/HkfDBDlfv)
 6 | * [Chapter 3](https://hackmd.io/@Raj-Ghugare/HkFPsyXtU)
 7 | * [Chapter 4](https://hackmd.io/@Raj-Ghugare/H1IooEiyw)
 8 | * [Chapter 5](https://hackmd.io/@Raj-Ghugare/SkaSu3HxD)                                      
 9 | * [Chapter 6](https://hackmd.io/@Raj-Ghugare/BkZZ3PaKL)
10 | 
11 | ### How to contribute
12 | 
13 | #### Corrections
14 | These solutions could have some errors and if you find something wrong please open a pull request by changing the readme and specifying the mistake in it.
15 | 
16 | #### Solutions to problems which are not included
17 | If you want to provide your solutions then write your solutions in markdown using [hackmd](https://hackmd.io/?nav=overview) and append it to the readme in your pull request.I will add the solutions if they are appropriate.
18 | 


--------------------------------------------------------------------------------
/Unit6-REINFORCE/main.py:
--------------------------------------------------------------------------------
 1 | import gym
 2 | from REINFORCE import Agent
 3 | from utils import plot_score
 4 | import numpy as np
 5 | import torch
 6 | from gym import wrappers
 7 | 
 8 | 
 9 | NAME = "LunarLander-v2"
10 | INPUT_DIMS = [8]
11 | GAMMA = 0.99
12 | N_ACTIONS = 4
13 | N_GAMES = 200
14 | 
15 | if __name__ == '__main__':
16 |     env = gym.make(NAME)
17 |     agent = Agent(lr=0.001, input_dims=INPUT_DIMS, gamma=GAMMA, n_actions=N_ACTIONS,
18 |                     h1=64, h2=32)
19 |     score_history = []
20 |     score = 0
21 |     best_score = -1000
22 | 
23 |     for i in range(N_GAMES):
24 |         print('episode: ', i, 'score %.3f' % score)
25 |         done = False
26 |         score = 0
27 |         state = env.reset()
28 |         while not done:
29 |             action = agent.choose_action(state)
30 |             next_state, reward, done, _ = env.step(action)
31 |             agent.store_rewards(reward)
32 |             state = next_state
33 |             score += reward
34 |         if(np.mean(score_history[-20:])>best_score and i>20):
35 |             torch.save(agent.policy.state_dict(),'./params/'+NAME+'pt.')
36 |             best_score = np.mean(score_history[-20])
37 |         score_history.append(score)
38 |         agent.improve()
39 | 
40 |     plot_score(score_history,NAME,save=True)
41 | 


--------------------------------------------------------------------------------
/Unit3-Monte-Carlo/utils.py:
--------------------------------------------------------------------------------
 1 | import seaborn as sns
 2 | import numpy as np
 3 | import matplotlib.pyplot as plt
 4 | 
 5 | def plot(target_policy):
 6 |     usable = np.zeros([11,10])
 7 |     non_usable = np.zeros([11,10])
 8 |     for i in target_policy:
 9 |         if i[0]>10:
10 |             if i[2]:
11 |                 usable[i[0]-11,i[1]-1]=target_policy[i]
12 |             else:
13 |                 non_usable[i[0]-11,i[1]-1]=target_policy[i]
14 |     usable = np.flip(usable,0)
15 |     non_usable = np.flip(non_usable,0)
16 |     ax = sns.heatmap(usable, linewidth=0, cbar=False)
17 |     plt.xlabel('dealer showing')
18 |     plt.title("Non Usable ace")
19 |     plt.title("Usable ace [Black=Stick,Off-white=Hit]")
20 |     plt.yticks(np.arange(11),['21','20','19','18','17','16','15','14','13','12','11'])
21 |     plt.xticks(np.arange(0,10)+0.5,['1','2','3','4','5','6','7','8','9','10'])
22 |     ax.yaxis.tick_right()
23 |     plt.show()
24 |     ax = sns.heatmap(non_usable, linewidth=0, cbar=False)
25 |     plt.ylabel('player sum')
26 |     plt.xlabel('dealer showing')
27 |     plt.title("Non Usable ace [[Black=Stick,Off-white=Hit]]")
28 |     plt.yticks(np.arange(11),['21','20','19','18','17','16','15','14','13','12','11'])
29 |     plt.xticks(np.arange(0,10)+0.5,['1','2','3','4','5','6','7','8','9','10'])
30 |     ax.yaxis.tick_right()
31 |     plt.show()
32 | 


--------------------------------------------------------------------------------
/Unit5-Deep-Q-Networks/play.py:
--------------------------------------------------------------------------------
 1 | from dqn import cart_agent
 2 | import torch
 3 | import gym
 4 | import time
 5 | import numpy as np
 6 | from utils import plot_learning_curve
 7 | 
 8 | device = torch.device('cuda' if torch.cuda.is_available else 'cpu')
 9 | env = gym.make('MountainCar-v0').unwrapped
10 | 
11 | if __name__ == '__main__':
12 |     player =  agent(epsilon=0,eps_decay=0,epsilon_min=0,gamma=0,l_r=0,n_actions=3,
13 |                 memory=0,batch_size=0,target_update=0,env = env,save = True)
14 |     n_games = 3
15 |     scores = []
16 |     player.policy_net.load_state_dict(torch.load('/home/raj/My_projects/DQN/MountanCar.pt'))
17 | 
18 |     for i in range(n_games):
19 |         env.reset()
20 |         last_screen = player.get_state()
21 |         current_screen = player.get_state()
22 |         state = current_screen-last_screen
23 | 
24 |         done = False
25 |         score = 0
26 |         while not done:
27 |             action = player.choose_action(state)
28 |             time.sleep(0.05)
29 |             _, reward, done, _ = player.env.step(action)
30 | 
31 |             last_screen = current_screen
32 |             current_screen = player.get_state()
33 | 
34 |             next_state = current_screen - last_screen
35 |             score += reward
36 |             state = next_state
37 | 
38 | 
39 |         scores.append(score)
40 |     print(np.mean(scores))
41 |     plot_learning_curve(i, scores,0)
42 | 


--------------------------------------------------------------------------------
/Unit5-Deep-Q-Networks/readme.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Deep Q learning using fixed q targets and experience replay
 3 | 
 4 | ## Results
 5 | 
 6 | ### Trained Mountain Car :
 7 | ![](https://media.giphy.com/media/dZopKlQbCgEBTPBy8n/giphy.gif)
 8 | 
 9 | ### Trained Cart Pole :
10 | ![](https://media.giphy.com/media/J5Yh1aY9WhlJc4TZFR/giphy.gif)
11 | 
12 | ## Abstract:
13 | 
14 | Function approximators like neural networks have succesfully been combined with reinforcement learning because of their ability to derive optimal estimations of the environment using higher order inputs like audios and images.This is an implementation of the Human-level control through deep reinforcement learning with some crunch time tweaks.My implementation was first tested using the low state inputs of the CartPole environment from OpenAi Gym.Then it was succesfully applied to different OpenAi gym  environments without any major hyper-parameter tuning using just the high-dimensional sensory inputs. 
15 | 
16 | 
17 | ## Environments:
18 | 
19 | - **CartPole** - [https://gym.openai.com/envs/CartPole-v1/]
20 | - **MountainCar** - [https://gym.openai.com/envs/MountainCar-v0/]
21 | 
22 | ## Instruction:
23 | 
24 | ``` Hyper-parameters tuning for new problems should be done accordingly ```
25 | ``` The path to save pytorch model checkpoints should be changed ```
26 | 
27 | ## Dependencies:
28 | 
29 | - Anaconda: [link](https://docs.anaconda.com/anaconda/install/linux/)
30 | - OpenAi gym: [link](https://gym.openai.com/)
31 | - pytorch: [link](https://pytorch.org/)
32 | 
33 | ## References:
34 | 
35 | - [Playing Atari with Deep Reinforcement Learning](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)
36 | 
37 | 


--------------------------------------------------------------------------------
/Unit5-Deep-Q-Networks/utils.py:
--------------------------------------------------------------------------------
 1 | import matplotlib.pyplot as plt
 2 | import numpy as np
 3 | from IPython.display import clear_output
 4 | import matplotlib
 5 | import torch
 6 | import matplotlib.pyplot as plt
 7 | 
 8 | is_ipython = 'inline' in matplotlib.get_backend()
 9 | if is_ipython:
10 |     from IPython import display
11 | 
12 | def plot_learning_curve(episode, scores, epsilon):
13 |     clear_output(True)
14 |     plt.figure(figsize=(20,5))
15 |     plt.subplot(131)
16 |     plt.title('episode %s. average_reward: %s' % (episode, np.mean(scores[-10:])))
17 |     plt.plot(scores)
18 |     plt.subplot(132)
19 |     plt.title('epsilon')
20 |     plt.plot(epsilon)
21 |     plt.show()
22 | 
23 | def plot_playing_curve(episode, scores):
24 |     clear_output(True)
25 |     plt.figure(figsize=(5,5))
26 |     plt.title('episode %s. average_reward: %s' % (episode, np.mean(scores[-10:])))
27 |     plt.plot(scores)
28 |     plt.show()
29 | 
30 | def plot_durations(scores,pause):
31 |     plt.ion()
32 |     plt.figure(2)
33 |     plt.clf()
34 | 
35 |     durations_t = torch.tensor(scores, dtype=torch.float)
36 |     plt.title('Training...')
37 |     plt.xlabel('Episode')
38 |     plt.ylabel('Scores')
39 |     plt.plot(durations_t.numpy())
40 |     # Take 20 episode averages and plot them too
41 |     if len(durations_t) >= 20:
42 |         means = durations_t.unfold(0, 20, 1).mean(1).view(-1)
43 |         means = torch.cat((torch.zeros(19), means))
44 |         plt.plot(means.numpy())
45 | 
46 |     plt.pause(pause)  # pause a bit so that plots are updated
47 |     if is_ipython:
48 |         display.clear_output(wait=True)
49 |         display.display(plt.gcf())
50 | 


--------------------------------------------------------------------------------
/Unit2-Bellman-Equations/readme.md:
--------------------------------------------------------------------------------
 1 | # Bellman equations and solving MDPs
 2 | Markov Decision Processes bring in the sequential decision making and delayed reward aspects of RL.
 3 | 
 4 | 1. [Stanford CS234 lecture 2](https://www.youtube.com/watch?v=E3f2Camj0Is&list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u&index=2)
 5 | 2. [Professor Balaraman Ravindran's RL week 3,4 and 5th only till policy iteration](https://nptel.ac.in/courses/106106143/)
 6 | 3. [Sutton and Barton chapter 3 and 4](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf)
 7 | 
 8 | Balaram's lectures are more of classical RL and math intensive, So it is better to watch CS234 first.
 9 | 
10 | You should implement the algorithms once you understand the proofs.   
11 | I have implemented the following :
12 | 
13 | - [x] Value iteration
14 | - [x] Policy iteration
15 | - [x] Asynchronous value iteration
16 | - [ ] Real time dynamic programming
17 | 
18 | The MDP which I have used is from the example 3.5 - Gridworld from Sutton and Barton.After running the code we can obtain figure 3.15 from the textbook
19 | 
20 | ![](https://i.imgur.com/uwnhUyi.png)
21 | 
22 | These are my notes on these topics.
23 | * [Bellman Equation](https://hackmd.io/Fuhp2hwyR4GknchLGBGTWw)
24 | * [Bellman Optimality Equation](https://hackmd.io/wqQyQAvlTVeGzLsaVLUswg)
25 | * [Value Iteration](https://hackmd.io/3o8W1o4rS6ikMs42PVXPAw)
26 | * [Policy Iteration](https://hackmd.io/8F3m-j59TB-RaxP3ysVsCg?both)
27 | 
28 | 
29 | References:
30 | 1. If you want to make your own blob environment then you can watch this [sentdex tutorial](https://www.youtube.com/watch?v=G92TF4xYQcU&list=PLQVvvaa0QuDezJFIOU5wDdfy4e9vdnx-7&index=4).
31 | 


--------------------------------------------------------------------------------
/Unit1-Multi-Armed-Bandits/utils.py:
--------------------------------------------------------------------------------
 1 | #######################################################################
 2 | # Copyright (C)                                                       #
 3 | # 2020(rajghugare.vnit@gmail.com)                                     #
 4 | # Permission given to modify the code as long as you keep this        #
 5 | # declaration at the top                                              #
 6 | #######################################################################
 7 | 
 8 | import numpy as np
 9 | import matplotlib.pyplot as plt
10 | 
11 | 
12 | def plot_ArmCount(data, num_iters, bandit, k):
13 |     x_index = []
14 |     for i in range(k):
15 |         x_index.append(str(i))
16 |     location = np.arange(k)
17 |     (e_Q, e_regret, e_arm_history, _) = data[bandit]["epsilon_greedy"]
18 |     (s_Q, s_regret, s_arm_history, _) = data[bandit]["softmax"]
19 |     (u_Q, u_regret, u_arm_history, _) = data[bandit]["UCB"]
20 |     (fig, ax) = plt.subplots(1,1)
21 |     bar1 = ax.bar(location, e_arm_history, label="epsilon_greedy", fill=False, edgecolor='green')
22 |     bar2 = ax.bar(location, s_arm_history, label="softmax", fill=False, edgecolor='red')
23 |     bar3 = ax.bar(location, u_arm_history, label="UCB1", fill=False, edgecolor='purple')
24 |     ax.set_ylabel('Arm pull history')
25 |     ax.set_title('Number of times arm pulled')
26 |     ax.set_xticks(location)
27 |     ax.set_xticklabels(x_index)
28 |     ax.legend()
29 |     fig.tight_layout()
30 |     plt.show()
31 | 
32 | 
33 | 
34 | def plot_regret(data, num_iters, bandit, player, k):
35 |     #Plots regret of any one bandit at a time
36 |     (Q, regret, arm_history, _) = data[bandit][player]
37 |     t = np.arange(num_iters)
38 |     plt.plot(t, regret, color='green', label=player)
39 |     plt.xlabel("Time steps")
40 |     plt.ylabel("Regret")
41 |     plt.legend()
42 |     plt.show()
43 | 


--------------------------------------------------------------------------------
/Unit1-Multi-Armed-Bandits/readme.md:
--------------------------------------------------------------------------------
 1 | # Multi-Armed bandits(Immediate RL problems)
 2 | Multi-armed bandits are ignored by a lot of people who begin studying RL,but I think that it is the best place to gain a strong mathematical foothold and get an idea of how things would work in a RL problem.
 3 | 
 4 | ### My notes
 5 | 
 6 | * [Overview of the Multi-armed bandit problem](https://hackmd.io/CZQq2azUTMCjt2FF_TQNfQ?view)
 7 | * [Regret optimality with UCB1](https://hackmd.io/-DkQQy8DRYezVXDqUaPsYQ)
 8 | * [PAC bounds with median elimination](https://hackmd.io/saK7DdqCRnyBfN3HykLhlA)
 9 | 
10 | ### Hello world of Reinforcement learning.
11 | 
12 | I would strongly advice you guys to go through the resources I am going to list down.These will be enough for theoretically studying bandits(atleast enough to get a basic understanding of immediate RL).
13 | 
14 | 1. [Just go through the introduction from wikipedia.](https://en.wikipedia.org/wiki/Multi-armed_bandit)  
15 | 2. [Professor Balaraman Ravindran's RL week 1 and week 2](https://nptel.ac.in/courses/106106143/)  
16 | 3. [Sutton and Barton chapter 2](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf)
17 | 
18 | As you are watching the lectures it is a good idea that you code the algorithms as you learn them.
19 | 
20 | I have implemented the following algorithms in agents.py
21 | - [X] epsilon-greedy
22 | - [x] softmax
23 | - [x] UCB1
24 | - [x] Median elimination
25 | - [ ] Other variants of UCB
26 | - [ ] Thompson sampling
27 | - [ ] Policy gradient methods
28 | 
29 | Results:
30 | ![](https://i.imgur.com/H4u6UaE.png)
31 | 
32 | this shows the number of times different algorithms pulled different arms(arms are in ascending as per expected values)
33 | 
34 | The ones which I havent implemented(You could/should if you want to).
35 | 
36 | After you are done with this I would recommend to go through the notes that I made.These notes summarize the topics in a brief manner.I would suggest you to make similar notes
37 | 
38 | 
39 | 
40 | 


--------------------------------------------------------------------------------
/Unit5-Deep-Q-Networks/main_dqn.py:
--------------------------------------------------------------------------------
 1 | import gym
 2 | from dqn import cart_agent
 3 | import numpy as np
 4 | import torch
 5 | from utils import plot_durations
 6 | from utils import plot_learning_curve
 7 | 
 8 | device = torch.device('cuda' if torch.cuda.is_available else 'cpu')
 9 | 
10 | if __name__ == '__main__':
11 | 
12 |     env = gym.make('MountainCar-v0')
13 | 
14 |     A =  agent(epsilon=1,eps_decay=0.005,epsilon_min=0.01,gamma=0.99,l_r=0.0001,n_actions=3,
15 |                 memory=20000,batch_size=32,target_update=7,env=env,save=True)
16 | 
17 |     scores, avg_score, epsilon_history = [], [], []
18 |     best_score = -np.inf
19 |     n_games = 1000
20 |     score = 0
21 | 
22 |     print("Save is currently !!!!!!!!!!!!!!!!!! ", A.save)
23 | 
24 |     for i in range(n_games):
25 |         A.env.reset()
26 |         last_screen = A.get_state()
27 |         current_screen = A.get_state()
28 |         state = current_screen-last_screen
29 | 
30 |         done = False
31 |         score = 0
32 | 
33 |         if i%20==0 and i>0:
34 |             plot_durations(scores, 0.001)
35 |             print('----------------- training --------------------')
36 |             print('epsiode number', i)
37 |             print("Average score ",avg_score[-1])
38 |             print('----------------- training --------------------')
39 | 
40 |         while not done:
41 |             action = A.choose_action(state)
42 | 
43 |             _, reward, done, _ = A.env.step(action)
44 | 
45 |             last_screen = current_screen
46 |             current_screen = A.get_state()
47 | 
48 |             next_state = current_screen - last_screen
49 | 
50 |             A.store_experience(state,action,reward,done,next_state)
51 |             A.learn_with_experience_replay()
52 | 
53 |             score += reward
54 |             state = next_state
55 | 
56 |         scores.append(score)
57 |         if i>30:
58 |             avg_score.append(np.mean(scores[-30:]))
59 |         else:
60 |             avg_score.append(np.mean(scores))
61 | 
62 |         if avg_score[-1] > best_score:
63 |             torch.save(A.policy_net.state_dict(),'/home/raj/My_projects/DQN/MountanCar.pt')
64 |             best_score = avg_score[-1]
65 |             print("***************\ncurrent best average score is "+ str(best_score) +"\n***************")
66 | 
67 |         if i%A.target_update == 0:
68 |             A.target_net.load_state_dict(A.policy_net.state_dict())
69 | 
70 |         A.epsilon_decay()
71 | 
72 |     plot_durations(scores,5)
73 | 


--------------------------------------------------------------------------------
/Unit6-REINFORCE/REINFORCE.py:
--------------------------------------------------------------------------------
 1 | import torch as T
 2 | import torch.nn as nn
 3 | import torch.nn.functional as F
 4 | import torch.optim as optim
 5 | import numpy as np
 6 | 
 7 | class Policy(nn.Module):
 8 |     def __init__(self, lr, input_dims, h1, h2, n_actions):
 9 |         super(Policy,self).__init__()
10 |         self.input_dims = input_dims
11 |         self.lr = lr
12 |         self.h1 = h1
13 |         self.h2 = h2
14 |         self.n_actions = n_actions
15 |         self.linear1 = nn.Linear(*self.input_dims, self.h1)
16 |         self.linear2 = nn.Linear(self.h1, self.h2)
17 |         self.linear3 = nn.Linear(self.h2, self.n_actions)
18 |         self.optimizer = optim.Adam(self.parameters(), lr=lr)
19 | 
20 |         self.device = T.device('cuda:0' if T.cuda.is_available() else 'cpu:0')
21 |         self.to(self.device)
22 | 
23 |     def forward(self,obs):
24 |         x = T.tensor(obs,dtype=T.float).to(self.device)
25 |         x = F.relu(self.linear1(x))
26 |         x = F.relu(self.linear2(x))
27 |         x = self.linear3(x)
28 | 
29 |         return x
30 | 
31 | class Agent(object):
32 |     def __init__(self, lr, input_dims, gamma=0.99, n_actions=2, h1=128, h2=128):
33 |         self.gamma = gamma
34 |         self.reward_memory = []
35 |         self.action_memory = []
36 |         self.policy = Policy(lr, input_dims, h1, h2, n_actions)
37 | 
38 |     def choose_action(self, observation):
39 |         probs = F.softmax(self.policy(observation),dim=0)
40 |         action_probs = T.distributions.Categorical(probs)
41 |         action = action_probs.sample()
42 |         log_probs = T.log(probs[action])
43 |         self.action_memory.append(log_probs)
44 | 
45 |         return action.item()
46 | 
47 |     def store_rewards(self, reward):
48 |         self.reward_memory.append(reward)
49 | 
50 |     def improve(self):
51 |         self.policy.optimizer.zero_grad()
52 |         G = np.zeros_like(self.reward_memory, dtype=np.float64)
53 |         for t in range(len(self.reward_memory)):
54 |             g_sum = 0
55 |             disc = 1
56 |             for i in range(t, len(self.reward_memory)):
57 |                 g_sum += self.reward_memory[i]*disc
58 |                 disc *= self.gamma
59 |             G[t] = g_sum
60 |         G = (G - np.mean(G))/(np.std(G) if np.std(G) > 0 else 1)
61 | 
62 |         G = T.tensor(G, dtype=T.float).to(self.policy.device)
63 | 
64 |         loss = 0
65 |         for g,log_prob in zip(G, self.action_memory):
66 |             loss += -g * log_prob
67 | 
68 |         loss.backward()
69 |         self.policy.optimizer.step()
70 | 
71 |         self.action_memory = []
72 |         self.reward_memory = []
73 | 


--------------------------------------------------------------------------------
/Unit1-Multi-Armed-Bandits/bandits.py:
--------------------------------------------------------------------------------
 1 | 
 2 | #######################################################################
 3 | # Copyright (C)                                                       #
 4 | # 2020(rajghugare.vnit@gmail.com)                                     #
 5 | # Permission given to modify the code as long as you keep this        #
 6 | # declaration at the top                                              #
 7 | #######################################################################
 8 | import numpy as np
 9 | import random
10 | 
11 | 
12 | class GaussianStationaryBandit(object):
13 |     def __init__(self, k, mu, sigma):
14 |         self.qstar = np.array(mu)                     #ndarray expected payoff
15 |         self.sigma = np.array(sigma)                  #ndarray standard deviation of payoff
16 |         self.arm_history = np.zeros(k)                #initializing history of all arms taken
17 |         self.regret = []                              #initializing history of regret after every action
18 |         self.best_payoff = np.max(self.qstar)
19 |         self.best_arm = np.argmax(self.qstar)
20 |         self.num_arms = k
21 |         self.rew = 0
22 | 
23 |     def pull(self, arm):
24 |         self.arm_history[arm] += 1
25 |         self.regret.append(self.best_payoff - self.qstar[arm])
26 |         reward = random.gauss(self.qstar[arm], self.sigma[arm])
27 |         self.reward += reward
28 |         return reward
29 | 
30 |     def get_ArmHistory(self):
31 |         return self.arm_history
32 | 
33 |     def get_regret(self):
34 |         return self.regret
35 | 
36 |     def get_BestArm(self):
37 |         return self.best_arm
38 | 
39 |     def reset(self):
40 |         self.arm_history = np.zeros(self.num_arms)
41 |         self.regret = []
42 |         self.re = 0
43 | 
44 |     def get_total_reward(self):
45 |         return self.reward
46 | 
47 | class BernoulliStationaryBandit(object):
48 |     def __init__(self, k, mu):
49 |         self.qstar = np.array(mu)                     #ndarray expected payoff
50 |         self.arm_history = np.zeros(k)                #initializing history of all arms taken
51 |         self.regret = []                              #initializing history of regret after every action
52 |         self.best_payoff = np.max(self.qstar)
53 |         self.best_arm = np.argmax(self.qstar)
54 |         self.num_arms = k
55 |         self.reward = 0
56 | 
57 |     def pull(self, arm):
58 |         self.arm_history[arm] += 1
59 |         self.regret.append(self.best_payoff - self.qstar[arm])
60 |         reward =  np.random.choice([1,0], p = [self.qstar[arm], 1-self.qstar[arm]])
61 |         self.reward += reward
62 |         return reward
63 | 
64 |     def get_ArmHistory(self):
65 |         return self.arm_history
66 | 
67 |     def get_regret(self):
68 |         return self.regret
69 | 
70 |     def get_BestArm(self):
71 |         return self.best_arm
72 | 
73 |     def reset(self):
74 |         self.arm_history = np.zeros(self.num_arms)
75 |         self.regret = []
76 |         self.reward = 0
77 | 
78 |     def get_total_reward(self):
79 |         return self.reward
80 | 


--------------------------------------------------------------------------------
/Unit1-Multi-Armed-Bandits/main.py:
--------------------------------------------------------------------------------
 1 | #######################################################################
 2 | # Copyright (C)                                                       #
 3 | # 2020(rajghugare.vnit@gmail.com)                                     #
 4 | # Permission given to modify the code as long as you keep this        #
 5 | # declaration at the top                                              #
 6 | #######################################################################
 7 | 
 8 | from agents import epsilon_greedy_agent
 9 | from agents import softmax_agent
10 | from agents import Median_elimination_agent
11 | from agents import UCB
12 | from bandits import GaussianStationaryBandit
13 | from bandits import BernoulliStationaryBandit
14 | from utils import plot_ArmCount
15 | from utils import plot_regret
16 | import numpy as np
17 | 
18 | 
19 | #gauss_bandit = GaussianStationaryBandit(k, mu, sigma)
20 | #sigma = [1.5, 3.1, 4.1, 1, 0.1 ,2.1 ,1.1 ,0.61 ,.71 ,1]
21 | #epsilon_greedy_player_gaussian = epsilon_greedy_agent(gauss_bandit, 1, num_iters)
22 | #softmax_player_gaussian = softmax_agent(gauss_bandit, 1, num_iters)
23 | 
24 | 
25 | num_iters = 5000
26 | 
27 | k = 10
28 | mu = np.array([0.1,0.5,0.7,0.73,0.756,0.789,0.81,0.83,0.855,0.865])
29 | mu = np.arange(10)*0.1
30 | bernoulli_bandit = BernoulliStationaryBandit(k , mu)
31 | 
32 | #initializing all the players
33 | epsilon_greedy_player_bernoulli = epsilon_greedy_agent(bernoulli_bandit, 1, num_iters)
34 | softmax_player_bernoulli = softmax_agent(bernoulli_bandit, 0.1, num_iters)
35 | median_elimination_player = Median_elimination_agent(bernoulli_bandit, epsilon=0.1, delta=0.1)
36 | UCB_player = UCB(bernoulli_bandit,num_iters)
37 | 
38 | def play_UCB():
39 |     data["bernoulli_bandit"]["UCB"] = UCB_player.play()
40 |     data["bernoulli_bandit"]["UCB"] = UCB_player.play()
41 |     plot_regret(data, num_iters, "bernoulli_bandit", "UCB", k)
42 |     plot_ArmCount(data, num_iters, "bernoulli_bandit", "UCB", k)
43 | 
44 | 
45 | def play_median_elimination():
46 |     data["bernoulli_bandit"]["median_elimination"] = median_elimination_player.play()
47 | 
48 | 
49 | def play_epsilon_greedy():
50 |     data["bernoulli_bandit"]["epsilon_greedy"] = epsilon_greedy_player_bernoulli.play()
51 |     plot_regret(data, num_iters, "bernoulli_bandit", "epsilon_greedy", k)
52 |     plot_ArmCount(data, num_iters, "bernoulli_bandit", "epsilon_greedy", k)
53 | 
54 | def play_softmax():
55 |     data["bernoulli_bandit"]["softmax"] = softmax_player_bernoulli.play()
56 |     plot_regret(data, num_iters, "bernoulli_bandit", "softmax", k)
57 |     plot_ArmCount(data, num_iters, "bernoulli_bandit", "softmax", k)
58 | 
59 | 
60 | 
61 | if __name__ == "__main__" :
62 | 
63 |     data = {"bernoulli_bandit":{},"gauss_bandit":{}}
64 |     data["bernoulli_bandit"]["UCB"] = UCB_player.play()
65 |     data["bernoulli_bandit"]["median_elimination"] = median_elimination_player.play()
66 |     data["bernoulli_bandit"]["epsilon_greedy"] = epsilon_greedy_player_bernoulli.play()
67 |     data["bernoulli_bandit"]["softmax"] = softmax_player_bernoulli.play()
68 |     plot_ArmCount(data, num_iters, "bernoulli_bandit", k)
69 | 


--------------------------------------------------------------------------------
/Unit3-Monte-Carlo/BlackJack.py:
--------------------------------------------------------------------------------
  1 | #######################################################################
  2 | # Copyright (C)                                                       #
  3 | # 2020(rajghugare.vnit@gmail.com)                                     #
  4 | # Permission given to modify the code as long as you keep this        #
  5 | # declaration at the top                                              #
  6 | #######################################################################
  7 | 
  8 | import gym
  9 | import numpy as np
 10 | import matplotlib.pyplot as plt
 11 | import random
 12 | from utils import plot
 13 | 
 14 | env= gym.make('Blackjack-v0')
 15 | GAMMA = 1
 16 | 
 17 | playerSum = list(np.arange(4,22))
 18 | agentCard = list(np.arange(1,11))
 19 | playerAce = [False,True]
 20 | actionSpace = [0,1]
 21 | stateSpace = []
 22 | 
 23 | target_policy = {}
 24 | Q = {}
 25 | C = {}
 26 | for p in playerSum:
 27 |     for a in agentCard:
 28 |         for ace in playerAce:
 29 |             stateSpace.append((p,a,ace))
 30 |             m = -1
 31 |             for action in actionSpace:
 32 |                 Q[(p,a,ace),action] = random.random()
 33 |                 if Q[(p,a,ace),action] > m:
 34 |                     m = Q[(p,a,ace),action]
 35 |                     argmax = action
 36 |                 C[(p,a,ace),action] = 0
 37 |             target_policy[(p,a,ace)] = argmax
 38 | 
 39 | def behaviour_policy():
 40 |     r = random.uniform(0,1)
 41 |     if r<0.5:
 42 |         return 0
 43 |     else:
 44 |         return 1
 45 | 
 46 | 
 47 | # Monte Carlo Off policy control to find optimal policy
 48 | for i in range(1000000):
 49 |     states = []
 50 |     actions = []
 51 |     rewards = []
 52 |     done = False
 53 |     states.append(env.reset())
 54 |     a = behaviour_policy()
 55 |     actions.append(a)
 56 |     while True:
 57 |         (s,r,done,_) = env.step(a)
 58 |         rewards.append(r)
 59 |         if done:
 60 |             break
 61 |         a = behaviour_policy()
 62 |         actions.append(a)
 63 |         states.append(s)
 64 |     G = 0
 65 |     W = 1      #Importance sampling ratio
 66 |     for i in range(len(states)):
 67 |         G = G + GAMMA*rewards[-1-i]
 68 |         C[states[-i-1],actions[-i-1]] += W
 69 |         Q[states[-i-1],actions[-i-1]] = Q[states[-i-1],actions[-i-1]] + W*(G-Q[states[-i-1],actions[-i-1]])/C[states[-i-1],actions[-i-1]]
 70 |         m = -1
 71 |         for action in actionSpace:
 72 |             if Q[states[-1-i],action] > m:
 73 |                 m = Q[states[-1-i],action]
 74 |                 argmax = action
 75 |         target_policy[states[-1-i]] = argmax
 76 |         if actions[-i-1] != argmax:
 77 |             break
 78 |         W = W*(1/0.5)
 79 | 
 80 | def play(n):
 81 |     win = 0
 82 |     loss = 0
 83 |     draw = 0
 84 |     for i in range(n):
 85 |         score = 0
 86 |         done = False
 87 |         s = env.reset()
 88 |         while not done:
 89 |             a = target_policy[s]
 90 |             (s,r,done,_) = env.step(a)
 91 |             score += r
 92 |         if score==0:
 93 |             draw += 1
 94 |         elif score==1:
 95 |             win += 1
 96 |         else:
 97 |             loss +=1
 98 |     print(win)
 99 |     print(loss)
100 |     print(draw)
101 | 
102 | plot(target_policy)
103 | 


--------------------------------------------------------------------------------
/Unit2-Bellman-Equations/main.py:
--------------------------------------------------------------------------------
 1 | 
 2 | #######################################################################
 3 | # Copyright (C)                                                       #
 4 | # 2020(rajghugare.vnit@gmail.com)                                     #
 5 | # Permission given to modify the code as long as you keep this        #
 6 | # declaration at the top                                              #
 7 | #######################################################################
 8 | 
 9 | import numpy as np
10 | from BlobEnvironment import BlobEnvironment
11 | 
12 | env = BlobEnvironment()
13 | 
14 | EPSILON = 0.01  #this is an optimality factor
15 | GAMMA = 0.9    #As defined in the problem itself
16 | 
17 | def value_iteration():
18 |     n = 0
19 |     v = np.zeros([5,5])                         #This should still be considered as a 25 dimensional vector
20 |     v_new = np.zeros([5,5])                     #It is in the form of 5 by 5 matrix for better visual undertanding and slightly easier implementation
21 |     while True:
22 |         for y in range(5):
23 |             for x in range(5):
24 |                 v_temp = np.zeros(4)
25 |                 for action in range(4):
26 |                     env.x = x
27 |                     env.y = y
28 |                     x_next,y_next,reward = env.step(action)
29 |                     v_temp[action] = reward + GAMMA*v[y_next,x_next]
30 |                 v_new[y,x] = np.max(v_temp)
31 |         if np.max(np.abs(v - v_new)) < EPSILON*(1-GAMMA)/(2*GAMMA):
32 |             env.plot_grid_values(np.round(v_new,decimals=2))
33 |             break
34 |         v = np.copy(v_new)
35 | 
36 | def policy_iteration():
37 |     n = 0
38 |     policy = np.zeros([5,5],dtype = np.uint8)
39 |     v = np.zeros([5,5])
40 |     v_new = np.zeros([5,5])
41 |     while True:
42 |         #Policy evaluation
43 |         while True:
44 |             for y in range(5):
45 |                 for x in range(5):
46 |                     action = policy[y,x]
47 |                     env.x = x
48 |                     env.y = y
49 |                     x_next,y_next,reward = env.step(action)
50 |                     v_new[y,x] = reward + GAMMA*v[y_next,x_next]
51 |             if np.max(np.abs(v - v_new)) < EPSILON*(1-GAMMA)/(2*GAMMA):
52 |                 break
53 |             v = np.copy(v_new)
54 |         #Policy improvement
55 |         new_policy = np.zeros([5,5],dtype=np.uint8)
56 |         for y in range(5):
57 |             for x in range(5):
58 |                 v_temp = np.zeros(4)
59 |                 for action in range(4):
60 |                     env.x = x
61 |                     env.y = y
62 |                     x_next,y_next,reward = env.step(action)
63 |                     v_temp[action] = reward + GAMMA*v[y_next,x_next]
64 |                 new_policy[y,x] = np.argmax(v_temp)
65 |         if np.array_equal(policy,new_policy):
66 |             break
67 |         policy = np.copy(new_policy)
68 |     env.plot_policy(policy)
69 |     return policy
70 | 
71 | 
72 | def Play_optimally(policy):
73 |     (x,y) = env.reset()
74 |     for i in range(25):
75 |         action = policy[y,x]
76 |         (x,y,reward) = env.step(action)
77 |         print(reward)
78 |         env.render()
79 | 
80 | 
81 | value_iteration()
82 | 
83 | policy = policy_iteration()
84 | Play_optimally(policy)
85 | 


--------------------------------------------------------------------------------
/Unit2-Bellman-Equations/BlobEnvironment.py:
--------------------------------------------------------------------------------
  1 | 
  2 | #######################################################################
  3 | # Copyright (C)                                                       #
  4 | # 2020(rajghugare.vnit@gmail.com)                                     #
  5 | # Permission given to modify the code as long as you keep this        #
  6 | # declaration at the top                                              #
  7 | #######################################################################
  8 | 
  9 | import numpy as np
 10 | import matplotlib.pyplot as plt
 11 | import time
 12 | import random
 13 | 
 14 | class Blob():
 15 |     def __init__(self, SIZE):
 16 |         self.size = SIZE
 17 |         self.x = np.random.randint(0, self.size)
 18 |         self.y = np.random.randint(0, self.size)
 19 | 
 20 |     def __str__(self):
 21 |         return f"{self.x}, {self.y}"
 22 | 
 23 | class BlobEnvironment():
 24 |     def __init__(self):
 25 |         self.size = 5
 26 |         self.n_actions = 4
 27 |         self.player = Blob(self.size)
 28 |         self.x = self.player.x
 29 |         self.y = self.player.y
 30 |         self.color = {"player":(0,0,255)}
 31 |         self.reward = 0
 32 | 
 33 |     def reset(self):
 34 |         self.x = self.player.x
 35 |         self.y = self.player.y
 36 |         return (self.x, self.y)
 37 | 
 38 |     def step(self,action=-1):
 39 |         if action == -1:
 40 |             print('lolll')
 41 |             action = self.player.policy()
 42 | 
 43 |         if action == 0:                     #Right
 44 |             self.move(x=1, y=0)
 45 |         elif action == 1:                   #Down
 46 |             self.move(x=0, y=1)
 47 |         elif action == 2:                   #Left
 48 |             self.move(x=-1, y=0)
 49 |         elif action == 3:                   #up
 50 |             self.move(x=0, y=-1)
 51 |         return self.x,self.y,self.reward
 52 | 
 53 |     def move(self, x, y):
 54 |         self.reward = 0
 55 |         if self.x==1 and self.y==0:
 56 |             self.reward = 10
 57 |             self.x = 1
 58 |             self.y = 4
 59 |         elif self.x==3 and self.y==0:
 60 |             self.reward = 5
 61 |             self.x = 3
 62 |             self.y = 2
 63 |         else:
 64 |             self.x += x
 65 |             self.y += y
 66 |             if self.x < 0:
 67 |                 self.x = 0
 68 |                 self.reward = -1
 69 |             elif self.x >= self.size:
 70 |                 self.x = self.size-1
 71 |                 self.reward = -1
 72 | 
 73 |             if self.y < 0:
 74 |                 self.y = 0
 75 |                 self.reward = -1
 76 |             elif self.y >= self.size:
 77 |                 self.y = self.size-1
 78 |                 self.reward = -1
 79 | 
 80 |     def render(self, RenderTime = 100):
 81 |         env = np.ones((self.size,self.size,3), dtype = np.uint8)*255
 82 |         env[self.y][self.x] = self.color["player"]
 83 |         plt.xticks(np.arange(-0.5,4.5,1),np.arange(5))
 84 |         plt.yticks(np.arange(-0.5,4.5,1),np.arange(5))
 85 |         plt.grid('True')
 86 |         plt.imshow(np.array(env))
 87 |         plt.pause(RenderTime/100)
 88 | 
 89 |     def sample_actions(self):
 90 |         return np.random.randint(0, self.n_actions)
 91 | 
 92 |     def plot_grid_values(self, values):
 93 |         fig, axs = plt.subplots(1,1)
 94 |         axs.axis('off')
 95 |         the_table = axs.table(cellText=values,bbox=[0, 0, 1, 1],cellLoc="center")
 96 |         plt.show()
 97 | 
 98 |     def plot_policy(self, policy):
 99 |         P = []
100 |         for y in range(5):
101 |             p = []
102 |             for x in range(5):
103 |                 if policy[y,x] == 0:
104 |                     p.append("right")
105 |                 elif policy[y,x] == 1:
106 |                     p.append("down")
107 |                 elif policy[y,x] == 2:
108 |                     p.append("left")
109 |                 else:
110 |                     p.append("up")
111 |             P.append(p)
112 |         fig, axs = plt.subplots(1,1)
113 |         axs.axis('off')
114 |         the_table = axs.table(cellText=P,bbox=[0, 0, 1, 1],cellLoc="center")
115 |         plt.show()
116 | 


--------------------------------------------------------------------------------
/Unit1-Multi-Armed-Bandits/agents.py:
--------------------------------------------------------------------------------
  1 | #######################################################################
  2 | # Copyright (C)                                                       #
  3 | # 2020(rajghugare.vnit@gmail.com)                                     #
  4 | # Permission given to modify the code as long as you keep this        #
  5 | # declaration at the top                                              #
  6 | #######################################################################
  7 | 
  8 | import random
  9 | import numpy as np
 10 | from bandits import BernoulliStationaryBandit
 11 | from bandits import GaussianStationaryBandit
 12 | 
 13 | 
 14 | class epsilon_greedy_agent(BernoulliStationaryBandit):
 15 |     def __init__(self, bandit, epsilon, num_iters):
 16 |         self.bandit = bandit
 17 |         self.epsilon = epsilon
 18 |         self.num_iters = num_iters
 19 |         self.Q = np.ones(self.bandit.num_arms)
 20 | 
 21 |     def EpsilonGreedy_policy(self):
 22 |         if random.random()<self.epsilon:
 23 |             arm = random.randint(0, self.bandit.num_arms-1)
 24 |         else:
 25 |             arm = np.argmax(self.Q)
 26 |         return arm
 27 | 
 28 |     def play(self,decay = True):
 29 |         self.bandit.reset()
 30 |         self.indicator = np.zeros(self.bandit.num_arms)
 31 |         for t in range(self.num_iters):
 32 |             arm = self.EpsilonGreedy_policy()
 33 |             reward = self.bandit.pull(arm)
 34 |             self.indicator[arm] += 1
 35 |             self.Q[arm] = (self.Q[arm]*self.indicator[arm] + reward)/(self.indicator[arm] + 1)
 36 |             if decay and t>0:
 37 |                 if t > self.num_iters/50:
 38 |                     self.epsilon = 1/t
 39 |         arm_history = self.bandit.get_ArmHistory()
 40 |         regret = self.bandit.get_regret()
 41 |         total_reward = self.bandit.get_total_reward()
 42 |         return self.Q, regret, arm_history, total_reward
 43 | 
 44 | 
 45 | class softmax_agent(BernoulliStationaryBandit):
 46 |     def __init__(self, bandit, beta, num_iters):
 47 |         self.bandit = bandit
 48 |         self.beta = beta
 49 |         self.num_iters = num_iters
 50 |         self.Q = np.ones(self.bandit.num_arms)
 51 | 
 52 |     def Softmax_policy(self):
 53 |         prob = np.copy(np.exp(self.Q/self.beta)/np.sum(np.exp(self.Q/self.beta)))
 54 |         arm = np.random.choice(np.arange(self.bandit.num_arms), p = prob)
 55 |         return arm
 56 | 
 57 |     def play(self):
 58 |         self.bandit.reset()
 59 |         self.indicator = np.zeros(self.bandit.num_arms)
 60 |         for t in range(self.num_iters):
 61 |             arm = self.Softmax_policy()
 62 |             reward =  self.bandit.pull(arm)
 63 |             self.indicator[arm] += 1
 64 |             self.Q[arm] = (self.Q[arm]*self.indicator[arm] + reward)/(self.indicator[arm] + 1)
 65 |         arm_history = self.bandit.get_ArmHistory()
 66 |         regret = self.bandit.get_regret()
 67 |         total_reward = self.bandit.get_total_reward()
 68 |         return self.Q, regret, arm_history, total_reward
 69 | 
 70 | 
 71 | class UCB():
 72 |     def __init__(self, bandit, num_iters):
 73 |         self.bandit = bandit
 74 |         self.time = 0
 75 |         self.Q = np.zeros(self.bandit.num_arms)
 76 |         self.confidence = np.zeros(self.bandit.num_arms)
 77 |         self.num_iters = num_iters
 78 | 
 79 |     def UCB_policy(self):
 80 |         arm = np.argmax(np.add(self.Q,self.confidence))
 81 |         return arm
 82 | 
 83 |     def play(self):
 84 |         self.bandit.reset()
 85 |         for i in range(self.bandit.num_arms):
 86 |             self.Q[i] = self.bandit.pull(i)
 87 |             self.time +=1
 88 |         for i in range(self.num_iters-self.bandit.num_arms):
 89 |             a_h = self.bandit.get_ArmHistory()
 90 |             self.confidence = np.sqrt(2*np.log(self.time)/a_h)
 91 |             arm = self.UCB_policy()
 92 |             reward = self.bandit.pull(arm)
 93 |             self.Q[arm] = (self.Q[arm]*a_h[arm] + reward)/(a_h[arm] + 1)
 94 |         arm_history = self.bandit.get_ArmHistory()
 95 |         regret = self.bandit.get_regret()
 96 |         total_reward = self.bandit.get_total_reward()
 97 |         return self.Q, regret, arm_history, total_reward
 98 | 
 99 | 
100 | class Median_elimination_agent():
101 |     def __init__(self, bandit, epsilon, delta):
102 |         self.bandit = bandit
103 |         self.epsilon = epsilon/4
104 |         self.delta = delta/2
105 |         self.Q = np.ones(self.bandit.num_arms)
106 |         self.S = np.arange(self.bandit.num_arms)
107 |     def play(self):
108 |         self.bandit.reset()
109 |         self.indicator = np.zeros(self.bandit.num_arms)
110 |         while len(self.S) != 1:
111 |             for arm in self.S:
112 |                 count = int(2*np.log(3/self.delta)/np.square(self.epsilon))
113 |                 for i in range(count):
114 |                     reward = self.bandit.pull(arm)
115 |                     self.indicator[arm] += 1
116 |                     self.Q[arm] = (self.Q[arm]*self.indicator[arm] + reward)/(self.indicator[arm] + 1)
117 |             M = np.median(self.Q[self.S])
118 |             self.S = np.delete(self.S, np.where(self.Q[self.S]<M))
119 |             self.epsilon = self.epsilon*3/4
120 |             self.delta = self.delta/2
121 |             print(self.S)
122 |         arm_history = self.bandit.get_ArmHistory()
123 |         regret =  self.bandit.get_regret()
124 |         total_reward = self.bandit.get_total_reward()
125 |         #self.s will eventually contain arm
126 |         return self.S, regret, arm_history,total_reward
127 | 


--------------------------------------------------------------------------------
/Unit5-Deep-Q-Networks/dqn.py:
--------------------------------------------------------------------------------
  1 | import gym
  2 | import numpy as np
  3 | import torch
  4 | import torch.nn as nn
  5 | import matplotlib
  6 | import matplotlib.pyplot as plt
  7 | from torch import optim
  8 | import torchvision.transforms as T
  9 | import cv2
 10 | 
 11 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
 12 | 
 13 | class DeepQNetwork(nn.Module):
 14 | 
 15 |     def __init__(self,learning_rate,h,w,n_actions):
 16 |         super(DeepQNetwork, self).__init__()
 17 |         self.conv1 = nn.Conv2d(1, 32, kernel_size=8, stride=4)
 18 |         self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
 19 |         self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
 20 | 
 21 |         def conv2d_size(size, kernel_size = 3, stride = 1):
 22 |             return (size - kernel_size)// stride  + 1
 23 | 
 24 |         convw = conv2d_size(conv2d_size(conv2d_size(w,8,4),4,2))
 25 |         convh = conv2d_size(conv2d_size(conv2d_size(h,8,4),4,2))
 26 |         lin_1 = convw*convh
 27 | 
 28 | 
 29 |         self.linear1 = nn.Linear(lin_1*64,256)
 30 |         self.linear2 = nn.Linear(256, n_actions)
 31 | 
 32 |         self.criterion = nn.MSELoss()
 33 |         self.optimizer = optim.Adam(self.parameters(),lr = learning_rate)
 34 | 
 35 | 
 36 |     def forward(self, x):
 37 |         x = torch.relu(self.conv1(x))
 38 |         x = torch.relu(self.conv2(x))
 39 |         x = torch.relu(self.conv3(x))
 40 |         x = x.view(x.shape[0],-1)
 41 |         x = torch.relu(self.linear1(x))
 42 |         action_values = self.linear2(x)
 43 |         return action_values
 44 | 
 45 | class agent():
 46 | 
 47 |     def __init__(self,epsilon,eps_decay,epsilon_min,gamma,l_r,n_actions,memory,batch_size,target_update,env,save=False):
 48 |         self.epsilon = epsilon
 49 |         self.eps_decay = eps_decay
 50 |         self.epsilon_min = epsilon_min
 51 |         self.gamma = gamma
 52 |         self.env = env
 53 |         self.n_actions = n_actions
 54 |         self.batch_size = batch_size
 55 |         self.memory = memory
 56 |         self.memory_count = 0
 57 |         self.ROWS = 84
 58 |         self.COLS = 84
 59 |         self.state_memory = torch.zeros([self.memory,1,self.ROWS,self.COLS],dtype = torch.float32)
 60 |         self.next_state_memory = torch.zeros([self.memory,1,self.ROWS,self.COLS],dtype = torch.float32)
 61 |         self.action_memory = torch.zeros(self.memory,dtype=torch.int32)
 62 |         self.terminal_memory = torch.zeros(self.memory,dtype=torch.uint8)
 63 |         self.reward_memory = torch.zeros(self.memory)
 64 |         self.policy_net = DeepQNetwork(learning_rate = l_r,h=self.ROWS,w=self.COLS,n_actions=self.n_actions).to(device)
 65 |         self.target_net = DeepQNetwork(learning_rate = l_r,h=self.ROWS,w=self.COLS,n_actions=self.n_actions).to(device)
 66 |         self.target_update = target_update
 67 |         self.save = save
 68 | 
 69 | 
 70 | 
 71 |     def choose_action(self,state):
 72 |         r = np.random.random()
 73 |         if r<self.epsilon:
 74 |             action = self.env.action_space.sample()
 75 |         else:
 76 |             with torch.no_grad():
 77 |                 q_val = self.policy_net.forward(state)
 78 |                 action = torch.argmax(q_val).item()
 79 | 
 80 |         return action
 81 | 
 82 | 
 83 |     def store_experience(self,state,action,reward,terminal,next_state):
 84 |         index = self.memory_count%self.memory
 85 | 
 86 |         self.state_memory[index] = state
 87 |         self.action_memory[index] = action
 88 |         self.reward_memory[index] = reward
 89 |         self.terminal_memory[index] = 1-terminal
 90 |         self.next_state_memory[index] = next_state
 91 | 
 92 |         self.memory_count+=1
 93 | 
 94 |     def get_state(self):
 95 | 
 96 |         screen = self.env.render(mode='rgb_array')
 97 |         screen_1 = cv2.cvtColor(screen, cv2.COLOR_RGB2GRAY)
 98 |         r_screen = cv2.resize(screen_1, (self.ROWS,self.COLS), interpolation=cv2.INTER_AREA)
 99 |         r_screen = np.array(r_screen)
100 |         r_screen = np.expand_dims(r_screen,axis=0)
101 |         r_screen = torch.Tensor(r_screen)
102 |         return r_screen.unsqueeze(0).to(device)
103 | 
104 | 
105 |     def learn_with_experience_replay(self):
106 |         if self.memory_count < self.batch_size:
107 |             return
108 | 
109 |         if self.memory_count < self.memory:
110 |             mem = self.memory_count
111 |         else:
112 |             mem = self.memory
113 | 
114 |         self.policy_net.optimizer.zero_grad()
115 | 
116 | 
117 |         batch = np.random.choice(mem, self.batch_size, replace=False)
118 | 
119 |         state_batch = self.state_memory[batch].to(device)
120 |         action_batch = self.action_memory[batch]
121 |         new_state_batch = self.next_state_memory[batch].to(device)
122 |         reward_batch = self.reward_memory[batch].to(device)
123 |         terminal_batch = self.terminal_memory[batch].to(device)
124 | 
125 | 
126 |         q_val = self.policy_net.forward(state_batch).to(device)
127 | 
128 |         q_next = self.target_net.forward(new_state_batch).to(device).detach()
129 | 
130 |         q_target = self.policy_net.forward(state_batch).to(device).detach()
131 | 
132 |         batch_index = np.arange(self.batch_size)
133 |         action_values = torch.max(q_next,1)[0]
134 | 
135 |         q_target[batch_index, np.array(action_batch)] = reward_batch + self.gamma*action_values*terminal_batch
136 | 
137 |         loss = self.policy_net.criterion(q_val,q_target).to(device)
138 | 
139 |         self.policy_net.optimizer.zero_grad()
140 |         loss.backward()
141 |         self.policy_net.optimizer.step()
142 | 
143 | 
144 | 
145 |     def epsilon_decay(self):
146 |         if self.epsilon>self.epsilon_min:
147 |             self.epsilon = self.epsilon-self.eps_decay
148 |         return self.epsilon
149 | 
150 | 
151 | print('Done')
152 | 


--------------------------------------------------------------------------------
/Unit4-Temporal-Difference-Methods/QLearning.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "#Custom Environment\n",
 10 |     "import numpy as np\n",
 11 |     "from PIL import Image\n",
 12 |     "import cv2\n",
 13 |     "import matplotlib.pyplot as plt\n",
 14 |     "from matplotlib import style\n",
 15 |     "import time\n",
 16 |     "import numpy as np\n",
 17 |     "import random\n",
 18 |     "\n",
 19 |     "\n",
 20 |     "style.use(\"ggplot\")\n",
 21 |     "\n",
 22 |     "\n",
 23 |     "class Blob():\n",
 24 |     "    def __init__(self, SIZE = 10):\n",
 25 |     "        self.size = SIZE\n",
 26 |     "        self.x = np.random.randint(0, SIZE)\n",
 27 |     "        self.y = np.random.randint(0, SIZE)\n",
 28 |     "\n",
 29 |     "    def __str__(self):\n",
 30 |     "        return f\"{self.x}, {self.y}\"\n",
 31 |     "\n",
 32 |     "    def __sub__(self, other):\n",
 33 |     "        return (self.x-other.x, self.y-other.y)\n",
 34 |     "\n",
 35 |     "    def act(self, choice, diagonal = False):\n",
 36 |     "        '''\n",
 37 |     "        Gives us 4 total movement options. (0,1,2,3)\n",
 38 |     "        '''\n",
 39 |     "        if diagonal:\n",
 40 |     "\n",
 41 |     "            if choice == 0:\n",
 42 |     "                self.move(x=1, y=1)\n",
 43 |     "            elif choice == 1:\n",
 44 |     "                self.move(x=-1, y=-1)\n",
 45 |     "            elif choice == 2:\n",
 46 |     "                self.move(x=-1, y=1)\n",
 47 |     "            elif choice == 3:\n",
 48 |     "                self.move(x=1, y=-1)\n",
 49 |     "\n",
 50 |     "        else:\n",
 51 |     "            if choice == 0:\n",
 52 |     "                self.move(x=0, y=1)\n",
 53 |     "            elif choice == 1:\n",
 54 |     "                self.move(x=0, y=-1)\n",
 55 |     "            elif choice == 2:\n",
 56 |     "                self.move(x=-1, y=0)\n",
 57 |     "            elif choice == 3:\n",
 58 |     "                self.move(x=1, y=0)\n",
 59 |     "\n",
 60 |     "\n",
 61 |     "    def move(self, x=-100, y=-100):\n",
 62 |     "\n",
 63 |     "        if x == -100:\n",
 64 |     "            self.x += np.random.randint(-1, 2)\n",
 65 |     "        else:\n",
 66 |     "            self.x += x\n",
 67 |     "\n",
 68 |     "        if y == -100:\n",
 69 |     "            self.y += np.random.randint(-1, 2)\n",
 70 |     "        else:\n",
 71 |     "            self.y += y\n",
 72 |     "\n",
 73 |     "        if self.x < 0:\n",
 74 |     "            self.x = 0\n",
 75 |     "        elif self.x > self.size-1:\n",
 76 |     "            self.x = self.size-1\n",
 77 |     "        if self.y < 0:\n",
 78 |     "            self.y = 0\n",
 79 |     "        elif self.y > self.size-1:\n",
 80 |     "            self.y = self.size-1\n",
 81 |     "\n",
 82 |     "\n",
 83 |     "class ENVIRONMENT():\n",
 84 |     "\n",
 85 |     "\n",
 86 |     "\n",
 87 |     "    def __init__(self, num_player=1, num_enemy=1, num_food=1, size = 10, diagonal = False):\n",
 88 |     "        self.size = size\n",
 89 |     "        self.naction = 4\n",
 90 |     "        self.diagonal = diagonal\n",
 91 |     "        self.num_enemy = num_enemy\n",
 92 |     "        self.num_food = num_food\n",
 93 |     "        self.player = Blob(size)\n",
 94 |     "        self.enemy = [Blob() for _ in range(self.num_enemy)]\n",
 95 |     "        self.food = [Blob() for _ in range(self.num_food)]\n",
 96 |     "        self.reward = 0\n",
 97 |     "        self.colors = {1: (255, 0, 0),\n",
 98 |     "         2: (0, 255, 0),\n",
 99 |     "         3: (0, 0, 255)}\n",
100 |     "        self.px,self.py = self.player.x,self.player.y\n",
101 |     "        self.ex,self.ey = [self.enemy[iter].x for iter in range(self.num_enemy)], [self.enemy[iter].y for iter in range(self.num_enemy)]\n",
102 |     "        self.fx,self.fy = [self.food[iter].x for iter in range(self.num_food)], [self.food[iter].y for iter in range(self.num_food)]\n",
103 |     "\n",
104 |     "\n",
105 |     "    def startover(self, newpos=False):\n",
106 |     "\n",
107 |     "        self.player.x, self.player.y = self.px, self.py\n",
108 |     "        for iter in range(self.num_enemy):\n",
109 |     "            self.enemy[iter].x, self.enemy[iter].y = self.ex[iter], self.ey[iter]\n",
110 |     "        for iter in range(self.num_food):\n",
111 |     "            self.food[iter].x, self.food[iter].y = self.fx[iter], self.fy[iter]\n",
112 |     "        if newpos == True:\n",
113 |     "            self.player = Blob(self.size)\n",
114 |     "        self.reward = 0\n",
115 |     "\n",
116 |     "        return (self.player.x, self.player.y), self.reward, False\n",
117 |     "\n",
118 |     "    def step(self, action):\n",
119 |     "\n",
120 |     "        self.player.act(action, self.diagonal)\n",
121 |     "        self.reward = self.calculate_reward()\n",
122 |     "        return (self.player.x, self.player.y), self.reward\n",
123 |     "\n",
124 |     "    def calculate_reward(self):\n",
125 |     "\n",
126 |     "        if self.player.x in [self.enemy[iter].x for iter in range(self.num_enemy)] and self.player.y in [self.enemy[iter].y for iter in range(self.num_enemy)]:\n",
127 |     "            return -100, True\n",
128 |     "\n",
129 |     "        if self.player.x in [self.food[iter].x for iter in range(self.num_food)] and self.player.y in [self.food[iter].y for iter in range(self.num_food)]:\n",
130 |     "            return 100, True\n",
131 |     "\n",
132 |     "        else:\n",
133 |     "            return -1, False\n",
134 |     "\n",
135 |     "\n",
136 |     "    def render(self,renderTime=100):\n",
137 |     "\n",
138 |     "        env = np.zeros((self.size, self.size, 3), dtype=np.uint8)\n",
139 |     "        for iter in range(self.num_food):\n",
140 |     "            env[self.food[iter].x][self.food[iter].y] = self.colors[2]\n",
141 |     "        for iter in range(self.num_enemy):\n",
142 |     "            env[self.enemy[iter].x][self.enemy[iter].y] = self.colors[3]\n",
143 |     "        env[self.player.x][self.player.y] = self.colors[1]\n",
144 |     "        img = Image.fromarray(env, 'RGB')\n",
145 |     "        img = img.resize((300, 300))\n",
146 |     "        cv2.imshow(\"image\", np.array(img))\n",
147 |     "        cv2.waitKey(renderTime)\n",
148 |     "        # cv2.destroyAllWindows()\n",
149 |     "\n",
150 |     "    def sample_action(self):\n",
151 |     "        return np.random.randint(0, self.naction)"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": 2,
157 |    "metadata": {},
158 |    "outputs": [
159 |     {
160 |      "data": {
161 |       "text/plain": [
162 |        "\"\\nActions \\ndiagonal = True\\n0 = down_right\\n1 = up_left\\n2 = up_right\\n3 = down_left\\nWhen space is not available action = action.split('_')[0]\\n\\nEnvironment\\nplayer = Blue\\nenemy = red\\ngoal = green \\n\\nIf a player is on \\nan enemy reward at that time step = -100\\nthe goal reward at that time step = 100\\nfor every other time step reward is = -1\\n\""
163 |       ]
164 |      },
165 |      "execution_count": 2,
166 |      "metadata": {},
167 |      "output_type": "execute_result"
168 |     }
169 |    ],
170 |    "source": [
171 |     "env = ENVIRONMENT(diagonal=True, size=10, num_enemy = 3, num_food = 1)\n",
172 |     "nS = 100\n",
173 |     "nA = 4\n",
174 |     "episodes = 25000\n",
175 |     "epsilon = 0.95\n",
176 |     "gamma = 0.9\n",
177 |     "learning_rate = 0.1\n",
178 |     "\"\"\"\n",
179 |     "Actions \n",
180 |     "diagonal = True\n",
181 |     "0 = down_right\n",
182 |     "1 = up_left\n",
183 |     "2 = up_right\n",
184 |     "3 = down_left\n",
185 |     "When space is not available action = action.split('_')[0]\n",
186 |     "\n",
187 |     "Environment\n",
188 |     "player = Blue\n",
189 |     "enemy = red\n",
190 |     "goal = green \n",
191 |     "\n",
192 |     "If a player is on \n",
193 |     "an enemy reward at that time step = -100\n",
194 |     "the goal reward at that time step = 100\n",
195 |     "for every other time step reward is = -1\n",
196 |     "\"\"\""
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "code",
201 |    "execution_count": 3,
202 |    "metadata": {},
203 |    "outputs": [],
204 |    "source": [
205 |     "def E_policy(q,s,epsilon):\n",
206 |     "    r = random.random()\n",
207 |     "    if r<epsilon:\n",
208 |     "        a = env.sample_action()\n",
209 |     "    else:\n",
210 |     "        a = np.argmax(q[s])\n",
211 |     "    return a"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": 4,
217 |    "metadata": {},
218 |    "outputs": [],
219 |    "source": [
220 |     "Q = np.zeros([nS,nA])\n",
221 |     "for e in range(episodes):\n",
222 |     "    pos_i,r0,term = env.startover(newpos = True)\n",
223 |     "    s_i = pos_i[0]*10 + pos_i[1]\n",
224 |     "    a_i = E_policy(Q,s_i,epsilon)\n",
225 |     "    while term == False:\n",
226 |     "        #a_i = E_policy(Q,s_i,epsilon)\n",
227 |     "        pos_f,(r,term) = env.step(a_i)\n",
228 |     "        s_f = pos_f[0]*10 + pos_f[1]\n",
229 |     "        Q[s_i,a_i] = (1-learning_rate)*Q[s_i,a_i] + learning_rate*(r + gamma*np.max(Q[s_f]))\n",
230 |     "        a_i = E_policy(Q,s_f,epsilon)\n",
231 |     "        s_i = s_f\n",
232 |     "    #if (e+1)%500 == 0:\n",
233 |     "    #    print(\"Epsiode = \",e)\n",
234 |     "    epsilon = epsilon*0.09998"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "code",
239 |    "execution_count": 5,
240 |    "metadata": {},
241 |    "outputs": [],
242 |    "source": [
243 |     "pol = np.zeros(nS)\n",
244 |     "for s in range(nS):\n",
245 |     "    pol[s] = np.argmax(Q[s])"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "code",
250 |    "execution_count": 6,
251 |    "metadata": {},
252 |    "outputs": [],
253 |    "source": [
254 |     "def play():\n",
255 |     "    pos_i,k,ter = env.startover(newpos=True)\n",
256 |     "    env.render()\n",
257 |     "    T = False\n",
258 |     "    i = 0\n",
259 |     "    while T == False and i<=20:   \n",
260 |     "        s = pos_i[0]*10 + pos_i[1]\n",
261 |     "        #print(pos_i)\n",
262 |     "        pos_i,R = env.step(pol[s])\n",
263 |     "        #print(R)\n",
264 |     "        T=R[1]\n",
265 |     "        env.render(500)\n",
266 |     "        i = i+1\n",
267 |     "    cv2.destroyAllWindows()"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": 16,
273 |    "metadata": {},
274 |    "outputs": [],
275 |    "source": [
276 |     "for i in range(10):\n",
277 |     "    play()"
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "code",
282 |    "execution_count": null,
283 |    "metadata": {},
284 |    "outputs": [],
285 |    "source": []
286 |   }
287 |  ],
288 |  "metadata": {
289 |   "kernelspec": {
290 |    "display_name": "Python 3",
291 |    "language": "python",
292 |    "name": "python3"
293 |   },
294 |   "language_info": {
295 |    "codemirror_mode": {
296 |     "name": "ipython",
297 |     "version": 3
298 |    },
299 |    "file_extension": ".py",
300 |    "mimetype": "text/x-python",
301 |    "name": "python",
302 |    "nbconvert_exporter": "python",
303 |    "pygments_lexer": "ipython3",
304 |    "version": "3.7.4"
305 |   }
306 |  },
307 |  "nbformat": 4,
308 |  "nbformat_minor": 2
309 | }
310 | 


--------------------------------------------------------------------------------
/Unit4-Temporal-Difference-Methods/SARSA.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 9,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "#Custom Environment\n",
 10 |     "import numpy as np\n",
 11 |     "from PIL import Image\n",
 12 |     "import cv2\n",
 13 |     "import matplotlib.pyplot as plt\n",
 14 |     "from matplotlib import style\n",
 15 |     "import time\n",
 16 |     "import numpy as np\n",
 17 |     "import random\n",
 18 |     "\n",
 19 |     "\n",
 20 |     "style.use(\"ggplot\")\n",
 21 |     "\n",
 22 |     "\n",
 23 |     "class Blob():\n",
 24 |     "    def __init__(self, SIZE = 10):\n",
 25 |     "        self.size = SIZE\n",
 26 |     "        self.x = np.random.randint(0, SIZE)\n",
 27 |     "        self.y = np.random.randint(0, SIZE)\n",
 28 |     "\n",
 29 |     "    def __str__(self):\n",
 30 |     "        return f\"{self.x}, {self.y}\"\n",
 31 |     "\n",
 32 |     "    def __sub__(self, other):\n",
 33 |     "        return (self.x-other.x, self.y-other.y)\n",
 34 |     "\n",
 35 |     "    def act(self, choice, diagonal = False):\n",
 36 |     "        '''\n",
 37 |     "        Gives us 4 total movement options. (0,1,2,3)\n",
 38 |     "        '''\n",
 39 |     "        if diagonal:\n",
 40 |     "\n",
 41 |     "            if choice == 0:\n",
 42 |     "                self.move(x=1, y=1)\n",
 43 |     "            elif choice == 1:\n",
 44 |     "                self.move(x=-1, y=-1)\n",
 45 |     "            elif choice == 2:\n",
 46 |     "                self.move(x=-1, y=1)\n",
 47 |     "            elif choice == 3:\n",
 48 |     "                self.move(x=1, y=-1)\n",
 49 |     "\n",
 50 |     "        else:\n",
 51 |     "            if choice == 0:\n",
 52 |     "                self.move(x=0, y=1)\n",
 53 |     "            elif choice == 1:\n",
 54 |     "                self.move(x=0, y=-1)\n",
 55 |     "            elif choice == 2:\n",
 56 |     "                self.move(x=-1, y=0)\n",
 57 |     "            elif choice == 3:\n",
 58 |     "                self.move(x=1, y=0)\n",
 59 |     "\n",
 60 |     "\n",
 61 |     "    def move(self, x=-100, y=-100):\n",
 62 |     "\n",
 63 |     "        if x == -100:\n",
 64 |     "            self.x += np.random.randint(-1, 2)\n",
 65 |     "        else:\n",
 66 |     "            self.x += x\n",
 67 |     "\n",
 68 |     "        if y == -100:\n",
 69 |     "            self.y += np.random.randint(-1, 2)\n",
 70 |     "        else:\n",
 71 |     "            self.y += y\n",
 72 |     "\n",
 73 |     "        if self.x < 0:\n",
 74 |     "            self.x = 0\n",
 75 |     "        elif self.x > self.size-1:\n",
 76 |     "            self.x = self.size-1\n",
 77 |     "        if self.y < 0:\n",
 78 |     "            self.y = 0\n",
 79 |     "        elif self.y > self.size-1:\n",
 80 |     "            self.y = self.size-1\n",
 81 |     "\n",
 82 |     "\n",
 83 |     "class ENVIRONMENT():\n",
 84 |     "\n",
 85 |     "\n",
 86 |     "\n",
 87 |     "    def __init__(self, num_player=1, num_enemy=1, num_food=1, size = 10, diagonal = False):\n",
 88 |     "        self.size = size\n",
 89 |     "        self.naction = 4\n",
 90 |     "        self.diagonal = diagonal\n",
 91 |     "        self.num_enemy = num_enemy\n",
 92 |     "        self.num_food = num_food\n",
 93 |     "        self.player = Blob(size)\n",
 94 |     "        self.enemy = [Blob() for _ in range(self.num_enemy)]\n",
 95 |     "        self.food = [Blob() for _ in range(self.num_food)]\n",
 96 |     "        self.reward = 0\n",
 97 |     "        self.colors = {1: (255, 0, 0),\n",
 98 |     "         2: (0, 255, 0),\n",
 99 |     "         3: (0, 0, 255)}\n",
100 |     "        self.px,self.py = self.player.x,self.player.y\n",
101 |     "        self.ex,self.ey = [self.enemy[iter].x for iter in range(self.num_enemy)], [self.enemy[iter].y for iter in range(self.num_enemy)]\n",
102 |     "        self.fx,self.fy = [self.food[iter].x for iter in range(self.num_food)], [self.food[iter].y for iter in range(self.num_food)]\n",
103 |     "\n",
104 |     "\n",
105 |     "    def startover(self, newpos=False):\n",
106 |     "\n",
107 |     "        self.player.x, self.player.y = self.px, self.py\n",
108 |     "        for iter in range(self.num_enemy):\n",
109 |     "            self.enemy[iter].x, self.enemy[iter].y = self.ex[iter], self.ey[iter]\n",
110 |     "        for iter in range(self.num_food):\n",
111 |     "            self.food[iter].x, self.food[iter].y = self.fx[iter], self.fy[iter]\n",
112 |     "        if newpos == True:\n",
113 |     "            self.player = Blob(self.size)\n",
114 |     "        self.reward = 0\n",
115 |     "\n",
116 |     "        return (self.player.x, self.player.y), self.reward, False\n",
117 |     "\n",
118 |     "    def step(self, action):\n",
119 |     "\n",
120 |     "        self.player.act(action, self.diagonal)\n",
121 |     "        self.reward = self.calculate_reward()\n",
122 |     "        return (self.player.x, self.player.y), self.reward\n",
123 |     "\n",
124 |     "    def calculate_reward(self):\n",
125 |     "\n",
126 |     "        if self.player.x in [self.enemy[iter].x for iter in range(self.num_enemy)] and self.player.y in [self.enemy[iter].y for iter in range(self.num_enemy)]:\n",
127 |     "            return -100, True\n",
128 |     "\n",
129 |     "        if self.player.x in [self.food[iter].x for iter in range(self.num_food)] and self.player.y in [self.food[iter].y for iter in range(self.num_food)]:\n",
130 |     "            return 100, True\n",
131 |     "\n",
132 |     "        else:\n",
133 |     "            return -1, False\n",
134 |     "\n",
135 |     "\n",
136 |     "    def render(self,renderTime=100):\n",
137 |     "\n",
138 |     "        env = np.zeros((self.size, self.size, 3), dtype=np.uint8)\n",
139 |     "        for iter in range(self.num_food):\n",
140 |     "            env[self.food[iter].x][self.food[iter].y] = self.colors[2]\n",
141 |     "        for iter in range(self.num_enemy):\n",
142 |     "            env[self.enemy[iter].x][self.enemy[iter].y] = self.colors[3]\n",
143 |     "        env[self.player.x][self.player.y] = self.colors[1]\n",
144 |     "        img = Image.fromarray(env, 'RGB')\n",
145 |     "        img = img.resize((300, 300))\n",
146 |     "        cv2.imshow(\"image\", np.array(img))\n",
147 |     "        cv2.waitKey(renderTime)\n",
148 |     "        # cv2.destroyAllWindows()\n",
149 |     "\n",
150 |     "    def sample_action(self):\n",
151 |     "        return np.random.randint(0, self.naction)"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": 10,
157 |    "metadata": {},
158 |    "outputs": [
159 |     {
160 |      "data": {
161 |       "text/plain": [
162 |        "\"\\nActions \\ndiagonal = True\\n0 = down_right\\n1 = up_left\\n2 = up_right\\n3 = down_left\\nWhen space is not available action = action.split('_')[0]\\n\\nEnvironment\\nplayer = Blue\\nenemy = red\\ngoal = green\\n\\nIf a player is on \\nan enemy reward at that time step = -100\\nthe goal reward at that time step = 100\\nfor every other time step reward is = -1\\n\""
163 |       ]
164 |      },
165 |      "execution_count": 10,
166 |      "metadata": {},
167 |      "output_type": "execute_result"
168 |     }
169 |    ],
170 |    "source": [
171 |     "env = ENVIRONMENT(diagonal=True, size=10, num_enemy = 3, num_food = 1)\n",
172 |     "episodes = 25000\n",
173 |     "nS = 100\n",
174 |     "nA = 4\n",
175 |     "learning_rate = 0.01\n",
176 |     "gamma = 0.9\n",
177 |     "epsilon = 0.95\n",
178 |     "\"\"\"\n",
179 |     "Actions \n",
180 |     "diagonal = True\n",
181 |     "0 = down_right\n",
182 |     "1 = up_left\n",
183 |     "2 = up_right\n",
184 |     "3 = down_left\n",
185 |     "When space is not available action = action.split('_')[0]\n",
186 |     "\n",
187 |     "Environment\n",
188 |     "player = Blue\n",
189 |     "enemy = red\n",
190 |     "goal = green\n",
191 |     "\n",
192 |     "If a player is on \n",
193 |     "an enemy reward at that time step = -100\n",
194 |     "the goal reward at that time step = 100\n",
195 |     "for every other time step reward is = -1\n",
196 |     "\"\"\""
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "code",
201 |    "execution_count": 11,
202 |    "metadata": {},
203 |    "outputs": [],
204 |    "source": [
205 |     "def E_policy(q,s,epsilon):\n",
206 |     "    r = random.random()\n",
207 |     "    if r<epsilon:\n",
208 |     "        a = env.sample_action()\n",
209 |     "    else:\n",
210 |     "        a = np.argmax(q[s])\n",
211 |     "    return a"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": 12,
217 |    "metadata": {},
218 |    "outputs": [],
219 |    "source": [
220 |     "Q = np.zeros([nS,nA])\n",
221 |     "for e in range(episodes):\n",
222 |     "    pos_i,r0,term = env.startover(newpos=True)\n",
223 |     "    s_i = pos_i[0]*10 + pos_i[1]\n",
224 |     "    a_i = E_policy(Q,s_i,epsilon)\n",
225 |     "    while term == False:\n",
226 |     "        pos_f,(rf,term) = env.step(a_i)\n",
227 |     "        s_f = pos_f[0]*10 + pos_f[1]\n",
228 |     "        a_f = E_policy(Q,s_f,epsilon)\n",
229 |     "        Q[s_i,a_i] = (1-learning_rate)*Q[s_i,a_i] + learning_rate*(rf + gamma*Q[s_f,a_f])\n",
230 |     "        a_i = a_f\n",
231 |     "        s_i = s_f                                                           \n",
232 |     "    #if (e+1)%500 == 0:\n",
233 |     "    #    print('current episode = ',e)\n",
234 |     "    epsilon = epsilon*0.9998 "
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "code",
239 |    "execution_count": 13,
240 |    "metadata": {},
241 |    "outputs": [],
242 |    "source": [
243 |     "pol = np.zeros(nS)\n",
244 |     "for s in range(nS):\n",
245 |     "    pol[s] = np.argmax(Q[s])"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "code",
250 |    "execution_count": 14,
251 |    "metadata": {},
252 |    "outputs": [],
253 |    "source": [
254 |     "def play():\n",
255 |     "    pos_i,k,ter = env.startover(newpos=True)\n",
256 |     "    env.render()\n",
257 |     "    T = False\n",
258 |     "    i = 0\n",
259 |     "    while T == False and i<=20:   \n",
260 |     "        s = pos_i[0]*10 + pos_i[1]\n",
261 |     "        #print(pos_i)\n",
262 |     "        pos_i,R = env.step(pol[s])\n",
263 |     "        #print(R)\n",
264 |     "        T=R[1]\n",
265 |     "        env.render(500)\n",
266 |     "        i = i+1\n",
267 |     "    cv2.destroyAllWindows()"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": 7,
273 |    "metadata": {},
274 |    "outputs": [],
275 |    "source": [
276 |     "cv2.destroyAllWindows()"
277 |    ]
278 |   },
279 |   {
280 |    "cell_type": "code",
281 |    "execution_count": 15,
282 |    "metadata": {
283 |     "scrolled": true
284 |    },
285 |    "outputs": [],
286 |    "source": [
287 |     "for i in range(20):\n",
288 |     "    play()    "
289 |    ]
290 |   },
291 |   {
292 |    "cell_type": "code",
293 |    "execution_count": null,
294 |    "metadata": {},
295 |    "outputs": [],
296 |    "source": []
297 |   }
298 |  ],
299 |  "metadata": {
300 |   "kernelspec": {
301 |    "display_name": "Python 3",
302 |    "language": "python",
303 |    "name": "python3"
304 |   },
305 |   "language_info": {
306 |    "codemirror_mode": {
307 |     "name": "ipython",
308 |     "version": 3
309 |    },
310 |    "file_extension": ".py",
311 |    "mimetype": "text/x-python",
312 |    "name": "python",
313 |    "nbconvert_exporter": "python",
314 |    "pygments_lexer": "ipython3",
315 |    "version": "3.7.4"
316 |   }
317 |  },
318 |  "nbformat": 4,
319 |  "nbformat_minor": 2
320 | }
321 | 


--------------------------------------------------------------------------------