├── README.md ├── p0_taxi-v2 ├── README.md ├── agent.py ├── images │ ├── all_perf.png │ ├── expected_sarsa_algo.png │ ├── expected_sarsa_perf.png │ ├── expected_sarsa_update_rule.png │ ├── sarsa_algo.png │ ├── sarsa_perf.png │ ├── sarsa_update_rule.png │ ├── sarsamax_algo.png │ ├── sarsamax_perf.png │ ├── sarsamax_update_rule.png │ ├── taxi-game.gif │ └── taxi_game_gif.gif ├── main.py └── monitor.py ├── p1_navigation ├── Navigation.ipynb ├── README.md ├── config.json ├── dqn_agent.py ├── main.py ├── model.py ├── report.pdf ├── requirements.txt ├── saved │ └── DQN_exp │ │ └── model_trained_solved.pth └── utils.py ├── p2_continuous_control ├── Continuous_Control.ipynb ├── README.md ├── config.json ├── ddpg_agent.py ├── images │ └── reacher_gif.gif ├── models.py ├── report.pdf ├── requirements.txt └── saved │ └── DDPG_exp │ ├── checkpoint_actor_solved.pth │ └── checkpoint_critic_solved.pth └── p3_collab_compet ├── DDPGAgents.py ├── OUNoise.py ├── README.md ├── ReplayBuffer.py ├── Tennis.ipynb ├── config.json ├── images └── tennis_gif.gif ├── report.pdf ├── requirements.txt ├── saved └── DDPGAgents_exp │ ├── checkpoint_actor_solved.pth │ └── checkpoint_critic_solved.pth └── utils.py /README.md: -------------------------------------------------------------------------------- 1 | # Deep Reinforcement Learning Nanodegree Udacity 2 | 3 | This repository contains project files for Udacity's Deep Reinforcement Learning Nanodegree program 4 | 5 | ## Projects 6 | 7 | ### Reinforcement Learning 8 | >[P0_Taxi](https://github.com/vmelan/DRLND-udacity/tree/master/p0_taxi-v2) 9 | 10 | For this project, we use OpenAI Gym Taxi-v2 environment to design an algorithm to teach a taxi agent to navigate a small gridworld using Reinforcement Learning methods. 11 | 12 | ### Navigation 13 | >[P1_Navigation](https://github.com/vmelan/DRLND-udacity/tree/master/p1_navigation) 14 | 15 | For this project, we train an agent to navigate and collect bananas in a large, square world using Deep Q-Network (DQN). 16 | 17 | ### Continuous Control 18 | >[P2_Continuous Control](https://github.com/vmelan/DRLND-udacity/tree/master/p2_continuous_control) 19 | 20 | For this project, we train a double-jointed arm agent to follow a target location using Deeo Distributed Policy 21 | Gradient (DDPG). 22 | 23 | ### Collaboration and Competition 24 | >[P3_Collaboration Competition](https://github.com/vmelan/DRLND-udacity/tree/master/p3_collab_compet) 25 | 26 | For this project, we train a pair of agents to play tennis using DDPG with shared replay buffer. 27 | 28 | -------------------------------------------------------------------------------- /p0_taxi-v2/README.md: -------------------------------------------------------------------------------- 1 | # Project: OpenAI Gym's Taxi-v2 Task 2 | 3 | For this project, we use OpenAI Gym Taxi-v2 environment to design an 4 | algorithm to teach a taxi agent to navigate a small gridworld. 5 | 6 |

7 | 8 |

9 | 10 | ## Problem Statement 11 | This problem comes from the paper Hierarchical Reinforcement 12 | Learning with the MAXQ Value Function Decomposition by Tom Dietterich (link 13 | to the paper: https://arxiv.org/pdf/cs/9905014.pdf), section 3.1 A Motivation Example. 14 | 15 | There are four specially-designated locations in 16 | this world, marked as R(ed), B(lue), G(reen), and Y(ellow). The taxi problem is episodic. In 17 | each episode, the taxi starts in a randomly-chosen square. There is a passenger at one of the 18 | four locations (chosen randomly), and that passenger wishes to be transported to one of the four 19 | locations (also chosen randomly). The taxi must go to the passenger’s location (the “source”), pick 20 | up the passenger, go to the destination location (the “destination”), and put down the passenger 21 | there. (To keep things uniform, the taxi must pick up and drop off the passenger even if he/she 22 | is already located at the destination!) The episode ends when the passenger is deposited at the 23 | destination location. 24 | 25 | There are six primitive actions in this domain: (a) four navigation actions that move the taxi 26 | one square North, South, East, or West, (b) a Pickup action, and (c) a Putdown action. Each action 27 | is deterministic. There is a reward of −1 for each action and an additional reward of +20 for 28 | successfully delivering the passenger. There is a reward of −10 if the taxi attempts to execute the 29 | Putdown or Pickup actions illegally. If a navigation action would cause the taxi to hit a wall, the 30 | action is a no-op, and there is only the usual reward of −1. 31 | 32 | We seek a policy that maximizes the total reward per episode. There are 500 possible states: 33 | 25 squares, 5 locations for the passenger (counting the four starting locations and the taxi), and 4 34 | destinations. 35 | 36 | ## Files 37 | - `agent.py`: Agent class in which we will develop our reinforcement learning methods 38 | - `monitor.py`: The `interace` function tests how well the agent learns from interaction with the environment 39 | - `main.py` : Main file to run the project for checking the performance of the agent 40 | 41 | ## Temporal-Difference (TD) Control Methods 42 | While **Monte-Carlo** approaches requires we run the agent for the whole episode before making any decisions, 43 | this solution is no longer viable with **continuous** tasks that does not have any terminal state, as well as **episodic** tasks for cases when we 44 | do not want to wait for the terminal state before making any decisions in the environment's episode. 45 | 46 | This is where Temporal-Difference (TD) Control Methods step in, they update estimates based in part 47 | on other learned estimates, without waiting for the final outcome. As such, TD methods will update the 48 | **Q-table** after every time steps. 49 | 50 | ### Sarsa 51 | The Sarsa update rule is the following: 52 | 53 |

54 | 55 |

56 | 57 | Notice that the action-value update uses the **S**tate, **A**ction, **R**eward, next **S**tate, next **R**eward hence the name 58 | of the algorithm **Sarsa(0)** or simply **Sarsa**. 59 | 60 |

61 | 62 |

63 | 64 | Here is the performance of Sarsa on the Taxi task : 65 |

66 | 67 |

68 | The average reward over the last 100 episodes keeps improving until the 2000th episodes, where it finally 69 | reaches convergence and stops improving 70 | 71 | ### Expected Sarsa 72 | The Expected Sarsa update rule is the following: 73 | 74 |

75 | 76 |

77 | Expected Sarsa uses the expected value of the next state-action pair, where the expectation takes into accoun the probability that the agent selects each possible action from the next state. 78 | 79 |

80 | 81 |

82 | 83 | Here is the performance of Expected Sarsa on the Taxi task : 84 |

85 | 86 |

87 | The resulting graph is noisier than Sarsa, due to the fact we are averaging over all the possible 88 | actions in the next state. Convergence takes more time and there are gradually some, albeit small, improvement. 89 | 90 | ### Sarsamax (or Q-Learning) 91 | The Sarsamax (or Q-Learning) update rule is the following: 92 | 93 |

94 | 95 |

96 | In Sarsamax, the update rule attempts to approximate the optimal value function 97 | at every time step. 98 | 99 |

100 | 101 |

102 | Here is the performance of Sarsamax on the Taxi task: 103 |

104 | 105 |

106 | Sarsamax is smoother and follows the same trend as Sarsa. 107 | 108 | ### Overview 109 | Sarsa, Expected Sarsa and Sarsamax have been trained for 20000 episodes, and we can visualize 110 | in the same graph their performance : 111 |

112 | 113 |

-------------------------------------------------------------------------------- /p0_taxi-v2/agent.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from collections import defaultdict 3 | 4 | class Agent: 5 | 6 | def __init__(self, nA=6): 7 | """ Initialize agent. 8 | 9 | Params 10 | ====== 11 | - nA: number of actions available to the agent 12 | """ 13 | self.nA = nA 14 | self.Q = defaultdict(lambda: np.zeros(self.nA)) 15 | 16 | self.eps = 1.0 17 | self.eps_decay = 0.99 18 | self.eps_min = 0.005 19 | 20 | self.alpha = 0.1 21 | self.gamma = 0.9 22 | 23 | def get_policy(self, Q_s): 24 | """ Obtain the action probabilities corresponding to epsilon-greedy policies """ 25 | self.eps = max(self.eps*self.eps_decay, self.eps_min) 26 | policy_s = np.ones(self.nA) * (self.eps / self.nA) 27 | best_a = np.argmax(Q_s) 28 | policy_s[best_a] = 1 - self.eps + (self.eps / self.nA) 29 | 30 | return policy_s 31 | 32 | def select_action(self, state): 33 | """ Given the state, select an action. 34 | 35 | Params 36 | ====== 37 | - state: the current state of the environment 38 | 39 | Returns 40 | ======= 41 | - action: an integer, compatible with the task's action space 42 | """ 43 | policy_s = self.get_policy(self.Q[state]) 44 | action = np.random.choice(np.arange(self.nA), p=policy_s) 45 | 46 | return action 47 | 48 | def step(self, state, action, reward, next_state, done): 49 | """ Update the agent's knowledge, using the most recently sampled tuple. 50 | 51 | Params 52 | ====== 53 | - state: the previous state of the environment 54 | - action: the agent's previous choice of action 55 | - reward: last reward received 56 | - next_state: the current state of the environment 57 | - done: whether the episode is complete (True or False) 58 | 59 | """ 60 | ## Using update rule of Sarsamax (Q-Learning) 61 | 62 | if not done: 63 | self.Q[state][action] = self.Q[state][action] + self.alpha * (reward + (self.gamma * np.max(self.Q[next_state])) - self.Q[state][action]) 64 | if done: 65 | self.Q[state][action] = self.Q[state][action] + self.alpha * (reward + self.gamma * 0 - self.Q[state][action]) 66 | 67 | -------------------------------------------------------------------------------- /p0_taxi-v2/images/all_perf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p0_taxi-v2/images/all_perf.png -------------------------------------------------------------------------------- /p0_taxi-v2/images/expected_sarsa_algo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p0_taxi-v2/images/expected_sarsa_algo.png -------------------------------------------------------------------------------- /p0_taxi-v2/images/expected_sarsa_perf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p0_taxi-v2/images/expected_sarsa_perf.png -------------------------------------------------------------------------------- /p0_taxi-v2/images/expected_sarsa_update_rule.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p0_taxi-v2/images/expected_sarsa_update_rule.png -------------------------------------------------------------------------------- /p0_taxi-v2/images/sarsa_algo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p0_taxi-v2/images/sarsa_algo.png -------------------------------------------------------------------------------- /p0_taxi-v2/images/sarsa_perf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p0_taxi-v2/images/sarsa_perf.png -------------------------------------------------------------------------------- /p0_taxi-v2/images/sarsa_update_rule.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p0_taxi-v2/images/sarsa_update_rule.png -------------------------------------------------------------------------------- /p0_taxi-v2/images/sarsamax_algo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p0_taxi-v2/images/sarsamax_algo.png -------------------------------------------------------------------------------- /p0_taxi-v2/images/sarsamax_perf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p0_taxi-v2/images/sarsamax_perf.png -------------------------------------------------------------------------------- /p0_taxi-v2/images/sarsamax_update_rule.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p0_taxi-v2/images/sarsamax_update_rule.png -------------------------------------------------------------------------------- /p0_taxi-v2/images/taxi-game.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p0_taxi-v2/images/taxi-game.gif -------------------------------------------------------------------------------- /p0_taxi-v2/images/taxi_game_gif.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p0_taxi-v2/images/taxi_game_gif.gif -------------------------------------------------------------------------------- /p0_taxi-v2/main.py: -------------------------------------------------------------------------------- 1 | from agent import Agent 2 | from monitor import interact 3 | import gym 4 | import numpy as np 5 | import matplotlib.pyplot as plt 6 | 7 | import sys 8 | from collections import defaultdict 9 | import time 10 | 11 | def plot_performance(num_episodes, avg_rewards, label, disp_plot=True): 12 | plt.plot(np.linspace(0, num_episodes, len(avg_rewards),endpoint=False), np.asarray(avg_rewards), label=label) 13 | plt.xlabel('Episode Number') 14 | plt.ylabel('Average Reward (Over Next %d Episodes)' % (100)) 15 | plt.title(label + " " + "performance") 16 | if disp_plot: plt.show() 17 | 18 | def plot_all_performances(num_episodes, all_avg_rewards, title): 19 | for (avg_reward, method) in zip(all_avg_rewards, ['Sarsa', 'Expected Sarsa', 'Sarsamax (Q-Learning)']): 20 | plot_performance(num_episodes, avg_reward, method, disp_plot=False) 21 | plt.title(title) 22 | plt.legend(loc='best') 23 | plt.show() 24 | 25 | def main(): 26 | env = gym.make('Taxi-v2') 27 | num_episodes = 20000 28 | 29 | ## Sarsa 30 | agent = Agent(method='Sarsa') 31 | sarsa_avg_rewards, sarsa_best_avg_reward = interact(env, agent, num_episodes=num_episodes) 32 | plot_performance(num_episodes, sarsa_avg_rewards, "Sarsa", disp_plot=True) 33 | 34 | # ## Expected Sarsa 35 | agent = Agent(method='Expected Sarsa') 36 | exp_sarsa_avg_rewards, exp_sarsa_best_avg_reward = interact(env, agent, num_episodes=num_episodes) 37 | plot_performance(num_episodes, exp_sarsa_avg_rewards, "Expected Sarsa", disp_plot=True) 38 | 39 | ## Q-Learning 40 | agent = Agent(method='Q-Learning') 41 | sarsamax_avg_rewards, sarsamax_best_avg_reward = interact(env, agent, num_episodes=num_episodes) 42 | plot_performance(num_episodes, sarsamax_avg_rewards, "Sarsamax (Q-Learning)", disp_plot=True) 43 | 44 | ## All performances 45 | plot_all_performances(num_episodes, [sarsa_avg_rewards, exp_sarsa_avg_rewards, sarsamax_avg_rewards], 46 | title="Comparison of Temporal Difference control methods") 47 | 48 | if __name__ == '__main__': 49 | main() -------------------------------------------------------------------------------- /p0_taxi-v2/monitor.py: -------------------------------------------------------------------------------- 1 | from collections import deque 2 | import sys 3 | import math 4 | import numpy as np 5 | 6 | def interact(env, agent, num_episodes=20000, window=100): 7 | # def interact(env, agent, num_episodes=1000, window=100): 8 | 9 | """ Monitor agent's performance. 10 | 11 | Params 12 | ====== 13 | - env: instance of OpenAI Gym's Taxi-v1 environment 14 | - agent: instance of class Agent (see Agent.py for details) 15 | - num_episodes: number of episodes of agent-environment interaction 16 | - window: number of episodes to consider when calculating average rewards 17 | 18 | Returns 19 | ======= 20 | - avg_rewards: deque containing average rewards 21 | - best_avg_reward: largest value in the avg_rewards deque 22 | """ 23 | # initialize average rewards 24 | avg_rewards = deque(maxlen=num_episodes) 25 | # initialize best average reward 26 | best_avg_reward = -math.inf 27 | # initialize monitor for most recent rewards 28 | samp_rewards = deque(maxlen=window) 29 | # for each episode 30 | for i_episode in range(1, num_episodes+1): 31 | # begin the episode 32 | state = env.reset() 33 | # initialize the sampled reward 34 | samp_reward = 0 35 | while True: 36 | # agent selects an action 37 | action = agent.select_action(state) 38 | # agent performs the selected action 39 | next_state, reward, done, _ = env.step(action) 40 | # agent performs internal updates based on sampled experience 41 | agent.step(state, action, reward, next_state, done) 42 | # update the sampled reward 43 | samp_reward += reward 44 | # update the state (s <- s') to next time step 45 | state = next_state 46 | if done: 47 | # save final sampled reward 48 | samp_rewards.append(samp_reward) 49 | break 50 | if (i_episode >= 100): 51 | # get average reward from last 100 episodes 52 | avg_reward = np.mean(samp_rewards) 53 | # append to deque 54 | avg_rewards.append(avg_reward) 55 | # update best average reward 56 | if avg_reward > best_avg_reward: 57 | best_avg_reward = avg_reward 58 | # monitor progress 59 | print("\rEpisode {}/{} || Best average reward {}".format(i_episode, num_episodes, best_avg_reward), end="") 60 | sys.stdout.flush() 61 | # check if task is solved (according to OpenAI Gym) 62 | if best_avg_reward >= 9.7: 63 | print('\nEnvironment solved in {} episodes.'.format(i_episode), end="") 64 | break 65 | if i_episode == num_episodes: print('\n') 66 | 67 | return avg_rewards, best_avg_reward -------------------------------------------------------------------------------- /p1_navigation/README.md: -------------------------------------------------------------------------------- 1 | # Project : Navigation 2 | 3 | ## Description 4 | For this project, we train an agent to navigate and collect bananas in a large, 5 | square world. 6 | 7 | ## Problem statement 8 | A reward of +1 is provided for collecting a yellow banana, and a reward of -1 is provided 9 | for collecting a blue banana. Thus, the goal of the agent is to collect 10 | as many yellow bananas as possible while avoiding blue bananas. 11 | 12 | The state space has 37 dimensions, and contains the agent's velocity, along 13 | with ray-based perception of objects around the agent's forward 14 | direction. Given this information, the agent has to learn how to best select 15 | actions. 16 | Four discrete actions are available, corresponding to: 17 | - `0` - move forward 18 | - `1` - move backward 19 | - `2` - turn left 20 | - `3` - turn right 21 | The task is episodic, and in order to solve the environment, the 22 | agent must get an average score of +13 over 100 consecutive episodes. 23 | 24 | ## Files 25 | - `Navigation.ipynb`: Notebook used to control and train the agent 26 | - `main.py`: Main script used to control and train the agent for experimentation 27 | - `dqn_agent.py`: Create an Agent class that interacts with and learns from the environment 28 | - `model.py`: Q-network class used to map state to action values 29 | - `config.json`: Configuration file to store variables and paths 30 | - `utils.py`: Helper functions 31 | - `report.pdf`: Technical report 32 | 33 | ## Dependencies 34 | To be able to run this code, you will need an environment with Python 3 and 35 | the dependencies are listed in the `requirements.txt` file so that you can install them 36 | using the following command: 37 | ``` 38 | pip install requirements.txt 39 | ``` 40 | 41 | Furthermore, you need to download the environment from one of the links below. You need only to select 42 | the environment that matches your operating system: 43 | - Linux : [link](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P1/Banana/Banana_Linux.zip) 44 | - MAC OSX : [link](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P1/Banana/Banana.app.zip) 45 | - Windows : [link](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P1/Banana/Banana_Windows_x86_64.zip) 46 | 47 | ## Running 48 | Run the cells in the notebook `Navigation.ipynb` to train an agent that solves our required 49 | task of collecting bananas. -------------------------------------------------------------------------------- /p1_navigation/config.json: -------------------------------------------------------------------------------- 1 | { 2 | "exp_name": "DQN_exp", 3 | "cuda": false, 4 | "gpu": 0, 5 | 6 | "optimizer": { 7 | "optimizer_type": "Adam", 8 | "betas": [0.9, 0.999], 9 | "optimizer_params": { 10 | "lr": 0.0005, 11 | "eps": 1e-7, 12 | "weight_decay": 0 13 | } 14 | }, 15 | 16 | "GLIE": { 17 | "eps_start": 1.0, 18 | "eps_end": 0.005, 19 | "eps_decay": 0.999 20 | }, 21 | 22 | "DQN": { 23 | "gamma": 0.99, 24 | "tau": 1e-3, 25 | "update_every": 4, 26 | "buffer_size": 5e4 27 | }, 28 | 29 | "architecture": { 30 | "hidden_layers_units": [500, 200, 100], 31 | "use_dropout": false, 32 | "dropout_proba": 0.5 33 | }, 34 | 35 | "trainer": { 36 | "num_episodes": 2000, 37 | "batch_size": 32, 38 | "max_timesteps_per_ep": 1000, 39 | "save_dir": "./saved/", 40 | "save_trained_name": "model_trained", 41 | "save_freq": 500, 42 | "verbose": 1 43 | } 44 | 45 | } -------------------------------------------------------------------------------- /p1_navigation/dqn_agent.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import random 3 | from collections import namedtuple, deque 4 | import logging 5 | 6 | from model import QNetwork 7 | 8 | import torch 9 | import torch.nn.functional as F 10 | import torch.optim as optim 11 | from utils import pick_device 12 | 13 | import pdb 14 | 15 | class Agent(): 16 | """ Agent used to interact with and learns from the environment """ 17 | 18 | def __init__(self, state_size, action_size, config): 19 | """ Initialize an Agent object """ 20 | 21 | self.state_size = state_size 22 | self.action_size = action_size 23 | self.config = config 24 | 25 | # logging for this class 26 | self.logger = logging.getLogger(self.__class__.__name__) 27 | 28 | # gpu support 29 | self.device = pick_device(config, self.logger) 30 | 31 | ## Q-Networks 32 | self.qnetwork_local = QNetwork(state_size, action_size, config).to(self.device) 33 | self.qnetwork_target = QNetwork(state_size, action_size, config).to(self.device) 34 | 35 | ## Get optimizer for local network 36 | self.optimizer = getattr(optim, config["optimizer"]["optimizer_type"])( 37 | self.qnetwork_local.parameters(), 38 | betas=tuple(config["optimizer"]["betas"]), 39 | **config["optimizer"]["optimizer_params"]) 40 | 41 | ## Replay memory 42 | self.memory = ReplayBuffer( 43 | config=config, 44 | action_size=action_size, 45 | buffer_size=int(config["DQN"]["buffer_size"]), 46 | batch_size=config["trainer"]["batch_size"] 47 | ) 48 | 49 | ## Initialize time step (for update every `update_every` steps) 50 | self.t_step = 0 51 | 52 | 53 | def step(self, state, action, reward, next_state, done): 54 | 55 | # Save experience in replay memory 56 | self.memory.add(state, action, reward, next_state, done) 57 | 58 | # Learn every `update_every` time steps 59 | self.t_step = (self.t_step + 1) % self.config["DQN"]["update_every"] 60 | if (self.t_step == 0): 61 | # If enough samples are available in memory, get random subset and learn 62 | if len(self.memory) > self.config["trainer"]["batch_size"]: 63 | experiences = self.memory.sample() 64 | self.learn(experiences, self.config["DQN"]["gamma"]) 65 | 66 | 67 | 68 | def act(self, state, epsilon): 69 | """ Returns actions for given state as per current policy """ 70 | # pdb.set_trace() 71 | 72 | # Convert state to tensor 73 | state = torch.from_numpy(state).float().unsqueeze(0).to(self.device) 74 | 75 | ## Evaluation mode 76 | self.qnetwork_local.eval() 77 | with torch.no_grad(): 78 | # Forward pass of local qnetwork 79 | action_values = self.qnetwork_local.forward(state) 80 | 81 | ## Training mode 82 | self.qnetwork_local.train() 83 | # Epsilon-greedy action selection 84 | if random.random() > epsilon: 85 | # Choose the best action (exploitation) 86 | return np.argmax(action_values.cpu().data.numpy()) 87 | else: 88 | # Choose random action (exploration) 89 | return random.choice(np.arange(self.action_size)) 90 | 91 | 92 | def learn(self, experiences, gamma): 93 | """ Update value parameters using given batch of experience tuples """ 94 | 95 | states, actions, rewards, next_states, dones = experiences 96 | 97 | ## TD target 98 | # Get max predicted Q-values (for next states) from target model 99 | # Q_targets_next = torch.argmax(self.qnetwork_target(next_states).detach(), dim=1).unsqueeze(1) 100 | Q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1) 101 | Q_targets_next = Q_targets_next.type(torch.FloatTensor) 102 | 103 | # Compute Q-targets for current states 104 | Q_targets = rewards + (gamma * Q_targets_next * (1 - dones)) 105 | 106 | ## old value 107 | # Get expected Q-values from local model 108 | Q_expected = torch.gather(self.qnetwork_local(states), dim=1, index=actions) 109 | 110 | # Compute loss 111 | loss = F.mse_loss(Q_expected, Q_targets) 112 | # Minimize loss 113 | self.optimizer.zero_grad() 114 | loss.backward() 115 | self.optimizer.step() 116 | 117 | # update target network with a soft update 118 | self.soft_update(self.qnetwork_local, self.qnetwork_target, self.config["DQN"]["tau"]) 119 | 120 | 121 | 122 | def soft_update(self, local_model, target_model, tau): 123 | """ 124 | Soft update model parameters 125 | θ_target = τ*θ_local + (1 - τ)*θ_target 126 | 127 | Parameters 128 | ---------- 129 | local_model (PyTorch model): weights will be copied from 130 | target_model (PyTorch model): weights will be copied to 131 | tau (float): interpolation parameter 132 | """ 133 | 134 | for target_param, local_param in zip(target_model.parameters(), local_model.parameters()): 135 | target_param.data.copy_(tau*local_param.data + (1.0 - tau)*target_param.data) 136 | 137 | 138 | class ReplayBuffer(): 139 | """ Fixed-size buffer to store experience tuples """ 140 | 141 | def __init__(self, config, action_size, buffer_size, batch_size): 142 | """ Initialize a ReplayBuffer object """ 143 | 144 | self.config = config 145 | self.action_size = action_size 146 | self.memory = deque(maxlen=buffer_size) 147 | self.batch_size = batch_size 148 | self.experience = namedtuple("Experience", 149 | field_names=["state", "action", "reward", "next_state", "done"]) 150 | 151 | # logging for this class 152 | self.logger = logging.getLogger(self.__class__.__name__) 153 | 154 | # gpu support 155 | self.device = pick_device(config, self.logger) 156 | 157 | 158 | def add(self, state, action, reward, next_state, done): 159 | """ Add a new experience to memory """ 160 | e = self.experience(state, action, reward, next_state, done) 161 | self.memory.append(e) 162 | 163 | def sample(self): 164 | """ Randomly sample a batch of experiences from memory """ 165 | experiences = random.sample(self.memory, k=self.batch_size) 166 | 167 | states = torch.from_numpy( 168 | np.vstack([e.state for e in experiences if e is not None]) 169 | ).float().to(self.device) 170 | actions = torch.from_numpy( 171 | np.vstack([e.action for e in experiences if e is not None]) 172 | ).long().to(self.device) 173 | rewards = torch.from_numpy( 174 | np.vstack([e.reward for e in experiences if e is not None]) 175 | ).float().to(self.device) 176 | next_states = torch.from_numpy( 177 | np.vstack([e.next_state for e in experiences if e is not None]) 178 | ).float().to(self.device) 179 | dones = torch.from_numpy( 180 | np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8) 181 | ).float().to(self.device) 182 | 183 | return (states, actions, rewards, next_states, dones) 184 | 185 | 186 | def __len__(self): 187 | """ Return the current size of internal memory """ 188 | return len(self.memory) 189 | -------------------------------------------------------------------------------- /p1_navigation/main.py: -------------------------------------------------------------------------------- 1 | import json 2 | import logging 3 | import torch 4 | import numpy as np 5 | from collections import deque 6 | from dqn_agent import Agent 7 | from utils import ensure_dir 8 | import matplotlib.pyplot as plt 9 | from unityagents import UnityEnvironment 10 | 11 | plt.ion() 12 | 13 | 14 | def dqn(agent, 15 | brain_name, 16 | config, 17 | n_episodes, 18 | max_timesteps_per_ep, 19 | eps_start, 20 | eps_end, 21 | eps_decay 22 | ): 23 | 24 | """ 25 | Deep Q-Learning 26 | """ 27 | logger = logging.getLogger('dqn') # logger 28 | flag = False # When environment is technically solved 29 | # Save path 30 | save_path = config["trainer"]["save_dir"] + config["exp_name"] + "/" 31 | ensure_dir(save_path) 32 | scores = [] # list containing scores from each episodes 33 | scores_window = deque(maxlen=100) 34 | epsilon = eps_start # init epsilon 35 | 36 | for i_episode in range(1, n_episodes + 1): 37 | # reset the environment 38 | env_info = env.reset(train_mode=True)[brain_name] 39 | # get the current state 40 | state = env_info.vector_observations[0] 41 | score = 0 42 | for t in range(max_timesteps_per_ep): 43 | # choose action based on epsilon-greedy policy 44 | action = agent.act(state, epsilon) 45 | # send the action to the environment 46 | env_info = env.step(action)[brain_name] 47 | # get the next state 48 | next_state = env_info.vector_observations[0] 49 | # get the reward 50 | reward = env_info.rewards[0] 51 | # see if episode has finished 52 | done = env_info.local_done[0] 53 | # step 54 | agent.step(state, action, reward, next_state, done) 55 | # cumulative rewards into score variable 56 | score += reward 57 | # get next_state and set it to state 58 | state = next_state 59 | 60 | if done: 61 | break 62 | 63 | # Update epsilon 64 | epsilon = max(eps_decay*epsilon, eps_end) 65 | 66 | # save most recent score 67 | scores.append(score) 68 | scores_window.append(score) 69 | 70 | logger.info('\rEpisode {}\tAverage Score: {:.3f}'.format(i_episode, np.mean(scores_window))) 71 | 72 | if (i_episode % 100 == 0): 73 | logger.info("\rEpisode {}\tAverage Score: {:.3f}".format(i_episode, \ 74 | np.mean(scores_window))) 75 | 76 | # Save occasionnally 77 | if (i_episode % config["trainer"]["save_freq"] == 0): 78 | 79 | torch.save(agent.qnetwork_local.state_dict(), save_path + 80 | config["trainer"]["save_trained_name"] + "_" + str(i_episode) + ".pth") 81 | 82 | # Check if environment solved (if not already) 83 | if not flag: 84 | if (np.mean(scores_window) >= 13.0): 85 | logger.info('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.3f}'.format( 86 | i_episode-100, np.mean(scores_window))) 87 | # Save solved model 88 | torch.save(agent.qnetwork_local.state_dict(), save_path + 89 | config["trainer"]["save_trained_name"] + "_solved.pth") 90 | flag = True 91 | 92 | return scores 93 | 94 | if __name__ == '__main__': 95 | # Configure logging for all loggers 96 | logging.basicConfig(level=logging.INFO, format='') 97 | 98 | # Load config file 99 | with open("config.json", "r") as f: 100 | config = json.load(f) 101 | 102 | # Start the environment 103 | env = UnityEnvironment(file_name="./Banana_Windows_x86_64/Banana.exe") 104 | 105 | # get the default brain 106 | brain_name = env.brain_names[0] 107 | brain = env.brains[brain_name] 108 | 109 | # Create agent 110 | agent = Agent(state_size=37, action_size=4, config=config) 111 | 112 | # Train the agent 113 | scores = dqn(agent=agent, 114 | brain_name=brain_name, 115 | config=config, 116 | n_episodes=config["trainer"]["num_episodes"], 117 | max_timesteps_per_ep=config["trainer"]["max_timesteps_per_ep"], 118 | eps_start=config["GLIE"]["eps_start"], 119 | eps_end=config["GLIE"]["eps_end"], 120 | eps_decay=config["GLIE"]["eps_decay"] 121 | ) 122 | 123 | # Close the environment 124 | env.close() -------------------------------------------------------------------------------- /p1_navigation/model.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | 5 | 6 | class QNetwork(nn.Module): 7 | """ Policy model that maps state to actions """ 8 | 9 | def __init__(self, state_size, action_size, config): 10 | """ Initialize parameters and build model """ 11 | super().__init__() 12 | 13 | 14 | # self.fc1 = nn.Linear(state_size, fc1_units) 15 | # self.fc2 = nn.Linear(fc1_units, fc2_units) 16 | # self.fc3 = nn.Linear(fc2_units, action_size) 17 | 18 | self.config = config 19 | # Retrieve variable from config file 20 | hidden_layers_units = config["architecture"]["hidden_layers_units"] 21 | dropout_proba = config["architecture"]["dropout_proba"] 22 | 23 | # Add the first layer 24 | self.layers = nn.ModuleList([nn.Linear(state_size, hidden_layers_units[0])]) 25 | 26 | # Add a variable number of more hidden layers 27 | layer_sizes = zip(hidden_layers_units[:-1], hidden_layers_units[1:]) 28 | self.layers.extend([nn.Linear(h1, h2) for h1, h2 in layer_sizes]) 29 | 30 | # Add last layer 31 | self.output = nn.Linear(hidden_layers_units[-1], action_size) 32 | 33 | # Dropout 34 | self.dropout = nn.Dropout(p=dropout_proba) 35 | 36 | def forward(self, x): 37 | """ Forward pass """ 38 | # x = F.relu(self.fc1(state)) 39 | # x = F.relu(self.fc2(x)) 40 | # x = self.fc3(x) 41 | 42 | for layer in self.layers: 43 | x = F.relu(layer(x)) 44 | if self.config["architecture"]["use_dropout"]: 45 | x = self.dropout(x) 46 | 47 | x = self.output(x) 48 | 49 | return x 50 | 51 | if __name__ == '__main__': 52 | import json 53 | with open("config.json", "r") as f: 54 | config = json.load(f) 55 | 56 | net = QNetwork(state_size=37, action_size=4, config=config) 57 | print("net:", net) 58 | -------------------------------------------------------------------------------- /p1_navigation/report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p1_navigation/report.pdf -------------------------------------------------------------------------------- /p1_navigation/requirements.txt: -------------------------------------------------------------------------------- 1 | matplotlib 2 | numpy>=1.11.0 3 | jupyter 4 | unityagents==0.4.0 5 | torch==0.4.0 6 | ipykernel 7 | -------------------------------------------------------------------------------- /p1_navigation/saved/DQN_exp/model_trained_solved.pth: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p1_navigation/saved/DQN_exp/model_trained_solved.pth -------------------------------------------------------------------------------- /p1_navigation/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import torch 3 | 4 | 5 | def pick_device(config, logger): 6 | """ Pick device """ 7 | if config["cuda"] and not torch.cuda.is_available(): 8 | logger.warning("Warning: There's no CUDA support on this machine," 9 | "training is performed on cpu.") 10 | device = torch.device("cpu") 11 | elif not config["cuda"] and torch.cuda.is_available(): 12 | logger.info("Training is performed on cpu by user's choice") 13 | device = torch.device("cpu") 14 | elif not config["cuda"] and not torch.cuda.is_available(): 15 | logger.info("Training on cpu") 16 | device = torch.device("cpu") 17 | else: 18 | logger.info("Training on gpu") 19 | device = torch.device("cuda:" + str(config["gpu"])) 20 | 21 | return device 22 | 23 | def ensure_dir(path): 24 | if not os.path.exists(path): 25 | os.makedirs(path) -------------------------------------------------------------------------------- /p2_continuous_control/README.md: -------------------------------------------------------------------------------- 1 | # Project : Continuous Control 2 | 3 | ## Description 4 | For this project, we train a double-jointed arm agent to follow a target location. 5 | 6 |

7 | 8 |

9 | 10 | ## Problem Statement 11 | A reward of +0.1 is provided for each step that the agent's hands is in the goal location. 12 | Thus, the goal of the agent is to maintain its position at 13 | the target location for as many 14 | steps as possible. 15 | 16 | The observation space consists of 33 variables corresponding to position, 17 | rotation, velocity, and angular velocities of the arm. 18 | Each action is a vector with four numbers, corresponding to torque 19 | applicable to two joints. Every 20 | entry in the action vector should be a number between -1 and 1. 21 | 22 | The task is episodic, with 1000 timesteps per episode. In order to solve 23 | the environment, the agent must get an average score of +30 over 100 consecutive 24 | episodes. 25 | 26 | ## Files 27 | - `Continuous_Control.ipynb`: Notebook used to control and train the agent 28 | - `ddpg_agent.py`: Create an Agent class that interacts with and learns from the environment 29 | - `model.py`: Actor and Critic classes 30 | - `config.json`: Configuration file to store variables and paths 31 | - `utils.py`: Helper functions 32 | - `report.pdf`: Technical report 33 | 34 | ## Dependencies 35 | To be able to run this code, you will need an environment with Python 3 and 36 | the dependencies are listed in the `requirements.txt` file so that you can install them 37 | using the following command: 38 | ``` 39 | pip install requirements.txt 40 | ``` 41 | 42 | Furthermore, you need to download the environment from one of the links below. You need only to select 43 | the environment that matches your operating system: 44 | - Linux : [link](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P2/Reacher/one_agent/Reacher_Linux.zip) 45 | - MAC OSX : [link](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P2/Reacher/Reacher.app.zip) 46 | - Windows : [link](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P2/Reacher/Reacher_Windows_x86_64.zip) 47 | 48 | ## Running 49 | Run the cells in the notebook `Continuous_Control.ipynb` to train an agent that solves our required 50 | task of moving the double-jointed arm. -------------------------------------------------------------------------------- /p2_continuous_control/config.json: -------------------------------------------------------------------------------- 1 | { 2 | "exp_name": "DDPG_exp", 3 | "cuda": true, 4 | "gpu": 0, 5 | 6 | "optimizer_actor": { 7 | "optimizer_type": "Adam", 8 | "betas": [0.9, 0.999], 9 | "optimizer_params": { 10 | "lr": 1e-4, 11 | "eps": 1e-7, 12 | "weight_decay": 0 13 | } 14 | }, 15 | 16 | "optimizer_critic": { 17 | "optimizer_type": "Adam", 18 | "betas": [0.9, 0.999], 19 | "optimizer_params": { 20 | "lr": 1e-4, 21 | "eps": 1e-7, 22 | "weight_decay": 0 23 | } 24 | }, 25 | 26 | "DDPG": { 27 | "gamma": 0.99, 28 | "tau": 0.001, 29 | "buffer_size": 10e6 30 | }, 31 | 32 | "architecture": { 33 | "fc1_units": 250, 34 | "fc2_units": 100 35 | }, 36 | 37 | "trainer" : { 38 | "num_episodes": 500, 39 | "batch_size": 128, 40 | "save_dir": "./saved/", 41 | "save_freq": 200 42 | } 43 | } -------------------------------------------------------------------------------- /p2_continuous_control/ddpg_agent.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import random 3 | import copy 4 | import logging 5 | from collections import namedtuple, deque 6 | from model import Actor, Critic 7 | import torch 8 | import torch.nn.functional as F 9 | import torch.optim as optim 10 | from utils import pick_device 11 | 12 | 13 | class Agent(): 14 | """ Agent used to interact with and learns from the environment """ 15 | 16 | def __init__(self, state_size, action_size, config): 17 | """ Initialize an agent object """ 18 | 19 | self.state_size = state_size 20 | self.action_size = action_size 21 | self.config = config 22 | 23 | # logging for this class 24 | self.logger = logging.getLogger(self.__class__.__name__) 25 | 26 | # gpu support 27 | self.device = pick_device(config, self.logger) 28 | 29 | ## Actor local and target networks 30 | self.actor_local = Actor(state_size, action_size, config).to(self.device) 31 | self.actor_target = Actor(state_size, action_size).to(self.device) 32 | self.actor_optimizer = getattr(optim, config["optimizer_actor"]["optimizer_type"])( 33 | self.actor_local.parameters(), 34 | betas=tuple(config["optimizer_actor"]["betas"], 35 | **config["optimizer_actor"]["optimizer_params"])) 36 | 37 | ## Critic local and target networks 38 | self.critic_local = Critic(state_size, action_size, config).to(self.device) 39 | self.critic_target = Actor(state_size, action_size, config).to(self.device) 40 | self.actor_optimizer = getattr(optim, config["optimizer_critic"]["optimizer_type"])( 41 | self.critic_local.parameters(), 42 | betas=tuple(config["optimizer_critic"]["betas"], 43 | **config["optimizer_critic"]optimizer_critic["optimizer_params"])) 44 | 45 | ## Noise process 46 | self.noise = OUNoise(action_size) 47 | 48 | ## Replay memory 49 | self.memory = ReplayBuffe( 50 | config=config, 51 | action_size=action_size, 52 | buffer_size=int(config["DDPG"]["buffer_size"]), 53 | batch_size=config["trainer"]["batch_size"] 54 | ) 55 | 56 | 57 | def step(self, state, action, reward, next_state, done): 58 | """ Save experience in replay memory, 59 | and use random sample from buffer to learn """ 60 | 61 | # Save experience in replay memory 62 | self.memory.add(state, action, reward, next_state, done) 63 | 64 | # learn every timestep as long as enough samples are available in memory 65 | if len(self.memory) > self.config["trainer"]["batch_size"]: 66 | experiences = self.memory.sample() 67 | self.learn(experiences, self.config["DDPG"]["gamma"]) 68 | 69 | 70 | def act(self, state): 71 | """ Returns actions for given state as per current policy """ 72 | 73 | # Convert state to tensor 74 | state = torch.from_numpy(state).float().to(self.device) 75 | 76 | ## Evaluation mode 77 | self.actor_local.eval() 78 | with torch.no_grad(): 79 | # Forward pass of local actor network 80 | action_values = self.actor_local.forward(state) 81 | 82 | ## Training mode 83 | self.actor_local.train() 84 | # Add noise to improve exploration to our actor policy 85 | action_values += self.noise.sample() 86 | # Clip action to stay in the range [-1, 1] for our task 87 | action_values = np.clip(action_values, -1, 1) 88 | 89 | return action_values 90 | 91 | 92 | def learn(self, experiences, gamma): 93 | """ Update value parameters using given batch of experience tuples """ 94 | 95 | states, actions, rewards, next_states, dones = experiences 96 | 97 | ## Update actor (policy) network using the sampled policy gradient 98 | # Compute actor loss 99 | actions_pred = self.actor_local.forward(states) 100 | actor_loss = -self.critic_local.forward(states, actions_pred).mean() 101 | # Minimize the loss 102 | self.actor_optimizer.zero_grad() 103 | actor_loss.backward() 104 | self.actor_optimizer.step() 105 | 106 | ## Update critic (value) network 107 | # Get predicted next-state actions and Q-values from target models 108 | actions_next = self.actor_target.forward(next_states) 109 | Q_targets_next = self.critic_target.forward(next_states, actions_next) 110 | # Compute Q-targets for current states 111 | Q_targets = rewards + (gamma * Q_targets_next * (1 - dones)) 112 | # Get expected Q-values from local critic model 113 | Q_expected = self.critic_local.forward(states) 114 | # Compute loss 115 | critic_loss = F.mse_loss(Q_expected, Q_targets) 116 | # Minimize the loss 117 | self.critic_optimizer.zero_grad() 118 | critic_loss.backward() 119 | self.critic_optimizer.step() 120 | 121 | 122 | ## Update target networks with a soft update 123 | self.soft_update(self.actor_local, self.self.actor_target, self.config["DDPG"]["tau"]) 124 | self.soft_update(self.critic_local, self.critic_target, self.config["DDPG"]["tau"]) 125 | 126 | 127 | def soft_update(self, local_model, target_model, tau): 128 | """ Soft update model parameters, 129 | improves the stability of learning """ 130 | 131 | for target_pararam, local_param in zip(target_model.parameters(), local_model.parameters()): 132 | target_param.data.copy_(tau*local_param.data + (1.0 - tau)*target_param.data) 133 | 134 | 135 | 136 | class OUNoise(): 137 | """ Ornstein-Uhlenbeck process """ 138 | 139 | def __init__(self, size, mu=0.0, theta=0.15, sigma=0.2): 140 | """ Initialize parameters and noise process """ 141 | self.mu = mu * np.ones(size) 142 | self.theta = theta 143 | self.sigma = sigma 144 | self.reset() 145 | 146 | def reset(self): 147 | """ Reset the interal state (= noise) to mean (mu). """ 148 | self.state = copy.copy(self.mu) 149 | 150 | def sample(self): 151 | """ Update internal state and return it as a noise sample """ 152 | x = self.state 153 | dx = self.theta * (self.mu - x) + self.sigma * np.array([random.random() for i in range(len(x))]) 154 | self.state = x + dx 155 | 156 | return self.state 157 | 158 | 159 | 160 | 161 | class ReplayBuffer(): 162 | """ Fixed-size buffer to store experience tuples """ 163 | 164 | def __init__(self, config, action_size, buffer_size, batch_size): 165 | """ Initialize a ReplayBuffer object """ 166 | 167 | self.config = config 168 | self.action_size = action_size 169 | self.memory = deque(maxlen=buffer_size) 170 | self.batch_size = batch_size 171 | self.experience = namedtuple("Experience", 172 | field_names=["state", "action", "reward", "next_state", "done"]) 173 | 174 | # logging for this class 175 | self.logger = logging.getLogger(self.__class__.__name__) 176 | 177 | # gpu support 178 | self.device = pick_device(config, self.logger) 179 | 180 | 181 | def add(self, state, action, reward, next_state, done): 182 | """ Add a new experience to memory """ 183 | e = self.experience(state, action, reward, next_state, done) 184 | self.memory.append(e) 185 | 186 | 187 | def sample(self): 188 | """ Randomly sample a batch of experiences from memory """ 189 | experiences = random.sample(self.memory, k=self.batch_size) 190 | 191 | states = torch.from_numpy( 192 | np.vstack([e.state for e in experiences if e is not None]) 193 | ).float().to(self.device) 194 | actions = torch.from_numpy( 195 | np.vstack([e.action for e in experiences if e is not None]) 196 | ).float().to(self.device) 197 | rewards = torch.from_numpy( 198 | np.vstack([e.rewards for e in experiences if e is not None]) 199 | ).float().to(self.device) 200 | next_states = torch.from_numpy( 201 | np.vstack([e.next_state for e in experiences if e is not None]) 202 | ).float().to(self.device) 203 | dones = torch.from_numpy( 204 | np.vstack([e.done for e in experiences if e is not None]).astype(uint8) 205 | ).float().to(self.device) 206 | 207 | return (states, actions, rewards, next_states, dones) 208 | 209 | 210 | def __len__(self): 211 | """ Return the current size of internal memory """ 212 | return len(self.memory) -------------------------------------------------------------------------------- /p2_continuous_control/images/reacher_gif.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p2_continuous_control/images/reacher_gif.gif -------------------------------------------------------------------------------- /p2_continuous_control/models.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | import torch.nn as nn 4 | import torch.nn.function as F 5 | 6 | 7 | class Actor(nn.Module): 8 | """ Actor (Policy) model """ 9 | 10 | def __init__(self, state_size, action_size, config): 11 | """ Initalize parameters and build model """ 12 | 13 | super(Actor, self).__init__() 14 | fc1_units = config["architecture"]["fc1_units"] 15 | fc2_units = config["architecture"]["fc2_units"] 16 | 17 | self.fc1 = nn.Linear(in_features=state_size, out_features=fc1_units) 18 | self.fc2 = nn.Linear(in_features=fc1_units, out_features=fc2_units) 19 | 20 | # weights initialization 21 | for m in self.modules(): 22 | if isinstance(m, nn.Linear): 23 | # FC layers have weights initialized with Glorot 24 | m.weight = nn.init.xavier_uniform(m.weight, gain=1) 25 | 26 | def forward(self, state): 27 | """ Build an actor (policy) network that maps states to actions """ 28 | x = F.relu(self.fc1(state)) 29 | x = F.relu(self.fc2(x)) 30 | x = F.tanh(self.fc3(x)) # outputs are in the range [-1, 1] 31 | 32 | return x 33 | 34 | 35 | class Critic(nn.Module): 36 | """ Critic (Value) Model """ 37 | 38 | def __init__(self, state_size, action_size, config): 39 | """ Initialize parameters and build model """ 40 | super(Critic, self).__init__() 41 | 42 | fc1_units = config["architecture"]["fc1_units"] 43 | fc2_units = config["architecture"]["fc2_units"] 44 | 45 | self.fc1 = nn.Linear(in_features=state_size, out_features=fc1_units) 46 | self.fc2 = nn.Linear(in_features=fc1_units + action_size, 47 | out_features=fc2_units) 48 | self.fc3 = nn.Linear(in_features=fc2_units, 1) 49 | 50 | # weights initialization 51 | for m in self.modules(): 52 | if isinstance(m, nn.Linear): 53 | # FC layers have weights initialized with Glorot 54 | m.weight = nn.init.xavier_uniform(m.weight, gain=1) 55 | 56 | def forward(self, state, action): 57 | """ Build a critic (value) network that maps 58 | (state, action) pairs -> Q-values """ 59 | x = F.relu(self.fc1(state)) 60 | x = F.relu(self.fc2(torch.cat([x, action], dim=1))) # add action too for the mapping 61 | x = F.relu(self.fc3(x)) 62 | 63 | return x 64 | 65 | -------------------------------------------------------------------------------- /p2_continuous_control/report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p2_continuous_control/report.pdf -------------------------------------------------------------------------------- /p2_continuous_control/requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p2_continuous_control/requirements.txt -------------------------------------------------------------------------------- /p2_continuous_control/saved/DDPG_exp/checkpoint_actor_solved.pth: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p2_continuous_control/saved/DDPG_exp/checkpoint_actor_solved.pth -------------------------------------------------------------------------------- /p2_continuous_control/saved/DDPG_exp/checkpoint_critic_solved.pth: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p2_continuous_control/saved/DDPG_exp/checkpoint_critic_solved.pth -------------------------------------------------------------------------------- /p3_collab_compet/DDPGAgents.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import logging 3 | from models import Actor, Critic 4 | from ReplayBuffer import ReplayBuffer 5 | from OUNoise import OUNoise 6 | import torch 7 | import torch.nn.functional as F 8 | import torch.optim as optim 9 | from utils import pick_device 10 | 11 | import pdb 12 | 13 | class DDPGAgents(): 14 | """ Agent used to interact with and learns from the environment """ 15 | 16 | def __init__(self, state_size, action_size, config): 17 | """ Initialize an agent object """ 18 | 19 | self.state_size = state_size 20 | self.action_size = action_size 21 | self.config = config 22 | 23 | # retrieve number of agents 24 | self.num_agents = config["DDPG"]["num_agents"] 25 | 26 | # logging for this class 27 | self.logger = logging.getLogger(self.__class__.__name__) 28 | 29 | # gpu support 30 | self.device = pick_device(config, self.logger) 31 | 32 | ## Actor local and target networks 33 | self.actor_local = Actor(state_size, action_size, config).to(self.device) 34 | self.actor_target = Actor(state_size, action_size, config).to(self.device) 35 | self.actor_optimizer = getattr(optim, config["optimizer_actor"]["optimizer_type"])( 36 | self.actor_local.parameters(), 37 | betas=tuple(config["optimizer_actor"]["betas"]), 38 | **config["optimizer_actor"]["optimizer_params"]) 39 | 40 | ## Critic local and target networks 41 | self.critic_local = Critic(state_size, action_size, config).to(self.device) 42 | self.critic_target = Critic(state_size, action_size, config).to(self.device) 43 | self.critic_optimizer = getattr(optim, config["optimizer_critic"]["optimizer_type"])( 44 | self.critic_local.parameters(), 45 | betas=tuple(config["optimizer_critic"]["betas"]), 46 | **config["optimizer_critic"]["optimizer_params"]) 47 | 48 | ## Noise process 49 | self.noise = OUNoise((self.num_agents, action_size)) 50 | 51 | ## Replay memory 52 | self.memory = ReplayBuffer( 53 | config=config, 54 | action_size=action_size, 55 | buffer_size=int(config["DDPG"]["buffer_size"]), 56 | batch_size=config["trainer"]["batch_size"] 57 | ) 58 | 59 | 60 | def step(self, state, action, reward, next_state, done): 61 | """ Save experience in replay memory, 62 | and use random sample from buffer to learn """ 63 | 64 | # Save experience in replay memory shared by all agents 65 | for agent in range(self.num_agents): 66 | self.memory.add(state[agent, :], 67 | action[agent, :], 68 | reward[agent], 69 | next_state[agent, :], 70 | done[agent] 71 | ) 72 | 73 | # learn every timestep as long as enough samples are available in memory 74 | if len(self.memory) > self.config["trainer"]["batch_size"]: 75 | experiences = self.memory.sample() 76 | self.learn(experiences, self.config["DDPG"]["gamma"]) 77 | 78 | 79 | def act(self, states, add_noise=False): 80 | """ Returns actions for given state as per current policy """ 81 | 82 | # Convert state to tensor² 83 | states = torch.from_numpy(states).float().to(self.device) 84 | 85 | # prepare actions numpy array for all agents 86 | actions = np.zeros((self.num_agents, self.action_size)) 87 | 88 | ## Evaluation mode 89 | self.actor_local.eval() 90 | with torch.no_grad(): 91 | # Forward pass of local actor network 92 | for agent, state in enumerate(states): 93 | action_values = self.actor_local.forward(state).cpu().data.numpy() 94 | actions[agent, :] = action_values 95 | 96 | # pdb.set_trace() 97 | ## Training mode 98 | self.actor_local.train() 99 | if add_noise: 100 | # Add noise to improve exploration to our actor policy 101 | # action_values += torch.from_numpy(self.noise.sample()).type(torch.FloatTensor).to(self.device) 102 | actions += self.noise.sample() 103 | # Clip action to stay in the range [-1, 1] for our task 104 | actions = np.clip(actions, -1, 1) 105 | 106 | return actions 107 | 108 | 109 | def learn(self, experiences, gamma): 110 | """ Update value parameters using given batch of experience tuples """ 111 | 112 | states, actions, rewards, next_states, dones = experiences 113 | 114 | ## Update actor (policy) network using the sampled policy gradient 115 | # Compute actor loss 116 | actions_pred = self.actor_local.forward(states) 117 | actor_loss = -self.critic_local.forward(states, actions_pred).mean() 118 | # Minimize the loss 119 | self.actor_optimizer.zero_grad() 120 | actor_loss.backward() 121 | self.actor_optimizer.step() 122 | 123 | ## Update critic (value) network 124 | # Get predicted next-state actions and Q-values from target models 125 | actions_next = self.actor_target.forward(next_states) 126 | Q_targets_next = self.critic_target.forward(next_states, actions_next) 127 | # Compute Q-targets for current states 128 | Q_targets = rewards + (gamma * Q_targets_next * (1 - dones)) 129 | # Get expected Q-values from local critic model 130 | Q_expected = self.critic_local.forward(states, actions) 131 | # Compute loss 132 | critic_loss = F.mse_loss(Q_expected, Q_targets) 133 | # Minimize the loss 134 | self.critic_optimizer.zero_grad() 135 | critic_loss.backward() 136 | self.critic_optimizer.step() 137 | 138 | 139 | ## Update target networks with a soft update 140 | self.soft_update(self.actor_local, self.actor_target, self.config["DDPG"]["tau"]) 141 | self.soft_update(self.critic_local, self.critic_target, self.config["DDPG"]["tau"]) 142 | 143 | 144 | def soft_update(self, local_model, target_model, tau): 145 | """ Soft update model parameters, 146 | improves the stability of learning """ 147 | 148 | for target_param, local_param in zip(target_model.parameters(), local_model.parameters()): 149 | target_param.data.copy_(tau*local_param.data + (1.0 - tau)*target_param.data) 150 | 151 | 152 | def reset(self): 153 | """ Reset noise """ 154 | self.noise.reset() 155 | 156 | 157 | 158 | 159 | 160 | 161 | -------------------------------------------------------------------------------- /p3_collab_compet/OUNoise.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | import copy 4 | 5 | class OUNoise(): 6 | """ Ornstein-Uhlenbeck process """ 7 | 8 | def __init__(self, size, mu=0.0, theta=0.15, sigma=0.2): 9 | """ Initialize parameters and noise process """ 10 | self.mu = mu * np.ones(size) 11 | self.theta = theta 12 | self.sigma = sigma 13 | self.reset() 14 | 15 | def reset(self): 16 | """ Reset the internal state (= noise) to mean (mu). """ 17 | self.state = copy.copy(self.mu) 18 | 19 | def sample(self): 20 | """ Update internal state and return it as a noise sample """ 21 | x = self.state 22 | dx = self.theta * (self.mu - x) + self.sigma * np.random.standard_normal(self.size) 23 | self.state = x + dx 24 | 25 | return self.state -------------------------------------------------------------------------------- /p3_collab_compet/README.md: -------------------------------------------------------------------------------- 1 | # Project : Collaboration and Competition 2 | 3 | ## Description 4 | For this project, we train a pair of agents to play tennis. 5 | 6 |

7 | 8 |

9 | 10 | ## Problem Statement 11 | A reward of +0.1 is provided for each step that one of the two agent hits the ball over the net. 12 | A reward of -0.01 is provided an agent lets a nall hit the ground or hits the ball out of bounds. 13 | Thus, the goal of each agent is to keep the ball in play. 14 | 15 | The observation space consists of 24 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 16 | 17 | The task is episodic. In order to solve 18 | the environment, one of the agent must get an average score of +0.5 over 100 consecutive 19 | episodes. 20 | 21 | ## Files 22 | - `Tennis.ipynb`: Notebook used to control and train the agent 23 | - `DDPGAgents.py`: Create an DDPGAgents class that interacts with and learns from the environment 24 | - `ReplayBuffer.py`: Replay Buffer class to store the experiences 25 | - `OUNoise.py`: Ornstein Uhlenbeck noise for the actor to improve exploration 26 | - `model.py`: Actor and Critic classes 27 | - `config.json`: Configuration file to store variables and paths 28 | - `utils.py`: Helper functions 29 | - `report.pdf`: Technical report 30 | 31 | ## Dependencies 32 | To be able to run this code, you will need an environment with Python 3 and 33 | the dependencies are listed in the `requirements.txt` file so that you can install them 34 | using the following command: 35 | ``` 36 | pip install requirements.txt 37 | ``` 38 | 39 | Furthermore, you need to download the environment from one of the links below. You need only to select 40 | the environment that matches your operating system: 41 | - Linux : [link](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P3/Tennis/Tennis_Linux.zip) 42 | - MAC OSX : [link](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P3/Tennis/Tennis.app.zip) 43 | - Windows : [link](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P3/Tennis/Tennis_Windows_x86_64.zip) 44 | 45 | ## Running 46 | Run the cells in the notebook `Tennis.ipynb` to train an agent that solves our required 47 | task of moving the double-jointed arm. -------------------------------------------------------------------------------- /p3_collab_compet/ReplayBuffer.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import torch 3 | from collections import namedtuple, deque 4 | import random 5 | from utils import pick_device 6 | import numpy as np 7 | 8 | class ReplayBuffer(): 9 | """ Fixed-size buffer to store experience tuples """ 10 | 11 | def __init__(self, config, action_size, buffer_size, batch_size): 12 | """ Initialize a ReplayBuffer object """ 13 | 14 | self.config = config 15 | self.action_size = action_size 16 | self.memory = deque(maxlen=buffer_size) 17 | self.batch_size = batch_size 18 | self.experience = namedtuple("Experience", 19 | field_names=["state", "action", "reward", "next_state", "done"]) 20 | 21 | # logging for this class 22 | self.logger = logging.getLogger(self.__class__.__name__) 23 | 24 | # gpu support 25 | self.device = pick_device(config, self.logger) 26 | 27 | 28 | def add(self, state, action, reward, next_state, done): 29 | """ Add a new experience to memory """ 30 | e = self.experience(state, action, reward, next_state, done) 31 | self.memory.append(e) 32 | 33 | 34 | def sample(self): 35 | """ Randomly sample a batch of experiences from memory """ 36 | experiences = random.sample(self.memory, k=self.batch_size) 37 | 38 | states = torch.from_numpy( 39 | np.vstack([e.state for e in experiences if e is not None]) 40 | ).float().to(self.device) 41 | actions = torch.from_numpy( 42 | np.vstack([e.action for e in experiences if e is not None]) 43 | ).float().to(self.device) 44 | rewards = torch.from_numpy( 45 | np.vstack([e.reward for e in experiences if e is not None]) 46 | ).float().to(self.device) 47 | next_states = torch.from_numpy( 48 | np.vstack([e.next_state for e in experiences if e is not None]) 49 | ).float().to(self.device) 50 | dones = torch.from_numpy( 51 | np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8) 52 | ).float().to(self.device) 53 | 54 | return (states, actions, rewards, next_states, dones) 55 | 56 | 57 | def __len__(self): 58 | """ Return the current size of internal memory """ 59 | return len(self.memory) -------------------------------------------------------------------------------- /p3_collab_compet/Tennis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Collaboration and Competition\n", 8 | "\n", 9 | "---\n", 10 | "\n", 11 | "You are welcome to use this coding environment to train your agent for the project. Follow the instructions below to get started!\n", 12 | "\n", 13 | "### 1. Start the Environment\n", 14 | "\n", 15 | "Run the next code cell to install a few packages. This line will take a few minutes to run!" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "!pip -q install ./python" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "The environment is already saved in the Workspace and can be accessed at the file path provided below. " 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 2, 37 | "metadata": {}, 38 | "outputs": [ 39 | { 40 | "name": "stderr", 41 | "output_type": "stream", 42 | "text": [ 43 | "INFO:unityagents:\n", 44 | "'Academy' started successfully!\n", 45 | "Unity Academy name: Academy\n", 46 | " Number of Brains: 1\n", 47 | " Number of External Brains : 1\n", 48 | " Lesson number : 0\n", 49 | " Reset Parameters :\n", 50 | "\t\t\n", 51 | "Unity brain name: TennisBrain\n", 52 | " Number of Visual Observations (per agent): 0\n", 53 | " Vector Observation space type: continuous\n", 54 | " Vector Observation space size (per agent): 8\n", 55 | " Number of stacked Vector Observation: 3\n", 56 | " Vector Action space type: continuous\n", 57 | " Vector Action space size (per agent): 2\n", 58 | " Vector Action descriptions: , \n" 59 | ] 60 | } 61 | ], 62 | "source": [ 63 | "from unityagents import UnityEnvironment\n", 64 | "import numpy as np\n", 65 | "\n", 66 | "env = UnityEnvironment(file_name=\"/data/Tennis_Linux_NoVis/Tennis\")" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 3, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "# get the default brain\n", 83 | "brain_name = env.brain_names[0]\n", 84 | "brain = env.brains[brain_name]" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 4, 90 | "metadata": {}, 91 | "outputs": [ 92 | { 93 | "data": { 94 | "text/plain": [ 95 | "('TennisBrain', )" 96 | ] 97 | }, 98 | "execution_count": 4, 99 | "metadata": {}, 100 | "output_type": "execute_result" 101 | } 102 | ], 103 | "source": [ 104 | "brain_name, brain" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "### 2. Examine the State and Action Spaces\n", 112 | "\n", 113 | "Run the code cell below to print some information about the environment." 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 5, 119 | "metadata": {}, 120 | "outputs": [ 121 | { 122 | "name": "stdout", 123 | "output_type": "stream", 124 | "text": [ 125 | "Number of agents: 2\n", 126 | "Size of each action: 2\n", 127 | "There are 2 agents. Each observes a state with length: 24\n", 128 | "The state for the first agent looks like: \n", 129 | " [ 0. 0. 0. 0. 0. 0. 0.\n", 130 | " 0. 0. 0. 0. 0. 0. 0.\n", 131 | " 0. 0. -6.65278625 -1.5 -0. 0.\n", 132 | " 6.83172083 6. -0. 0. ]\n" 133 | ] 134 | } 135 | ], 136 | "source": [ 137 | "# reset the environment\n", 138 | "env_info = env.reset(train_mode=True)[brain_name]\n", 139 | "\n", 140 | "# number of agents \n", 141 | "num_agents = len(env_info.agents)\n", 142 | "print('Number of agents:', num_agents)\n", 143 | "\n", 144 | "# size of each action\n", 145 | "action_size = brain.vector_action_space_size\n", 146 | "print('Size of each action:', action_size)\n", 147 | "\n", 148 | "# examine the state space \n", 149 | "states = env_info.vector_observations\n", 150 | "state_size = states.shape[1]\n", 151 | "print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))\n", 152 | "print('The state for the first agent looks like: \\n', states[0])" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "### 3. Take Random Actions in the Environment\n", 160 | "\n", 161 | "In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.\n", 162 | "\n", 163 | "Note that **in this coding environment, you will not be able to watch the agents while they are training**, and you should set `train_mode=True` to restart the environment." 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 6, 169 | "metadata": { 170 | "scrolled": false 171 | }, 172 | "outputs": [], 173 | "source": [ 174 | "# for i in range(5): # play game for 5 episodes\n", 175 | "# env_info = env.reset(train_mode=False)[brain_name] # reset the environment \n", 176 | "# states = env_info.vector_observations # get the current state (for each agent)\n", 177 | "# scores = np.zeros(num_agents) # initialize the score (for each agent)\n", 178 | "# while True:\n", 179 | "# actions = np.random.randn(num_agents, action_size) # select an action (for each agent)\n", 180 | "# actions = np.clip(actions, -1, 1) # all actions between -1 and 1\n", 181 | "# env_info = env.step(actions)[brain_name] # send all actions to tne environment\n", 182 | "# next_states = env_info.vector_observations # get next state (for each agent)\n", 183 | "# rewards = env_info.rewards # get reward (for each agent)\n", 184 | "# dones = env_info.local_done # see if episode finished\n", 185 | "# scores += env_info.rewards # update the score (for each agent)\n", 186 | "# states = next_states # roll over states to next time step\n", 187 | "# if np.any(dones): # exit loop if episode finished\n", 188 | "# break\n", 189 | "# print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "When finished, you can close the environment." 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 7, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "# env.close()" 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "### 4. It's Your Turn!\n", 213 | "\n", 214 | "Now it's your turn to train your own agent to solve the environment! A few **important notes**:\n", 215 | "- When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:\n", 216 | "```python\n", 217 | "env_info = env.reset(train_mode=True)[brain_name]\n", 218 | "```\n", 219 | "- To structure your work, you're welcome to work directly in this Jupyter notebook, or you might like to start over with a new file! You can see the list of files in the workspace by clicking on **_Jupyter_** in the top left corner of the notebook.\n", 220 | "- In this coding environment, you will not be able to watch the agents while they are training. However, **_after training the agents_**, you can download the saved model weights to watch the agents on your own machine! " 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 8, 226 | "metadata": {}, 227 | "outputs": [], 228 | "source": [ 229 | "## code to keep session awake in Udacity workspace\n", 230 | "import signal\n", 231 | "\n", 232 | "from contextlib import contextmanager\n", 233 | "\n", 234 | "import requests\n", 235 | "\n", 236 | "\n", 237 | "DELAY = INTERVAL = 4 * 60 # interval time in seconds\n", 238 | "MIN_DELAY = MIN_INTERVAL = 2 * 60\n", 239 | "KEEPALIVE_URL = \"https://nebula.udacity.com/api/v1/remote/keep-alive\"\n", 240 | "TOKEN_URL = \"http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token\"\n", 241 | "TOKEN_HEADERS = {\"Metadata-Flavor\":\"Google\"}\n", 242 | "\n", 243 | "\n", 244 | "def _request_handler(headers):\n", 245 | " def _handler(signum, frame):\n", 246 | " requests.request(\"POST\", KEEPALIVE_URL, headers=headers)\n", 247 | " return _handler\n", 248 | "\n", 249 | "\n", 250 | "@contextmanager\n", 251 | "def active_session(delay=DELAY, interval=INTERVAL):\n", 252 | " \"\"\"\n", 253 | " Example:\n", 254 | "\n", 255 | " from workspace_utils import active session\n", 256 | "\n", 257 | " with active_session():\n", 258 | " # do long-running work here\n", 259 | " \"\"\"\n", 260 | " token = requests.request(\"GET\", TOKEN_URL, headers=TOKEN_HEADERS).text\n", 261 | " headers = {'Authorization': \"STAR \" + token}\n", 262 | " delay = max(delay, MIN_DELAY)\n", 263 | " interval = max(interval, MIN_INTERVAL)\n", 264 | " original_handler = signal.getsignal(signal.SIGALRM)\n", 265 | " try:\n", 266 | " signal.signal(signal.SIGALRM, _request_handler(headers))\n", 267 | " signal.setitimer(signal.ITIMER_REAL, delay, interval)\n", 268 | " yield\n", 269 | " finally:\n", 270 | " signal.signal(signal.SIGALRM, original_handler)\n", 271 | " signal.setitimer(signal.ITIMER_REAL, 0)\n", 272 | "\n", 273 | "\n", 274 | "def keep_awake(iterable, delay=DELAY, interval=INTERVAL):\n", 275 | " \"\"\"\n", 276 | " Example:\n", 277 | "\n", 278 | " from workspace_utils import keep_awake\n", 279 | "\n", 280 | " for i in keep_awake(range(5)):\n", 281 | " # do iteration with lots of work here\n", 282 | " \"\"\"\n", 283 | " with active_session(delay, interval): yield from iterable" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": 9, 289 | "metadata": {}, 290 | "outputs": [], 291 | "source": [ 292 | "## Watch changes and reload automatically\n", 293 | "% load_ext autoreload\n", 294 | "% autoreload 2" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 10, 300 | "metadata": {}, 301 | "outputs": [], 302 | "source": [ 303 | "import pdb\n", 304 | "import json\n", 305 | "import numpy as np \n", 306 | "import torch \n", 307 | "from collections import deque\n", 308 | "from DDPGAgents import DDPGAgents\n", 309 | "from utils import ensure_dir\n", 310 | "import matplotlib.pyplot as plt\n", 311 | "\n", 312 | "import logging\n", 313 | "logging.basicConfig(level=logging.INFO, format='')\n", 314 | "\n", 315 | "with open(\"config.json\", \"r\") as f: \n", 316 | " config = json.load(f)" 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "execution_count": 11, 322 | "metadata": {}, 323 | "outputs": [ 324 | { 325 | "name": "stderr", 326 | "output_type": "stream", 327 | "text": [ 328 | "INFO:DDPGAgents:Training on gpu\n", 329 | "INFO:ReplayBuffer:Training on gpu\n" 330 | ] 331 | }, 332 | { 333 | "name": "stdout", 334 | "output_type": "stream", 335 | "text": [ 336 | "Episode 100\tAverage Score: 0.000\n", 337 | "Episode 200\tAverage Score: 0.000\n", 338 | "Episode 300\tAverage Score: 0.000\n", 339 | "Episode 400\tAverage Score: 0.000\n", 340 | "Episode 500\tAverage Score: 0.000\n", 341 | "Episode 600\tAverage Score: 0.002\n", 342 | "Episode 700\tAverage Score: 0.000\n", 343 | "Episode 800\tAverage Score: 0.000\n", 344 | "Episode 900\tAverage Score: 0.000\n", 345 | "Episode 1000\tAverage Score: 0.000\n", 346 | "Episode 1100\tAverage Score: 0.011\n", 347 | "Episode 1200\tAverage Score: 0.019\n", 348 | "Episode 1300\tAverage Score: 0.006\n", 349 | "Episode 1400\tAverage Score: 0.029\n", 350 | "Episode 1500\tAverage Score: 0.060\n", 351 | "Episode 1600\tAverage Score: 0.235\n", 352 | "Episode 1700\tAverage Score: 0.133\n", 353 | "Episode 1800\tAverage Score: 0.168\n", 354 | "Episode 1900\tAverage Score: 0.324\n", 355 | "Episode 1909\tAverage Score: 0.503\n", 356 | "Environment solved in 1809 episodes!\tAverage Score: 0.503\n" 357 | ] 358 | } 359 | ], 360 | "source": [ 361 | "agent = DDPGAgents(state_size=24, action_size=2, config=config)\n", 362 | "brain_name = env.brain_names[0]\n", 363 | "\n", 364 | "def ddpg(agent, \n", 365 | " brain_name, \n", 366 | " config, \n", 367 | " n_episodes=config[\"trainer\"][\"num_episodes\"]\n", 368 | " ):\n", 369 | " \"\"\" Deep Deterministic Policy Gradient \"\"\"\n", 370 | " \n", 371 | " # Set logger for this function\n", 372 | " logger = logging.getLogger(\"ddpg\")\n", 373 | " \n", 374 | " # number of agents\n", 375 | " num_agents = config[\"DDPG\"][\"num_agents\"]\n", 376 | " \n", 377 | " max_t = 1000\n", 378 | " \n", 379 | " flag = False # When environment is technically solved\n", 380 | " # Save path \n", 381 | " save_path = config[\"trainer\"][\"save_dir\"] + config[\"exp_name\"] + \"/\"\n", 382 | " ensure_dir(save_path)\n", 383 | " scores = [] # list containing scores from each episodes \n", 384 | " scores_window = deque(maxlen=100)\n", 385 | " \n", 386 | " for i_episode in keep_awake(range(1, n_episodes + 1)):\n", 387 | " # reset the environment\n", 388 | " env_info = env.reset(train_mode=True)[brain_name]\n", 389 | " \n", 390 | " # reset noise\n", 391 | " agent.reset()\n", 392 | " \n", 393 | " # get the current state\n", 394 | " state = env_info.vector_observations\n", 395 | "\n", 396 | " # score of the agents\n", 397 | " score = np.zeros(num_agents)\n", 398 | " \n", 399 | " for t in range(max_t):\n", 400 | " # choose actions\n", 401 | " action = agent.act(state)\n", 402 | " # send the actions to the environment \n", 403 | " env_info = env.step(action)[brain_name]\n", 404 | " # get the next state\n", 405 | " next_state = env_info.vector_observations\n", 406 | " # get the rewards\n", 407 | " rewards = env_info.rewards\n", 408 | " # see if episode has finished\n", 409 | " dones = env_info.local_done\n", 410 | " # step \n", 411 | " agent.step(state, action, rewards, next_state, dones)\n", 412 | " # accumulate rewards into score variable\n", 413 | " score += rewards\n", 414 | " # get next_state and set it to state\n", 415 | " state = next_state\n", 416 | " \n", 417 | " if any(dones): \n", 418 | " break\n", 419 | " \n", 420 | " # save most recent scores (mean amongst the agents)\n", 421 | " scores.append(np.max(score))\n", 422 | " scores_window.append(np.max(score))\n", 423 | " \n", 424 | " print('\\rEpisode {}\\tAverage Score: {:.3f}'.format(i_episode, np.mean(scores_window)), end=\"\")\n", 425 | " \n", 426 | " if (i_episode % 100 == 0):\n", 427 | " print(\"\\rEpisode {}\\tAverage Score: {:.3f}\".format(i_episode, \\\n", 428 | " np.mean(scores_window)))\n", 429 | " \n", 430 | " # Save occasionnaly \n", 431 | " if (i_episode % config[\"trainer\"][\"save_freq\"] == 0):\n", 432 | " torch.save(agent.actor_local.state_dict(), save_path + \n", 433 | " \"checkpoint_actor_\" + str(i_episode) + \".pth\")\n", 434 | " torch.save(agent.critic_local.state_dict(), save_path + \n", 435 | " \"checkpoint_critic_\" + str(i_episode) + \".pth\")\n", 436 | " \n", 437 | " # Check if envionment solved \n", 438 | " if not flag:\n", 439 | " if (np.mean(scores_window) >= 0.5):\n", 440 | " print(\"\\nEnvironment solved in {:d} episodes!\\tAverage Score: {:.3f}\".format(\n", 441 | " i_episode-100, np.mean(scores_window)))\n", 442 | " # Save solved model \n", 443 | " torch.save(agent.actor_local.state_dict(), save_path + \n", 444 | " \"checkpoint_actor_solved.pth\")\n", 445 | " torch.save(agent.critic_local.state_dict(), save_path + \n", 446 | " \"checkpoint_critic_solved.pth\")\n", 447 | " flag = True\n", 448 | " \n", 449 | " break\n", 450 | " \n", 451 | " return scores\n", 452 | " \n", 453 | "scores = ddpg(agent=agent, \n", 454 | " brain_name=brain_name, \n", 455 | " config=config)" 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "execution_count": 12, 461 | "metadata": {}, 462 | "outputs": [], 463 | "source": [ 464 | "env.close()" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": 13, 470 | "metadata": {}, 471 | "outputs": [ 472 | { 473 | "data": { 474 | "image/png": "\n", 475 | "text/plain": [ 476 | "" 477 | ] 478 | }, 479 | "metadata": { 480 | "needs_background": "light" 481 | }, 482 | "output_type": "display_data" 483 | } 484 | ], 485 | "source": [ 486 | "# plot the scores\n", 487 | "fig = plt.figure(figsize=(16, 12))\n", 488 | "ax = fig.add_subplot(111)\n", 489 | "plt.plot(np.arange(len(scores)), scores)\n", 490 | "plt.xlabel('Episode number')\n", 491 | "plt.ylabel('Score')\n", 492 | "plt.show()" 493 | ] 494 | } 495 | ], 496 | "metadata": { 497 | "kernelspec": { 498 | "display_name": "Python 3", 499 | "language": "python", 500 | "name": "python3" 501 | }, 502 | "language_info": { 503 | "codemirror_mode": { 504 | "name": "ipython", 505 | "version": 3 506 | }, 507 | "file_extension": ".py", 508 | "mimetype": "text/x-python", 509 | "name": "python", 510 | "nbconvert_exporter": "python", 511 | "pygments_lexer": "ipython3", 512 | "version": "3.6.3" 513 | } 514 | }, 515 | "nbformat": 4, 516 | "nbformat_minor": 2 517 | } 518 | -------------------------------------------------------------------------------- /p3_collab_compet/config.json: -------------------------------------------------------------------------------- 1 | { 2 | "exp_name": "DDPGAgents_exp", 3 | "cuda": true, 4 | "gpu": 0, 5 | 6 | "optimizer_actor": { 7 | "optimizer_type": "Adam", 8 | "betas": [0.9, 0.999], 9 | "optimizer_params": { 10 | "lr": 1e-4, 11 | "eps": 1e-7, 12 | "weight_decay": 0 13 | } 14 | }, 15 | 16 | "optimizer_critic": { 17 | "optimizer_type": "Adam", 18 | "betas": [0.9, 0.999], 19 | "optimizer_params": { 20 | "lr": 1e-3, 21 | "eps": 1e-7, 22 | "weight_decay": 0 23 | } 24 | }, 25 | 26 | "DDPG": { 27 | "num_agents": 2, 28 | "gamma": 0.99, 29 | "tau": 0.001, 30 | "buffer_size": 10e6 31 | }, 32 | 33 | "architecture": { 34 | "fc1_units": 250, 35 | "fc2_units": 100 36 | }, 37 | 38 | "trainer" : { 39 | "num_episodes": 15000, 40 | "batch_size": 128, 41 | "save_dir": "./saved/", 42 | "save_freq": 1000 43 | } 44 | } -------------------------------------------------------------------------------- /p3_collab_compet/images/tennis_gif.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p3_collab_compet/images/tennis_gif.gif -------------------------------------------------------------------------------- /p3_collab_compet/report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p3_collab_compet/report.pdf -------------------------------------------------------------------------------- /p3_collab_compet/requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p3_collab_compet/requirements.txt -------------------------------------------------------------------------------- /p3_collab_compet/saved/DDPGAgents_exp/checkpoint_actor_solved.pth: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p3_collab_compet/saved/DDPGAgents_exp/checkpoint_actor_solved.pth -------------------------------------------------------------------------------- /p3_collab_compet/saved/DDPGAgents_exp/checkpoint_critic_solved.pth: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vmelan/DRLND-udacity/8d2a38d2894b0f692fc6ffbe652eb94440b516de/p3_collab_compet/saved/DDPGAgents_exp/checkpoint_critic_solved.pth -------------------------------------------------------------------------------- /p3_collab_compet/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import torch 3 | 4 | 5 | def pick_device(config, logger): 6 | """ Pick device """ 7 | if config["cuda"] and not torch.cuda.is_available(): 8 | logger.warning("Warning: There's no CUDA support on this machine," 9 | "training is performed on cpu.") 10 | device = torch.device("cpu") 11 | elif not config["cuda"] and torch.cuda.is_available(): 12 | logger.info("Training is performed on cpu by user's choice") 13 | device = torch.device("cpu") 14 | elif not config["cuda"] and not torch.cuda.is_available(): 15 | logger.info("Training on cpu") 16 | device = torch.device("cpu") 17 | else: 18 | logger.info("Training on gpu") 19 | device = torch.device("cuda:" + str(config["gpu"])) 20 | 21 | return device 22 | 23 | def ensure_dir(path): 24 | if not os.path.exists(path): 25 | os.makedirs(path) --------------------------------------------------------------------------------