├── LICENSE ├── README.md ├── hw1 ├── README.md ├── demo.bash ├── experts │ ├── Ant-v2.pkl │ ├── HalfCheetah-v2.pkl │ ├── Hopper-v2.pkl │ ├── Humanoid-v2.pkl │ ├── Reacher-v2.pkl │ └── Walker2d-v2.pkl ├── hw1_instructions.pdf ├── load_policy.py ├── requirements.txt └── run_expert.py ├── hw2 ├── README.md ├── hw2_instructions.pdf ├── hw2_instructions.tex ├── logz.py ├── lunar_lander.py ├── plot.py ├── requirements.txt └── train_pg_f18.py └── hw3 ├── README.md ├── atari_wrappers.py ├── dqn.py ├── dqn_utils.py ├── hw3_instructions.pdf ├── logz.py ├── lunar_lander.py ├── plot.py ├── requirements.txt ├── run_dqn_atari.py ├── run_dqn_lander.py ├── run_dqn_ram.py └── train_ac_f18.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 KuNya 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Berkeley DeepRLcourse Homework in PyTorch 2 | ## Introduction 3 | 4 | In recent years, with the booming of deep learning, reinforcement learning has made great progress in solving complex tasks and has attracted more and more people`s attention. Also, many researchers start applying reinforcement learning algorithms to solve the problem in other fields (such as Natural Language Processing). 5 | 6 | So, there is a big need for learning those classic reinforcement learning algorithms in an easy way. 7 | 8 | As beginners in reinforcement learning, we found that [CS 294-112](http://rail.eecs.berkeley.edu/deeprlcourse/) at UC Berkeley is a great course where we can learn a lot of classic and advanced reinforcement learning algorithms. 9 | 10 | As the saying goes, “talk is cheap, show me your code.” It is very important to write algorithm in code correctly, instead of just knowing the algorithm. Luckily, CS 294-112 also provides programming assignments for those reinforcement learning algorithms. While, these assignments are mainly implemented in **TensorFlow**, which might be bad news for people who are more familiar with other deep learning frameworks. 11 | 12 | For the reasons above, we modified those assignments (for Fall 2018) and implemented in **PyTorch**, which is a framework that we often use in our research. 13 | 14 | Moreover, we also provide [solutions](https://github.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch-solution) to these assignments, and you can use them when you get stuck. 15 | 16 | Hope you will enjoy it : ) 17 | 18 | 19 | 20 | ## What can you learn from it? 21 | 22 | - ### HW1: Imitation Learning 23 | 24 | In this assignment, you will implement the **Behavioral Cloning** and **DAgger** algorithm. 25 | 26 | In the experiments, you will see the case where Behavioral Cloning work well, and the case where DAgger can learn a better policy than Behavioral Cloning. 27 | 28 | - ### HW2: Policy Gradients 29 | 30 | In this assignment, you will implement the **Policy Gradients** algorithm. 31 | 32 | In the experiments, you will compare the difference between gradient estimators(full-trajectory case and reward-to-go case) and learn how batch size and learning rate can affect the algorithm performance. Moreover, you will implement a **neural network baseline** to help the gradient estimator to reduce variance and assist the agent to learn a better policy. 33 | 34 | - ### HW3: Q-Learning and Actor-Critic 35 | 36 | In this assignment, you will implement the **Deep Q-learning** and **Actor-Critic** algorithm. 37 | 38 | In the Deep Q-learning part, you will implement **vanilla DQN** and **double DQN** and compare their performance in different atari game environments. Also, you will experiment how hyperparameters affect the final results. 39 | 40 | In the Actor-Critic part, you will implement a **Actor-Critic** model based on your Policy Gradients implementations in HW2. Additionally, you will learn how to tune the hyperparameters for the Actor-Critic model, and make it outperform your previous Policy Gradients model which is equiped with reward-to-go gradient estimator and neural network baseline. 41 | 42 | - ### HW4: Model-Based RL 43 | 44 | ###### Coming Soon...... 45 | 46 | - ### HW5: Advanced Topics 47 | 48 | ###### Coming Soon...... 49 | 50 | 51 | 52 | ## How can you use it? 53 | 54 | #### If you want to learn: 55 | 56 | - ##### The whole course: 57 | 58 | You can just follow the course syllabus, and use this as programming assignments. 59 | 60 | - ##### Policy Optimization style RL algorithm: 61 | 62 | You may want to finish the HW2 and the Actor-Critic part in HW3, and read relative material from the course website. 63 | 64 | - ##### Dynamic Programming style RL algorithm: 65 | 66 | You may want to finish the Deep Q-learning part in HW3, and read relative material from the course website. 67 | 68 | #### Or you can just use it as you like : ) -------------------------------------------------------------------------------- /hw1/README.md: -------------------------------------------------------------------------------- 1 | # CS294-112 HW 1: Imitation Learning 2 | 3 | Modification: 4 | 5 | We implemented the forward pass of the expert policy network in numpy, and you can use any deep learning framework you like to write this assignment. 6 | 7 | ------ 8 | 9 | Dependencies: 10 | 11 | * Python **3.5** 12 | * Numpy 13 | * MuJoCo version **1.50** and mujoco-py **1.50.1.56** 14 | * OpenAI Gym version **0.10.5** 15 | 16 | Once Python **3.5** is installed, you can install the remaining dependencies using `pip install -r requirements.txt`. 17 | 18 | **Note**: MuJoCo versions until 1.5 do not support NVMe disks therefore won't be compatible with recent Mac machines. 19 | There is a request for OpenAI to support it that can be followed [here](https://github.com/openai/gym/issues/638). 20 | 21 | 22 | 23 | The only file that you need to look at is `run_expert.py`, which is code to load up an expert policy, run a specified number of roll-outs, and save out data. 24 | 25 | In `experts/`, the provided expert policies are: 26 | * Ant-v2.pkl 27 | * HalfCheetah-v2.pkl 28 | * Hopper-v2.pkl 29 | * Humanoid-v2.pkl 30 | * Reacher-v2.pkl 31 | * Walker2d-v2.pkl 32 | 33 | The name of the pickle file corresponds to the name of the gym environment. 34 | 35 | 36 | 37 | See the [HW1 PDF](./hw1_instructions.pdf) for further instructions. -------------------------------------------------------------------------------- /hw1/demo.bash: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -eux 3 | for e in Hopper-v2 Ant-v2 HalfCheetah-v2 Humanoid-v2 Reacher-v2 Walker2d-v2 4 | do 5 | python run_expert.py experts/$e.pkl $e --render --num_rollouts=1 6 | done 7 | -------------------------------------------------------------------------------- /hw1/experts/Ant-v2.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw1/experts/Ant-v2.pkl -------------------------------------------------------------------------------- /hw1/experts/HalfCheetah-v2.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw1/experts/HalfCheetah-v2.pkl -------------------------------------------------------------------------------- /hw1/experts/Hopper-v2.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw1/experts/Hopper-v2.pkl -------------------------------------------------------------------------------- /hw1/experts/Humanoid-v2.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw1/experts/Humanoid-v2.pkl -------------------------------------------------------------------------------- /hw1/experts/Reacher-v2.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw1/experts/Reacher-v2.pkl -------------------------------------------------------------------------------- /hw1/experts/Walker2d-v2.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw1/experts/Walker2d-v2.pkl -------------------------------------------------------------------------------- /hw1/hw1_instructions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw1/hw1_instructions.pdf -------------------------------------------------------------------------------- /hw1/load_policy.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | import numpy as np 3 | from functools import reduce 4 | 5 | 6 | def load_policy(filename): 7 | def read_layer(l): 8 | assert list(l.keys()) == ['AffineLayer'] 9 | assert sorted(l['AffineLayer'].keys()) == ['W', 'b'] 10 | W, b = l['AffineLayer']['W'].astype(np.float32), l['AffineLayer']['b'].astype(np.float32) 11 | return lambda x: np.matmul(x, W) + b 12 | 13 | def build_nonlin_fn(nonlin_type): 14 | if nonlin_type == 'lrelu': 15 | leak = 0.01 # openai/imitation nn.py:233 16 | return lambda x: 0.5 * (1 + leak) * x + 0.5 * (1 - leak) * np.abs(x) 17 | elif nonlin_type == 'tanh': 18 | return lambda x: np.tanh(x) 19 | else: 20 | raise NotImplementedError(nonlin_type) 21 | 22 | with open(filename, 'rb') as f: 23 | data = pickle.loads(f.read()) 24 | 25 | # assert len(data.keys()) == 2 26 | nonlin_type = data['nonlin_type'] 27 | nonlin_fn = build_nonlin_fn(nonlin_type) 28 | policy_type = [k for k in data.keys() if k != 'nonlin_type'][0] 29 | 30 | assert policy_type == 'GaussianPolicy', 'Policy type {} not supported'.format(policy_type) 31 | policy_params = data[policy_type] 32 | 33 | assert set(policy_params.keys()) == {'logstdevs_1_Da', 'hidden', 'obsnorm', 'out'} 34 | 35 | # Build observation normalization layer 36 | assert list(policy_params['obsnorm'].keys()) == ['Standardizer'] 37 | obsnorm_mean = policy_params['obsnorm']['Standardizer']['mean_1_D'] 38 | obsnorm_meansq = policy_params['obsnorm']['Standardizer']['meansq_1_D'] 39 | obsnorm_stdev = np.sqrt(np.maximum(0, obsnorm_meansq - np.square(obsnorm_mean))) 40 | #print('obs', obsnorm_mean.shape, obsnorm_stdev.shape) 41 | 42 | 43 | # Build hidden layers 44 | assert list(policy_params['hidden'].keys()) == ['FeedforwardNet'] 45 | layer_params = policy_params['hidden']['FeedforwardNet'] 46 | layers = [] 47 | for layer_name in sorted(layer_params.keys()): 48 | l = layer_params[layer_name] 49 | fc_layer = read_layer(l) 50 | layers += [fc_layer, nonlin_fn] 51 | 52 | # Build output layer 53 | fc_layer = read_layer(policy_params['out']) 54 | layers += [fc_layer] 55 | layers_forward = lambda inp: reduce(lambda x, fn: fn(x), [inp] + layers) 56 | 57 | 58 | def forward_pass(obs): 59 | ''' Build the forward pass for policy net. 60 | 61 | Input: batched observation. (shape: [batch_size, obs_dim]) 62 | 63 | Output: batched action. (shape: [batch_size, action_dim]) 64 | ''' 65 | obs = obs.astype(np.float32) 66 | normed_obs = (obs - obsnorm_mean) / (obsnorm_stdev + 1e-6) # 1e-6 constant from Standardizer class in nn.py:409 in openai/imitation 67 | output = layers_forward(normed_obs.astype(np.float32)) 68 | 69 | return output 70 | 71 | return forward_pass 72 | -------------------------------------------------------------------------------- /hw1/requirements.txt: -------------------------------------------------------------------------------- 1 | gym==0.10.5 2 | mujoco-py==1.50.1.56 3 | numpy 4 | seaborn 5 | -------------------------------------------------------------------------------- /hw1/run_expert.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """ 3 | Code to load an expert policy and generate roll-out data for behavioral cloning. 4 | Example usage: 5 | python run_expert.py experts/Humanoid-v1.pkl Humanoid-v1 --render \ 6 | --num_rollouts 20 7 | 8 | Modified from the script written by Jonathan Ho (hoj@openai.com) 9 | """ 10 | 11 | import os 12 | import argparse 13 | import pickle 14 | import numpy as np 15 | import gym 16 | import load_policy 17 | 18 | def main(): 19 | parser = argparse.ArgumentParser() 20 | parser.add_argument('expert_policy_file', type=str) 21 | parser.add_argument('envname', type=str) 22 | parser.add_argument('--render', action='store_true') 23 | parser.add_argument("--max_timesteps", type=int) 24 | parser.add_argument('--num_rollouts', type=int, default=20, 25 | help='Number of expert roll outs') 26 | args = parser.parse_args() 27 | 28 | print('loading and building expert policy') 29 | policy_net = load_policy.load_policy(args.expert_policy_file) 30 | print('loaded and built') 31 | 32 | env = gym.make(args.envname) 33 | max_steps = args.max_timesteps or env.spec.timestep_limit 34 | 35 | returns = [] 36 | observations = [] 37 | actions = [] 38 | for i in range(args.num_rollouts): 39 | print('iter', i) 40 | obs = env.reset() 41 | done = False 42 | totalr = 0. 43 | steps = 0 44 | while not done: 45 | action = policy_net(obs[None, :]) 46 | observations.append(obs) 47 | actions.append(action) 48 | obs, r, done, _ = env.step(action) 49 | totalr += r 50 | steps += 1 51 | if args.render: 52 | env.render() 53 | if steps % 100 == 0: print("%i/%i"%(steps, max_steps)) 54 | if steps >= max_steps: 55 | break 56 | returns.append(totalr) 57 | 58 | print('returns', returns) 59 | print('mean return', np.mean(returns)) 60 | print('std of return', np.std(returns)) 61 | 62 | expert_data = {'observations': np.array(observations), 63 | 'actions': np.array(actions)} 64 | 65 | if not os.path.exists('expert_data'): 66 | os.makedirs('expert_data') 67 | 68 | with open(os.path.join('expert_data', args.envname + '.pkl'), 'wb') as f: 69 | pickle.dump(expert_data, f, pickle.HIGHEST_PROTOCOL) 70 | 71 | if __name__ == '__main__': 72 | main() 73 | -------------------------------------------------------------------------------- /hw2/README.md: -------------------------------------------------------------------------------- 1 | # CS294-112 HW 2: Policy Gradient 2 | 3 | Modification: 4 | 5 | In general, we followed the code structure of the original version and modified the neural network part to pytorch. 6 | 7 | Because of the different between the static graphs framework and the dynamic graphs framework, we merged and added some code in `train_pg_f18.py`. We also adapted the instructions of this assignment for pytorch. (Thanks to CS294-112 for offering ![equation](http://latex.codecogs.com/gif.latex?\LaTeX) code for the instructions) And you can just follow the pytorch version instructions we wrote. 8 | 9 | ------ 10 | 11 | Dependencies: 12 | 13 | * Python **3.5** 14 | * Numpy version **1.14.5** 15 | * Pytorch version **0.4.0** 16 | * MuJoCo version **1.50** and mujoco-py **1.50.1.56** 17 | * OpenAI Gym version **0.10.5** 18 | * seaborn 19 | * Box2D==**2.3.2** 20 | 21 | Before doing anything, first replace `gym/envs/box2d/lunar_lander.py` with the provided `lunar_lander.py` file. 22 | 23 | The only file that you need to look at is `train_pg_f18.py`, which you will implement. 24 | 25 | See the [HW2 PDF](./hw2_instructions.pdf) for further instructions. 26 | -------------------------------------------------------------------------------- /hw2/hw2_instructions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw2/hw2_instructions.pdf -------------------------------------------------------------------------------- /hw2/hw2_instructions.tex: -------------------------------------------------------------------------------- 1 | \documentclass[12pt]{article} 2 | \usepackage{fullpage} 3 | \usepackage{url} 4 | \usepackage{amsmath} 5 | \usepackage{amsfonts} 6 | \usepackage{algorithm} 7 | \usepackage{algorithmic} 8 | \usepackage{graphicx} 9 | \usepackage{hyperref} 10 | \usepackage{color} 11 | \usepackage{listings} 12 | \usepackage{verbatim} 13 | \usepackage{enumitem} 14 | \usepackage[parfill]{parskip} 15 | 16 | \newcommand{\xb}{\mathbf{x}} 17 | \newcommand{\yb}{\mathbf{y}} 18 | \newcommand{\wb}{\mathbf{w}} 19 | \newcommand{\Xb}{\mathbf{X}} 20 | \newcommand{\Yb}{\mathbf{Y}} 21 | \newcommand{\tr}{^T} 22 | \newcommand{\hb}{\mathbf{h}} 23 | \newcommand{\Hb}{\mathbf{H}} 24 | 25 | \newcommand{\cmt}[1]{{\footnotesize\textcolor{red}{#1}}} 26 | \newcommand{\todo}[1]{\cmt{TO-DO: #1}} 27 | 28 | \title{CS294-112 Deep Reinforcement Learning HW2: \\ Policy Gradients\\ 29 | \textbf{Pytorch Version}} 30 | 31 | \author{ 32 | } 33 | 34 | \date{} 35 | 36 | \usepackage{courier} 37 | 38 | \definecolor{codegreen}{rgb}{0,0.6,0} 39 | \definecolor{codegray}{rgb}{0.5,0.5,0.5} 40 | \definecolor{codepurple}{rgb}{0.58,0,0.82} 41 | \definecolor{backcolour}{rgb}{0.95,0.95,0.92} 42 | 43 | \lstdefinestyle{mystyle}{ 44 | backgroundcolor=\color{backcolour}, 45 | commentstyle=\color{codegreen}, 46 | keywordstyle=\color{magenta}, 47 | numberstyle=\tiny\color{codegray}, 48 | stringstyle=\color{codepurple}, 49 | basicstyle=\footnotesize\ttfamily, 50 | breakatwhitespace=false, 51 | breaklines=true, 52 | captionpos=b, 53 | keepspaces=true, 54 | %numbers=left, 55 | numbersep=5pt, 56 | showspaces=false, 57 | showstringspaces=false, 58 | showtabs=false, 59 | tabsize=2 60 | } 61 | 62 | \lstset{style=mystyle} 63 | 64 | \begin{document} 65 | 66 | 67 | \maketitle 68 | 69 | \section{Introduction} 70 | The goal of this assignment is to experiment with policy gradient and its variants, including variance reduction methods. Your goals will be to set up policy gradient for both continuous and discrete environments and experiment with variance reduction tricks, including implementing reward-to-go and neural network baselines. 71 | 72 | \section{Review} 73 | Recall that the reinforcement learning objective is to learn a $\theta^*$ that maximizes the objective function: 74 | \begin{align} \label{objective} 75 | J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[r(\tau)\right] 76 | \end{align} 77 | where 78 | $$\pi_\theta(\tau) = p(s_1, a_1, ..., s_T, a_T) = p(s_1)\pi_\theta(a_1|s_1) \prod_{t=2}^T p(s_t | s_{t-1}, a_{t-1}) \pi_\theta(a_t | s_t)$$ 79 | and 80 | $$r(\tau) = r(s_1, a_1, ..., s_T, a_T) = \sum_{t=1}^T r(s_t, a_t).$$ 81 | 82 | The policy gradient approach is to directly take the gradient of this objective: 83 | \begin{align} 84 | \nabla_\theta J(\theta) &= \nabla_\theta \int \pi_\theta(\tau) r(\tau) d\tau \label{policygradientintegral} \\ 85 | &= \int \pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau) r(\tau) d\tau. \label{scorefunctionpg} 86 | \end{align} 87 | In practice, the expectation over trajectories $\tau$ can be approximated from a batch of $N$ sampled trajectories: 88 | \begin{align} 89 | \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \nabla_\theta \log \pi_\theta(\tau_i) r(\tau_i) \\ 90 | &= \frac{1}{N} \sum_{i=1}^N \left(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{it} | s_{it})\right)\left(\sum_{t=1}^T r(s_{it}, a_{it})\right). \label{estimatedscorefunctionpg} 91 | \end{align} 92 | Here we see that the policy $\pi_\theta$ is a probability distribution over the action space, conditioned on the state. In the agent-environment loop, the agent samples an action $a_t$ from $\pi_\theta(\cdot | s_t)$ and the environment responds with a reward $r(s_t, a_t)$. 93 | 94 | One way to reduce the variance of the policy gradient is to exploit causality: the notion that the policy cannot affect rewards in the past, yielding following the modified objective, where the sum of rewards here is a sample estimate of the $Q$ function, known as the ``reward-to-go:'' 95 | \begin{align} 96 | \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{it} | s_{it})\left(\sum_{t'=t}^T r(s_{it'}, a_{it'})\right). 97 | \end{align} 98 | 99 | Multiplying a discount factor $\gamma$ to the rewards can be interpreted as encouraging the agent to focus on rewards closer in the future, which can also be thought of as a means for reducing variance (because there is more variance possible futures further into the future). We saw in lecture that the discount factor can be incorporated in two ways. 100 | 101 | The first way applies the discount on the rewards from full trajectory: 102 | \begin{align} \label{discount_full} 103 | \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \left(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{it} | s_{it})\right)\left(\sum_{t=1}^T \gamma^{t-1} r(s_{it}, a_{it})\right) 104 | \end{align} 105 | and the second way applies the discount on the ``reward-to-go:'' 106 | \begin{align} \label{discount_rtg} 107 | \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{it} | s_{it})\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{it'}, a_{it'})\right). 108 | \end{align}. 109 | 110 | We have seen in lecture that subtracting a baseline that is a constant with respect to $\tau$ from the sum of rewards 111 | \begin{align} \label{constant_wrt_tau} 112 | \nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[r(\tau) - b\right]\ 113 | \end{align} 114 | leaves the policy gradient unbiased because $$\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[b\right] = \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[\nabla_\theta \log \pi_\theta(\tau) \cdot b\right] = 0.$$ 115 | 116 | In this assignment, we will implement a value function $V_\phi^\pi$ which acts as a \textit{state-dependent} baseline. The value function is trained to approximate the sum of future rewards starting from a particular state: 117 | \begin{align} 118 | V_\phi^\pi(s_t) \approx \sum_{t'=t}^T \mathbb{E}_{\pi_\theta} \left[r(s_{t'}, a_{t'}) | s_t\right], 119 | \end{align} 120 | so the approximate policy gradient now looks like this: 121 | \begin{align} \label{state_dependent_baseline} 122 | \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{it} | s_{it})\left(\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{it'}, a_{it'})\right) - V_\phi^\pi\left(s_{it}\right)\right). 123 | \end{align} 124 | 125 | \textbf{Problem 1. State-dependent baseline:} 126 | In lecture we saw that the policy gradient is unbiased if the baseline is a constant with respect to $\tau$ (Equation~\ref{constant_wrt_tau}). The purpose of this problem is to help convince ourselves that subtracting a state-dependent baseline from the return keeps the policy gradient unbiased. For clarity we will use $p_\theta(\tau)$ instead of $\pi_\theta(\tau)$, although they mean the same thing. Using the \href{https://en.wikipedia.org/wiki/Law_of_total_expectation}{\textcolor{blue}{law of iterated expectations}} we will show that the policy gradient is still unbiased if the baseline $b$ is function of a state at a particular timestep of $\tau$ (Equation~\ref{state_dependent_baseline}). Recall from equation \ref{scorefunctionpg} that the policy gradient can be expressed as: 127 | \begin{align*} 128 | &\mathbb{E}_{\tau \sim p_\theta(\tau)} \left[\nabla_\theta \log p_\theta(\tau)r(\tau)\right]. 129 | \end{align*} 130 | By breaking up $p_\theta(\tau)$ into dynamics and policy terms, we can discard the dynamics terms, which are not functions of $\theta$: 131 | \begin{align*} 132 | &\mathbb{E}_{\tau \sim p_\theta(\tau)} \left[\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \left(\sum_{t'=1}^T r(s_{t'}, a_{t'})\right)\right]. 133 | \end{align*} 134 | When we subtract a state dependent baseline $b(s_t)$ (recall equation \ref{state_dependent_baseline}) we get 135 | \begin{align*} 136 | \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \left(\left(\sum_{t'=1}^T r(s_{t'}, a_{t'})\right) - b(s_t)\right)\right]. 137 | \end{align*} 138 | An alternative approach is to look at the entire trajectory and consider a particular timestep $t^* \in [1, T-1]$ (the timestep $T$ case would be very similar to part (a)). 139 | Our goal for this problem is to show that 140 | \begin{align*} 141 | \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t|s_t) b(s_t)\right] = 0. 142 | \end{align*} 143 | By \href{https://brilliant.org/wiki/linearity-of-expectation/}{\textcolor{blue}{linearity of expectation}} we can consider each term in this sum independently, so we can equivalently show that 144 | \begin{align} \label{independent} 145 | \sum_{t=1}^T \mathbb{E}_{\tau \sim p_\theta(\tau)}\left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \left(b(s_t)\right)\right] = 0. 146 | \end{align} 147 | \begin{enumerate} [label=(\alph*)] 148 | \item Using the chain rule, we can express $p_\theta(\tau)$ as a product of the state-action marginal $(s_t, a_t)$ and the probability of the rest of the trajectory conditioned on $(s_t, a_t)$ (which we denote as $(\tau / s_t, a_t | s_t, a_t)$): 149 | \begin{align*} 150 | p_\theta(\tau) = p_\theta(s_t, a_t)p_\theta(\tau / s_t, a_t | s_t, a_t) 151 | \end{align*} 152 | Please show equation \ref{independent} by using the law of iterated expectations, breaking $\mathbb{E}_{\tau \sim p_\theta(\tau)}$ by decoupling the state-action marginal from the rest of the trajectory. 153 | \item Alternatively, we can consider the structure of the MDP and express $p_\theta(\tau)$ as a product of the trajectory distribution up to $s_t$ (which we denote as $(s_{1:t}, a_{1:t-1})$) and the trajectory distribution after $s_t$ conditioned on the first part (which we denote as $(s_{t+1:T}, a_{t:T} | s_{1:t}, a_{1:t-1})$): 154 | \begin{align*} 155 | p_\theta(\tau) = p_\theta(s_{1:t}, a_{1:t-1}) p_\theta(s_{t+1:T}, a_{t:T} | s_{1:t}, a_{1:t-1}) 156 | \end{align*} 157 | \begin{enumerate} 158 | \item Explain why, for the inner expectation, conditioning on $(s_1, a_1, ..., a_{t^*-1}, s_{t^*})$ is equivalent to conditioning only on $s_{t^*}$. 159 | \item Please show equation \ref{independent} by using the law of iterated expectations, breaking $\mathbb{E}_{\tau \sim p_\theta(\tau)}$ by decoupling trajectory up to $s_t$ from the trajectory after $s_t$. 160 | \end{enumerate} 161 | \end{enumerate} 162 | Since the policy gradient with respect to $\theta$ can be decoupled as a summation of terms over timesteps $t \in [1, T]$, because we have shown that the policy gradient is unbiased for each of these terms, 163 | the entire policy gradient is also unbiased with respect to a vector of state-dependent baselines over the timesteps: $[b(s_1), b(s_2), ... b(s_T)]$. 164 | 165 | \section{Code Setup} 166 | \subsection{Files} 167 | The starter code is available \href{https://github.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/tree/master/hw2}{\textcolor{blue}{here}}. 168 | The only file you need to modify in this homework is \verb|train_pg_f18.py|. The files \verb|logz.py| and \verb|plots.py| are utility files; while you should look at them to understand their functionality, you will not modify them. For the Lunar Lander task, use the provided \verb|lunar_lander.py| file instead of \verb|gym/envs/box2d/lunar_lander.py|. After you fill in the appropriate methods, you should be able to just run \verb|python train_pg_f18.py| with some command line options to perform the experiments. To visualize the results, you can run \verb|python plot.py path/to/logdir|. 169 | 170 | \subsection{Overview} 171 | The function \verb|train_PG| is used to perform the actual training for policy gradient. The parameters passed into this function specify the algorithm's hyperparameters and environment. The \verb|Agent| class contains methods that define the neural networks, sample trajectories, estimate returns, and update the parameters of the policy. 172 | 173 | At a high level, the dataflow of the code is structured like this: 174 | \begin{enumerate} 175 | \item \textit{Define neural network components} from \verb|torch.nn| in Pytorch. 176 | \item \textit{Build the forward pass function} for your neural network model by using the components you just defined. 177 | \end{enumerate} 178 | Then we will repeat Steps 3 through 5 for $N$ iterations: 179 | \begin{enumerate}\setcounter{enumi}{2} 180 | \item \textit{Sample trajectories} by executing the functions that samples an action given an observation from the environment. Collect the states, actions, and rewards as numpy variables. 181 | \item \textit{Estimate returns} in numpy (estimated Q values, baseline predictions, advantages). 182 | \item \textit{Update parameters} by executing the functions that updates the parameters given what you computed in Step 4. 183 | \end{enumerate} 184 | 185 | \section{Building Neural Networks} 186 | 187 | \textbf{Problem 2. Neural networks:} We will now begin to implement a neural network that parametrizes $\pi_\theta$. 188 | \begin{enumerate} [label=(\alph*)] 189 | \item Implement the utility function, \verb|build_mlp|, which will build a feedforward neural network with fully connected units (Hint: use \texttt{torch.nn.Linear}). Test it to make sure that it produces outputs of the expected size and shape. \textbf{You do not need to include anything in your write-up about this,} it will just make your life easier. 190 | \item Next, implement the functions used for forward pass. At this point, you only need to implement the parts with the ``Problem 2'' header. 191 | \begin{enumerate} [label=(\roman*)] 192 | \item Define the model components in \texttt 193 | {PolicyNet.define\_model\_components}. You should define the parameters of your model here, which will be tracked by \verb|torch.autograd| later. They can be any instance of \verb|torch.nn.Module| or \verb|torch.nn.Parameter|. 194 | \item Define the method \texttt{PolicyNet.forward}: This defines forward pass for our policy network. It outputs the parameters of a distribution $\pi_\theta(a|s)$. In this homework, when the distribution is over discrete actions these parameters will be the logits of a categorical distribution, and when the distribution is over continuous actions these parameters will be the mean and the log standard deviation of a multivariate Gaussian distribution. 195 | \item Define the method \texttt{Agent.sample\_action}: This receives an observation and produces an action that sampled from $\pi_\theta(a|s)$. This method will be called in \texttt{Agent.sample\_trajectory}. 196 | \item Define the method \texttt{Agent.get\_log\_prob}: Given an action that the agent took in the environment, this computes the log probability of that action under $\pi_\theta(a|s)$. This will be used in the loss function. 197 | 198 | \end{enumerate} 199 | \end{enumerate} 200 | 201 | \section{Implement Policy Gradient} 202 | \subsection{Implementing the policy gradient loop} 203 | \textbf{Problem 3. Policy Gradient:} Recall from lecture that an RL algorithm can viewed as consisting of three parts, which are reflected in the training loop of \verb|train_PG|: 204 | \begin{enumerate} 205 | \item \verb|Agent.sample_trajectories|: Generate samples (e.g. run the policy to collect trajectories consisting of state transitions ($s, a, s', r$)) 206 | \item \verb|Agent.estimate_return|: Estimate the return (e.g. sum together discounted rewards from the trajectories, or learn a model that predicts expected total future discounted reward) 207 | \item \verb|Agent.update_parameters|: Improve the policy (e.g. update the parameters of the policy with policy gradient) 208 | \end{enumerate} 209 | In our implementation, for clarity we will update the parameters of the value function baseline also in the third step (\verb|Agent.update_parameters|), rather than in the second step (as was described in lecture). You only need to implement the parts with the ``Problem 3'' header. 210 | \begin{enumerate} [label=(\alph*)] 211 | \item \textbf{Sample trajectories:} In \texttt{Agent.sample\_trajectories}, use the method \\ \texttt{Agent.sample\_action} which you just defined in ``Problem 2'' to sample an action given an observation from the environment. 212 | \item \textbf{Estimate return:} We will now implement $r(\tau)$ from Equation \ref{objective}. 213 | Please implement the method \verb|Agent.sum_of_rewards|, which will return a sample estimate of the discounted return, 214 | for both the full-trajectory (Equation~\ref{discount_full}) case, where $$r(\tau_i) = \sum_{t=1}^T \gamma^{t'-1} r(s_{it}, a_{it})$$ and 215 | for the ``reward-to-go'' case (Equation~\ref{discount_rtg}) where $$r(\tau_i) = \sum_{t'=t}^T \gamma^{t'-t} r(s_{it'}, a_{it'}).$$ 216 | In \verb|Agent.estimate_return|, normalize the advantages to have a mean of zero and a standard deviation of one. This is a trick for reducing variance. 217 | \item \textbf{Update parameters:} 218 | In \verb|Agent.update_parameters| implement a loss function (which can use the result from \texttt{Agent.get\_log\_prob}) to whose gradient is $$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \nabla_\theta \log \pi_\theta(\tau_i) r(\tau_i).$$ 219 | Then, set the optimizer (we use \verb|torch.optim.Adam| in this case) in the right way and perform gradient decent to update the parameters of the policy. 220 | \end{enumerate} 221 | 222 | 223 | \subsection{Experiments} 224 | After you have implemented the code, we will run experiments to get a feel for how different settings impact the performance of policy gradient methods. 225 | 226 | \textbf{Problem 4. CartPole:} Run the PG algorithm in the discrete \verb|CartPole-v0| environment from the command line as follows: 227 | \begin{lstlisting} 228 | python train_pg_f18.py CartPole-v0 -n 100 -b 1000 -e 3 -dna --exp_name sb_no_rtg_dna 229 | python train_pg_f18.py CartPole-v0 -n 100 -b 1000 -e 3 -rtg -dna --exp_name sb_rtg_dna 230 | python train_pg_f18.py CartPole-v0 -n 100 -b 1000 -e 3 -rtg --exp_name sb_rtg_na 231 | python train_pg_f18.py CartPole-v0 -n 100 -b 5000 -e 3 -dna --exp_name lb_no_rtg_dna 232 | python train_pg_f18.py CartPole-v0 -n 100 -b 5000 -e 3 -rtg -dna --exp_name lb_rtg_dna 233 | python train_pg_f18.py CartPole-v0 -n 100 -b 5000 -e 3 -rtg --exp_name lb_rtg_na 234 | \end{lstlisting} 235 | 236 | What's happening there: 237 | \begin{itemize} 238 | \item \verb|-n| : Number of iterations. 239 | \item \verb|-b| : Batch size (number of state-action pairs sampled while acting according to the current policy at each iteration). 240 | \item \verb|-e| : Number of experiments to run with the same configuration. Each experiment will start with a different randomly initialized policy, and have a different stream of random numbers. 241 | \item \verb|-dna| : Flag: if present, sets \verb|normalize_advantages| to False. Otherwise, by default, \verb|normalize_advantages=True|. 242 | \item \verb|-rtg| : Flag: if present, sets \verb|reward_to_go=True|. Otherwise, \verb|reward_to_go=False| by default. 243 | \item \verb|--exp_name| : Name for experiment, which goes into the name for the data directory. 244 | \end{itemize} 245 | 246 | Various other command line arguments will allow you to set batch size, learning rate, network architecture (number of hidden layers and the size of the hidden layers---for CartPole, you can use one hidden layer with 32 units), and more. 247 | 248 | \textbf{Deliverables for report:} 249 | 250 | \begin{itemize} 251 | \item Graph the results of your experiments \textbf{using the plot.py file we provide.} Create two graphs. 252 | \begin{itemize} 253 | \item In the first graph, compare the learning curves (average return at each iteration) for the experiments prefixed with \verb|sb_|. (The small batch experiments.) 254 | \item In the second graph, compare the learning curves for the experiments prefixed with \verb|lb_|. (The large batch experiments.) 255 | \end{itemize} 256 | \item Answer the following questions briefly: 257 | \begin{itemize} 258 | \item Which gradient estimator has better performance without advantage-centering---the trajectory-centric one, or the one using reward-to-go? 259 | \item Did advantage centering help? 260 | \item Did the batch size make an impact? 261 | \end{itemize} 262 | \item Provide the exact command line configurations you used to run your experiments. (To verify batch size, learning rate, architecture, and so on.) 263 | \end{itemize} 264 | 265 | \textbf{What to Expect:} 266 | \begin{itemize} 267 | \item The best configuration of CartPole in both the large and small batch cases converge to a maximum score of 200. 268 | \end{itemize} 269 | 270 | 271 | \textbf{Problem 5. InvertedPendulum:} Run experiments in \verb|InvertedPendulum-v2| continuous control environment as follows: 272 | \begin{lstlisting} 273 | python train_pg_f18.py InvertedPendulum-v2 -ep 1000 --discount 0.9 -n 100 -e 3 -l 2 -s 64 -b -lr -rtg --exp_name hc_b_r 274 | \end{lstlisting} 275 | where your task is to find the smallest batch size \texttt{b*} and largest learning rate \texttt{r*} that gets to optimum (maximum score of 1000) in less than 100 iterations. The policy performance may fluctuate around 1000 -- this is fine. The precision of \texttt{b*} and \texttt{r*} need only be one significant digit. 276 | 277 | \textbf{Deliverables:} 278 | 279 | \begin{itemize} 280 | \item Given the \texttt{b*} and \texttt{r*} you found, provide a learning curve where the policy gets to optimum (maximum score of ~1000) in less than 100 iterations. (This may be for a single random seed, or averaged over multiple.) 281 | \item Provide the exact command line configurations you used to run your experiments. 282 | \end{itemize} 283 | 284 | 285 | \section{Implement Neural Network Baselines} 286 | For the rest of the assignment we will use ``reward-to-go.'' 287 | 288 | \textbf{Problem 6. Neural network baseline:} We will now implement a value function as a state-dependent neural network baseline. The sections in the code are marked by ``Problem 6.'' 289 | \begin{enumerate} [label=(\alph*)] 290 | \item In \verb|Agent.__init__| implement $V_\phi^\pi$, a neural network that predicts the expected return conditioned on a state. 291 | \item In \verb|Agent.compute_advantage|, use the neural network to predict the expected state-conditioned return (call \texttt{self.value\_net}), normalize it to match the statistics of the current batch of ``reward-to-go'', and subtract this value from the ``reward-to-go'' to yield an estimate of the advantage. This implements $$\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{it'}, a_{it'})\right) - V_\phi^\pi\left(s_{it}\right)$$. 292 | \item In \verb|Agent.update_parameters|, implement the loss function to train this network. ``Rescale'' the target values for the neural network baseline to have a mean of zero and a standard deviation of one. 293 | \end{enumerate} 294 | 295 | \section{More Complex Tasks} 296 | \textbf{Note:} The following tasks would take quite a bit of time to train. Please start early! 297 | 298 | \textbf{Problem 7: LunarLander} For this problem, you will use your policy gradient implementation to solve \verb|LunarLanderContinuous-v2|. 299 | Use an episode length of 1000. The purpose of this problem is to help you debug your baseline implementation. 300 | Run the following command: 301 | \begin{lstlisting} 302 | python train_pg_f18.py LunarLanderContinuous-v2 -ep 1000 --discount 0.99 -n 100 -e 3 -l 2 -s 64 -b 40000 -lr 0.005 -rtg --nn_baseline --exp_name ll_b40000_r0.005 303 | \end{lstlisting} 304 | \textbf{Deliverables:} 305 | \begin{itemize} 306 | \item Plot a learning curve for the above command. You should expect to achieve an average return of around 180. 307 | \end{itemize} 308 | 309 | \textbf{Problem 8: HalfCheetah} For this problem, you will use your policy gradient implementation to solve \verb|HalfCheetah-v2|. 310 | Use an episode length of 150, which is shorter than the default of 1000 for HalfCheetah (which would speed up your training significantly). 311 | Search over batch sizes \texttt{b} $\in [10000,30000,50000]$ and learning rates \texttt{r} $\in [0.005, 0.01, 0.02]$ to replace \texttt{} and \texttt{} below: 312 | \begin{lstlisting} 313 | python train_pg_f18.py HalfCheetah-v2 -ep 150 --discount 0.9 -n 100 -e 3 -l 2 -s 32 -b -lr -rtg --nn_baseline --exp_name hc_b_r 314 | \end{lstlisting} 315 | \textbf{Deliverables:} 316 | \begin{itemize} 317 | \item How did the batch size and learning rate affect the performance? 318 | \item Once you've found suitable values of \texttt{b} and \texttt{r} among those choices (let's call them \texttt{b*} and \texttt{r*}), use \texttt{b*} and \texttt{r*} 319 | and run the following commands (remember to replace the terms in the angle brackets): 320 | \begin{lstlisting} 321 | python train_pg_f18.py HalfCheetah-v2 -ep 150 --discount 0.95 -n 100 -e 3 -l 2 -s 32 -b -lr --exp_name hc_b_r 322 | python train_pg_f18.py HalfCheetah-v2 -ep 150 --discount 0.95 -n 100 -e 3 -l 2 -s 32 -b -lr -rtg --exp_name hc_b_r 323 | python train_pg_f18.py HalfCheetah-v2 -ep 150 --discount 0.95 -n 100 -e 3 -l 2 -s 32 -b -lr --nn_baseline --exp_name hc_b_r 324 | python train_pg_f18.py HalfCheetah-v2 -ep 150 --discount 0.95 -n 100 -e 3 -l 2 -s 32 -b -lr -rtg --nn_baseline --exp_name hc_b_r 325 | \end{lstlisting} 326 | The run with reward-to-go and the baseline should achieve an average score close to 200. Provide a single plot plotting the learning curves for all four runs. 327 | \end{itemize} 328 | 329 | 330 | \section{Bonus!} 331 | 332 | Choose any (or all) of the following: 333 | \begin{itemize} 334 | \item A serious bottleneck in the learning, for more complex environments, is the sample collection time. In \verb|train_pg_f18.py|, we only collect trajectories in a single thread, but this process can be fully parallelized across threads to get a useful speedup. Implement the parallelization and report on the difference in training time. 335 | \item Implement GAE-$\lambda$ for advantage estimation.\footnote{\url{https://arxiv.org/abs/1506.02438}} Run experiments in a MuJoCo gym environment to explore whether this speeds up training. (\verb|Walker2d-v1| may be good for this.) 336 | \item In PG, we collect a batch of data, estimate a single gradient, and then discard the data and move on. Can we potentially accelerate PG by taking multiple gradient descent steps with the same batch of data? Explore this option and report on your results. Set up a fair comparison between single-step PG and multi-step PG on at least one MuJoCo gym environment. 337 | \end{itemize} 338 | 339 | \section{Submission} 340 | Your report should be a document containing 341 | \begin{enumerate} [label=(\alph*)] 342 | \item 343 | Your mathematical response (written in \LaTeX) for Problem 1. 344 | \item All graphs requested in Problems 4, 5, 7, and 8. 345 | \item Answers to short explanation questions in section 5 and 7. 346 | \item All command-line expressions you used to run your experiments. 347 | \item (Optionally) Your bonus results (command-line expressions, graphs, and a few sentences that comment on your findings). 348 | \end{enumerate} 349 | 350 | Please also submit your modified \verb|train_pg_f18.py| file. If your code includes additional files, provide a zip file including your \verb|train_pg_f18.py| and all other files needed to run your code. Please include a \verb|README.md| with instructions needed to exactly duplicate your results (including command-line expressions). 351 | 352 | 353 | \end{document} 354 | -------------------------------------------------------------------------------- /hw2/logz.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | """ 4 | 5 | Some simple logging functionality, inspired by rllab's logging. 6 | Assumes that each diagnostic gets logged each iteration 7 | 8 | Call logz.configure_output_dir() to start logging to a 9 | tab-separated-values file (some_folder_name/log.txt) 10 | 11 | To load the learning curves, you can do, for example 12 | 13 | A = np.genfromtxt('/tmp/expt_1468984536/log.txt',delimiter='\t',dtype=None, names=True) 14 | A['EpRewMean'] 15 | 16 | """ 17 | 18 | import os.path as osp, shutil, time, atexit, os, subprocess 19 | import pickle 20 | import torch 21 | 22 | color2num = dict( 23 | gray=30, 24 | red=31, 25 | green=32, 26 | yellow=33, 27 | blue=34, 28 | magenta=35, 29 | cyan=36, 30 | white=37, 31 | crimson=38 32 | ) 33 | 34 | def colorize(string, color, bold=False, highlight=False): 35 | attr = [] 36 | num = color2num[color] 37 | if highlight: num += 10 38 | attr.append(str(num)) 39 | if bold: attr.append('1') 40 | return '\x1b[%sm%s\x1b[0m' % (';'.join(attr), string) 41 | 42 | class G: 43 | output_dir = None 44 | output_file = None 45 | first_row = True 46 | log_headers = [] 47 | log_current_row = {} 48 | 49 | def configure_output_dir(d=None): 50 | """ 51 | Set output directory to d, or to /tmp/somerandomnumber if d is None 52 | """ 53 | G.output_dir = d or "/tmp/experiments/%i"%int(time.time()) 54 | assert not osp.exists(G.output_dir), "Log dir %s already exists! Delete it first or use a different dir"%G.output_dir 55 | os.makedirs(G.output_dir) 56 | G.output_file = open(osp.join(G.output_dir, "log.txt"), 'w') 57 | atexit.register(G.output_file.close) 58 | print(colorize("Logging data to %s"%G.output_file.name, 'green', bold=True)) 59 | 60 | def log_tabular(key, val): 61 | """ 62 | Log a value of some diagnostic 63 | Call this once for each diagnostic quantity, each iteration 64 | """ 65 | if G.first_row: 66 | G.log_headers.append(key) 67 | else: 68 | assert key in G.log_headers, "Trying to introduce a new key %s that you didn't include in the first iteration"%key 69 | assert key not in G.log_current_row, "You already set %s this iteration. Maybe you forgot to call dump_tabular()"%key 70 | G.log_current_row[key] = val 71 | 72 | def save_hyperparams(params): 73 | with open(osp.join(G.output_dir, "hyperparams.json"), 'w') as out: 74 | out.write(json.dumps(params, separators=(',\n','\t:\t'), sort_keys=True)) 75 | 76 | def save_pytorch_model(model): 77 | """ 78 | Saves the entire pytorch Module 79 | """ 80 | torch.save(model, osp.join(G.output_dir, "model.pkl")) 81 | 82 | 83 | def dump_tabular(): 84 | """ 85 | Write all of the diagnostics from the current iteration 86 | """ 87 | vals = [] 88 | key_lens = [len(key) for key in G.log_headers] 89 | max_key_len = max(15,max(key_lens)) 90 | keystr = '%'+'%d'%max_key_len 91 | fmt = "| " + keystr + "s | %15s |" 92 | n_slashes = 22 + max_key_len 93 | print("-"*n_slashes) 94 | for key in G.log_headers: 95 | val = G.log_current_row.get(key, "") 96 | if hasattr(val, "__float__"): valstr = "%8.3g"%val 97 | else: valstr = val 98 | print(fmt%(key, valstr)) 99 | vals.append(val) 100 | print("-"*n_slashes) 101 | if G.output_file is not None: 102 | if G.first_row: 103 | G.output_file.write("\t".join(G.log_headers)) 104 | G.output_file.write("\n") 105 | G.output_file.write("\t".join(map(str,vals))) 106 | G.output_file.write("\n") 107 | G.output_file.flush() 108 | G.log_current_row.clear() 109 | G.first_row=False 110 | -------------------------------------------------------------------------------- /hw2/lunar_lander.py: -------------------------------------------------------------------------------- 1 | import sys, math 2 | import numpy as np 3 | 4 | import Box2D 5 | from Box2D.b2 import (edgeShape, circleShape, fixtureDef, polygonShape, revoluteJointDef, contactListener) 6 | 7 | import gym 8 | from gym import spaces 9 | from gym.utils import seeding 10 | 11 | import pyglet 12 | 13 | from copy import copy 14 | 15 | # Rocket trajectory optimization is a classic topic in Optimal Control. 16 | # 17 | # According to Pontryagin's maximum principle it's optimal to fire engine full throttle or 18 | # turn it off. That's the reason this environment is OK to have discreet actions (engine on or off). 19 | # 20 | # Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. 21 | # Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. 22 | # If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or 23 | # comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main 24 | # engine is -0.3 points each frame. Solved is 200 points. 25 | # 26 | # Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land 27 | # on its first attempt. Please see source code for details. 28 | # 29 | # Too see heuristic landing, run: 30 | # 31 | # python gym/envs/box2d/lunar_lander.py 32 | # 33 | # To play yourself, run: 34 | # 35 | # python examples/agents/keyboard_agent.py LunarLander-v0 36 | # 37 | # Created by Oleg Klimov. Licensed on the same terms as the rest of OpenAI Gym. 38 | 39 | # Modified by Sid Reddy (sgr@berkeley.edu) on 8/14/18 40 | # 41 | # Changelog: 42 | # - different discretization scheme for actions 43 | # - different terminal rewards 44 | # - different observations 45 | # - randomized landing site 46 | # 47 | # You can create an env object using `gym.make('LunarLanderContinuous-v2')`, 48 | # and it will use the discrete action space specified in this file, even though 49 | # the env is called "Continuous". 50 | # 51 | # A good agent should be able to achieve >150 reward. 52 | 53 | MAX_NUM_STEPS = 1000 54 | 55 | N_OBS_DIM = 9 56 | N_ACT_DIM = 6 # num discrete actions 57 | 58 | FPS = 50 59 | SCALE = 30.0 # affects how fast-paced the game is, forces should be adjusted as well 60 | 61 | MAIN_ENGINE_POWER = 13.0 62 | SIDE_ENGINE_POWER = 0.6 63 | 64 | INITIAL_RANDOM = 1000.0 # Set 1500 to make game harder 65 | 66 | LANDER_POLY =[ 67 | (-14,+17), (-17,0), (-17,-10), 68 | (+17,-10), (+17,0), (+14,+17) 69 | ] 70 | LEG_AWAY = 20 71 | LEG_DOWN = 18 72 | LEG_W, LEG_H = 2, 8 73 | LEG_SPRING_TORQUE = 40 # 40 is too difficult for human players, 400 a bit easier 74 | 75 | SIDE_ENGINE_HEIGHT = 14.0 76 | SIDE_ENGINE_AWAY = 12.0 77 | 78 | VIEWPORT_W = 600 79 | VIEWPORT_H = 400 80 | 81 | THROTTLE_MAG = 0.75 # discretized 'on' value for thrusters 82 | NOOP = 1 # don't fire main engine, don't steer 83 | def disc_to_cont(action): # discrete action -> continuous action 84 | if type(action) == np.ndarray: 85 | return action 86 | # main engine 87 | if action < 3: 88 | m = -THROTTLE_MAG 89 | elif action < 6: 90 | m = THROTTLE_MAG 91 | else: 92 | raise ValueError 93 | # steering 94 | if action % 3 == 0: 95 | s = -THROTTLE_MAG 96 | elif action % 3 == 1: 97 | s = 0 98 | else: 99 | s = THROTTLE_MAG 100 | return np.array([m, s]) 101 | 102 | class ContactDetector(contactListener): 103 | def __init__(self, env): 104 | contactListener.__init__(self) 105 | self.env = env 106 | def BeginContact(self, contact): 107 | if self.env.lander==contact.fixtureA.body or self.env.lander==contact.fixtureB.body: 108 | self.env.game_over = True 109 | for i in range(2): 110 | if self.env.legs[i] in [contact.fixtureA.body, contact.fixtureB.body]: 111 | self.env.legs[i].ground_contact = True 112 | def EndContact(self, contact): 113 | for i in range(2): 114 | if self.env.legs[i] in [contact.fixtureA.body, contact.fixtureB.body]: 115 | self.env.legs[i].ground_contact = False 116 | 117 | class LunarLander(gym.Env): 118 | metadata = { 119 | 'render.modes': ['human', 'rgb_array'], 120 | 'video.frames_per_second' : FPS 121 | } 122 | 123 | continuous = False 124 | 125 | def __init__(self): 126 | self._seed() 127 | self.viewer = None 128 | 129 | self.world = Box2D.b2World() 130 | self.moon = None 131 | self.lander = None 132 | self.particles = [] 133 | 134 | self.prev_reward = None 135 | 136 | high = np.array([np.inf]*N_OBS_DIM) # useful range is -1 .. +1, but spikes can be higher 137 | self.observation_space = spaces.Box(-high, high) 138 | 139 | self.action_space = spaces.Discrete(N_ACT_DIM) 140 | 141 | self.curr_step = None 142 | 143 | self._reset() 144 | 145 | def _seed(self, seed=None): 146 | self.np_random, seed = seeding.np_random(seed) 147 | return [seed] 148 | 149 | def _destroy(self): 150 | if not self.moon: return 151 | self.world.contactListener = None 152 | self._clean_particles(True) 153 | self.world.DestroyBody(self.moon) 154 | self.moon = None 155 | self.world.DestroyBody(self.lander) 156 | self.lander = None 157 | self.world.DestroyBody(self.legs[0]) 158 | self.world.DestroyBody(self.legs[1]) 159 | 160 | def _reset(self): 161 | self.curr_step = 0 162 | 163 | self._destroy() 164 | self.world.contactListener_keepref = ContactDetector(self) 165 | self.world.contactListener = self.world.contactListener_keepref 166 | self.game_over = False 167 | self.prev_shaping = None 168 | 169 | W = VIEWPORT_W/SCALE 170 | H = VIEWPORT_H/SCALE 171 | 172 | # terrain 173 | CHUNKS = 11 174 | height = self.np_random.uniform(0, H/2, size=(CHUNKS+1,) ) 175 | chunk_x = [W/(CHUNKS-1)*i for i in range(CHUNKS)] 176 | 177 | # randomize helipad x-coord 178 | helipad_chunk = np.random.choice(range(1, CHUNKS-1)) 179 | 180 | self.helipad_x1 = chunk_x[helipad_chunk-1] 181 | self.helipad_x2 = chunk_x[helipad_chunk+1] 182 | self.helipad_y = H/4 183 | height[helipad_chunk-2] = self.helipad_y 184 | height[helipad_chunk-1] = self.helipad_y 185 | height[helipad_chunk+0] = self.helipad_y 186 | height[helipad_chunk+1] = self.helipad_y 187 | height[helipad_chunk+2] = self.helipad_y 188 | smooth_y = [0.33*(height[i-1] + height[i+0] + height[i+1]) for i in range(CHUNKS)] 189 | 190 | self.moon = self.world.CreateStaticBody( shapes=edgeShape(vertices=[(0, 0), (W, 0)]) ) 191 | self.sky_polys = [] 192 | for i in range(CHUNKS-1): 193 | p1 = (chunk_x[i], smooth_y[i]) 194 | p2 = (chunk_x[i+1], smooth_y[i+1]) 195 | self.moon.CreateEdgeFixture( 196 | vertices=[p1,p2], 197 | density=0, 198 | friction=0.1) 199 | self.sky_polys.append( [p1, p2, (p2[0],H), (p1[0],H)] ) 200 | 201 | self.moon.color1 = (0.0,0.0,0.0) 202 | self.moon.color2 = (0.0,0.0,0.0) 203 | 204 | initial_y = VIEWPORT_H/SCALE#*0.75 205 | self.lander = self.world.CreateDynamicBody( 206 | position = (VIEWPORT_W/SCALE/2, initial_y), 207 | angle=0.0, 208 | fixtures = fixtureDef( 209 | shape=polygonShape(vertices=[ (x/SCALE,y/SCALE) for x,y in LANDER_POLY ]), 210 | density=5.0, 211 | friction=0.1, 212 | categoryBits=0x0010, 213 | maskBits=0x001, # collide only with ground 214 | restitution=0.0) # 0.99 bouncy 215 | ) 216 | self.lander.color1 = (0.5,0.4,0.9) 217 | self.lander.color2 = (0.3,0.3,0.5) 218 | self.lander.ApplyForceToCenter( ( 219 | self.np_random.uniform(-INITIAL_RANDOM, INITIAL_RANDOM), 220 | self.np_random.uniform(-INITIAL_RANDOM, INITIAL_RANDOM) 221 | ), True) 222 | 223 | self.legs = [] 224 | for i in [-1,+1]: 225 | leg = self.world.CreateDynamicBody( 226 | position = (VIEWPORT_W/SCALE/2 - i*LEG_AWAY/SCALE, initial_y), 227 | angle = (i*0.05), 228 | fixtures = fixtureDef( 229 | shape=polygonShape(box=(LEG_W/SCALE, LEG_H/SCALE)), 230 | density=1.0, 231 | restitution=0.0, 232 | categoryBits=0x0020, 233 | maskBits=0x001) 234 | ) 235 | leg.ground_contact = False 236 | leg.color1 = (0.5,0.4,0.9) 237 | leg.color2 = (0.3,0.3,0.5) 238 | rjd = revoluteJointDef( 239 | bodyA=self.lander, 240 | bodyB=leg, 241 | localAnchorA=(0, 0), 242 | localAnchorB=(i*LEG_AWAY/SCALE, LEG_DOWN/SCALE), 243 | enableMotor=True, 244 | enableLimit=True, 245 | maxMotorTorque=LEG_SPRING_TORQUE, 246 | motorSpeed=+0.3*i # low enough not to jump back into the sky 247 | ) 248 | if i==-1: 249 | rjd.lowerAngle = +0.9 - 0.5 # Yes, the most esoteric numbers here, angles legs have freedom to travel within 250 | rjd.upperAngle = +0.9 251 | else: 252 | rjd.lowerAngle = -0.9 253 | rjd.upperAngle = -0.9 + 0.5 254 | leg.joint = self.world.CreateJoint(rjd) 255 | self.legs.append(leg) 256 | 257 | self.drawlist = [self.lander] + self.legs 258 | 259 | return self._step(NOOP)[0] 260 | 261 | def _create_particle(self, mass, x, y, ttl): 262 | p = self.world.CreateDynamicBody( 263 | position = (x,y), 264 | angle=0.0, 265 | fixtures = fixtureDef( 266 | shape=circleShape(radius=2/SCALE, pos=(0,0)), 267 | density=mass, 268 | friction=0.1, 269 | categoryBits=0x0100, 270 | maskBits=0x001, # collide only with ground 271 | restitution=0.3) 272 | ) 273 | p.ttl = ttl 274 | self.particles.append(p) 275 | self._clean_particles(False) 276 | return p 277 | 278 | def _clean_particles(self, all): 279 | while self.particles and (all or self.particles[0].ttl<0): 280 | self.world.DestroyBody(self.particles.pop(0)) 281 | 282 | def _step(self, action): 283 | #assert self.action_space.contains(action), "%r (%s) invalid " % (action,type(action)) 284 | if type(action) in [int, np.int64]: 285 | action = disc_to_cont(action) 286 | 287 | # Engines 288 | tip = (math.sin(self.lander.angle), math.cos(self.lander.angle)) 289 | side = (-tip[1], tip[0]); 290 | dispersion = [self.np_random.uniform(-1.0, +1.0) / SCALE for _ in range(2)] 291 | 292 | m_power = 0.0 293 | if (self.continuous and action[0] > 0.0) or (not self.continuous and action==2): 294 | # Main engine 295 | if self.continuous: 296 | m_power = (np.clip(action[0], 0.0,1.0) + 1.0)*0.5 # 0.5..1.0 297 | assert m_power>=0.5 and m_power <= 1.0 298 | else: 299 | m_power = 1.0 300 | ox = tip[0]*(4/SCALE + 2*dispersion[0]) + side[0]*dispersion[1] # 4 is move a bit downwards, +-2 for randomness 301 | oy = -tip[1]*(4/SCALE + 2*dispersion[0]) - side[1]*dispersion[1] 302 | impulse_pos = (self.lander.position[0] + ox, self.lander.position[1] + oy) 303 | p = self._create_particle(3.5, impulse_pos[0], impulse_pos[1], m_power) # particles are just a decoration, 3.5 is here to make particle speed adequate 304 | p.ApplyLinearImpulse( ( ox*MAIN_ENGINE_POWER*m_power, oy*MAIN_ENGINE_POWER*m_power), impulse_pos, True) 305 | self.lander.ApplyLinearImpulse( (-ox*MAIN_ENGINE_POWER*m_power, -oy*MAIN_ENGINE_POWER*m_power), impulse_pos, True) 306 | 307 | s_power = 0.0 308 | if (self.continuous and np.abs(action[1]) > 0.5) or (not self.continuous and action in [1,3]): 309 | # Orientation engines 310 | if self.continuous: 311 | direction = np.sign(action[1]) 312 | s_power = np.clip(np.abs(action[1]), 0.5,1.0) 313 | assert s_power>=0.5 and s_power <= 1.0 314 | else: 315 | direction = action-2 316 | s_power = 1.0 317 | ox = tip[0]*dispersion[0] + side[0]*(3*dispersion[1]+direction*SIDE_ENGINE_AWAY/SCALE) 318 | oy = -tip[1]*dispersion[0] - side[1]*(3*dispersion[1]+direction*SIDE_ENGINE_AWAY/SCALE) 319 | impulse_pos = (self.lander.position[0] + ox - tip[0]*17/SCALE, self.lander.position[1] + oy + tip[1]*SIDE_ENGINE_HEIGHT/SCALE) 320 | p = self._create_particle(0.7, impulse_pos[0], impulse_pos[1], s_power) 321 | p.ApplyLinearImpulse( ( ox*SIDE_ENGINE_POWER*s_power, oy*SIDE_ENGINE_POWER*s_power), impulse_pos, True) 322 | self.lander.ApplyLinearImpulse( (-ox*SIDE_ENGINE_POWER*s_power, -oy*SIDE_ENGINE_POWER*s_power), impulse_pos, True) 323 | 324 | # perform normal update 325 | self.world.Step(1.0/FPS, 6*30, 2*30) 326 | 327 | pos = self.lander.position 328 | vel = self.lander.linearVelocity 329 | helipad_x = (self.helipad_x1 + self.helipad_x2) / 2 330 | state = [ 331 | (pos.x - VIEWPORT_W/SCALE/2) / (VIEWPORT_W/SCALE/2), 332 | (pos.y - (self.helipad_y+LEG_DOWN/SCALE)) / (VIEWPORT_W/SCALE/2), 333 | vel.x*(VIEWPORT_W/SCALE/2)/FPS, 334 | vel.y*(VIEWPORT_H/SCALE/2)/FPS, 335 | self.lander.angle, 336 | 20.0*self.lander.angularVelocity/FPS, 337 | 1.0 if self.legs[0].ground_contact else 0.0, 338 | 1.0 if self.legs[1].ground_contact else 0.0, 339 | (helipad_x - VIEWPORT_W/SCALE/2) / (VIEWPORT_W/SCALE/2) 340 | ] 341 | assert len(state)==N_OBS_DIM 342 | 343 | self.curr_step += 1 344 | 345 | reward = 0 346 | shaping = 0 347 | dx = (pos.x - helipad_x) / (VIEWPORT_W/SCALE/2) 348 | shaping += -100*np.sqrt(state[2]*state[2] + state[3]*state[3]) - 100*abs(state[4]) 349 | shaping += -100*np.sqrt(dx*dx + state[1]*state[1]) + 10*state[6] + 10*state[7] 350 | if self.prev_shaping is not None: 351 | reward = shaping - self.prev_shaping 352 | self.prev_shaping = shaping 353 | 354 | reward -= m_power*0.30 # less fuel spent is better, about -30 for heurisic landing 355 | reward -= s_power*0.03 356 | 357 | oob = abs(state[0]) >= 1.0 358 | timeout = self.curr_step >= MAX_NUM_STEPS 359 | not_awake = not self.lander.awake 360 | 361 | at_site = pos.x >= self.helipad_x1 and pos.x <= self.helipad_x2 and state[1] <= 0 362 | grounded = self.legs[0].ground_contact and self.legs[1].ground_contact 363 | landed = at_site and grounded 364 | 365 | done = self.game_over or oob or not_awake or timeout or landed 366 | if done: 367 | if self.game_over or oob: 368 | reward = -100 369 | self.lander.color1 = (255,0,0) 370 | elif at_site: 371 | reward = +100 372 | self.lander.color1 = (0,255,0) 373 | elif timeout: 374 | self.lander.color1 = (255,0,0) 375 | info = {} 376 | 377 | return np.array(state), reward, done, info 378 | 379 | def _render(self, mode='human', close=False): 380 | if close: 381 | if self.viewer is not None: 382 | self.viewer.close() 383 | self.viewer = None 384 | return 385 | 386 | from gym.envs.classic_control import rendering 387 | if self.viewer is None: 388 | self.viewer = rendering.Viewer(VIEWPORT_W, VIEWPORT_H) 389 | self.viewer.set_bounds(0, VIEWPORT_W/SCALE, 0, VIEWPORT_H/SCALE) 390 | 391 | for obj in self.particles: 392 | obj.ttl -= 0.15 393 | obj.color1 = (max(0.2,0.2+obj.ttl), max(0.2,0.5*obj.ttl), max(0.2,0.5*obj.ttl)) 394 | obj.color2 = (max(0.2,0.2+obj.ttl), max(0.2,0.5*obj.ttl), max(0.2,0.5*obj.ttl)) 395 | 396 | self._clean_particles(False) 397 | 398 | for p in self.sky_polys: 399 | self.viewer.draw_polygon(p, color=(0,0,0)) 400 | 401 | for obj in self.particles + self.drawlist: 402 | for f in obj.fixtures: 403 | trans = f.body.transform 404 | if type(f.shape) is circleShape: 405 | t = rendering.Transform(translation=trans*f.shape.pos) 406 | self.viewer.draw_circle(f.shape.radius, 20, color=obj.color1).add_attr(t) 407 | self.viewer.draw_circle(f.shape.radius, 20, color=obj.color2, filled=False, linewidth=2).add_attr(t) 408 | else: 409 | path = [trans*v for v in f.shape.vertices] 410 | self.viewer.draw_polygon(path, color=obj.color1) 411 | path.append(path[0]) 412 | self.viewer.draw_polyline(path, color=obj.color2, linewidth=2) 413 | 414 | for x in [self.helipad_x1, self.helipad_x2]: 415 | flagy1 = self.helipad_y 416 | flagy2 = flagy1 + 50/SCALE 417 | self.viewer.draw_polyline( [(x, flagy1), (x, flagy2)], color=(1,1,1) ) 418 | self.viewer.draw_polygon( [(x, flagy2), (x, flagy2-10/SCALE), (x+25/SCALE, flagy2-5/SCALE)], color=(0.8,0.8,0) ) 419 | 420 | clock_prog = self.curr_step / MAX_NUM_STEPS 421 | self.viewer.draw_polyline( [(0, 0.05*VIEWPORT_H/SCALE), (clock_prog*VIEWPORT_W/SCALE, 0.05*VIEWPORT_H/SCALE)], color=(255,0,0), linewidth=5 ) 422 | 423 | return self.viewer.render(return_rgb_array = mode=='rgb_array') 424 | 425 | class LunarLanderContinuous(LunarLander): 426 | continuous = True 427 | 428 | def heuristic(env, s): 429 | # Heuristic for: 430 | # 1. Testing. 431 | # 2. Demonstration rollout. 432 | angle_targ = s[0]*0.5 + s[2]*1.0 # angle should point towards center (s[0] is horizontal coordinate, s[2] hor speed) 433 | if angle_targ > 0.4: angle_targ = 0.4 # more than 0.4 radians (22 degrees) is bad 434 | if angle_targ < -0.4: angle_targ = -0.4 435 | hover_targ = 0.55*np.abs(s[0]) # target y should be proporional to horizontal offset 436 | 437 | # PID controller: s[4] angle, s[5] angularSpeed 438 | angle_todo = (angle_targ - s[4])*0.5 - (s[5])*1.0 439 | #print("angle_targ=%0.2f, angle_todo=%0.2f" % (angle_targ, angle_todo)) 440 | 441 | # PID controller: s[1] vertical coordinate s[3] vertical speed 442 | hover_todo = (hover_targ - s[1])*0.5 - (s[3])*0.5 443 | #print("hover_targ=%0.2f, hover_todo=%0.2f" % (hover_targ, hover_todo)) 444 | 445 | if s[6] or s[7]: # legs have contact 446 | angle_todo = 0 447 | hover_todo = -(s[3])*0.5 # override to reduce fall speed, that's all we need after contact 448 | 449 | if env.continuous: 450 | a = np.array( [hover_todo*20 - 1, -angle_todo*20] ) 451 | a = np.clip(a, -1, +1) 452 | else: 453 | a = 0 454 | if hover_todo > np.abs(angle_todo) and hover_todo > 0.05: a = 2 455 | elif angle_todo < -0.05: a = 3 456 | elif angle_todo > +0.05: a = 1 457 | return a 458 | 459 | if __name__=="__main__": 460 | #env = LunarLander() 461 | env = LunarLanderContinuous() 462 | s = env.reset() 463 | total_reward = 0 464 | steps = 0 465 | while True: 466 | a = heuristic(env, s) 467 | s, r, done, info = env.step(a) 468 | env.render() 469 | total_reward += r 470 | if steps % 20 == 0 or done: 471 | print(["{:+0.2f}".format(x) for x in s]) 472 | print("step {} total_reward {:+0.2f}".format(steps, total_reward)) 473 | steps += 1 474 | if done: break 475 | -------------------------------------------------------------------------------- /hw2/plot.py: -------------------------------------------------------------------------------- 1 | import seaborn as sns 2 | import pandas as pd 3 | import matplotlib.pyplot as plt 4 | import json 5 | import os 6 | 7 | """ 8 | Using the plotter: 9 | 10 | Call it from the command line, and supply it with logdirs to experiments. 11 | Suppose you ran an experiment with name 'test', and you ran 'test' for 10 12 | random seeds. The runner code stored it in the directory structure 13 | 14 | data 15 | L test_EnvName_DateTime 16 | L 0 17 | L log.txt 18 | L params.json 19 | L 1 20 | L log.txt 21 | L params.json 22 | . 23 | . 24 | . 25 | L 9 26 | L log.txt 27 | L params.json 28 | 29 | To plot learning curves from the experiment, averaged over all random 30 | seeds, call 31 | 32 | python plot.py data/test_EnvName_DateTime --value AverageReturn 33 | 34 | and voila. To see a different statistics, change what you put in for 35 | the keyword --value. You can also enter /multiple/ values, and it will 36 | make all of them in order. 37 | 38 | 39 | Suppose you ran two experiments: 'test1' and 'test2'. In 'test2' you tried 40 | a different set of hyperparameters from 'test1', and now you would like 41 | to compare them -- see their learning curves side-by-side. Just call 42 | 43 | python plot.py data/test1 data/test2 44 | 45 | and it will plot them both! They will be given titles in the legend according 46 | to their exp_name parameters. If you want to use custom legend titles, use 47 | the --legend flag and then provide a title for each logdir. 48 | 49 | """ 50 | 51 | def plot_data(data, value="AverageReturn"): 52 | if isinstance(data, list): 53 | data = pd.concat(data, ignore_index=True) 54 | 55 | sns.set(style="darkgrid", font_scale=1.5) 56 | sns.tsplot(data=data, time="Iteration", value=value, unit="Unit", condition="Condition") 57 | plt.legend(loc='best').draggable() 58 | plt.show() 59 | 60 | 61 | def get_datasets(fpath, condition=None): 62 | unit = 0 63 | datasets = [] 64 | for root, dir, files in os.walk(fpath): 65 | if 'log.txt' in files: 66 | param_path = open(os.path.join(root,'hyperparams.json')) 67 | params = json.load(param_path) 68 | exp_name = params['exp_name'] 69 | 70 | log_path = os.path.join(root,'log.txt') 71 | experiment_data = pd.read_table(log_path) 72 | 73 | experiment_data.insert( 74 | len(experiment_data.columns), 75 | 'Unit', 76 | unit 77 | ) 78 | experiment_data.insert( 79 | len(experiment_data.columns), 80 | 'Condition', 81 | condition or exp_name 82 | ) 83 | 84 | datasets.append(experiment_data) 85 | unit += 1 86 | 87 | return datasets 88 | 89 | 90 | def main(): 91 | import argparse 92 | parser = argparse.ArgumentParser() 93 | parser.add_argument('logdir', nargs='*') 94 | parser.add_argument('--legend', nargs='*') 95 | parser.add_argument('--value', default='AverageReturn', nargs='*') 96 | args = parser.parse_args() 97 | 98 | use_legend = False 99 | if args.legend is not None: 100 | assert len(args.legend) == len(args.logdir), \ 101 | "Must give a legend title for each set of experiments." 102 | use_legend = True 103 | 104 | data = [] 105 | if use_legend: 106 | for logdir, legend_title in zip(args.logdir, args.legend): 107 | data += get_datasets(logdir, legend_title) 108 | else: 109 | for logdir in args.logdir: 110 | data += get_datasets(logdir) 111 | 112 | if isinstance(args.value, list): 113 | values = args.value 114 | else: 115 | values = [args.value] 116 | for value in values: 117 | plot_data(data, value=value) 118 | 119 | if __name__ == "__main__": 120 | main() 121 | -------------------------------------------------------------------------------- /hw2/requirements.txt: -------------------------------------------------------------------------------- 1 | mujoco-py==1.50.1.56 2 | gym==0.10.5 3 | torch==0.4.0 4 | numpy==1.14.5 5 | seaborn 6 | Box2D==2.3.2 7 | -------------------------------------------------------------------------------- /hw2/train_pg_f18.py: -------------------------------------------------------------------------------- 1 | """ 2 | Original code from John Schulman for CS294 Deep Reinforcement Learning Spring 2017 3 | Adapted for CS294-112 Fall 2017 by Abhishek Gupta and Joshua Achiam 4 | Adapted for CS294-112 Fall 2018 by Michael Chang and Soroush Nasiriany 5 | Adapted for pytorch version by Ning Dai 6 | """ 7 | import numpy as np 8 | import torch 9 | import gym 10 | import logz 11 | import scipy.signal 12 | import os 13 | import time 14 | import inspect 15 | from torch.multiprocessing import Process 16 | from torch import nn, optim 17 | 18 | #============================================================================================# 19 | # Utilities 20 | #============================================================================================# 21 | 22 | #========================================================================================# 23 | # ----------PROBLEM 2---------- 24 | #========================================================================================# 25 | def build_mlp(input_size, output_size, n_layers, hidden_size, activation=nn.Tanh): 26 | """ 27 | Builds a feedforward neural network 28 | 29 | arguments: 30 | input_size: size of the input layer 31 | output_size: size of the output layer 32 | n_layers: number of hidden layers 33 | hidden_size: dimension of the hidden layers 34 | activation: activation of the hidden layers 35 | output_activation: activation of the output layer 36 | 37 | returns: 38 | an instance of nn.Sequential which contains the feedforward neural network 39 | 40 | Hint: use nn.Linear 41 | """ 42 | layers = [] 43 | # YOUR CODE HERE 44 | raise NotImplementedError 45 | return nn.Sequential(*layers).apply(weights_init) 46 | 47 | def weights_init(m): 48 | if hasattr(m, 'weight'): 49 | torch.nn.init.xavier_uniform_(m.weight) 50 | 51 | def pathlength(path): 52 | return len(path["reward"]) 53 | 54 | def setup_logger(logdir, locals_): 55 | # Configure output directory for logging 56 | logz.configure_output_dir(logdir) 57 | # Log experimental parameters 58 | args = inspect.getargspec(train_PG)[0] 59 | hyperparams = {k: locals_[k] if k in locals_ else None for k in args} 60 | logz.save_hyperparams(hyperparams) 61 | 62 | class PolicyNet(nn.Module): 63 | def __init__(self, neural_network_args): 64 | super(PolicyNet, self).__init__() 65 | self.ob_dim = neural_network_args['ob_dim'] 66 | self.ac_dim = neural_network_args['ac_dim'] 67 | self.discrete = neural_network_args['discrete'] 68 | self.hidden_size = neural_network_args['size'] 69 | self.n_layers = neural_network_args['n_layers'] 70 | 71 | self.define_model_components() 72 | 73 | #========================================================================================# 74 | # ----------PROBLEM 2---------- 75 | #========================================================================================# 76 | def define_model_components(self): 77 | """ 78 | Define the parameters of policy network here. 79 | You can use any instance of nn.Module or nn.Parameter. 80 | 81 | Hint: use the 'build_mlp' function defined above 82 | In the discrete case, model should output logits of a categorical distribution 83 | over the actions 84 | In the continuous case, model should output a tuple (mean, log_std) of a Gaussian 85 | distribution over actions. log_std should just be a trainable 86 | variable, not a network output. 87 | """ 88 | # YOUR_CODE_HERE 89 | if self.discrete: 90 | raise NotImplementedError 91 | else: 92 | raise NotImplementedError 93 | 94 | #========================================================================================# 95 | # ----------PROBLEM 2---------- 96 | #========================================================================================# 97 | """ 98 | Notes on notation: 99 | 100 | Pytorch tensor variables have the prefix ts_, to distinguish them from the numpy array 101 | variables that are computed later in the function 102 | 103 | Prefixes and suffixes: 104 | ob - observation 105 | ac - action 106 | _no - this tensor should have shape (batch size, observation dim) 107 | _na - this tensor should have shape (batch size, action dim) 108 | _n - this tensor should have shape (batch size) 109 | 110 | Note: batch size is defined at runtime 111 | """ 112 | def forward(self, ts_ob_no): 113 | """ 114 | Define forward pass for policy network. 115 | 116 | arguments: 117 | ts_ob_no: (batch_size, self.ob_dim) 118 | 119 | returns: 120 | the parameters of the policy. 121 | 122 | if discrete, the parameters are the logits of a categorical distribution 123 | over the actions 124 | ts_logits_na: (batch_size, self.ac_dim) 125 | 126 | if continuous, the parameters are a tuple (mean, log_std) of a Gaussian 127 | distribution over actions. log_std should just be a trainable 128 | variable, not a network output. 129 | ts_mean: (batch_size, self.ac_dim) 130 | st_logstd: (self.ac_dim,) 131 | 132 | Hint: use the components you defined in self.define_model_components 133 | """ 134 | raise NotImplementedError 135 | if self.discrete: 136 | # YOUR_CODE_HERE 137 | ts_logits_na = None 138 | return ts_logits_na 139 | else: 140 | # YOUR_CODE_HERE 141 | ts_mean = None 142 | ts_logstd = None 143 | return (ts_mean, ts_logstd) 144 | 145 | #============================================================================================# 146 | # Policy Gradient 147 | #============================================================================================# 148 | 149 | class Agent(object): 150 | def __init__(self, neural_network_args, sample_trajectory_args, estimate_return_args): 151 | super(Agent, self).__init__() 152 | self.ob_dim = neural_network_args['ob_dim'] 153 | self.ac_dim = neural_network_args['ac_dim'] 154 | self.discrete = neural_network_args['discrete'] 155 | self.hidden_size = neural_network_args['size'] 156 | self.n_layers = neural_network_args['n_layers'] 157 | self.learning_rate = neural_network_args['learning_rate'] 158 | 159 | self.animate = sample_trajectory_args['animate'] 160 | self.max_path_length = sample_trajectory_args['max_path_length'] 161 | self.min_timesteps_per_batch = sample_trajectory_args['min_timesteps_per_batch'] 162 | 163 | self.gamma = estimate_return_args['gamma'] 164 | self.reward_to_go = estimate_return_args['reward_to_go'] 165 | self.nn_baseline = estimate_return_args['nn_baseline'] 166 | self.normalize_advantages = estimate_return_args['normalize_advantages'] 167 | 168 | self.policy_net = PolicyNet(neural_network_args) 169 | params = list(self.policy_net.parameters()) 170 | 171 | #========================================================================================# 172 | # ----------PROBLEM 6---------- 173 | # Optional Baseline 174 | # 175 | # Define a neural network baseline. 176 | #========================================================================================# 177 | if self.nn_baseline: 178 | self.value_net = build_mlp(self.ob_dim, 1, self.n_layers, self.hidden_size) 179 | params += list(self.value_net.parameters()) 180 | 181 | self.optimizer = optim.Adam(params, lr=self.learning_rate) 182 | 183 | #========================================================================================# 184 | # ----------PROBLEM 2---------- 185 | #========================================================================================# 186 | def sample_action(self, ob_no): 187 | """ 188 | Build the method used for sampling action from the policy distribution 189 | 190 | arguments: 191 | ob_no: (batch_size, self.ob_dim) 192 | 193 | returns: 194 | sampled_ac: 195 | if discrete: (batch_size) 196 | if continuous: (batch_size, self.ac_dim) 197 | 198 | Hint: for the continuous case, use the reparameterization trick: 199 | The output from a Gaussian distribution with mean 'mu' and std 'sigma' is 200 | 201 | mu + sigma * z, z ~ N(0, I) 202 | 203 | This reduces the problem to just sampling z. (Hint: use torch.normal!) 204 | """ 205 | ts_ob_no = torch.from_numpy(ob_no).float() 206 | 207 | raise NotImplementedError 208 | if self.discrete: 209 | ts_logits_na = self.policy_net(ts_ob_no) 210 | # YOUR_CODE_HERE 211 | ts_sampled_ac = None 212 | else: 213 | ts_mean, ts_logstd = self.policy_net(ts_ob_no) 214 | # YOUR_CODE_HERE 215 | ts_sampled_ac = None 216 | 217 | sampled_ac = ts_sampled_ac.numpy() 218 | return sampled_ac 219 | 220 | #========================================================================================# 221 | # ----------PROBLEM 2---------- 222 | #========================================================================================# 223 | def get_log_prob(self, policy_parameters, ts_ac_na): 224 | """ 225 | Build the method used for computing the log probability of a set of actions 226 | that were actually taken according to the policy 227 | 228 | arguments: 229 | policy_parameters 230 | if discrete: logits of a categorical distribution over actions 231 | ts_logits_na: (batch_size, self.ac_dim) 232 | if continuous: (mean, log_std) of a Gaussian distribution over actions 233 | ts_mean: (batch_size, self.ac_dim) 234 | ts_logstd: (self.ac_dim,) 235 | 236 | ts_ac_na: (batch_size, self.ac_dim) 237 | 238 | returns: 239 | ts_logprob_n: (batch_size) 240 | 241 | Hint: 242 | For the discrete case, use the log probability under a categorical distribution. 243 | For the continuous case, use the log probability under a multivariate gaussian. 244 | """ 245 | raise NotImplementedError 246 | if self.discrete: 247 | ts_logits_na = policy_parameters 248 | # YOUR_CODE_HERE 249 | ts_logprob_n = None 250 | else: 251 | ts_mean, ts_logstd = policy_parameters 252 | # YOUR_CODE_HERE 253 | ts_logprob_n = None 254 | return ts_logprob_n 255 | 256 | def sample_trajectories(self, itr, env): 257 | # Collect paths until we have enough timesteps 258 | timesteps_this_batch = 0 259 | paths = [] 260 | while True: 261 | animate_this_episode=(len(paths)==0 and (itr % 10 == 0) and self.animate) 262 | path = self.sample_trajectory(env, animate_this_episode) 263 | paths.append(path) 264 | timesteps_this_batch += pathlength(path) 265 | if timesteps_this_batch > self.min_timesteps_per_batch: 266 | break 267 | return paths, timesteps_this_batch 268 | 269 | def sample_trajectory(self, env, animate_this_episode): 270 | ob = env.reset() 271 | obs, acs, rewards = [], [], [] 272 | steps = 0 273 | while True: 274 | if animate_this_episode: 275 | env.render() 276 | time.sleep(0.1) 277 | obs.append(ob) 278 | #====================================================================================# 279 | # ----------PROBLEM 3---------- 280 | #====================================================================================# 281 | raise NotImplementedError 282 | ac = None # YOUR CODE HERE 283 | ac = ac[0] 284 | acs.append(ac) 285 | ob, rew, done, _ = env.step(ac) 286 | rewards.append(rew) 287 | steps += 1 288 | if done or steps > self.max_path_length: 289 | break 290 | path = {"observation" : np.array(obs, dtype=np.float32), 291 | "reward" : np.array(rewards, dtype=np.float32), 292 | "action" : np.array(acs, dtype=np.float32)} 293 | return path 294 | 295 | #====================================================================================# 296 | # ----------PROBLEM 3---------- 297 | #====================================================================================# 298 | def sum_of_rewards(self, re_n): 299 | """ 300 | Monte Carlo estimation of the Q function. 301 | 302 | let sum_of_path_lengths be the sum of the lengths of the paths sampled from 303 | Agent.sample_trajectories 304 | let num_paths be the number of paths sampled from Agent.sample_trajectories 305 | 306 | arguments: 307 | re_n: length: num_paths. Each element in re_n is a numpy array 308 | containing the rewards for the particular path 309 | 310 | returns: 311 | q_n: shape: (sum_of_path_lengths). A single vector for the estimated q values 312 | whose length is the sum of the lengths of the paths 313 | 314 | ---------------------------------------------------------------------------------- 315 | 316 | Your code should construct numpy arrays for Q-values which will be used to compute 317 | advantages (which will in turn be fed to the placeholder you defined in 318 | Agent.define_placeholders). 319 | 320 | Recall that the expression for the policy gradient PG is 321 | 322 | PG = E_{tau} [sum_{t=0}^T grad log pi(a_t|s_t) * (Q_t - b_t )] 323 | 324 | where 325 | 326 | tau=(s_0, a_0, ...) is a trajectory, 327 | Q_t is the Q-value at time t, Q^{pi}(s_t, a_t), 328 | and b_t is a baseline which may depend on s_t. 329 | 330 | You will write code for two cases, controlled by the flag 'reward_to_go': 331 | 332 | Case 1: trajectory-based PG 333 | 334 | (reward_to_go = False) 335 | 336 | Instead of Q^{pi}(s_t, a_t), we use the total discounted reward summed over 337 | entire trajectory (regardless of which time step the Q-value should be for). 338 | 339 | For this case, the policy gradient estimator is 340 | 341 | E_{tau} [sum_{t=0}^T grad log pi(a_t|s_t) * Ret(tau)] 342 | 343 | where 344 | 345 | Ret(tau) = sum_{t'=0}^T gamma^t' r_{t'}. 346 | 347 | Thus, you should compute 348 | 349 | Q_t = Ret(tau) 350 | 351 | Case 2: reward-to-go PG 352 | 353 | (reward_to_go = True) 354 | 355 | Here, you estimate Q^{pi}(s_t, a_t) by the discounted sum of rewards starting 356 | from time step t. Thus, you should compute 357 | 358 | Q_t = sum_{t'=t}^T gamma^(t'-t) * r_{t'} 359 | 360 | 361 | Store the Q-values for all timesteps and all trajectories in a variable 'q_n', 362 | like the 'ob_no' and 'ac_na' above. 363 | """ 364 | # YOUR_CODE_HERE 365 | if self.reward_to_go: 366 | raise NotImplementedError 367 | else: 368 | raise NotImplementedError 369 | return q_n 370 | 371 | def compute_advantage(self, ob_no, q_n): 372 | """ 373 | Computes advantages by (possibly) subtracting a baseline from the estimated Q values 374 | 375 | let sum_of_path_lengths be the sum of the lengths of the paths sampled from 376 | Agent.sample_trajectories 377 | let num_paths be the number of paths sampled from Agent.sample_trajectories 378 | 379 | arguments: 380 | ob_no: shape: (sum_of_path_lengths, ob_dim) 381 | q_n: shape: (sum_of_path_lengths). A single vector for the estimated q values 382 | whose length is the sum of the lengths of the paths 383 | 384 | returns: 385 | adv_n: shape: (sum_of_path_lengths). A single vector for the estimated 386 | advantages whose length is the sum of the lengths of the paths 387 | """ 388 | #====================================================================================# 389 | # ----------PROBLEM 6---------- 390 | # Computing Baselines 391 | #====================================================================================# 392 | if self.nn_baseline: 393 | # If nn_baseline is True, use your neural network to predict reward-to-go 394 | # at each timestep for each trajectory, and save the result in a variable 'b_n' 395 | # like 'ob_no', 'ac_na', and 'q_n'. 396 | # 397 | # Hint #bl1: rescale the output from the nn_baseline to match the statistics 398 | # (mean and std) of the current batch of Q-values. (Goes with Hint 399 | # #bl2 in Agent.update_parameters. 400 | raise NotImplementedError 401 | # YOUR CODE HERE 402 | b_n = None 403 | adv_n = q_n - b_n 404 | else: 405 | adv_n = q_n.copy() 406 | return adv_n 407 | 408 | def estimate_return(self, ob_no, re_n): 409 | """ 410 | Estimates the returns over a set of trajectories. 411 | 412 | let sum_of_path_lengths be the sum of the lengths of the paths sampled from 413 | Agent.sample_trajectories 414 | let num_paths be the number of paths sampled from Agent.sample_trajectories 415 | 416 | arguments: 417 | ob_no: shape: (sum_of_path_lengths, ob_dim) 418 | re_n: length: num_paths. Each element in re_n is a numpy array 419 | containing the rewards for the particular path 420 | 421 | returns: 422 | q_n: shape: (sum_of_path_lengths). A single vector for the estimated q values 423 | whose length is the sum of the lengths of the paths 424 | adv_n: shape: (sum_of_path_lengths). A single vector for the estimated 425 | advantages whose length is the sum of the lengths of the paths 426 | """ 427 | q_n = self.sum_of_rewards(re_n) 428 | adv_n = self.compute_advantage(ob_no, q_n) 429 | #====================================================================================# 430 | # ----------PROBLEM 3---------- 431 | # Advantage Normalization 432 | #====================================================================================# 433 | if self.normalize_advantages: 434 | # On the next line, implement a trick which is known empirically to reduce variance 435 | # in policy gradient methods: normalize adv_n to have mean zero and std=1. 436 | raise NotImplementedError 437 | adv_n = None # YOUR_CODE_HERE 438 | return q_n, adv_n 439 | 440 | def update_parameters(self, ob_no, ac_na, q_n, adv_n): 441 | """ 442 | Update the parameters of the policy and (possibly) the neural network baseline, 443 | which is trained to approximate the value function. 444 | 445 | arguments: 446 | ob_no: shape: (sum_of_path_lengths, ob_dim) 447 | ac_na: shape: (sum_of_path_lengths). 448 | q_n: shape: (sum_of_path_lengths). A single vector for the estimated q values 449 | whose length is the sum of the lengths of the paths 450 | adv_n: shape: (sum_of_path_lengths). A single vector for the estimated 451 | advantages whose length is the sum of the lengths of the paths 452 | 453 | returns: 454 | nothing 455 | 456 | """ 457 | # convert numpy array to pytorch tensor 458 | ts_ob_no, ts_ac_na, ts_q_n, ts_adv_n = map(lambda x: torch.from_numpy(x), [ob_no, ac_na, q_n, adv_n]) 459 | 460 | # The policy takes in an observation and produces a distribution over the action space 461 | policy_parameters = self.policy_net(ts_ob_no) 462 | 463 | # We can compute the logprob of the actions that were actually taken by the policy 464 | # This is used in the loss function. 465 | ts_logprob_n = self.get_log_prob(policy_parameters, ts_ac_na) 466 | 467 | # clean the gradient for model parameters 468 | self.optimizer.zero_grad() 469 | 470 | #========================================================================================# 471 | # ----------PROBLEM 3---------- 472 | # Loss Function for Policy Gradient 473 | #========================================================================================# 474 | raise NotImplementedError 475 | loss = None # YOUR CODE HERE 476 | loss.backward() 477 | 478 | #====================================================================================# 479 | # ----------PROBLEM 6---------- 480 | # Optimizing Neural Network Baseline 481 | #====================================================================================# 482 | if self.nn_baseline: 483 | # If a neural network baseline is used, set up the targets and the output of the 484 | # baseline. 485 | # 486 | # Fit it to the current batch in order to use for the next iteration. Use the 487 | # self.value_net you defined earlier. 488 | # 489 | # Hint #bl2: Instead of trying to target raw Q-values directly, rescale the 490 | # targets to have mean zero and std=1. (Goes with Hint #bl1 in 491 | # Agent.compute_advantage.) 492 | 493 | # YOUR_CODE_HERE 494 | raise NotImplementedError 495 | baseline_prediction = None 496 | ts_target_n = None 497 | baseline_loss = None 498 | baseline_loss.backward() 499 | 500 | #====================================================================================# 501 | # ----------PROBLEM 3---------- 502 | # Performing the Policy Update 503 | #====================================================================================# 504 | 505 | # Call the optimizer to perform the policy gradient update based on the current batch 506 | # of rollouts. 507 | # 508 | # For debug purposes, you may wish to save the value of the loss function before 509 | # and after an update, and then log them below. 510 | 511 | # YOUR_CODE_HERE 512 | raise NotImplementedError 513 | 514 | def train_PG( 515 | exp_name, 516 | env_name, 517 | n_iter, 518 | gamma, 519 | min_timesteps_per_batch, 520 | max_path_length, 521 | learning_rate, 522 | reward_to_go, 523 | animate, 524 | logdir, 525 | normalize_advantages, 526 | nn_baseline, 527 | seed, 528 | n_layers, 529 | size): 530 | 531 | start = time.time() 532 | 533 | #========================================================================================# 534 | # Set Up Logger 535 | #========================================================================================# 536 | setup_logger(logdir, locals()) 537 | 538 | #========================================================================================# 539 | # Set Up Env 540 | #========================================================================================# 541 | 542 | # Make the gym environment 543 | env = gym.make(env_name) 544 | 545 | # Set random seeds 546 | torch.manual_seed(seed) 547 | np.random.seed(seed) 548 | env.seed(seed) 549 | 550 | # Maximum length for episodes 551 | max_path_length = max_path_length or env.spec.max_episode_steps 552 | 553 | # Is this env continuous, or self.discrete? 554 | discrete = isinstance(env.action_space, gym.spaces.Discrete) 555 | 556 | # Observation and action sizes 557 | ob_dim = env.observation_space.shape[0] 558 | ac_dim = env.action_space.n if discrete else env.action_space.shape[0] 559 | 560 | #========================================================================================# 561 | # Initialize Agent 562 | #========================================================================================# 563 | neural_network_args = { 564 | 'n_layers': n_layers, 565 | 'ob_dim': ob_dim, 566 | 'ac_dim': ac_dim, 567 | 'discrete': discrete, 568 | 'size': size, 569 | 'learning_rate': learning_rate, 570 | } 571 | 572 | sample_trajectory_args = { 573 | 'animate': animate, 574 | 'max_path_length': max_path_length, 575 | 'min_timesteps_per_batch': min_timesteps_per_batch, 576 | } 577 | 578 | estimate_return_args = { 579 | 'gamma': gamma, 580 | 'reward_to_go': reward_to_go, 581 | 'nn_baseline': nn_baseline, 582 | 'normalize_advantages': normalize_advantages, 583 | } 584 | 585 | agent = Agent(neural_network_args, sample_trajectory_args, estimate_return_args) 586 | 587 | #========================================================================================# 588 | # Training Loop 589 | #========================================================================================# 590 | 591 | total_timesteps = 0 592 | for itr in range(n_iter): 593 | print("********** Iteration %i ************"%itr) 594 | 595 | with torch.no_grad(): # use torch.no_grad to disable the gradient calculation 596 | paths, timesteps_this_batch = agent.sample_trajectories(itr, env) 597 | total_timesteps += timesteps_this_batch 598 | 599 | # Build arrays for observation, action for the policy gradient update by concatenating 600 | # across paths 601 | ob_no = np.concatenate([path["observation"] for path in paths]) 602 | ac_na = np.concatenate([path["action"] for path in paths]) 603 | re_n = [path["reward"] for path in paths] 604 | 605 | with torch.no_grad(): 606 | q_n, adv_n = agent.estimate_return(ob_no, re_n) 607 | 608 | agent.update_parameters(ob_no, ac_na, q_n, adv_n) 609 | 610 | # Log diagnostics 611 | returns = [path["reward"].sum() for path in paths] 612 | ep_lengths = [pathlength(path) for path in paths] 613 | logz.log_tabular("Time", time.time() - start) 614 | logz.log_tabular("Iteration", itr) 615 | logz.log_tabular("AverageReturn", np.mean(returns)) 616 | logz.log_tabular("StdReturn", np.std(returns)) 617 | logz.log_tabular("MaxReturn", np.max(returns)) 618 | logz.log_tabular("MinReturn", np.min(returns)) 619 | logz.log_tabular("EpLenMean", np.mean(ep_lengths)) 620 | logz.log_tabular("EpLenStd", np.std(ep_lengths)) 621 | logz.log_tabular("TimestepsThisBatch", timesteps_this_batch) 622 | logz.log_tabular("TimestepsSoFar", total_timesteps) 623 | logz.dump_tabular() 624 | logz.save_pytorch_model(agent) 625 | 626 | 627 | def main(): 628 | import argparse 629 | parser = argparse.ArgumentParser() 630 | parser.add_argument('env_name', type=str) 631 | parser.add_argument('--exp_name', type=str, default='vpg') 632 | parser.add_argument('--render', action='store_true') 633 | parser.add_argument('--discount', type=float, default=1.0) 634 | parser.add_argument('--n_iter', '-n', type=int, default=100) 635 | parser.add_argument('--batch_size', '-b', type=int, default=1000) 636 | parser.add_argument('--ep_len', '-ep', type=float, default=-1.) 637 | parser.add_argument('--learning_rate', '-lr', type=float, default=5e-3) 638 | parser.add_argument('--reward_to_go', '-rtg', action='store_true') 639 | parser.add_argument('--dont_normalize_advantages', '-dna', action='store_true') 640 | parser.add_argument('--nn_baseline', '-bl', action='store_true') 641 | parser.add_argument('--seed', type=int, default=1) 642 | parser.add_argument('--n_experiments', '-e', type=int, default=1) 643 | parser.add_argument('--n_layers', '-l', type=int, default=2) 644 | parser.add_argument('--size', '-s', type=int, default=64) 645 | args = parser.parse_args() 646 | 647 | if not(os.path.exists('data')): 648 | os.makedirs('data') 649 | logdir = args.exp_name + '_' + args.env_name + '_' + time.strftime("%d-%m-%Y_%H-%M-%S") 650 | logdir = os.path.join('data', logdir) 651 | if not(os.path.exists(logdir)): 652 | os.makedirs(logdir) 653 | 654 | max_path_length = args.ep_len if args.ep_len > 0 else None 655 | 656 | processes = [] 657 | 658 | for e in range(args.n_experiments): 659 | seed = args.seed + 10*e 660 | print('Running experiment with seed %d'%seed) 661 | 662 | def train_func(): 663 | train_PG( 664 | exp_name=args.exp_name, 665 | env_name=args.env_name, 666 | n_iter=args.n_iter, 667 | gamma=args.discount, 668 | min_timesteps_per_batch=args.batch_size, 669 | max_path_length=max_path_length, 670 | learning_rate=args.learning_rate, 671 | reward_to_go=args.reward_to_go, 672 | animate=args.render, 673 | logdir=os.path.join(logdir,'%d'%seed), 674 | normalize_advantages=not(args.dont_normalize_advantages), 675 | nn_baseline=args.nn_baseline, 676 | seed=seed, 677 | n_layers=args.n_layers, 678 | size=args.size 679 | ) 680 | p = Process(target=train_func, args=tuple()) 681 | p.start() 682 | processes.append(p) 683 | # if you comment in the line below, then the loop will block 684 | # until this process finishes 685 | # p.join() 686 | 687 | for p in processes: 688 | p.join() 689 | 690 | if __name__ == "__main__": 691 | main() 692 | -------------------------------------------------------------------------------- /hw3/README.md: -------------------------------------------------------------------------------- 1 | # CS294-112 HW 3: Q-Learning 2 | 3 | Modifications: 4 | 5 | In general, we followed the code structure of the original version and modified the neural network part to pytorch. 6 | 7 | Because of the different between the static graphs framework and the dynamic graphs framework, we merged and added some code. For the instructions, you can generally follow the original PDF version, and we have adapted the comments in the code for pytorch to help you finish this assignment. 8 | 9 | ------ 10 | 11 | Dependencies: 12 | 13 | * Python **3.5** 14 | * Numpy version **1.14.5** 15 | * Pytorch version **0.4.0** 16 | * MuJoCo version **1.50** and mujoco-py **1.50.1.56** 17 | * OpenAI Gym version **0.10.5** 18 | * seaborn 19 | * Box2D==**2.3.2** 20 | * OpenCV 21 | * ffmpeg 22 | 23 | Before doing anything, first replace `gym/envs/box2d/lunar_lander.py` with the provided `lunar_lander.py` file. 24 | 25 | The only files that you need to look at are `dqn.py` and `train_ac_f18.py`, which you will implement. 26 | 27 | See the [HW3 PDF](./hw3_instructions.pdf) for further instructions. 28 | 29 | The starter code was based on an implementation of Q-learning for Atari generously provided by Szymon Sidor from OpenAI. 30 | -------------------------------------------------------------------------------- /hw3/atari_wrappers.py: -------------------------------------------------------------------------------- 1 | import cv2 2 | import numpy as np 3 | from collections import deque 4 | import gym 5 | from gym import spaces 6 | 7 | 8 | class NoopResetEnv(gym.Wrapper): 9 | def __init__(self, env=None, noop_max=30): 10 | """Sample initial states by taking random number of no-ops on reset. 11 | No-op is assumed to be action 0. 12 | """ 13 | super(NoopResetEnv, self).__init__(env) 14 | self.noop_max = noop_max 15 | assert env.unwrapped.get_action_meanings()[0] == 'NOOP' 16 | 17 | def _reset(self): 18 | """ Do no-op action for a number of steps in [1, noop_max].""" 19 | self.env.reset() 20 | noops = np.random.randint(1, self.noop_max + 1) 21 | for _ in range(noops): 22 | obs, _, _, _ = self.env.step(0) 23 | return obs 24 | 25 | class FireResetEnv(gym.Wrapper): 26 | def __init__(self, env=None): 27 | """Take action on reset for environments that are fixed until firing.""" 28 | super(FireResetEnv, self).__init__(env) 29 | assert env.unwrapped.get_action_meanings()[1] == 'FIRE' 30 | assert len(env.unwrapped.get_action_meanings()) >= 3 31 | 32 | def _reset(self): 33 | self.env.reset() 34 | obs, _, _, _ = self.env.step(1) 35 | obs, _, _, _ = self.env.step(2) 36 | return obs 37 | 38 | class EpisodicLifeEnv(gym.Wrapper): 39 | def __init__(self, env=None): 40 | """Make end-of-life == end-of-episode, but only reset on true game over. 41 | Done by DeepMind for the DQN and co. since it helps value estimation. 42 | """ 43 | super(EpisodicLifeEnv, self).__init__(env) 44 | self.lives = 0 45 | self.was_real_done = True 46 | self.was_real_reset = False 47 | 48 | def _step(self, action): 49 | obs, reward, done, info = self.env.step(action) 50 | self.was_real_done = done 51 | # check current lives, make loss of life terminal, 52 | # then update lives to handle bonus lives 53 | lives = self.env.unwrapped.ale.lives() 54 | if lives < self.lives and lives > 0: 55 | # for Qbert somtimes we stay in lives == 0 condtion for a few frames 56 | # so its important to keep lives > 0, so that we only reset once 57 | # the environment advertises done. 58 | done = True 59 | self.lives = lives 60 | return obs, reward, done, info 61 | 62 | def _reset(self): 63 | """Reset only when lives are exhausted. 64 | This way all states are still reachable even though lives are episodic, 65 | and the learner need not know about any of this behind-the-scenes. 66 | """ 67 | if self.was_real_done: 68 | obs = self.env.reset() 69 | self.was_real_reset = True 70 | else: 71 | # no-op step to advance from terminal/lost life state 72 | obs, _, _, _ = self.env.step(0) 73 | self.was_real_reset = False 74 | self.lives = self.env.unwrapped.ale.lives() 75 | return obs 76 | 77 | class MaxAndSkipEnv(gym.Wrapper): 78 | def __init__(self, env=None, skip=4): 79 | """Return only every `skip`-th frame""" 80 | super(MaxAndSkipEnv, self).__init__(env) 81 | # most recent raw observations (for max pooling across time steps) 82 | self._obs_buffer = deque(maxlen=2) 83 | self._skip = skip 84 | 85 | def _step(self, action): 86 | total_reward = 0.0 87 | done = None 88 | for _ in range(self._skip): 89 | obs, reward, done, info = self.env.step(action) 90 | self._obs_buffer.append(obs) 91 | total_reward += reward 92 | if done: 93 | break 94 | 95 | max_frame = np.max(np.stack(self._obs_buffer), axis=0) 96 | 97 | return max_frame, total_reward, done, info 98 | 99 | def _reset(self): 100 | """Clear past frame buffer and init. to first obs. from inner env.""" 101 | self._obs_buffer.clear() 102 | obs = self.env.reset() 103 | self._obs_buffer.append(obs) 104 | return obs 105 | 106 | def _process_frame84(frame): 107 | img = np.reshape(frame, [210, 160, 3]).astype(np.float32) 108 | img = img[:, :, 0] * 0.299 + img[:, :, 1] * 0.587 + img[:, :, 2] * 0.114 109 | resized_screen = cv2.resize(img, (84, 110), interpolation=cv2.INTER_LINEAR) 110 | x_t = resized_screen[18:102, :] 111 | x_t = np.reshape(x_t, [84, 84, 1]) 112 | return x_t.astype(np.uint8) 113 | 114 | class ProcessFrame84(gym.Wrapper): 115 | def __init__(self, env=None): 116 | super(ProcessFrame84, self).__init__(env) 117 | self.observation_space = spaces.Box(low=0, high=255, shape=(84, 84, 1)) 118 | 119 | def _step(self, action): 120 | obs, reward, done, info = self.env.step(action) 121 | return _process_frame84(obs), reward, done, info 122 | 123 | def _reset(self): 124 | return _process_frame84(self.env.reset()) 125 | 126 | class ClippedRewardsWrapper(gym.Wrapper): 127 | def _step(self, action): 128 | obs, reward, done, info = self.env.step(action) 129 | return obs, np.sign(reward), done, info 130 | 131 | def wrap_deepmind_ram(env): 132 | env = EpisodicLifeEnv(env) 133 | env = NoopResetEnv(env, noop_max=30) 134 | env = MaxAndSkipEnv(env, skip=4) 135 | if 'FIRE' in env.unwrapped.get_action_meanings(): 136 | env = FireResetEnv(env) 137 | env = ClippedRewardsWrapper(env) 138 | return env 139 | 140 | def wrap_deepmind(env): 141 | assert 'NoFrameskip' in env.spec.id 142 | env = EpisodicLifeEnv(env) 143 | env = NoopResetEnv(env, noop_max=30) 144 | env = MaxAndSkipEnv(env, skip=4) 145 | if 'FIRE' in env.unwrapped.get_action_meanings(): 146 | env = FireResetEnv(env) 147 | env = ProcessFrame84(env) 148 | env = ClippedRewardsWrapper(env) 149 | return env 150 | -------------------------------------------------------------------------------- /hw3/dqn.py: -------------------------------------------------------------------------------- 1 | import time 2 | import pickle 3 | import sys 4 | import gym.spaces 5 | import logz 6 | import numpy as np 7 | import random 8 | import torch 9 | import torch.nn.functional as F 10 | from torch import nn, optim 11 | from collections import namedtuple 12 | from dqn_utils import LinearSchedule, ReplayBuffer, get_wrapper_by_name 13 | 14 | OptimizerSpec = namedtuple("OptimizerSpec", ["constructor", "kwargs", "lr_lambda"]) 15 | 16 | 17 | class QLearner(object): 18 | 19 | def __init__( 20 | self, 21 | env, 22 | q_func, 23 | optimizer_spec, 24 | exploration=LinearSchedule(1000000, 0.1), 25 | stopping_criterion=None, 26 | replay_buffer_size=1000000, 27 | batch_size=32, 28 | gamma=0.99, 29 | learning_starts=50000, 30 | learning_freq=4, 31 | frame_history_len=4, 32 | target_update_freq=10000, 33 | grad_norm_clipping=10, 34 | double_q=True, 35 | lander=False): 36 | """Run Deep Q-learning algorithm. 37 | 38 | You can specify your own convnet using q_func. 39 | 40 | All schedules are w.r.t. total number of steps taken in the environment. 41 | 42 | Parameters 43 | ---------- 44 | env: gym.Env 45 | gym environment to train on. 46 | q_func: function 47 | Model to use for computing the q function. It should accept the 48 | following named arguments: 49 | in_channels: int 50 | number of channels for the input 51 | num_actions: int 52 | number of actions 53 | optimizer_spec: OptimizerSpec 54 | Specifying the constructor and kwargs, as well as learning rate schedule 55 | for the optimizer 56 | exploration: rl_algs.deepq.utils.schedules.Schedule 57 | schedule for probability of chosing random action. 58 | stopping_criterion: (env, t) -> bool 59 | should return true when it's ok for the RL algorithm to stop. 60 | takes in env and the number of steps executed so far. 61 | replay_buffer_size: int 62 | How many memories to store in the replay buffer. 63 | batch_size: int 64 | How many transitions to sample each time experience is replayed. 65 | gamma: float 66 | Discount Factor 67 | learning_starts: int 68 | After how many environment steps to start replaying experiences 69 | learning_freq: int 70 | How many steps of environment to take between every experience replay 71 | frame_history_len: int 72 | How many past frames to include as input to the model. 73 | target_update_freq: int 74 | How many experience replay rounds (not steps!) to perform between 75 | each update to the target Q network 76 | grad_norm_clipping: float or None 77 | If not None gradients' norms are clipped to this value. 78 | double_q: bool 79 | If True, then use double Q-learning to compute target values. Otherwise, use vanilla DQN. 80 | https://papers.nips.cc/paper/3964-double-q-learning.pdf 81 | """ 82 | assert type(env.observation_space) == gym.spaces.Box 83 | assert type(env.action_space) == gym.spaces.Discrete 84 | 85 | self.target_update_freq = target_update_freq 86 | self.optimizer_spec = optimizer_spec 87 | self.batch_size = batch_size 88 | self.learning_freq = learning_freq 89 | self.learning_starts = learning_starts 90 | self.stopping_criterion = stopping_criterion 91 | self.env = env 92 | self.exploration = exploration 93 | self.gamma = gamma 94 | self.double_q = double_q 95 | self.device = torch.device('cuda' if torch.cuda.is_available else 'cpu') 96 | 97 | ############### 98 | # BUILD MODEL # 99 | ############### 100 | 101 | if len(self.env.observation_space.shape) == 1: 102 | # This means we are running on low-dimensional observations (e.g. RAM) 103 | in_features = self.env.observation_space.shape[0] 104 | else: 105 | img_h, img_w, img_c = self.env.observation_space.shape 106 | in_features = frame_history_len * img_c 107 | self.num_actions = self.env.action_space.n 108 | 109 | # define deep Q network and target Q network 110 | self.q_net = q_func(in_features, self.num_actions).to(self.device) 111 | self.target_q_net = q_func(in_features, self.num_actions).to(self.device) 112 | 113 | # construct optimization op (with gradient clipping) 114 | parameters = self.q_net.parameters() 115 | self.optimizer = self.optimizer_spec.constructor(parameters, lr=1, 116 | **self.optimizer_spec.kwargs) 117 | self.lr_scheduler = optim.lr_scheduler.LambdaLR(self.optimizer, self.optimizer_spec.lr_lambda) 118 | # clip_grad_norm_fn will be called before doing gradient decent 119 | self.clip_grad_norm_fn = lambda : nn.utils.clip_grad_norm_(parameters, max_norm=grad_norm_clipping) 120 | 121 | # update_target_fn will be called periodically to copy Q network to target Q network 122 | self.update_target_fn = lambda : self.target_q_net.load_state_dict(self.q_net.state_dict()) 123 | 124 | # construct the replay buffer 125 | self.replay_buffer = ReplayBuffer(replay_buffer_size, frame_history_len, lander=lander) 126 | self.replay_buffer_idx = None 127 | 128 | ############### 129 | # RUN ENV # 130 | ############### 131 | self.model_initialized = False 132 | self.num_param_updates = 0 133 | self.mean_episode_reward = -float('nan') 134 | self.best_mean_episode_reward = -float('inf') 135 | self.last_obs = self.env.reset() 136 | self.log_every_n_steps = 10000 137 | 138 | self.start_time = time.time() 139 | self.t = 0 140 | 141 | def calc_loss(self, obs, ac, rw, nxobs, done): 142 | """ 143 | Calculate the loss for a batch of transitions. 144 | 145 | Here, you should fill in your own code to compute the Bellman error. This requires 146 | evaluating the current and next Q-values and constructing the corresponding error. 147 | 148 | arguments: 149 | ob: The observation for current step 150 | ac: The corresponding action for current step 151 | rw: The reward for each timestep 152 | nxob: The observation after taking one step forward 153 | done: The mask for terminal state. This value is 1 if the next state corresponds to 154 | the end of an episode, in which case there is no Q-value at the next state; 155 | at the end of an episode, only the current state reward contributes to the target, 156 | not the next state Q-value (i.e. target is just rew_t_ph, not rew_t_ph + gamma * q_tp1) 157 | 158 | inputs are generated from self.replay_buffer.sample, you can refer the code in dqn_utils.py 159 | for more details 160 | 161 | returns: 162 | a scalar tensor represent the loss 163 | 164 | Hint: use smooth_l1_loss (a.k.a huber_loss) instead of mean squared error. 165 | use self.double_q to switch between double DQN and vanilla DQN. 166 | """ 167 | 168 | # YOUR CODE HERE 169 | 170 | 171 | def stopping_criterion_met(self): 172 | return self.stopping_criterion is not None and self.stopping_criterion(self.env, self.t) 173 | 174 | def step_env(self): 175 | ### 2. Step the env and store the transition 176 | # At this point, "self.last_obs" contains the latest observation that was 177 | # recorded from the simulator. Here, your code needs to store this 178 | # observation and its outcome (reward, next observation, etc.) into 179 | # the replay buffer while stepping the simulator forward one step. 180 | # At the end of this block of code, the simulator should have been 181 | # advanced one step, and the replay buffer should contain one more 182 | # transition. 183 | # Specifically, self.last_obs must point to the new latest observation. 184 | # Useful functions you'll need to call: 185 | # obs, reward, done, info = env.step(action) 186 | # this steps the environment forward one step 187 | # obs = env.reset() 188 | # this resets the environment if you reached an episode boundary. 189 | # Don't forget to call env.reset() to get a new observation if done 190 | # is true!! 191 | # Note that you cannot use "self.last_obs" directly as input 192 | # into your network, since it needs to be processed to include context 193 | # from previous frames. You should check out the replay buffer 194 | # implementation in dqn_utils.py to see what functionality the replay 195 | # buffer exposes. The replay buffer has a function called 196 | # encode_recent_observation that will take the latest observation 197 | # that you pushed into the buffer and compute the corresponding 198 | # input that should be given to a Q network by appending some 199 | # previous frames. 200 | # Don't forget to include epsilon greedy exploration! 201 | # And remember that the first time you enter this loop, the model 202 | # may not yet have been initialized (but of course, the first step 203 | # might as well be random, since you haven't trained your net...) 204 | 205 | ##### 206 | 207 | # YOUR CODE HERE 208 | 209 | 210 | def update_model(self): 211 | ### 3. Perform experience replay and train the network. 212 | # note that this is only done if the replay buffer contains enough samples 213 | # for us to learn something useful -- until then, the model will not be 214 | # initialized and random actions should be taken 215 | self.lr_scheduler.step() 216 | 217 | if (self.t > self.learning_starts and \ 218 | self.t % self.learning_freq == 0 and \ 219 | self.replay_buffer.can_sample(self.batch_size)): 220 | 221 | # Here, you should perform training. Training consists of four steps: 222 | # 3.a: use the replay buffer to sample a batch of transitions (see the 223 | # replay buffer code for function definition, each batch that you sample 224 | # should consist of current observations, current actions, rewards, 225 | # next observations, and done indicator). 226 | # 3.b: set the self.model_initialized to True. Because the newwork in starting 227 | # to train, and you will use it to take action in self.step_env. 228 | # 3.c: train the model. To do this, you'll need to use the self.optimizer and 229 | # self.calc_loss that were created earlier: self.calc_loss is what you 230 | # created to compute the total Bellman error in a batch, and self.optimizer 231 | # will actually perform a gradient step and update the network parameters 232 | # to reduce the loss. 233 | # Before your optimizer take step, don`t forget to call self.clip_grad_norm_fn 234 | # to perform gradient clipping. 235 | # 3.d: periodically update the target network by calling self.update_target_fn 236 | # you should update every target_update_freq steps, and you may find the 237 | # variable self.num_param_updates useful for this (it was initialized to 0) 238 | ##### 239 | 240 | # YOUR CODE HERE 241 | 242 | 243 | self.num_param_updates += 1 244 | 245 | self.t += 1 246 | 247 | def log_progress(self): 248 | episode_rewards = get_wrapper_by_name(self.env, "Monitor").get_episode_rewards() 249 | 250 | if len(episode_rewards) > 0: 251 | self.mean_episode_reward = np.mean(episode_rewards[-100:]) 252 | 253 | if len(episode_rewards) > 100: 254 | self.best_mean_episode_reward = max(self.best_mean_episode_reward, self.mean_episode_reward) 255 | 256 | if self.t % self.log_every_n_steps == 0 and self.model_initialized: 257 | logz.log_tabular("TimeStep", self.t) 258 | logz.log_tabular("MeanReturn", self.mean_episode_reward) 259 | logz.log_tabular("BestMeanReturn", max(self.best_mean_episode_reward, self.mean_episode_reward)) 260 | logz.log_tabular("Episodes", len(episode_rewards)) 261 | logz.log_tabular("Exploration", self.exploration.value(self.t)) 262 | logz.log_tabular("LearningRate", self.optimizer_spec.lr_lambda(self.t)) 263 | logz.log_tabular("Time", (time.time() - self.start_time) / 60.) 264 | logz.dump_tabular() 265 | logz.save_pytorch_model(self.q_net) 266 | 267 | def learn(*args, **kwargs): 268 | alg = QLearner(*args, **kwargs) 269 | while not alg.stopping_criterion_met(): 270 | alg.step_env() 271 | # at this point, the environment should have been advanced one step (and 272 | # reset if done was true), and self.last_obs should point to the new latest 273 | # observation 274 | alg.update_model() 275 | alg.log_progress() 276 | 277 | -------------------------------------------------------------------------------- /hw3/dqn_utils.py: -------------------------------------------------------------------------------- 1 | """This file includes a collection of utility functions that are useful for 2 | implementing DQN.""" 3 | import gym 4 | import numpy as np 5 | import random 6 | 7 | def sample_n_unique(sampling_f, n): 8 | """Helper function. Given a function `sampling_f` that returns 9 | comparable objects, sample n such unique objects. 10 | """ 11 | res = [] 12 | while len(res) < n: 13 | candidate = sampling_f() 14 | if candidate not in res: 15 | res.append(candidate) 16 | return res 17 | 18 | class Schedule(object): 19 | def value(self, t): 20 | """Value of the schedule at time t""" 21 | raise NotImplementedError() 22 | 23 | class ConstantSchedule(object): 24 | def __init__(self, value): 25 | """Value remains constant over time. 26 | Parameters 27 | ---------- 28 | value: float 29 | Constant value of the schedule 30 | """ 31 | self._v = value 32 | 33 | def value(self, t): 34 | """See Schedule.value""" 35 | return self._v 36 | 37 | def linear_interpolation(l, r, alpha): 38 | return l + alpha * (r - l) 39 | 40 | class PiecewiseSchedule(object): 41 | def __init__(self, endpoints, interpolation=linear_interpolation, outside_value=None): 42 | """Piecewise schedule. 43 | endpoints: [(int, int)] 44 | list of pairs `(time, value)` meanining that schedule should output 45 | `value` when `t==time`. All the values for time must be sorted in 46 | an increasing order. When t is between two times, e.g. `(time_a, value_a)` 47 | and `(time_b, value_b)`, such that `time_a <= t < time_b` then value outputs 48 | `interpolation(value_a, value_b, alpha)` where alpha is a fraction of 49 | time passed between `time_a` and `time_b` for time `t`. 50 | interpolation: lambda float, float, float: float 51 | a function that takes value to the left and to the right of t according 52 | to the `endpoints`. Alpha is the fraction of distance from left endpoint to 53 | right endpoint that t has covered. See linear_interpolation for example. 54 | outside_value: float 55 | if the value is requested outside of all the intervals sepecified in 56 | `endpoints` this value is returned. If None then AssertionError is 57 | raised when outside value is requested. 58 | """ 59 | idxes = [e[0] for e in endpoints] 60 | assert idxes == sorted(idxes) 61 | self._interpolation = interpolation 62 | self._outside_value = outside_value 63 | self._endpoints = endpoints 64 | 65 | def value(self, t): 66 | """See Schedule.value""" 67 | for (l_t, l), (r_t, r) in zip(self._endpoints[:-1], self._endpoints[1:]): 68 | if l_t <= t and t < r_t: 69 | alpha = float(t - l_t) / (r_t - l_t) 70 | return self._interpolation(l, r, alpha) 71 | 72 | # t does not belong to any of the pieces, so doom. 73 | assert self._outside_value is not None 74 | return self._outside_value 75 | 76 | class LinearSchedule(object): 77 | def __init__(self, schedule_timesteps, final_p, initial_p=1.0): 78 | """Linear interpolation between initial_p and final_p over 79 | schedule_timesteps. After this many timesteps pass final_p is 80 | returned. 81 | Parameters 82 | ---------- 83 | schedule_timesteps: int 84 | Number of timesteps for which to linearly anneal initial_p 85 | to final_p 86 | initial_p: float 87 | initial output value 88 | final_p: float 89 | final output value 90 | """ 91 | self.schedule_timesteps = schedule_timesteps 92 | self.final_p = final_p 93 | self.initial_p = initial_p 94 | 95 | def value(self, t): 96 | """See Schedule.value""" 97 | fraction = min(float(t) / self.schedule_timesteps, 1.0) 98 | return self.initial_p + fraction * (self.final_p - self.initial_p) 99 | 100 | 101 | def get_wrapper_by_name(env, classname): 102 | currentenv = env 103 | while True: 104 | if classname in currentenv.__class__.__name__: 105 | return currentenv 106 | elif isinstance(env, gym.Wrapper): 107 | currentenv = currentenv.env 108 | else: 109 | raise ValueError("Couldn't find wrapper named %s"%classname) 110 | 111 | class ReplayBuffer(object): 112 | def __init__(self, size, frame_history_len, lander=False): 113 | """This is a memory efficient implementation of the replay buffer. 114 | 115 | The sepecific memory optimizations use here are: 116 | - only store each frame once rather than k times 117 | even if every observation normally consists of k last frames 118 | - store frames as np.uint8 (actually it is most time-performance 119 | to cast them back to float32 on GPU to minimize memory transfer 120 | time) 121 | - store frame_t and frame_(t+1) in the same buffer. 122 | 123 | For the tipical use case in Atari Deep RL buffer with 1M frames the total 124 | memory footprint of this buffer is 10^6 * 84 * 84 bytes ~= 7 gigabytes 125 | 126 | Warning! Assumes that returning frame of zeros at the beginning 127 | of the episode, when there is less frames than `frame_history_len`, 128 | is acceptable. 129 | 130 | Parameters 131 | ---------- 132 | size: int 133 | Max number of transitions to store in the buffer. When the buffer 134 | overflows the old memories are dropped. 135 | frame_history_len: int 136 | Number of memories to be retried for each observation. 137 | """ 138 | self.lander = lander 139 | 140 | self.size = size 141 | self.frame_history_len = frame_history_len 142 | 143 | self.next_idx = 0 144 | self.num_in_buffer = 0 145 | 146 | self.obs = None 147 | self.action = None 148 | self.reward = None 149 | self.done = None 150 | 151 | def can_sample(self, batch_size): 152 | """Returns true if `batch_size` different transitions can be sampled from the buffer.""" 153 | return batch_size + 1 <= self.num_in_buffer 154 | 155 | def _encode_sample(self, idxes): 156 | obs_batch = np.concatenate([self._encode_observation(idx)[None] for idx in idxes], 0) 157 | act_batch = self.action[idxes] 158 | rew_batch = self.reward[idxes] 159 | next_obs_batch = np.concatenate([self._encode_observation(idx + 1)[None] for idx in idxes], 0) 160 | done_mask = np.array([1.0 if self.done[idx] else 0.0 for idx in idxes], dtype=np.float32) 161 | 162 | return obs_batch, act_batch, rew_batch, next_obs_batch, done_mask 163 | 164 | 165 | def sample(self, batch_size): 166 | """Sample `batch_size` different transitions. 167 | 168 | i-th sample transition is the following: 169 | 170 | when observing `obs_batch[i]`, action `act_batch[i]` was taken, 171 | after which reward `rew_batch[i]` was received and subsequent 172 | observation next_obs_batch[i] was observed, unless the epsiode 173 | was done which is represented by `done_mask[i]` which is equal 174 | to 1 if episode has ended as a result of that action. 175 | 176 | Parameters 177 | ---------- 178 | batch_size: int 179 | How many transitions to sample. 180 | 181 | Returns 182 | ------- 183 | obs_batch: np.array 184 | Array of shape 185 | (batch_size, img_h, img_w, img_c * frame_history_len) 186 | and dtype np.uint8 187 | act_batch: np.array 188 | Array of shape (batch_size,) and dtype np.int32 189 | rew_batch: np.array 190 | Array of shape (batch_size,) and dtype np.float32 191 | next_obs_batch: np.array 192 | Array of shape 193 | (batch_size, img_h, img_w, img_c * frame_history_len) 194 | and dtype np.uint8 195 | done_mask: np.array 196 | Array of shape (batch_size,) and dtype np.float32 197 | """ 198 | assert self.can_sample(batch_size) 199 | idxes = sample_n_unique(lambda: random.randint(0, self.num_in_buffer - 2), batch_size) 200 | return self._encode_sample(idxes) 201 | 202 | def encode_recent_observation(self): 203 | """Return the most recent `frame_history_len` frames. 204 | 205 | Returns 206 | ------- 207 | observation: np.array 208 | Array of shape (img_h, img_w, img_c * frame_history_len) 209 | and dtype np.uint8, where observation[:, :, i*img_c:(i+1)*img_c] 210 | encodes frame at time `t - frame_history_len + i` 211 | """ 212 | assert self.num_in_buffer > 0 213 | return self._encode_observation((self.next_idx - 1) % self.size) 214 | 215 | def _encode_observation(self, idx): 216 | end_idx = idx + 1 # make noninclusive 217 | start_idx = end_idx - self.frame_history_len 218 | # this checks if we are using low-dimensional observations, such as RAM 219 | # state, in which case we just directly return the latest RAM. 220 | if len(self.obs.shape) == 2: 221 | return self.obs[end_idx-1] 222 | # if there weren't enough frames ever in the buffer for context 223 | if start_idx < 0 and self.num_in_buffer != self.size: 224 | start_idx = 0 225 | for idx in range(start_idx, end_idx - 1): 226 | if self.done[idx % self.size]: 227 | start_idx = idx + 1 228 | missing_context = self.frame_history_len - (end_idx - start_idx) 229 | # if zero padding is needed for missing context 230 | # or we are on the boundry of the buffer 231 | if start_idx < 0 or missing_context > 0: 232 | frames = [np.zeros_like(self.obs[0]) for _ in range(missing_context)] 233 | for idx in range(start_idx, end_idx): 234 | frames.append(self.obs[idx % self.size]) 235 | return np.concatenate(frames, 2) 236 | else: 237 | # this optimization has potential to saves about 30% compute time \o/ 238 | img_h, img_w = self.obs.shape[1], self.obs.shape[2] 239 | return self.obs[start_idx:end_idx].transpose(1, 2, 0, 3).reshape(img_h, img_w, -1) 240 | 241 | def store_frame(self, frame): 242 | """Store a single frame in the buffer at the next available index, overwriting 243 | old frames if necessary. 244 | 245 | Parameters 246 | ---------- 247 | frame: np.array 248 | Array of shape (img_h, img_w, img_c) and dtype np.uint8 249 | the frame to be stored 250 | 251 | Returns 252 | ------- 253 | idx: int 254 | Index at which the frame is stored. To be used for `store_effect` later. 255 | """ 256 | if self.obs is None: 257 | self.obs = np.empty([self.size] + list(frame.shape), dtype=np.float32 if self.lander else np.uint8) 258 | self.action = np.empty([self.size], dtype=np.int32) 259 | self.reward = np.empty([self.size], dtype=np.float32) 260 | self.done = np.empty([self.size], dtype=np.bool) 261 | self.obs[self.next_idx] = frame 262 | 263 | ret = self.next_idx 264 | self.next_idx = (self.next_idx + 1) % self.size 265 | self.num_in_buffer = min(self.size, self.num_in_buffer + 1) 266 | 267 | return ret 268 | 269 | def store_effect(self, idx, action, reward, done): 270 | """Store effects of action taken after obeserving frame stored 271 | at index idx. The reason `store_frame` and `store_effect` is broken 272 | up into two functions is so that once can call `encode_recent_observation` 273 | in between. 274 | 275 | Paramters 276 | --------- 277 | idx: int 278 | Index in buffer of recently observed frame (returned by `store_frame`). 279 | action: int 280 | Action that was performed upon observing this frame. 281 | reward: float 282 | Reward that was received when the actions was performed. 283 | done: bool 284 | True if episode was finished after performing that action. 285 | """ 286 | self.action[idx] = action 287 | self.reward[idx] = reward 288 | self.done[idx] = done 289 | 290 | -------------------------------------------------------------------------------- /hw3/hw3_instructions.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw3/hw3_instructions.pdf -------------------------------------------------------------------------------- /hw3/logz.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | """ 4 | 5 | Some simple logging functionality, inspired by rllab's logging. 6 | Assumes that each diagnostic gets logged each iteration 7 | 8 | Call logz.configure_output_dir() to start logging to a 9 | tab-separated-values file (some_folder_name/log.txt) 10 | 11 | To load the learning curves, you can do, for example 12 | 13 | A = np.genfromtxt('/tmp/expt_1468984536/log.txt',delimiter='\t',dtype=None, names=True) 14 | A['EpRewMean'] 15 | 16 | """ 17 | 18 | import os.path as osp, shutil, time, atexit, os, subprocess 19 | import pickle 20 | import torch 21 | 22 | color2num = dict( 23 | gray=30, 24 | red=31, 25 | green=32, 26 | yellow=33, 27 | blue=34, 28 | magenta=35, 29 | cyan=36, 30 | white=37, 31 | crimson=38 32 | ) 33 | 34 | def colorize(string, color, bold=False, highlight=False): 35 | attr = [] 36 | num = color2num[color] 37 | if highlight: num += 10 38 | attr.append(str(num)) 39 | if bold: attr.append('1') 40 | return '\x1b[%sm%s\x1b[0m' % (';'.join(attr), string) 41 | 42 | class G: 43 | output_dir = None 44 | output_file = None 45 | first_row = True 46 | log_headers = [] 47 | log_current_row = {} 48 | 49 | def configure_output_dir(d=None): 50 | """ 51 | Set output directory to d, or to /tmp/somerandomnumber if d is None 52 | """ 53 | G.output_dir = d or "/tmp/experiments/%i"%int(time.time()) 54 | assert not osp.exists(G.output_dir), "Log dir %s already exists! Delete it first or use a different dir"%G.output_dir 55 | os.makedirs(G.output_dir) 56 | G.output_file = open(osp.join(G.output_dir, "log.txt"), 'w') 57 | atexit.register(G.output_file.close) 58 | print(colorize("Logging data to %s"%G.output_file.name, 'green', bold=True)) 59 | 60 | def log_tabular(key, val): 61 | """ 62 | Log a value of some diagnostic 63 | Call this once for each diagnostic quantity, each iteration 64 | """ 65 | if G.first_row: 66 | G.log_headers.append(key) 67 | else: 68 | assert key in G.log_headers, "Trying to introduce a new key %s that you didn't include in the first iteration"%key 69 | assert key not in G.log_current_row, "You already set %s this iteration. Maybe you forgot to call dump_tabular()"%key 70 | G.log_current_row[key] = val 71 | 72 | def save_hyperparams(params): 73 | with open(osp.join(G.output_dir, "hyperparams.json"), 'w') as out: 74 | out.write(json.dumps(params, separators=(',\n','\t:\t'), sort_keys=True)) 75 | 76 | def save_pytorch_model(model): 77 | """ 78 | Saves the entire pytorch Module 79 | """ 80 | torch.save(model, osp.join(G.output_dir, "model.pkl")) 81 | 82 | 83 | def dump_tabular(): 84 | """ 85 | Write all of the diagnostics from the current iteration 86 | """ 87 | vals = [] 88 | key_lens = [len(key) for key in G.log_headers] 89 | max_key_len = max(15,max(key_lens)) 90 | keystr = '%'+'%d'%max_key_len 91 | fmt = "| " + keystr + "s | %15s |" 92 | n_slashes = 22 + max_key_len 93 | print("-"*n_slashes) 94 | for key in G.log_headers: 95 | val = G.log_current_row.get(key, "") 96 | if hasattr(val, "__float__"): valstr = "%8.3g"%val 97 | else: valstr = val 98 | print(fmt%(key, valstr)) 99 | vals.append(val) 100 | print("-"*n_slashes) 101 | if G.output_file is not None: 102 | if G.first_row: 103 | G.output_file.write("\t".join(G.log_headers)) 104 | G.output_file.write("\n") 105 | G.output_file.write("\t".join(map(str,vals))) 106 | G.output_file.write("\n") 107 | G.output_file.flush() 108 | G.log_current_row.clear() 109 | G.first_row=False 110 | -------------------------------------------------------------------------------- /hw3/lunar_lander.py: -------------------------------------------------------------------------------- 1 | import sys, math 2 | import numpy as np 3 | 4 | import Box2D 5 | from Box2D.b2 import (edgeShape, circleShape, fixtureDef, polygonShape, revoluteJointDef, contactListener) 6 | 7 | import gym 8 | from gym import spaces 9 | from gym.utils import seeding 10 | 11 | import pyglet 12 | 13 | from copy import copy 14 | 15 | # Rocket trajectory optimization is a classic topic in Optimal Control. 16 | # 17 | # According to Pontryagin's maximum principle it's optimal to fire engine full throttle or 18 | # turn it off. That's the reason this environment is OK to have discreet actions (engine on or off). 19 | # 20 | # Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. 21 | # Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. 22 | # If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or 23 | # comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main 24 | # engine is -0.3 points each frame. Solved is 200 points. 25 | # 26 | # Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land 27 | # on its first attempt. Please see source code for details. 28 | # 29 | # Too see heuristic landing, run: 30 | # 31 | # python gym/envs/box2d/lunar_lander.py 32 | # 33 | # To play yourself, run: 34 | # 35 | # python examples/agents/keyboard_agent.py LunarLander-v0 36 | # 37 | # Created by Oleg Klimov. Licensed on the same terms as the rest of OpenAI Gym. 38 | 39 | # Modified by Sid Reddy (sgr@berkeley.edu) on 8/14/18 40 | # 41 | # Changelog: 42 | # - different discretization scheme for actions 43 | # - different terminal rewards 44 | # - different observations 45 | # - randomized landing site 46 | # 47 | # A good agent should be able to achieve >150 reward. 48 | 49 | MAX_NUM_STEPS = 1000 50 | 51 | N_OBS_DIM = 9 52 | N_ACT_DIM = 6 # num discrete actions 53 | 54 | FPS = 50 55 | SCALE = 30.0 # affects how fast-paced the game is, forces should be adjusted as well 56 | 57 | MAIN_ENGINE_POWER = 13.0 58 | SIDE_ENGINE_POWER = 0.6 59 | 60 | INITIAL_RANDOM = 1000.0 # Set 1500 to make game harder 61 | 62 | LANDER_POLY =[ 63 | (-14,+17), (-17,0), (-17,-10), 64 | (+17,-10), (+17,0), (+14,+17) 65 | ] 66 | LEG_AWAY = 20 67 | LEG_DOWN = 18 68 | LEG_W, LEG_H = 2, 8 69 | LEG_SPRING_TORQUE = 40 # 40 is too difficult for human players, 400 a bit easier 70 | 71 | SIDE_ENGINE_HEIGHT = 14.0 72 | SIDE_ENGINE_AWAY = 12.0 73 | 74 | VIEWPORT_W = 600 75 | VIEWPORT_H = 400 76 | 77 | THROTTLE_MAG = 0.75 # discretized 'on' value for thrusters 78 | NOOP = 1 # don't fire main engine, don't steer 79 | def disc_to_cont(action): # discrete action -> continuous action 80 | if type(action) == np.ndarray: 81 | return action 82 | # main engine 83 | if action < 3: 84 | m = -THROTTLE_MAG 85 | elif action < 6: 86 | m = THROTTLE_MAG 87 | else: 88 | raise ValueError 89 | # steering 90 | if action % 3 == 0: 91 | s = -THROTTLE_MAG 92 | elif action % 3 == 1: 93 | s = 0 94 | else: 95 | s = THROTTLE_MAG 96 | return np.array([m, s]) 97 | 98 | class ContactDetector(contactListener): 99 | def __init__(self, env): 100 | contactListener.__init__(self) 101 | self.env = env 102 | def BeginContact(self, contact): 103 | if self.env.lander==contact.fixtureA.body or self.env.lander==contact.fixtureB.body: 104 | self.env.game_over = True 105 | for i in range(2): 106 | if self.env.legs[i] in [contact.fixtureA.body, contact.fixtureB.body]: 107 | self.env.legs[i].ground_contact = True 108 | def EndContact(self, contact): 109 | for i in range(2): 110 | if self.env.legs[i] in [contact.fixtureA.body, contact.fixtureB.body]: 111 | self.env.legs[i].ground_contact = False 112 | 113 | class LunarLander(gym.Env): 114 | metadata = { 115 | 'render.modes': ['human', 'rgb_array'], 116 | 'video.frames_per_second' : FPS 117 | } 118 | 119 | continuous = False 120 | 121 | def __init__(self): 122 | self._seed() 123 | self.viewer = None 124 | 125 | self.world = Box2D.b2World() 126 | self.moon = None 127 | self.lander = None 128 | self.particles = [] 129 | 130 | self.prev_reward = None 131 | 132 | high = np.array([np.inf]*N_OBS_DIM) # useful range is -1 .. +1, but spikes can be higher 133 | self.observation_space = spaces.Box(-high, high) 134 | 135 | self.action_space = spaces.Discrete(N_ACT_DIM) 136 | 137 | self.curr_step = None 138 | 139 | self._reset() 140 | 141 | def _seed(self, seed=None): 142 | self.np_random, seed = seeding.np_random(seed) 143 | return [seed] 144 | 145 | def _destroy(self): 146 | if not self.moon: return 147 | self.world.contactListener = None 148 | self._clean_particles(True) 149 | self.world.DestroyBody(self.moon) 150 | self.moon = None 151 | self.world.DestroyBody(self.lander) 152 | self.lander = None 153 | self.world.DestroyBody(self.legs[0]) 154 | self.world.DestroyBody(self.legs[1]) 155 | 156 | def _reset(self): 157 | self.curr_step = 0 158 | 159 | self._destroy() 160 | self.world.contactListener_keepref = ContactDetector(self) 161 | self.world.contactListener = self.world.contactListener_keepref 162 | self.game_over = False 163 | self.prev_shaping = None 164 | 165 | W = VIEWPORT_W/SCALE 166 | H = VIEWPORT_H/SCALE 167 | 168 | # terrain 169 | CHUNKS = 11 170 | height = self.np_random.uniform(0, H/2, size=(CHUNKS+1,) ) 171 | chunk_x = [W/(CHUNKS-1)*i for i in range(CHUNKS)] 172 | 173 | # randomize helipad x-coord 174 | helipad_chunk = np.random.choice(range(1, CHUNKS-1)) 175 | 176 | self.helipad_x1 = chunk_x[helipad_chunk-1] 177 | self.helipad_x2 = chunk_x[helipad_chunk+1] 178 | self.helipad_y = H/4 179 | height[helipad_chunk-2] = self.helipad_y 180 | height[helipad_chunk-1] = self.helipad_y 181 | height[helipad_chunk+0] = self.helipad_y 182 | height[helipad_chunk+1] = self.helipad_y 183 | height[helipad_chunk+2] = self.helipad_y 184 | smooth_y = [0.33*(height[i-1] + height[i+0] + height[i+1]) for i in range(CHUNKS)] 185 | 186 | self.moon = self.world.CreateStaticBody( shapes=edgeShape(vertices=[(0, 0), (W, 0)]) ) 187 | self.sky_polys = [] 188 | for i in range(CHUNKS-1): 189 | p1 = (chunk_x[i], smooth_y[i]) 190 | p2 = (chunk_x[i+1], smooth_y[i+1]) 191 | self.moon.CreateEdgeFixture( 192 | vertices=[p1,p2], 193 | density=0, 194 | friction=0.1) 195 | self.sky_polys.append( [p1, p2, (p2[0],H), (p1[0],H)] ) 196 | 197 | self.moon.color1 = (0.0,0.0,0.0) 198 | self.moon.color2 = (0.0,0.0,0.0) 199 | 200 | initial_y = VIEWPORT_H/SCALE#*0.75 201 | self.lander = self.world.CreateDynamicBody( 202 | position = (VIEWPORT_W/SCALE/2, initial_y), 203 | angle=0.0, 204 | fixtures = fixtureDef( 205 | shape=polygonShape(vertices=[ (x/SCALE,y/SCALE) for x,y in LANDER_POLY ]), 206 | density=5.0, 207 | friction=0.1, 208 | categoryBits=0x0010, 209 | maskBits=0x001, # collide only with ground 210 | restitution=0.0) # 0.99 bouncy 211 | ) 212 | self.lander.color1 = (0.5,0.4,0.9) 213 | self.lander.color2 = (0.3,0.3,0.5) 214 | self.lander.ApplyForceToCenter( ( 215 | self.np_random.uniform(-INITIAL_RANDOM, INITIAL_RANDOM), 216 | self.np_random.uniform(-INITIAL_RANDOM, INITIAL_RANDOM) 217 | ), True) 218 | 219 | self.legs = [] 220 | for i in [-1,+1]: 221 | leg = self.world.CreateDynamicBody( 222 | position = (VIEWPORT_W/SCALE/2 - i*LEG_AWAY/SCALE, initial_y), 223 | angle = (i*0.05), 224 | fixtures = fixtureDef( 225 | shape=polygonShape(box=(LEG_W/SCALE, LEG_H/SCALE)), 226 | density=1.0, 227 | restitution=0.0, 228 | categoryBits=0x0020, 229 | maskBits=0x001) 230 | ) 231 | leg.ground_contact = False 232 | leg.color1 = (0.5,0.4,0.9) 233 | leg.color2 = (0.3,0.3,0.5) 234 | rjd = revoluteJointDef( 235 | bodyA=self.lander, 236 | bodyB=leg, 237 | localAnchorA=(0, 0), 238 | localAnchorB=(i*LEG_AWAY/SCALE, LEG_DOWN/SCALE), 239 | enableMotor=True, 240 | enableLimit=True, 241 | maxMotorTorque=LEG_SPRING_TORQUE, 242 | motorSpeed=+0.3*i # low enough not to jump back into the sky 243 | ) 244 | if i==-1: 245 | rjd.lowerAngle = +0.9 - 0.5 # Yes, the most esoteric numbers here, angles legs have freedom to travel within 246 | rjd.upperAngle = +0.9 247 | else: 248 | rjd.lowerAngle = -0.9 249 | rjd.upperAngle = -0.9 + 0.5 250 | leg.joint = self.world.CreateJoint(rjd) 251 | self.legs.append(leg) 252 | 253 | self.drawlist = [self.lander] + self.legs 254 | 255 | return self._step(NOOP)[0] 256 | 257 | def _create_particle(self, mass, x, y, ttl): 258 | p = self.world.CreateDynamicBody( 259 | position = (x,y), 260 | angle=0.0, 261 | fixtures = fixtureDef( 262 | shape=circleShape(radius=2/SCALE, pos=(0,0)), 263 | density=mass, 264 | friction=0.1, 265 | categoryBits=0x0100, 266 | maskBits=0x001, # collide only with ground 267 | restitution=0.3) 268 | ) 269 | p.ttl = ttl 270 | self.particles.append(p) 271 | self._clean_particles(False) 272 | return p 273 | 274 | def _clean_particles(self, all): 275 | while self.particles and (all or self.particles[0].ttl<0): 276 | self.world.DestroyBody(self.particles.pop(0)) 277 | 278 | def _step(self, action): 279 | assert self.action_space.contains(action), "%r (%s) invalid " % (action,type(action)) 280 | action = disc_to_cont(action) 281 | 282 | # Engines 283 | tip = (math.sin(self.lander.angle), math.cos(self.lander.angle)) 284 | side = (-tip[1], tip[0]); 285 | dispersion = [self.np_random.uniform(-1.0, +1.0) / SCALE for _ in range(2)] 286 | 287 | m_power = 0.0 288 | if action[0] > 0.0: 289 | # Main engine 290 | m_power = (np.clip(action[0], 0.0,1.0) + 1.0)*0.5 # 0.5..1.0 291 | assert m_power>=0.5 and m_power <= 1.0 292 | ox = tip[0]*(4/SCALE + 2*dispersion[0]) + side[0]*dispersion[1] # 4 is move a bit downwards, +-2 for randomness 293 | oy = -tip[1]*(4/SCALE + 2*dispersion[0]) - side[1]*dispersion[1] 294 | impulse_pos = (self.lander.position[0] + ox, self.lander.position[1] + oy) 295 | p = self._create_particle(3.5, impulse_pos[0], impulse_pos[1], m_power) # particles are just a decoration, 3.5 is here to make particle speed adequate 296 | p.ApplyLinearImpulse( ( ox*MAIN_ENGINE_POWER*m_power, oy*MAIN_ENGINE_POWER*m_power), impulse_pos, True) 297 | self.lander.ApplyLinearImpulse( (-ox*MAIN_ENGINE_POWER*m_power, -oy*MAIN_ENGINE_POWER*m_power), impulse_pos, True) 298 | 299 | s_power = 0.0 300 | if np.abs(action[1]) > 0.5: 301 | # Orientation engines 302 | direction = np.sign(action[1]) 303 | s_power = np.clip(np.abs(action[1]), 0.5,1.0) 304 | assert s_power>=0.5 and s_power <= 1.0 305 | ox = tip[0]*dispersion[0] + side[0]*(3*dispersion[1]+direction*SIDE_ENGINE_AWAY/SCALE) 306 | oy = -tip[1]*dispersion[0] - side[1]*(3*dispersion[1]+direction*SIDE_ENGINE_AWAY/SCALE) 307 | impulse_pos = (self.lander.position[0] + ox - tip[0]*17/SCALE, self.lander.position[1] + oy + tip[1]*SIDE_ENGINE_HEIGHT/SCALE) 308 | p = self._create_particle(0.7, impulse_pos[0], impulse_pos[1], s_power) 309 | p.ApplyLinearImpulse( ( ox*SIDE_ENGINE_POWER*s_power, oy*SIDE_ENGINE_POWER*s_power), impulse_pos, True) 310 | self.lander.ApplyLinearImpulse( (-ox*SIDE_ENGINE_POWER*s_power, -oy*SIDE_ENGINE_POWER*s_power), impulse_pos, True) 311 | 312 | # perform normal update 313 | self.world.Step(1.0/FPS, 6*30, 2*30) 314 | 315 | pos = self.lander.position 316 | vel = self.lander.linearVelocity 317 | helipad_x = (self.helipad_x1 + self.helipad_x2) / 2 318 | state = [ 319 | (pos.x - VIEWPORT_W/SCALE/2) / (VIEWPORT_W/SCALE/2), 320 | (pos.y - (self.helipad_y+LEG_DOWN/SCALE)) / (VIEWPORT_W/SCALE/2), 321 | vel.x*(VIEWPORT_W/SCALE/2)/FPS, 322 | vel.y*(VIEWPORT_H/SCALE/2)/FPS, 323 | self.lander.angle, 324 | 20.0*self.lander.angularVelocity/FPS, 325 | 1.0 if self.legs[0].ground_contact else 0.0, 326 | 1.0 if self.legs[1].ground_contact else 0.0, 327 | (helipad_x - VIEWPORT_W/SCALE/2) / (VIEWPORT_W/SCALE/2) 328 | ] 329 | assert len(state)==N_OBS_DIM 330 | 331 | self.curr_step += 1 332 | 333 | reward = 0 334 | shaping = 0 335 | dx = (pos.x - helipad_x) / (VIEWPORT_W/SCALE/2) 336 | shaping += -100*np.sqrt(state[2]*state[2] + state[3]*state[3]) - 100*abs(state[4]) 337 | shaping += -100*np.sqrt(dx*dx + state[1]*state[1]) + 10*state[6] + 10*state[7] 338 | if self.prev_shaping is not None: 339 | reward = shaping - self.prev_shaping 340 | self.prev_shaping = shaping 341 | 342 | reward -= m_power*0.30 # less fuel spent is better, about -30 for heurisic landing 343 | reward -= s_power*0.03 344 | 345 | oob = abs(state[0]) >= 1.0 346 | timeout = self.curr_step >= MAX_NUM_STEPS 347 | not_awake = not self.lander.awake 348 | 349 | at_site = pos.x >= self.helipad_x1 and pos.x <= self.helipad_x2 and state[1] <= 0 350 | grounded = self.legs[0].ground_contact and self.legs[1].ground_contact 351 | landed = at_site and grounded 352 | 353 | done = self.game_over or oob or not_awake or timeout or landed 354 | if done: 355 | if self.game_over or oob: 356 | reward = -100 357 | self.lander.color1 = (255,0,0) 358 | elif at_site: 359 | reward = +100 360 | self.lander.color1 = (0,255,0) 361 | elif timeout: 362 | self.lander.color1 = (255,0,0) 363 | info = {} 364 | 365 | return np.array(state), reward, done, info 366 | 367 | def _render(self, mode='human', close=False): 368 | if close: 369 | if self.viewer is not None: 370 | self.viewer.close() 371 | self.viewer = None 372 | return 373 | 374 | from gym.envs.classic_control import rendering 375 | if self.viewer is None: 376 | self.viewer = rendering.Viewer(VIEWPORT_W, VIEWPORT_H) 377 | self.viewer.set_bounds(0, VIEWPORT_W/SCALE, 0, VIEWPORT_H/SCALE) 378 | 379 | for obj in self.particles: 380 | obj.ttl -= 0.15 381 | obj.color1 = (max(0.2,0.2+obj.ttl), max(0.2,0.5*obj.ttl), max(0.2,0.5*obj.ttl)) 382 | obj.color2 = (max(0.2,0.2+obj.ttl), max(0.2,0.5*obj.ttl), max(0.2,0.5*obj.ttl)) 383 | 384 | self._clean_particles(False) 385 | 386 | for p in self.sky_polys: 387 | self.viewer.draw_polygon(p, color=(0,0,0)) 388 | 389 | for obj in self.particles + self.drawlist: 390 | for f in obj.fixtures: 391 | trans = f.body.transform 392 | if type(f.shape) is circleShape: 393 | t = rendering.Transform(translation=trans*f.shape.pos) 394 | self.viewer.draw_circle(f.shape.radius, 20, color=obj.color1).add_attr(t) 395 | self.viewer.draw_circle(f.shape.radius, 20, color=obj.color2, filled=False, linewidth=2).add_attr(t) 396 | else: 397 | path = [trans*v for v in f.shape.vertices] 398 | self.viewer.draw_polygon(path, color=obj.color1) 399 | path.append(path[0]) 400 | self.viewer.draw_polyline(path, color=obj.color2, linewidth=2) 401 | 402 | for x in [self.helipad_x1, self.helipad_x2]: 403 | flagy1 = self.helipad_y 404 | flagy2 = flagy1 + 50/SCALE 405 | self.viewer.draw_polyline( [(x, flagy1), (x, flagy2)], color=(1,1,1) ) 406 | self.viewer.draw_polygon( [(x, flagy2), (x, flagy2-10/SCALE), (x+25/SCALE, flagy2-5/SCALE)], color=(0.8,0.8,0) ) 407 | 408 | clock_prog = self.curr_step / MAX_NUM_STEPS 409 | self.viewer.draw_polyline( [(0, 0.05*VIEWPORT_H/SCALE), (clock_prog*VIEWPORT_W/SCALE, 0.05*VIEWPORT_H/SCALE)], color=(255,0,0), linewidth=5 ) 410 | 411 | return self.viewer.render(return_rgb_array = mode=='rgb_array') 412 | 413 | def reset(self): 414 | return self._reset() 415 | 416 | def step(self, *args, **kwargs): 417 | return self._step(*args, **kwargs) 418 | 419 | 420 | class LunarLanderContinuous(LunarLander): 421 | continuous = True 422 | 423 | def heuristic(env, s): 424 | # Heuristic for: 425 | # 1. Testing. 426 | # 2. Demonstration rollout. 427 | angle_targ = s[0]*0.5 + s[2]*1.0 # angle should point towards center (s[0] is horizontal coordinate, s[2] hor speed) 428 | if angle_targ > 0.4: angle_targ = 0.4 # more than 0.4 radians (22 degrees) is bad 429 | if angle_targ < -0.4: angle_targ = -0.4 430 | hover_targ = 0.55*np.abs(s[0]) # target y should be proporional to horizontal offset 431 | 432 | # PID controller: s[4] angle, s[5] angularSpeed 433 | angle_todo = (angle_targ - s[4])*0.5 - (s[5])*1.0 434 | #print("angle_targ=%0.2f, angle_todo=%0.2f" % (angle_targ, angle_todo)) 435 | 436 | # PID controller: s[1] vertical coordinate s[3] vertical speed 437 | hover_todo = (hover_targ - s[1])*0.5 - (s[3])*0.5 438 | #print("hover_targ=%0.2f, hover_todo=%0.2f" % (hover_targ, hover_todo)) 439 | 440 | if s[6] or s[7]: # legs have contact 441 | angle_todo = 0 442 | hover_todo = -(s[3])*0.5 # override to reduce fall speed, that's all we need after contact 443 | 444 | a = np.array( [hover_todo*20 - 1, -angle_todo*20] ) 445 | a = np.clip(a, -1, +1) 446 | return a 447 | 448 | if __name__=="__main__": 449 | #env = LunarLander() 450 | env = LunarLanderContinuous() 451 | s = env.reset() 452 | total_reward = 0 453 | steps = 0 454 | while True: 455 | a = heuristic(env, s) 456 | s, r, done, info = env.step(a) 457 | env.render() 458 | total_reward += r 459 | if steps % 20 == 0 or done: 460 | print(["{:+0.2f}".format(x) for x in s]) 461 | print("step {} total_reward {:+0.2f}".format(steps, total_reward)) 462 | steps += 1 463 | if done: break 464 | -------------------------------------------------------------------------------- /hw3/plot.py: -------------------------------------------------------------------------------- 1 | import seaborn as sns 2 | import pandas as pd 3 | import matplotlib.pyplot as plt 4 | import json 5 | import os 6 | 7 | """ 8 | Using the plotter: 9 | 10 | Call it from the command line, and supply it with logdirs to experiments. 11 | Suppose you ran an experiment with name 'test', and you ran 'test' for 10 12 | random seeds. The runner code stored it in the directory structure 13 | 14 | data 15 | L test_EnvName_DateTime 16 | L 0 17 | L log.txt 18 | L params.json 19 | L 1 20 | L log.txt 21 | L params.json 22 | . 23 | . 24 | . 25 | L 9 26 | L log.txt 27 | L params.json 28 | 29 | To plot learning curves from the experiment, averaged over all random 30 | seeds, call 31 | 32 | python plot.py data/test_EnvName_DateTime --value AverageReturn 33 | 34 | and voila. To see a different statistics, change what you put in for 35 | the keyword --value. You can also enter /multiple/ values, and it will 36 | make all of them in order. 37 | 38 | 39 | Suppose you ran two experiments: 'test1' and 'test2'. In 'test2' you tried 40 | a different set of hyperparameters from 'test1', and now you would like 41 | to compare them -- see their learning curves side-by-side. Just call 42 | 43 | python plot.py data/test1 data/test2 44 | 45 | and it will plot them both! They will be given titles in the legend according 46 | to their exp_name parameters. If you want to use custom legend titles, use 47 | the --legend flag and then provide a title for each logdir. 48 | 49 | """ 50 | 51 | def plot_data(data, time="Iteration", value="AverageReturn", combine=False): 52 | if isinstance(data, list): 53 | data = pd.concat(data, ignore_index=True) 54 | plt.figure(figsize=(16, 9)) 55 | sns.set(style="darkgrid", font_scale=1.5) 56 | if not combine: 57 | sns.tsplot(data=data, time=time, value=value, unit="Unit", condition="Condition") 58 | else: 59 | df1 = data.loc[:, [time, value[0], 'Condition']] 60 | df1['Statistics'] = value[0] 61 | df1.rename(columns={value[0]:'Value', 'Condition':'ExpName'}, inplace = True) 62 | df2 = data.loc[:, [time, value[1], 'Condition']] 63 | df2['Statistics'] = value[1] 64 | df2.rename(columns={value[1]:'Value', 'Condition':'ExpName'}, inplace = True) 65 | data = pd.concat([df1, df2], ignore_index=True) 66 | sns.lineplot(x=time, y='Value', hue='ExpName', style='Statistics', data=data) 67 | 68 | plt.legend(loc='best').draggable() 69 | plt.savefig('result.png', bbox_inches='tight') 70 | plt.show() 71 | 72 | 73 | def get_datasets(fpath, condition=None): 74 | unit = 0 75 | datasets = [] 76 | for root, dir, files in os.walk(fpath): 77 | if 'log.txt' in files: 78 | param_path = open(os.path.join(root,'hyperparams.json')) 79 | params = json.load(param_path) 80 | exp_name = params['exp_name'] 81 | 82 | log_path = os.path.join(root,'log.txt') 83 | experiment_data = pd.read_table(log_path) 84 | 85 | experiment_data.insert( 86 | len(experiment_data.columns), 87 | 'Unit', 88 | unit 89 | ) 90 | experiment_data.insert( 91 | len(experiment_data.columns), 92 | 'Condition', 93 | condition or exp_name 94 | ) 95 | 96 | datasets.append(experiment_data) 97 | unit += 1 98 | 99 | return datasets 100 | 101 | 102 | def main(): 103 | import argparse 104 | parser = argparse.ArgumentParser() 105 | parser.add_argument('logdir', nargs='*') 106 | parser.add_argument('--legend', nargs='*') 107 | parser.add_argument('--time', type=str, default='Iteration') 108 | parser.add_argument('--value', default='AverageReturn', nargs='*') 109 | parser.add_argument('--combine', action='store_true') 110 | args = parser.parse_args() 111 | 112 | use_legend = False 113 | if args.legend is not None: 114 | assert len(args.legend) == len(args.logdir), \ 115 | "Must give a legend title for each set of experiments." 116 | use_legend = True 117 | 118 | data = [] 119 | if use_legend: 120 | for logdir, legend_title in zip(args.logdir, args.legend): 121 | data += get_datasets(logdir, legend_title) 122 | else: 123 | for logdir in args.logdir: 124 | data += get_datasets(logdir) 125 | 126 | time = args.time 127 | 128 | if isinstance(args.value, list): 129 | values = args.value 130 | else: 131 | values = [args.value] 132 | 133 | if args.combine and len(values) == 2: 134 | plot_data(data, time=time, value=values, combine=True) 135 | else: 136 | for value in values: 137 | plot_data(data, time=time, value=value, combine=False) 138 | 139 | if __name__ == "__main__": 140 | main() 141 | -------------------------------------------------------------------------------- /hw3/requirements.txt: -------------------------------------------------------------------------------- 1 | gym==0.10.5 2 | gym[atari] 3 | box2d 4 | mujoco-py==1.50.1.56 5 | torch==0.4.0 6 | numpy 7 | seaborn 8 | opencv-python 9 | -------------------------------------------------------------------------------- /hw3/run_dqn_atari.py: -------------------------------------------------------------------------------- 1 | import gym 2 | from gym import wrappers 3 | import time 4 | import logz 5 | import os.path as osp 6 | import random 7 | import numpy as np 8 | import torch 9 | from torch import nn 10 | 11 | import dqn 12 | from dqn_utils import PiecewiseSchedule, get_wrapper_by_name 13 | from atari_wrappers import wrap_deepmind 14 | 15 | def weights_init(m): 16 | if hasattr(m, 'weight'): 17 | nn.init.xavier_normal_(m.weight) 18 | if hasattr(m, 'bias'): 19 | nn.init.constant_(m.bias, 0) 20 | 21 | class DQN(nn.Module): # for atari 22 | def __init__(self, in_channels, num_actions): 23 | # as described in https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf 24 | super(DQN, self).__init__() 25 | self.convnet = nn.Sequential( 26 | nn.Conv2d(in_channels, out_channels=32, kernel_size=8, stride=4), 27 | nn.ReLU(True), 28 | nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2), 29 | nn.ReLU(True), 30 | nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1), 31 | nn.ReLU(True), 32 | ) 33 | self.classifier = nn.Sequential( 34 | nn.Linear(in_features=7 * 7 * 64, out_features=512), 35 | nn.ReLU(True), 36 | nn.Linear(in_features=512, out_features=num_actions), 37 | ) 38 | 39 | self.apply(weights_init) 40 | 41 | def forward(self, obs): 42 | out = obs.float() / 255 # convert 8-bits RGB color to float in [0, 1] 43 | out = out.permute(0, 3, 1, 2) # reshape to [batch_size, img_c * frames, img_h, img_w] 44 | out = self.convnet(out) 45 | out = out.view(out.size(0), -1) # flatten feature maps to a big vector 46 | out = self.classifier(out) 47 | return out 48 | 49 | def atari_learn(env, 50 | num_timesteps): 51 | # This is just a rough estimate 52 | num_iterations = float(num_timesteps) / 4.0 53 | 54 | lr_multiplier = 1.0 55 | lr_schedule = PiecewiseSchedule( 56 | [ 57 | (0, 1e-4 * lr_multiplier), 58 | (num_iterations / 10, 1e-4 * lr_multiplier), 59 | (num_iterations / 2, 5e-5 * lr_multiplier), 60 | ], 61 | outside_value=5e-5 * lr_multiplier 62 | ) 63 | lr_lambda = lambda t: lr_schedule.value(t) 64 | 65 | optimizer = dqn.OptimizerSpec( 66 | constructor=torch.optim.Adam, 67 | kwargs=dict(eps=1e-4), 68 | lr_lambda=lr_lambda 69 | ) 70 | 71 | def stopping_criterion(env, t): 72 | # notice that here t is the number of steps of the wrapped env, 73 | # which is different from the number of steps in the underlying env 74 | return get_wrapper_by_name(env, "Monitor").get_total_steps() >= num_timesteps 75 | 76 | exploration_schedule = PiecewiseSchedule( 77 | [ 78 | (0, 1.0), 79 | (1e6, 0.1), 80 | (num_iterations / 2, 0.01), 81 | ], 82 | outside_value=0.01 83 | ) 84 | 85 | dqn.learn( 86 | env=env, 87 | q_func=DQN, 88 | optimizer_spec=optimizer, 89 | exploration=exploration_schedule, 90 | stopping_criterion=stopping_criterion, 91 | replay_buffer_size=1000000, 92 | batch_size=32, 93 | gamma=0.99, 94 | learning_starts=50000, 95 | learning_freq=4, 96 | frame_history_len=4, 97 | target_update_freq=10000, 98 | grad_norm_clipping=10, 99 | double_q=True 100 | ) 101 | env.close() 102 | 103 | def set_global_seeds(i): 104 | torch.manual_seed(i) 105 | if torch.cuda.is_available: 106 | torch.cuda.manual_seed(i) 107 | np.random.seed(i) 108 | random.seed(i) 109 | 110 | def get_env(env_name, exp_name, seed): 111 | env = gym.make(env_name) 112 | 113 | set_global_seeds(seed) 114 | env.seed(seed) 115 | 116 | # Set Up Logger 117 | logdir = 'dqn_' + exp_name + '_' + env_name + '_' + time.strftime("%d-%m-%Y_%H-%M-%S") 118 | logdir = osp.join('data', logdir) 119 | logdir = osp.join(logdir, '%d'%seed) 120 | logz.configure_output_dir(logdir) 121 | hyperparams = {'exp_name': exp_name, 'env_name': env_name} 122 | logz.save_hyperparams(hyperparams) 123 | 124 | expt_dir = '/tmp/hw3_vid_dir2/' 125 | env = wrappers.Monitor(env, osp.join(expt_dir, "gym"), force=True) 126 | env = wrap_deepmind(env) 127 | 128 | return env 129 | 130 | def main(): 131 | # Choose Atari games. 132 | env_name = 'PongNoFrameskip-v4' 133 | exp_name = 'Pong_double_dqn' # you can use it to mark different experiments 134 | 135 | # Run training 136 | seed = random.randint(0, 9999) 137 | print('random seed = %d' % seed) 138 | env = get_env(env_name, exp_name, seed) 139 | atari_learn(env, num_timesteps=2e8) 140 | 141 | if __name__ == "__main__": 142 | main() 143 | -------------------------------------------------------------------------------- /hw3/run_dqn_lander.py: -------------------------------------------------------------------------------- 1 | import gym 2 | from gym import wrappers 3 | import time 4 | import logz 5 | import os.path as osp 6 | import random 7 | import numpy as np 8 | import torch 9 | from torch import nn 10 | 11 | import dqn 12 | from dqn_utils import ConstantSchedule, PiecewiseSchedule, get_wrapper_by_name 13 | 14 | 15 | def weights_init(m): 16 | if hasattr(m, 'weight'): 17 | nn.init.orthogonal_(m.weight) 18 | if hasattr(m, 'bias'): 19 | nn.init.constant_(m.bias, 0) 20 | 21 | class DQN(nn.Module): # for lunar lander 22 | def __init__(self, in_features, num_actions): 23 | super(DQN, self).__init__() 24 | self.classifier = nn.Sequential( 25 | nn.Linear(in_features, out_features=64), 26 | nn.ReLU(True), 27 | nn.Linear(in_features=64, out_features=64), 28 | nn.ReLU(True), 29 | nn.Linear(in_features=64, out_features=num_actions), 30 | ) 31 | 32 | self.apply(weights_init) 33 | 34 | def forward(self, obs): 35 | out = self.classifier(obs) 36 | return out 37 | 38 | def lander_optimizer(): 39 | lr_schedule = ConstantSchedule(1e-3) 40 | lr_lambda = lambda t: lr_schedule.value(t) 41 | return dqn.OptimizerSpec( 42 | constructor=torch.optim.Adam, 43 | lr_lambda=lr_lambda, 44 | kwargs={} 45 | ) 46 | 47 | def lander_stopping_criterion(num_timesteps): 48 | def stopping_criterion(env, t): 49 | # notice that here t is the number of steps of the wrapped env, 50 | # which is different from the number of steps in the underlying env 51 | return get_wrapper_by_name(env, "Monitor").get_total_steps() >= num_timesteps 52 | return stopping_criterion 53 | 54 | def lander_exploration_schedule(num_timesteps): 55 | return PiecewiseSchedule( 56 | [ 57 | (0, 1), 58 | (num_timesteps * 0.1, 0.02), 59 | ], outside_value=0.02 60 | ) 61 | 62 | def lander_kwargs(): 63 | return { 64 | 'optimizer_spec': lander_optimizer(), 65 | 'q_func': DQN, 66 | 'replay_buffer_size': 50000, 67 | 'batch_size': 32, 68 | 'gamma': 1.00, 69 | 'learning_starts': 1000, 70 | 'learning_freq': 1, 71 | 'frame_history_len': 1, 72 | 'target_update_freq': 3000, 73 | 'grad_norm_clipping': 10, 74 | 'lander': True 75 | } 76 | 77 | def lander_learn(env, 78 | num_timesteps): 79 | 80 | optimizer = lander_optimizer() 81 | stopping_criterion = lander_stopping_criterion(num_timesteps) 82 | exploration_schedule = lander_exploration_schedule(num_timesteps) 83 | 84 | dqn.learn( 85 | env=env, 86 | exploration=lander_exploration_schedule(num_timesteps), 87 | stopping_criterion=lander_stopping_criterion(num_timesteps), 88 | double_q=True, 89 | **lander_kwargs() 90 | ) 91 | env.close() 92 | 93 | def set_global_seeds(i): 94 | torch.manual_seed(i) 95 | if torch.cuda.is_available: 96 | torch.cuda.manual_seed(i) 97 | np.random.seed(i) 98 | random.seed(i) 99 | 100 | def get_env(env_name, exp_name, seed): 101 | env = gym.make(env_name) 102 | 103 | set_global_seeds(seed) 104 | env.seed(seed) 105 | 106 | # Set Up Logger 107 | logdir = 'dqn_' + exp_name + '_' + env_name + '_' + time.strftime("%d-%m-%Y_%H-%M-%S") 108 | logdir = osp.join('data', logdir) 109 | logdir = osp.join(logdir, '%d'%seed) 110 | logz.configure_output_dir(logdir) 111 | hyperparams = {'exp_name': exp_name, 'env_name': env_name} 112 | logz.save_hyperparams(hyperparams) 113 | 114 | expt_dir = '/tmp/hw3_vid_dir/' 115 | env = wrappers.Monitor(env, osp.join(expt_dir, "gym"), force=True, video_callable=False) 116 | 117 | 118 | return env 119 | 120 | def main(): 121 | # Choose Atari games. 122 | env_name = 'LunarLander-v2' 123 | exp_name = 'LunarLander_double_dqn' # you can use it to mark different experiments 124 | 125 | # Run training 126 | seed = 4565 # you may want to randomize this 127 | print('random seed = %d' % seed) 128 | env = get_env(env_name, exp_name, seed) 129 | lander_learn(env, num_timesteps=500000) 130 | 131 | if __name__ == "__main__": 132 | main() 133 | -------------------------------------------------------------------------------- /hw3/run_dqn_ram.py: -------------------------------------------------------------------------------- 1 | import gym 2 | from gym import wrappers 3 | import time 4 | import logz 5 | import os.path as osp 6 | import random 7 | import numpy as np 8 | import torch 9 | from torch import nn 10 | 11 | import dqn 12 | from dqn_utils import PiecewiseSchedule, get_wrapper_by_name 13 | from atari_wrappers import wrap_deepmind_ram 14 | 15 | def weights_init(m): 16 | if hasattr(m, 'weight'): 17 | nn.init.xavier_uniform_(m.weight) 18 | if hasattr(m, 'bias'): 19 | nn.init.constant_(m.bias, 0) 20 | 21 | class DQN(nn.Module): # for atari ram 22 | def __init__(self, in_features, num_actions): 23 | super(DQN, self).__init__() 24 | self.classifier = nn.Sequential( 25 | nn.Linear(in_features, out_features=256), 26 | nn.ReLU(True), 27 | nn.Linear(in_features=256, out_features=128), 28 | nn.ReLU(True), 29 | nn.Linear(in_features=128, out_features=64), 30 | nn.ReLU(True), 31 | nn.Linear(in_features=64, out_features=num_actions), 32 | ) 33 | 34 | self.apply(weights_init) 35 | 36 | def forward(self, obs): 37 | out = obs.float() / 255 # convert 8-bits ram state to float in [0, 1] 38 | out = self.classifier(out) 39 | return out 40 | 41 | def atari_learn(env, 42 | num_timesteps): 43 | # This is just a rough estimate 44 | num_iterations = float(num_timesteps) / 4.0 45 | 46 | lr_multiplier = 1.0 47 | lr_schedule = PiecewiseSchedule( 48 | [ 49 | (0, 1e-4 * lr_multiplier), 50 | (num_iterations / 10, 1e-4 * lr_multiplier), 51 | (num_iterations / 2, 5e-5 * lr_multiplier), 52 | ], 53 | outside_value=5e-5 * lr_multiplier 54 | ) 55 | lr_lambda = lambda t: lr_schedule.value(t) 56 | 57 | optimizer = dqn.OptimizerSpec( 58 | constructor=torch.optim.Adam, 59 | kwargs=dict(eps=1e-4), 60 | lr_lambda=lr_lambda 61 | ) 62 | 63 | def stopping_criterion(env, t): 64 | # notice that here t is the number of steps of the wrapped env, 65 | # which is different from the number of steps in the underlying env 66 | return get_wrapper_by_name(env, "Monitor").get_total_steps() >= num_timesteps 67 | 68 | exploration_schedule = PiecewiseSchedule( 69 | [ 70 | (0, 0.2), 71 | (1e6, 0.1), 72 | (num_iterations / 2, 0.01), 73 | ], outside_value=0.01 74 | ) 75 | 76 | dqn.learn( 77 | env, 78 | q_func=DQN, 79 | optimizer_spec=optimizer, 80 | exploration=exploration_schedule, 81 | stopping_criterion=stopping_criterion, 82 | replay_buffer_size=1000000, 83 | batch_size=32, 84 | gamma=0.99, 85 | learning_starts=50000, 86 | learning_freq=4, 87 | frame_history_len=1, 88 | target_update_freq=10000, 89 | grad_norm_clipping=10 90 | ) 91 | env.close() 92 | 93 | def set_global_seeds(i): 94 | torch.manual_seed(i) 95 | if torch.cuda.is_available: 96 | torch.cuda.manual_seed(i) 97 | np.random.seed(i) 98 | random.seed(i) 99 | 100 | def get_env(env_name, exp_name, seed): 101 | env = gym.make(env_name) 102 | 103 | set_global_seeds(seed) 104 | env.seed(seed) 105 | 106 | # Set Up Logger 107 | logdir = 'dqn_' + exp_name + '_' + env_name + '_' + time.strftime("%d-%m-%Y_%H-%M-%S") 108 | logdir = osp.join('data', logdir) 109 | logdir = osp.join(logdir, '%d'%seed) 110 | logz.configure_output_dir(logdir) 111 | hyperparams = {'exp_name': exp_name, 'env_name': env_name} 112 | logz.save_hyperparams(hyperparams) 113 | 114 | expt_dir = '/tmp/hw3_vid_dir/' 115 | env = wrappers.Monitor(env, osp.join(expt_dir, "gym"), force=True) 116 | env = wrap_deepmind_ram(env) 117 | 118 | return env 119 | 120 | def main(): 121 | # Choose Atari games. 122 | env_name = 'Pong-ram-v0' 123 | exp_name = 'Pong_double_dqn' # you can use it to mark different experiments 124 | 125 | # Run training 126 | seed = 0 # Use a seed of zero (you may want to randomize the seed!) 127 | print('random seed = %d' % seed) 128 | env = get_env(env_name, exp_name, seed) 129 | atari_learn(env, num_timesteps=int(4e7)) 130 | 131 | if __name__ == "__main__": 132 | main() 133 | -------------------------------------------------------------------------------- /hw3/train_ac_f18.py: -------------------------------------------------------------------------------- 1 | """ 2 | Original code from John Schulman for CS294 Deep Reinforcement Learning Spring 2017 3 | Adapted for CS294-112 Fall 2017 by Abhishek Gupta and Joshua Achiam 4 | Adapted for CS294-112 Fall 2018 by Soroush Nasiriany, Sid Reddy, and Greg Kahn 5 | Adapted for pytorch version by Ning Dai 6 | """ 7 | import numpy as np 8 | import torch 9 | import gym 10 | import logz 11 | import os 12 | import time 13 | import inspect 14 | from torch.multiprocessing import Process 15 | from torch import nn, optim 16 | 17 | #============================================================================================# 18 | # Utilities 19 | #============================================================================================# 20 | 21 | def build_mlp(input_size, output_size, n_layers, hidden_size, activation=nn.Tanh): 22 | """ 23 | Builds a feedforward neural network 24 | 25 | arguments: 26 | input_size: size of the input layer 27 | output_size: size of the output layer 28 | n_layers: number of hidden layers 29 | hidden_size: dimension of the hidden layers 30 | activation: activation of the hidden layers 31 | output_activation: activation of the output layer 32 | 33 | returns: 34 | an instance of nn.Sequential which contains the feedforward neural network 35 | 36 | Hint: use nn.Linear 37 | """ 38 | layers = [] 39 | # YOUR HW2 CODE HERE 40 | raise NotImplementedError 41 | 42 | return nn.Sequential(*layers).apply(weights_init) 43 | 44 | def weights_init(m): 45 | if hasattr(m, 'weight'): 46 | nn.init.xavier_uniform_(m.weight) 47 | 48 | def pathlength(path): 49 | return len(path["reward"]) 50 | 51 | def setup_logger(logdir, locals_): 52 | # Configure output directory for logging 53 | logz.configure_output_dir(logdir) 54 | # Log experimental parameters 55 | args = inspect.getargspec(train_AC)[0] 56 | hyperparams = {k: locals_[k] if k in locals_ else None for k in args} 57 | logz.save_hyperparams(hyperparams) 58 | 59 | class PolicyNet(nn.Module): 60 | def __init__(self, neural_network_args): 61 | super(PolicyNet, self).__init__() 62 | self.ob_dim = neural_network_args['ob_dim'] 63 | self.ac_dim = neural_network_args['ac_dim'] 64 | self.discrete = neural_network_args['discrete'] 65 | self.hidden_size = neural_network_args['size'] 66 | self.n_layers = neural_network_args['actor_n_layers'] 67 | 68 | self.define_model_components() 69 | 70 | def define_model_components(self): 71 | """ 72 | Define the parameters of policy network here. 73 | You can use any instance of nn.Module or nn.Parameter. 74 | 75 | Hint: use the 'build_mlp' function above 76 | In the discrete case, model should output logits of a categorical distribution 77 | over the actions 78 | In the continuous case, model should output a tuple (mean, log_std) of a Gaussian 79 | distribution over actions. log_std should just be a trainable 80 | variable, not a network output. 81 | """ 82 | # YOUR HW2 CODE HERE 83 | if self.discrete: 84 | raise NotImplementedError 85 | else: 86 | raise NotImplementedError 87 | 88 | #========================================================================================# 89 | # ----------PROBLEM 2---------- 90 | #========================================================================================# 91 | """ 92 | Notes on notation: 93 | 94 | Pytorch tensor variables have the prefix ts_, to distinguish them from the numpy array 95 | variables that are computed later in the function 96 | 97 | Prefixes and suffixes: 98 | ob - observation 99 | ac - action 100 | _no - this tensor should have shape (batch size, observation dim) 101 | _na - this tensor should have shape (batch size, action dim) 102 | _n - this tensor should have shape (batch size) 103 | 104 | Note: batch size is defined at runtime 105 | """ 106 | def forward(self, ts_ob_no): 107 | """ 108 | Define forward pass for policy network. 109 | 110 | arguments: 111 | ts_ob_no: (batch_size, self.ob_dim) 112 | 113 | returns: 114 | the parameters of the policy. 115 | 116 | if discrete, the parameters are the logits of a categorical distribution 117 | over the actions 118 | ts_logits_na: (batch_size, self.ac_dim) 119 | 120 | if continuous, the parameters are a tuple (mean, log_std) of a Gaussian 121 | distribution over actions. log_std should just be a trainable 122 | variable, not a network output. 123 | ts_mean: (batch_size, self.ac_dim) 124 | st_logstd: (self.ac_dim,) 125 | 126 | Hint: use the components you defined in self.define_model_components 127 | """ 128 | raise NotImplementedError 129 | if self.discrete: 130 | # YOUR HW2 CODE HERE 131 | ts_logits_na = None 132 | return ts_logits_na 133 | else: 134 | # YOUR HW2 CODE HERE 135 | ts_mean = None 136 | ts_logstd = None 137 | return (ts_mean, ts_logstd) 138 | 139 | #============================================================================================# 140 | # Actor Critic 141 | #============================================================================================# 142 | 143 | class Agent(object): 144 | def __init__(self, neural_network_args, sample_trajectory_args, estimate_advantage_args): 145 | super(Agent, self).__init__() 146 | self.ob_dim = neural_network_args['ob_dim'] 147 | self.ac_dim = neural_network_args['ac_dim'] 148 | self.discrete = neural_network_args['discrete'] 149 | self.hidden_size = neural_network_args['size'] 150 | self.critic_n_layers = neural_network_args['critic_n_layers'] 151 | self.actor_learning_rate = neural_network_args['actor_learning_rate'] 152 | self.critic_learning_rate = neural_network_args['critic_learning_rate'] 153 | self.num_target_updates = neural_network_args['num_target_updates'] 154 | self.num_grad_steps_per_target_update = neural_network_args['num_grad_steps_per_target_update'] 155 | 156 | self.animate = sample_trajectory_args['animate'] 157 | self.max_path_length = sample_trajectory_args['max_path_length'] 158 | self.min_timesteps_per_batch = sample_trajectory_args['min_timesteps_per_batch'] 159 | 160 | self.gamma = estimate_advantage_args['gamma'] 161 | self.normalize_advantages = estimate_advantage_args['normalize_advantages'] 162 | 163 | self.policy_net = PolicyNet(neural_network_args) 164 | self.value_net = build_mlp(self.ob_dim, 1, self.critic_n_layers, self.hidden_size) 165 | 166 | self.actor_optimizer = optim.Adam(self.policy_net.parameters(), lr=self.actor_learning_rate) 167 | self.critic_optimizer = optim.Adam(self.value_net.parameters(), lr=self.critic_learning_rate) 168 | 169 | def sample_action(self, ob_no): 170 | """ 171 | Build the method used for sampling action from the policy distribution 172 | 173 | arguments: 174 | ob_no: (batch_size, self.ob_dim) 175 | 176 | returns: 177 | sampled_ac: 178 | if discrete: (batch_size) 179 | if continuous: (batch_size, self.ac_dim) 180 | 181 | Hint: for the continuous case, use the reparameterization trick: 182 | The output from a Gaussian distribution with mean 'mu' and std 'sigma' is 183 | 184 | mu + sigma * z, z ~ N(0, I) 185 | 186 | This reduces the problem to just sampling z. (Hint: use torch.normal!) 187 | """ 188 | ts_ob_no = torch.from_numpy(ob_no).float() 189 | 190 | raise NotImplementedError 191 | if self.discrete: 192 | ts_logits_na = self.policy_net(ts_ob_no) 193 | # YOUR HW2 CODE HERE 194 | ts_probs = None 195 | ts_sampled_ac = None 196 | else: 197 | ts_mean, ts_logstd = self.policy_net(ts_ob_no) 198 | # YOUR HW2 CODE HERE 199 | ts_sampled_ac = None 200 | 201 | sampled_ac = ts_sampled_ac.numpy() 202 | 203 | return sampled_ac 204 | 205 | def get_log_prob(self, policy_parameters, ts_ac_na): 206 | """ 207 | Build the method used for computing the log probability of a set of actions 208 | that were actually taken according to the policy 209 | 210 | arguments: 211 | policy_parameters 212 | if discrete: logits of a categorical distribution over actions 213 | ts_logits_na: (batch_size, self.ac_dim) 214 | if continuous: (mean, log_std) of a Gaussian distribution over actions 215 | ts_mean: (batch_size, self.ac_dim) 216 | ts_logstd: (self.ac_dim,) 217 | 218 | ts_ac_na: (batch_size, self.ac_dim) 219 | 220 | returns: 221 | ts_logprob_n: (batch_size) 222 | 223 | Hint: 224 | For the discrete case, use the log probability under a categorical distribution. 225 | For the continuous case, use the log probability under a multivariate gaussian. 226 | """ 227 | raise NotImplementedError 228 | if self.discrete: 229 | ts_logits_na = policy_parameters 230 | # YOUR HW2 CODE HERE 231 | ts_logprob_n = None 232 | else: 233 | ts_mean, ts_logstd = policy_parameters 234 | # YOUR HW2 CODE HERE 235 | ts_logprob_n = None 236 | 237 | return ts_logprob_n 238 | 239 | def sample_trajectories(self, itr, env): 240 | # Collect paths until we have enough timesteps 241 | timesteps_this_batch = 0 242 | paths = [] 243 | while True: 244 | animate_this_episode=(len(paths)==0 and (itr % 10 == 0) and self.animate) 245 | path = self.sample_trajectory(env, animate_this_episode) 246 | paths.append(path) 247 | timesteps_this_batch += pathlength(path) 248 | if timesteps_this_batch > self.min_timesteps_per_batch: 249 | break 250 | return paths, timesteps_this_batch 251 | 252 | def sample_trajectory(self, env, animate_this_episode): 253 | ob = env.reset() 254 | obs, acs, rewards, next_obs, terminals = [], [], [], [], [] 255 | steps = 0 256 | while True: 257 | if animate_this_episode: 258 | env.render() 259 | time.sleep(0.1) 260 | obs.append(ob) 261 | raise NotImplementedError 262 | ac = None # YOUR HW2 CODE HERE 263 | ac = ac[0] 264 | acs.append(ac) 265 | ob, rew, done, _ = env.step(ac) 266 | # add the observation after taking a step to next_obs 267 | # YOUR CODE HERE 268 | raise NotImplementedError 269 | rewards.append(rew) 270 | steps += 1 271 | # If the episode ended, the corresponding terminal value is 1 272 | # otherwise, it is 0 273 | # YOUR CODE HERE 274 | if done or steps > self.max_path_length: 275 | raise NotImplementedError 276 | break 277 | else: 278 | raise NotImplementedError 279 | path = {"observation" : np.array(obs, dtype=np.float32), 280 | "reward" : np.array(rewards, dtype=np.float32), 281 | "action" : np.array(acs, dtype=np.float32), 282 | "next_observation": np.array(next_obs, dtype=np.float32), 283 | "terminal": np.array(terminals, dtype=np.float32)} 284 | return path 285 | 286 | def estimate_advantage(self, ob_no, next_ob_no, re_n, terminal_n): 287 | """ 288 | Estimates the advantage function value for each timestep. 289 | 290 | let sum_of_path_lengths be the sum of the lengths of the paths sampled from 291 | Agent.sample_trajectories 292 | 293 | arguments: 294 | ob_no: shape: (sum_of_path_lengths, ob_dim) 295 | next_ob_no: shape: (sum_of_path_lengths, ob_dim). The observation after taking one step forward 296 | re_n: length: sum_of_path_lengths. Each element in re_n is a scalar containing 297 | the reward for each timestep 298 | terminal_n: length: sum_of_path_lengths. Each element in terminal_n is either 1 if the episode ended 299 | at that timestep of 0 if the episode did not end 300 | 301 | returns: 302 | adv_n: shape: (sum_of_path_lengths). A single vector for the estimated 303 | advantages whose length is the sum of the lengths of the paths 304 | """ 305 | # First, estimate the Q value as Q(s, a) = r(s, a) + gamma*V(s') 306 | # To get the advantage, subtract the V(s) to get A(s, a) = Q(s, a) - V(s) 307 | # This requires calling the critic twice --- to obtain V(s') when calculating Q(s, a), 308 | # and V(s) when subtracting the baseline 309 | # Note: don't forget to use terminal_n to cut off the V(s') term when computing Q(s, a) 310 | # otherwise the values will grow without bound. 311 | # YOUR CODE HERE 312 | raise NotImplementedError 313 | adv_n = None 314 | 315 | if self.normalize_advantages: 316 | raise NotImplementedError 317 | adv_n = None # YOUR HW2 CODE HERE 318 | return adv_n 319 | 320 | def update_critic(self, ob_no, next_ob_no, re_n, terminal_n): 321 | """ 322 | Update the parameters of the critic. 323 | 324 | let sum_of_path_lengths be the sum of the lengths of the paths sampled from 325 | Agent.sample_trajectories 326 | let num_paths be the number of paths sampled from Agent.sample_trajectories 327 | 328 | arguments: 329 | ob_no: shape: (sum_of_path_lengths, ob_dim) 330 | next_ob_no: shape: (sum_of_path_lengths, ob_dim). The observation after taking one step forward 331 | re_n: length: sum_of_path_lengths. Each element in re_n is a scalar containing 332 | the reward for each timestep 333 | terminal_n: length: sum_of_path_lengths. Each element in terminal_n is either 1 if the episode ended 334 | at that timestep of 0 if the episode did not end 335 | 336 | returns: 337 | nothing 338 | """ 339 | # Use a bootstrapped target values to update the critic 340 | # Compute the target values r(s, a) + gamma*V(s') by calling the critic to compute V(s') 341 | # In total, take n=self.num_grad_steps_per_target_update*self.num_target_updates gradient update steps 342 | # Every self.num_grad_steps_per_target_update steps, recompute the target values 343 | # by evaluating V(s') on the updated critic 344 | # Note: don't forget to use terminal_n to cut off the V(s') term when computing the target 345 | # otherwise the values will grow without bound. 346 | # YOUR CODE HERE 347 | raise NotImplementedError 348 | 349 | def update_actor(self, ob_no, ac_na, adv_n): 350 | """ 351 | Update the parameters of the policy. 352 | 353 | arguments: 354 | ob_no: shape: (sum_of_path_lengths, ob_dim) 355 | ac_na: shape: (sum_of_path_lengths). 356 | adv_n: shape: (sum_of_path_lengths). A single vector for the estimated 357 | advantages whose length is the sum of the lengths of the paths 358 | 359 | returns: 360 | nothing 361 | 362 | """ 363 | # convert numpy array to pytorch tensor 364 | ts_ob_no, ts_ac_na, ts_adv_n = map(lambda x: torch.from_numpy(x), [ob_no, ac_na, adv_n]) 365 | 366 | # The policy takes in an observation and produces a distribution over the action space 367 | policy_parameters = self.policy_net(ts_ob_no) 368 | 369 | # We can compute the logprob of the actions that were actually taken by the policy 370 | # This is used in the loss function. 371 | ts_logprob_n = self.get_log_prob(policy_parameters, ts_ac_na) 372 | 373 | # clean the gradient for model parameters 374 | self.actor_optimizer.zero_grad() 375 | 376 | actor_loss = - (ts_logprob_n * ts_adv_n).mean() 377 | actor_loss.backward() 378 | 379 | self.actor_optimizer.step() 380 | 381 | def train_AC( 382 | exp_name, 383 | env_name, 384 | n_iter, 385 | gamma, 386 | min_timesteps_per_batch, 387 | max_path_length, 388 | actor_learning_rate, 389 | critic_learning_rate, 390 | num_target_updates, 391 | num_grad_steps_per_target_update, 392 | animate, 393 | logdir, 394 | normalize_advantages, 395 | seed, 396 | actor_n_layers, 397 | critic_n_layers, 398 | size): 399 | 400 | start = time.time() 401 | 402 | #========================================================================================# 403 | # Set Up Logger 404 | #========================================================================================# 405 | setup_logger(logdir, locals()) 406 | 407 | #========================================================================================# 408 | # Set Up Env 409 | #========================================================================================# 410 | 411 | # Make the gym environment 412 | env = gym.make(env_name) 413 | 414 | # Set random seeds 415 | torch.manual_seed(seed) 416 | np.random.seed(seed) 417 | env.seed(seed) 418 | 419 | # Maximum length for episodes 420 | max_path_length = max_path_length or env.spec.max_episode_steps 421 | 422 | # Is this env continuous, or self.discrete? 423 | discrete = isinstance(env.action_space, gym.spaces.Discrete) 424 | 425 | 426 | # Observation and action sizes 427 | ob_dim = env.observation_space.shape[0] 428 | ac_dim = env.action_space.n if discrete else env.action_space.shape[0] 429 | 430 | #========================================================================================# 431 | # Initialize Agent 432 | #========================================================================================# 433 | neural_network_args = { 434 | 'actor_n_layers': actor_n_layers, 435 | 'critic_n_layers': critic_n_layers, 436 | 'ob_dim': ob_dim, 437 | 'ac_dim': ac_dim, 438 | 'discrete': discrete, 439 | 'size': size, 440 | 'actor_learning_rate': actor_learning_rate, 441 | 'critic_learning_rate': critic_learning_rate, 442 | 'num_target_updates': num_target_updates, 443 | 'num_grad_steps_per_target_update': num_grad_steps_per_target_update, 444 | } 445 | 446 | sample_trajectory_args = { 447 | 'animate': animate, 448 | 'max_path_length': max_path_length, 449 | 'min_timesteps_per_batch': min_timesteps_per_batch, 450 | } 451 | 452 | estimate_advantage_args = { 453 | 'gamma': gamma, 454 | 'normalize_advantages': normalize_advantages, 455 | } 456 | 457 | agent = Agent(neural_network_args, sample_trajectory_args, estimate_advantage_args) 458 | 459 | #========================================================================================# 460 | # Training Loop 461 | #========================================================================================# 462 | 463 | total_timesteps = 0 464 | for itr in range(n_iter): 465 | print("********** Iteration %i ************"%itr) 466 | 467 | with torch.no_grad(): # use torch.no_grad to disable the gradient calculation 468 | paths, timesteps_this_batch = agent.sample_trajectories(itr, env) 469 | total_timesteps += timesteps_this_batch 470 | 471 | # Build arrays for observation, action for the policy gradient update by concatenating 472 | # across paths 473 | ob_no = np.concatenate([path["observation"] for path in paths]) 474 | ac_na = np.concatenate([path["action"] for path in paths]) 475 | re_n = np.concatenate([path["reward"] for path in paths]) 476 | next_ob_no = np.concatenate([path["next_observation"] for path in paths]) 477 | terminal_n = np.concatenate([path["terminal"] for path in paths]) 478 | 479 | # Call tensorflow operations to: 480 | # (1) update the critic, by calling agent.update_critic 481 | # (2) use the updated critic to compute the advantage by, calling agent.estimate_advantage 482 | # (3) use the estimated advantage values to update the actor, by calling agent.update_actor 483 | # YOUR CODE HERE 484 | raise NotImplementedError 485 | 486 | # Log diagnostics 487 | returns = [path["reward"].sum() for path in paths] 488 | ep_lengths = [pathlength(path) for path in paths] 489 | logz.log_tabular("Time", time.time() - start) 490 | logz.log_tabular("Iteration", itr) 491 | logz.log_tabular("AverageReturn", np.mean(returns)) 492 | logz.log_tabular("StdReturn", np.std(returns)) 493 | logz.log_tabular("MaxReturn", np.max(returns)) 494 | logz.log_tabular("MinReturn", np.min(returns)) 495 | logz.log_tabular("EpLenMean", np.mean(ep_lengths)) 496 | logz.log_tabular("EpLenStd", np.std(ep_lengths)) 497 | logz.log_tabular("TimestepsThisBatch", timesteps_this_batch) 498 | logz.log_tabular("TimestepsSoFar", total_timesteps) 499 | logz.dump_tabular() 500 | logz.save_pytorch_model(agent) 501 | 502 | 503 | def main(): 504 | import argparse 505 | parser = argparse.ArgumentParser() 506 | parser.add_argument('env_name', type=str) 507 | parser.add_argument('--exp_name', type=str, default='vac') 508 | parser.add_argument('--render', action='store_true') 509 | parser.add_argument('--discount', type=float, default=1.0) 510 | parser.add_argument('--n_iter', '-n', type=int, default=100) 511 | parser.add_argument('--batch_size', '-b', type=int, default=1000) 512 | parser.add_argument('--ep_len', '-ep', type=float, default=-1.) 513 | parser.add_argument('--actor_learning_rate', '-lr', type=float, default=5e-3) 514 | parser.add_argument('--critic_learning_rate', '-clr', type=float) 515 | parser.add_argument('--dont_normalize_advantages', '-dna', action='store_true') 516 | parser.add_argument('--num_target_updates', '-ntu', type=int, default=10) 517 | parser.add_argument('--num_grad_steps_per_target_update', '-ngsptu', type=int, default=10) 518 | parser.add_argument('--seed', type=int, default=1) 519 | parser.add_argument('--n_experiments', '-e', type=int, default=1) 520 | parser.add_argument('--actor_n_layers', '-l', type=int, default=2) 521 | parser.add_argument('--critic_n_layers', '-cl', type=int) 522 | parser.add_argument('--size', '-s', type=int, default=64) 523 | args = parser.parse_args() 524 | 525 | if not(os.path.exists('data')): 526 | os.makedirs('data') 527 | logdir = 'ac_' + args.exp_name + '_' + args.env_name + '_' + time.strftime("%d-%m-%Y_%H-%M-%S") 528 | logdir = os.path.join('data', logdir) 529 | if not(os.path.exists(logdir)): 530 | os.makedirs(logdir) 531 | 532 | max_path_length = args.ep_len if args.ep_len > 0 else None 533 | 534 | if not args.critic_learning_rate: 535 | args.critic_learning_rate = args.actor_learning_rate 536 | 537 | if not args.critic_n_layers: 538 | args.critic_n_layers = args.actor_n_layers 539 | 540 | processes = [] 541 | 542 | for e in range(args.n_experiments): 543 | seed = args.seed + 10*e 544 | print('Running experiment with seed %d'%seed) 545 | 546 | def train_func(): 547 | train_AC( 548 | exp_name=args.exp_name, 549 | env_name=args.env_name, 550 | n_iter=args.n_iter, 551 | gamma=args.discount, 552 | min_timesteps_per_batch=args.batch_size, 553 | max_path_length=max_path_length, 554 | actor_learning_rate=args.actor_learning_rate, 555 | critic_learning_rate=args.critic_learning_rate, 556 | num_target_updates=args.num_target_updates, 557 | num_grad_steps_per_target_update=args.num_grad_steps_per_target_update, 558 | animate=args.render, 559 | logdir=os.path.join(logdir,'%d'%seed), 560 | normalize_advantages=not(args.dont_normalize_advantages), 561 | seed=seed, 562 | actor_n_layers=args.actor_n_layers, 563 | critic_n_layers=args.critic_n_layers, 564 | size=args.size 565 | ) 566 | p = Process(target=train_func, args=tuple()) 567 | p.start() 568 | processes.append(p) 569 | # if you comment in the line below, then the loop will block 570 | # until this process finishes 571 | # p.join() 572 | 573 | for p in processes: 574 | p.join() 575 | 576 | if __name__ == "__main__": 577 | main() 578 | --------------------------------------------------------------------------------