├── LICENSE
├── README.md
├── hw1
    ├── README.md
    ├── demo.bash
    ├── experts
    │   ├── Ant-v2.pkl
    │   ├── HalfCheetah-v2.pkl
    │   ├── Hopper-v2.pkl
    │   ├── Humanoid-v2.pkl
    │   ├── Reacher-v2.pkl
    │   └── Walker2d-v2.pkl
    ├── hw1_instructions.pdf
    ├── load_policy.py
    ├── requirements.txt
    └── run_expert.py
├── hw2
    ├── README.md
    ├── hw2_instructions.pdf
    ├── hw2_instructions.tex
    ├── logz.py
    ├── lunar_lander.py
    ├── plot.py
    ├── requirements.txt
    └── train_pg_f18.py
└── hw3
    ├── README.md
    ├── atari_wrappers.py
    ├── dqn.py
    ├── dqn_utils.py
    ├── hw3_instructions.pdf
    ├── logz.py
    ├── lunar_lander.py
    ├── plot.py
    ├── requirements.txt
    ├── run_dqn_atari.py
    ├── run_dqn_lander.py
    ├── run_dqn_ram.py
    └── train_ac_f18.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 KuNya
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Berkeley DeepRLcourse Homework in PyTorch
 2 | ## Introduction
 3 | 
 4 | In recent years, with the booming of deep learning, reinforcement learning has made great progress in solving complex tasks and has attracted more and more people`s attention. Also, many researchers start applying reinforcement learning algorithms to solve the problem in other fields (such as Natural Language Processing).
 5 | 
 6 | So, there is a big need for learning those classic reinforcement learning algorithms in an easy way.
 7 | 
 8 | As beginners in reinforcement learning, we found that [CS 294-112](http://rail.eecs.berkeley.edu/deeprlcourse/) at UC Berkeley is a great course where we can learn a lot of classic and advanced reinforcement learning algorithms.
 9 | 
10 | As the saying goes, “talk is cheap, show me your code.” It is very important to write algorithm in code correctly, instead of just knowing the algorithm. Luckily, CS 294-112 also provides programming assignments for those reinforcement learning algorithms. While, these assignments are mainly implemented in **TensorFlow**, which might be bad news for people who are more familiar with other deep learning frameworks.
11 | 
12 | For the reasons above, we modified those assignments (for Fall 2018) and implemented in **PyTorch**, which is a framework that we often use in our research. 
13 | 
14 | Moreover, we also provide [solutions](https://github.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch-solution) to these assignments, and you can use them when you get stuck.
15 | 
16 | Hope you will enjoy it : )
17 | 
18 | 
19 | 
20 | ## What can you learn from it?
21 | 
22 | - ### HW1: Imitation Learning
23 | 
24 |   In this assignment, you will implement the **Behavioral Cloning** and **DAgger** algorithm. 
25 | 
26 |   In the experiments, you will see the case where Behavioral Cloning work well, and the case where DAgger can learn a better policy than Behavioral Cloning.
27 | 
28 | - ### HW2: Policy Gradients
29 | 
30 |   In this assignment, you will implement the **Policy Gradients** algorithm.
31 | 
32 |   In the experiments, you will compare the difference between gradient estimators(full-trajectory case and reward-to-go case) and learn how batch size and learning rate can affect the algorithm performance. Moreover, you will implement a **neural network baseline** to help the gradient estimator to reduce variance and assist the agent to learn a better policy.
33 | 
34 | - ### HW3: Q-Learning and Actor-Critic
35 | 
36 |   In this assignment, you will implement the **Deep Q-learning** and **Actor-Critic** algorithm.
37 | 
38 |   In the Deep Q-learning part, you will implement **vanilla DQN** and **double DQN** and compare their performance in different atari game environments. Also, you will experiment how hyperparameters affect the final results.
39 | 
40 |   In the Actor-Critic part, you will implement a **Actor-Critic** model based on your Policy Gradients implementations in HW2. Additionally, you will learn how to tune the hyperparameters for the Actor-Critic model, and make it outperform your previous Policy Gradients model which is equiped with reward-to-go gradient estimator and neural network baseline.
41 | 
42 | - ### HW4: Model-Based RL
43 | 
44 |   ###### Coming Soon......
45 | 
46 | - ### HW5: Advanced Topics
47 | 
48 |   ###### Coming Soon......
49 | 
50 | 
51 | 
52 | ## How can you use it?
53 | 
54 | #### If you want to learn：
55 | 
56 | - ##### The whole course:
57 | 
58 |   You can just follow the course syllabus, and use this as programming assignments.
59 | 
60 | - ##### Policy Optimization style RL algorithm:
61 | 
62 |   You may want to finish the HW2 and the Actor-Critic part in HW3, and read relative material from the course website.
63 | 
64 | - ##### Dynamic Programming style RL algorithm:
65 | 
66 |   You may want to finish the Deep Q-learning part in HW3, and read relative material from the course website.
67 | 
68 | #### Or you can just use it as you like : )


--------------------------------------------------------------------------------
/hw1/README.md:
--------------------------------------------------------------------------------
 1 | # CS294-112 HW 1: Imitation Learning
 2 | 
 3 | Modification:
 4 | 
 5 | We implemented the forward pass of the expert policy network in numpy, and you can use any deep learning framework you like to write this assignment. 
 6 | 
 7 | ------
 8 | 
 9 | Dependencies:
10 | 
11 |  * Python **3.5**
12 |  * Numpy
13 |  * MuJoCo version **1.50** and mujoco-py **1.50.1.56**
14 |  * OpenAI Gym version **0.10.5**
15 | 
16 | Once Python **3.5** is installed, you can install the remaining dependencies using `pip install -r requirements.txt`.
17 | 
18 | **Note**: MuJoCo versions until 1.5 do not support NVMe disks therefore won't be compatible with recent Mac machines.
19 | There is a request for OpenAI to support it that can be followed [here](https://github.com/openai/gym/issues/638).
20 | 
21 | 
22 | 
23 | The only file that you need to look at is `run_expert.py`, which is code to load up an expert policy, run a specified number of roll-outs, and save out data.
24 | 
25 | In `experts/`, the provided expert policies are:
26 | * Ant-v2.pkl
27 | * HalfCheetah-v2.pkl
28 | * Hopper-v2.pkl
29 | * Humanoid-v2.pkl
30 | * Reacher-v2.pkl
31 | * Walker2d-v2.pkl
32 | 
33 | The name of the pickle file corresponds to the name of the gym environment.
34 | 
35 | 
36 | 
37 | See the [HW1 PDF](./hw1_instructions.pdf) for further instructions.


--------------------------------------------------------------------------------
/hw1/demo.bash:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | set -eux
3 | for e in Hopper-v2 Ant-v2 HalfCheetah-v2 Humanoid-v2 Reacher-v2 Walker2d-v2
4 | do
5 |     python run_expert.py experts/$e.pkl $e --render --num_rollouts=1
6 | done
7 | 


--------------------------------------------------------------------------------
/hw1/experts/Ant-v2.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw1/experts/Ant-v2.pkl


--------------------------------------------------------------------------------
/hw1/experts/HalfCheetah-v2.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw1/experts/HalfCheetah-v2.pkl


--------------------------------------------------------------------------------
/hw1/experts/Hopper-v2.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw1/experts/Hopper-v2.pkl


--------------------------------------------------------------------------------
/hw1/experts/Humanoid-v2.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw1/experts/Humanoid-v2.pkl


--------------------------------------------------------------------------------
/hw1/experts/Reacher-v2.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw1/experts/Reacher-v2.pkl


--------------------------------------------------------------------------------
/hw1/experts/Walker2d-v2.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw1/experts/Walker2d-v2.pkl


--------------------------------------------------------------------------------
/hw1/hw1_instructions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw1/hw1_instructions.pdf


--------------------------------------------------------------------------------
/hw1/load_policy.py:
--------------------------------------------------------------------------------
 1 | import pickle
 2 | import numpy as np
 3 | from functools import reduce
 4 | 
 5 | 
 6 | def load_policy(filename):
 7 |     def read_layer(l):
 8 |         assert list(l.keys()) == ['AffineLayer']
 9 |         assert sorted(l['AffineLayer'].keys()) == ['W', 'b']
10 |         W, b = l['AffineLayer']['W'].astype(np.float32), l['AffineLayer']['b'].astype(np.float32)
11 |         return lambda x: np.matmul(x, W) + b
12 |         
13 |     def build_nonlin_fn(nonlin_type):
14 |         if nonlin_type == 'lrelu':
15 |             leak = 0.01 # openai/imitation nn.py:233
16 |             return lambda x: 0.5 * (1 + leak) * x + 0.5 * (1 - leak) * np.abs(x)
17 |         elif nonlin_type == 'tanh':
18 |             return lambda x: np.tanh(x)
19 |         else:
20 |             raise NotImplementedError(nonlin_type)
21 |     
22 |     with open(filename, 'rb') as f:
23 |         data = pickle.loads(f.read())
24 | 
25 |     # assert len(data.keys()) == 2
26 |     nonlin_type = data['nonlin_type']
27 |     nonlin_fn = build_nonlin_fn(nonlin_type)
28 |     policy_type = [k for k in data.keys() if k != 'nonlin_type'][0]
29 | 
30 |     assert policy_type == 'GaussianPolicy', 'Policy type {} not supported'.format(policy_type)
31 |     policy_params = data[policy_type]
32 | 
33 |     assert set(policy_params.keys()) == {'logstdevs_1_Da', 'hidden', 'obsnorm', 'out'}
34 |     
35 |     # Build observation normalization layer
36 |     assert list(policy_params['obsnorm'].keys()) == ['Standardizer']
37 |     obsnorm_mean = policy_params['obsnorm']['Standardizer']['mean_1_D']
38 |     obsnorm_meansq = policy_params['obsnorm']['Standardizer']['meansq_1_D']
39 |     obsnorm_stdev = np.sqrt(np.maximum(0, obsnorm_meansq - np.square(obsnorm_mean)))
40 |     #print('obs', obsnorm_mean.shape, obsnorm_stdev.shape)
41 | 
42 |     
43 |     # Build hidden layers
44 |     assert list(policy_params['hidden'].keys()) == ['FeedforwardNet']
45 |     layer_params = policy_params['hidden']['FeedforwardNet']
46 |     layers = []
47 |     for layer_name in sorted(layer_params.keys()):
48 |         l = layer_params[layer_name]
49 |         fc_layer = read_layer(l)
50 |         layers += [fc_layer, nonlin_fn]
51 | 
52 |     # Build output layer
53 |     fc_layer = read_layer(policy_params['out'])
54 |     layers += [fc_layer]
55 |     layers_forward = lambda inp: reduce(lambda x, fn: fn(x), [inp] + layers)
56 |     
57 |     
58 |     def forward_pass(obs):
59 |         ''' Build the forward pass for policy net.
60 | 
61 |         Input: batched observation. (shape: [batch_size, obs_dim])
62 | 
63 |         Output: batched action. (shape: [batch_size, action_dim])
64 |         '''
65 |         obs = obs.astype(np.float32)
66 |         normed_obs = (obs - obsnorm_mean) / (obsnorm_stdev + 1e-6) # 1e-6 constant from Standardizer class in nn.py:409 in openai/imitation
67 |         output = layers_forward(normed_obs.astype(np.float32))
68 | 
69 |         return output
70 | 
71 |     return forward_pass
72 | 


--------------------------------------------------------------------------------
/hw1/requirements.txt:
--------------------------------------------------------------------------------
1 | gym==0.10.5
2 | mujoco-py==1.50.1.56
3 | numpy
4 | seaborn
5 | 


--------------------------------------------------------------------------------
/hw1/run_expert.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | """
 3 | Code to load an expert policy and generate roll-out data for behavioral cloning.
 4 | Example usage:
 5 |     python run_expert.py experts/Humanoid-v1.pkl Humanoid-v1 --render \
 6 |             --num_rollouts 20
 7 | 
 8 | Modified from the script written by Jonathan Ho (hoj@openai.com)
 9 | """
10 | 
11 | import os
12 | import argparse
13 | import pickle
14 | import numpy as np
15 | import gym
16 | import load_policy
17 | 
18 | def main():
19 |     parser = argparse.ArgumentParser()
20 |     parser.add_argument('expert_policy_file', type=str)
21 |     parser.add_argument('envname', type=str)
22 |     parser.add_argument('--render', action='store_true')
23 |     parser.add_argument("--max_timesteps", type=int)
24 |     parser.add_argument('--num_rollouts', type=int, default=20,
25 |                         help='Number of expert roll outs')
26 |     args = parser.parse_args()
27 | 
28 |     print('loading and building expert policy')
29 |     policy_net = load_policy.load_policy(args.expert_policy_file)
30 |     print('loaded and built')
31 | 
32 |     env = gym.make(args.envname)
33 |     max_steps = args.max_timesteps or env.spec.timestep_limit
34 | 
35 |     returns = []
36 |     observations = []
37 |     actions = []
38 |     for i in range(args.num_rollouts):
39 |         print('iter', i)
40 |         obs = env.reset()
41 |         done = False
42 |         totalr = 0.
43 |         steps = 0
44 |         while not done:
45 |             action = policy_net(obs[None, :])
46 |             observations.append(obs)
47 |             actions.append(action)
48 |             obs, r, done, _ = env.step(action)
49 |             totalr += r
50 |             steps += 1
51 |             if args.render:
52 |                 env.render()
53 |             if steps % 100 == 0: print("%i/%i"%(steps, max_steps))
54 |             if steps >= max_steps:
55 |                 break
56 |         returns.append(totalr)
57 | 
58 |     print('returns', returns)
59 |     print('mean return', np.mean(returns))
60 |     print('std of return', np.std(returns))
61 |     
62 |     expert_data = {'observations': np.array(observations),
63 |                    'actions': np.array(actions)}
64 | 
65 |     if not os.path.exists('expert_data'):
66 |         os.makedirs('expert_data')
67 |     
68 |     with open(os.path.join('expert_data', args.envname + '.pkl'), 'wb') as f:
69 |         pickle.dump(expert_data, f, pickle.HIGHEST_PROTOCOL)
70 | 
71 | if __name__ == '__main__':
72 |     main()
73 | 


--------------------------------------------------------------------------------
/hw2/README.md:
--------------------------------------------------------------------------------
 1 | # CS294-112 HW 2: Policy Gradient
 2 | 
 3 | Modification:
 4 | 
 5 | In general, we followed the code structure of the original version and modified the neural network part to pytorch. 
 6 | 
 7 | Because of the different between the static graphs framework and the dynamic graphs framework, we merged and added some code in `train_pg_f18.py`. We also adapted the instructions of this assignment for pytorch. (Thanks to CS294-112 for offering ![equation](http://latex.codecogs.com/gif.latex?\LaTeX) code for the instructions) And you can just follow the pytorch version instructions we wrote.
 8 | 
 9 | ------
10 | 
11 | Dependencies:
12 | 
13 |  * Python **3.5**
14 |  * Numpy version **1.14.5**
15 |  * Pytorch version **0.4.0**
16 |  * MuJoCo version **1.50** and mujoco-py **1.50.1.56**
17 |  * OpenAI Gym version **0.10.5**
18 |  * seaborn
19 |  * Box2D==**2.3.2**
20 | 
21 | Before doing anything, first replace `gym/envs/box2d/lunar_lander.py` with the provided `lunar_lander.py` file.
22 | 
23 | The only file that you need to look at is `train_pg_f18.py`, which you will implement.
24 | 
25 | See the [HW2 PDF](./hw2_instructions.pdf) for further instructions.
26 | 


--------------------------------------------------------------------------------
/hw2/hw2_instructions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw2/hw2_instructions.pdf


--------------------------------------------------------------------------------
/hw2/hw2_instructions.tex:
--------------------------------------------------------------------------------
  1 | \documentclass[12pt]{article}
  2 | \usepackage{fullpage}
  3 | \usepackage{url}
  4 | \usepackage{amsmath}
  5 | \usepackage{amsfonts}
  6 | \usepackage{algorithm}
  7 | \usepackage{algorithmic}
  8 | \usepackage{graphicx}
  9 | \usepackage{hyperref}
 10 | \usepackage{color}
 11 | \usepackage{listings}
 12 | \usepackage{verbatim}
 13 | \usepackage{enumitem}
 14 | \usepackage[parfill]{parskip}
 15 | 
 16 | \newcommand{\xb}{\mathbf{x}}
 17 | \newcommand{\yb}{\mathbf{y}}
 18 | \newcommand{\wb}{\mathbf{w}}
 19 | \newcommand{\Xb}{\mathbf{X}}
 20 | \newcommand{\Yb}{\mathbf{Y}}
 21 | \newcommand{\tr}{^T}
 22 | \newcommand{\hb}{\mathbf{h}}
 23 | \newcommand{\Hb}{\mathbf{H}}
 24 | 
 25 | \newcommand{\cmt}[1]{{\footnotesize\textcolor{red}{#1}}}
 26 | \newcommand{\todo}[1]{\cmt{TO-DO: #1}}
 27 | 
 28 | \title{CS294-112 Deep Reinforcement Learning HW2: \\ Policy Gradients\\
 29 | \textbf{Pytorch Version}}
 30 | 
 31 | \author{
 32 | }
 33 | 
 34 | \date{}
 35 | 
 36 | \usepackage{courier}
 37 |  
 38 | \definecolor{codegreen}{rgb}{0,0.6,0}
 39 | \definecolor{codegray}{rgb}{0.5,0.5,0.5}
 40 | \definecolor{codepurple}{rgb}{0.58,0,0.82}
 41 | \definecolor{backcolour}{rgb}{0.95,0.95,0.92}
 42 | 
 43 | \lstdefinestyle{mystyle}{
 44 |     backgroundcolor=\color{backcolour},   
 45 |     commentstyle=\color{codegreen},
 46 |     keywordstyle=\color{magenta},
 47 |     numberstyle=\tiny\color{codegray},
 48 |     stringstyle=\color{codepurple},
 49 |     basicstyle=\footnotesize\ttfamily,
 50 |     breakatwhitespace=false,         
 51 |     breaklines=true,                 
 52 |     captionpos=b,                    
 53 |     keepspaces=true,                 
 54 |     %numbers=left,                    
 55 |     numbersep=5pt,                  
 56 |     showspaces=false,                
 57 |     showstringspaces=false,
 58 |     showtabs=false,                  
 59 |     tabsize=2
 60 | }
 61 | 
 62 | \lstset{style=mystyle}
 63 | 
 64 | \begin{document}
 65 | 
 66 | 
 67 | \maketitle
 68 | 
 69 | \section{Introduction}
 70 | The goal of this assignment is to experiment with policy gradient and its variants, including variance reduction methods. Your goals will be to set up policy gradient for both continuous and discrete environments and experiment with variance reduction tricks, including implementing reward-to-go and neural network baselines.
 71 | 
 72 | \section{Review}
 73 | Recall that the reinforcement learning objective is to learn a $\theta^*$ that maximizes the objective function:
 74 | \begin{align} \label{objective}
 75 | J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[r(\tau)\right]
 76 | \end{align}
 77 | where 
 78 | $$\pi_\theta(\tau) = p(s_1, a_1, ..., s_T, a_T) = p(s_1)\pi_\theta(a_1|s_1) \prod_{t=2}^T p(s_t | s_{t-1}, a_{t-1}) \pi_\theta(a_t | s_t)$$
 79 | and 
 80 | $$r(\tau) = r(s_1, a_1, ..., s_T, a_T) = \sum_{t=1}^T r(s_t, a_t).$$
 81 | 
 82 | The policy gradient approach is to directly take the gradient of this objective:
 83 | \begin{align}
 84 | \nabla_\theta J(\theta) &= \nabla_\theta \int \pi_\theta(\tau) r(\tau) d\tau \label{policygradientintegral} \\
 85 | &= \int \pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau) r(\tau) d\tau. \label{scorefunctionpg}
 86 | \end{align}
 87 | In practice, the expectation over trajectories $\tau$ can be approximated from a batch of $N$ sampled trajectories:
 88 | \begin{align}
 89 | \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \nabla_\theta \log \pi_\theta(\tau_i) r(\tau_i) \\
 90 | &= \frac{1}{N} \sum_{i=1}^N \left(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{it} | s_{it})\right)\left(\sum_{t=1}^T r(s_{it}, a_{it})\right). \label{estimatedscorefunctionpg}
 91 | \end{align}
 92 | Here we see that the policy $\pi_\theta$ is a probability distribution over the action space, conditioned on the state. In the agent-environment loop, the agent samples an action $a_t$ from $\pi_\theta(\cdot | s_t)$ and the environment responds with a reward $r(s_t, a_t)$.
 93 | 
 94 | One way to reduce the variance of the policy gradient is to exploit causality: the notion that the policy cannot affect rewards in the past, yielding following the modified objective, where the sum of rewards here is a sample estimate of the $Q$ function, known as the ``reward-to-go:''
 95 | \begin{align}
 96 | \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{it} | s_{it})\left(\sum_{t'=t}^T r(s_{it'}, a_{it'})\right).
 97 | \end{align}
 98 | 
 99 | Multiplying a discount factor $\gamma$ to the rewards can be interpreted as encouraging the agent to focus on rewards closer in the future, which can also be thought of as a means for reducing variance (because there is more variance possible futures further into the future). We saw in lecture that the discount factor can be incorporated in two ways.
100 | 
101 | The first way applies the discount on the rewards from full trajectory:
102 | \begin{align} \label{discount_full}
103 | \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \left(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{it} | s_{it})\right)\left(\sum_{t=1}^T \gamma^{t-1} r(s_{it}, a_{it})\right)
104 | \end{align}
105 | and the second way applies the discount on the ``reward-to-go:''
106 | \begin{align} \label{discount_rtg}
107 | \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{it} | s_{it})\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{it'}, a_{it'})\right).
108 | \end{align}.
109 | 
110 | We have seen in lecture that subtracting a baseline that is a constant with respect to $\tau$ from the sum of rewards
111 | \begin{align} \label{constant_wrt_tau}
112 | \nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[r(\tau) - b\right]\
113 | \end{align}
114 | leaves the policy gradient unbiased because $$\nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[b\right] = \mathbb{E}_{\tau \sim \pi_\theta(\tau)} \left[\nabla_\theta \log \pi_\theta(\tau) \cdot b\right] = 0.$$
115 | 
116 | In this assignment, we will implement a value function $V_\phi^\pi$ which acts as a \textit{state-dependent} baseline. The value function is trained to approximate the sum of future rewards starting from a particular state:
117 | \begin{align}
118 | V_\phi^\pi(s_t) \approx \sum_{t'=t}^T \mathbb{E}_{\pi_\theta} \left[r(s_{t'}, a_{t'}) | s_t\right],
119 | \end{align}
120 | so the approximate policy gradient now looks like this:
121 | \begin{align} \label{state_dependent_baseline}
122 | \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{it} | s_{it})\left(\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{it'}, a_{it'})\right) - V_\phi^\pi\left(s_{it}\right)\right).
123 | \end{align}
124 | 
125 | \textbf{Problem 1. State-dependent baseline:}
126 | In lecture we saw that the policy gradient is unbiased if the baseline is a constant with respect to $\tau$ (Equation~\ref{constant_wrt_tau}). The purpose of this problem is to help convince ourselves that subtracting a state-dependent baseline from the return keeps the policy gradient unbiased. For clarity we will use $p_\theta(\tau)$ instead of $\pi_\theta(\tau)$, although they mean the same thing. Using the \href{https://en.wikipedia.org/wiki/Law_of_total_expectation}{\textcolor{blue}{law of iterated expectations}} we will show that the policy gradient is still unbiased if the baseline $b$ is function of a state at a particular timestep of $\tau$ (Equation~\ref{state_dependent_baseline}). Recall from equation \ref{scorefunctionpg} that the policy gradient can be expressed as:
127 | \begin{align*}
128 |     &\mathbb{E}_{\tau \sim p_\theta(\tau)} \left[\nabla_\theta \log p_\theta(\tau)r(\tau)\right].
129 | \end{align*}
130 | By breaking up $p_\theta(\tau)$ into dynamics and policy terms, we can discard the dynamics terms, which are not functions of $\theta$:
131 | \begin{align*}
132 |    &\mathbb{E}_{\tau \sim p_\theta(\tau)} \left[\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \left(\sum_{t'=1}^T r(s_{t'}, a_{t'})\right)\right].
133 | \end{align*}
134 | When we subtract a state dependent baseline $b(s_t)$ (recall equation \ref{state_dependent_baseline}) we get
135 | \begin{align*}
136 | \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \left(\left(\sum_{t'=1}^T r(s_{t'}, a_{t'})\right) - b(s_t)\right)\right].
137 | \end{align*}
138 | An alternative approach is to look at the entire trajectory and consider a particular timestep $t^* \in [1, T-1]$ (the timestep $T$ case would be very similar to part (a)). 
139 | Our goal for this problem is to show that 
140 | \begin{align*}
141 | \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t|s_t) b(s_t)\right] = 0.
142 | \end{align*}
143 | By \href{https://brilliant.org/wiki/linearity-of-expectation/}{\textcolor{blue}{linearity of expectation}} we can consider each term in this sum independently, so we can equivalently show that
144 | \begin{align} \label{independent}
145 | \sum_{t=1}^T \mathbb{E}_{\tau \sim p_\theta(\tau)}\left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \left(b(s_t)\right)\right] = 0.
146 | \end{align}
147 | \begin{enumerate} [label=(\alph*)]
148 | \item Using the chain rule, we can express $p_\theta(\tau)$ as a product of the state-action marginal $(s_t, a_t)$ and the probability of the rest of the trajectory conditioned on $(s_t, a_t)$ (which we denote as $(\tau / s_t, a_t | s_t, a_t)$):
149 |     \begin{align*}
150 |         p_\theta(\tau) = p_\theta(s_t, a_t)p_\theta(\tau / s_t, a_t | s_t, a_t)
151 |     \end{align*}
152 |     Please show equation \ref{independent} by using the law of iterated expectations, breaking $\mathbb{E}_{\tau \sim p_\theta(\tau)}$ by decoupling the state-action marginal from the rest of the trajectory.
153 | \item Alternatively, we can consider the structure of the MDP and express $p_\theta(\tau)$ as a product of the trajectory distribution up to $s_t$ (which we denote as $(s_{1:t}, a_{1:t-1})$) and the trajectory distribution after $s_t$ conditioned on the first part (which we denote as $(s_{t+1:T}, a_{t:T} | s_{1:t}, a_{1:t-1})$):
154 |     \begin{align*}
155 |         p_\theta(\tau) = p_\theta(s_{1:t}, a_{1:t-1}) p_\theta(s_{t+1:T}, a_{t:T} | s_{1:t}, a_{1:t-1})
156 |     \end{align*}
157 | \begin{enumerate}
158 | \item Explain why, for the inner expectation, conditioning on $(s_1, a_1, ..., a_{t^*-1}, s_{t^*})$ is equivalent to conditioning only on $s_{t^*}$.
159 | \item Please show equation \ref{independent} by using the law of iterated expectations, breaking $\mathbb{E}_{\tau \sim p_\theta(\tau)}$ by decoupling trajectory up to $s_t$ from the trajectory after $s_t$.
160 | \end{enumerate}
161 | \end{enumerate}
162 | Since the policy gradient with respect to $\theta$ can be decoupled as a summation of terms over timesteps $t \in [1, T]$, because we have shown that the policy gradient is unbiased for each of these terms,
163 | the entire policy gradient is also unbiased with respect to a vector of state-dependent baselines over the timesteps: $[b(s_1), b(s_2), ... b(s_T)]$.
164 | 
165 | \section{Code Setup}
166 | \subsection{Files}
167 | The starter code is available \href{https://github.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/tree/master/hw2}{\textcolor{blue}{here}}.
168 | The only file you need to modify in this homework is \verb|train_pg_f18.py|. The files \verb|logz.py| and \verb|plots.py| are utility files; while you should look at them to understand their functionality, you will not modify them. For the Lunar Lander task, use the provided \verb|lunar_lander.py| file instead of \verb|gym/envs/box2d/lunar_lander.py|. After you fill in the appropriate methods, you should be able to just run \verb|python train_pg_f18.py| with some command line options to perform the experiments. To visualize the results, you can run \verb|python plot.py path/to/logdir|.
169 | 
170 | \subsection{Overview}
171 | The function \verb|train_PG| is used to perform the actual training for policy gradient. The parameters passed into this function specify the algorithm's hyperparameters and environment. The \verb|Agent| class contains methods that define the neural networks, sample trajectories, estimate returns, and update the parameters of the policy.
172 | 
173 | At a high level, the dataflow of the code is structured like this:
174 | \begin{enumerate}
175 |     \item \textit{Define neural network components} from \verb|torch.nn| in Pytorch.
176 |     \item \textit{Build the forward pass function} for your neural network model by using the components you just defined.
177 | \end{enumerate}
178 | Then we will repeat Steps 3 through 5 for $N$ iterations:
179 | \begin{enumerate}\setcounter{enumi}{2}
180 |     \item \textit{Sample trajectories} by executing the functions that samples an action given an observation from the environment. Collect the states, actions, and rewards as numpy variables.
181 |     \item \textit{Estimate returns} in numpy (estimated Q values, baseline predictions, advantages).
182 |     \item \textit{Update parameters} by executing the functions that updates the parameters given what you computed in Step 4.
183 | \end{enumerate}
184 | 
185 | \section{Building Neural Networks}
186 | 
187 | \textbf{Problem 2. Neural networks:} We will now begin to implement a neural network that parametrizes $\pi_\theta$.
188 | \begin{enumerate} [label=(\alph*)]
189 | \item Implement the utility function, \verb|build_mlp|, which will build a feedforward neural network with fully connected units (Hint: use \texttt{torch.nn.Linear}). Test it to make sure that it produces outputs of the expected size and shape. \textbf{You do not need to include anything in your write-up about this,} it will just make your life easier.
190 | \item Next, implement the functions used for forward pass. At this point, you only need to implement the parts with the ``Problem 2'' header.
191 |     \begin{enumerate} [label=(\roman*)]
192 |         \item Define the model components in \texttt
193 |         {PolicyNet.define\_model\_components}. You should define the parameters of your model here, which will be tracked by \verb|torch.autograd| later. They can be any instance of \verb|torch.nn.Module| or \verb|torch.nn.Parameter|.
194 |         \item Define the method  \texttt{PolicyNet.forward}: This defines forward pass for our policy network. It outputs the parameters of a distribution $\pi_\theta(a|s)$. In this homework, when the distribution is over discrete actions these parameters will be the logits of a categorical distribution, and when the distribution is over continuous actions these parameters will be the mean and the log standard deviation of a multivariate Gaussian distribution. 
195 |         \item Define the method \texttt{Agent.sample\_action}: This receives an observation and produces an action that  sampled from $\pi_\theta(a|s)$. This method will be called in \texttt{Agent.sample\_trajectory}.
196 |         \item Define the method \texttt{Agent.get\_log\_prob}: Given an action that the agent took in the environment, this computes the log probability of that action under $\pi_\theta(a|s)$. This will be used in the loss function.
197 |         
198 |     \end{enumerate}
199 | \end{enumerate}
200 | 
201 | \section{Implement Policy Gradient}
202 | \subsection{Implementing the policy gradient loop}
203 | \textbf{Problem 3. Policy Gradient:} Recall from lecture that an RL algorithm can viewed as consisting of three parts, which are reflected in the training loop of \verb|train_PG|:
204 | \begin{enumerate}
205 |     \item \verb|Agent.sample_trajectories|: Generate samples (e.g. run the policy to collect trajectories consisting of state transitions ($s, a, s', r$))
206 |     \item \verb|Agent.estimate_return|: Estimate the return (e.g. sum together discounted rewards from the trajectories, or learn a model that predicts expected total future discounted reward)
207 |     \item \verb|Agent.update_parameters|: Improve the policy (e.g. update the parameters of the policy with policy gradient)
208 | \end{enumerate}
209 | In our implementation, for clarity we will update the parameters of the value function baseline also in the third step (\verb|Agent.update_parameters|), rather than in the second step (as was described in lecture). You only need to implement the parts with the ``Problem 3'' header. 
210 | \begin{enumerate} [label=(\alph*)]
211 |     \item \textbf{Sample trajectories:} In \texttt{Agent.sample\_trajectories}, use the method \\ \texttt{Agent.sample\_action} which you just defined in ``Problem 2'' to sample an action given an observation from the environment.
212 |     \item \textbf{Estimate return:} We will now implement $r(\tau)$ from Equation \ref{objective}. 
213 |     Please implement the method \verb|Agent.sum_of_rewards|, which will return a sample estimate of the discounted return, 
214 |     for both the full-trajectory (Equation~\ref{discount_full}) case, where $$r(\tau_i) = \sum_{t=1}^T \gamma^{t'-1} r(s_{it}, a_{it})$$ and 
215 |     for the ``reward-to-go'' case (Equation~\ref{discount_rtg}) where $$r(\tau_i) = \sum_{t'=t}^T \gamma^{t'-t} r(s_{it'}, a_{it'}).$$
216 |     In \verb|Agent.estimate_return|, normalize the advantages to have a mean of zero and a standard deviation of one. This is a trick for reducing variance.
217 |     \item \textbf{Update parameters:}
218 |     In \verb|Agent.update_parameters| implement a loss function (which can use the result from \texttt{Agent.get\_log\_prob}) to whose gradient is $$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \nabla_\theta \log \pi_\theta(\tau_i) r(\tau_i).$$
219 |     Then, set the optimizer (we use \verb|torch.optim.Adam| in this case) in the right way and perform gradient decent to update the parameters of the policy. 
220 | \end{enumerate}
221 | 
222 | 
223 | \subsection{Experiments}
224 | After you have implemented the code, we will run experiments to get a feel for how different settings impact the performance of policy gradient methods.
225 | 
226 | \textbf{Problem 4. CartPole:} Run the PG algorithm in the discrete \verb|CartPole-v0| environment from the command line as follows:
227 | \begin{lstlisting}
228 | python train_pg_f18.py CartPole-v0 -n 100 -b 1000 -e 3 -dna --exp_name sb_no_rtg_dna
229 | python train_pg_f18.py CartPole-v0 -n 100 -b 1000 -e 3 -rtg -dna --exp_name sb_rtg_dna
230 | python train_pg_f18.py CartPole-v0 -n 100 -b 1000 -e 3 -rtg --exp_name sb_rtg_na
231 | python train_pg_f18.py CartPole-v0 -n 100 -b 5000 -e 3 -dna --exp_name lb_no_rtg_dna
232 | python train_pg_f18.py CartPole-v0 -n 100 -b 5000 -e 3 -rtg -dna --exp_name lb_rtg_dna
233 | python train_pg_f18.py CartPole-v0 -n 100 -b 5000 -e 3 -rtg --exp_name lb_rtg_na
234 | \end{lstlisting}
235 | 
236 | What's happening there:
237 | \begin{itemize}
238 | \item \verb|-n| : Number of iterations.
239 | \item \verb|-b| : Batch size (number of state-action pairs sampled while acting according to the current policy at each iteration).
240 | \item \verb|-e| : Number of experiments to run with the same configuration. Each experiment will start with a different randomly initialized policy, and have a different stream of random numbers. 
241 | \item \verb|-dna| : Flag: if present, sets \verb|normalize_advantages| to False. Otherwise, by default, \verb|normalize_advantages=True|. 
242 | \item \verb|-rtg| : Flag: if present, sets \verb|reward_to_go=True|. Otherwise, \verb|reward_to_go=False| by default.
243 | \item \verb|--exp_name| : Name for experiment, which goes into the name for the data directory.
244 | \end{itemize}
245 | 
246 | Various other command line arguments will allow you to set batch size, learning rate, network architecture (number of hidden layers and the size of the hidden layers---for CartPole, you can use one hidden layer with 32 units), and more. 
247 | 
248 | \textbf{Deliverables for report:}
249 | 
250 | \begin{itemize}
251 | \item Graph the results of your experiments \textbf{using the plot.py file we provide.} Create two graphs. 
252 | \begin{itemize}
253 | \item In the first graph, compare the learning curves (average return at each iteration) for the experiments prefixed with \verb|sb_|. (The small batch experiments.)
254 | \item In the second graph, compare the learning curves for the experiments prefixed with \verb|lb_|. (The large batch experiments.)
255 | \end{itemize}
256 | \item Answer the following questions briefly: 
257 | \begin{itemize}
258 | \item Which gradient estimator has better performance without advantage-centering---the trajectory-centric one, or the one using reward-to-go? 
259 | \item Did advantage centering help?
260 | \item Did the batch size make an impact?
261 | \end{itemize}
262 | \item Provide the exact command line configurations you used to run your experiments. (To verify batch size, learning rate, architecture, and so on.)
263 | \end{itemize}
264 | 
265 | \textbf{What to Expect:}
266 | \begin{itemize}
267 | \item The best configuration of CartPole in both the large and small batch cases converge to a maximum score of 200.
268 | \end{itemize}
269 | 
270 | 
271 | \textbf{Problem 5. InvertedPendulum:} Run experiments in \verb|InvertedPendulum-v2| continuous control environment as follows:
272 | \begin{lstlisting}
273 | python train_pg_f18.py InvertedPendulum-v2 -ep 1000 --discount 0.9 -n 100 -e 3 -l 2 -s 64 -b <b*> -lr <r*> -rtg --exp_name hc_b<b*>_r<r*>
274 | \end{lstlisting}
275 | where your task is to find the smallest batch size \texttt{b*} and largest learning rate \texttt{r*} that gets to optimum (maximum score of 1000) in less than 100 iterations. The policy performance may fluctuate around 1000 -- this is fine. The precision of \texttt{b*} and \texttt{r*} need only be one significant digit.
276 | 
277 | \textbf{Deliverables:}
278 | 
279 | \begin{itemize}
280 | \item Given the \texttt{b*} and \texttt{r*} you found, provide a learning curve where the policy gets to optimum (maximum score of ~1000) in less than 100 iterations. (This may be for a single random seed, or averaged over multiple.)
281 | \item Provide the exact command line configurations you used to run your experiments.
282 | \end{itemize}
283 | 
284 | 
285 | \section{Implement Neural Network Baselines}
286 | For the rest of the assignment we will use ``reward-to-go.''
287 | 
288 | \textbf{Problem 6. Neural network baseline:} We will now implement a value function as a state-dependent neural network baseline. The sections in the code are marked by ``Problem 6.''
289 | \begin{enumerate} [label=(\alph*)]
290 |     \item In \verb|Agent.__init__| implement $V_\phi^\pi$, a neural network that predicts the expected return conditioned on a state. 
291 |     \item In \verb|Agent.compute_advantage|, use the neural network to predict the expected state-conditioned return (call \texttt{self.value\_net}), normalize it to match the statistics of the current batch of ``reward-to-go'', and subtract this value from the ``reward-to-go'' to yield an estimate of the advantage. This implements $$\left(\sum_{t'=t}^T \gamma^{t'-t} r(s_{it'}, a_{it'})\right) - V_\phi^\pi\left(s_{it}\right)$$.
292 |     \item In \verb|Agent.update_parameters|, implement the loss function to train this network. ``Rescale'' the target values for the neural network baseline to have a mean of zero and a standard deviation of one.
293 | \end{enumerate}
294 | 
295 | \section{More Complex Tasks}
296 | \textbf{Note:} The following tasks would take quite a bit of time to train. Please start early!
297 | 
298 | \textbf{Problem 7: LunarLander} For this problem, you will use your policy gradient implementation to solve \verb|LunarLanderContinuous-v2|.
299 | Use an episode length of 1000. The purpose of this problem is to help you debug your baseline implementation.
300 | Run the following command:
301 | \begin{lstlisting}
302 | python train_pg_f18.py LunarLanderContinuous-v2 -ep 1000 --discount 0.99 -n 100 -e 3 -l 2 -s 64 -b 40000 -lr 0.005 -rtg --nn_baseline --exp_name ll_b40000_r0.005
303 | \end{lstlisting}
304 | \textbf{Deliverables:}
305 | \begin{itemize}
306 |     \item Plot a learning curve for the above command. You should expect to achieve an average return of around 180.
307 | \end{itemize}
308 | 
309 | \textbf{Problem 8: HalfCheetah} For this problem, you will use your policy gradient implementation to solve \verb|HalfCheetah-v2|. 
310 | Use an episode length of 150, which is shorter than the default of 1000 for HalfCheetah (which would speed up your training significantly).
311 | Search over batch sizes \texttt{b} $\in [10000,30000,50000]$ and learning rates \texttt{r} $\in [0.005, 0.01, 0.02]$ to replace \texttt{<b>} and \texttt{<r>} below:
312 | \begin{lstlisting}
313 | python train_pg_f18.py HalfCheetah-v2 -ep 150 --discount 0.9 -n 100 -e 3 -l 2 -s 32 -b <b> -lr <r> -rtg --nn_baseline --exp_name hc_b<b>_r<r>
314 | \end{lstlisting}
315 | \textbf{Deliverables:}
316 | \begin{itemize}
317 |     \item How did the batch size and learning rate affect the performance?
318 |     \item Once you've found suitable values of \texttt{b} and \texttt{r} among those choices (let's call them \texttt{b*} and \texttt{r*}), use \texttt{b*} and \texttt{r*}
319 | and run the following commands (remember to replace the terms in the angle brackets):
320 | \begin{lstlisting}
321 | python train_pg_f18.py HalfCheetah-v2 -ep 150 --discount 0.95 -n 100 -e 3 -l 2 -s 32 -b <b*> -lr <r*> --exp_name hc_b<b*>_r<r*>
322 | python train_pg_f18.py HalfCheetah-v2 -ep 150 --discount 0.95 -n 100 -e 3 -l 2 -s 32 -b <b*> -lr <r*> -rtg --exp_name hc_b<b*>_r<r*>
323 | python train_pg_f18.py HalfCheetah-v2 -ep 150 --discount 0.95 -n 100 -e 3 -l 2 -s 32 -b <b*> -lr <r*> --nn_baseline --exp_name hc_b<b*>_r<r*>
324 | python train_pg_f18.py HalfCheetah-v2 -ep 150 --discount 0.95 -n 100 -e 3 -l 2 -s 32 -b <b*> -lr <r*> -rtg --nn_baseline --exp_name hc_b<b*>_r<r*>
325 | \end{lstlisting}
326 | The run with reward-to-go and the baseline should achieve an average score close to 200. Provide a single plot plotting the learning curves for all four runs.
327 | \end{itemize}
328 | 
329 | 
330 | \section{Bonus!}
331 | 
332 | Choose any (or all) of the following:
333 | \begin{itemize}
334 | \item A serious bottleneck in the learning, for more complex environments, is the sample collection time. In \verb|train_pg_f18.py|, we only collect trajectories in a single thread, but this process can be fully parallelized across threads to get a useful speedup. Implement the parallelization and report on the difference in training time. 
335 | \item Implement GAE-$\lambda$ for advantage estimation.\footnote{\url{https://arxiv.org/abs/1506.02438}} Run experiments in a MuJoCo gym environment to explore whether this speeds up training. (\verb|Walker2d-v1| may be good for this.)
336 | \item In PG, we collect a batch of data, estimate a single gradient, and then discard the data and move on. Can we potentially accelerate PG by taking multiple gradient descent steps with the same batch of data? Explore this option and report on your results. Set up a fair comparison between single-step PG and multi-step PG on at least one MuJoCo gym environment. 
337 | \end{itemize}
338 | 
339 | \section{Submission}
340 | Your report should be a document containing 
341 | \begin{enumerate} [label=(\alph*)]
342 | \item 
343 | Your mathematical response (written in \LaTeX) for Problem 1.
344 | \item All graphs requested in Problems 4, 5, 7, and 8.
345 | \item Answers to short explanation questions in section 5 and 7.
346 | \item All command-line expressions you used to run your experiments.
347 | \item (Optionally) Your bonus results (command-line expressions, graphs, and a few sentences that comment on your findings).
348 | \end{enumerate}
349 | 
350 | Please also submit your modified \verb|train_pg_f18.py| file. If your code includes additional files, provide a zip file including your \verb|train_pg_f18.py| and all other files needed to run your code. Please include a \verb|README.md| with instructions needed to exactly duplicate your results (including command-line expressions).
351 | 
352 | 
353 | \end{document}
354 | 


--------------------------------------------------------------------------------
/hw2/logz.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | 
  3 | """
  4 | 
  5 | Some simple logging functionality, inspired by rllab's logging.
  6 | Assumes that each diagnostic gets logged each iteration
  7 | 
  8 | Call logz.configure_output_dir() to start logging to a 
  9 | tab-separated-values file (some_folder_name/log.txt)
 10 | 
 11 | To load the learning curves, you can do, for example
 12 | 
 13 | A = np.genfromtxt('/tmp/expt_1468984536/log.txt',delimiter='\t',dtype=None, names=True)
 14 | A['EpRewMean']
 15 | 
 16 | """
 17 | 
 18 | import os.path as osp, shutil, time, atexit, os, subprocess
 19 | import pickle
 20 | import torch
 21 | 
 22 | color2num = dict(
 23 |     gray=30,
 24 |     red=31,
 25 |     green=32,
 26 |     yellow=33,
 27 |     blue=34,
 28 |     magenta=35,
 29 |     cyan=36,
 30 |     white=37,
 31 |     crimson=38
 32 | )
 33 | 
 34 | def colorize(string, color, bold=False, highlight=False):
 35 |     attr = []
 36 |     num = color2num[color]
 37 |     if highlight: num += 10
 38 |     attr.append(str(num))
 39 |     if bold: attr.append('1')
 40 |     return '\x1b[%sm%s\x1b[0m' % (';'.join(attr), string)
 41 | 
 42 | class G:
 43 |     output_dir = None
 44 |     output_file = None
 45 |     first_row = True
 46 |     log_headers = []
 47 |     log_current_row = {}
 48 | 
 49 | def configure_output_dir(d=None):
 50 |     """
 51 |     Set output directory to d, or to /tmp/somerandomnumber if d is None
 52 |     """
 53 |     G.output_dir = d or "/tmp/experiments/%i"%int(time.time())
 54 |     assert not osp.exists(G.output_dir), "Log dir %s already exists! Delete it first or use a different dir"%G.output_dir
 55 |     os.makedirs(G.output_dir)
 56 |     G.output_file = open(osp.join(G.output_dir, "log.txt"), 'w')
 57 |     atexit.register(G.output_file.close)
 58 |     print(colorize("Logging data to %s"%G.output_file.name, 'green', bold=True))
 59 | 
 60 | def log_tabular(key, val):
 61 |     """
 62 |     Log a value of some diagnostic
 63 |     Call this once for each diagnostic quantity, each iteration
 64 |     """
 65 |     if G.first_row:
 66 |         G.log_headers.append(key)
 67 |     else:
 68 |         assert key in G.log_headers, "Trying to introduce a new key %s that you didn't include in the first iteration"%key
 69 |     assert key not in G.log_current_row, "You already set %s this iteration. Maybe you forgot to call dump_tabular()"%key
 70 |     G.log_current_row[key] = val
 71 | 
 72 | def save_hyperparams(params):
 73 |     with open(osp.join(G.output_dir, "hyperparams.json"), 'w') as out:
 74 |         out.write(json.dumps(params, separators=(',\n','\t:\t'), sort_keys=True))
 75 | 
 76 | def save_pytorch_model(model):  
 77 |     """
 78 |     Saves the entire pytorch Module 
 79 |     """
 80 |     torch.save(model, osp.join(G.output_dir, "model.pkl"))
 81 |     
 82 | 
 83 | def dump_tabular():
 84 |     """
 85 |     Write all of the diagnostics from the current iteration
 86 |     """
 87 |     vals = []
 88 |     key_lens = [len(key) for key in G.log_headers]
 89 |     max_key_len = max(15,max(key_lens))
 90 |     keystr = '%'+'%d'%max_key_len
 91 |     fmt = "| " + keystr + "s | %15s |"
 92 |     n_slashes = 22 + max_key_len
 93 |     print("-"*n_slashes)
 94 |     for key in G.log_headers:
 95 |         val = G.log_current_row.get(key, "")
 96 |         if hasattr(val, "__float__"): valstr = "%8.3g"%val
 97 |         else: valstr = val
 98 |         print(fmt%(key, valstr))
 99 |         vals.append(val)
100 |     print("-"*n_slashes)
101 |     if G.output_file is not None:
102 |         if G.first_row:
103 |             G.output_file.write("\t".join(G.log_headers))
104 |             G.output_file.write("\n")
105 |         G.output_file.write("\t".join(map(str,vals)))
106 |         G.output_file.write("\n")
107 |         G.output_file.flush()
108 |     G.log_current_row.clear()
109 |     G.first_row=False
110 | 


--------------------------------------------------------------------------------
/hw2/lunar_lander.py:
--------------------------------------------------------------------------------
  1 | import sys, math
  2 | import numpy as np
  3 | 
  4 | import Box2D
  5 | from Box2D.b2 import (edgeShape, circleShape, fixtureDef, polygonShape, revoluteJointDef, contactListener)
  6 | 
  7 | import gym
  8 | from gym import spaces
  9 | from gym.utils import seeding
 10 | 
 11 | import pyglet
 12 | 
 13 | from copy import copy
 14 | 
 15 | # Rocket trajectory optimization is a classic topic in Optimal Control.
 16 | #
 17 | # According to Pontryagin's maximum principle it's optimal to fire engine full throttle or
 18 | # turn it off. That's the reason this environment is OK to have discreet actions (engine on or off).
 19 | #
 20 | # Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector.
 21 | # Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points.
 22 | # If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or
 23 | # comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main
 24 | # engine is -0.3 points each frame. Solved is 200 points.
 25 | #
 26 | # Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land
 27 | # on its first attempt. Please see source code for details.
 28 | #
 29 | # Too see heuristic landing, run:
 30 | #
 31 | # python gym/envs/box2d/lunar_lander.py
 32 | #
 33 | # To play yourself, run:
 34 | #
 35 | # python examples/agents/keyboard_agent.py LunarLander-v0
 36 | #
 37 | # Created by Oleg Klimov. Licensed on the same terms as the rest of OpenAI Gym.
 38 | 
 39 | # Modified by Sid Reddy (sgr@berkeley.edu) on 8/14/18
 40 | #
 41 | # Changelog:
 42 | # - different discretization scheme for actions
 43 | # - different terminal rewards
 44 | # - different observations
 45 | # - randomized landing site
 46 | #
 47 | # You can create an env object using `gym.make('LunarLanderContinuous-v2')`,
 48 | # and it will use the discrete action space specified in this file, even though
 49 | # the env is called "Continuous".
 50 | #
 51 | # A good agent should be able to achieve >150 reward.
 52 | 
 53 | MAX_NUM_STEPS = 1000
 54 | 
 55 | N_OBS_DIM = 9
 56 | N_ACT_DIM = 6 # num discrete actions
 57 | 
 58 | FPS    = 50
 59 | SCALE  = 30.0   # affects how fast-paced the game is, forces should be adjusted as well
 60 | 
 61 | MAIN_ENGINE_POWER  = 13.0
 62 | SIDE_ENGINE_POWER  =  0.6
 63 | 
 64 | INITIAL_RANDOM = 1000.0   # Set 1500 to make game harder
 65 | 
 66 | LANDER_POLY =[
 67 |     (-14,+17), (-17,0), (-17,-10),
 68 |     (+17,-10), (+17,0), (+14,+17)
 69 |     ]
 70 | LEG_AWAY = 20
 71 | LEG_DOWN = 18
 72 | LEG_W, LEG_H = 2, 8
 73 | LEG_SPRING_TORQUE = 40 # 40 is too difficult for human players, 400 a bit easier
 74 | 
 75 | SIDE_ENGINE_HEIGHT = 14.0
 76 | SIDE_ENGINE_AWAY   = 12.0
 77 | 
 78 | VIEWPORT_W = 600
 79 | VIEWPORT_H = 400
 80 | 
 81 | THROTTLE_MAG = 0.75 # discretized 'on' value for thrusters
 82 | NOOP = 1 # don't fire main engine, don't steer
 83 | def disc_to_cont(action): # discrete action -> continuous action
 84 |   if type(action) == np.ndarray:
 85 |     return action
 86 |   # main engine
 87 |   if action < 3:
 88 |     m = -THROTTLE_MAG
 89 |   elif action < 6:
 90 |     m = THROTTLE_MAG
 91 |   else:
 92 |     raise ValueError
 93 |   # steering
 94 |   if action % 3 == 0:
 95 |     s = -THROTTLE_MAG
 96 |   elif action % 3 == 1:
 97 |     s = 0
 98 |   else:
 99 |     s = THROTTLE_MAG
100 |   return np.array([m, s])
101 | 
102 | class ContactDetector(contactListener):
103 |     def __init__(self, env):
104 |         contactListener.__init__(self)
105 |         self.env = env
106 |     def BeginContact(self, contact):
107 |         if self.env.lander==contact.fixtureA.body or self.env.lander==contact.fixtureB.body:
108 |             self.env.game_over = True
109 |         for i in range(2):
110 |             if self.env.legs[i] in [contact.fixtureA.body, contact.fixtureB.body]:
111 |                 self.env.legs[i].ground_contact = True
112 |     def EndContact(self, contact):
113 |         for i in range(2):
114 |             if self.env.legs[i] in [contact.fixtureA.body, contact.fixtureB.body]:
115 |                 self.env.legs[i].ground_contact = False
116 | 
117 | class LunarLander(gym.Env):
118 |     metadata = {
119 |         'render.modes': ['human', 'rgb_array'],
120 |         'video.frames_per_second' : FPS
121 |     }
122 | 
123 |     continuous = False
124 | 
125 |     def __init__(self):
126 |         self._seed()
127 |         self.viewer = None
128 | 
129 |         self.world = Box2D.b2World()
130 |         self.moon = None
131 |         self.lander = None
132 |         self.particles = []
133 | 
134 |         self.prev_reward = None
135 | 
136 |         high = np.array([np.inf]*N_OBS_DIM)  # useful range is -1 .. +1, but spikes can be higher
137 |         self.observation_space = spaces.Box(-high, high)
138 | 
139 |         self.action_space = spaces.Discrete(N_ACT_DIM)
140 | 
141 |         self.curr_step = None
142 | 
143 |         self._reset()
144 | 
145 |     def _seed(self, seed=None):
146 |         self.np_random, seed = seeding.np_random(seed)
147 |         return [seed]
148 | 
149 |     def _destroy(self):
150 |         if not self.moon: return
151 |         self.world.contactListener = None
152 |         self._clean_particles(True)
153 |         self.world.DestroyBody(self.moon)
154 |         self.moon = None
155 |         self.world.DestroyBody(self.lander)
156 |         self.lander = None
157 |         self.world.DestroyBody(self.legs[0])
158 |         self.world.DestroyBody(self.legs[1])
159 | 
160 |     def _reset(self):
161 |         self.curr_step = 0
162 | 
163 |         self._destroy()
164 |         self.world.contactListener_keepref = ContactDetector(self)
165 |         self.world.contactListener = self.world.contactListener_keepref
166 |         self.game_over = False
167 |         self.prev_shaping = None
168 | 
169 |         W = VIEWPORT_W/SCALE
170 |         H = VIEWPORT_H/SCALE
171 | 
172 |         # terrain
173 |         CHUNKS = 11
174 |         height = self.np_random.uniform(0, H/2, size=(CHUNKS+1,) )
175 |         chunk_x  = [W/(CHUNKS-1)*i for i in range(CHUNKS)]
176 | 
177 |         # randomize helipad x-coord
178 |         helipad_chunk = np.random.choice(range(1, CHUNKS-1))
179 | 
180 |         self.helipad_x1 = chunk_x[helipad_chunk-1]
181 |         self.helipad_x2 = chunk_x[helipad_chunk+1]
182 |         self.helipad_y  = H/4
183 |         height[helipad_chunk-2] = self.helipad_y
184 |         height[helipad_chunk-1] = self.helipad_y
185 |         height[helipad_chunk+0] = self.helipad_y
186 |         height[helipad_chunk+1] = self.helipad_y
187 |         height[helipad_chunk+2] = self.helipad_y
188 |         smooth_y = [0.33*(height[i-1] + height[i+0] + height[i+1]) for i in range(CHUNKS)]
189 | 
190 |         self.moon = self.world.CreateStaticBody( shapes=edgeShape(vertices=[(0, 0), (W, 0)]) )
191 |         self.sky_polys = []
192 |         for i in range(CHUNKS-1):
193 |             p1 = (chunk_x[i],   smooth_y[i])
194 |             p2 = (chunk_x[i+1], smooth_y[i+1])
195 |             self.moon.CreateEdgeFixture(
196 |                 vertices=[p1,p2],
197 |                 density=0,
198 |                 friction=0.1)
199 |             self.sky_polys.append( [p1, p2, (p2[0],H), (p1[0],H)] )
200 | 
201 |         self.moon.color1 = (0.0,0.0,0.0)
202 |         self.moon.color2 = (0.0,0.0,0.0)
203 | 
204 |         initial_y = VIEWPORT_H/SCALE#*0.75
205 |         self.lander = self.world.CreateDynamicBody(
206 |             position = (VIEWPORT_W/SCALE/2, initial_y),
207 |             angle=0.0,
208 |             fixtures = fixtureDef(
209 |                 shape=polygonShape(vertices=[ (x/SCALE,y/SCALE) for x,y in LANDER_POLY ]),
210 |                 density=5.0,
211 |                 friction=0.1,
212 |                 categoryBits=0x0010,
213 |                 maskBits=0x001,  # collide only with ground
214 |                 restitution=0.0) # 0.99 bouncy
215 |                 )
216 |         self.lander.color1 = (0.5,0.4,0.9)
217 |         self.lander.color2 = (0.3,0.3,0.5)
218 |         self.lander.ApplyForceToCenter( (
219 |             self.np_random.uniform(-INITIAL_RANDOM, INITIAL_RANDOM),
220 |             self.np_random.uniform(-INITIAL_RANDOM, INITIAL_RANDOM)
221 |             ), True)
222 | 
223 |         self.legs = []
224 |         for i in [-1,+1]:
225 |             leg = self.world.CreateDynamicBody(
226 |                 position = (VIEWPORT_W/SCALE/2 - i*LEG_AWAY/SCALE, initial_y),
227 |                 angle = (i*0.05),
228 |                 fixtures = fixtureDef(
229 |                     shape=polygonShape(box=(LEG_W/SCALE, LEG_H/SCALE)),
230 |                     density=1.0,
231 |                     restitution=0.0,
232 |                     categoryBits=0x0020,
233 |                     maskBits=0x001)
234 |                 )
235 |             leg.ground_contact = False
236 |             leg.color1 = (0.5,0.4,0.9)
237 |             leg.color2 = (0.3,0.3,0.5)
238 |             rjd = revoluteJointDef(
239 |                 bodyA=self.lander,
240 |                 bodyB=leg,
241 |                 localAnchorA=(0, 0),
242 |                 localAnchorB=(i*LEG_AWAY/SCALE, LEG_DOWN/SCALE),
243 |                 enableMotor=True,
244 |                 enableLimit=True,
245 |                 maxMotorTorque=LEG_SPRING_TORQUE,
246 |                 motorSpeed=+0.3*i  # low enough not to jump back into the sky
247 |                 )
248 |             if i==-1:
249 |                 rjd.lowerAngle = +0.9 - 0.5  # Yes, the most esoteric numbers here, angles legs have freedom to travel within
250 |                 rjd.upperAngle = +0.9
251 |             else:
252 |                 rjd.lowerAngle = -0.9
253 |                 rjd.upperAngle = -0.9 + 0.5
254 |             leg.joint = self.world.CreateJoint(rjd)
255 |             self.legs.append(leg)
256 | 
257 |         self.drawlist = [self.lander] + self.legs
258 | 
259 |         return self._step(NOOP)[0]
260 | 
261 |     def _create_particle(self, mass, x, y, ttl):
262 |         p = self.world.CreateDynamicBody(
263 |             position = (x,y),
264 |             angle=0.0,
265 |             fixtures = fixtureDef(
266 |                 shape=circleShape(radius=2/SCALE, pos=(0,0)),
267 |                 density=mass,
268 |                 friction=0.1,
269 |                 categoryBits=0x0100,
270 |                 maskBits=0x001,  # collide only with ground
271 |                 restitution=0.3)
272 |                 )
273 |         p.ttl = ttl
274 |         self.particles.append(p)
275 |         self._clean_particles(False)
276 |         return p
277 | 
278 |     def _clean_particles(self, all):
279 |         while self.particles and (all or self.particles[0].ttl<0):
280 |             self.world.DestroyBody(self.particles.pop(0))
281 | 
282 |     def _step(self, action):
283 |         #assert self.action_space.contains(action), "%r (%s) invalid " % (action,type(action))
284 |         if type(action) in [int, np.int64]:
285 |           action = disc_to_cont(action)
286 | 
287 |         # Engines
288 |         tip  = (math.sin(self.lander.angle), math.cos(self.lander.angle))
289 |         side = (-tip[1], tip[0]);
290 |         dispersion = [self.np_random.uniform(-1.0, +1.0) / SCALE for _ in range(2)]
291 | 
292 |         m_power = 0.0
293 |         if (self.continuous and action[0] > 0.0) or (not self.continuous and action==2):
294 |             # Main engine
295 |             if self.continuous:
296 |                 m_power = (np.clip(action[0], 0.0,1.0) + 1.0)*0.5   # 0.5..1.0
297 |                 assert m_power>=0.5 and m_power <= 1.0
298 |             else:
299 |                 m_power = 1.0
300 |             ox =  tip[0]*(4/SCALE + 2*dispersion[0]) + side[0]*dispersion[1]   # 4 is move a bit downwards, +-2 for randomness
301 |             oy = -tip[1]*(4/SCALE + 2*dispersion[0]) - side[1]*dispersion[1]
302 |             impulse_pos = (self.lander.position[0] + ox, self.lander.position[1] + oy)
303 |             p = self._create_particle(3.5, impulse_pos[0], impulse_pos[1], m_power)    # particles are just a decoration, 3.5 is here to make particle speed adequate
304 |             p.ApplyLinearImpulse(           ( ox*MAIN_ENGINE_POWER*m_power,  oy*MAIN_ENGINE_POWER*m_power), impulse_pos, True)
305 |             self.lander.ApplyLinearImpulse( (-ox*MAIN_ENGINE_POWER*m_power, -oy*MAIN_ENGINE_POWER*m_power), impulse_pos, True)
306 | 
307 |         s_power = 0.0
308 |         if (self.continuous and np.abs(action[1]) > 0.5) or (not self.continuous and action in [1,3]):
309 |             # Orientation engines
310 |             if self.continuous:
311 |                 direction = np.sign(action[1])
312 |                 s_power = np.clip(np.abs(action[1]), 0.5,1.0)
313 |                 assert s_power>=0.5 and s_power <= 1.0
314 |             else:
315 |                 direction = action-2
316 |                 s_power = 1.0
317 |             ox =  tip[0]*dispersion[0] + side[0]*(3*dispersion[1]+direction*SIDE_ENGINE_AWAY/SCALE)
318 |             oy = -tip[1]*dispersion[0] - side[1]*(3*dispersion[1]+direction*SIDE_ENGINE_AWAY/SCALE)
319 |             impulse_pos = (self.lander.position[0] + ox - tip[0]*17/SCALE, self.lander.position[1] + oy + tip[1]*SIDE_ENGINE_HEIGHT/SCALE)
320 |             p = self._create_particle(0.7, impulse_pos[0], impulse_pos[1], s_power)
321 |             p.ApplyLinearImpulse(           ( ox*SIDE_ENGINE_POWER*s_power,  oy*SIDE_ENGINE_POWER*s_power), impulse_pos, True)
322 |             self.lander.ApplyLinearImpulse( (-ox*SIDE_ENGINE_POWER*s_power, -oy*SIDE_ENGINE_POWER*s_power), impulse_pos, True)
323 | 
324 |         # perform normal update
325 |         self.world.Step(1.0/FPS, 6*30, 2*30)
326 | 
327 |         pos = self.lander.position
328 |         vel = self.lander.linearVelocity
329 |         helipad_x = (self.helipad_x1 + self.helipad_x2) / 2
330 |         state = [
331 |             (pos.x - VIEWPORT_W/SCALE/2) / (VIEWPORT_W/SCALE/2),
332 |             (pos.y - (self.helipad_y+LEG_DOWN/SCALE)) / (VIEWPORT_W/SCALE/2),
333 |             vel.x*(VIEWPORT_W/SCALE/2)/FPS,
334 |             vel.y*(VIEWPORT_H/SCALE/2)/FPS,
335 |             self.lander.angle,
336 |             20.0*self.lander.angularVelocity/FPS,
337 |             1.0 if self.legs[0].ground_contact else 0.0,
338 |             1.0 if self.legs[1].ground_contact else 0.0,
339 |             (helipad_x - VIEWPORT_W/SCALE/2) / (VIEWPORT_W/SCALE/2)
340 |             ]
341 |         assert len(state)==N_OBS_DIM
342 | 
343 |         self.curr_step += 1
344 | 
345 |         reward = 0
346 |         shaping = 0
347 |         dx = (pos.x - helipad_x) / (VIEWPORT_W/SCALE/2)
348 |         shaping += -100*np.sqrt(state[2]*state[2] + state[3]*state[3]) - 100*abs(state[4])
349 |         shaping += -100*np.sqrt(dx*dx + state[1]*state[1]) + 10*state[6] + 10*state[7]
350 |         if self.prev_shaping is not None:
351 |             reward = shaping - self.prev_shaping
352 |         self.prev_shaping = shaping
353 | 
354 |         reward -= m_power*0.30  # less fuel spent is better, about -30 for heurisic landing
355 |         reward -= s_power*0.03
356 | 
357 |         oob = abs(state[0]) >= 1.0
358 |         timeout = self.curr_step >= MAX_NUM_STEPS
359 |         not_awake = not self.lander.awake
360 | 
361 |         at_site = pos.x >= self.helipad_x1 and pos.x <= self.helipad_x2 and state[1] <= 0
362 |         grounded = self.legs[0].ground_contact and self.legs[1].ground_contact
363 |         landed = at_site and grounded
364 | 
365 |         done = self.game_over or oob or not_awake or timeout or landed
366 |         if done:
367 |           if self.game_over or oob:
368 |             reward = -100
369 |             self.lander.color1 = (255,0,0)
370 |           elif at_site:
371 |             reward = +100
372 |             self.lander.color1 = (0,255,0)
373 |           elif timeout:
374 |             self.lander.color1 = (255,0,0)
375 |         info = {}
376 | 
377 |         return np.array(state), reward, done, info
378 | 
379 |     def _render(self, mode='human', close=False):
380 |         if close:
381 |             if self.viewer is not None:
382 |                 self.viewer.close()
383 |                 self.viewer = None
384 |             return
385 | 
386 |         from gym.envs.classic_control import rendering
387 |         if self.viewer is None:
388 |             self.viewer = rendering.Viewer(VIEWPORT_W, VIEWPORT_H)
389 |             self.viewer.set_bounds(0, VIEWPORT_W/SCALE, 0, VIEWPORT_H/SCALE)
390 | 
391 |         for obj in self.particles:
392 |             obj.ttl -= 0.15
393 |             obj.color1 = (max(0.2,0.2+obj.ttl), max(0.2,0.5*obj.ttl), max(0.2,0.5*obj.ttl))
394 |             obj.color2 = (max(0.2,0.2+obj.ttl), max(0.2,0.5*obj.ttl), max(0.2,0.5*obj.ttl))
395 | 
396 |         self._clean_particles(False)
397 | 
398 |         for p in self.sky_polys:
399 |             self.viewer.draw_polygon(p, color=(0,0,0))
400 | 
401 |         for obj in self.particles + self.drawlist:
402 |             for f in obj.fixtures:
403 |                 trans = f.body.transform
404 |                 if type(f.shape) is circleShape:
405 |                     t = rendering.Transform(translation=trans*f.shape.pos)
406 |                     self.viewer.draw_circle(f.shape.radius, 20, color=obj.color1).add_attr(t)
407 |                     self.viewer.draw_circle(f.shape.radius, 20, color=obj.color2, filled=False, linewidth=2).add_attr(t)
408 |                 else:
409 |                     path = [trans*v for v in f.shape.vertices]
410 |                     self.viewer.draw_polygon(path, color=obj.color1)
411 |                     path.append(path[0])
412 |                     self.viewer.draw_polyline(path, color=obj.color2, linewidth=2)
413 | 
414 |         for x in [self.helipad_x1, self.helipad_x2]:
415 |             flagy1 = self.helipad_y
416 |             flagy2 = flagy1 + 50/SCALE
417 |             self.viewer.draw_polyline( [(x, flagy1), (x, flagy2)], color=(1,1,1) )
418 |             self.viewer.draw_polygon( [(x, flagy2), (x, flagy2-10/SCALE), (x+25/SCALE, flagy2-5/SCALE)], color=(0.8,0.8,0) )
419 | 
420 |         clock_prog = self.curr_step / MAX_NUM_STEPS
421 |         self.viewer.draw_polyline( [(0, 0.05*VIEWPORT_H/SCALE), (clock_prog*VIEWPORT_W/SCALE, 0.05*VIEWPORT_H/SCALE)], color=(255,0,0), linewidth=5  )
422 | 
423 |         return self.viewer.render(return_rgb_array = mode=='rgb_array')
424 | 
425 | class LunarLanderContinuous(LunarLander):
426 |     continuous = True
427 | 
428 | def heuristic(env, s):
429 |     # Heuristic for:
430 |     # 1. Testing.
431 |     # 2. Demonstration rollout.
432 |     angle_targ = s[0]*0.5 + s[2]*1.0         # angle should point towards center (s[0] is horizontal coordinate, s[2] hor speed)
433 |     if angle_targ >  0.4: angle_targ =  0.4  # more than 0.4 radians (22 degrees) is bad
434 |     if angle_targ < -0.4: angle_targ = -0.4
435 |     hover_targ = 0.55*np.abs(s[0])           # target y should be proporional to horizontal offset
436 | 
437 |     # PID controller: s[4] angle, s[5] angularSpeed
438 |     angle_todo = (angle_targ - s[4])*0.5 - (s[5])*1.0
439 |     #print("angle_targ=%0.2f, angle_todo=%0.2f" % (angle_targ, angle_todo))
440 | 
441 |     # PID controller: s[1] vertical coordinate s[3] vertical speed
442 |     hover_todo = (hover_targ - s[1])*0.5 - (s[3])*0.5
443 |     #print("hover_targ=%0.2f, hover_todo=%0.2f" % (hover_targ, hover_todo))
444 | 
445 |     if s[6] or s[7]: # legs have contact
446 |         angle_todo = 0
447 |         hover_todo = -(s[3])*0.5  # override to reduce fall speed, that's all we need after contact
448 | 
449 |     if env.continuous:
450 |         a = np.array( [hover_todo*20 - 1, -angle_todo*20] )
451 |         a = np.clip(a, -1, +1)
452 |     else:
453 |         a = 0
454 |         if hover_todo > np.abs(angle_todo) and hover_todo > 0.05: a = 2
455 |         elif angle_todo < -0.05: a = 3
456 |         elif angle_todo > +0.05: a = 1
457 |     return a
458 | 
459 | if __name__=="__main__":
460 |     #env = LunarLander()
461 |     env = LunarLanderContinuous()
462 |     s = env.reset()
463 |     total_reward = 0
464 |     steps = 0
465 |     while True:
466 |         a = heuristic(env, s)
467 |         s, r, done, info = env.step(a)
468 |         env.render()
469 |         total_reward += r
470 |         if steps % 20 == 0 or done:
471 |             print(["{:+0.2f}".format(x) for x in s])
472 |             print("step {} total_reward {:+0.2f}".format(steps, total_reward))
473 |         steps += 1
474 |         if done: break
475 | 


--------------------------------------------------------------------------------
/hw2/plot.py:
--------------------------------------------------------------------------------
  1 | import seaborn as sns
  2 | import pandas as pd
  3 | import matplotlib.pyplot as plt
  4 | import json
  5 | import os
  6 | 
  7 | """
  8 | Using the plotter:
  9 | 
 10 | Call it from the command line, and supply it with logdirs to experiments.
 11 | Suppose you ran an experiment with name 'test', and you ran 'test' for 10 
 12 | random seeds. The runner code stored it in the directory structure
 13 | 
 14 |     data
 15 |     L test_EnvName_DateTime
 16 |       L  0
 17 |         L log.txt
 18 |         L params.json
 19 |       L  1
 20 |         L log.txt
 21 |         L params.json
 22 |        .
 23 |        .
 24 |        .
 25 |       L  9
 26 |         L log.txt
 27 |         L params.json
 28 | 
 29 | To plot learning curves from the experiment, averaged over all random
 30 | seeds, call
 31 | 
 32 |     python plot.py data/test_EnvName_DateTime --value AverageReturn
 33 | 
 34 | and voila. To see a different statistics, change what you put in for
 35 | the keyword --value. You can also enter /multiple/ values, and it will 
 36 | make all of them in order.
 37 | 
 38 | 
 39 | Suppose you ran two experiments: 'test1' and 'test2'. In 'test2' you tried
 40 | a different set of hyperparameters from 'test1', and now you would like 
 41 | to compare them -- see their learning curves side-by-side. Just call
 42 | 
 43 |     python plot.py data/test1 data/test2
 44 | 
 45 | and it will plot them both! They will be given titles in the legend according
 46 | to their exp_name parameters. If you want to use custom legend titles, use
 47 | the --legend flag and then provide a title for each logdir.
 48 | 
 49 | """
 50 | 
 51 | def plot_data(data, value="AverageReturn"):
 52 |     if isinstance(data, list):
 53 |         data = pd.concat(data, ignore_index=True)
 54 | 
 55 |     sns.set(style="darkgrid", font_scale=1.5)
 56 |     sns.tsplot(data=data, time="Iteration", value=value, unit="Unit", condition="Condition")
 57 |     plt.legend(loc='best').draggable()
 58 |     plt.show()
 59 | 
 60 | 
 61 | def get_datasets(fpath, condition=None):
 62 |     unit = 0
 63 |     datasets = []
 64 |     for root, dir, files in os.walk(fpath):
 65 |         if 'log.txt' in files:
 66 |             param_path = open(os.path.join(root,'hyperparams.json'))
 67 |             params = json.load(param_path)
 68 |             exp_name = params['exp_name']
 69 |             
 70 |             log_path = os.path.join(root,'log.txt')
 71 |             experiment_data = pd.read_table(log_path)
 72 | 
 73 |             experiment_data.insert(
 74 |                 len(experiment_data.columns),
 75 |                 'Unit',
 76 |                 unit
 77 |                 )        
 78 |             experiment_data.insert(
 79 |                 len(experiment_data.columns),
 80 |                 'Condition',
 81 |                 condition or exp_name
 82 |                 )
 83 | 
 84 |             datasets.append(experiment_data)
 85 |             unit += 1
 86 | 
 87 |     return datasets
 88 | 
 89 | 
 90 | def main():
 91 |     import argparse
 92 |     parser = argparse.ArgumentParser()
 93 |     parser.add_argument('logdir', nargs='*')
 94 |     parser.add_argument('--legend', nargs='*')
 95 |     parser.add_argument('--value', default='AverageReturn', nargs='*')
 96 |     args = parser.parse_args()
 97 | 
 98 |     use_legend = False
 99 |     if args.legend is not None:
100 |         assert len(args.legend) == len(args.logdir), \
101 |             "Must give a legend title for each set of experiments."
102 |         use_legend = True
103 | 
104 |     data = []
105 |     if use_legend:
106 |         for logdir, legend_title in zip(args.logdir, args.legend):
107 |             data += get_datasets(logdir, legend_title)
108 |     else:
109 |         for logdir in args.logdir:
110 |             data += get_datasets(logdir)
111 | 
112 |     if isinstance(args.value, list):
113 |         values = args.value
114 |     else:
115 |         values = [args.value]
116 |     for value in values:
117 |         plot_data(data, value=value)
118 | 
119 | if __name__ == "__main__":
120 |     main()
121 | 


--------------------------------------------------------------------------------
/hw2/requirements.txt:
--------------------------------------------------------------------------------
1 | mujoco-py==1.50.1.56
2 | gym==0.10.5
3 | torch==0.4.0
4 | numpy==1.14.5
5 | seaborn
6 | Box2D==2.3.2
7 | 


--------------------------------------------------------------------------------
/hw2/train_pg_f18.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Original code from John Schulman for CS294 Deep Reinforcement Learning Spring 2017
  3 | Adapted for CS294-112 Fall 2017 by Abhishek Gupta and Joshua Achiam
  4 | Adapted for CS294-112 Fall 2018 by Michael Chang and Soroush Nasiriany
  5 | Adapted for pytorch version by Ning Dai
  6 | """
  7 | import numpy as np
  8 | import torch
  9 | import gym
 10 | import logz
 11 | import scipy.signal
 12 | import os
 13 | import time
 14 | import inspect
 15 | from torch.multiprocessing import Process
 16 | from torch import nn, optim
 17 | 
 18 | #============================================================================================#
 19 | # Utilities
 20 | #============================================================================================#
 21 | 
 22 | #========================================================================================#
 23 | #                           ----------PROBLEM 2----------
 24 | #========================================================================================#  
 25 | def build_mlp(input_size, output_size, n_layers, hidden_size, activation=nn.Tanh):
 26 |     """
 27 |         Builds a feedforward neural network
 28 |         
 29 |         arguments:
 30 |             input_size: size of the input layer
 31 |             output_size: size of the output layer
 32 |             n_layers: number of hidden layers
 33 |             hidden_size: dimension of the hidden layers
 34 |             activation: activation of the hidden layers
 35 |             output_activation: activation of the output layer
 36 | 
 37 |         returns:
 38 |             an instance of nn.Sequential which contains the feedforward neural network
 39 | 
 40 |         Hint: use nn.Linear
 41 |     """
 42 |     layers = []
 43 |     # YOUR CODE HERE
 44 |     raise NotImplementedError
 45 |     return nn.Sequential(*layers).apply(weights_init)
 46 | 
 47 | def weights_init(m):
 48 |     if hasattr(m, 'weight'):
 49 |         torch.nn.init.xavier_uniform_(m.weight) 
 50 | 
 51 | def pathlength(path):
 52 |     return len(path["reward"])
 53 | 
 54 | def setup_logger(logdir, locals_):
 55 |     # Configure output directory for logging
 56 |     logz.configure_output_dir(logdir)
 57 |     # Log experimental parameters
 58 |     args = inspect.getargspec(train_PG)[0]
 59 |     hyperparams = {k: locals_[k] if k in locals_ else None for k in args}
 60 |     logz.save_hyperparams(hyperparams)
 61 | 
 62 | class PolicyNet(nn.Module):
 63 |     def __init__(self, neural_network_args):
 64 |         super(PolicyNet, self).__init__()
 65 |         self.ob_dim = neural_network_args['ob_dim']
 66 |         self.ac_dim = neural_network_args['ac_dim']
 67 |         self.discrete = neural_network_args['discrete']
 68 |         self.hidden_size = neural_network_args['size']
 69 |         self.n_layers = neural_network_args['n_layers']
 70 | 
 71 |         self.define_model_components()
 72 |         
 73 |     #========================================================================================#
 74 |     #                           ----------PROBLEM 2----------
 75 |     #========================================================================================#
 76 |     def define_model_components(self):
 77 |         """
 78 |             Define the parameters of policy network here.
 79 |             You can use any instance of nn.Module or nn.Parameter.
 80 | 
 81 |             Hint: use the 'build_mlp' function defined above
 82 |                 In the discrete case, model should output logits of a categorical distribution
 83 |                     over the actions
 84 |                 In the continuous case, model should output a tuple (mean, log_std) of a Gaussian
 85 |                     distribution over actions. log_std should just be a trainable
 86 |                     variable, not a network output.
 87 |         """
 88 |         # YOUR_CODE_HERE
 89 |         if self.discrete:
 90 |             raise NotImplementedError
 91 |         else:
 92 |             raise NotImplementedError
 93 |             
 94 |     #========================================================================================#
 95 |     #                           ----------PROBLEM 2----------
 96 |     #========================================================================================#
 97 |     """
 98 |         Notes on notation:
 99 |         
100 |         Pytorch tensor variables have the prefix ts_, to distinguish them from the numpy array
101 |         variables that are computed later in the function
102 |     
103 |         Prefixes and suffixes:
104 |         ob - observation 
105 |         ac - action
106 |         _no - this tensor should have shape (batch size, observation dim)
107 |         _na - this tensor should have shape (batch size, action dim)
108 |         _n  - this tensor should have shape (batch size)
109 |             
110 |         Note: batch size is defined at runtime
111 |     """
112 |     def forward(self, ts_ob_no):
113 |         """
114 |             Define forward pass for policy network.
115 | 
116 |             arguments:
117 |                 ts_ob_no: (batch_size, self.ob_dim) 
118 | 
119 |             returns:
120 |                 the parameters of the policy.
121 | 
122 |                 if discrete, the parameters are the logits of a categorical distribution
123 |                     over the actions
124 |                     ts_logits_na: (batch_size, self.ac_dim)
125 | 
126 |                 if continuous, the parameters are a tuple (mean, log_std) of a Gaussian
127 |                     distribution over actions. log_std should just be a trainable
128 |                     variable, not a network output.
129 |                     ts_mean: (batch_size, self.ac_dim)
130 |                     st_logstd: (self.ac_dim,)
131 |         
132 |             Hint: use the components you defined in self.define_model_components
133 |         """
134 |         raise NotImplementedError
135 |         if self.discrete:
136 |             # YOUR_CODE_HERE
137 |             ts_logits_na = None
138 |             return ts_logits_na
139 |         else:
140 |             # YOUR_CODE_HERE
141 |             ts_mean = None
142 |             ts_logstd = None
143 |             return (ts_mean, ts_logstd)
144 |     
145 | #============================================================================================#
146 | # Policy Gradient
147 | #============================================================================================#
148 | 
149 | class Agent(object):
150 |     def __init__(self, neural_network_args, sample_trajectory_args, estimate_return_args):
151 |         super(Agent, self).__init__()
152 |         self.ob_dim = neural_network_args['ob_dim']
153 |         self.ac_dim = neural_network_args['ac_dim']
154 |         self.discrete = neural_network_args['discrete']
155 |         self.hidden_size = neural_network_args['size']
156 |         self.n_layers = neural_network_args['n_layers']
157 |         self.learning_rate = neural_network_args['learning_rate']
158 | 
159 |         self.animate = sample_trajectory_args['animate']
160 |         self.max_path_length = sample_trajectory_args['max_path_length']
161 |         self.min_timesteps_per_batch = sample_trajectory_args['min_timesteps_per_batch']
162 | 
163 |         self.gamma = estimate_return_args['gamma']
164 |         self.reward_to_go = estimate_return_args['reward_to_go']
165 |         self.nn_baseline = estimate_return_args['nn_baseline']
166 |         self.normalize_advantages = estimate_return_args['normalize_advantages']
167 | 
168 |         self.policy_net = PolicyNet(neural_network_args)
169 |         params = list(self.policy_net.parameters())
170 | 
171 |         #========================================================================================#
172 |         #                           ----------PROBLEM 6----------
173 |         # Optional Baseline
174 |         #
175 |         # Define a neural network baseline.
176 |         #========================================================================================#
177 |         if self.nn_baseline:
178 |             self.value_net = build_mlp(self.ob_dim, 1, self.n_layers, self.hidden_size)
179 |             params += list(self.value_net.parameters())
180 | 
181 |         self.optimizer = optim.Adam(params, lr=self.learning_rate)
182 |         
183 |     #========================================================================================#
184 |     #                           ----------PROBLEM 2----------
185 |     #========================================================================================#
186 |     def sample_action(self, ob_no):
187 |         """
188 |             Build the method used for sampling action from the policy distribution
189 |     
190 |             arguments:
191 |                 ob_no: (batch_size, self.ob_dim)
192 | 
193 |             returns:
194 |                 sampled_ac: 
195 |                     if discrete: (batch_size)
196 |                     if continuous: (batch_size, self.ac_dim)
197 | 
198 |             Hint: for the continuous case, use the reparameterization trick:
199 |                  The output from a Gaussian distribution with mean 'mu' and std 'sigma' is
200 |         
201 |                       mu + sigma * z,         z ~ N(0, I)
202 |         
203 |                  This reduces the problem to just sampling z. (Hint: use torch.normal!)
204 |         """
205 |         ts_ob_no = torch.from_numpy(ob_no).float()
206 |         
207 |         raise NotImplementedError
208 |         if self.discrete:
209 |             ts_logits_na = self.policy_net(ts_ob_no)
210 |             # YOUR_CODE_HERE
211 |             ts_sampled_ac = None
212 |         else:
213 |             ts_mean, ts_logstd = self.policy_net(ts_ob_no)
214 |             # YOUR_CODE_HERE
215 |             ts_sampled_ac = None
216 | 
217 |         sampled_ac = ts_sampled_ac.numpy()
218 |         return sampled_ac
219 | 
220 |     #========================================================================================#
221 |     #                           ----------PROBLEM 2----------
222 |     #========================================================================================#
223 |     def get_log_prob(self, policy_parameters, ts_ac_na):
224 |         """
225 |             Build the method used for computing the log probability of a set of actions
226 |             that were actually taken according to the policy
227 | 
228 |             arguments:
229 |                 policy_parameters
230 |                     if discrete: logits of a categorical distribution over actions 
231 |                         ts_logits_na: (batch_size, self.ac_dim)
232 |                     if continuous: (mean, log_std) of a Gaussian distribution over actions
233 |                         ts_mean: (batch_size, self.ac_dim)
234 |                         ts_logstd: (self.ac_dim,)
235 | 
236 |                 ts_ac_na: (batch_size, self.ac_dim)
237 | 
238 |             returns:
239 |                 ts_logprob_n: (batch_size)
240 | 
241 |             Hint:
242 |                 For the discrete case, use the log probability under a categorical distribution.
243 |                 For the continuous case, use the log probability under a multivariate gaussian.
244 |         """
245 |         raise NotImplementedError
246 |         if self.discrete:
247 |             ts_logits_na = policy_parameters
248 |             # YOUR_CODE_HERE
249 |             ts_logprob_n = None
250 |         else:
251 |             ts_mean, ts_logstd = policy_parameters
252 |             # YOUR_CODE_HERE
253 |             ts_logprob_n = None
254 |         return ts_logprob_n
255 | 
256 |     def sample_trajectories(self, itr, env):
257 |         # Collect paths until we have enough timesteps
258 |         timesteps_this_batch = 0
259 |         paths = []
260 |         while True:
261 |             animate_this_episode=(len(paths)==0 and (itr % 10 == 0) and self.animate)
262 |             path = self.sample_trajectory(env, animate_this_episode)
263 |             paths.append(path)
264 |             timesteps_this_batch += pathlength(path)
265 |             if timesteps_this_batch > self.min_timesteps_per_batch:
266 |                 break
267 |         return paths, timesteps_this_batch
268 | 
269 |     def sample_trajectory(self, env, animate_this_episode):
270 |         ob = env.reset()
271 |         obs, acs, rewards = [], [], []
272 |         steps = 0
273 |         while True:
274 |             if animate_this_episode:
275 |                 env.render()
276 |                 time.sleep(0.1)
277 |             obs.append(ob)
278 |             #====================================================================================#
279 |             #                           ----------PROBLEM 3----------
280 |             #====================================================================================#
281 |             raise NotImplementedError
282 |             ac = None # YOUR CODE HERE
283 |             ac = ac[0]
284 |             acs.append(ac)
285 |             ob, rew, done, _ = env.step(ac)
286 |             rewards.append(rew)
287 |             steps += 1
288 |             if done or steps > self.max_path_length:
289 |                 break
290 |         path = {"observation" : np.array(obs, dtype=np.float32), 
291 |                 "reward" : np.array(rewards, dtype=np.float32), 
292 |                 "action" : np.array(acs, dtype=np.float32)}
293 |         return path
294 | 
295 |     #====================================================================================#
296 |     #                           ----------PROBLEM 3----------
297 |     #====================================================================================#
298 |     def sum_of_rewards(self, re_n):
299 |         """
300 |             Monte Carlo estimation of the Q function.
301 | 
302 |             let sum_of_path_lengths be the sum of the lengths of the paths sampled from 
303 |                 Agent.sample_trajectories
304 |             let num_paths be the number of paths sampled from Agent.sample_trajectories
305 | 
306 |             arguments:
307 |                 re_n: length: num_paths. Each element in re_n is a numpy array 
308 |                     containing the rewards for the particular path
309 | 
310 |             returns:
311 |                 q_n: shape: (sum_of_path_lengths). A single vector for the estimated q values 
312 |                     whose length is the sum of the lengths of the paths
313 | 
314 |             ----------------------------------------------------------------------------------
315 |             
316 |             Your code should construct numpy arrays for Q-values which will be used to compute
317 |             advantages (which will in turn be fed to the placeholder you defined in 
318 |             Agent.define_placeholders). 
319 |             
320 |             Recall that the expression for the policy gradient PG is
321 |             
322 |                   PG = E_{tau} [sum_{t=0}^T grad log pi(a_t|s_t) * (Q_t - b_t )]
323 |             
324 |             where 
325 |             
326 |                   tau=(s_0, a_0, ...) is a trajectory,
327 |                   Q_t is the Q-value at time t, Q^{pi}(s_t, a_t),
328 |                   and b_t is a baseline which may depend on s_t. 
329 |             
330 |             You will write code for two cases, controlled by the flag 'reward_to_go':
331 |             
332 |               Case 1: trajectory-based PG 
333 |             
334 |                   (reward_to_go = False)
335 |             
336 |                   Instead of Q^{pi}(s_t, a_t), we use the total discounted reward summed over 
337 |                   entire trajectory (regardless of which time step the Q-value should be for). 
338 |             
339 |                   For this case, the policy gradient estimator is
340 |             
341 |                       E_{tau} [sum_{t=0}^T grad log pi(a_t|s_t) * Ret(tau)]
342 |             
343 |                   where
344 |             
345 |                       Ret(tau) = sum_{t'=0}^T gamma^t' r_{t'}.
346 |             
347 |                   Thus, you should compute
348 |             
349 |                       Q_t = Ret(tau)
350 |             
351 |               Case 2: reward-to-go PG 
352 |             
353 |                   (reward_to_go = True)
354 |             
355 |                   Here, you estimate Q^{pi}(s_t, a_t) by the discounted sum of rewards starting
356 |                   from time step t. Thus, you should compute
357 |             
358 |                       Q_t = sum_{t'=t}^T gamma^(t'-t) * r_{t'}
359 |             
360 |             
361 |             Store the Q-values for all timesteps and all trajectories in a variable 'q_n',
362 |             like the 'ob_no' and 'ac_na' above. 
363 |         """
364 |         # YOUR_CODE_HERE
365 |         if self.reward_to_go:
366 |             raise NotImplementedError
367 |         else:
368 |             raise NotImplementedError
369 |         return q_n
370 | 
371 |     def compute_advantage(self, ob_no, q_n):
372 |         """
373 |             Computes advantages by (possibly) subtracting a baseline from the estimated Q values
374 | 
375 |             let sum_of_path_lengths be the sum of the lengths of the paths sampled from 
376 |                 Agent.sample_trajectories
377 |             let num_paths be the number of paths sampled from Agent.sample_trajectories
378 | 
379 |             arguments:
380 |                 ob_no: shape: (sum_of_path_lengths, ob_dim)
381 |                 q_n: shape: (sum_of_path_lengths). A single vector for the estimated q values 
382 |                     whose length is the sum of the lengths of the paths
383 | 
384 |             returns:
385 |                 adv_n: shape: (sum_of_path_lengths). A single vector for the estimated 
386 |                     advantages whose length is the sum of the lengths of the paths
387 |         """
388 |         #====================================================================================#
389 |         #                           ----------PROBLEM 6----------
390 |         # Computing Baselines
391 |         #====================================================================================#
392 |         if self.nn_baseline:
393 |             # If nn_baseline is True, use your neural network to predict reward-to-go
394 |             # at each timestep for each trajectory, and save the result in a variable 'b_n'
395 |             # like 'ob_no', 'ac_na', and 'q_n'.
396 |             #
397 |             # Hint #bl1: rescale the output from the nn_baseline to match the statistics
398 |             # (mean and std) of the current batch of Q-values. (Goes with Hint
399 |             # #bl2 in Agent.update_parameters.
400 |             raise NotImplementedError
401 |             # YOUR CODE HERE
402 |             b_n = None 
403 |             adv_n = q_n - b_n
404 |         else:
405 |             adv_n = q_n.copy()
406 |         return adv_n
407 | 
408 |     def estimate_return(self, ob_no, re_n):
409 |         """
410 |             Estimates the returns over a set of trajectories.
411 | 
412 |             let sum_of_path_lengths be the sum of the lengths of the paths sampled from 
413 |                 Agent.sample_trajectories
414 |             let num_paths be the number of paths sampled from Agent.sample_trajectories
415 | 
416 |             arguments:
417 |                 ob_no: shape: (sum_of_path_lengths, ob_dim)
418 |                 re_n: length: num_paths. Each element in re_n is a numpy array 
419 |                     containing the rewards for the particular path
420 | 
421 |             returns:
422 |                 q_n: shape: (sum_of_path_lengths). A single vector for the estimated q values 
423 |                     whose length is the sum of the lengths of the paths
424 |                 adv_n: shape: (sum_of_path_lengths). A single vector for the estimated 
425 |                     advantages whose length is the sum of the lengths of the paths
426 |         """
427 |         q_n = self.sum_of_rewards(re_n)
428 |         adv_n = self.compute_advantage(ob_no, q_n)
429 |         #====================================================================================#
430 |         #                           ----------PROBLEM 3----------
431 |         # Advantage Normalization
432 |         #====================================================================================#
433 |         if self.normalize_advantages:
434 |             # On the next line, implement a trick which is known empirically to reduce variance
435 |             # in policy gradient methods: normalize adv_n to have mean zero and std=1.
436 |             raise NotImplementedError
437 |             adv_n = None # YOUR_CODE_HERE
438 |         return q_n, adv_n
439 | 
440 |     def update_parameters(self, ob_no, ac_na, q_n, adv_n):
441 |         """ 
442 |             Update the parameters of the policy and (possibly) the neural network baseline, 
443 |             which is trained to approximate the value function.
444 | 
445 |             arguments:
446 |                 ob_no: shape: (sum_of_path_lengths, ob_dim)
447 |                 ac_na: shape: (sum_of_path_lengths).
448 |                 q_n: shape: (sum_of_path_lengths). A single vector for the estimated q values 
449 |                     whose length is the sum of the lengths of the paths
450 |                 adv_n: shape: (sum_of_path_lengths). A single vector for the estimated 
451 |                     advantages whose length is the sum of the lengths of the paths
452 | 
453 |             returns:
454 |                 nothing
455 | 
456 |         """
457 |         # convert numpy array to pytorch tensor
458 |         ts_ob_no, ts_ac_na, ts_q_n, ts_adv_n = map(lambda x: torch.from_numpy(x), [ob_no, ac_na, q_n, adv_n])
459 | 
460 |         # The policy takes in an observation and produces a distribution over the action space
461 |         policy_parameters = self.policy_net(ts_ob_no)
462 | 
463 |         # We can compute the logprob of the actions that were actually taken by the policy
464 |         # This is used in the loss function.
465 |         ts_logprob_n = self.get_log_prob(policy_parameters, ts_ac_na)
466 | 
467 |         # clean the gradient for model parameters
468 |         self.optimizer.zero_grad()
469 |         
470 |         #========================================================================================#
471 |         #                           ----------PROBLEM 3----------
472 |         # Loss Function for Policy Gradient
473 |         #========================================================================================#
474 |         raise NotImplementedError
475 |         loss = None # YOUR CODE HERE
476 |         loss.backward()
477 |         
478 |         #====================================================================================#
479 |         #                           ----------PROBLEM 6----------
480 |         # Optimizing Neural Network Baseline
481 |         #====================================================================================#
482 |         if self.nn_baseline:
483 |             # If a neural network baseline is used, set up the targets and the output of the 
484 |             # baseline. 
485 |             # 
486 |             # Fit it to the current batch in order to use for the next iteration. Use the 
487 |             # self.value_net you defined earlier.
488 |             #
489 |             # Hint #bl2: Instead of trying to target raw Q-values directly, rescale the 
490 |             # targets to have mean zero and std=1. (Goes with Hint #bl1 in 
491 |             # Agent.compute_advantage.)
492 | 
493 |             # YOUR_CODE_HERE
494 |             raise NotImplementedError
495 |             baseline_prediction = None
496 |             ts_target_n = None
497 |             baseline_loss = None
498 |             baseline_loss.backward()
499 | 
500 |         #====================================================================================#
501 |         #                           ----------PROBLEM 3----------
502 |         # Performing the Policy Update
503 |         #====================================================================================#
504 | 
505 |         # Call the optimizer to perform the policy gradient update based on the current batch 
506 |         # of rollouts.
507 |         # 
508 |         # For debug purposes, you may wish to save the value of the loss function before
509 |         # and after an update, and then log them below. 
510 | 
511 |         # YOUR_CODE_HERE
512 |         raise NotImplementedError
513 | 
514 | def train_PG(
515 |         exp_name,
516 |         env_name,
517 |         n_iter, 
518 |         gamma, 
519 |         min_timesteps_per_batch, 
520 |         max_path_length,
521 |         learning_rate, 
522 |         reward_to_go, 
523 |         animate, 
524 |         logdir, 
525 |         normalize_advantages,
526 |         nn_baseline, 
527 |         seed,
528 |         n_layers,
529 |         size):
530 | 
531 |     start = time.time()
532 | 
533 |     #========================================================================================#
534 |     # Set Up Logger
535 |     #========================================================================================#
536 |     setup_logger(logdir, locals())
537 | 
538 |     #========================================================================================#
539 |     # Set Up Env
540 |     #========================================================================================#
541 | 
542 |     # Make the gym environment
543 |     env = gym.make(env_name)
544 | 
545 |     # Set random seeds
546 |     torch.manual_seed(seed)
547 |     np.random.seed(seed)
548 |     env.seed(seed)
549 | 
550 |     # Maximum length for episodes
551 |     max_path_length = max_path_length or env.spec.max_episode_steps
552 | 
553 |     # Is this env continuous, or self.discrete?
554 |     discrete = isinstance(env.action_space, gym.spaces.Discrete)
555 | 
556 |     # Observation and action sizes
557 |     ob_dim = env.observation_space.shape[0]
558 |     ac_dim = env.action_space.n if discrete else env.action_space.shape[0]
559 | 
560 |     #========================================================================================#
561 |     # Initialize Agent
562 |     #========================================================================================#
563 |     neural_network_args = {
564 |         'n_layers': n_layers,
565 |         'ob_dim': ob_dim,
566 |         'ac_dim': ac_dim,
567 |         'discrete': discrete,
568 |         'size': size,
569 |         'learning_rate': learning_rate,
570 |         }
571 | 
572 |     sample_trajectory_args = {
573 |         'animate': animate,
574 |         'max_path_length': max_path_length,
575 |         'min_timesteps_per_batch': min_timesteps_per_batch,
576 |     }
577 | 
578 |     estimate_return_args = {
579 |         'gamma': gamma,
580 |         'reward_to_go': reward_to_go,
581 |         'nn_baseline': nn_baseline,
582 |         'normalize_advantages': normalize_advantages,
583 |     }
584 | 
585 |     agent = Agent(neural_network_args, sample_trajectory_args, estimate_return_args)
586 | 
587 |     #========================================================================================#
588 |     # Training Loop
589 |     #========================================================================================#
590 | 
591 |     total_timesteps = 0
592 |     for itr in range(n_iter):
593 |         print("********** Iteration %i ************"%itr)
594 |         
595 |         with torch.no_grad(): # use torch.no_grad to disable the gradient calculation
596 |             paths, timesteps_this_batch = agent.sample_trajectories(itr, env)
597 |         total_timesteps += timesteps_this_batch
598 | 
599 |         # Build arrays for observation, action for the policy gradient update by concatenating 
600 |         # across paths
601 |         ob_no = np.concatenate([path["observation"] for path in paths])
602 |         ac_na = np.concatenate([path["action"] for path in paths])
603 |         re_n = [path["reward"] for path in paths]
604 | 
605 |         with torch.no_grad():
606 |             q_n, adv_n = agent.estimate_return(ob_no, re_n)
607 |             
608 |         agent.update_parameters(ob_no, ac_na, q_n, adv_n)
609 | 
610 |         # Log diagnostics
611 |         returns = [path["reward"].sum() for path in paths]
612 |         ep_lengths = [pathlength(path) for path in paths]
613 |         logz.log_tabular("Time", time.time() - start)
614 |         logz.log_tabular("Iteration", itr)
615 |         logz.log_tabular("AverageReturn", np.mean(returns))
616 |         logz.log_tabular("StdReturn", np.std(returns))
617 |         logz.log_tabular("MaxReturn", np.max(returns))
618 |         logz.log_tabular("MinReturn", np.min(returns))
619 |         logz.log_tabular("EpLenMean", np.mean(ep_lengths))
620 |         logz.log_tabular("EpLenStd", np.std(ep_lengths))
621 |         logz.log_tabular("TimestepsThisBatch", timesteps_this_batch)
622 |         logz.log_tabular("TimestepsSoFar", total_timesteps)
623 |         logz.dump_tabular()
624 |         logz.save_pytorch_model(agent)
625 | 
626 | 
627 | def main():
628 |     import argparse
629 |     parser = argparse.ArgumentParser()
630 |     parser.add_argument('env_name', type=str)
631 |     parser.add_argument('--exp_name', type=str, default='vpg')
632 |     parser.add_argument('--render', action='store_true')
633 |     parser.add_argument('--discount', type=float, default=1.0)
634 |     parser.add_argument('--n_iter', '-n', type=int, default=100)
635 |     parser.add_argument('--batch_size', '-b', type=int, default=1000)
636 |     parser.add_argument('--ep_len', '-ep', type=float, default=-1.)
637 |     parser.add_argument('--learning_rate', '-lr', type=float, default=5e-3)
638 |     parser.add_argument('--reward_to_go', '-rtg', action='store_true')
639 |     parser.add_argument('--dont_normalize_advantages', '-dna', action='store_true')
640 |     parser.add_argument('--nn_baseline', '-bl', action='store_true')
641 |     parser.add_argument('--seed', type=int, default=1)
642 |     parser.add_argument('--n_experiments', '-e', type=int, default=1)
643 |     parser.add_argument('--n_layers', '-l', type=int, default=2)
644 |     parser.add_argument('--size', '-s', type=int, default=64)
645 |     args = parser.parse_args()
646 | 
647 |     if not(os.path.exists('data')):
648 |         os.makedirs('data')
649 |     logdir = args.exp_name + '_' + args.env_name + '_' + time.strftime("%d-%m-%Y_%H-%M-%S")
650 |     logdir = os.path.join('data', logdir)
651 |     if not(os.path.exists(logdir)):
652 |         os.makedirs(logdir)
653 | 
654 |     max_path_length = args.ep_len if args.ep_len > 0 else None
655 | 
656 |     processes = []
657 | 
658 |     for e in range(args.n_experiments):
659 |         seed = args.seed + 10*e
660 |         print('Running experiment with seed %d'%seed)
661 | 
662 |         def train_func():
663 |             train_PG(
664 |                 exp_name=args.exp_name,
665 |                 env_name=args.env_name,
666 |                 n_iter=args.n_iter,
667 |                 gamma=args.discount,
668 |                 min_timesteps_per_batch=args.batch_size,
669 |                 max_path_length=max_path_length,
670 |                 learning_rate=args.learning_rate,
671 |                 reward_to_go=args.reward_to_go,
672 |                 animate=args.render,
673 |                 logdir=os.path.join(logdir,'%d'%seed),
674 |                 normalize_advantages=not(args.dont_normalize_advantages),
675 |                 nn_baseline=args.nn_baseline, 
676 |                 seed=seed,
677 |                 n_layers=args.n_layers,
678 |                 size=args.size
679 |                 )
680 |         p = Process(target=train_func, args=tuple())
681 |         p.start()
682 |         processes.append(p)
683 |         # if you comment in the line below, then the loop will block 
684 |         # until this process finishes
685 |         # p.join()
686 | 
687 |     for p in processes:
688 |         p.join()
689 | 
690 | if __name__ == "__main__":
691 |     main()
692 | 


--------------------------------------------------------------------------------
/hw3/README.md:
--------------------------------------------------------------------------------
 1 | # CS294-112 HW 3: Q-Learning
 2 | 
 3 | Modifications:
 4 | 
 5 | In general, we followed the code structure of the original version and modified the neural network part to pytorch. 
 6 | 
 7 | Because of the different between the static graphs framework and the dynamic graphs framework, we merged and added some code. For the instructions, you can generally follow the original PDF version, and we have adapted the comments in the code for pytorch to help you finish this assignment.
 8 | 
 9 | ------
10 | 
11 | Dependencies:
12 | 
13 |  * Python **3.5**
14 |  * Numpy version **1.14.5**
15 |  * Pytorch version **0.4.0**
16 |  * MuJoCo version **1.50** and mujoco-py **1.50.1.56**
17 |  * OpenAI Gym version **0.10.5**
18 |  * seaborn
19 |  * Box2D==**2.3.2**
20 |  * OpenCV
21 |  * ffmpeg
22 | 
23 | Before doing anything, first replace `gym/envs/box2d/lunar_lander.py` with the provided `lunar_lander.py` file.
24 | 
25 | The only files that you need to look at are `dqn.py` and `train_ac_f18.py`, which you will implement.
26 | 
27 | See the [HW3 PDF](./hw3_instructions.pdf) for further instructions.
28 | 
29 | The starter code was based on an implementation of Q-learning for Atari generously provided by Szymon Sidor from OpenAI.
30 | 


--------------------------------------------------------------------------------
/hw3/atari_wrappers.py:
--------------------------------------------------------------------------------
  1 | import cv2
  2 | import numpy as np
  3 | from collections import deque
  4 | import gym
  5 | from gym import spaces
  6 | 
  7 | 
  8 | class NoopResetEnv(gym.Wrapper):
  9 |     def __init__(self, env=None, noop_max=30):
 10 |         """Sample initial states by taking random number of no-ops on reset.
 11 |         No-op is assumed to be action 0.
 12 |         """
 13 |         super(NoopResetEnv, self).__init__(env)
 14 |         self.noop_max = noop_max
 15 |         assert env.unwrapped.get_action_meanings()[0] == 'NOOP'
 16 | 
 17 |     def _reset(self):
 18 |         """ Do no-op action for a number of steps in [1, noop_max]."""
 19 |         self.env.reset()
 20 |         noops = np.random.randint(1, self.noop_max + 1)
 21 |         for _ in range(noops):
 22 |             obs, _, _, _ = self.env.step(0)
 23 |         return obs
 24 | 
 25 | class FireResetEnv(gym.Wrapper):
 26 |     def __init__(self, env=None):
 27 |         """Take action on reset for environments that are fixed until firing."""
 28 |         super(FireResetEnv, self).__init__(env)
 29 |         assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
 30 |         assert len(env.unwrapped.get_action_meanings()) >= 3
 31 | 
 32 |     def _reset(self):
 33 |         self.env.reset()
 34 |         obs, _, _, _ = self.env.step(1)
 35 |         obs, _, _, _ = self.env.step(2)
 36 |         return obs
 37 | 
 38 | class EpisodicLifeEnv(gym.Wrapper):
 39 |     def __init__(self, env=None):
 40 |         """Make end-of-life == end-of-episode, but only reset on true game over.
 41 |         Done by DeepMind for the DQN and co. since it helps value estimation.
 42 |         """
 43 |         super(EpisodicLifeEnv, self).__init__(env)
 44 |         self.lives = 0
 45 |         self.was_real_done  = True
 46 |         self.was_real_reset = False
 47 | 
 48 |     def _step(self, action):
 49 |         obs, reward, done, info = self.env.step(action)
 50 |         self.was_real_done = done
 51 |         # check current lives, make loss of life terminal,
 52 |         # then update lives to handle bonus lives
 53 |         lives = self.env.unwrapped.ale.lives()
 54 |         if lives < self.lives and lives > 0:
 55 |             # for Qbert somtimes we stay in lives == 0 condtion for a few frames
 56 |             # so its important to keep lives > 0, so that we only reset once
 57 |             # the environment advertises done.
 58 |             done = True
 59 |         self.lives = lives
 60 |         return obs, reward, done, info
 61 | 
 62 |     def _reset(self):
 63 |         """Reset only when lives are exhausted.
 64 |         This way all states are still reachable even though lives are episodic,
 65 |         and the learner need not know about any of this behind-the-scenes.
 66 |         """
 67 |         if self.was_real_done:
 68 |             obs = self.env.reset()
 69 |             self.was_real_reset = True
 70 |         else:
 71 |             # no-op step to advance from terminal/lost life state
 72 |             obs, _, _, _ = self.env.step(0)
 73 |             self.was_real_reset = False
 74 |         self.lives = self.env.unwrapped.ale.lives()
 75 |         return obs
 76 | 
 77 | class MaxAndSkipEnv(gym.Wrapper):
 78 |     def __init__(self, env=None, skip=4):
 79 |         """Return only every `skip`-th frame"""
 80 |         super(MaxAndSkipEnv, self).__init__(env)
 81 |         # most recent raw observations (for max pooling across time steps)
 82 |         self._obs_buffer = deque(maxlen=2)
 83 |         self._skip       = skip
 84 | 
 85 |     def _step(self, action):
 86 |         total_reward = 0.0
 87 |         done = None
 88 |         for _ in range(self._skip):
 89 |             obs, reward, done, info = self.env.step(action)
 90 |             self._obs_buffer.append(obs)
 91 |             total_reward += reward
 92 |             if done:
 93 |                 break
 94 | 
 95 |         max_frame = np.max(np.stack(self._obs_buffer), axis=0)
 96 | 
 97 |         return max_frame, total_reward, done, info
 98 | 
 99 |     def _reset(self):
100 |         """Clear past frame buffer and init. to first obs. from inner env."""
101 |         self._obs_buffer.clear()
102 |         obs = self.env.reset()
103 |         self._obs_buffer.append(obs)
104 |         return obs
105 | 
106 | def _process_frame84(frame):
107 |     img = np.reshape(frame, [210, 160, 3]).astype(np.float32)
108 |     img = img[:, :, 0] * 0.299 + img[:, :, 1] * 0.587 + img[:, :, 2] * 0.114
109 |     resized_screen = cv2.resize(img, (84, 110),  interpolation=cv2.INTER_LINEAR)
110 |     x_t = resized_screen[18:102, :]
111 |     x_t = np.reshape(x_t, [84, 84, 1])
112 |     return x_t.astype(np.uint8)
113 | 
114 | class ProcessFrame84(gym.Wrapper):
115 |     def __init__(self, env=None):
116 |         super(ProcessFrame84, self).__init__(env)
117 |         self.observation_space = spaces.Box(low=0, high=255, shape=(84, 84, 1))
118 | 
119 |     def _step(self, action):
120 |         obs, reward, done, info = self.env.step(action)
121 |         return _process_frame84(obs), reward, done, info
122 | 
123 |     def _reset(self):
124 |         return _process_frame84(self.env.reset())
125 | 
126 | class ClippedRewardsWrapper(gym.Wrapper):
127 |     def _step(self, action):
128 |         obs, reward, done, info = self.env.step(action)
129 |         return obs, np.sign(reward), done, info
130 | 
131 | def wrap_deepmind_ram(env):
132 |     env = EpisodicLifeEnv(env)
133 |     env = NoopResetEnv(env, noop_max=30)
134 |     env = MaxAndSkipEnv(env, skip=4)
135 |     if 'FIRE' in env.unwrapped.get_action_meanings():
136 |         env = FireResetEnv(env)
137 |     env = ClippedRewardsWrapper(env)
138 |     return env
139 | 
140 | def wrap_deepmind(env):
141 |     assert 'NoFrameskip' in env.spec.id
142 |     env = EpisodicLifeEnv(env)
143 |     env = NoopResetEnv(env, noop_max=30)
144 |     env = MaxAndSkipEnv(env, skip=4)
145 |     if 'FIRE' in env.unwrapped.get_action_meanings():
146 |         env = FireResetEnv(env)
147 |     env = ProcessFrame84(env)
148 |     env = ClippedRewardsWrapper(env)
149 |     return env
150 | 


--------------------------------------------------------------------------------
/hw3/dqn.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import pickle
  3 | import sys
  4 | import gym.spaces
  5 | import logz
  6 | import numpy as np
  7 | import random
  8 | import torch
  9 | import torch.nn.functional as F
 10 | from torch import nn, optim
 11 | from collections import namedtuple
 12 | from dqn_utils import LinearSchedule, ReplayBuffer, get_wrapper_by_name
 13 | 
 14 | OptimizerSpec = namedtuple("OptimizerSpec", ["constructor", "kwargs", "lr_lambda"])
 15 |  
 16 | 
 17 | class QLearner(object):
 18 | 
 19 |   def __init__(
 20 |     self,
 21 |     env,
 22 |     q_func,
 23 |     optimizer_spec,
 24 |     exploration=LinearSchedule(1000000, 0.1),
 25 |     stopping_criterion=None,
 26 |     replay_buffer_size=1000000,
 27 |     batch_size=32,
 28 |     gamma=0.99,
 29 |     learning_starts=50000,
 30 |     learning_freq=4,
 31 |     frame_history_len=4,
 32 |     target_update_freq=10000,
 33 |     grad_norm_clipping=10,
 34 |     double_q=True,
 35 |     lander=False):
 36 |     """Run Deep Q-learning algorithm.
 37 | 
 38 |     You can specify your own convnet using q_func.
 39 | 
 40 |     All schedules are w.r.t. total number of steps taken in the environment.
 41 | 
 42 |     Parameters
 43 |     ----------
 44 |     env: gym.Env
 45 |         gym environment to train on.
 46 |     q_func: function
 47 |         Model to use for computing the q function. It should accept the
 48 |         following named arguments:
 49 |             in_channels: int
 50 |                 number of channels for the input
 51 |             num_actions: int
 52 |                 number of actions
 53 |     optimizer_spec: OptimizerSpec
 54 |         Specifying the constructor and kwargs, as well as learning rate schedule
 55 |         for the optimizer
 56 |     exploration: rl_algs.deepq.utils.schedules.Schedule
 57 |         schedule for probability of chosing random action.
 58 |     stopping_criterion: (env, t) -> bool
 59 |         should return true when it's ok for the RL algorithm to stop.
 60 |         takes in env and the number of steps executed so far.
 61 |     replay_buffer_size: int
 62 |         How many memories to store in the replay buffer.
 63 |     batch_size: int
 64 |         How many transitions to sample each time experience is replayed.
 65 |     gamma: float
 66 |         Discount Factor
 67 |     learning_starts: int
 68 |         After how many environment steps to start replaying experiences
 69 |     learning_freq: int
 70 |         How many steps of environment to take between every experience replay
 71 |     frame_history_len: int
 72 |         How many past frames to include as input to the model.
 73 |     target_update_freq: int
 74 |         How many experience replay rounds (not steps!) to perform between
 75 |         each update to the target Q network
 76 |     grad_norm_clipping: float or None
 77 |         If not None gradients' norms are clipped to this value.
 78 |     double_q: bool
 79 |         If True, then use double Q-learning to compute target values. Otherwise, use vanilla DQN.
 80 |         https://papers.nips.cc/paper/3964-double-q-learning.pdf
 81 |     """
 82 |     assert type(env.observation_space) == gym.spaces.Box
 83 |     assert type(env.action_space)      == gym.spaces.Discrete
 84 | 
 85 |     self.target_update_freq = target_update_freq
 86 |     self.optimizer_spec = optimizer_spec
 87 |     self.batch_size = batch_size
 88 |     self.learning_freq = learning_freq
 89 |     self.learning_starts = learning_starts
 90 |     self.stopping_criterion = stopping_criterion
 91 |     self.env = env
 92 |     self.exploration = exploration
 93 |     self.gamma = gamma
 94 |     self.double_q = double_q
 95 |     self.device = torch.device('cuda' if torch.cuda.is_available else 'cpu')
 96 |     
 97 |     ###############
 98 |     # BUILD MODEL #
 99 |     ###############
100 | 
101 |     if len(self.env.observation_space.shape) == 1:
102 |         # This means we are running on low-dimensional observations (e.g. RAM)
103 |         in_features = self.env.observation_space.shape[0]
104 |     else:
105 |         img_h, img_w, img_c = self.env.observation_space.shape
106 |         in_features = frame_history_len * img_c
107 |     self.num_actions = self.env.action_space.n
108 | 
109 |     # define deep Q network and target Q network
110 |     self.q_net = q_func(in_features, self.num_actions).to(self.device)
111 |     self.target_q_net = q_func(in_features, self.num_actions).to(self.device)
112 | 
113 |     # construct optimization op (with gradient clipping)
114 |     parameters = self.q_net.parameters()
115 |     self.optimizer = self.optimizer_spec.constructor(parameters, lr=1, 
116 |                                                      **self.optimizer_spec.kwargs)
117 |     self.lr_scheduler = optim.lr_scheduler.LambdaLR(self.optimizer, self.optimizer_spec.lr_lambda)
118 |     # clip_grad_norm_fn will be called before doing gradient decent
119 |     self.clip_grad_norm_fn = lambda : nn.utils.clip_grad_norm_(parameters, max_norm=grad_norm_clipping)
120 | 
121 |     # update_target_fn will be called periodically to copy Q network to target Q network
122 |     self.update_target_fn = lambda : self.target_q_net.load_state_dict(self.q_net.state_dict())
123 | 
124 |     # construct the replay buffer
125 |     self.replay_buffer = ReplayBuffer(replay_buffer_size, frame_history_len, lander=lander)
126 |     self.replay_buffer_idx = None
127 | 
128 |     ###############
129 |     # RUN ENV     #
130 |     ###############
131 |     self.model_initialized = False
132 |     self.num_param_updates = 0
133 |     self.mean_episode_reward      = -float('nan')
134 |     self.best_mean_episode_reward = -float('inf')
135 |     self.last_obs = self.env.reset()
136 |     self.log_every_n_steps = 10000
137 | 
138 |     self.start_time = time.time()
139 |     self.t = 0
140 | 
141 |   def calc_loss(self, obs, ac, rw, nxobs, done):
142 |     """
143 |         Calculate the loss for a batch of transitions. 
144 | 
145 |         Here, you should fill in your own code to compute the Bellman error. This requires
146 |         evaluating the current and next Q-values and constructing the corresponding error.
147 | 
148 |         arguments:
149 |             ob: The observation for current step
150 |             ac: The corresponding action for current step
151 |             rw: The reward for each timestep
152 |             nxob: The observation after taking one step forward
153 |             done: The mask for terminal state. This value is 1 if the next state corresponds to
154 |                   the end of an episode, in which case there is no Q-value at the next state;
155 |                   at the end of an episode, only the current state reward contributes to the target,
156 |                   not the next state Q-value (i.e. target is just rew_t_ph, not rew_t_ph + gamma * q_tp1)
157 | 
158 |             inputs are generated from self.replay_buffer.sample, you can refer the code in dqn_utils.py
159 |             for more details 
160 | 
161 |         returns:
162 |             a scalar tensor represent the loss
163 | 
164 |         Hint: use smooth_l1_loss (a.k.a huber_loss) instead of mean squared error.
165 |               use self.double_q to switch between double DQN and vanilla DQN.
166 |     """
167 |     
168 |     # YOUR CODE HERE
169 |     
170 |     
171 |   def stopping_criterion_met(self):
172 |     return self.stopping_criterion is not None and self.stopping_criterion(self.env, self.t)
173 | 
174 |   def step_env(self):
175 |     ### 2. Step the env and store the transition
176 |     # At this point, "self.last_obs" contains the latest observation that was
177 |     # recorded from the simulator. Here, your code needs to store this
178 |     # observation and its outcome (reward, next observation, etc.) into
179 |     # the replay buffer while stepping the simulator forward one step.
180 |     # At the end of this block of code, the simulator should have been
181 |     # advanced one step, and the replay buffer should contain one more
182 |     # transition.
183 |     # Specifically, self.last_obs must point to the new latest observation.
184 |     # Useful functions you'll need to call:
185 |     # obs, reward, done, info = env.step(action)
186 |     # this steps the environment forward one step
187 |     # obs = env.reset()
188 |     # this resets the environment if you reached an episode boundary.
189 |     # Don't forget to call env.reset() to get a new observation if done
190 |     # is true!!
191 |     # Note that you cannot use "self.last_obs" directly as input
192 |     # into your network, since it needs to be processed to include context
193 |     # from previous frames. You should check out the replay buffer
194 |     # implementation in dqn_utils.py to see what functionality the replay
195 |     # buffer exposes. The replay buffer has a function called
196 |     # encode_recent_observation that will take the latest observation
197 |     # that you pushed into the buffer and compute the corresponding
198 |     # input that should be given to a Q network by appending some
199 |     # previous frames.
200 |     # Don't forget to include epsilon greedy exploration!
201 |     # And remember that the first time you enter this loop, the model
202 |     # may not yet have been initialized (but of course, the first step
203 |     # might as well be random, since you haven't trained your net...)
204 | 
205 |     #####
206 | 
207 |     # YOUR CODE HERE
208 |     
209 | 
210 |   def update_model(self):
211 |     ### 3. Perform experience replay and train the network.
212 |     # note that this is only done if the replay buffer contains enough samples
213 |     # for us to learn something useful -- until then, the model will not be
214 |     # initialized and random actions should be taken
215 |     self.lr_scheduler.step()
216 |     
217 |     if (self.t > self.learning_starts and \
218 |         self.t % self.learning_freq == 0 and \
219 |         self.replay_buffer.can_sample(self.batch_size)):
220 |       
221 |       # Here, you should perform training. Training consists of four steps:
222 |       # 3.a: use the replay buffer to sample a batch of transitions (see the
223 |       # replay buffer code for function definition, each batch that you sample
224 |       # should consist of current observations, current actions, rewards,
225 |       # next observations, and done indicator).
226 |       # 3.b: set the self.model_initialized to True. Because the newwork in starting
227 |       # to train, and you will use it to take action in self.step_env.
228 |       # 3.c: train the model. To do this, you'll need to use the self.optimizer and
229 |       # self.calc_loss that were created earlier: self.calc_loss is what you
230 |       # created to compute the total Bellman error in a batch, and self.optimizer
231 |       # will actually perform a gradient step and update the network parameters
232 |       # to reduce the loss. 
233 |       # Before your optimizer take step, don`t forget to call self.clip_grad_norm_fn
234 |       # to perform gradient clipping.
235 |       # 3.d: periodically update the target network by calling self.update_target_fn
236 |       # you should update every target_update_freq steps, and you may find the
237 |       # variable self.num_param_updates useful for this (it was initialized to 0)
238 |       #####
239 |       
240 |       # YOUR CODE HERE
241 |       
242 | 
243 |       self.num_param_updates += 1
244 | 
245 |     self.t += 1
246 | 
247 |   def log_progress(self):
248 |     episode_rewards = get_wrapper_by_name(self.env, "Monitor").get_episode_rewards()
249 | 
250 |     if len(episode_rewards) > 0:
251 |       self.mean_episode_reward = np.mean(episode_rewards[-100:])
252 | 
253 |     if len(episode_rewards) > 100:
254 |       self.best_mean_episode_reward = max(self.best_mean_episode_reward, self.mean_episode_reward)
255 | 
256 |     if self.t % self.log_every_n_steps == 0 and self.model_initialized:
257 |       logz.log_tabular("TimeStep", self.t)
258 |       logz.log_tabular("MeanReturn", self.mean_episode_reward)
259 |       logz.log_tabular("BestMeanReturn", max(self.best_mean_episode_reward, self.mean_episode_reward))
260 |       logz.log_tabular("Episodes", len(episode_rewards))
261 |       logz.log_tabular("Exploration", self.exploration.value(self.t))
262 |       logz.log_tabular("LearningRate", self.optimizer_spec.lr_lambda(self.t))
263 |       logz.log_tabular("Time", (time.time() - self.start_time) / 60.)
264 |       logz.dump_tabular()
265 |       logz.save_pytorch_model(self.q_net)
266 |       
267 | def learn(*args, **kwargs):
268 |   alg = QLearner(*args, **kwargs)
269 |   while not alg.stopping_criterion_met():
270 |     alg.step_env()
271 |     # at this point, the environment should have been advanced one step (and
272 |     # reset if done was true), and self.last_obs should point to the new latest
273 |     # observation
274 |     alg.update_model()
275 |     alg.log_progress()
276 | 
277 | 


--------------------------------------------------------------------------------
/hw3/dqn_utils.py:
--------------------------------------------------------------------------------
  1 | """This file includes a collection of utility functions that are useful for
  2 | implementing DQN."""
  3 | import gym
  4 | import numpy as np
  5 | import random
  6 | 
  7 | def sample_n_unique(sampling_f, n):
  8 |     """Helper function. Given a function `sampling_f` that returns
  9 |     comparable objects, sample n such unique objects.
 10 |     """
 11 |     res = []
 12 |     while len(res) < n:
 13 |         candidate = sampling_f()
 14 |         if candidate not in res:
 15 |             res.append(candidate)
 16 |     return res
 17 | 
 18 | class Schedule(object):
 19 |     def value(self, t):
 20 |         """Value of the schedule at time t"""
 21 |         raise NotImplementedError()
 22 | 
 23 | class ConstantSchedule(object):
 24 |     def __init__(self, value):
 25 |         """Value remains constant over time.
 26 |         Parameters
 27 |         ----------
 28 |         value: float
 29 |             Constant value of the schedule
 30 |         """
 31 |         self._v = value
 32 | 
 33 |     def value(self, t):
 34 |         """See Schedule.value"""
 35 |         return self._v
 36 | 
 37 | def linear_interpolation(l, r, alpha):
 38 |     return l + alpha * (r - l)
 39 | 
 40 | class PiecewiseSchedule(object):
 41 |     def __init__(self, endpoints, interpolation=linear_interpolation, outside_value=None):
 42 |         """Piecewise schedule.
 43 |         endpoints: [(int, int)]
 44 |             list of pairs `(time, value)` meanining that schedule should output
 45 |             `value` when `t==time`. All the values for time must be sorted in
 46 |             an increasing order. When t is between two times, e.g. `(time_a, value_a)`
 47 |             and `(time_b, value_b)`, such that `time_a <= t < time_b` then value outputs
 48 |             `interpolation(value_a, value_b, alpha)` where alpha is a fraction of
 49 |             time passed between `time_a` and `time_b` for time `t`.
 50 |         interpolation: lambda float, float, float: float
 51 |             a function that takes value to the left and to the right of t according
 52 |             to the `endpoints`. Alpha is the fraction of distance from left endpoint to
 53 |             right endpoint that t has covered. See linear_interpolation for example.
 54 |         outside_value: float
 55 |             if the value is requested outside of all the intervals sepecified in
 56 |             `endpoints` this value is returned. If None then AssertionError is
 57 |             raised when outside value is requested.
 58 |         """
 59 |         idxes = [e[0] for e in endpoints]
 60 |         assert idxes == sorted(idxes)
 61 |         self._interpolation = interpolation
 62 |         self._outside_value = outside_value
 63 |         self._endpoints     = endpoints
 64 | 
 65 |     def value(self, t):
 66 |         """See Schedule.value"""
 67 |         for (l_t, l), (r_t, r) in zip(self._endpoints[:-1], self._endpoints[1:]):
 68 |             if l_t <= t and t < r_t:
 69 |                 alpha = float(t - l_t) / (r_t - l_t)
 70 |                 return self._interpolation(l, r, alpha)
 71 | 
 72 |         # t does not belong to any of the pieces, so doom.
 73 |         assert self._outside_value is not None
 74 |         return self._outside_value
 75 | 
 76 | class LinearSchedule(object):
 77 |     def __init__(self, schedule_timesteps, final_p, initial_p=1.0):
 78 |         """Linear interpolation between initial_p and final_p over
 79 |         schedule_timesteps. After this many timesteps pass final_p is
 80 |         returned.
 81 |         Parameters
 82 |         ----------
 83 |         schedule_timesteps: int
 84 |             Number of timesteps for which to linearly anneal initial_p
 85 |             to final_p
 86 |         initial_p: float
 87 |             initial output value
 88 |         final_p: float
 89 |             final output value
 90 |         """
 91 |         self.schedule_timesteps = schedule_timesteps
 92 |         self.final_p            = final_p
 93 |         self.initial_p          = initial_p
 94 | 
 95 |     def value(self, t):
 96 |         """See Schedule.value"""
 97 |         fraction  = min(float(t) / self.schedule_timesteps, 1.0)
 98 |         return self.initial_p + fraction * (self.final_p - self.initial_p)
 99 | 
100 | 
101 | def get_wrapper_by_name(env, classname):
102 |     currentenv = env
103 |     while True:
104 |         if classname in currentenv.__class__.__name__:
105 |             return currentenv
106 |         elif isinstance(env, gym.Wrapper):
107 |             currentenv = currentenv.env
108 |         else:
109 |             raise ValueError("Couldn't find wrapper named %s"%classname)
110 | 
111 | class ReplayBuffer(object):
112 |     def __init__(self, size, frame_history_len, lander=False):
113 |         """This is a memory efficient implementation of the replay buffer.
114 | 
115 |         The sepecific memory optimizations use here are:
116 |             - only store each frame once rather than k times
117 |               even if every observation normally consists of k last frames
118 |             - store frames as np.uint8 (actually it is most time-performance
119 |               to cast them back to float32 on GPU to minimize memory transfer
120 |               time)
121 |             - store frame_t and frame_(t+1) in the same buffer.
122 | 
123 |         For the tipical use case in Atari Deep RL buffer with 1M frames the total
124 |         memory footprint of this buffer is 10^6 * 84 * 84 bytes ~= 7 gigabytes
125 | 
126 |         Warning! Assumes that returning frame of zeros at the beginning
127 |         of the episode, when there is less frames than `frame_history_len`,
128 |         is acceptable.
129 | 
130 |         Parameters
131 |         ----------
132 |         size: int
133 |             Max number of transitions to store in the buffer. When the buffer
134 |             overflows the old memories are dropped.
135 |         frame_history_len: int
136 |             Number of memories to be retried for each observation.
137 |         """
138 |         self.lander = lander
139 | 
140 |         self.size = size
141 |         self.frame_history_len = frame_history_len
142 | 
143 |         self.next_idx      = 0
144 |         self.num_in_buffer = 0
145 | 
146 |         self.obs      = None
147 |         self.action   = None
148 |         self.reward   = None
149 |         self.done     = None
150 | 
151 |     def can_sample(self, batch_size):
152 |         """Returns true if `batch_size` different transitions can be sampled from the buffer."""
153 |         return batch_size + 1 <= self.num_in_buffer
154 | 
155 |     def _encode_sample(self, idxes):
156 |         obs_batch      = np.concatenate([self._encode_observation(idx)[None] for idx in idxes], 0)
157 |         act_batch      = self.action[idxes]
158 |         rew_batch      = self.reward[idxes]
159 |         next_obs_batch = np.concatenate([self._encode_observation(idx + 1)[None] for idx in idxes], 0)
160 |         done_mask      = np.array([1.0 if self.done[idx] else 0.0 for idx in idxes], dtype=np.float32)
161 | 
162 |         return obs_batch, act_batch, rew_batch, next_obs_batch, done_mask
163 | 
164 | 
165 |     def sample(self, batch_size):
166 |         """Sample `batch_size` different transitions.
167 | 
168 |         i-th sample transition is the following:
169 | 
170 |         when observing `obs_batch[i]`, action `act_batch[i]` was taken,
171 |         after which reward `rew_batch[i]` was received and subsequent
172 |         observation  next_obs_batch[i] was observed, unless the epsiode
173 |         was done which is represented by `done_mask[i]` which is equal
174 |         to 1 if episode has ended as a result of that action.
175 | 
176 |         Parameters
177 |         ----------
178 |         batch_size: int
179 |             How many transitions to sample.
180 | 
181 |         Returns
182 |         -------
183 |         obs_batch: np.array
184 |             Array of shape
185 |             (batch_size, img_h, img_w, img_c * frame_history_len)
186 |             and dtype np.uint8
187 |         act_batch: np.array
188 |             Array of shape (batch_size,) and dtype np.int32
189 |         rew_batch: np.array
190 |             Array of shape (batch_size,) and dtype np.float32
191 |         next_obs_batch: np.array
192 |             Array of shape
193 |             (batch_size, img_h, img_w, img_c * frame_history_len)
194 |             and dtype np.uint8
195 |         done_mask: np.array
196 |             Array of shape (batch_size,) and dtype np.float32
197 |         """
198 |         assert self.can_sample(batch_size)
199 |         idxes = sample_n_unique(lambda: random.randint(0, self.num_in_buffer - 2), batch_size)
200 |         return self._encode_sample(idxes)
201 | 
202 |     def encode_recent_observation(self):
203 |         """Return the most recent `frame_history_len` frames.
204 | 
205 |         Returns
206 |         -------
207 |         observation: np.array
208 |             Array of shape (img_h, img_w, img_c * frame_history_len)
209 |             and dtype np.uint8, where observation[:, :, i*img_c:(i+1)*img_c]
210 |             encodes frame at time `t - frame_history_len + i`
211 |         """
212 |         assert self.num_in_buffer > 0
213 |         return self._encode_observation((self.next_idx - 1) % self.size)
214 | 
215 |     def _encode_observation(self, idx):
216 |         end_idx   = idx + 1 # make noninclusive
217 |         start_idx = end_idx - self.frame_history_len
218 |         # this checks if we are using low-dimensional observations, such as RAM
219 |         # state, in which case we just directly return the latest RAM.
220 |         if len(self.obs.shape) == 2:
221 |             return self.obs[end_idx-1]
222 |         # if there weren't enough frames ever in the buffer for context
223 |         if start_idx < 0 and self.num_in_buffer != self.size:
224 |             start_idx = 0
225 |         for idx in range(start_idx, end_idx - 1):
226 |             if self.done[idx % self.size]:
227 |                 start_idx = idx + 1
228 |         missing_context = self.frame_history_len - (end_idx - start_idx)
229 |         # if zero padding is needed for missing context
230 |         # or we are on the boundry of the buffer
231 |         if start_idx < 0 or missing_context > 0:
232 |             frames = [np.zeros_like(self.obs[0]) for _ in range(missing_context)]
233 |             for idx in range(start_idx, end_idx):
234 |                 frames.append(self.obs[idx % self.size])
235 |             return np.concatenate(frames, 2)
236 |         else:
237 |             # this optimization has potential to saves about 30% compute time \o/
238 |             img_h, img_w = self.obs.shape[1], self.obs.shape[2]
239 |             return self.obs[start_idx:end_idx].transpose(1, 2, 0, 3).reshape(img_h, img_w, -1)
240 | 
241 |     def store_frame(self, frame):
242 |         """Store a single frame in the buffer at the next available index, overwriting
243 |         old frames if necessary.
244 | 
245 |         Parameters
246 |         ----------
247 |         frame: np.array
248 |             Array of shape (img_h, img_w, img_c) and dtype np.uint8
249 |             the frame to be stored
250 | 
251 |         Returns
252 |         -------
253 |         idx: int
254 |             Index at which the frame is stored. To be used for `store_effect` later.
255 |         """
256 |         if self.obs is None:
257 |             self.obs      = np.empty([self.size] + list(frame.shape), dtype=np.float32 if self.lander else np.uint8)
258 |             self.action   = np.empty([self.size],                     dtype=np.int32)
259 |             self.reward   = np.empty([self.size],                     dtype=np.float32)
260 |             self.done     = np.empty([self.size],                     dtype=np.bool)
261 |         self.obs[self.next_idx] = frame
262 | 
263 |         ret = self.next_idx
264 |         self.next_idx = (self.next_idx + 1) % self.size
265 |         self.num_in_buffer = min(self.size, self.num_in_buffer + 1)
266 | 
267 |         return ret
268 | 
269 |     def store_effect(self, idx, action, reward, done):
270 |         """Store effects of action taken after obeserving frame stored
271 |         at index idx. The reason `store_frame` and `store_effect` is broken
272 |         up into two functions is so that once can call `encode_recent_observation`
273 |         in between.
274 | 
275 |         Paramters
276 |         ---------
277 |         idx: int
278 |             Index in buffer of recently observed frame (returned by `store_frame`).
279 |         action: int
280 |             Action that was performed upon observing this frame.
281 |         reward: float
282 |             Reward that was received when the actions was performed.
283 |         done: bool
284 |             True if episode was finished after performing that action.
285 |         """
286 |         self.action[idx] = action
287 |         self.reward[idx] = reward
288 |         self.done[idx]   = done
289 | 
290 | 


--------------------------------------------------------------------------------
/hw3/hw3_instructions.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KuNyaa/berkeleydeeprlcourse-homework-pytorch/b7cb9fb3479b94c4e31fca32b55f7ce2586cc81d/hw3/hw3_instructions.pdf


--------------------------------------------------------------------------------
/hw3/logz.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | 
  3 | """
  4 | 
  5 | Some simple logging functionality, inspired by rllab's logging.
  6 | Assumes that each diagnostic gets logged each iteration
  7 | 
  8 | Call logz.configure_output_dir() to start logging to a 
  9 | tab-separated-values file (some_folder_name/log.txt)
 10 | 
 11 | To load the learning curves, you can do, for example
 12 | 
 13 | A = np.genfromtxt('/tmp/expt_1468984536/log.txt',delimiter='\t',dtype=None, names=True)
 14 | A['EpRewMean']
 15 | 
 16 | """
 17 | 
 18 | import os.path as osp, shutil, time, atexit, os, subprocess
 19 | import pickle
 20 | import torch
 21 | 
 22 | color2num = dict(
 23 |     gray=30,
 24 |     red=31,
 25 |     green=32,
 26 |     yellow=33,
 27 |     blue=34,
 28 |     magenta=35,
 29 |     cyan=36,
 30 |     white=37,
 31 |     crimson=38
 32 | )
 33 | 
 34 | def colorize(string, color, bold=False, highlight=False):
 35 |     attr = []
 36 |     num = color2num[color]
 37 |     if highlight: num += 10
 38 |     attr.append(str(num))
 39 |     if bold: attr.append('1')
 40 |     return '\x1b[%sm%s\x1b[0m' % (';'.join(attr), string)
 41 | 
 42 | class G:
 43 |     output_dir = None
 44 |     output_file = None
 45 |     first_row = True
 46 |     log_headers = []
 47 |     log_current_row = {}
 48 | 
 49 | def configure_output_dir(d=None):
 50 |     """
 51 |     Set output directory to d, or to /tmp/somerandomnumber if d is None
 52 |     """
 53 |     G.output_dir = d or "/tmp/experiments/%i"%int(time.time())
 54 |     assert not osp.exists(G.output_dir), "Log dir %s already exists! Delete it first or use a different dir"%G.output_dir
 55 |     os.makedirs(G.output_dir)
 56 |     G.output_file = open(osp.join(G.output_dir, "log.txt"), 'w')
 57 |     atexit.register(G.output_file.close)
 58 |     print(colorize("Logging data to %s"%G.output_file.name, 'green', bold=True))
 59 | 
 60 | def log_tabular(key, val):
 61 |     """
 62 |     Log a value of some diagnostic
 63 |     Call this once for each diagnostic quantity, each iteration
 64 |     """
 65 |     if G.first_row:
 66 |         G.log_headers.append(key)
 67 |     else:
 68 |         assert key in G.log_headers, "Trying to introduce a new key %s that you didn't include in the first iteration"%key
 69 |     assert key not in G.log_current_row, "You already set %s this iteration. Maybe you forgot to call dump_tabular()"%key
 70 |     G.log_current_row[key] = val
 71 | 
 72 | def save_hyperparams(params):
 73 |     with open(osp.join(G.output_dir, "hyperparams.json"), 'w') as out:
 74 |         out.write(json.dumps(params, separators=(',\n','\t:\t'), sort_keys=True))
 75 | 
 76 | def save_pytorch_model(model):  
 77 |     """
 78 |     Saves the entire pytorch Module 
 79 |     """
 80 |     torch.save(model, osp.join(G.output_dir, "model.pkl"))
 81 |     
 82 | 
 83 | def dump_tabular():
 84 |     """
 85 |     Write all of the diagnostics from the current iteration
 86 |     """
 87 |     vals = []
 88 |     key_lens = [len(key) for key in G.log_headers]
 89 |     max_key_len = max(15,max(key_lens))
 90 |     keystr = '%'+'%d'%max_key_len
 91 |     fmt = "| " + keystr + "s | %15s |"
 92 |     n_slashes = 22 + max_key_len
 93 |     print("-"*n_slashes)
 94 |     for key in G.log_headers:
 95 |         val = G.log_current_row.get(key, "")
 96 |         if hasattr(val, "__float__"): valstr = "%8.3g"%val
 97 |         else: valstr = val
 98 |         print(fmt%(key, valstr))
 99 |         vals.append(val)
100 |     print("-"*n_slashes)
101 |     if G.output_file is not None:
102 |         if G.first_row:
103 |             G.output_file.write("\t".join(G.log_headers))
104 |             G.output_file.write("\n")
105 |         G.output_file.write("\t".join(map(str,vals)))
106 |         G.output_file.write("\n")
107 |         G.output_file.flush()
108 |     G.log_current_row.clear()
109 |     G.first_row=False
110 | 


--------------------------------------------------------------------------------
/hw3/lunar_lander.py:
--------------------------------------------------------------------------------
  1 | import sys, math
  2 | import numpy as np
  3 | 
  4 | import Box2D
  5 | from Box2D.b2 import (edgeShape, circleShape, fixtureDef, polygonShape, revoluteJointDef, contactListener)
  6 | 
  7 | import gym
  8 | from gym import spaces
  9 | from gym.utils import seeding
 10 | 
 11 | import pyglet
 12 | 
 13 | from copy import copy
 14 | 
 15 | # Rocket trajectory optimization is a classic topic in Optimal Control.
 16 | #
 17 | # According to Pontryagin's maximum principle it's optimal to fire engine full throttle or
 18 | # turn it off. That's the reason this environment is OK to have discreet actions (engine on or off).
 19 | #
 20 | # Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector.
 21 | # Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points.
 22 | # If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or
 23 | # comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main
 24 | # engine is -0.3 points each frame. Solved is 200 points.
 25 | #
 26 | # Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land
 27 | # on its first attempt. Please see source code for details.
 28 | #
 29 | # Too see heuristic landing, run:
 30 | #
 31 | # python gym/envs/box2d/lunar_lander.py
 32 | #
 33 | # To play yourself, run:
 34 | #
 35 | # python examples/agents/keyboard_agent.py LunarLander-v0
 36 | #
 37 | # Created by Oleg Klimov. Licensed on the same terms as the rest of OpenAI Gym.
 38 | 
 39 | # Modified by Sid Reddy (sgr@berkeley.edu) on 8/14/18
 40 | #
 41 | # Changelog:
 42 | # - different discretization scheme for actions
 43 | # - different terminal rewards
 44 | # - different observations
 45 | # - randomized landing site
 46 | #
 47 | # A good agent should be able to achieve >150 reward.
 48 | 
 49 | MAX_NUM_STEPS = 1000
 50 | 
 51 | N_OBS_DIM = 9
 52 | N_ACT_DIM = 6 # num discrete actions
 53 | 
 54 | FPS    = 50
 55 | SCALE  = 30.0   # affects how fast-paced the game is, forces should be adjusted as well
 56 | 
 57 | MAIN_ENGINE_POWER  = 13.0
 58 | SIDE_ENGINE_POWER  =  0.6
 59 | 
 60 | INITIAL_RANDOM = 1000.0   # Set 1500 to make game harder
 61 | 
 62 | LANDER_POLY =[
 63 |     (-14,+17), (-17,0), (-17,-10),
 64 |     (+17,-10), (+17,0), (+14,+17)
 65 |     ]
 66 | LEG_AWAY = 20
 67 | LEG_DOWN = 18
 68 | LEG_W, LEG_H = 2, 8
 69 | LEG_SPRING_TORQUE = 40 # 40 is too difficult for human players, 400 a bit easier
 70 | 
 71 | SIDE_ENGINE_HEIGHT = 14.0
 72 | SIDE_ENGINE_AWAY   = 12.0
 73 | 
 74 | VIEWPORT_W = 600
 75 | VIEWPORT_H = 400
 76 | 
 77 | THROTTLE_MAG = 0.75 # discretized 'on' value for thrusters
 78 | NOOP = 1 # don't fire main engine, don't steer
 79 | def disc_to_cont(action): # discrete action -> continuous action
 80 |   if type(action) == np.ndarray:
 81 |     return action
 82 |   # main engine
 83 |   if action < 3:
 84 |     m = -THROTTLE_MAG
 85 |   elif action < 6:
 86 |     m = THROTTLE_MAG
 87 |   else:
 88 |     raise ValueError
 89 |   # steering
 90 |   if action % 3 == 0:
 91 |     s = -THROTTLE_MAG
 92 |   elif action % 3 == 1:
 93 |     s = 0
 94 |   else:
 95 |     s = THROTTLE_MAG
 96 |   return np.array([m, s])
 97 | 
 98 | class ContactDetector(contactListener):
 99 |     def __init__(self, env):
100 |         contactListener.__init__(self)
101 |         self.env = env
102 |     def BeginContact(self, contact):
103 |         if self.env.lander==contact.fixtureA.body or self.env.lander==contact.fixtureB.body:
104 |             self.env.game_over = True
105 |         for i in range(2):
106 |             if self.env.legs[i] in [contact.fixtureA.body, contact.fixtureB.body]:
107 |                 self.env.legs[i].ground_contact = True
108 |     def EndContact(self, contact):
109 |         for i in range(2):
110 |             if self.env.legs[i] in [contact.fixtureA.body, contact.fixtureB.body]:
111 |                 self.env.legs[i].ground_contact = False
112 | 
113 | class LunarLander(gym.Env):
114 |     metadata = {
115 |         'render.modes': ['human', 'rgb_array'],
116 |         'video.frames_per_second' : FPS
117 |     }
118 | 
119 |     continuous = False
120 | 
121 |     def __init__(self):
122 |         self._seed()
123 |         self.viewer = None
124 | 
125 |         self.world = Box2D.b2World()
126 |         self.moon = None
127 |         self.lander = None
128 |         self.particles = []
129 | 
130 |         self.prev_reward = None
131 | 
132 |         high = np.array([np.inf]*N_OBS_DIM)  # useful range is -1 .. +1, but spikes can be higher
133 |         self.observation_space = spaces.Box(-high, high)
134 | 
135 |         self.action_space = spaces.Discrete(N_ACT_DIM)
136 | 
137 |         self.curr_step = None
138 | 
139 |         self._reset()
140 | 
141 |     def _seed(self, seed=None):
142 |         self.np_random, seed = seeding.np_random(seed)
143 |         return [seed]
144 | 
145 |     def _destroy(self):
146 |         if not self.moon: return
147 |         self.world.contactListener = None
148 |         self._clean_particles(True)
149 |         self.world.DestroyBody(self.moon)
150 |         self.moon = None
151 |         self.world.DestroyBody(self.lander)
152 |         self.lander = None
153 |         self.world.DestroyBody(self.legs[0])
154 |         self.world.DestroyBody(self.legs[1])
155 | 
156 |     def _reset(self):
157 |         self.curr_step = 0
158 | 
159 |         self._destroy()
160 |         self.world.contactListener_keepref = ContactDetector(self)
161 |         self.world.contactListener = self.world.contactListener_keepref
162 |         self.game_over = False
163 |         self.prev_shaping = None
164 | 
165 |         W = VIEWPORT_W/SCALE
166 |         H = VIEWPORT_H/SCALE
167 | 
168 |         # terrain
169 |         CHUNKS = 11
170 |         height = self.np_random.uniform(0, H/2, size=(CHUNKS+1,) )
171 |         chunk_x  = [W/(CHUNKS-1)*i for i in range(CHUNKS)]
172 | 
173 |         # randomize helipad x-coord
174 |         helipad_chunk = np.random.choice(range(1, CHUNKS-1))
175 | 
176 |         self.helipad_x1 = chunk_x[helipad_chunk-1]
177 |         self.helipad_x2 = chunk_x[helipad_chunk+1]
178 |         self.helipad_y  = H/4
179 |         height[helipad_chunk-2] = self.helipad_y
180 |         height[helipad_chunk-1] = self.helipad_y
181 |         height[helipad_chunk+0] = self.helipad_y
182 |         height[helipad_chunk+1] = self.helipad_y
183 |         height[helipad_chunk+2] = self.helipad_y
184 |         smooth_y = [0.33*(height[i-1] + height[i+0] + height[i+1]) for i in range(CHUNKS)]
185 | 
186 |         self.moon = self.world.CreateStaticBody( shapes=edgeShape(vertices=[(0, 0), (W, 0)]) )
187 |         self.sky_polys = []
188 |         for i in range(CHUNKS-1):
189 |             p1 = (chunk_x[i],   smooth_y[i])
190 |             p2 = (chunk_x[i+1], smooth_y[i+1])
191 |             self.moon.CreateEdgeFixture(
192 |                 vertices=[p1,p2],
193 |                 density=0,
194 |                 friction=0.1)
195 |             self.sky_polys.append( [p1, p2, (p2[0],H), (p1[0],H)] )
196 | 
197 |         self.moon.color1 = (0.0,0.0,0.0)
198 |         self.moon.color2 = (0.0,0.0,0.0)
199 | 
200 |         initial_y = VIEWPORT_H/SCALE#*0.75
201 |         self.lander = self.world.CreateDynamicBody(
202 |             position = (VIEWPORT_W/SCALE/2, initial_y),
203 |             angle=0.0,
204 |             fixtures = fixtureDef(
205 |                 shape=polygonShape(vertices=[ (x/SCALE,y/SCALE) for x,y in LANDER_POLY ]),
206 |                 density=5.0,
207 |                 friction=0.1,
208 |                 categoryBits=0x0010,
209 |                 maskBits=0x001,  # collide only with ground
210 |                 restitution=0.0) # 0.99 bouncy
211 |                 )
212 |         self.lander.color1 = (0.5,0.4,0.9)
213 |         self.lander.color2 = (0.3,0.3,0.5)
214 |         self.lander.ApplyForceToCenter( (
215 |             self.np_random.uniform(-INITIAL_RANDOM, INITIAL_RANDOM),
216 |             self.np_random.uniform(-INITIAL_RANDOM, INITIAL_RANDOM)
217 |             ), True)
218 | 
219 |         self.legs = []
220 |         for i in [-1,+1]:
221 |             leg = self.world.CreateDynamicBody(
222 |                 position = (VIEWPORT_W/SCALE/2 - i*LEG_AWAY/SCALE, initial_y),
223 |                 angle = (i*0.05),
224 |                 fixtures = fixtureDef(
225 |                     shape=polygonShape(box=(LEG_W/SCALE, LEG_H/SCALE)),
226 |                     density=1.0,
227 |                     restitution=0.0,
228 |                     categoryBits=0x0020,
229 |                     maskBits=0x001)
230 |                 )
231 |             leg.ground_contact = False
232 |             leg.color1 = (0.5,0.4,0.9)
233 |             leg.color2 = (0.3,0.3,0.5)
234 |             rjd = revoluteJointDef(
235 |                 bodyA=self.lander,
236 |                 bodyB=leg,
237 |                 localAnchorA=(0, 0),
238 |                 localAnchorB=(i*LEG_AWAY/SCALE, LEG_DOWN/SCALE),
239 |                 enableMotor=True,
240 |                 enableLimit=True,
241 |                 maxMotorTorque=LEG_SPRING_TORQUE,
242 |                 motorSpeed=+0.3*i  # low enough not to jump back into the sky
243 |                 )
244 |             if i==-1:
245 |                 rjd.lowerAngle = +0.9 - 0.5  # Yes, the most esoteric numbers here, angles legs have freedom to travel within
246 |                 rjd.upperAngle = +0.9
247 |             else:
248 |                 rjd.lowerAngle = -0.9
249 |                 rjd.upperAngle = -0.9 + 0.5
250 |             leg.joint = self.world.CreateJoint(rjd)
251 |             self.legs.append(leg)
252 | 
253 |         self.drawlist = [self.lander] + self.legs
254 | 
255 |         return self._step(NOOP)[0]
256 | 
257 |     def _create_particle(self, mass, x, y, ttl):
258 |         p = self.world.CreateDynamicBody(
259 |             position = (x,y),
260 |             angle=0.0,
261 |             fixtures = fixtureDef(
262 |                 shape=circleShape(radius=2/SCALE, pos=(0,0)),
263 |                 density=mass,
264 |                 friction=0.1,
265 |                 categoryBits=0x0100,
266 |                 maskBits=0x001,  # collide only with ground
267 |                 restitution=0.3)
268 |                 )
269 |         p.ttl = ttl
270 |         self.particles.append(p)
271 |         self._clean_particles(False)
272 |         return p
273 | 
274 |     def _clean_particles(self, all):
275 |         while self.particles and (all or self.particles[0].ttl<0):
276 |             self.world.DestroyBody(self.particles.pop(0))
277 | 
278 |     def _step(self, action):
279 |         assert self.action_space.contains(action), "%r (%s) invalid " % (action,type(action))
280 |         action = disc_to_cont(action)
281 | 
282 |         # Engines
283 |         tip  = (math.sin(self.lander.angle), math.cos(self.lander.angle))
284 |         side = (-tip[1], tip[0]);
285 |         dispersion = [self.np_random.uniform(-1.0, +1.0) / SCALE for _ in range(2)]
286 | 
287 |         m_power = 0.0
288 |         if action[0] > 0.0:
289 |             # Main engine
290 |             m_power = (np.clip(action[0], 0.0,1.0) + 1.0)*0.5   # 0.5..1.0
291 |             assert m_power>=0.5 and m_power <= 1.0
292 |             ox =  tip[0]*(4/SCALE + 2*dispersion[0]) + side[0]*dispersion[1]   # 4 is move a bit downwards, +-2 for randomness
293 |             oy = -tip[1]*(4/SCALE + 2*dispersion[0]) - side[1]*dispersion[1]
294 |             impulse_pos = (self.lander.position[0] + ox, self.lander.position[1] + oy)
295 |             p = self._create_particle(3.5, impulse_pos[0], impulse_pos[1], m_power)    # particles are just a decoration, 3.5 is here to make particle speed adequate
296 |             p.ApplyLinearImpulse(           ( ox*MAIN_ENGINE_POWER*m_power,  oy*MAIN_ENGINE_POWER*m_power), impulse_pos, True)
297 |             self.lander.ApplyLinearImpulse( (-ox*MAIN_ENGINE_POWER*m_power, -oy*MAIN_ENGINE_POWER*m_power), impulse_pos, True)
298 | 
299 |         s_power = 0.0
300 |         if np.abs(action[1]) > 0.5:
301 |             # Orientation engines
302 |             direction = np.sign(action[1])
303 |             s_power = np.clip(np.abs(action[1]), 0.5,1.0)
304 |             assert s_power>=0.5 and s_power <= 1.0
305 |             ox =  tip[0]*dispersion[0] + side[0]*(3*dispersion[1]+direction*SIDE_ENGINE_AWAY/SCALE)
306 |             oy = -tip[1]*dispersion[0] - side[1]*(3*dispersion[1]+direction*SIDE_ENGINE_AWAY/SCALE)
307 |             impulse_pos = (self.lander.position[0] + ox - tip[0]*17/SCALE, self.lander.position[1] + oy + tip[1]*SIDE_ENGINE_HEIGHT/SCALE)
308 |             p = self._create_particle(0.7, impulse_pos[0], impulse_pos[1], s_power)
309 |             p.ApplyLinearImpulse(           ( ox*SIDE_ENGINE_POWER*s_power,  oy*SIDE_ENGINE_POWER*s_power), impulse_pos, True)
310 |             self.lander.ApplyLinearImpulse( (-ox*SIDE_ENGINE_POWER*s_power, -oy*SIDE_ENGINE_POWER*s_power), impulse_pos, True)
311 | 
312 |         # perform normal update
313 |         self.world.Step(1.0/FPS, 6*30, 2*30)
314 | 
315 |         pos = self.lander.position
316 |         vel = self.lander.linearVelocity
317 |         helipad_x = (self.helipad_x1 + self.helipad_x2) / 2
318 |         state = [
319 |             (pos.x - VIEWPORT_W/SCALE/2) / (VIEWPORT_W/SCALE/2),
320 |             (pos.y - (self.helipad_y+LEG_DOWN/SCALE)) / (VIEWPORT_W/SCALE/2),
321 |             vel.x*(VIEWPORT_W/SCALE/2)/FPS,
322 |             vel.y*(VIEWPORT_H/SCALE/2)/FPS,
323 |             self.lander.angle,
324 |             20.0*self.lander.angularVelocity/FPS,
325 |             1.0 if self.legs[0].ground_contact else 0.0,
326 |             1.0 if self.legs[1].ground_contact else 0.0,
327 |             (helipad_x - VIEWPORT_W/SCALE/2) / (VIEWPORT_W/SCALE/2)
328 |             ]
329 |         assert len(state)==N_OBS_DIM
330 | 
331 |         self.curr_step += 1
332 | 
333 |         reward = 0
334 |         shaping = 0
335 |         dx = (pos.x - helipad_x) / (VIEWPORT_W/SCALE/2)
336 |         shaping += -100*np.sqrt(state[2]*state[2] + state[3]*state[3]) - 100*abs(state[4])
337 |         shaping += -100*np.sqrt(dx*dx + state[1]*state[1]) + 10*state[6] + 10*state[7]
338 |         if self.prev_shaping is not None:
339 |             reward = shaping - self.prev_shaping
340 |         self.prev_shaping = shaping
341 | 
342 |         reward -= m_power*0.30  # less fuel spent is better, about -30 for heurisic landing
343 |         reward -= s_power*0.03
344 | 
345 |         oob = abs(state[0]) >= 1.0
346 |         timeout = self.curr_step >= MAX_NUM_STEPS
347 |         not_awake = not self.lander.awake
348 | 
349 |         at_site = pos.x >= self.helipad_x1 and pos.x <= self.helipad_x2 and state[1] <= 0
350 |         grounded = self.legs[0].ground_contact and self.legs[1].ground_contact
351 |         landed = at_site and grounded
352 | 
353 |         done = self.game_over or oob or not_awake or timeout or landed
354 |         if done:
355 |           if self.game_over or oob:
356 |             reward = -100
357 |             self.lander.color1 = (255,0,0)
358 |           elif at_site:
359 |             reward = +100
360 |             self.lander.color1 = (0,255,0)
361 |           elif timeout:
362 |             self.lander.color1 = (255,0,0)
363 |         info = {}
364 | 
365 |         return np.array(state), reward, done, info
366 | 
367 |     def _render(self, mode='human', close=False):
368 |         if close:
369 |             if self.viewer is not None:
370 |                 self.viewer.close()
371 |                 self.viewer = None
372 |             return
373 | 
374 |         from gym.envs.classic_control import rendering
375 |         if self.viewer is None:
376 |             self.viewer = rendering.Viewer(VIEWPORT_W, VIEWPORT_H)
377 |             self.viewer.set_bounds(0, VIEWPORT_W/SCALE, 0, VIEWPORT_H/SCALE)
378 | 
379 |         for obj in self.particles:
380 |             obj.ttl -= 0.15
381 |             obj.color1 = (max(0.2,0.2+obj.ttl), max(0.2,0.5*obj.ttl), max(0.2,0.5*obj.ttl))
382 |             obj.color2 = (max(0.2,0.2+obj.ttl), max(0.2,0.5*obj.ttl), max(0.2,0.5*obj.ttl))
383 | 
384 |         self._clean_particles(False)
385 | 
386 |         for p in self.sky_polys:
387 |             self.viewer.draw_polygon(p, color=(0,0,0))
388 | 
389 |         for obj in self.particles + self.drawlist:
390 |             for f in obj.fixtures:
391 |                 trans = f.body.transform
392 |                 if type(f.shape) is circleShape:
393 |                     t = rendering.Transform(translation=trans*f.shape.pos)
394 |                     self.viewer.draw_circle(f.shape.radius, 20, color=obj.color1).add_attr(t)
395 |                     self.viewer.draw_circle(f.shape.radius, 20, color=obj.color2, filled=False, linewidth=2).add_attr(t)
396 |                 else:
397 |                     path = [trans*v for v in f.shape.vertices]
398 |                     self.viewer.draw_polygon(path, color=obj.color1)
399 |                     path.append(path[0])
400 |                     self.viewer.draw_polyline(path, color=obj.color2, linewidth=2)
401 | 
402 |         for x in [self.helipad_x1, self.helipad_x2]:
403 |             flagy1 = self.helipad_y
404 |             flagy2 = flagy1 + 50/SCALE
405 |             self.viewer.draw_polyline( [(x, flagy1), (x, flagy2)], color=(1,1,1) )
406 |             self.viewer.draw_polygon( [(x, flagy2), (x, flagy2-10/SCALE), (x+25/SCALE, flagy2-5/SCALE)], color=(0.8,0.8,0) )
407 | 
408 |         clock_prog = self.curr_step / MAX_NUM_STEPS
409 |         self.viewer.draw_polyline( [(0, 0.05*VIEWPORT_H/SCALE), (clock_prog*VIEWPORT_W/SCALE, 0.05*VIEWPORT_H/SCALE)], color=(255,0,0), linewidth=5  )
410 | 
411 |         return self.viewer.render(return_rgb_array = mode=='rgb_array')
412 | 
413 |     def reset(self):
414 |         return self._reset()
415 | 
416 |     def step(self, *args, **kwargs):
417 |         return self._step(*args, **kwargs)
418 | 
419 | 
420 | class LunarLanderContinuous(LunarLander):
421 |    continuous = True
422 | 
423 | def heuristic(env, s):
424 |     # Heuristic for:
425 |     # 1. Testing.
426 |     # 2. Demonstration rollout.
427 |     angle_targ = s[0]*0.5 + s[2]*1.0         # angle should point towards center (s[0] is horizontal coordinate, s[2] hor speed)
428 |     if angle_targ >  0.4: angle_targ =  0.4  # more than 0.4 radians (22 degrees) is bad
429 |     if angle_targ < -0.4: angle_targ = -0.4
430 |     hover_targ = 0.55*np.abs(s[0])           # target y should be proporional to horizontal offset
431 | 
432 |     # PID controller: s[4] angle, s[5] angularSpeed
433 |     angle_todo = (angle_targ - s[4])*0.5 - (s[5])*1.0
434 |     #print("angle_targ=%0.2f, angle_todo=%0.2f" % (angle_targ, angle_todo))
435 | 
436 |     # PID controller: s[1] vertical coordinate s[3] vertical speed
437 |     hover_todo = (hover_targ - s[1])*0.5 - (s[3])*0.5
438 |     #print("hover_targ=%0.2f, hover_todo=%0.2f" % (hover_targ, hover_todo))
439 | 
440 |     if s[6] or s[7]: # legs have contact
441 |         angle_todo = 0
442 |         hover_todo = -(s[3])*0.5  # override to reduce fall speed, that's all we need after contact
443 | 
444 |     a = np.array( [hover_todo*20 - 1, -angle_todo*20] )
445 |     a = np.clip(a, -1, +1)
446 |     return a
447 | 
448 | if __name__=="__main__":
449 |     #env = LunarLander()
450 |     env = LunarLanderContinuous()
451 |     s = env.reset()
452 |     total_reward = 0
453 |     steps = 0
454 |     while True:
455 |         a = heuristic(env, s)
456 |         s, r, done, info = env.step(a)
457 |         env.render()
458 |         total_reward += r
459 |         if steps % 20 == 0 or done:
460 |             print(["{:+0.2f}".format(x) for x in s])
461 |             print("step {} total_reward {:+0.2f}".format(steps, total_reward))
462 |         steps += 1
463 |         if done: break
464 | 


--------------------------------------------------------------------------------
/hw3/plot.py:
--------------------------------------------------------------------------------
  1 | import seaborn as sns
  2 | import pandas as pd
  3 | import matplotlib.pyplot as plt
  4 | import json
  5 | import os
  6 | 
  7 | """
  8 | Using the plotter:
  9 | 
 10 | Call it from the command line, and supply it with logdirs to experiments.
 11 | Suppose you ran an experiment with name 'test', and you ran 'test' for 10 
 12 | random seeds. The runner code stored it in the directory structure
 13 | 
 14 |     data
 15 |     L test_EnvName_DateTime
 16 |       L  0
 17 |         L log.txt
 18 |         L params.json
 19 |       L  1
 20 |         L log.txt
 21 |         L params.json
 22 |        .
 23 |        .
 24 |        .
 25 |       L  9
 26 |         L log.txt
 27 |         L params.json
 28 | 
 29 | To plot learning curves from the experiment, averaged over all random
 30 | seeds, call
 31 | 
 32 |     python plot.py data/test_EnvName_DateTime --value AverageReturn
 33 | 
 34 | and voila. To see a different statistics, change what you put in for
 35 | the keyword --value. You can also enter /multiple/ values, and it will 
 36 | make all of them in order.
 37 | 
 38 | 
 39 | Suppose you ran two experiments: 'test1' and 'test2'. In 'test2' you tried
 40 | a different set of hyperparameters from 'test1', and now you would like 
 41 | to compare them -- see their learning curves side-by-side. Just call
 42 | 
 43 |     python plot.py data/test1 data/test2
 44 | 
 45 | and it will plot them both! They will be given titles in the legend according
 46 | to their exp_name parameters. If you want to use custom legend titles, use
 47 | the --legend flag and then provide a title for each logdir.
 48 | 
 49 | """
 50 | 
 51 | def plot_data(data, time="Iteration", value="AverageReturn", combine=False):
 52 |     if isinstance(data, list):
 53 |         data = pd.concat(data, ignore_index=True)
 54 |     plt.figure(figsize=(16, 9))
 55 |     sns.set(style="darkgrid", font_scale=1.5)
 56 |     if not combine:
 57 |         sns.tsplot(data=data, time=time, value=value, unit="Unit", condition="Condition")
 58 |     else:
 59 |         df1 = data.loc[:, [time, value[0], 'Condition']]
 60 |         df1['Statistics'] = value[0]
 61 |         df1.rename(columns={value[0]:'Value', 'Condition':'ExpName'}, inplace = True)
 62 |         df2 = data.loc[:, [time, value[1], 'Condition']]
 63 |         df2['Statistics'] = value[1]
 64 |         df2.rename(columns={value[1]:'Value', 'Condition':'ExpName'}, inplace = True)
 65 |         data = pd.concat([df1, df2], ignore_index=True)
 66 |         sns.lineplot(x=time, y='Value', hue='ExpName', style='Statistics', data=data)
 67 |         
 68 |     plt.legend(loc='best').draggable()
 69 |     plt.savefig('result.png', bbox_inches='tight')
 70 |     plt.show()
 71 | 
 72 | 
 73 | def get_datasets(fpath, condition=None):
 74 |     unit = 0
 75 |     datasets = []
 76 |     for root, dir, files in os.walk(fpath):
 77 |         if 'log.txt' in files:
 78 |             param_path = open(os.path.join(root,'hyperparams.json'))
 79 |             params = json.load(param_path)
 80 |             exp_name = params['exp_name']
 81 |             
 82 |             log_path = os.path.join(root,'log.txt')
 83 |             experiment_data = pd.read_table(log_path)
 84 | 
 85 |             experiment_data.insert(
 86 |                 len(experiment_data.columns),
 87 |                 'Unit',
 88 |                 unit
 89 |                 )        
 90 |             experiment_data.insert(
 91 |                 len(experiment_data.columns),
 92 |                 'Condition',
 93 |                 condition or exp_name
 94 |                 )
 95 | 
 96 |             datasets.append(experiment_data)
 97 |             unit += 1
 98 | 
 99 |     return datasets
100 | 
101 | 
102 | def main():
103 |     import argparse
104 |     parser = argparse.ArgumentParser()
105 |     parser.add_argument('logdir', nargs='*')
106 |     parser.add_argument('--legend', nargs='*')
107 |     parser.add_argument('--time', type=str, default='Iteration')
108 |     parser.add_argument('--value', default='AverageReturn', nargs='*')
109 |     parser.add_argument('--combine', action='store_true')
110 |     args = parser.parse_args()
111 | 
112 |     use_legend = False
113 |     if args.legend is not None:
114 |         assert len(args.legend) == len(args.logdir), \
115 |             "Must give a legend title for each set of experiments."
116 |         use_legend = True
117 | 
118 |     data = []
119 |     if use_legend:
120 |         for logdir, legend_title in zip(args.logdir, args.legend):
121 |             data += get_datasets(logdir, legend_title)
122 |     else:
123 |         for logdir in args.logdir:
124 |             data += get_datasets(logdir)
125 | 
126 |     time = args.time
127 |             
128 |     if isinstance(args.value, list):
129 |         values = args.value
130 |     else:
131 |         values = [args.value]
132 | 
133 |     if args.combine and len(values) == 2:
134 |         plot_data(data, time=time, value=values, combine=True)
135 |     else:
136 |         for value in values:
137 |             plot_data(data, time=time, value=value, combine=False)
138 | 
139 | if __name__ == "__main__":
140 |     main()
141 | 


--------------------------------------------------------------------------------
/hw3/requirements.txt:
--------------------------------------------------------------------------------
1 | gym==0.10.5
2 | gym[atari]
3 | box2d
4 | mujoco-py==1.50.1.56
5 | torch==0.4.0
6 | numpy
7 | seaborn
8 | opencv-python
9 | 


--------------------------------------------------------------------------------
/hw3/run_dqn_atari.py:
--------------------------------------------------------------------------------
  1 | import gym
  2 | from gym import wrappers
  3 | import time
  4 | import logz
  5 | import os.path as osp
  6 | import random
  7 | import numpy as np
  8 | import torch
  9 | from torch import nn
 10 | 
 11 | import dqn
 12 | from dqn_utils import PiecewiseSchedule, get_wrapper_by_name
 13 | from atari_wrappers import wrap_deepmind
 14 | 
 15 | def weights_init(m):
 16 |     if hasattr(m, 'weight'):
 17 |         nn.init.xavier_normal_(m.weight)
 18 |     if hasattr(m, 'bias'):
 19 |         nn.init.constant_(m.bias, 0)
 20 | 
 21 | class DQN(nn.Module): # for atari
 22 |     def __init__(self, in_channels, num_actions):
 23 |         # as described in https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf
 24 |         super(DQN, self).__init__()
 25 |         self.convnet = nn.Sequential(
 26 |             nn.Conv2d(in_channels, out_channels=32, kernel_size=8, stride=4),
 27 |             nn.ReLU(True),
 28 |             nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
 29 |             nn.ReLU(True),
 30 |             nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1),
 31 |             nn.ReLU(True),
 32 |         )
 33 |         self.classifier = nn.Sequential(
 34 |             nn.Linear(in_features=7 * 7 * 64, out_features=512),
 35 |             nn.ReLU(True),
 36 |             nn.Linear(in_features=512, out_features=num_actions),
 37 |         )
 38 | 
 39 |         self.apply(weights_init)
 40 | 
 41 |     def forward(self, obs):
 42 |         out = obs.float() / 255 # convert 8-bits RGB color to float in [0, 1]
 43 |         out = out.permute(0, 3, 1, 2) # reshape to [batch_size, img_c * frames, img_h, img_w]
 44 |         out = self.convnet(out)
 45 |         out = out.view(out.size(0), -1) # flatten feature maps to a big vector
 46 |         out = self.classifier(out)
 47 |         return out
 48 | 
 49 | def atari_learn(env,
 50 |                 num_timesteps):
 51 |     # This is just a rough estimate
 52 |     num_iterations = float(num_timesteps) / 4.0
 53 | 
 54 |     lr_multiplier = 1.0
 55 |     lr_schedule = PiecewiseSchedule(
 56 |         [
 57 |             (0,                   1e-4 * lr_multiplier),
 58 |             (num_iterations / 10, 1e-4 * lr_multiplier),
 59 |             (num_iterations / 2,  5e-5 * lr_multiplier),
 60 |         ],
 61 |         outside_value=5e-5 * lr_multiplier
 62 |     )
 63 |     lr_lambda = lambda t: lr_schedule.value(t)
 64 | 
 65 |     optimizer = dqn.OptimizerSpec(
 66 |         constructor=torch.optim.Adam,
 67 |         kwargs=dict(eps=1e-4),
 68 |         lr_lambda=lr_lambda
 69 |     )
 70 | 
 71 |     def stopping_criterion(env, t):
 72 |         # notice that here t is the number of steps of the wrapped env,
 73 |         # which is different from the number of steps in the underlying env
 74 |         return get_wrapper_by_name(env, "Monitor").get_total_steps() >= num_timesteps
 75 | 
 76 |     exploration_schedule = PiecewiseSchedule(
 77 |         [
 78 |             (0, 1.0),
 79 |             (1e6, 0.1),
 80 |             (num_iterations / 2, 0.01),
 81 |         ],
 82 |         outside_value=0.01
 83 |     )
 84 | 
 85 |     dqn.learn(
 86 |         env=env,
 87 |         q_func=DQN,
 88 |         optimizer_spec=optimizer,
 89 |         exploration=exploration_schedule,
 90 |         stopping_criterion=stopping_criterion,
 91 |         replay_buffer_size=1000000,
 92 |         batch_size=32,
 93 |         gamma=0.99,
 94 |         learning_starts=50000,
 95 |         learning_freq=4,
 96 |         frame_history_len=4,
 97 |         target_update_freq=10000,
 98 |         grad_norm_clipping=10,
 99 |         double_q=True
100 |     )
101 |     env.close()
102 | 
103 | def set_global_seeds(i):
104 |     torch.manual_seed(i)
105 |     if torch.cuda.is_available:
106 |         torch.cuda.manual_seed(i)
107 |     np.random.seed(i)
108 |     random.seed(i)
109 | 
110 | def get_env(env_name, exp_name, seed):
111 |     env = gym.make(env_name)
112 | 
113 |     set_global_seeds(seed)
114 |     env.seed(seed)
115 |     
116 |     # Set Up Logger
117 |     logdir = 'dqn_' + exp_name + '_' + env_name + '_' + time.strftime("%d-%m-%Y_%H-%M-%S")
118 |     logdir = osp.join('data', logdir)
119 |     logdir = osp.join(logdir, '%d'%seed)
120 |     logz.configure_output_dir(logdir)
121 |     hyperparams = {'exp_name': exp_name, 'env_name': env_name}
122 |     logz.save_hyperparams(hyperparams)
123 | 
124 |     expt_dir = '/tmp/hw3_vid_dir2/'
125 |     env = wrappers.Monitor(env, osp.join(expt_dir, "gym"), force=True)
126 |     env = wrap_deepmind(env)
127 | 
128 |     return env
129 | 
130 | def main():
131 |     # Choose Atari games.
132 |     env_name = 'PongNoFrameskip-v4'
133 |     exp_name = 'Pong_double_dqn' # you can use it to mark different experiments
134 | 
135 |     # Run training
136 |     seed = random.randint(0, 9999)
137 |     print('random seed = %d' % seed)
138 |     env = get_env(env_name, exp_name, seed)
139 |     atari_learn(env, num_timesteps=2e8)
140 | 
141 | if __name__ == "__main__":
142 |     main()
143 | 


--------------------------------------------------------------------------------
/hw3/run_dqn_lander.py:
--------------------------------------------------------------------------------
  1 | import gym
  2 | from gym import wrappers
  3 | import time
  4 | import logz
  5 | import os.path as osp
  6 | import random
  7 | import numpy as np
  8 | import torch
  9 | from torch import nn
 10 | 
 11 | import dqn
 12 | from dqn_utils import ConstantSchedule, PiecewiseSchedule, get_wrapper_by_name
 13 | 
 14 | 
 15 | def weights_init(m):
 16 |     if hasattr(m, 'weight'):
 17 |         nn.init.orthogonal_(m.weight)
 18 |     if hasattr(m, 'bias'):
 19 |         nn.init.constant_(m.bias, 0)
 20 | 
 21 | class DQN(nn.Module): # for lunar lander
 22 |     def __init__(self, in_features, num_actions):
 23 |         super(DQN, self).__init__()
 24 |         self.classifier = nn.Sequential(
 25 |             nn.Linear(in_features, out_features=64),
 26 |             nn.ReLU(True),
 27 |             nn.Linear(in_features=64, out_features=64),
 28 |             nn.ReLU(True),
 29 |             nn.Linear(in_features=64, out_features=num_actions),
 30 |         )
 31 | 
 32 |         self.apply(weights_init)
 33 | 
 34 |     def forward(self, obs):
 35 |         out = self.classifier(obs)
 36 |         return out
 37 | 
 38 | def lander_optimizer():
 39 |     lr_schedule = ConstantSchedule(1e-3)
 40 |     lr_lambda = lambda t: lr_schedule.value(t)
 41 |     return dqn.OptimizerSpec(
 42 |         constructor=torch.optim.Adam,
 43 |         lr_lambda=lr_lambda,
 44 |         kwargs={}
 45 |     )
 46 | 
 47 | def lander_stopping_criterion(num_timesteps):
 48 |     def stopping_criterion(env, t):
 49 |         # notice that here t is the number of steps of the wrapped env,
 50 |         # which is different from the number of steps in the underlying env
 51 |         return get_wrapper_by_name(env, "Monitor").get_total_steps() >= num_timesteps
 52 |     return stopping_criterion
 53 | 
 54 | def lander_exploration_schedule(num_timesteps):
 55 |     return PiecewiseSchedule(
 56 |         [
 57 |             (0, 1),
 58 |             (num_timesteps * 0.1, 0.02),
 59 |         ], outside_value=0.02
 60 |     )
 61 | 
 62 | def lander_kwargs():
 63 |     return {
 64 |         'optimizer_spec': lander_optimizer(),
 65 |         'q_func': DQN,
 66 |         'replay_buffer_size': 50000,
 67 |         'batch_size': 32,
 68 |         'gamma': 1.00,
 69 |         'learning_starts': 1000,
 70 |         'learning_freq': 1,
 71 |         'frame_history_len': 1,
 72 |         'target_update_freq': 3000,
 73 |         'grad_norm_clipping': 10,
 74 |         'lander': True
 75 |     }
 76 | 
 77 | def lander_learn(env,
 78 |                  num_timesteps):
 79 | 
 80 |     optimizer = lander_optimizer()
 81 |     stopping_criterion = lander_stopping_criterion(num_timesteps)
 82 |     exploration_schedule = lander_exploration_schedule(num_timesteps)
 83 | 
 84 |     dqn.learn(
 85 |         env=env,
 86 |         exploration=lander_exploration_schedule(num_timesteps),
 87 |         stopping_criterion=lander_stopping_criterion(num_timesteps),
 88 |         double_q=True,
 89 |         **lander_kwargs()
 90 |     )
 91 |     env.close()
 92 | 
 93 | def set_global_seeds(i):
 94 |     torch.manual_seed(i)
 95 |     if torch.cuda.is_available:
 96 |         torch.cuda.manual_seed(i)
 97 |     np.random.seed(i)
 98 |     random.seed(i)
 99 | 
100 | def get_env(env_name, exp_name, seed):
101 |     env = gym.make(env_name)
102 | 
103 |     set_global_seeds(seed)
104 |     env.seed(seed)
105 | 
106 |     # Set Up Logger
107 |     logdir = 'dqn_' + exp_name + '_' + env_name + '_' + time.strftime("%d-%m-%Y_%H-%M-%S")
108 |     logdir = osp.join('data', logdir)
109 |     logdir = osp.join(logdir, '%d'%seed)
110 |     logz.configure_output_dir(logdir)
111 |     hyperparams = {'exp_name': exp_name, 'env_name': env_name}
112 |     logz.save_hyperparams(hyperparams)
113 | 
114 |     expt_dir = '/tmp/hw3_vid_dir/'
115 |     env = wrappers.Monitor(env, osp.join(expt_dir, "gym"), force=True, video_callable=False)
116 |     
117 | 
118 |     return env
119 | 
120 | def main():
121 |     # Choose Atari games.
122 |     env_name = 'LunarLander-v2'
123 |     exp_name = 'LunarLander_double_dqn' # you can use it to mark different experiments
124 |     
125 |     # Run training
126 |     seed = 4565 # you may want to randomize this
127 |     print('random seed = %d' % seed)
128 |     env = get_env(env_name, exp_name, seed)
129 |     lander_learn(env, num_timesteps=500000)
130 | 
131 | if __name__ == "__main__":
132 |     main()
133 | 


--------------------------------------------------------------------------------
/hw3/run_dqn_ram.py:
--------------------------------------------------------------------------------
  1 | import gym
  2 | from gym import wrappers
  3 | import time
  4 | import logz
  5 | import os.path as osp
  6 | import random
  7 | import numpy as np
  8 | import torch
  9 | from torch import nn
 10 | 
 11 | import dqn
 12 | from dqn_utils import PiecewiseSchedule, get_wrapper_by_name
 13 | from atari_wrappers import wrap_deepmind_ram
 14 | 
 15 | def weights_init(m):
 16 |     if hasattr(m, 'weight'):
 17 |         nn.init.xavier_uniform_(m.weight)
 18 |     if hasattr(m, 'bias'):
 19 |         nn.init.constant_(m.bias, 0)
 20 |         
 21 | class DQN(nn.Module): # for atari ram
 22 |     def __init__(self, in_features, num_actions):
 23 |         super(DQN, self).__init__()
 24 |         self.classifier = nn.Sequential(
 25 |             nn.Linear(in_features, out_features=256),
 26 |             nn.ReLU(True),
 27 |             nn.Linear(in_features=256, out_features=128),
 28 |             nn.ReLU(True),
 29 |             nn.Linear(in_features=128, out_features=64),
 30 |             nn.ReLU(True),
 31 |             nn.Linear(in_features=64, out_features=num_actions),
 32 |         )
 33 | 
 34 |         self.apply(weights_init)
 35 | 
 36 |     def forward(self, obs):
 37 |         out = obs.float() / 255 # convert 8-bits ram state to float in [0, 1]
 38 |         out = self.classifier(out)
 39 |         return out
 40 | 
 41 | def atari_learn(env,
 42 |                 num_timesteps):
 43 |     # This is just a rough estimate
 44 |     num_iterations = float(num_timesteps) / 4.0
 45 | 
 46 |     lr_multiplier = 1.0
 47 |     lr_schedule = PiecewiseSchedule(
 48 |         [
 49 |             (0,                   1e-4 * lr_multiplier),
 50 |             (num_iterations / 10, 1e-4 * lr_multiplier),
 51 |             (num_iterations / 2,  5e-5 * lr_multiplier),
 52 |         ],
 53 |         outside_value=5e-5 * lr_multiplier
 54 |     )
 55 |     lr_lambda = lambda t: lr_schedule.value(t)
 56 | 
 57 |     optimizer = dqn.OptimizerSpec(
 58 |         constructor=torch.optim.Adam,
 59 |         kwargs=dict(eps=1e-4),
 60 |         lr_lambda=lr_lambda
 61 |     )
 62 | 
 63 |     def stopping_criterion(env, t):
 64 |         # notice that here t is the number of steps of the wrapped env,
 65 |         # which is different from the number of steps in the underlying env
 66 |         return get_wrapper_by_name(env, "Monitor").get_total_steps() >= num_timesteps
 67 | 
 68 |     exploration_schedule = PiecewiseSchedule(
 69 |         [
 70 |             (0, 0.2),
 71 |             (1e6, 0.1),
 72 |             (num_iterations / 2, 0.01),
 73 |         ], outside_value=0.01
 74 |     )
 75 | 
 76 |     dqn.learn(
 77 |         env,
 78 |         q_func=DQN,
 79 |         optimizer_spec=optimizer,
 80 |         exploration=exploration_schedule,
 81 |         stopping_criterion=stopping_criterion,
 82 |         replay_buffer_size=1000000,
 83 |         batch_size=32,
 84 |         gamma=0.99,
 85 |         learning_starts=50000,
 86 |         learning_freq=4,
 87 |         frame_history_len=1,
 88 |         target_update_freq=10000,
 89 |         grad_norm_clipping=10
 90 |     )
 91 |     env.close()
 92 | 
 93 | def set_global_seeds(i):
 94 |     torch.manual_seed(i)
 95 |     if torch.cuda.is_available:
 96 |         torch.cuda.manual_seed(i)
 97 |     np.random.seed(i)
 98 |     random.seed(i)
 99 | 
100 | def get_env(env_name, exp_name, seed):
101 |     env = gym.make(env_name)
102 | 
103 |     set_global_seeds(seed)
104 |     env.seed(seed)
105 | 
106 |     # Set Up Logger
107 |     logdir = 'dqn_' + exp_name + '_' + env_name + '_' + time.strftime("%d-%m-%Y_%H-%M-%S")
108 |     logdir = osp.join('data', logdir)
109 |     logdir = osp.join(logdir, '%d'%seed)
110 |     logz.configure_output_dir(logdir)
111 |     hyperparams = {'exp_name': exp_name, 'env_name': env_name}
112 |     logz.save_hyperparams(hyperparams)
113 | 
114 |     expt_dir = '/tmp/hw3_vid_dir/'
115 |     env = wrappers.Monitor(env, osp.join(expt_dir, "gym"), force=True)
116 |     env = wrap_deepmind_ram(env)
117 | 
118 |     return env
119 | 
120 | def main():
121 |     # Choose Atari games.
122 |     env_name = 'Pong-ram-v0'
123 |     exp_name = 'Pong_double_dqn' # you can use it to mark different experiments
124 |     
125 |     # Run training
126 |     seed = 0 # Use a seed of zero (you may want to randomize the seed!)
127 |     print('random seed = %d' % seed)
128 |     env = get_env(env_name, exp_name, seed)
129 |     atari_learn(env, num_timesteps=int(4e7))
130 | 
131 | if __name__ == "__main__":
132 |     main()
133 | 


--------------------------------------------------------------------------------
/hw3/train_ac_f18.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Original code from John Schulman for CS294 Deep Reinforcement Learning Spring 2017
  3 | Adapted for CS294-112 Fall 2017 by Abhishek Gupta and Joshua Achiam
  4 | Adapted for CS294-112 Fall 2018 by Soroush Nasiriany, Sid Reddy, and Greg Kahn
  5 | Adapted for pytorch version by Ning Dai
  6 | """
  7 | import numpy as np
  8 | import torch
  9 | import gym
 10 | import logz
 11 | import os
 12 | import time
 13 | import inspect
 14 | from torch.multiprocessing import Process
 15 | from torch import nn, optim
 16 | 
 17 | #============================================================================================#
 18 | # Utilities
 19 | #============================================================================================#
 20 | 
 21 | def build_mlp(input_size, output_size, n_layers, hidden_size, activation=nn.Tanh):
 22 |     """
 23 |         Builds a feedforward neural network
 24 |         
 25 |         arguments:
 26 |             input_size: size of the input layer
 27 |             output_size: size of the output layer
 28 |             n_layers: number of hidden layers
 29 |             hidden_size: dimension of the hidden layers
 30 |             activation: activation of the hidden layers
 31 |             output_activation: activation of the output layer
 32 | 
 33 |         returns:
 34 |             an instance of nn.Sequential which contains the feedforward neural network
 35 | 
 36 |         Hint: use nn.Linear
 37 |     """
 38 |     layers = []
 39 |     # YOUR HW2 CODE HERE
 40 |     raise NotImplementedError
 41 | 
 42 |     return nn.Sequential(*layers).apply(weights_init)
 43 | 
 44 | def weights_init(m):
 45 |     if hasattr(m, 'weight'):
 46 |         nn.init.xavier_uniform_(m.weight)
 47 | 
 48 | def pathlength(path):
 49 |     return len(path["reward"])
 50 | 
 51 | def setup_logger(logdir, locals_):
 52 |     # Configure output directory for logging
 53 |     logz.configure_output_dir(logdir)
 54 |     # Log experimental parameters
 55 |     args = inspect.getargspec(train_AC)[0]
 56 |     hyperparams = {k: locals_[k] if k in locals_ else None for k in args}
 57 |     logz.save_hyperparams(hyperparams)
 58 | 
 59 | class PolicyNet(nn.Module):
 60 |     def __init__(self, neural_network_args):
 61 |         super(PolicyNet, self).__init__()
 62 |         self.ob_dim = neural_network_args['ob_dim']
 63 |         self.ac_dim = neural_network_args['ac_dim']
 64 |         self.discrete = neural_network_args['discrete']
 65 |         self.hidden_size = neural_network_args['size']
 66 |         self.n_layers = neural_network_args['actor_n_layers']
 67 | 
 68 |         self.define_model_components()
 69 |         
 70 |     def define_model_components(self):
 71 |         """
 72 |             Define the parameters of policy network here.
 73 |             You can use any instance of nn.Module or nn.Parameter.
 74 | 
 75 |             Hint: use the 'build_mlp' function above
 76 |                 In the discrete case, model should output logits of a categorical distribution
 77 |                     over the actions
 78 |                 In the continuous case, model should output a tuple (mean, log_std) of a Gaussian
 79 |                     distribution over actions. log_std should just be a trainable
 80 |                     variable, not a network output.
 81 |         """
 82 |         # YOUR HW2 CODE HERE
 83 |         if self.discrete:
 84 |             raise NotImplementedError
 85 |         else:
 86 |             raise NotImplementedError
 87 |             
 88 |     #========================================================================================#
 89 |     #                           ----------PROBLEM 2----------
 90 |     #========================================================================================#
 91 |     """
 92 |         Notes on notation:
 93 |         
 94 |         Pytorch tensor variables have the prefix ts_, to distinguish them from the numpy array
 95 |         variables that are computed later in the function
 96 |     
 97 |         Prefixes and suffixes:
 98 |         ob - observation 
 99 |         ac - action
100 |         _no - this tensor should have shape (batch size, observation dim)
101 |         _na - this tensor should have shape (batch size, action dim)
102 |         _n  - this tensor should have shape (batch size)
103 |             
104 |         Note: batch size is defined at runtime
105 |     """
106 |     def forward(self, ts_ob_no):
107 |         """
108 |             Define forward pass for policy network.
109 | 
110 |             arguments:
111 |                 ts_ob_no: (batch_size, self.ob_dim) 
112 | 
113 |             returns:
114 |                 the parameters of the policy.
115 | 
116 |                 if discrete, the parameters are the logits of a categorical distribution
117 |                     over the actions
118 |                     ts_logits_na: (batch_size, self.ac_dim)
119 | 
120 |                 if continuous, the parameters are a tuple (mean, log_std) of a Gaussian
121 |                     distribution over actions. log_std should just be a trainable
122 |                     variable, not a network output.
123 |                     ts_mean: (batch_size, self.ac_dim)
124 |                     st_logstd: (self.ac_dim,)
125 |         
126 |             Hint: use the components you defined in self.define_model_components
127 |         """
128 |         raise NotImplementedError
129 |         if self.discrete:
130 |             # YOUR HW2 CODE HERE
131 |             ts_logits_na = None
132 |             return ts_logits_na
133 |         else:
134 |             # YOUR HW2 CODE HERE
135 |             ts_mean = None
136 |             ts_logstd = None
137 |             return (ts_mean, ts_logstd)
138 |     
139 | #============================================================================================#
140 | # Actor Critic
141 | #============================================================================================#
142 | 
143 | class Agent(object):
144 |     def __init__(self, neural_network_args, sample_trajectory_args, estimate_advantage_args):
145 |         super(Agent, self).__init__()
146 |         self.ob_dim = neural_network_args['ob_dim']
147 |         self.ac_dim = neural_network_args['ac_dim']
148 |         self.discrete = neural_network_args['discrete']
149 |         self.hidden_size = neural_network_args['size']
150 |         self.critic_n_layers = neural_network_args['critic_n_layers']
151 |         self.actor_learning_rate = neural_network_args['actor_learning_rate']
152 |         self.critic_learning_rate = neural_network_args['critic_learning_rate']
153 |         self.num_target_updates = neural_network_args['num_target_updates']
154 |         self.num_grad_steps_per_target_update = neural_network_args['num_grad_steps_per_target_update']
155 | 
156 |         self.animate = sample_trajectory_args['animate']
157 |         self.max_path_length = sample_trajectory_args['max_path_length']
158 |         self.min_timesteps_per_batch = sample_trajectory_args['min_timesteps_per_batch']
159 | 
160 |         self.gamma = estimate_advantage_args['gamma']
161 |         self.normalize_advantages = estimate_advantage_args['normalize_advantages']
162 | 
163 |         self.policy_net = PolicyNet(neural_network_args)
164 |         self.value_net = build_mlp(self.ob_dim, 1, self.critic_n_layers, self.hidden_size)
165 | 
166 |         self.actor_optimizer = optim.Adam(self.policy_net.parameters(), lr=self.actor_learning_rate)
167 |         self.critic_optimizer = optim.Adam(self.value_net.parameters(), lr=self.critic_learning_rate)
168 |         
169 |     def sample_action(self, ob_no):
170 |         """
171 |             Build the method used for sampling action from the policy distribution
172 |     
173 |             arguments:
174 |                 ob_no: (batch_size, self.ob_dim)
175 | 
176 |             returns:
177 |                 sampled_ac: 
178 |                     if discrete: (batch_size)
179 |                     if continuous: (batch_size, self.ac_dim)
180 | 
181 |             Hint: for the continuous case, use the reparameterization trick:
182 |                  The output from a Gaussian distribution with mean 'mu' and std 'sigma' is
183 |         
184 |                       mu + sigma * z,         z ~ N(0, I)
185 |         
186 |                  This reduces the problem to just sampling z. (Hint: use torch.normal!)
187 |         """
188 |         ts_ob_no = torch.from_numpy(ob_no).float()
189 |         
190 |         raise NotImplementedError
191 |         if self.discrete:
192 |             ts_logits_na = self.policy_net(ts_ob_no)
193 |             # YOUR HW2 CODE HERE
194 |             ts_probs = None
195 |             ts_sampled_ac = None
196 |         else:
197 |             ts_mean, ts_logstd = self.policy_net(ts_ob_no)
198 |             # YOUR HW2 CODE HERE
199 |             ts_sampled_ac = None
200 | 
201 |         sampled_ac = ts_sampled_ac.numpy()
202 |             
203 |         return sampled_ac
204 |     
205 |     def get_log_prob(self, policy_parameters, ts_ac_na):
206 |         """
207 |             Build the method used for computing the log probability of a set of actions
208 |             that were actually taken according to the policy
209 | 
210 |             arguments:
211 |                 policy_parameters
212 |                     if discrete: logits of a categorical distribution over actions 
213 |                         ts_logits_na: (batch_size, self.ac_dim)
214 |                     if continuous: (mean, log_std) of a Gaussian distribution over actions
215 |                         ts_mean: (batch_size, self.ac_dim)
216 |                         ts_logstd: (self.ac_dim,)
217 | 
218 |                 ts_ac_na: (batch_size, self.ac_dim)
219 | 
220 |             returns:
221 |                 ts_logprob_n: (batch_size)
222 | 
223 |             Hint:
224 |                 For the discrete case, use the log probability under a categorical distribution.
225 |                 For the continuous case, use the log probability under a multivariate gaussian.
226 |         """
227 |         raise NotImplementedError
228 |         if self.discrete:
229 |             ts_logits_na = policy_parameters
230 |             # YOUR HW2 CODE HERE
231 |             ts_logprob_n = None
232 |         else:
233 |             ts_mean, ts_logstd = policy_parameters
234 |             # YOUR HW2 CODE HERE
235 |             ts_logprob_n = None
236 |             
237 |         return ts_logprob_n
238 | 
239 |     def sample_trajectories(self, itr, env):
240 |         # Collect paths until we have enough timesteps
241 |         timesteps_this_batch = 0
242 |         paths = []
243 |         while True:
244 |             animate_this_episode=(len(paths)==0 and (itr % 10 == 0) and self.animate)
245 |             path = self.sample_trajectory(env, animate_this_episode)
246 |             paths.append(path)
247 |             timesteps_this_batch += pathlength(path)
248 |             if timesteps_this_batch > self.min_timesteps_per_batch:
249 |                 break
250 |         return paths, timesteps_this_batch
251 | 
252 |     def sample_trajectory(self, env, animate_this_episode):
253 |         ob = env.reset()
254 |         obs, acs, rewards, next_obs, terminals = [], [], [], [], []
255 |         steps = 0
256 |         while True:
257 |             if animate_this_episode:
258 |                 env.render()
259 |                 time.sleep(0.1)
260 |             obs.append(ob)
261 |             raise NotImplementedError
262 |             ac = None # YOUR HW2 CODE HERE
263 |             ac = ac[0]
264 |             acs.append(ac)
265 |             ob, rew, done, _ = env.step(ac)
266 |             # add the observation after taking a step to next_obs
267 |             # YOUR CODE HERE
268 |             raise NotImplementedError
269 |             rewards.append(rew)
270 |             steps += 1
271 |             # If the episode ended, the corresponding terminal value is 1
272 |             # otherwise, it is 0
273 |             # YOUR CODE HERE
274 |             if done or steps > self.max_path_length:
275 |                 raise NotImplementedError
276 |                 break
277 |             else:
278 |                 raise NotImplementedError
279 |         path = {"observation" : np.array(obs, dtype=np.float32), 
280 |                 "reward" : np.array(rewards, dtype=np.float32), 
281 |                 "action" : np.array(acs, dtype=np.float32),
282 |                 "next_observation": np.array(next_obs, dtype=np.float32),
283 |                 "terminal": np.array(terminals, dtype=np.float32)}
284 |         return path
285 | 
286 |     def estimate_advantage(self, ob_no, next_ob_no, re_n, terminal_n):
287 |         """
288 |             Estimates the advantage function value for each timestep.
289 | 
290 |             let sum_of_path_lengths be the sum of the lengths of the paths sampled from 
291 |                 Agent.sample_trajectories
292 | 
293 |             arguments:
294 |                 ob_no: shape: (sum_of_path_lengths, ob_dim)
295 |                 next_ob_no: shape: (sum_of_path_lengths, ob_dim). The observation after taking one step forward
296 |                 re_n: length: sum_of_path_lengths. Each element in re_n is a scalar containing
297 |                     the reward for each timestep
298 |                 terminal_n: length: sum_of_path_lengths. Each element in terminal_n is either 1 if the episode ended
299 |                     at that timestep of 0 if the episode did not end
300 | 
301 |             returns:
302 |                 adv_n: shape: (sum_of_path_lengths). A single vector for the estimated 
303 |                     advantages whose length is the sum of the lengths of the paths
304 |         """
305 |         # First, estimate the Q value as Q(s, a) = r(s, a) + gamma*V(s')
306 |         # To get the advantage, subtract the V(s) to get A(s, a) = Q(s, a) - V(s)
307 |         # This requires calling the critic twice --- to obtain V(s') when calculating Q(s, a),
308 |         # and V(s) when subtracting the baseline
309 |         # Note: don't forget to use terminal_n to cut off the V(s') term when computing Q(s, a)
310 |         # otherwise the values will grow without bound.
311 |         # YOUR CODE HERE
312 |         raise NotImplementedError
313 |         adv_n = None
314 |         
315 |         if self.normalize_advantages:
316 |             raise NotImplementedError
317 |             adv_n = None # YOUR HW2 CODE HERE
318 |         return adv_n
319 | 
320 |     def update_critic(self, ob_no, next_ob_no, re_n, terminal_n):
321 |         """
322 |             Update the parameters of the critic.
323 | 
324 |             let sum_of_path_lengths be the sum of the lengths of the paths sampled from
325 |                 Agent.sample_trajectories
326 |             let num_paths be the number of paths sampled from Agent.sample_trajectories
327 | 
328 |             arguments:
329 |                 ob_no: shape: (sum_of_path_lengths, ob_dim)
330 |                 next_ob_no: shape: (sum_of_path_lengths, ob_dim). The observation after taking one step forward
331 |                 re_n: length: sum_of_path_lengths. Each element in re_n is a scalar containing
332 |                     the reward for each timestep
333 |                 terminal_n: length: sum_of_path_lengths. Each element in terminal_n is either 1 if the episode ended
334 |                     at that timestep of 0 if the episode did not end
335 | 
336 |             returns:
337 |                 nothing
338 |         """
339 |         # Use a bootstrapped target values to update the critic
340 |         # Compute the target values r(s, a) + gamma*V(s') by calling the critic to compute V(s')
341 |         # In total, take n=self.num_grad_steps_per_target_update*self.num_target_updates gradient update steps
342 |         # Every self.num_grad_steps_per_target_update steps, recompute the target values
343 |         # by evaluating V(s') on the updated critic
344 |         # Note: don't forget to use terminal_n to cut off the V(s') term when computing the target
345 |         # otherwise the values will grow without bound.
346 |         # YOUR CODE HERE
347 |         raise NotImplementedError
348 |                 
349 |     def update_actor(self, ob_no, ac_na, adv_n):
350 |         """ 
351 |             Update the parameters of the policy.
352 | 
353 |             arguments:
354 |                 ob_no: shape: (sum_of_path_lengths, ob_dim)
355 |                 ac_na: shape: (sum_of_path_lengths).
356 |                 adv_n: shape: (sum_of_path_lengths). A single vector for the estimated
357 |                     advantages whose length is the sum of the lengths of the paths
358 | 
359 |             returns:
360 |                 nothing
361 | 
362 |         """
363 |         # convert numpy array to pytorch tensor
364 |         ts_ob_no, ts_ac_na, ts_adv_n = map(lambda x: torch.from_numpy(x), [ob_no, ac_na, adv_n])
365 | 
366 |         # The policy takes in an observation and produces a distribution over the action space
367 |         policy_parameters = self.policy_net(ts_ob_no)
368 | 
369 |         # We can compute the logprob of the actions that were actually taken by the policy
370 |         # This is used in the loss function.
371 |         ts_logprob_n = self.get_log_prob(policy_parameters, ts_ac_na)
372 | 
373 |         # clean the gradient for model parameters
374 |         self.actor_optimizer.zero_grad()
375 |         
376 |         actor_loss = - (ts_logprob_n * ts_adv_n).mean() 
377 |         actor_loss.backward()
378 |         
379 |         self.actor_optimizer.step()
380 | 
381 | def train_AC(
382 |         exp_name,
383 |         env_name,
384 |         n_iter, 
385 |         gamma, 
386 |         min_timesteps_per_batch, 
387 |         max_path_length,
388 |         actor_learning_rate,
389 |         critic_learning_rate,
390 |         num_target_updates,
391 |         num_grad_steps_per_target_update,
392 |         animate, 
393 |         logdir, 
394 |         normalize_advantages,
395 |         seed,
396 |         actor_n_layers,
397 |         critic_n_layers,
398 |         size):
399 | 
400 |     start = time.time()
401 | 
402 |     #========================================================================================#
403 |     # Set Up Logger
404 |     #========================================================================================#
405 |     setup_logger(logdir, locals())
406 | 
407 |     #========================================================================================#
408 |     # Set Up Env
409 |     #========================================================================================#
410 | 
411 |     # Make the gym environment
412 |     env = gym.make(env_name)
413 | 
414 |     # Set random seeds
415 |     torch.manual_seed(seed)
416 |     np.random.seed(seed)
417 |     env.seed(seed)
418 | 
419 |     # Maximum length for episodes
420 |     max_path_length = max_path_length or env.spec.max_episode_steps
421 | 
422 |     # Is this env continuous, or self.discrete?
423 |     discrete = isinstance(env.action_space, gym.spaces.Discrete)
424 |     
425 | 
426 |     # Observation and action sizes
427 |     ob_dim = env.observation_space.shape[0]
428 |     ac_dim = env.action_space.n if discrete else env.action_space.shape[0]
429 | 
430 |     #========================================================================================#
431 |     # Initialize Agent
432 |     #========================================================================================#
433 |     neural_network_args = {
434 |         'actor_n_layers': actor_n_layers,
435 |         'critic_n_layers': critic_n_layers,
436 |         'ob_dim': ob_dim,
437 |         'ac_dim': ac_dim,
438 |         'discrete': discrete,
439 |         'size': size,
440 |         'actor_learning_rate': actor_learning_rate,
441 |         'critic_learning_rate': critic_learning_rate,
442 |         'num_target_updates': num_target_updates,
443 |         'num_grad_steps_per_target_update': num_grad_steps_per_target_update,
444 |         }
445 | 
446 |     sample_trajectory_args = {
447 |         'animate': animate,
448 |         'max_path_length': max_path_length,
449 |         'min_timesteps_per_batch': min_timesteps_per_batch,
450 |     }
451 | 
452 |     estimate_advantage_args = {
453 |         'gamma': gamma,
454 |         'normalize_advantages': normalize_advantages,
455 |     }
456 | 
457 |     agent = Agent(neural_network_args, sample_trajectory_args, estimate_advantage_args)
458 | 
459 |     #========================================================================================#
460 |     # Training Loop
461 |     #========================================================================================#
462 | 
463 |     total_timesteps = 0
464 |     for itr in range(n_iter):
465 |         print("********** Iteration %i ************"%itr)
466 |         
467 |         with torch.no_grad(): # use torch.no_grad to disable the gradient calculation
468 |             paths, timesteps_this_batch = agent.sample_trajectories(itr, env)
469 |         total_timesteps += timesteps_this_batch
470 | 
471 |         # Build arrays for observation, action for the policy gradient update by concatenating 
472 |         # across paths
473 |         ob_no = np.concatenate([path["observation"] for path in paths])
474 |         ac_na = np.concatenate([path["action"] for path in paths])
475 |         re_n = np.concatenate([path["reward"] for path in paths])
476 |         next_ob_no = np.concatenate([path["next_observation"] for path in paths])
477 |         terminal_n = np.concatenate([path["terminal"] for path in paths])
478 | 
479 |         # Call tensorflow operations to:
480 |         # (1) update the critic, by calling agent.update_critic
481 |         # (2) use the updated critic to compute the advantage by, calling agent.estimate_advantage
482 |         # (3) use the estimated advantage values to update the actor, by calling agent.update_actor
483 |         # YOUR CODE HERE
484 |         raise NotImplementedError
485 | 
486 |         # Log diagnostics
487 |         returns = [path["reward"].sum() for path in paths]
488 |         ep_lengths = [pathlength(path) for path in paths]
489 |         logz.log_tabular("Time", time.time() - start)
490 |         logz.log_tabular("Iteration", itr)
491 |         logz.log_tabular("AverageReturn", np.mean(returns))
492 |         logz.log_tabular("StdReturn", np.std(returns))
493 |         logz.log_tabular("MaxReturn", np.max(returns))
494 |         logz.log_tabular("MinReturn", np.min(returns))
495 |         logz.log_tabular("EpLenMean", np.mean(ep_lengths))
496 |         logz.log_tabular("EpLenStd", np.std(ep_lengths))
497 |         logz.log_tabular("TimestepsThisBatch", timesteps_this_batch)
498 |         logz.log_tabular("TimestepsSoFar", total_timesteps)
499 |         logz.dump_tabular()
500 |         logz.save_pytorch_model(agent)
501 | 
502 | 
503 | def main():
504 |     import argparse
505 |     parser = argparse.ArgumentParser()
506 |     parser.add_argument('env_name', type=str)
507 |     parser.add_argument('--exp_name', type=str, default='vac')
508 |     parser.add_argument('--render', action='store_true')
509 |     parser.add_argument('--discount', type=float, default=1.0)
510 |     parser.add_argument('--n_iter', '-n', type=int, default=100)
511 |     parser.add_argument('--batch_size', '-b', type=int, default=1000)
512 |     parser.add_argument('--ep_len', '-ep', type=float, default=-1.)
513 |     parser.add_argument('--actor_learning_rate', '-lr', type=float, default=5e-3)
514 |     parser.add_argument('--critic_learning_rate', '-clr', type=float)
515 |     parser.add_argument('--dont_normalize_advantages', '-dna', action='store_true')
516 |     parser.add_argument('--num_target_updates', '-ntu', type=int, default=10)
517 |     parser.add_argument('--num_grad_steps_per_target_update', '-ngsptu', type=int, default=10)
518 |     parser.add_argument('--seed', type=int, default=1)
519 |     parser.add_argument('--n_experiments', '-e', type=int, default=1)
520 |     parser.add_argument('--actor_n_layers', '-l', type=int, default=2)
521 |     parser.add_argument('--critic_n_layers', '-cl', type=int)
522 |     parser.add_argument('--size', '-s', type=int, default=64)
523 |     args = parser.parse_args()
524 | 
525 |     if not(os.path.exists('data')):
526 |         os.makedirs('data')
527 |     logdir = 'ac_' + args.exp_name + '_' + args.env_name + '_' + time.strftime("%d-%m-%Y_%H-%M-%S")
528 |     logdir = os.path.join('data', logdir)
529 |     if not(os.path.exists(logdir)):
530 |         os.makedirs(logdir)
531 | 
532 |     max_path_length = args.ep_len if args.ep_len > 0 else None
533 | 
534 |     if not args.critic_learning_rate:
535 |         args.critic_learning_rate = args.actor_learning_rate
536 | 
537 |     if not args.critic_n_layers:
538 |         args.critic_n_layers = args.actor_n_layers
539 |         
540 |     processes = []
541 | 
542 |     for e in range(args.n_experiments):
543 |         seed = args.seed + 10*e
544 |         print('Running experiment with seed %d'%seed)
545 | 
546 |         def train_func():
547 |             train_AC(
548 |                 exp_name=args.exp_name,
549 |                 env_name=args.env_name,
550 |                 n_iter=args.n_iter,
551 |                 gamma=args.discount,
552 |                 min_timesteps_per_batch=args.batch_size,
553 |                 max_path_length=max_path_length,
554 |                 actor_learning_rate=args.actor_learning_rate,
555 |                 critic_learning_rate=args.critic_learning_rate,
556 |                 num_target_updates=args.num_target_updates,
557 |                 num_grad_steps_per_target_update=args.num_grad_steps_per_target_update,
558 |                 animate=args.render,
559 |                 logdir=os.path.join(logdir,'%d'%seed),
560 |                 normalize_advantages=not(args.dont_normalize_advantages),
561 |                 seed=seed,
562 |                 actor_n_layers=args.actor_n_layers,
563 |                 critic_n_layers=args.critic_n_layers,
564 |                 size=args.size
565 |                 )
566 |         p = Process(target=train_func, args=tuple())
567 |         p.start()
568 |         processes.append(p)
569 |         # if you comment in the line below, then the loop will block 
570 |         # until this process finishes
571 |         # p.join()
572 | 
573 |     for p in processes:
574 |         p.join()
575 | 
576 | if __name__ == "__main__":
577 |     main()
578 | 


--------------------------------------------------------------------------------