├── .gitignore
├── LICENSE.md
├── README.md
├── assets
├── a3c_con.gif
├── a3c_pong.gif
├── a3c_pong.png
├── breakout.gif
└── cartpole.gif
├── core
├── __init__.py
├── agent.py
├── agent_single_process.py
├── agents
│ ├── __init__.py
│ ├── a3c.py
│ ├── a3c_single_process.py
│ ├── acer.py
│ ├── acer_single_process.py
│ ├── dqn.py
│ └── empty.py
├── env.py
├── envs
│ ├── __init__.py
│ ├── atari.py
│ ├── atari_ram.py
│ ├── gym.py
│ └── lab.py
├── memories
│ ├── __init__.py
│ ├── episode_parameter.py
│ ├── episodic.py
│ └── sequential.py
├── memory.py
├── model.py
└── models
│ ├── __init__.py
│ ├── a3c_cnn_dis.py
│ ├── a3c_mlp_con.py
│ ├── acer_cnn_dis.py
│ ├── acer_mlp_dis.py
│ ├── dqn_cnn.py
│ ├── dqn_mlp.py
│ └── empty.py
├── figs
└── .gitignore
├── imgs
└── .gitignore
├── logs
└── .gitignore
├── main.py
├── models
└── .gitignore
├── optims
├── __init__.py
├── helpers.py
├── sharedAdam.py
└── sharedRMSprop.py
├── plot.sh
├── plot_compare.sh
└── utils
├── __init__.py
├── distributions.py
├── factory.py
├── helpers.py
├── init_weights.py
└── options.py
/.gitignore:
--------------------------------------------------------------------------------
1 | core/*.pyc
2 | core/envs/*.pyc
3 | core/models/*.pyc
4 | core/memories/*.pyc
5 | core/agents/*.pyc
6 | optims/*.pyc
7 | utils/*.pyc
8 | models/*
9 | logs/*
10 |
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 |
3 | Copyright (c) 2017 Jingwei Zhang
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # **Deep Reinforcement Learning** with
2 | # **pytorch** & **visdom**
3 | *******
4 |
5 |
6 | * Sample testings of trained agents (DQN on Breakout, A3C on Pong, DoubleDQN on CartPole, continuous A3C on InvertedPendulum(MuJoCo)):
7 |
8 |
9 |  |
10 |  |
11 |  |
12 |  |
13 |
14 |
15 |
16 | * Sample on-line plotting while training an A3C agent on Pong (with 16 learner processes):
17 | 
18 |
19 | * Sample loggings while training a DQN agent on CartPole (we use ```WARNING``` as the logging level currently to get rid of the ```INFO``` printouts from visdom):
20 | ```bash
21 | [WARNING ] (MainProcess) <===================================>
22 | [WARNING ] (MainProcess) bash$: python -m visdom.server
23 | [WARNING ] (MainProcess) http://localhost:8097/env/daim_17040900
24 | [WARNING ] (MainProcess) <===================================> DQN
25 | [WARNING ] (MainProcess) <-----------------------------------> Env
26 | [WARNING ] (MainProcess) Creating {gym | CartPole-v0} w/ Seed: 123
27 | [INFO ] (MainProcess) Making new env: CartPole-v0
28 | [WARNING ] (MainProcess) Action Space: [0, 1]
29 | [WARNING ] (MainProcess) State Space: 4
30 | [WARNING ] (MainProcess) <-----------------------------------> Model
31 | [WARNING ] (MainProcess) MlpModel (
32 | (fc1): Linear (4 -> 16)
33 | (rl1): ReLU ()
34 | (fc2): Linear (16 -> 16)
35 | (rl2): ReLU ()
36 | (fc3): Linear (16 -> 16)
37 | (rl3): ReLU ()
38 | (fc4): Linear (16 -> 2)
39 | )
40 | [WARNING ] (MainProcess) No Pretrained Model. Will Train From Scratch.
41 | [WARNING ] (MainProcess) <===================================> Training ...
42 | [WARNING ] (MainProcess) Validation Data @ Step: 501
43 | [WARNING ] (MainProcess) Start Training @ Step: 501
44 | [WARNING ] (MainProcess) Reporting @ Step: 2500 | Elapsed Time: 5.32397913933
45 | [WARNING ] (MainProcess) Training Stats: epsilon: 0.972
46 | [WARNING ] (MainProcess) Training Stats: total_reward: 2500.0
47 | [WARNING ] (MainProcess) Training Stats: avg_reward: 21.7391304348
48 | [WARNING ] (MainProcess) Training Stats: nepisodes: 115
49 | [WARNING ] (MainProcess) Training Stats: nepisodes_solved: 114
50 | [WARNING ] (MainProcess) Training Stats: repisodes_solved: 0.991304347826
51 | [WARNING ] (MainProcess) Evaluating @ Step: 2500
52 | [WARNING ] (MainProcess) Iteration: 2500; v_avg: 1.73136949539
53 | [WARNING ] (MainProcess) Iteration: 2500; tderr_avg: 0.0964358523488
54 | [WARNING ] (MainProcess) Iteration: 2500; steps_avg: 9.34579439252
55 | [WARNING ] (MainProcess) Iteration: 2500; steps_std: 0.798395631184
56 | [WARNING ] (MainProcess) Iteration: 2500; reward_avg: 9.34579439252
57 | [WARNING ] (MainProcess) Iteration: 2500; reward_std: 0.798395631184
58 | [WARNING ] (MainProcess) Iteration: 2500; nepisodes: 107
59 | [WARNING ] (MainProcess) Iteration: 2500; nepisodes_solved: 106
60 | [WARNING ] (MainProcess) Iteration: 2500; repisodes_solved: 0.990654205607
61 | [WARNING ] (MainProcess) Saving Model @ Step: 2500: /home/zhang/ws/17_ws/pytorch-rl/models/daim_17040900.pth ...
62 | [WARNING ] (MainProcess) Saved Model @ Step: 2500: /home/zhang/ws/17_ws/pytorch-rl/models/daim_17040900.pth.
63 | [WARNING ] (MainProcess) Resume Training @ Step: 2500
64 | ...
65 | ```
66 | *******
67 |
68 |
69 | ## What is included?
70 | This repo currently contains the following agents:
71 |
72 | - Deep Q Learning (DQN) [[1]](http://arxiv.org/abs/1312.5602), [[2]](http://home.uchicago.edu/~arij/journalclub/papers/2015_Mnih_et_al.pdf)
73 | - Double DQN [[3]](http://arxiv.org/abs/1509.06461)
74 | - Dueling network DQN (Dueling DQN) [[4]](https://arxiv.org/abs/1511.06581)
75 | - Asynchronous Advantage Actor-Critic (A3C) (w/ both discrete/continuous action space support) [[5]](https://arxiv.org/abs/1602.01783), [[6]](https://arxiv.org/abs/1506.02438)
76 | - Sample Efficient Actor-Critic with Experience Replay (ACER) (currently w/ discrete action space support (Truncated Importance Sampling, 1st Order TRPO)) [[7]](https://arxiv.org/abs/1611.01224), [[8]](https://arxiv.org/abs/1606.02647)
77 |
78 | Work in progress:
79 | - Testing ACER
80 |
81 | Future Plans:
82 | - Deep Deterministic Policy Gradient (DDPG) [[9]](http://arxiv.org/abs/1509.02971), [[10]](http://proceedings.mlr.press/v32/silver14.pdf)
83 | - Continuous DQN (CDQN or NAF) [[11]](http://arxiv.org/abs/1603.00748)
84 |
85 |
86 | ## Code structure & Naming conventions:
87 | NOTE: we follow the exact code structure as [pytorch-dnc](https://github.com/jingweiz/pytorch-dnc) so as to make the code easily transplantable.
88 | * ```./utils/factory.py```
89 | > We suggest the users refer to ```./utils/factory.py```,
90 | where we list all the integrated ```Env```, ```Model```,
91 | ```Memory```, ```Agent``` into ```Dict```'s.
92 | All of those four core classes are implemented in ```./core/```.
93 | The factory pattern in ```./utils/factory.py``` makes the code super clean,
94 | as no matter what type of ```Agent``` you want to train,
95 | or which type of ```Env``` you want to train on,
96 | all you need to do is to simply modify some parameters in ```./utils/options.py```,
97 | then the ```./main.py``` will do it all (NOTE: this ```./main.py``` file never needs to be modified).
98 | * namings
99 | > To make the code more clean and readable, we name the variables using the following pattern (mainly in inherited ```Agent```'s):
100 | > * ```*_vb```: ```torch.autograd.Variable```'s or a list of such objects
101 | > * ```*_ts```: ```torch.Tensor```'s or a list of such objects
102 | > * otherwise: normal python datatypes
103 |
104 |
105 | ## Dependencies
106 | - Python 2.7
107 | - [PyTorch >=v0.2.0](http://pytorch.org/)
108 | - [Visdom](https://github.com/facebookresearch/visdom)
109 | - [OpenAI Gym >=v0.9.0 (for lower versoins, just need to change into the available games, e.g. change PongDeterministic-v4 to PongDeterministic-v3)](https://github.com/openai/gym)
110 | - [mujoco-py (Optional: for training continuous version of a3c)](https://github.com/openai/mujoco-py)
111 | *******
112 |
113 |
114 | ## How to run:
115 | You only need to modify some parameters in ```./utils/options.py``` to train a new configuration.
116 |
117 | * Configure your training in ```./utils/options.py```:
118 | > * ```line 14```: add an entry into ```CONFIGS``` to define your training (```agent_type```, ```env_type```, ```game```, ```model_type```, ```memory_type```)
119 | > * ```line 33```: choose the entry you just added
120 | > * ```line 29-30```: fill in your machine/cluster ID (```MACHINE```) and timestamp (```TIMESTAMP```) to define your training signature (```MACHINE_TIMESTAMP```),
121 | the corresponding model file and the log file of this training will be saved under this signature (```./models/MACHINE_TIMESTAMP.pth``` & ```./logs/MACHINE_TIMESTAMP.log``` respectively).
122 | Also the visdom visualization will be displayed under this signature (first activate the visdom server by type in bash: ```python -m visdom.server &```, then open this address in your browser: ```http://localhost:8097/env/MACHINE_TIMESTAMP```)
123 | > * ```line 32```: to train a model, set ```mode=1``` (training visualization will be under ```http://localhost:8097/env/MACHINE_TIMESTAMP```); to test the model of this current training, all you need to do is to set ```mode=2``` (testing visualization will be under ```http://localhost:8097/env/MACHINE_TIMESTAMP_test```).
124 |
125 | * Run:
126 | > ```python main.py```
127 | *******
128 |
129 |
130 | ## Bonus Scripts :)
131 | We also provide 2 additional scripts for quickly evaluating your results after training. (Dependecies: [lmj-plot](https://github.com/lmjohns3/py-plot))
132 | * ```plot.sh``` (e.g., plot from log file: ```logs/machine1_17080801.log```)
133 | > * ```./plot.sh machine1 17080801```
134 | > * the generated figures will be saved into ```figs/machine1_17080801/```
135 | * ```plot_compare.sh``` (e.g., compare log files: ```logs/machine1_17080801.log```,```logs/machine2_17080802.log```)
136 | > ```./plot.sh 00 machine1 17080801 machine2 17080802```
137 | > * the generated figures will be saved into ```figs/compare_00/```
138 | > * the color coding will be in the order of: ```red green blue magenta yellow cyan```
139 | *******
140 |
141 |
142 | ## Repos we referred to during the development of this repo:
143 | * [matthiasplappert/keras-rl](https://github.com/matthiasplappert/keras-rl)
144 | * [transedward/pytorch-dqn](https://github.com/transedward/pytorch-dqn)
145 | * [ikostrikov/pytorch-a3c](https://github.com/ikostrikov/pytorch-a3c)
146 | * [onlytailei/A3C-PyTorch](https://github.com/onlytailei/A3C-PyTorch)
147 | * [Kaixhin/ACER](https://github.com/Kaixhin/ACER)
148 | * And a private implementation of A3C from [@stokasto](https://github.com/stokasto)
149 | *******
150 |
151 |
152 | ## Citation
153 | If you find this library useful and would like to cite it, the following would be appropriate:
154 | ```
155 | @misc{pytorch-rl,
156 | author = {Zhang, Jingwei and Tai, Lei},
157 | title = {jingweiz/pytorch-rl},
158 | url = {https://github.com/jingweiz/pytorch-rl},
159 | year = {2017}
160 | }
161 | ```
162 |
--------------------------------------------------------------------------------
/assets/a3c_con.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingweiz/pytorch-rl/20b3b738ca400b1916197f27a91367878b09803c/assets/a3c_con.gif
--------------------------------------------------------------------------------
/assets/a3c_pong.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingweiz/pytorch-rl/20b3b738ca400b1916197f27a91367878b09803c/assets/a3c_pong.gif
--------------------------------------------------------------------------------
/assets/a3c_pong.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingweiz/pytorch-rl/20b3b738ca400b1916197f27a91367878b09803c/assets/a3c_pong.png
--------------------------------------------------------------------------------
/assets/breakout.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingweiz/pytorch-rl/20b3b738ca400b1916197f27a91367878b09803c/assets/breakout.gif
--------------------------------------------------------------------------------
/assets/cartpole.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingweiz/pytorch-rl/20b3b738ca400b1916197f27a91367878b09803c/assets/cartpole.gif
--------------------------------------------------------------------------------
/core/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingweiz/pytorch-rl/20b3b738ca400b1916197f27a91367878b09803c/core/__init__.py
--------------------------------------------------------------------------------
/core/agent.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import torch
5 | import torch.optim as optim
6 |
7 | from utils.helpers import Experience
8 |
9 | class Agent(object):
10 | def __init__(self, args, env_prototype, model_prototype, memory_prototype=None):
11 | # logging
12 | self.logger = args.logger
13 |
14 | # prototypes for env & model & memory
15 | self.env_prototype = env_prototype # NOTE: instantiated in fit_model() of inherited Agents
16 | self.env_params = args.env_params
17 | self.model_prototype = model_prototype # NOTE: instantiated in fit_model() of inherited Agents
18 | self.model_params = args.model_params
19 | self.memory_prototype = memory_prototype # NOTE: instantiated in __init__() of inherited Agents (dqn needs, a3c doesn't so only pass in None)
20 | self.memory_params = args.memory_params
21 |
22 | # params
23 | self.model_name = args.model_name # NOTE: will save the current model to model_name
24 | self.model_file = args.model_file # NOTE: will load pretrained model_file if not None
25 |
26 | self.render = args.render
27 | self.visualize = args.visualize
28 | if self.visualize:
29 | self.vis = args.vis
30 | self.refs = args.refs
31 |
32 | self.save_best = args.save_best
33 | if self.save_best:
34 | self.best_step = None # NOTE: achieves best_reward at this step
35 | self.best_reward = None # NOTE: only save a new model if achieves higher reward
36 |
37 | self.hist_len = args.hist_len
38 | self.hidden_dim = args.hidden_dim
39 |
40 | self.use_cuda = args.use_cuda
41 | self.dtype = args.dtype
42 |
43 | # agent_params
44 | # criteria and optimizer
45 | self.value_criteria = args.value_criteria
46 | self.optim = args.optim
47 | # hyperparameters
48 | self.steps = args.steps
49 | self.early_stop = args.early_stop
50 | self.gamma = args.gamma
51 | self.clip_grad = args.clip_grad
52 | self.lr = args.lr
53 | self.lr_decay = args.lr_decay
54 | self.weight_decay = args.weight_decay
55 | self.eval_freq = args.eval_freq
56 | self.eval_steps = args.eval_steps
57 | self.prog_freq = args.prog_freq
58 | self.test_nepisodes = args.test_nepisodes
59 | if args.agent_type == "dqn":
60 | self.enable_double_dqn = args.enable_double_dqn
61 | self.enable_dueling = args.enable_dueling
62 | self.dueling_type = args.dueling_type
63 |
64 | self.learn_start = args.learn_start
65 | self.batch_size = args.batch_size
66 | self.valid_size = args.valid_size
67 | self.eps_start = args.eps_start
68 | self.eps_end = args.eps_end
69 | self.eps_eval = args.eps_eval
70 | self.eps_decay = args.eps_decay
71 | self.target_model_update = args.target_model_update
72 | self.action_repetition = args.action_repetition
73 | self.memory_interval = args.memory_interval
74 | self.train_interval = args.train_interval
75 | elif args.agent_type == "a3c":
76 | self.enable_log_at_train_step = args.enable_log_at_train_step
77 |
78 | self.enable_lstm = args.enable_lstm
79 | self.enable_continuous = args.enable_continuous
80 | self.num_processes = args.num_processes
81 |
82 | self.rollout_steps = args.rollout_steps
83 | self.tau = args.tau
84 | self.beta = args.beta
85 | elif args.agent_type == "acer":
86 | self.enable_bias_correction = args.enable_bias_correction
87 | self.enable_1st_order_trpo = args.enable_1st_order_trpo
88 | self.enable_log_at_train_step = args.enable_log_at_train_step
89 |
90 | self.enable_lstm = args.enable_lstm
91 | self.enable_continuous = args.enable_continuous
92 | self.num_processes = args.num_processes
93 |
94 | self.replay_ratio = args.replay_ratio
95 | self.replay_start = args.replay_start
96 | self.batch_size = args.batch_size
97 | self.valid_size = args.valid_size
98 | self.clip_trace = args.clip_trace
99 | self.clip_1st_order_trpo = args.clip_1st_order_trpo
100 | self.avg_model_decay = args.avg_model_decay
101 |
102 | self.rollout_steps = args.rollout_steps
103 | self.tau = args.tau
104 | self.beta = args.beta
105 |
106 | def _reset_experience(self):
107 | self.experience = Experience(state0 = None,
108 | action = None,
109 | reward = None,
110 | state1 = None,
111 | terminal1 = False)
112 |
113 | def _load_model(self, model_file):
114 | if model_file:
115 | self.logger.warning("Loading Model: " + self.model_file + " ...")
116 | self.model.load_state_dict(torch.load(model_file))
117 | self.logger.warning("Loaded Model: " + self.model_file + " ...")
118 | else:
119 | self.logger.warning("No Pretrained Model. Will Train From Scratch.")
120 |
121 | def _save_model(self, step, curr_reward):
122 | self.logger.warning("Saving Model @ Step: " + str(step) + ": " + self.model_name + " ...")
123 | if self.save_best:
124 | if self.best_step is None:
125 | self.best_step = step
126 | self.best_reward = curr_reward
127 | if curr_reward >= self.best_reward:
128 | self.best_step = step
129 | self.best_reward = curr_reward
130 | torch.save(self.model.state_dict(), self.model_name)
131 | self.logger.warning("Saved Model @ Step: " + str(step) + ": " + self.model_name + ". {Best Step: " + str(self.best_step) + " | Best Reward: " + str(self.best_reward) + "}")
132 | else:
133 | torch.save(self.model.state_dict(), self.model_name)
134 | self.logger.warning("Saved Model @ Step: " + str(step) + ": " + self.model_name + ".")
135 |
136 | def _forward(self, observation):
137 | raise NotImplementedError("not implemented in base calss")
138 |
139 | def _backward(self, reward, terminal):
140 | raise NotImplementedError("not implemented in base calss")
141 |
142 | def _eval_model(self): # evaluation during training
143 | raise NotImplementedError("not implemented in base calss")
144 |
145 | def fit_model(self): # training
146 | raise NotImplementedError("not implemented in base calss")
147 |
148 | def test_model(self): # testing pre-trained models
149 | raise NotImplementedError("not implemented in base calss")
150 |
--------------------------------------------------------------------------------
/core/agent_single_process.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | import random
6 | import time
7 | import math
8 | import torch
9 | import torch.optim as optim
10 | from torch.autograd import Variable
11 | import torch.nn.functional as F
12 | import torch.multiprocessing as mp
13 |
14 | from utils.helpers import Experience, one_hot
15 |
16 | class AgentSingleProcess(mp.Process):
17 | def __init__(self, master, process_id=0):
18 | super(AgentSingleProcess, self).__init__(name = "Process-%d" % process_id)
19 | # NOTE: self.master.* refers to parameters shared across all processes
20 | # NOTE: self.* refers to process-specific properties
21 | # NOTE: we are not copying self.master.* to self.* to keep the code clean
22 |
23 | self.master = master
24 | self.process_id = process_id
25 |
26 | # env
27 | self.env = self.master.env_prototype(self.master.env_params, self.process_id)
28 | # model
29 | self.model = self.master.model_prototype(self.master.model_params)
30 | self._sync_local_with_global()
31 |
32 | # experience
33 | self._reset_experience()
34 |
35 | def _reset_experience(self): # for getting one set of observation from env for every action taken
36 | self.experience = Experience(state0 = None,
37 | action = None,
38 | reward = None,
39 | state1 = None,
40 | terminal1 = False) # TODO: should check this again
41 |
42 | def _sync_local_with_global(self): # grab the current global model for local learning/evaluating
43 | self.model.load_state_dict(self.master.model.state_dict())
44 |
45 | # NOTE: since no backward passes has ever been run on the global model
46 | # NOTE: its grad has never been initialized, here we ensure proper initialization
47 | # NOTE: reference: https://discuss.pytorch.org/t/problem-on-variable-grad-data/957
48 | def _ensure_global_grads(self):
49 | for global_param, local_param in zip(self.master.model.parameters(),
50 | self.model.parameters()):
51 | if global_param.grad is not None:
52 | return
53 | global_param._grad = local_param.grad
54 |
55 | def _forward(self, observation):
56 | raise NotImplementedError("not implemented in base calss")
57 |
58 | def _backward(self, reward, terminal):
59 | raise NotImplementedError("not implemented in base calss")
60 |
61 | def run(self):
62 | raise NotImplementedError("not implemented in base calss")
63 |
--------------------------------------------------------------------------------
/core/agents/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingweiz/pytorch-rl/20b3b738ca400b1916197f27a91367878b09803c/core/agents/__init__.py
--------------------------------------------------------------------------------
/core/agents/a3c.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import torch.multiprocessing as mp
5 |
6 | from core.agent import Agent
7 | from core.agents.a3c_single_process import A3CLearner, A3CEvaluator, A3CTester
8 |
9 | class A3CAgent(Agent):
10 | def __init__(self, args, env_prototype, model_prototype, memory_prototype):
11 | super(A3CAgent, self).__init__(args, env_prototype, model_prototype, memory_prototype)
12 | self.logger.warning("<===================================> A3C-Master {Env(dummy) & Model}")
13 |
14 | # dummy_env just to get state_shape & action_dim
15 | self.dummy_env = self.env_prototype(self.env_params, self.num_processes)
16 | self.state_shape = self.dummy_env.state_shape
17 | self.action_dim = self.dummy_env.action_dim
18 | del self.dummy_env
19 |
20 | # global shared model
21 | self.model_params.state_shape = self.state_shape
22 | self.model_params.action_dim = self.action_dim
23 | self.model = self.model_prototype(self.model_params)
24 | self._load_model(self.model_file) # load pretrained model if provided
25 | self.model.share_memory() # NOTE
26 |
27 | # learning algorithm
28 | self.optimizer = self.optim(self.model.parameters(), lr = self.lr)
29 | self.optimizer.share_memory() # NOTE
30 | self.lr_adjusted = mp.Value('d', self.lr) # adjusted lr
31 |
32 | # global counters
33 | self.frame_step = mp.Value('l', 0) # global frame step counter
34 | self.train_step = mp.Value('l', 0) # global train step counter
35 | # global training stats
36 | self.p_loss_avg = mp.Value('d', 0.) # global policy loss
37 | self.v_loss_avg = mp.Value('d', 0.) # global value loss
38 | self.loss_avg = mp.Value('d', 0.) # global loss
39 | self.loss_counter = mp.Value('l', 0) # storing this many losses
40 | self._reset_training_loggings()
41 |
42 | def _reset_training_loggings(self):
43 | self.p_loss_avg.value = 0.
44 | self.v_loss_avg.value = 0.
45 | self.loss_avg.value = 0.
46 | self.loss_counter.value = 0
47 |
48 | def fit_model(self):
49 | self.jobs = []
50 | for process_id in range(self.num_processes):
51 | self.jobs.append(A3CLearner(self, process_id))
52 | self.jobs.append(A3CEvaluator(self, self.num_processes))
53 |
54 | self.logger.warning("<===================================> Training ...")
55 | for job in self.jobs:
56 | job.start()
57 | for job in self.jobs:
58 | job.join()
59 |
60 | def test_model(self):
61 | self.jobs = []
62 | self.jobs.append(A3CTester(self))
63 |
64 | self.logger.warning("<===================================> Testing ...")
65 | for job in self.jobs:
66 | job.start()
67 | for job in self.jobs:
68 | job.join()
69 |
--------------------------------------------------------------------------------
/core/agents/a3c_single_process.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | import random
6 | import time
7 | import math
8 | import torch
9 | from torch.autograd import Variable
10 | import torch.nn.functional as F
11 |
12 | from utils.helpers import A3C_Experience
13 | from core.agent_single_process import AgentSingleProcess
14 |
15 | class A3CSingleProcess(AgentSingleProcess):
16 | def __init__(self, master, process_id=0):
17 | super(A3CSingleProcess, self).__init__(master, process_id)
18 |
19 | # lstm hidden states
20 | if self.master.enable_lstm:
21 | self._reset_lstm_hidden_vb_episode() # clear up hidden state
22 | self._reset_lstm_hidden_vb_rollout() # detach the previous variable from the computation graph
23 |
24 | # NOTE global variable pi
25 | if self.master.enable_continuous:
26 | self.pi_vb = Variable(torch.Tensor([math.pi]).type(self.master.dtype))
27 |
28 | self.master.logger.warning("Registered A3C-SingleProcess-Agent #" + str(self.process_id) + " w/ Env (seed:" + str(self.env.seed) + ").")
29 |
30 | # NOTE: to be called at the beginning of each new episode, clear up the hidden state
31 | def _reset_lstm_hidden_vb_episode(self, training=True): # seq_len, batch_size, hidden_dim
32 | not_training = not training
33 | if self.master.enable_continuous:
34 | self.lstm_hidden_vb = (Variable(torch.zeros(2, self.master.hidden_dim).type(self.master.dtype), volatile=not_training),
35 | Variable(torch.zeros(2, self.master.hidden_dim).type(self.master.dtype), volatile=not_training))
36 | else:
37 | self.lstm_hidden_vb = (Variable(torch.zeros(1, self.master.hidden_dim).type(self.master.dtype), volatile=not_training),
38 | Variable(torch.zeros(1, self.master.hidden_dim).type(self.master.dtype), volatile=not_training))
39 |
40 | # NOTE: to be called at the beginning of each rollout, detach the previous variable from the graph
41 | def _reset_lstm_hidden_vb_rollout(self):
42 | self.lstm_hidden_vb = (Variable(self.lstm_hidden_vb[0].data),
43 | Variable(self.lstm_hidden_vb[1].data))
44 |
45 | def _preprocessState(self, state, is_valotile=False):
46 | if isinstance(state, list):
47 | state_vb = []
48 | for i in range(len(state)):
49 | state_vb.append(Variable(torch.from_numpy(state[i]).unsqueeze(0).type(self.master.dtype), volatile=is_valotile))
50 | else:
51 | state_vb = Variable(torch.from_numpy(state).unsqueeze(0).type(self.master.dtype), volatile=is_valotile)
52 | return state_vb
53 |
54 | def _forward(self, state_vb):
55 | if self.master.enable_continuous: # NOTE continous control p_vb here is the mu_vb of continous action dist
56 | if self.master.enable_lstm:
57 | p_vb, sig_vb, v_vb, self.lstm_hidden_vb = self.model(state_vb, self.lstm_hidden_vb)
58 | else:
59 | p_vb, sig_vb, v_vb = self.model(state_vb)
60 | if self.training:
61 | _eps = torch.randn(p_vb.size())
62 | action = (p_vb + sig_vb.sqrt()*Variable(_eps)).data.numpy()
63 | else:
64 | action = p_vb.data.numpy()
65 | return action, p_vb, sig_vb, v_vb
66 | else:
67 | if self.master.enable_lstm:
68 | p_vb, v_vb, self.lstm_hidden_vb = self.model(state_vb, self.lstm_hidden_vb)
69 | else:
70 | p_vb, v_vb = self.model(state_vb)
71 | if self.training:
72 | action = p_vb.multinomial().data[0][0]
73 | else:
74 | action = p_vb.max(1)[1].data.squeeze().numpy()[0]
75 | return action, p_vb, v_vb
76 |
77 | def _normal(self, x, mu, sigma_sq):
78 | a = (-1 * (x - mu).pow(2) / (2 * sigma_sq)).exp()
79 | b = 1 / (2 * sigma_sq * self.pi_vb.expand_as(sigma_sq)).sqrt()
80 | return (a * b).log()
81 |
82 | class A3CLearner(A3CSingleProcess):
83 | def __init__(self, master, process_id=0):
84 | master.logger.warning("<===================================> A3C-Learner #" + str(process_id) + " {Env & Model}")
85 | super(A3CLearner, self).__init__(master, process_id)
86 |
87 | self._reset_rollout()
88 |
89 | self.training = True # choose actions by polinomial
90 | self.model.train(self.training)
91 | # local counters
92 | self.frame_step = 0 # local frame step counter
93 | self.train_step = 0 # local train step counter
94 | # local training stats
95 | self.p_loss_avg = 0. # global policy loss
96 | self.v_loss_avg = 0. # global value loss
97 | self.loss_avg = 0. # global value loss
98 | self.loss_counter = 0 # storing this many losses
99 | self._reset_training_loggings()
100 |
101 | # copy local training stats to global every prog_freq
102 | self.last_prog = time.time()
103 |
104 | def _reset_training_loggings(self):
105 | self.p_loss_avg = 0.
106 | self.v_loss_avg = 0.
107 | self.loss_avg = 0.
108 | self.loss_counter = 0
109 |
110 | def _reset_rollout(self): # for storing the experiences collected through one rollout
111 | self.rollout = A3C_Experience(state0 = [],
112 | action = [],
113 | reward = [],
114 | state1 = [],
115 | terminal1 = [],
116 | policy_vb = [],
117 | sigmoid_vb = [],
118 | value0_vb = [])
119 |
120 | def _get_valueT_vb(self):
121 | if self.rollout.terminal1[-1]: # for terminal sT
122 | valueT_vb = Variable(torch.zeros(1, 1))
123 | else: # for non-terminal sT
124 | sT_vb = self._preprocessState(self.rollout.state1[-1], True) # bootstrap from last state
125 | if self.master.enable_continuous:
126 | if self.master.enable_lstm:
127 | _, _, valueT_vb, _ = self.model(sT_vb, self.lstm_hidden_vb) # NOTE: only doing inference here
128 | else:
129 | _, _, valueT_vb = self.model(sT_vb) # NOTE: only doing inference here
130 | else:
131 | if self.master.enable_lstm:
132 | _, valueT_vb, _ = self.model(sT_vb, self.lstm_hidden_vb) # NOTE: only doing inference here
133 | else:
134 | _, valueT_vb = self.model(sT_vb) # NOTE: only doing inference here
135 | # NOTE: here valueT_vb.volatile=True since sT_vb.volatile=True
136 | # NOTE: if we use detach() here, it would remain volatile
137 | # NOTE: then all the follow-up computations would only give volatile loss variables
138 | valueT_vb = Variable(valueT_vb.data)
139 |
140 | return valueT_vb
141 |
142 | def _backward(self):
143 | # preparation
144 | rollout_steps = len(self.rollout.reward)
145 | policy_vb = self.rollout.policy_vb
146 | if self.master.enable_continuous:
147 | action_batch_vb = Variable(torch.from_numpy(np.array(self.rollout.action)))
148 | if self.master.use_cuda:
149 | action_batch_vb = action_batch_vb.cuda()
150 | sigma_vb = self.rollout.sigmoid_vb
151 | else:
152 | action_batch_vb = Variable(torch.from_numpy(np.array(self.rollout.action)).long())
153 | if self.master.use_cuda:
154 | action_batch_vb = action_batch_vb.cuda()
155 | policy_log_vb = [torch.log(policy_vb[i]) for i in range(rollout_steps)]
156 | entropy_vb = [- (policy_log_vb[i] * policy_vb[i]).sum(1) for i in range(rollout_steps)]
157 | policy_log_vb = [policy_log_vb[i].gather(1, action_batch_vb[i].unsqueeze(0)) for i in range(rollout_steps) ]
158 | valueT_vb = self._get_valueT_vb()
159 | self.rollout.value0_vb.append(Variable(valueT_vb.data)) # NOTE: only this last entry is Volatile, all others are still in the graph
160 | gae_ts = torch.zeros(1, 1)
161 |
162 | # compute loss
163 | policy_loss_vb = 0.
164 | value_loss_vb = 0.
165 | for i in reversed(range(rollout_steps)):
166 | valueT_vb = self.master.gamma * valueT_vb + self.rollout.reward[i]
167 | advantage_vb = valueT_vb - self.rollout.value0_vb[i]
168 | value_loss_vb = value_loss_vb + 0.5 * advantage_vb.pow(2)
169 |
170 | # Generalized Advantage Estimation
171 | tderr_ts = self.rollout.reward[i] + self.master.gamma * self.rollout.value0_vb[i + 1].data - self.rollout.value0_vb[i].data
172 | gae_ts = self.master.gamma * gae_ts * self.master.tau + tderr_ts
173 | if self.master.enable_continuous:
174 | _log_prob = self._normal(action_batch_vb[i], policy_vb[i], sigma_vb[i])
175 | _entropy = 0.5 * ((sigma_vb[i] * 2 * self.pi_vb.expand_as(sigma_vb[i])).log() + 1)
176 | policy_loss_vb -= (_log_prob * Variable(gae_ts).expand_as(_log_prob)).sum() + self.master.beta * _entropy.sum()
177 | else:
178 | policy_loss_vb -= policy_log_vb[i] * Variable(gae_ts) + self.master.beta * entropy_vb[i]
179 |
180 | loss_vb = policy_loss_vb + 0.5 * value_loss_vb
181 | loss_vb.backward()
182 | torch.nn.utils.clip_grad_norm(self.model.parameters(), self.master.clip_grad)
183 |
184 | self._ensure_global_grads()
185 | self.master.optimizer.step()
186 | self.train_step += 1
187 | self.master.train_step.value += 1
188 |
189 | # adjust learning rate if enabled
190 | if self.master.lr_decay:
191 | self.master.lr_adjusted.value = max(self.master.lr * (self.master.steps - self.master.train_step.value) / self.master.steps, 1e-32)
192 | adjust_learning_rate(self.master.optimizer, self.master.lr_adjusted.value)
193 |
194 | # log training stats
195 | self.p_loss_avg += policy_loss_vb.data.numpy()
196 | self.v_loss_avg += value_loss_vb.data.numpy()
197 | self.loss_avg += loss_vb.data.numpy()
198 | self.loss_counter += 1
199 |
200 | def _rollout(self, episode_steps, episode_reward):
201 | # reset rollout experiences
202 | self._reset_rollout()
203 |
204 | t_start = self.frame_step
205 | # continue to rollout only if:
206 | # 1. not running out of max steps of this current rollout, and
207 | # 2. not terminal, and
208 | # 3. not exceeding max steps of this current episode
209 | # 4. master not exceeding max train steps
210 | while (self.frame_step - t_start) < self.master.rollout_steps \
211 | and not self.experience.terminal1 \
212 | and (self.master.early_stop is None or episode_steps < self.master.early_stop):
213 | # NOTE: here first store the last frame: experience.state1 as rollout.state0
214 | self.rollout.state0.append(self.experience.state1)
215 | # then get the action to take from rollout.state0 (experience.state1)
216 | if self.master.enable_continuous:
217 | action, p_vb, sig_vb, v_vb = self._forward(self._preprocessState(self.experience.state1))
218 | self.rollout.sigmoid_vb.append(sig_vb)
219 | else:
220 | action, p_vb, v_vb = self._forward(self._preprocessState(self.experience.state1))
221 | # then execute action in env to get a new experience.state1 -> rollout.state1
222 | self.experience = self.env.step(action)
223 | # push experience into rollout
224 | self.rollout.action.append(action)
225 | self.rollout.reward.append(self.experience.reward)
226 | self.rollout.state1.append(self.experience.state1)
227 | self.rollout.terminal1.append(self.experience.terminal1)
228 | self.rollout.policy_vb.append(p_vb)
229 | self.rollout.value0_vb.append(v_vb)
230 |
231 | episode_steps += 1
232 | episode_reward += self.experience.reward
233 | self.frame_step += 1
234 | self.master.frame_step.value += 1
235 |
236 | # NOTE: we put this condition in the end to make sure this current rollout won't be empty
237 | if self.master.train_step.value >= self.master.steps:
238 | break
239 |
240 | return episode_steps, episode_reward
241 |
242 | def run(self):
243 | # make sure processes are not completely synced by sleeping a bit
244 | time.sleep(int(np.random.rand() * (self.process_id + 5)))
245 |
246 | nepisodes = 0
247 | nepisodes_solved = 0
248 | episode_steps = None
249 | episode_reward = None
250 | should_start_new = True
251 | while self.master.train_step.value < self.master.steps:
252 | # sync in every step
253 | self._sync_local_with_global()
254 | self.model.zero_grad()
255 |
256 | # start of a new episode
257 | if should_start_new:
258 | episode_steps = 0
259 | episode_reward = 0.
260 | # reset lstm_hidden_vb for new episode
261 | if self.master.enable_lstm:
262 | # NOTE: clear hidden state at the beginning of each episode
263 | self._reset_lstm_hidden_vb_episode()
264 | # Obtain the initial observation by resetting the environment
265 | self._reset_experience()
266 | self.experience = self.env.reset()
267 | assert self.experience.state1 is not None
268 | # reset flag
269 | should_start_new = False
270 | if self.master.enable_lstm:
271 | # NOTE: detach the previous hidden variable from the graph at the beginning of each rollout
272 | self._reset_lstm_hidden_vb_rollout()
273 | # Run a rollout for rollout_steps or until terminal
274 | episode_steps, episode_reward = self._rollout(episode_steps, episode_reward)
275 |
276 | if self.experience.terminal1 or \
277 | self.master.early_stop and episode_steps >= self.master.early_stop:
278 | nepisodes += 1
279 | should_start_new = True
280 | if self.experience.terminal1:
281 | nepisodes_solved += 1
282 |
283 | # calculate loss
284 | self._backward()
285 |
286 | # copy local training stats to global at prog_freq, and clear up local stats
287 | if time.time() - self.last_prog >= self.master.prog_freq:
288 | self.master.p_loss_avg.value += self.p_loss_avg
289 | self.master.v_loss_avg.value += self.v_loss_avg
290 | self.master.loss_avg.value += self.loss_avg
291 | self.master.loss_counter.value += self.loss_counter
292 | self._reset_training_loggings()
293 | self.last_prog = time.time()
294 |
295 | class A3CEvaluator(A3CSingleProcess):
296 | def __init__(self, master, process_id=0):
297 | master.logger.warning("<===================================> A3C-Evaluator {Env & Model}")
298 | super(A3CEvaluator, self).__init__(master, process_id)
299 |
300 | self.training = False # choose actions w/ max probability
301 | self.model.train(self.training)
302 | self._reset_loggings()
303 |
304 | self.start_time = time.time()
305 | self.last_eval = time.time()
306 |
307 | def _reset_loggings(self):
308 | # training stats across all processes
309 | self.p_loss_avg_log = []
310 | self.v_loss_avg_log = []
311 | self.loss_avg_log = []
312 | # evaluation stats
313 | self.entropy_avg_log = []
314 | self.v_avg_log = []
315 | self.steps_avg_log = []
316 | self.steps_std_log = []
317 | self.reward_avg_log = []
318 | self.reward_std_log = []
319 | self.nepisodes_log = []
320 | self.nepisodes_solved_log = []
321 | self.repisodes_solved_log = []
322 | # placeholders for windows for online curve plotting
323 | if self.master.visualize:
324 | # training stats across all processes
325 | self.win_p_loss_avg = "win_p_loss_avg"
326 | self.win_v_loss_avg = "win_v_loss_avg"
327 | self.win_loss_avg = "win_loss_avg"
328 | # evaluation stats
329 | self.win_entropy_avg = "win_entropy_avg"
330 | self.win_v_avg = "win_v_avg"
331 | self.win_steps_avg = "win_steps_avg"
332 | self.win_steps_std = "win_steps_std"
333 | self.win_reward_avg = "win_reward_avg"
334 | self.win_reward_std = "win_reward_std"
335 | self.win_nepisodes = "win_nepisodes"
336 | self.win_nepisodes_solved = "win_nepisodes_solved"
337 | self.win_repisodes_solved = "win_repisodes_solved"
338 |
339 | def _eval_model(self):
340 | self.last_eval = time.time()
341 | eval_at_train_step = self.master.train_step.value
342 | eval_at_frame_step = self.master.frame_step.value
343 | # first grab the latest global model to do the evaluation
344 | self._sync_local_with_global()
345 |
346 | # evaluate
347 | eval_step = 0
348 |
349 | eval_entropy_log = []
350 | eval_v_log = []
351 | eval_nepisodes = 0
352 | eval_nepisodes_solved = 0
353 | eval_episode_steps = None
354 | eval_episode_steps_log = []
355 | eval_episode_reward = None
356 | eval_episode_reward_log = []
357 | eval_should_start_new = True
358 | while eval_step < self.master.eval_steps:
359 | if eval_should_start_new: # start of a new episode
360 | eval_episode_steps = 0
361 | eval_episode_reward = 0.
362 | # reset lstm_hidden_vb for new episode
363 | if self.master.enable_lstm:
364 | # NOTE: clear hidden state at the beginning of each episode
365 | self._reset_lstm_hidden_vb_episode(self.training)
366 | # Obtain the initial observation by resetting the environment
367 | self._reset_experience()
368 | self.experience = self.env.reset()
369 | assert self.experience.state1 is not None
370 | if not self.training:
371 | if self.master.visualize: self.env.visual()
372 | if self.master.render: self.env.render()
373 | # reset flag
374 | eval_should_start_new = False
375 | if self.master.enable_lstm:
376 | # NOTE: detach the previous hidden variable from the graph at the beginning of each step
377 | # NOTE: not necessary here in evaluation but we do it anyways
378 | self._reset_lstm_hidden_vb_rollout()
379 | # Run a single step
380 | if self.master.enable_continuous:
381 | eval_action, p_vb, sig_vb, v_vb = self._forward(self._preprocessState(self.experience.state1, True))
382 | else:
383 | eval_action, p_vb, v_vb = self._forward(self._preprocessState(self.experience.state1, True))
384 | self.experience = self.env.step(eval_action)
385 | if not self.training:
386 | if self.master.visualize: self.env.visual()
387 | if self.master.render: self.env.render()
388 | if self.experience.terminal1 or \
389 | self.master.early_stop and (eval_episode_steps + 1) == self.master.early_stop or \
390 | (eval_step + 1) == self.master.eval_steps:
391 | eval_should_start_new = True
392 |
393 | eval_episode_steps += 1
394 | eval_episode_reward += self.experience.reward
395 | eval_step += 1
396 |
397 | if eval_should_start_new:
398 | eval_nepisodes += 1
399 | if self.experience.terminal1:
400 | eval_nepisodes_solved += 1
401 |
402 | # This episode is finished, report and reset
403 | # NOTE make no sense for continuous
404 | if self.master.enable_continuous:
405 | eval_entropy_log.append([0.5 * ((sig_vb * 2 * self.pi_vb.expand_as(sig_vb)).log() + 1).data.numpy()])
406 | else:
407 | eval_entropy_log.append([np.mean((-torch.log(p_vb.data.squeeze()) * p_vb.data.squeeze()).numpy())])
408 | eval_v_log.append([v_vb.data.numpy()])
409 | eval_episode_steps_log.append([eval_episode_steps])
410 | eval_episode_reward_log.append([eval_episode_reward])
411 | self._reset_experience()
412 | eval_episode_steps = None
413 | eval_episode_reward = None
414 |
415 | # Logging for this evaluation phase
416 | loss_counter = self.master.loss_counter.value
417 | p_loss_avg = self.master.p_loss_avg.value / loss_counter if loss_counter > 0 else 0.
418 | v_loss_avg = self.master.v_loss_avg.value / loss_counter if loss_counter > 0 else 0.
419 | loss_avg = self.master.loss_avg.value / loss_counter if loss_counter > 0 else 0.
420 | self.master._reset_training_loggings()
421 | def _log_at_step(eval_at_step):
422 | self.p_loss_avg_log.append([eval_at_step, p_loss_avg])
423 | self.v_loss_avg_log.append([eval_at_step, v_loss_avg])
424 | self.loss_avg_log.append([eval_at_step, loss_avg])
425 | self.entropy_avg_log.append([eval_at_step, np.mean(np.asarray(eval_entropy_log))])
426 | self.v_avg_log.append([eval_at_step, np.mean(np.asarray(eval_v_log))])
427 | self.steps_avg_log.append([eval_at_step, np.mean(np.asarray(eval_episode_steps_log))])
428 | self.steps_std_log.append([eval_at_step, np.std(np.asarray(eval_episode_steps_log))])
429 | self.reward_avg_log.append([eval_at_step, np.mean(np.asarray(eval_episode_reward_log))])
430 | self.reward_std_log.append([eval_at_step, np.std(np.asarray(eval_episode_reward_log))])
431 | self.nepisodes_log.append([eval_at_step, eval_nepisodes])
432 | self.nepisodes_solved_log.append([eval_at_step, eval_nepisodes_solved])
433 | self.repisodes_solved_log.append([eval_at_step, (eval_nepisodes_solved/eval_nepisodes) if eval_nepisodes > 0 else 0.])
434 | # logging
435 | self.master.logger.warning("Reporting @ Step: " + str(eval_at_step) + " | Elapsed Time: " + str(time.time() - self.start_time))
436 | self.master.logger.warning("Iteration: {}; lr: {}".format(eval_at_step, self.master.lr_adjusted.value))
437 | self.master.logger.warning("Iteration: {}; p_loss_avg: {}".format(eval_at_step, self.p_loss_avg_log[-1][1]))
438 | self.master.logger.warning("Iteration: {}; v_loss_avg: {}".format(eval_at_step, self.v_loss_avg_log[-1][1]))
439 | self.master.logger.warning("Iteration: {}; loss_avg: {}".format(eval_at_step, self.loss_avg_log[-1][1]))
440 | self.master._reset_training_loggings()
441 | self.master.logger.warning("Evaluating @ Step: " + str(eval_at_train_step) + " | (" + str(eval_at_frame_step) + " frames)...")
442 | self.master.logger.warning("Evaluation Took: " + str(time.time() - self.last_eval))
443 | self.master.logger.warning("Iteration: {}; entropy_avg: {}".format(eval_at_step, self.entropy_avg_log[-1][1]))
444 | self.master.logger.warning("Iteration: {}; v_avg: {}".format(eval_at_step, self.v_avg_log[-1][1]))
445 | self.master.logger.warning("Iteration: {}; steps_avg: {}".format(eval_at_step, self.steps_avg_log[-1][1]))
446 | self.master.logger.warning("Iteration: {}; steps_std: {}".format(eval_at_step, self.steps_std_log[-1][1]))
447 | self.master.logger.warning("Iteration: {}; reward_avg: {}".format(eval_at_step, self.reward_avg_log[-1][1]))
448 | self.master.logger.warning("Iteration: {}; reward_std: {}".format(eval_at_step, self.reward_std_log[-1][1]))
449 | self.master.logger.warning("Iteration: {}; nepisodes: {}".format(eval_at_step, self.nepisodes_log[-1][1]))
450 | self.master.logger.warning("Iteration: {}; nepisodes_solved: {}".format(eval_at_step, self.nepisodes_solved_log[-1][1]))
451 | self.master.logger.warning("Iteration: {}; repisodes_solved: {}".format(eval_at_step, self.repisodes_solved_log[-1][1]))
452 | if self.master.enable_log_at_train_step:
453 | _log_at_step(eval_at_train_step)
454 | else:
455 | _log_at_step(eval_at_frame_step)
456 |
457 | # plotting
458 | if self.master.visualize:
459 | self.win_p_loss_avg = self.master.vis.scatter(X=np.array(self.p_loss_avg_log), env=self.master.refs, win=self.win_p_loss_avg, opts=dict(title="p_loss_avg"))
460 | self.win_v_loss_avg = self.master.vis.scatter(X=np.array(self.v_loss_avg_log), env=self.master.refs, win=self.win_v_loss_avg, opts=dict(title="v_loss_avg"))
461 | self.win_loss_avg = self.master.vis.scatter(X=np.array(self.loss_avg_log), env=self.master.refs, win=self.win_loss_avg, opts=dict(title="loss_avg"))
462 | self.win_entropy_avg = self.master.vis.scatter(X=np.array(self.entropy_avg_log), env=self.master.refs, win=self.win_entropy_avg, opts=dict(title="entropy_avg"))
463 | self.win_v_avg = self.master.vis.scatter(X=np.array(self.v_avg_log), env=self.master.refs, win=self.win_v_avg, opts=dict(title="v_avg"))
464 | self.win_steps_avg = self.master.vis.scatter(X=np.array(self.steps_avg_log), env=self.master.refs, win=self.win_steps_avg, opts=dict(title="steps_avg"))
465 | # self.win_steps_std = self.master.vis.scatter(X=np.array(self.steps_std_log), env=self.master.refs, win=self.win_steps_std, opts=dict(title="steps_std"))
466 | self.win_reward_avg = self.master.vis.scatter(X=np.array(self.reward_avg_log), env=self.master.refs, win=self.win_reward_avg, opts=dict(title="reward_avg"))
467 | # self.win_reward_std = self.master.vis.scatter(X=np.array(self.reward_std_log), env=self.master.refs, win=self.win_reward_std, opts=dict(title="reward_std"))
468 | self.win_nepisodes = self.master.vis.scatter(X=np.array(self.nepisodes_log), env=self.master.refs, win=self.win_nepisodes, opts=dict(title="nepisodes"))
469 | self.win_nepisodes_solved = self.master.vis.scatter(X=np.array(self.nepisodes_solved_log), env=self.master.refs, win=self.win_nepisodes_solved, opts=dict(title="nepisodes_solved"))
470 | self.win_repisodes_solved = self.master.vis.scatter(X=np.array(self.repisodes_solved_log), env=self.master.refs, win=self.win_repisodes_solved, opts=dict(title="repisodes_solved"))
471 | self.last_eval = time.time()
472 |
473 | # save model
474 | self.master._save_model(eval_at_train_step, self.reward_avg_log[-1][1])
475 |
476 | def run(self):
477 | while self.master.train_step.value < self.master.steps:
478 | if time.time() - self.last_eval > self.master.eval_freq:
479 | self._eval_model()
480 | # we also do a final evaluation after training is done
481 | self._eval_model()
482 |
483 | class A3CTester(A3CSingleProcess):
484 | def __init__(self, master, process_id=0):
485 | master.logger.warning("<===================================> A3C-Tester {Env & Model}")
486 | super(A3CTester, self).__init__(master, process_id)
487 |
488 | self.training = False # choose actions w/ max probability
489 | self.model.train(self.training)
490 | self._reset_loggings()
491 |
492 | self.start_time = time.time()
493 |
494 | def _reset_loggings(self):
495 | # testing stats
496 | self.steps_avg_log = []
497 | self.steps_std_log = []
498 | self.reward_avg_log = []
499 | self.reward_std_log = []
500 | self.nepisodes_log = []
501 | self.nepisodes_solved_log = []
502 | self.repisodes_solved_log = []
503 | # placeholders for windows for online curve plotting
504 | if self.master.visualize:
505 | # evaluation stats
506 | self.win_steps_avg = "win_steps_avg"
507 | self.win_steps_std = "win_steps_std"
508 | self.win_reward_avg = "win_reward_avg"
509 | self.win_reward_std = "win_reward_std"
510 | self.win_nepisodes = "win_nepisodes"
511 | self.win_nepisodes_solved = "win_nepisodes_solved"
512 | self.win_repisodes_solved = "win_repisodes_solved"
513 |
514 | def run(self):
515 | test_step = 0
516 | test_nepisodes = 0
517 | test_nepisodes_solved = 0
518 | test_episode_steps = None
519 | test_episode_steps_log = []
520 | test_episode_reward = None
521 | test_episode_reward_log = []
522 | test_should_start_new = True
523 | while test_nepisodes < self.master.test_nepisodes:
524 | if test_should_start_new: # start of a new episode
525 | test_episode_steps = 0
526 | test_episode_reward = 0.
527 | # reset lstm_hidden_vb for new episode
528 | if self.master.enable_lstm:
529 | # NOTE: clear hidden state at the beginning of each episode
530 | self._reset_lstm_hidden_vb_episode(self.training)
531 | # Obtain the initial observation by resetting the environment
532 | self._reset_experience()
533 | self.experience = self.env.reset()
534 | assert self.experience.state1 is not None
535 | if not self.training:
536 | if self.master.visualize: self.env.visual()
537 | if self.master.render: self.env.render()
538 | # reset flag
539 | test_should_start_new = False
540 | if self.master.enable_lstm:
541 | # NOTE: detach the previous hidden variable from the graph at the beginning of each step
542 | # NOTE: not necessary here in testing but we do it anyways
543 | self._reset_lstm_hidden_vb_rollout()
544 | # Run a single step
545 | if self.master.enable_continuous:
546 | test_action, p_vb, sig_vb, v_vb = self._forward(self._preprocessState(self.experience.state1, True))
547 | else:
548 | test_action, p_vb, v_vb = self._forward(self._preprocessState(self.experience.state1, True))
549 | self.experience = self.env.step(test_action)
550 | if not self.training:
551 | if self.master.visualize: self.env.visual()
552 | if self.master.render: self.env.render()
553 | if self.experience.terminal1 or \
554 | self.master.early_stop and (test_episode_steps + 1) == self.master.early_stop:
555 | test_should_start_new = True
556 |
557 | test_episode_steps += 1
558 | test_episode_reward += self.experience.reward
559 | test_step += 1
560 |
561 | if test_should_start_new:
562 | test_nepisodes += 1
563 | if self.experience.terminal1:
564 | test_nepisodes_solved += 1
565 |
566 | # This episode is finished, report and reset
567 | test_episode_steps_log.append([test_episode_steps])
568 | test_episode_reward_log.append([test_episode_reward])
569 | self._reset_experience()
570 | test_episode_steps = None
571 | test_episode_reward = None
572 |
573 | self.steps_avg_log.append([test_nepisodes, np.mean(np.asarray(test_episode_steps_log))])
574 | self.steps_std_log.append([test_nepisodes, np.std(np.asarray(test_episode_steps_log))]); del test_episode_steps_log
575 | self.reward_avg_log.append([test_nepisodes, np.mean(np.asarray(test_episode_reward_log))])
576 | self.reward_std_log.append([test_nepisodes, np.std(np.asarray(test_episode_reward_log))]); del test_episode_reward_log
577 | self.nepisodes_log.append([test_nepisodes, test_nepisodes])
578 | self.nepisodes_solved_log.append([test_nepisodes, test_nepisodes_solved])
579 | self.repisodes_solved_log.append([test_nepisodes, (test_nepisodes_solved/test_nepisodes) if test_nepisodes > 0 else 0.])
580 | # plotting
581 | if self.master.visualize:
582 | self.win_steps_avg = self.master.vis.scatter(X=np.array(self.steps_avg_log), env=self.master.refs, win=self.win_steps_avg, opts=dict(title="steps_avg"))
583 | # self.win_steps_std = self.master.vis.scatter(X=np.array(self.steps_std_log), env=self.master.refs, win=self.win_steps_std, opts=dict(title="steps_std"))
584 | self.win_reward_avg = self.master.vis.scatter(X=np.array(self.reward_avg_log), env=self.master.refs, win=self.win_reward_avg, opts=dict(title="reward_avg"))
585 | # self.win_reward_std = self.master.vis.scatter(X=np.array(self.reward_std_log), env=self.master.refs, win=self.win_reward_std, opts=dict(title="reward_std"))
586 | self.win_nepisodes = self.master.vis.scatter(X=np.array(self.nepisodes_log), env=self.master.refs, win=self.win_nepisodes, opts=dict(title="nepisodes"))
587 | self.win_nepisodes_solved = self.master.vis.scatter(X=np.array(self.nepisodes_solved_log), env=self.master.refs, win=self.win_nepisodes_solved, opts=dict(title="nepisodes_solved"))
588 | self.win_repisodes_solved = self.master.vis.scatter(X=np.array(self.repisodes_solved_log), env=self.master.refs, win=self.win_repisodes_solved, opts=dict(title="repisodes_solved"))
589 | # logging
590 | self.master.logger.warning("Testing Took: " + str(time.time() - self.start_time))
591 | self.master.logger.warning("Testing: steps_avg: {}".format(self.steps_avg_log[-1][1]))
592 | self.master.logger.warning("Testing: steps_std: {}".format(self.steps_std_log[-1][1]))
593 | self.master.logger.warning("Testing: reward_avg: {}".format(self.reward_avg_log[-1][1]))
594 | self.master.logger.warning("Testing: reward_std: {}".format(self.reward_std_log[-1][1]))
595 | self.master.logger.warning("Testing: nepisodes: {}".format(self.nepisodes_log[-1][1]))
596 | self.master.logger.warning("Testing: nepisodes_solved: {}".format(self.nepisodes_solved_log[-1][1]))
597 | self.master.logger.warning("Testing: repisodes_solved: {}".format(self.repisodes_solved_log[-1][1]))
598 |
--------------------------------------------------------------------------------
/core/agents/acer.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import torch.multiprocessing as mp
5 |
6 | from core.agent import Agent
7 | from core.agents.acer_single_process import ACERLearner, ACEREvaluator, ACERTester
8 |
9 | class ACERAgent(Agent):
10 | def __init__(self, args, env_prototype, model_prototype, memory_prototype):
11 | super(ACERAgent, self).__init__(args, env_prototype, model_prototype, memory_prototype)
12 | self.logger.warning("<===================================> ACER-Master {Env(dummy) & Model}")
13 |
14 | # dummy_env just to get state_shape & action_dim
15 | self.dummy_env = self.env_prototype(self.env_params, self.num_processes)
16 | self.state_shape = self.dummy_env.state_shape
17 | self.action_dim = self.dummy_env.action_dim
18 | del self.dummy_env
19 |
20 | # global shared model
21 | self.model_params.state_shape = self.state_shape
22 | self.model_params.action_dim = self.action_dim
23 | self.model = self.model_prototype(self.model_params)
24 | self._load_model(self.model_file) # load pretrained model if provided
25 | self.model.share_memory() # NOTE
26 |
27 | # learning algorithm # TODO: could also linearly anneal learning rate
28 | self.optimizer = self.optim(self.model.parameters(), lr = self.lr)
29 | self.optimizer.share_memory() # NOTE
30 | self.lr_adjusted = mp.Value('d', self.lr) # adjusted lr
31 |
32 | # global shared average model: for 1st order trpo policy update
33 | self.avg_model = self.model_prototype(self.model_params)
34 | self.avg_model.load_state_dict(self.model.state_dict())
35 | self.avg_model.share_memory() # NOTE
36 | for param in self.avg_model.parameters(): param.requires_grad = False
37 |
38 | # global counters
39 | self.frame_step = mp.Value('l', 0) # global frame step counter
40 | self.train_step = mp.Value('l', 0) # global train step counter
41 | self.on_policy_train_step = mp.Value('l', 0) # global on-policy train step counter
42 | self.off_policy_train_step = mp.Value('l', 0) # global off-policy train step counter
43 | # global training stats
44 | self.p_loss_avg = mp.Value('d', 0.) # global policy loss
45 | self.v_loss_avg = mp.Value('d', 0.) # global value loss
46 | self.entropy_loss_avg = mp.Value('d', 0.) # global value loss
47 | self.loss_counter = mp.Value('l', 0) # storing this many losses
48 | self._reset_training_loggings()
49 |
50 | def _reset_training_loggings(self):
51 | self.p_loss_avg.value = 0.
52 | self.v_loss_avg.value = 0.
53 | self.entropy_loss_avg.value = 0.
54 | self.loss_counter.value = 0
55 |
56 | def fit_model(self):
57 | self.jobs = []
58 | for process_id in range(self.num_processes):
59 | self.jobs.append(ACERLearner(self, process_id))
60 | self.jobs.append(ACEREvaluator(self, self.num_processes))
61 |
62 | self.logger.warning("<===================================> Training ...")
63 | for job in self.jobs:
64 | job.start()
65 | for job in self.jobs:
66 | job.join()
67 |
68 | def test_model(self):
69 | self.jobs = []
70 | self.jobs.append(ACERTester(self))
71 |
72 | self.logger.warning("<===================================> Testing ...")
73 | for job in self.jobs:
74 | job.start()
75 | for job in self.jobs:
76 | job.join()
77 |
--------------------------------------------------------------------------------
/core/agents/dqn.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | import random
6 | import time
7 | import torch
8 | from torch.autograd import Variable
9 |
10 | from optims.helpers import adjust_learning_rate
11 | from core.agent import Agent
12 |
13 | class DQNAgent(Agent):
14 | def __init__(self, args, env_prototype, model_prototype, memory_prototype):
15 | super(DQNAgent, self).__init__(args, env_prototype, model_prototype, memory_prototype)
16 | self.logger.warning("<===================================> DQN")
17 |
18 | # env
19 | self.env = self.env_prototype(self.env_params)
20 | self.state_shape = self.env.state_shape
21 | self.action_dim = self.env.action_dim
22 |
23 | # model
24 | self.model_params.state_shape = self.state_shape
25 | self.model_params.action_dim = self.action_dim
26 | self.model = self.model_prototype(self.model_params)
27 | self._load_model(self.model_file) # load pretrained model if provided
28 | # target_model
29 | self.target_model = self.model_prototype(self.model_params)
30 | self._update_target_model_hard()
31 |
32 | # memory
33 | # NOTE: we instantiate memory objects only inside fit_model/test_model
34 | # NOTE: since in fit_model we need both replay memory and recent memory
35 | # NOTE: while in test_model we only need recent memory, in which case memory_size=0
36 | self.memory_params = args.memory_params
37 |
38 | # experience & states
39 | self._reset_states()
40 |
41 | def _reset_training_loggings(self):
42 | self._reset_testing_loggings()
43 | # during the evaluation in training, we additionally log for
44 | # the predicted Q-values and TD-errors on validation data
45 | self.v_avg_log = []
46 | self.tderr_avg_log = []
47 | # placeholders for windows for online curve plotting
48 | if self.visualize:
49 | self.win_v_avg = "win_v_avg"
50 | self.win_tderr_avg = "win_tderr_avg"
51 |
52 | def _reset_testing_loggings(self):
53 | # setup logging for testing/evaluation stats
54 | self.steps_avg_log = []
55 | self.steps_std_log = []
56 | self.reward_avg_log = []
57 | self.reward_std_log = []
58 | self.nepisodes_log = []
59 | self.nepisodes_solved_log = []
60 | self.repisodes_solved_log = []
61 | # placeholders for windows for online curve plotting
62 | if self.visualize:
63 | self.win_steps_avg = "win_steps_avg"
64 | self.win_steps_std = "win_steps_std"
65 | self.win_reward_avg = "win_reward_avg"
66 | self.win_reward_std = "win_reward_std"
67 | self.win_nepisodes = "win_nepisodes"
68 | self.win_nepisodes_solved = "win_nepisodes_solved"
69 | self.win_repisodes_solved = "win_repisodes_solved"
70 |
71 | def _reset_states(self):
72 | self._reset_experience()
73 | self.recent_action = None
74 | self.recent_observation = None
75 |
76 | # Hard update every `target_model_update` steps.
77 | def _update_target_model_hard(self):
78 | self.target_model.load_state_dict(self.model.state_dict())
79 |
80 | # Soft update with `(1 - target_model_update) * old + target_model_update * new`.
81 | def _update_target_model_soft(self):
82 | for i, (key, target_weights) in enumerate(self.target_model.state_dict().iteritems()):
83 | target_weights += self.target_model_update * self.model.state_dict()[key]
84 |
85 | def _sample_validation_data(self):
86 | self.logger.warning("Validation Data @ Step: " + str(self.step))
87 | self.valid_data = self.memory.sample(self.valid_size)
88 |
89 | def _compute_validation_stats(self):
90 | return self._get_q_update(self.valid_data)
91 |
92 | def _get_q_update(self, experiences): # compute temporal difference error for a batch
93 | # Start by extracting the necessary parameters (we use a vectorized implementation).
94 | state0_batch_vb = Variable(torch.from_numpy(np.array(tuple(experiences[i].state0 for i in range(len(experiences))))).type(self.dtype))
95 | action_batch_vb = Variable(torch.from_numpy(np.array(tuple(experiences[i].action for i in range(len(experiences))))).long())
96 | reward_batch_vb = Variable(torch.from_numpy(np.array(tuple(experiences[i].reward for i in range(len(experiences)))))).type(self.dtype)
97 | state1_batch_vb = Variable(torch.from_numpy(np.array(tuple(experiences[i].state1 for i in range(len(experiences))))).type(self.dtype))
98 | terminal1_batch_vb = Variable(torch.from_numpy(np.array(tuple(0. if experiences[i].terminal1 else 1. for i in range(len(experiences)))))).type(self.dtype)
99 |
100 | if self.use_cuda:
101 | action_batch_vb = action_batch_vb.cuda()
102 |
103 | # Compute target Q values for mini-batch update.
104 | if self.enable_double_dqn:
105 | # According to the paper "Deep Reinforcement Learning with Double Q-learning"
106 | # (van Hasselt et al., 2015), in Double DQN, the online network predicts the actions
107 | # while the target network is used to estimate the Q value.
108 | q_values_vb = self.model(state1_batch_vb)
109 | # Detach this variable from the current graph since we don't want gradients to propagate
110 | q_values_vb = Variable(q_values_vb.data)
111 | # _, q_max_actions_vb = q_values_vb.max(dim=1) # 0.1.12
112 | _, q_max_actions_vb = q_values_vb.max(dim=1, keepdim=True) # 0.2.0
113 | # Now, estimate Q values using the target network but select the values with the
114 | # highest Q value wrt to the online model (as computed above).
115 | next_max_q_values_vb = self.target_model(state1_batch_vb)
116 | # Detach this variable from the current graph since we don't want gradients to propagate
117 | next_max_q_values_vb = Variable(next_max_q_values_vb.data)
118 | next_max_q_values_vb = next_max_q_values_vb.gather(1, q_max_actions_vb)
119 | else:
120 | # Compute the q_values given state1, and extract the maximum for each sample in the batch.
121 | # We perform this prediction on the target_model instead of the model for reasons
122 | # outlined in Mnih (2015). In short: it makes the algorithm more stable.
123 | next_max_q_values_vb = self.target_model(state1_batch_vb)
124 | # Detach this variable from the current graph since we don't want gradients to propagate
125 | next_max_q_values_vb = Variable(next_max_q_values_vb.data)
126 | # next_max_q_values_vb, _ = next_max_q_values_vb.max(dim = 1) # 0.1.12
127 | next_max_q_values_vb, _ = next_max_q_values_vb.max(dim = 1, keepdim=True) # 0.2.0
128 |
129 | # Compute r_t + gamma * max_a Q(s_t+1, a) and update the targets accordingly
130 | # but only for the affected output units (as given by action_batch).
131 | current_q_values_vb = self.model(state0_batch_vb).gather(1, action_batch_vb.unsqueeze(1)).squeeze()
132 | # Set discounted reward to zero for all states that were terminal.
133 | next_max_q_values_vb = next_max_q_values_vb * terminal1_batch_vb.unsqueeze(1)
134 | # expected_q_values_vb = reward_batch_vb + self.gamma * next_max_q_values_vb # 0.1.12
135 | expected_q_values_vb = reward_batch_vb + self.gamma * next_max_q_values_vb.squeeze() # 0.2.0
136 | # Compute temporal difference error, use huber loss to mitigate outlier impact
137 | # TODO: can optionally use huber loss from here: https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
138 | td_error_vb = self.value_criteria(current_q_values_vb, expected_q_values_vb)
139 |
140 | # return v_avg, tderr_avg_vb
141 | if not self.training: # then is being called from _compute_validation_stats, which is just doing inference
142 | td_error_vb = Variable(td_error_vb.data) # detach it from the graph
143 | return next_max_q_values_vb.data.clone().mean(), td_error_vb
144 |
145 | def _epsilon_greedy(self, q_values_ts):
146 | # calculate epsilon
147 | if self.training: # linearly anneal epsilon
148 | self.eps = self.eps_end + max(0, (self.eps_start - self.eps_end) * (self.eps_decay - max(0, self.step - self.learn_start)) / self.eps_decay)
149 | else:
150 | self.eps = self.eps_eval
151 | # choose action
152 | if np.random.uniform() < self.eps: # then we choose a random action
153 | action = random.randrange(self.action_dim)
154 | else: # then we choose the greedy action
155 | if self.use_cuda:
156 | action = np.argmax(q_values_ts.cpu().numpy())
157 | else:
158 | action = np.argmax(q_values_ts.numpy())
159 | return action
160 |
161 | def _forward(self, observation):
162 | # Select an action.
163 | state = self.memory.get_recent_state(observation)
164 | state_ts = torch.from_numpy(np.array(state)).unsqueeze(0).type(self.dtype)
165 | q_values_ts = self.model(Variable(state_ts, volatile=True)).data # NOTE: only doing inference here, so volatile=True
166 | if self.training and self.step < self.learn_start: # then we don't do any learning, just accumulate experiences into replay memory
167 | action = random.randrange(self.action_dim) # thus we only randomly sample actions here, since the model hasn't been updated at all till now
168 | else:
169 | action = self._epsilon_greedy(q_values_ts)
170 |
171 | # Book keeping
172 | self.recent_observation = observation
173 | self.recent_action = action
174 |
175 | return action
176 |
177 | def _backward(self, reward, terminal):
178 | # Store most recent experience in memory.
179 | if self.step % self.memory_interval == 0:
180 | # NOTE: so the tuples stored in memory corresponds to:
181 | # NOTE: in recent_observation(state0), take recent_action(action), get reward(reward), ends up in terminal(terminal1)
182 | self.memory.append(self.recent_observation, self.recent_action, reward, terminal,
183 | training = self.training)
184 |
185 | if not self.training:
186 | # We're done here. No need to update the replay memory since we only use the
187 | # recent memory to obtain the state over the most recent observations.
188 | return
189 |
190 | # sample validation data right before training started
191 | # NOTE: here validation data is not entirely clean since the agent might see those data during training
192 | # NOTE: but it's ok as is also the case in the original dqn code, cos those data are not used to judge performance like in supervised learning
193 | # NOTE: but rather to inspect the whole learning procedure; of course we can separate those entirely from the training data but it's not worth the effort
194 | if self.step == self.learn_start + 1:
195 | self._sample_validation_data()
196 | self.logger.warning("Start Training @ Step: " + str(self.step))
197 |
198 | # Train the network on a single stochastic batch.
199 | if self.step > self.learn_start and self.step % self.train_interval == 0:
200 | experiences = self.memory.sample(self.batch_size)
201 | # Compute temporal difference error
202 | _, td_error_vb = self._get_q_update(experiences)
203 | # Construct optimizer and clear old gradients
204 | # TODO: can linearly anneal the lr here thus we would have to create a new optimizer here
205 | # TODO: we leave the lr constant here for now and wait for update threads maybe from: https://discuss.pytorch.org/t/adaptive-learning-rate/320/11
206 | self.optimizer.zero_grad()
207 | # run backward pass and clip gradient
208 | td_error_vb.backward()
209 | for param in self.model.parameters():
210 | param.grad.data.clamp_(-self.clip_grad, self.clip_grad)
211 | # Perform the update
212 | self.optimizer.step()
213 |
214 | # adjust learning rate if enabled
215 | if self.lr_decay:
216 | self.lr_adjusted = max(self.lr * (self.steps - self.step) / self.steps, 1e-32)
217 | adjust_learning_rate(self.optimizer, self.lr_adjusted)
218 |
219 | if self.target_model_update >= 1 and self.step % self.target_model_update == 0:
220 | self._update_target_model_hard() # Hard update every `target_model_update` steps.
221 | if self.target_model_update < 1.: # TODO: have not tested
222 | self._update_target_model_soft() # Soft update with `(1 - target_model_update) * old + target_model_update * new`.
223 |
224 | return
225 |
226 | def fit_model(self):
227 | # memory
228 | self.memory = self.memory_prototype(limit = self.memory_params.memory_size,
229 | window_length = self.memory_params.hist_len)
230 | self.eps = self.eps_start
231 | # self.optimizer = self.optim(self.model.parameters(), lr=self.lr, alpha=0.95, eps=0.01, weight_decay=self.weight_decay) # RMSprop
232 | self.optimizer = self.optim(self.model.parameters(), lr=self.lr, weight_decay=self.weight_decay) # Adam
233 | self.lr_adjusted = self.lr
234 |
235 | self.logger.warning("<===================================> Training ...")
236 | self.training = True
237 | self._reset_training_loggings()
238 |
239 | self.start_time = time.time()
240 | self.step = 0
241 |
242 | nepisodes = 0
243 | nepisodes_solved = 0
244 | episode_steps = None
245 | episode_reward = None
246 | total_reward = 0.
247 | should_start_new = True
248 | while self.step < self.steps:
249 | if should_start_new: # start of a new episode
250 | episode_steps = 0
251 | episode_reward = 0.
252 | # Obtain the initial observation by resetting the environment
253 | self._reset_states()
254 | self.experience = self.env.reset()
255 | assert self.experience.state1 is not None
256 | if not self.training:
257 | if self.visualize: self.env.visual()
258 | if self.render: self.env.render()
259 | # reset flag
260 | should_start_new = False
261 | # Run a single step
262 | # This is where all of the work happens. We first perceive and compute the action
263 | # (forward step) and then use the reward to improve (backward step)
264 | action = self._forward(self.experience.state1)
265 | reward = 0.
266 | for _ in range(self.action_repetition):
267 | self.experience = self.env.step(action)
268 | if not self.training:
269 | if self.visualize: self.env.visual()
270 | if self.render: self.env.render()
271 | reward += self.experience.reward
272 | if self.experience.terminal1:
273 | should_start_new = True
274 | break
275 | if self.early_stop and (episode_steps + 1) >= self.early_stop or (self.step + 1) % self.eval_freq == 0:
276 | # to make sure the historic observations for the first hist_len-1 steps in (the next episode / eval) would be clean
277 | should_start_new = True
278 | if should_start_new:
279 | self._backward(reward, True)
280 | else:
281 | self._backward(reward, self.experience.terminal1)
282 |
283 | episode_steps += 1
284 | episode_reward += reward
285 | self.step += 1
286 |
287 | if should_start_new:
288 | # We are in a terminal state but the agent hasn't yet seen it. We therefore
289 | # perform one more forward-backward call and simply ignore the action before
290 | # resetting the environment. We need to pass in "terminal=False" here since
291 | # the *next* state, that is the state of the newly reset environment, is
292 | # always non-terminal by convention.
293 | self._forward(self.experience.state1) # recent_observation & recent_action get updated
294 | self._backward(0., False) # recent experience gets pushed into memory
295 | # NOTE: the append happened inside here is just trying to save s1, none of a,r,t are used for this terminal s1 when sample
296 | total_reward += episode_reward
297 | nepisodes += 1
298 | if self.experience.terminal1:
299 | nepisodes_solved += 1
300 | self._reset_states()
301 | episode_steps = None
302 | episode_reward = None
303 |
304 | # report training stats
305 | if self.step % self.prog_freq == 0:
306 | self.logger.warning("Reporting @ Step: " + str(self.step) + " | Elapsed Time: " + str(time.time() - self.start_time))
307 | self.logger.warning("Training Stats: lr: {}".format(self.lr_adjusted))
308 | self.logger.warning("Training Stats: epsilon: {}".format(self.eps))
309 | self.logger.warning("Training Stats: total_reward: {}".format(total_reward))
310 | self.logger.warning("Training Stats: avg_reward: {}".format(total_reward/nepisodes if nepisodes > 0 else 0.))
311 | self.logger.warning("Training Stats: nepisodes: {}".format(nepisodes))
312 | self.logger.warning("Training Stats: nepisodes_solved: {}".format(nepisodes_solved))
313 | self.logger.warning("Training Stats: repisodes_solved: {}".format(nepisodes_solved/nepisodes if nepisodes > 0 else 0.))
314 |
315 | # evaluation & checkpointing
316 | if self.step > self.learn_start and self.step % self.eval_freq == 0:
317 | # Set states for evaluation
318 | self.training = False
319 | self.logger.warning("Evaluating @ Step: " + str(self.step))
320 | self._eval_model()
321 |
322 | # Set states for resume training
323 | self.training = True
324 | self.logger.warning("Resume Training @ Step: " + str(self.step))
325 | should_start_new = True
326 |
327 | def _eval_model(self):
328 | self.training = False
329 | eval_step = 0
330 |
331 | eval_nepisodes = 0
332 | eval_nepisodes_solved = 0
333 | eval_episode_steps = None
334 | eval_episode_steps_log = []
335 | eval_episode_reward = None
336 | eval_episode_reward_log = []
337 | eval_should_start_new = True
338 | while eval_step < self.eval_steps:
339 | if eval_should_start_new: # start of a new episode
340 | eval_episode_steps = 0
341 | eval_episode_reward = 0.
342 | # Obtain the initial observation by resetting the environment
343 | self._reset_states()
344 | self.experience = self.env.reset()
345 | assert self.experience.state1 is not None
346 | if not self.training:
347 | if self.visualize: self.env.visual()
348 | if self.render: self.env.render()
349 | # reset flag
350 | eval_should_start_new = False
351 | # Run a single step
352 | eval_action = self._forward(self.experience.state1)
353 | eval_reward = 0.
354 | for _ in range(self.action_repetition):
355 | self.experience = self.env.step(eval_action)
356 | if not self.training:
357 | if self.visualize: self.env.visual()
358 | if self.render: self.env.render()
359 | eval_reward += self.experience.reward
360 | if self.experience.terminal1:
361 | eval_should_start_new = True
362 | break
363 | if self.early_stop and (eval_episode_steps + 1) >= self.early_stop or (eval_step + 1) == self.eval_steps:
364 | # to make sure the historic observations for the first hist_len-1 steps in (the next episode / resume training) would be clean
365 | eval_should_start_new = True
366 | # NOTE: here NOT doing backprop, only adding into recent memory
367 | if eval_should_start_new:
368 | self._backward(eval_reward, True)
369 | else:
370 | self._backward(eval_reward, self.experience.terminal1)
371 |
372 | eval_episode_steps += 1
373 | eval_episode_reward += eval_reward
374 | eval_step += 1
375 |
376 | if eval_should_start_new:
377 | # We are in a terminal state but the agent hasn't yet seen it. We therefore
378 | # perform one more forward-backward call and simply ignore the action before
379 | # resetting the environment. We need to pass in "terminal=False" here since
380 | # the *next* state, that is the state of the newly reset environment, is
381 | # always non-terminal by convention.
382 | self._forward(self.experience.state1) # recent_observation & recent_action get updated
383 | self._backward(0., False) # NOTE: here NOT doing backprop, only adding into recent memory
384 |
385 | eval_nepisodes += 1
386 | if self.experience.terminal1:
387 | eval_nepisodes_solved += 1
388 |
389 | # This episode is finished, report and reset
390 | eval_episode_steps_log.append([eval_episode_steps])
391 | eval_episode_reward_log.append([eval_episode_reward])
392 | self._reset_states()
393 | eval_episode_steps = None
394 | eval_episode_reward = None
395 |
396 | # Computing validation stats
397 | v_avg, tderr_avg_vb = self._compute_validation_stats()
398 | # Logging for this evaluation phase
399 | self.v_avg_log.append([self.step, v_avg])
400 | self.tderr_avg_log.append([self.step, tderr_avg_vb.data.clone().mean()])
401 | self.steps_avg_log.append([self.step, np.mean(np.asarray(eval_episode_steps_log))])
402 | self.steps_std_log.append([self.step, np.std(np.asarray(eval_episode_steps_log))]); del eval_episode_steps_log
403 | self.reward_avg_log.append([self.step, np.mean(np.asarray(eval_episode_reward_log))])
404 | self.reward_std_log.append([self.step, np.std(np.asarray(eval_episode_reward_log))]); del eval_episode_reward_log
405 | self.nepisodes_log.append([self.step, eval_nepisodes])
406 | self.nepisodes_solved_log.append([self.step, eval_nepisodes_solved])
407 | self.repisodes_solved_log.append([self.step, (eval_nepisodes_solved/eval_nepisodes) if eval_nepisodes > 0 else 0])
408 | # plotting
409 | if self.visualize:
410 | self.win_v_avg = self.vis.scatter(X=np.array(self.v_avg_log), env=self.refs, win=self.win_v_avg, opts=dict(title="v_avg"))
411 | self.win_tderr_avg = self.vis.scatter(X=np.array(self.tderr_avg_log), env=self.refs, win=self.win_tderr_avg, opts=dict(title="tderr_avg"))
412 | self.win_steps_avg = self.vis.scatter(X=np.array(self.steps_avg_log), env=self.refs, win=self.win_steps_avg, opts=dict(title="steps_avg"))
413 | # self.win_steps_std = self.vis.scatter(X=np.array(self.steps_std_log), env=self.refs, win=self.win_steps_std, opts=dict(title="steps_std"))
414 | self.win_reward_avg = self.vis.scatter(X=np.array(self.reward_avg_log), env=self.refs, win=self.win_reward_avg, opts=dict(title="reward_avg"))
415 | # self.win_reward_std = self.vis.scatter(X=np.array(self.reward_std_log), env=self.refs, win=self.win_reward_std, opts=dict(title="reward_std"))
416 | self.win_nepisodes = self.vis.scatter(X=np.array(self.nepisodes_log), env=self.refs, win=self.win_nepisodes, opts=dict(title="nepisodes"))
417 | self.win_nepisodes_solved = self.vis.scatter(X=np.array(self.nepisodes_solved_log), env=self.refs, win=self.win_nepisodes_solved, opts=dict(title="nepisodes_solved"))
418 | self.win_repisodes_solved = self.vis.scatter(X=np.array(self.repisodes_solved_log), env=self.refs, win=self.win_repisodes_solved, opts=dict(title="repisodes_solved"))
419 | # logging
420 | self.logger.warning("Iteration: {}; v_avg: {}".format(self.step, self.v_avg_log[-1][1]))
421 | self.logger.warning("Iteration: {}; tderr_avg: {}".format(self.step, self.tderr_avg_log[-1][1]))
422 | self.logger.warning("Iteration: {}; steps_avg: {}".format(self.step, self.steps_avg_log[-1][1]))
423 | self.logger.warning("Iteration: {}; steps_std: {}".format(self.step, self.steps_std_log[-1][1]))
424 | self.logger.warning("Iteration: {}; reward_avg: {}".format(self.step, self.reward_avg_log[-1][1]))
425 | self.logger.warning("Iteration: {}; reward_std: {}".format(self.step, self.reward_std_log[-1][1]))
426 | self.logger.warning("Iteration: {}; nepisodes: {}".format(self.step, self.nepisodes_log[-1][1]))
427 | self.logger.warning("Iteration: {}; nepisodes_solved: {}".format(self.step, self.nepisodes_solved_log[-1][1]))
428 | self.logger.warning("Iteration: {}; repisodes_solved: {}".format(self.step, self.repisodes_solved_log[-1][1]))
429 |
430 | # save model
431 | self._save_model(self.step, self.reward_avg_log[-1][1])
432 |
433 | def test_model(self):
434 | # memory # NOTE: here we don't need a replay memory, just a recent memory
435 | self.memory = self.memory_prototype(limit = 0,
436 | window_length = self.memory_params.hist_len)
437 | self.eps = self.eps_eval
438 |
439 | self.logger.warning("<===================================> Testing ...")
440 | self.training = False
441 | self._reset_testing_loggings()
442 |
443 | self.start_time = time.time()
444 | self.step = 0
445 |
446 | test_nepisodes = 0
447 | test_nepisodes_solved = 0
448 | test_episode_steps = None
449 | test_episode_steps_log = []
450 | test_episode_reward = None
451 | test_episode_reward_log = []
452 | test_should_start_new = True
453 | while test_nepisodes < self.test_nepisodes:
454 | if test_should_start_new: # start of a new episode
455 | test_episode_steps = 0
456 | test_episode_reward = 0.
457 | # Obtain the initial observation by resetting the environment
458 | self._reset_states()
459 | self.experience = self.env.reset()
460 | assert self.experience.state1 is not None
461 | if not self.training:
462 | if self.visualize: self.env.visual()
463 | if self.render: self.env.render()
464 | # reset flag
465 | test_should_start_new = False
466 | # Run a single step
467 | test_action = self._forward(self.experience.state1)
468 | test_reward = 0.
469 | for _ in range(self.action_repetition):
470 | self.experience = self.env.step(test_action)
471 | if not self.training:
472 | if self.visualize: self.env.visual()
473 | if self.render: self.env.render()
474 | test_reward += self.experience.reward
475 | if self.experience.terminal1:
476 | test_should_start_new = True
477 | break
478 | if self.early_stop and (test_episode_steps + 1) >= self.early_stop:
479 | # to make sure the historic observations for the first hist_len-1 steps in (the next episode / resume training) would be clean
480 | test_should_start_new = True
481 | # NOTE: here NOT doing backprop, only adding into recent memory
482 | if test_should_start_new:
483 | self._backward(test_reward, True)
484 | else:
485 | self._backward(test_reward, self.experience.terminal1)
486 |
487 | test_episode_steps += 1
488 | test_episode_reward += test_reward
489 | self.step += 1
490 |
491 | if test_should_start_new:
492 | # We are in a terminal state but the agent hasn't yet seen it. We therefore
493 | # perform one more forward-backward call and simply ignore the action before
494 | # resetting the environment. We need to pass in "terminal=False" here since
495 | # the *next* state, that is the state of the newly reset environment, is
496 | # always non-terminal by convention.
497 | self._forward(self.experience.state1) # recent_observation & recent_action get updated
498 | self._backward(0., False) # NOTE: here NOT doing backprop, only adding into recent memory
499 |
500 | test_nepisodes += 1
501 | if self.experience.terminal1:
502 | test_nepisodes_solved += 1
503 |
504 | # This episode is finished, report and reset
505 | test_episode_steps_log.append([test_episode_steps])
506 | test_episode_reward_log.append([test_episode_reward])
507 | self._reset_states()
508 | test_episode_steps = None
509 | test_episode_reward = None
510 |
511 | # Logging for this testing phase
512 | self.steps_avg_log.append([self.step, np.mean(np.asarray(test_episode_steps_log))])
513 | self.steps_std_log.append([self.step, np.std(np.asarray(test_episode_steps_log))]); del test_episode_steps_log
514 | self.reward_avg_log.append([self.step, np.mean(np.asarray(test_episode_reward_log))])
515 | self.reward_std_log.append([self.step, np.std(np.asarray(test_episode_reward_log))]); del test_episode_reward_log
516 | self.nepisodes_log.append([self.step, test_nepisodes])
517 | self.nepisodes_solved_log.append([self.step, test_nepisodes_solved])
518 | self.repisodes_solved_log.append([self.step, (test_nepisodes_solved/test_nepisodes) if test_nepisodes > 0 else 0.])
519 | # plotting
520 | if self.visualize:
521 | self.win_steps_avg = self.vis.scatter(X=np.array(self.steps_avg_log), env=self.refs, win=self.win_steps_avg, opts=dict(title="steps_avg"))
522 | # self.win_steps_std = self.vis.scatter(X=np.array(self.steps_std_log), env=self.refs, win=self.win_steps_std, opts=dict(title="steps_std"))
523 | self.win_reward_avg = self.vis.scatter(X=np.array(self.reward_avg_log), env=self.refs, win=self.win_reward_avg, opts=dict(title="reward_avg"))
524 | # self.win_reward_std = self.vis.scatter(X=np.array(self.reward_std_log), env=self.refs, win=self.win_reward_std, opts=dict(title="reward_std"))
525 | self.win_nepisodes = self.vis.scatter(X=np.array(self.nepisodes_log), env=self.refs, win=self.win_nepisodes, opts=dict(title="nepisodes"))
526 | self.win_nepisodes_solved = self.vis.scatter(X=np.array(self.nepisodes_solved_log), env=self.refs, win=self.win_nepisodes_solved, opts=dict(title="nepisodes_solved"))
527 | self.win_repisodes_solved = self.vis.scatter(X=np.array(self.repisodes_solved_log), env=self.refs, win=self.win_repisodes_solved, opts=dict(title="repisodes_solved"))
528 | # logging
529 | self.logger.warning("Testing Took: " + str(time.time() - self.start_time))
530 | self.logger.warning("Testing: steps_avg: {}".format(self.steps_avg_log[-1][1]))
531 | self.logger.warning("Testing: steps_std: {}".format(self.steps_std_log[-1][1]))
532 | self.logger.warning("Testing: reward_avg: {}".format(self.reward_avg_log[-1][1]))
533 | self.logger.warning("Testing: reward_std: {}".format(self.reward_std_log[-1][1]))
534 | self.logger.warning("Testing: nepisodes: {}".format(self.nepisodes_log[-1][1]))
535 | self.logger.warning("Testing: nepisodes_solved: {}".format(self.nepisodes_solved_log[-1][1]))
536 | self.logger.warning("Testing: repisodes_solved: {}".format(self.repisodes_solved_log[-1][1]))
537 |
--------------------------------------------------------------------------------
/core/agents/empty.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import random
5 |
6 | from utils.helpers import Experience
7 | from core.agent import Agent
8 |
9 | class EmptyAgent(Agent):
10 | def __init__(self, args, env_prototype, model_prototype, memory_prototype):
11 | super(EmptyAgent, self).__init__(args, env_prototype, model_prototype, memory_prototype)
12 | self.logger.warning("<===================================> Empty")
13 |
14 | # env
15 | self.env = self.env_prototype(self.env_params)
16 | self.state_shape = self.env.state_shape
17 | self.action_dim = self.env.action_dim
18 |
19 | self._reset_experience()
20 |
21 | def _forward(self, state):
22 | pass
23 |
24 | def _backward(self, reward, terminal):
25 | pass
26 |
27 | def _eval_model(self):
28 | pass
29 |
30 | def fit_model(self): # the most basic control loop, to ease integration of new envs
31 | self.step = 0
32 | should_start_new = True
33 | while self.step < self.steps:
34 | if should_start_new:
35 | self._reset_experience()
36 | self.experience = self.env.reset()
37 | assert self.experience.state1 is not None
38 | if self.visualize: self.env.visual()
39 | if self.render: self.env.render()
40 | should_start_new = False
41 | action = random.randrange(self.action_dim) # thus we only randomly sample actions here, since the model hasn't been updated at all till now
42 | self.experience = self.env.step(action)
43 | if self.visualize: self.env.visual()
44 | if self.render: self.env.render()
45 | if self.experience.terminal1 or self.early_stop and (episode_steps + 1) >= self.early_stop:
46 | should_start_new = True
47 |
48 | self.step += 1
49 |
50 | def test_model(self):
51 | pass
52 |
--------------------------------------------------------------------------------
/core/env.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | from copy import deepcopy
6 | from gym.spaces.box import Box
7 | import inspect
8 |
9 | from utils.helpers import Experience # NOTE: here state0 is always "None"
10 | from utils.helpers import preprocessAtari, rgb2gray, rgb2y, scale
11 |
12 | class Env(object):
13 | def __init__(self, args, env_ind=0):
14 | self.logger = args.logger
15 | self.ind = env_ind # NOTE: for creating multiple environment instances
16 | # general setup
17 | self.mode = args.mode # NOTE: save frames when mode=2
18 | if self.mode == 2:
19 | try:
20 | import scipy.misc
21 | self.imsave = scipy.misc.imsave
22 | except ImportError as e: self.logger.warning("WARNING: scipy.misc not found")
23 | self.img_dir = args.root_dir + "/imgs/"
24 | self.frame_ind = 0
25 | self.seed = args.seed + self.ind # NOTE: so to give a different seed to each instance
26 | self.visualize = args.visualize
27 | if self.visualize:
28 | self.vis = args.vis
29 | self.refs = args.refs
30 | self.win_state1 = "win_state1"
31 |
32 | self.env_type = args.env_type
33 | self.game = args.game
34 | self._reset_experience()
35 |
36 | self.logger.warning("<-----------------------------------> Env")
37 | self.logger.warning("Creating {" + self.env_type + " | " + self.game + "} w/ Seed: " + str(self.seed))
38 |
39 | def _reset_experience(self):
40 | self.exp_state0 = None # NOTE: always None in this module
41 | self.exp_action = None
42 | self.exp_reward = None
43 | self.exp_state1 = None
44 | self.exp_terminal1 = None
45 |
46 | def _get_experience(self):
47 | return Experience(state0 = self.exp_state0, # NOTE: here state0 is always None
48 | action = self.exp_action,
49 | reward = self.exp_reward,
50 | state1 = self._preprocessState(self.exp_state1),
51 | terminal1 = self.exp_terminal1)
52 |
53 | def _preprocessState(self, state):
54 | raise NotImplementedError("not implemented in base calss")
55 |
56 | @property
57 | def state_shape(self):
58 | raise NotImplementedError("not implemented in base calss")
59 |
60 | @property
61 | def action_dim(self):
62 | if isinstance(self.env.action_space, Box):
63 | return self.env.action_space.shape[0]
64 | else:
65 | return self.env.action_space.n
66 |
67 | def render(self): # render using the original gl window
68 | raise NotImplementedError("not implemented in base calss")
69 |
70 | def visual(self): # visualize onto visdom
71 | raise NotImplementedError("not implemented in base calss")
72 |
73 | def reset(self):
74 | raise NotImplementedError("not implemented in base calss")
75 |
76 | def step(self, action):
77 | raise NotImplementedError("not implemented in base calss")
78 |
--------------------------------------------------------------------------------
/core/envs/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingweiz/pytorch-rl/20b3b738ca400b1916197f27a91367878b09803c/core/envs/__init__.py
--------------------------------------------------------------------------------
/core/envs/atari.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | from copy import deepcopy
6 | from gym.spaces.box import Box
7 | import inspect
8 |
9 | from utils.helpers import Experience # NOTE: here state0 is always "None"
10 | from utils.helpers import preprocessAtari, rgb2gray, rgb2y, scale
11 | from core.env import Env
12 |
13 | class AtariEnv(Env): # pixel-level inputs
14 | def __init__(self, args, env_ind=0):
15 | super(AtariEnv, self).__init__(args, env_ind)
16 |
17 | assert self.env_type == "atari"
18 | try: import gym
19 | except ImportError as e: self.logger.warning("WARNING: gym not found")
20 |
21 | self.env = gym.make(self.game)
22 | self.env.seed(self.seed) # NOTE: so each env would be different
23 |
24 | # action space setup
25 | self.actions = range(self.action_dim)
26 | self.logger.warning("Action Space: %s", self.actions)
27 | # state space setup
28 | self.hei_state = args.hei_state
29 | self.wid_state = args.wid_state
30 | self.preprocess_mode = args.preprocess_mode if not None else 0 # 0(crop&resize) | 1(rgb2gray) | 2(rgb2y)
31 | assert self.hei_state == self.wid_state
32 | self.logger.warning("State Space: (" + str(self.state_shape) + " * " + str(self.state_shape) + ")")
33 |
34 | def _preprocessState(self, state):
35 | if self.preprocess_mode == 3: # crop then resize
36 | state = preprocessAtari(state)
37 | if self.preprocess_mode == 2: # rgb2y
38 | state = scale(rgb2y(state), self.hei_state, self.wid_state) / 255.
39 | elif self.preprocess_mode == 1: # rgb2gray
40 | state = scale(rgb2gray(state), self.hei_state, self.wid_state) / 255.
41 | elif self.preprocess_mode == 0: # do nothing
42 | pass
43 | return state.reshape(self.hei_state * self.wid_state)
44 |
45 | @property
46 | def state_shape(self):
47 | return self.hei_state
48 |
49 | def render(self):
50 | return self.env.render()
51 |
52 | def visual(self):
53 | if self.visualize:
54 | self.win_state1 = self.vis.image(np.transpose(self.exp_state1, (2, 0, 1)), env=self.refs, win=self.win_state1, opts=dict(title="state1"))
55 | if self.mode == 2:
56 | frame_name = self.img_dir + "frame_%04d.jpg" % self.frame_ind
57 | self.imsave(frame_name, self.exp_state1)
58 | self.logger.warning("Saved Frame @ Step: " + str(self.frame_ind) + " To: " + frame_name)
59 | self.frame_ind += 1
60 |
61 | def sample_random_action(self):
62 | return self.env.action_space.sample()
63 |
64 | def reset(self):
65 | # TODO: could add random start here, since random start only make sense for atari games
66 | self._reset_experience()
67 | self.exp_state1 = self.env.reset()
68 | return self._get_experience()
69 |
70 | def step(self, action_index):
71 | self.exp_action = action_index
72 | self.exp_state1, self.exp_reward, self.exp_terminal1, _ = self.env.step(self.actions[self.exp_action])
73 | return self._get_experience()
74 |
--------------------------------------------------------------------------------
/core/envs/atari_ram.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | from copy import deepcopy
6 | from gym.spaces.box import Box
7 | import inspect
8 |
9 | from utils.helpers import Experience # NOTE: here state0 is always "None"
10 | from utils.helpers import preprocessAtari, rgb2gray, rgb2y, scale
11 | from core.env import Env
12 |
13 | class AtariRamEnv(Env): # atari games w/ ram states as input
14 | def __init__(self, args, env_ind=0):
15 | super(AtariRamEnv, self).__init__(args, env_ind)
16 |
17 | assert self.env_type == "atari-ram"
18 | try: import gym
19 | except ImportError as e: self.logger.warning("WARNING: gym not found")
20 |
21 | self.env = gym.make(self.game)
22 | self.env.seed(self.seed) # NOTE: so each env would be different
23 |
24 | # action space setup
25 | self.actions = range(self.action_dim)
26 | self.logger.warning("Action Space: %s", self.actions)
27 |
28 | # state space setup
29 | self.logger.warning("State Space: %s", self.state_shape)
30 |
31 | def _preprocessState(self, state): # NOTE: here the input is [0, 255], so we normalize
32 | return state/255. # TODO: check again the range, also syntax w/ python3
33 |
34 | @property
35 | def state_shape(self):
36 | return self.env.observation_space.shape[0]
37 |
38 | def render(self):
39 | return self.env.render()
40 |
41 | def visual(self): # TODO: try to grab also the pixel-level outputs and visualize
42 | pass
43 |
44 | def sample_random_action(self):
45 | return self.env.action_space.sample()
46 |
47 | def reset(self):
48 | self._reset_experience()
49 | self.exp_state1 = self.env.reset()
50 | return self._get_experience()
51 |
52 | def step(self, action_index):
53 | self.exp_action = action_index
54 | self.exp_state1, self.exp_reward, self.exp_terminal1, _ = self.env.step(self.actions[self.exp_action])
55 | return self._get_experience()
56 |
--------------------------------------------------------------------------------
/core/envs/gym.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | from copy import deepcopy
6 | from gym.spaces.box import Box
7 | import inspect
8 |
9 | from utils.helpers import Experience # NOTE: here state0 is always "None"
10 | from utils.helpers import preprocessAtari, rgb2gray, rgb2y, scale
11 | from core.env import Env
12 |
13 | class GymEnv(Env): # low dimensional observations
14 | def __init__(self, args, env_ind=0):
15 | super(GymEnv, self).__init__(args, env_ind)
16 |
17 | assert self.env_type == "gym"
18 | try: import gym
19 | except ImportError as e: self.logger.warning("WARNING: gym not found")
20 |
21 | self.env = gym.make(self.game)
22 | self.env.seed(self.seed) # NOTE: so each env would be different
23 |
24 | # action space setup
25 | self.actions = range(self.action_dim)
26 | self.logger.warning("Action Space: %s", self.actions)
27 |
28 | # state space setup
29 | self.logger.warning("State Space: %s", self.state_shape)
30 |
31 | # continuous space
32 | if args.agent_type == "a3c":
33 | self.enable_continuous = args.enable_continuous
34 | else:
35 | self.enable_continuous = False
36 |
37 | def _preprocessState(self, state): # NOTE: here no preprecessing is needed
38 | return state
39 |
40 | @property
41 | def state_shape(self):
42 | return self.env.observation_space.shape[0]
43 |
44 | def render(self):
45 | if self.mode == 2:
46 | frame = self.env.render(mode='rgb_array')
47 | frame_name = self.img_dir + "frame_%04d.jpg" % self.frame_ind
48 | self.imsave(frame_name, frame)
49 | self.logger.warning("Saved Frame @ Step: " + str(self.frame_ind) + " To: " + frame_name)
50 | self.frame_ind += 1
51 | return frame
52 | else:
53 | return self.env.render()
54 |
55 |
56 | def visual(self):
57 | pass
58 |
59 | def sample_random_action(self):
60 | return self.env.action_space.sample()
61 |
62 | def reset(self):
63 | self._reset_experience()
64 | self.exp_state1 = self.env.reset()
65 | return self._get_experience()
66 |
67 | def step(self, action_index):
68 | self.exp_action = action_index
69 | if self.enable_continuous:
70 | self.exp_state1, self.exp_reward, self.exp_terminal1, _ = self.env.step(self.exp_action)
71 | else:
72 | self.exp_state1, self.exp_reward, self.exp_terminal1, _ = self.env.step(self.actions[self.exp_action])
73 | return self._get_experience()
74 |
--------------------------------------------------------------------------------
/core/envs/lab.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | from copy import deepcopy
6 | from gym.spaces.box import Box
7 | import inspect
8 |
9 | from utils.helpers import Experience # NOTE: here state0 is always "None"
10 | from utils.helpers import preprocessAtari, rgb2gray, rgb2y, scale
11 | from core.env import Env
12 |
13 | class LabEnv(Env):
14 | def __init__(self, args, env_ind=0):
15 | super(LabEnv, self).__init__(args, env_ind)
16 |
17 | assert self.env_type == "lab"
18 |
--------------------------------------------------------------------------------
/core/memories/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingweiz/pytorch-rl/20b3b738ca400b1916197f27a91367878b09803c/core/memories/__init__.py
--------------------------------------------------------------------------------
/core/memories/episode_parameter.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | from collections import deque, namedtuple
6 | import warnings
7 | import random
8 |
9 | from utils.helpers import Experience
10 | from core.memory import sample_batch_indexes, RingBuffer, Memory
11 |
12 | class EpisodeParameterMemory(Memory):
13 | def __init__(self, limit, **kwargs):
14 | super(EpisodeParameterMemory, self).__init__(**kwargs)
15 | self.limit = limit
16 |
17 | self.params = RingBuffer(limit)
18 | self.intermediate_rewards = []
19 | self.total_rewards = RingBuffer(limit)
20 |
21 | def sample(self, batch_size, batch_idxs=None):
22 | if batch_idxs is None:
23 | batch_idxs = sample_batch_indexes(0, self.nb_entries, size=batch_size)
24 | assert len(batch_idxs) == batch_size
25 |
26 | batch_params = []
27 | batch_total_rewards = []
28 | for idx in batch_idxs:
29 | batch_params.append(self.params[idx])
30 | batch_total_rewards.append(self.total_rewards[idx])
31 | return batch_params, batch_total_rewards
32 |
33 | def append(self, observation, action, reward, terminal, training=True):
34 | super(EpisodeParameterMemory, self).append(observation, action, reward, terminal, training=training)
35 | if training:
36 | self.intermediate_rewards.append(reward)
37 |
38 | def finalize_episode(self, params):
39 | total_reward = sum(self.intermediate_rewards)
40 | self.total_rewards.append(total_reward)
41 | self.params.append(params)
42 | self.intermediate_rewards = []
43 |
44 | @property
45 | def nb_entries(self):
46 | return len(self.total_rewards)
47 |
48 | def get_config(self):
49 | config = super(SequentialMemory, self).get_config()
50 | config['limit'] = self.limit
51 | return config
52 |
--------------------------------------------------------------------------------
/core/memories/episodic.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import random
5 | from collections import deque, namedtuple
6 |
7 | from utils.helpers import ACER_Off_Policy_Experience
8 |
9 | # TODO: should inherite from Memory to make it consistent
10 | class EpisodicMemory():
11 | def __init__(self, capacity, max_episode_length):
12 | # Max number of transitions possible will be the memory capacity, could be much less
13 | self.num_episodes = capacity // max_episode_length
14 | self.memory = deque(maxlen=self.num_episodes)
15 | self.memory.append([]) # List for first episode
16 | self.position = 0
17 |
18 | def append(self, state0, action, reward, detached_old_policy_vb):
19 | self.memory[self.position].append(ACER_Off_Policy_Experience(state0, action, reward, detached_old_policy_vb)) # Save s_i, a_i, r_i+1, /mu(|s_i)
20 | # Terminal states are saved with actions as None, so switch to next episode
21 | if action is None:
22 | self.memory.append([])
23 | self.position = min(self.position + 1, self.num_episodes - 1)
24 |
25 | # Samples random trajectory
26 | def sample(self, maxlen=0):
27 | while True:
28 | e = random.randrange(len(self.memory))
29 | mem = self.memory[e]
30 | T = len(mem)
31 | if T > 0:
32 | # Take a random subset of trajectory if maxlen specified, otherwise return full trajectory
33 | if maxlen > 0 and T > maxlen + 1:
34 | t = random.randrange(T - maxlen - 1) # Include next state after final "maxlen" state
35 | return mem[t:t + maxlen + 1]
36 | else:
37 | return mem
38 |
39 | # Samples batch of trajectories, truncating them to the same length
40 | def sample_batch(self, batch_size, maxlen=0):
41 | batch = [self.sample(maxlen=maxlen) for _ in range(batch_size)]
42 | minimum_size = min(len(trajectory) for trajectory in batch)
43 | batch = [trajectory[:minimum_size] for trajectory in batch] # Truncate trajectories
44 | return list(map(list, zip(*batch))) # Transpose so that timesteps are packed together
45 |
46 | def __len__(self):
47 | return sum(len(episode) for episode in self.memory)
48 |
--------------------------------------------------------------------------------
/core/memories/sequential.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | from collections import deque, namedtuple
6 | import warnings
7 | import random
8 |
9 | from utils.helpers import Experience
10 | from core.memory import sample_batch_indexes, zeroed_observation, RingBuffer, Memory
11 |
12 | class SequentialMemory(Memory):
13 | def __init__(self, limit, **kwargs):
14 | super(SequentialMemory, self).__init__(**kwargs)
15 |
16 | self.limit = limit
17 |
18 | # Do not use deque to implement the memory. This data structure may seem convenient but
19 | # it is way too slow on random access. Instead, we use our own ring buffer implementation.
20 | self.actions = RingBuffer(limit)
21 | self.rewards = RingBuffer(limit)
22 | self.terminals = RingBuffer(limit)
23 | self.observations = RingBuffer(limit)
24 |
25 | def sample(self, batch_size, batch_idxs=None):
26 | if batch_idxs is None:
27 | # Draw random indexes such that we have at least a single entry before each
28 | # index.
29 | batch_idxs = sample_batch_indexes(0, self.nb_entries - 1, size=batch_size)
30 | batch_idxs = np.array(batch_idxs) + 1
31 | assert np.min(batch_idxs) >= 1
32 | assert np.max(batch_idxs) < self.nb_entries
33 | assert len(batch_idxs) == batch_size
34 |
35 | # Create experiences
36 | experiences = []
37 | for idx in batch_idxs:
38 | terminal0 = self.terminals[idx - 2] if idx >= 2 else False
39 | while terminal0:
40 | # Skip this transition because the environment was reset here. Select a new, random
41 | # transition and use this instead. This may cause the batch to contain the same
42 | # transition twice.
43 | idx = sample_batch_indexes(1, self.nb_entries, size=1)[0]
44 | terminal0 = self.terminals[idx - 2] if idx >= 2 else False
45 | assert 1 <= idx < self.nb_entries
46 |
47 | # This code is slightly complicated by the fact that subsequent observations might be
48 | # from different episodes. We ensure that an experience never spans multiple episodes.
49 | # This is probably not that important in practice but it seems cleaner.
50 | state0 = [self.observations[idx - 1]]
51 | for offset in range(0, self.window_length - 1):
52 | current_idx = idx - 2 - offset
53 | current_terminal = self.terminals[current_idx - 1] if current_idx - 1 > 0 else False
54 | if current_idx < 0 or (not self.ignore_episode_boundaries and current_terminal):
55 | # The previously handled observation was terminal, don't add the current one.
56 | # Otherwise we would leak into a different episode.
57 | break
58 | state0.insert(0, self.observations[current_idx])
59 | while len(state0) < self.window_length:
60 | state0.insert(0, zeroed_observation(state0[0]))
61 | action = self.actions[idx - 1]
62 | reward = self.rewards[idx - 1]
63 | terminal1 = self.terminals[idx - 1]
64 |
65 | # Okay, now we need to create the follow-up state. This is state0 shifted on timestep
66 | # to the right. Again, we need to be careful to not include an observation from the next
67 | # episode if the last state is terminal.
68 | state1 = [np.copy(x) for x in state0[1:]]
69 | state1.append(self.observations[idx])
70 |
71 | assert len(state0) == self.window_length
72 | assert len(state1) == len(state0)
73 | experiences.append(Experience(state0=state0, action=action, reward=reward,
74 | state1=state1, terminal1=terminal1))
75 | assert len(experiences) == batch_size
76 | return experiences
77 |
78 | def append(self, observation, action, reward, terminal, training=True):
79 | super(SequentialMemory, self).append(observation, action, reward, terminal, training=training)
80 |
81 | # This needs to be understood as follows: in `observation`, take `action`, obtain `reward`
82 | # and weather the next state is `terminal` or not.
83 | if training:
84 | self.observations.append(observation)
85 | self.actions.append(action)
86 | self.rewards.append(reward)
87 | self.terminals.append(terminal)
88 |
89 | @property
90 | def nb_entries(self):
91 | return len(self.observations)
92 |
93 | def get_config(self):
94 | config = super(SequentialMemory, self).get_config()
95 | config['limit'] = self.limit
96 | return config
97 |
--------------------------------------------------------------------------------
/core/memory.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | from collections import deque, namedtuple
6 | import warnings
7 | import random
8 |
9 | from utils.helpers import Experience
10 |
11 | def sample_batch_indexes(low, high, size):
12 | if high - low >= size:
13 | # We have enough data. Draw without replacement, that is each index is unique in the
14 | # batch. We cannot use `np.random.choice` here because it is horribly inefficient as
15 | # the memory grows. See https://github.com/numpy/numpy/issues/2764 for a discussion.
16 | # `random.sample` does the same thing (drawing without replacement) and is way faster.
17 | try:
18 | r = xrange(low, high)
19 | except NameError:
20 | r = range(low, high)
21 | batch_idxs = random.sample(r, size)
22 | else:
23 | # Not enough data. Help ourselves with sampling from the range, but the same index
24 | # can occur multiple times. This is not good and should be avoided by picking a
25 | # large enough warm-up phase.
26 | warnings.warn('Not enough entries to sample without replacement. Consider increasing your warm-up phase to avoid oversampling!')
27 | batch_idxs = np.random.random_integers(low, high - 1, size=size)
28 | assert len(batch_idxs) == size
29 | return batch_idxs
30 |
31 | def zeroed_observation(observation):
32 | if hasattr(observation, 'shape'):
33 | return np.zeros(observation.shape)
34 | elif hasattr(observation, '__iter__'):
35 | out = []
36 | for x in observation:
37 | out.append(zeroed_observation(x))
38 | return out
39 | else:
40 | return 0.
41 |
42 | class RingBuffer(object):
43 | def __init__(self, maxlen):
44 | self.maxlen = maxlen
45 | self.start = 0
46 | self.length = 0
47 | self.data = [None for _ in range(maxlen)]
48 |
49 | def __len__(self):
50 | return self.length
51 |
52 | def __getitem__(self, idx):
53 | if idx < 0 or idx >= self.length:
54 | raise KeyError()
55 | return self.data[(self.start + idx) % self.maxlen]
56 |
57 | def append(self, v):
58 | if self.length < self.maxlen:
59 | # We have space, simply increase the length.
60 | self.length += 1
61 | elif self.length == self.maxlen:
62 | # No space, "remove" the first item.
63 | self.start = (self.start + 1) % self.maxlen
64 | else:
65 | # This should never happen.
66 | raise RuntimeError()
67 | self.data[(self.start + self.length - 1) % self.maxlen] = v
68 |
69 | class Memory(object):
70 | def __init__(self, window_length, ignore_episode_boundaries=False):
71 | self.window_length = window_length
72 | self.ignore_episode_boundaries = ignore_episode_boundaries
73 |
74 | self.recent_observations = deque(maxlen=window_length)
75 | self.recent_terminals = deque(maxlen=window_length)
76 |
77 | def sample(self, batch_size, batch_idxs=None):
78 | raise NotImplementedError()
79 |
80 | def append(self, observation, action, reward, terminal, training=True):
81 | self.recent_observations.append(observation)
82 | self.recent_terminals.append(terminal)
83 |
84 | def get_recent_state(self, current_observation):
85 | # This code is slightly complicated by the fact that subsequent observations might be
86 | # from different episodes. We ensure that an experience never spans multiple episodes.
87 | # This is probably not that important in practice but it seems cleaner.
88 | state = [current_observation]
89 | idx = len(self.recent_observations) - 1
90 | for offset in range(0, self.window_length - 1):
91 | current_idx = idx - offset
92 | current_terminal = self.recent_terminals[current_idx - 1] if current_idx - 1 >= 0 else False
93 | if current_idx < 0 or (not self.ignore_episode_boundaries and current_terminal):
94 | # The previously handled observation was terminal, don't add the current one.
95 | # Otherwise we would leak into a different episode.
96 | break
97 | state.insert(0, self.recent_observations[current_idx])
98 | while len(state) < self.window_length:
99 | state.insert(0, zeroed_observation(state[0]))
100 | return state
101 |
102 | def get_config(self):
103 | config = {
104 | 'window_length': self.window_length,
105 | 'ignore_episode_boundaries': self.ignore_episode_boundaries,
106 | }
107 | return config
108 |
--------------------------------------------------------------------------------
/core/model.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | import torch
6 | import torch.nn as nn
7 | import torch.nn.functional as F
8 | from torch.autograd import Variable
9 |
10 | from utils.init_weights import init_weights, normalized_columns_initializer
11 |
12 | class Model(nn.Module):
13 | def __init__(self, args):
14 | super(Model, self).__init__()
15 | # logging
16 | self.logger = args.logger
17 | # params
18 | self.hidden_dim = args.hidden_dim
19 | self.use_cuda = args.use_cuda
20 | self.dtype = args.dtype
21 | # model_params
22 | if hasattr(args, "enable_dueling"): # only set for "dqn"
23 | self.enable_dueling = args.enable_dueling
24 | self.dueling_type = args.dueling_type
25 | if hasattr(args, "enable_lstm"): # only set for "dqn"
26 | self.enable_lstm = args.enable_lstm
27 |
28 | self.input_dims = {}
29 | self.input_dims[0] = args.hist_len # from params
30 | self.input_dims[1] = args.state_shape
31 | self.output_dims = args.action_dim
32 |
33 | def _init_weights(self):
34 | raise NotImplementedError("not implemented in base calss")
35 |
36 | def print_model(self):
37 | self.logger.warning("<-----------------------------------> Model")
38 | self.logger.warning(self)
39 |
40 | def _reset(self): # NOTE: should be called at each child's __init__
41 | self._init_weights()
42 | self.type(self.dtype) # put on gpu if possible
43 | self.print_model()
44 |
45 | def forward(self, input):
46 | raise NotImplementedError("not implemented in base calss")
47 |
--------------------------------------------------------------------------------
/core/models/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingweiz/pytorch-rl/20b3b738ca400b1916197f27a91367878b09803c/core/models/__init__.py
--------------------------------------------------------------------------------
/core/models/a3c_cnn_dis.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | import torch
6 | import torch.nn as nn
7 | import torch.nn.functional as F
8 | from torch.autograd import Variable
9 |
10 | from utils.init_weights import init_weights, normalized_columns_initializer
11 | from core.model import Model
12 |
13 | class A3CCnnDisModel(Model):
14 | def __init__(self, args):
15 | super(A3CCnnDisModel, self).__init__(args)
16 | # build model
17 | # 0. feature layers
18 | self.conv1 = nn.Conv2d(self.input_dims[0], 32, kernel_size=3, stride=2) # NOTE: for pkg="atari"
19 | self.rl1 = nn.ReLU()
20 | self.conv2 = nn.Conv2d(32, 32, kernel_size=3, stride=2, padding=1)
21 | self.rl2 = nn.ReLU()
22 | self.conv3 = nn.Conv2d(32, 32, kernel_size=3, stride=2, padding=1)
23 | self.rl3 = nn.ReLU()
24 | self.conv4 = nn.Conv2d(32, 32, kernel_size=3, stride=2, padding=1)
25 | self.rl4 = nn.ReLU()
26 | if self.enable_lstm:
27 | self.lstm = nn.LSTMCell(3*3*32, self.hidden_dim)
28 | # 1. policy output
29 | self.policy_5 = nn.Linear(self.hidden_dim, self.output_dims)
30 | self.policy_6 = nn.Softmax()
31 | # 2. value output
32 | self.value_5 = nn.Linear(self.hidden_dim, 1)
33 |
34 | self._reset()
35 |
36 | def _init_weights(self):
37 | self.apply(init_weights)
38 | self.policy_5.weight.data = normalized_columns_initializer(self.policy_5.weight.data, 0.01)
39 | self.policy_5.bias.data.fill_(0)
40 | self.value_5.weight.data = normalized_columns_initializer(self.value_5.weight.data, 1.0)
41 | self.value_5.bias.data.fill_(0)
42 |
43 | self.lstm.bias_ih.data.fill_(0)
44 | self.lstm.bias_hh.data.fill_(0)
45 |
46 | def forward(self, x, lstm_hidden_vb=None):
47 | x = x.view(x.size(0), self.input_dims[0], self.input_dims[1], self.input_dims[1])
48 | x = self.rl1(self.conv1(x))
49 | x = self.rl2(self.conv2(x))
50 | x = self.rl3(self.conv3(x))
51 | x = self.rl4(self.conv4(x))
52 | x = x.view(-1, 3*3*32)
53 | if self.enable_lstm:
54 | x, c = self.lstm(x, lstm_hidden_vb)
55 | p = self.policy_5(x)
56 | p = self.policy_6(p)
57 | v = self.value_5(x)
58 | if self.enable_lstm:
59 | return p, v, (x, c)
60 | else:
61 | return p, v
62 |
--------------------------------------------------------------------------------
/core/models/a3c_mlp_con.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | import torch
6 | import torch.nn as nn
7 | import torch.nn.functional as F
8 | from torch.autograd import Variable
9 |
10 | from utils.init_weights import init_weights, normalized_columns_initializer
11 | from core.model import Model
12 |
13 | class A3CMlpConModel(Model):
14 | def __init__(self, args):
15 | super(A3CMlpConModel, self).__init__(args)
16 | # build model
17 | # 0. feature layers
18 | self.fc1 = nn.Linear(self.input_dims[0] * self.input_dims[1], self.hidden_dim) # NOTE: for pkg="gym"
19 | self.rl1 = nn.ReLU()
20 | self.fc2 = nn.Linear(self.hidden_dim, self.hidden_dim)
21 | self.rl2 = nn.ReLU()
22 | self.fc3 = nn.Linear(self.hidden_dim, self.hidden_dim)
23 | self.rl3 = nn.ReLU()
24 | self.fc4 = nn.Linear(self.hidden_dim, self.hidden_dim)
25 | self.rl4 = nn.ReLU()
26 |
27 | self.fc1_v = nn.Linear(self.input_dims[0] * self.input_dims[1], self.hidden_dim) # NOTE: for pkg="gym"
28 | self.rl1_v = nn.ReLU()
29 | self.fc2_v = nn.Linear(self.hidden_dim, self.hidden_dim)
30 | self.rl2_v = nn.ReLU()
31 | self.fc3_v = nn.Linear(self.hidden_dim, self.hidden_dim)
32 | self.rl3_v = nn.ReLU()
33 | self.fc4_v = nn.Linear(self.hidden_dim, self.hidden_dim)
34 | self.rl4_v = nn.ReLU()
35 |
36 | # lstm
37 | if self.enable_lstm:
38 | self.lstm = nn.LSTMCell(self.hidden_dim, self.hidden_dim)
39 | self.lstm_v = nn.LSTMCell(self.hidden_dim, self.hidden_dim)
40 |
41 | # 1. policy output
42 | self.policy_5 = nn.Linear(self.hidden_dim, self.output_dims)
43 | self.policy_sig = nn.Linear(self.hidden_dim, self.output_dims)
44 | self.softplus = nn.Softplus()
45 | # 2. value output
46 | self.value_5 = nn.Linear(self.hidden_dim, 1)
47 |
48 | self._reset()
49 |
50 | def _init_weights(self):
51 | self.apply(init_weights)
52 | self.fc1.weight.data = normalized_columns_initializer(self.fc1.weight.data, 0.01)
53 | self.fc1.bias.data.fill_(0)
54 | self.fc2.weight.data = normalized_columns_initializer(self.fc2.weight.data, 0.01)
55 | self.fc2.bias.data.fill_(0)
56 | self.fc3.weight.data = normalized_columns_initializer(self.fc3.weight.data, 0.01)
57 | self.fc3.bias.data.fill_(0)
58 | self.fc4.weight.data = normalized_columns_initializer(self.fc4.weight.data, 0.01)
59 | self.fc4.bias.data.fill_(0)
60 | self.policy_5.weight.data = normalized_columns_initializer(self.policy_5.weight.data, 0.01)
61 | self.policy_5.bias.data.fill_(0)
62 | self.value_5.weight.data = normalized_columns_initializer(self.value_5.weight.data, 1.0)
63 | self.value_5.bias.data.fill_(0)
64 |
65 | self.lstm.bias_ih.data.fill_(0)
66 | self.lstm.bias_hh.data.fill_(0)
67 |
68 | self.lstm_v.bias_ih.data.fill_(0)
69 | self.lstm_v.bias_hh.data.fill_(0)
70 |
71 | def forward(self, x, lstm_hidden_vb=None):
72 | p = x.view(x.size(0), self.input_dims[0] * self.input_dims[1])
73 | p = self.rl1(self.fc1(p))
74 | p = self.rl2(self.fc2(p))
75 | p = self.rl3(self.fc3(p))
76 | p = self.rl4(self.fc4(p))
77 | p = p.view(-1, self.hidden_dim)
78 | if self.enable_lstm:
79 | p_, v_ = torch.split(lstm_hidden_vb[0],1)
80 | c_p, c_v = torch.split(lstm_hidden_vb[1],1)
81 | p, c_p = self.lstm(p, (p_, c_p))
82 | p_out = self.policy_5(p)
83 | sig = self.policy_sig(p)
84 | sig = self.softplus(sig)
85 |
86 | v = x.view(x.size(0), self.input_dims[0] * self.input_dims[1])
87 | v = self.rl1_v(self.fc1_v(v))
88 | v = self.rl2_v(self.fc2_v(v))
89 | v = self.rl3_v(self.fc3_v(v))
90 | v = self.rl4_v(self.fc4_v(v))
91 | v = v.view(-1, self.hidden_dim)
92 | if self.enable_lstm:
93 | v, c_v = self.lstm_v(v, (v_, c_v))
94 | v_out = self.value_5(v)
95 |
96 | if self.enable_lstm:
97 | return p_out, sig, v_out, (torch.cat((p,v),0), torch.cat((c_p, c_v),0))
98 | else:
99 | return p_out, sig, v_out
100 |
--------------------------------------------------------------------------------
/core/models/acer_cnn_dis.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | import torch
6 | import torch.nn as nn
7 | import torch.nn.functional as F
8 | from torch.autograd import Variable
9 |
10 | from utils.init_weights import init_weights, normalized_columns_initializer
11 | from core.model import Model
12 |
13 | class ACERCnnDisModel(Model):
14 | def __init__(self, args):
15 | super(ACERCnnDisModel, self).__init__(args)
16 | # build model
17 | # 0. feature layers
18 | self.conv1 = nn.Conv2d(self.input_dims[0], 32, kernel_size=3, stride=2) # NOTE: for pkg="atari"
19 | self.rl1 = nn.ReLU()
20 | self.conv2 = nn.Conv2d(32, 32, kernel_size=3, stride=2, padding=1)
21 | self.rl2 = nn.ReLU()
22 | self.conv3 = nn.Conv2d(32, 32, kernel_size=3, stride=2, padding=1)
23 | self.rl3 = nn.ReLU()
24 | self.conv4 = nn.Conv2d(32, 32, kernel_size=3, stride=2, padding=1)
25 | self.rl4 = nn.ReLU()
26 | if self.enable_lstm:
27 | self.lstm = nn.LSTMCell(3*3*32, self.hidden_dim)
28 |
29 | # 1. actor: /pi_{/theta}(a_t | x_t)
30 | self.actor_5 = nn.Linear(self.hidden_dim, self.output_dims)
31 | self.actor_6 = nn.Softmax()
32 | # 2. critic: Q_{/theta_v}(x_t, a_t)
33 | self.critic_5 = nn.Linear(self.hidden_dim, self.output_dims)
34 |
35 | self._reset()
36 |
37 | def _init_weights(self):
38 | self.apply(init_weights)
39 | self.actor_5.weight.data = normalized_columns_initializer(self.actor_5.weight.data, 0.01)
40 | self.actor_5.bias.data.fill_(0)
41 | self.critic_5.weight.data = normalized_columns_initializer(self.critic_5.weight.data, 1.0)
42 | self.critic_5.bias.data.fill_(0)
43 |
44 | self.lstm.bias_ih.data.fill_(0)
45 | self.lstm.bias_hh.data.fill_(0)
46 |
47 | def forward(self, x, lstm_hidden_vb=None):
48 | x = x.view(x.size(0), self.input_dims[0], self.input_dims[1], self.input_dims[1])
49 | x = self.rl1(self.conv1(x))
50 | x = self.rl2(self.conv2(x))
51 | x = self.rl3(self.conv3(x))
52 | x = self.rl4(self.conv4(x))
53 | x = x.view(-1, 3*3*32)
54 | if self.enable_lstm:
55 | x, c = self.lstm(x, lstm_hidden_vb)
56 | policy = self.actor_6(self.actor_5(x)).clamp(max=1-1e-6, min=1e-6) # TODO: max might not be necessary
57 | q = self.critic_5(x)
58 | v = (q * policy).sum(1, keepdim=True) # expectation of Q under /pi
59 | if self.enable_lstm:
60 | return policy, q, v, (x, c)
61 | else:
62 | return policy, q, v
63 |
--------------------------------------------------------------------------------
/core/models/acer_mlp_dis.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | import torch
6 | import torch.nn as nn
7 | import torch.nn.functional as F
8 | from torch.autograd import Variable
9 |
10 | from utils.init_weights import init_weights, normalized_columns_initializer
11 | from core.model import Model
12 |
13 | class ACERMlpDisModel(Model):
14 | def __init__(self, args):
15 | super(ACERMlpDisModel, self).__init__(args)
16 | # build model
17 | # 0. feature layers
18 | self.fc1 = nn.Linear(self.input_dims[0] * self.input_dims[1], self.hidden_dim)
19 | self.rl1 = nn.ReLU()
20 |
21 | # lstm
22 | if self.enable_lstm:
23 | self.lstm = nn.LSTMCell(self.hidden_dim, self.hidden_dim)
24 |
25 | # 1. actor: /pi_{/theta}(a_t | x_t)
26 | self.actor_2 = nn.Linear(self.hidden_dim, self.output_dims)
27 | self.actor_3 = nn.Softmax()
28 | # 2. critic: Q_{/theta_v}(x_t, a_t)
29 | self.critic_2 = nn.Linear(self.hidden_dim, self.output_dims)
30 |
31 | self._reset()
32 |
33 | def _init_weights(self):
34 | self.apply(init_weights)
35 | self.actor_2.weight.data = normalized_columns_initializer(self.actor_2.weight.data, 0.01)
36 | self.actor_2.bias.data.fill_(0)
37 | self.critic_2.weight.data = normalized_columns_initializer(self.critic_2.weight.data, 1.0)
38 | self.critic_2.bias.data.fill_(0)
39 |
40 | self.lstm.bias_ih.data.fill_(0)
41 | self.lstm.bias_hh.data.fill_(0)
42 |
43 | def forward(self, x, lstm_hidden_vb=None):
44 | x = x.view(x.size(0), self.input_dims[0] * self.input_dims[1])
45 | x = self.rl1(self.fc1(x))
46 | # x = x.view(-1, 3*3*32)
47 | if self.enable_lstm:
48 | x, c = self.lstm(x, lstm_hidden_vb)
49 | policy = self.actor_3(self.actor_2(x)).clamp(max=1-1e-6, min=1e-6) # TODO: max might not be necessary
50 | q = self.critic_2(x)
51 | v = (q * policy).sum(1, keepdim=True) # expectation of Q under /pi
52 | if self.enable_lstm:
53 | return policy, q, v, (x, c)
54 | else:
55 | return policy, q, v
56 |
--------------------------------------------------------------------------------
/core/models/dqn_cnn.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | import torch
6 | import torch.nn as nn
7 | import torch.nn.functional as F
8 | from torch.autograd import Variable
9 |
10 | from utils.init_weights import init_weights, normalized_columns_initializer
11 | from core.model import Model
12 |
13 | class DQNCnnModel(Model):
14 | def __init__(self, args):
15 | super(DQNCnnModel, self).__init__(args)
16 | # 84x84
17 | # self.conv1 = nn.Conv2d(self.input_dims[0], 32, kernel_size=8, stride=4)
18 | # self.rl1 = nn.ReLU()
19 | # self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
20 | # self.rl2 = nn.ReLU()
21 | # self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
22 | # self.rl3 = nn.ReLU()
23 | # self.fc4 = nn.Linear(64*7*7, self.hidden_dim)
24 | # self.rl4 = nn.ReLU()
25 | # 42x42
26 | self.conv1 = nn.Conv2d(self.input_dims[0], 32, kernel_size=3, stride=2)
27 | self.rl1 = nn.ReLU()
28 | self.conv2 = nn.Conv2d(32, 32, kernel_size=3, stride=2, padding=1)
29 | self.rl2 = nn.ReLU()
30 | self.conv3 = nn.Conv2d(32, 32, kernel_size=3, stride=2, padding=1)
31 | self.rl3 = nn.ReLU()
32 | self.fc4 = nn.Linear(32*5*5, self.hidden_dim)
33 | self.rl4 = nn.ReLU()
34 | if self.enable_dueling: # [0]: V(s); [1,:]: A(s, a)
35 | self.fc5 = nn.Linear(self.hidden_dim, self.output_dims + 1)
36 | self.v_ind = torch.LongTensor(self.output_dims).fill_(0).unsqueeze(0)
37 | self.a_ind = torch.LongTensor(np.arange(1, self.output_dims + 1)).unsqueeze(0)
38 | else: # one q value output for each action
39 | self.fc5 = nn.Linear(self.hidden_dim, self.output_dims)
40 |
41 | self._reset()
42 |
43 | def _init_weights(self):
44 | self.apply(init_weights)
45 | self.fc4.weight.data = normalized_columns_initializer(self.fc4.weight.data, 0.0001)
46 | self.fc4.bias.data.fill_(0)
47 | self.fc5.weight.data = normalized_columns_initializer(self.fc5.weight.data, 0.0001)
48 | self.fc5.bias.data.fill_(0)
49 |
50 | def forward(self, x):
51 | x = x.view(x.size(0), self.input_dims[0], self.input_dims[1], self.input_dims[1])
52 | x = self.rl1(self.conv1(x))
53 | x = self.rl2(self.conv2(x))
54 | x = self.rl3(self.conv3(x))
55 | x = self.rl4(self.fc4(x.view(x.size(0), -1)))
56 | if self.enable_dueling:
57 | x = self.fc5(x)
58 | v_ind_vb = Variable(self.v_ind)
59 | a_ind_vb = Variable(self.a_ind)
60 | if self.use_cuda:
61 | v_ind_vb = v_ind_vb.cuda()
62 | a_ind_vb = a_ind_vb.cuda()
63 | v = x.gather(1, v_ind_vb.expand(x.size(0), self.output_dims))
64 | a = x.gather(1, a_ind_vb.expand(x.size(0), self.output_dims))
65 | # now calculate Q(s, a)
66 | if self.dueling_type == "avg": # Q(s,a)=V(s)+(A(s,a)-avg_a(A(s,a)))
67 | x = v + (a - a.mean(1).expand(x.size(0), self.output_dims))
68 | elif self.dueling_type == "max": # Q(s,a)=V(s)+(A(s,a)-max_a(A(s,a)))
69 | x = v + (a - a.max(1)[0].expand(x.size(0), self.output_dims))
70 | elif self.dueling_type == "naive": # Q(s,a)=V(s)+ A(s,a)
71 | x = v + a
72 | else:
73 | assert False, "dueling_type must be one of {'avg', 'max', 'naive'}"
74 | del v_ind_vb, a_ind_vb, v, a
75 | return x
76 | else:
77 | return self.fc5(x.view(x.size(0), -1))
78 |
--------------------------------------------------------------------------------
/core/models/dqn_mlp.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | import torch
6 | import torch.nn as nn
7 | import torch.nn.functional as F
8 | from torch.autograd import Variable
9 |
10 | from utils.init_weights import init_weights, normalized_columns_initializer
11 | from core.model import Model
12 |
13 | class DQNMlpModel(Model):
14 | def __init__(self, args):
15 | super(DQNMlpModel, self).__init__(args)
16 | # build model
17 | self.fc1 = nn.Linear(self.input_dims[0] * self.input_dims[1], self.hidden_dim)
18 | self.rl1 = nn.ReLU()
19 | self.fc2 = nn.Linear(self.hidden_dim, self.hidden_dim)
20 | self.rl2 = nn.ReLU()
21 | self.fc3 = nn.Linear(self.hidden_dim, self.hidden_dim)
22 | self.rl3 = nn.ReLU()
23 | if self.enable_dueling: # [0]: V(s); [1,:]: A(s, a)
24 | self.fc4 = nn.Linear(self.hidden_dim, self.output_dims + 1)
25 | self.v_ind = torch.LongTensor(self.output_dims).fill_(0).unsqueeze(0)
26 | self.a_ind = torch.LongTensor(np.arange(1, self.output_dims + 1)).unsqueeze(0)
27 | else: # one q value output for each action
28 | self.fc4 = nn.Linear(self.hidden_dim, self.output_dims)
29 |
30 | self._reset()
31 |
32 | def _init_weights(self):
33 | # self.apply(init_weights)
34 | # self.fc1.weight.data = normalized_columns_initializer(self.fc1.weight.data, 0.01)
35 | # self.fc1.bias.data.fill_(0)
36 | # self.fc2.weight.data = normalized_columns_initializer(self.fc2.weight.data, 0.01)
37 | # self.fc2.bias.data.fill_(0)
38 | # self.fc3.weight.data = normalized_columns_initializer(self.fc3.weight.data, 0.01)
39 | # self.fc3.bias.data.fill_(0)
40 | # self.fc4.weight.data = normalized_columns_initializer(self.fc4.weight.data, 0.01)
41 | # self.fc4.bias.data.fill_(0)
42 | pass
43 |
44 | def forward(self, x):
45 | x = x.view(x.size(0), self.input_dims[0] * self.input_dims[1])
46 | x = self.rl1(self.fc1(x))
47 | x = self.rl2(self.fc2(x))
48 | x = self.rl3(self.fc3(x))
49 | if self.enable_dueling:
50 | x = self.fc4(x.view(x.size(0), -1))
51 | v_ind_vb = Variable(self.v_ind)
52 | a_ind_vb = Variable(self.a_ind)
53 | if self.use_cuda:
54 | v_ind_vb = v_ind_vb.cuda()
55 | a_ind_vb = a_ind_vb.cuda()
56 | v = x.gather(1, v_ind_vb.expand(x.size(0), self.output_dims))
57 | a = x.gather(1, a_ind_vb.expand(x.size(0), self.output_dims))
58 | # now calculate Q(s, a)
59 | if self.dueling_type == "avg": # Q(s,a)=V(s)+(A(s,a)-avg_a(A(s,a)))
60 | # x = v + (a - a.mean(1)).expand(x.size(0), self.output_dims) # 0.1.12
61 | x = v + (a - a.mean(1, keepdim=True)) # 0.2.0
62 | elif self.dueling_type == "max": # Q(s,a)=V(s)+(A(s,a)-max_a(A(s,a)))
63 | # x = v + (a - a.max(1)[0]).expand(x.size(0), self.output_dims) # 0.1.12
64 | x = v + (a - a.max(1, keepdim=True)[0]) # 0.2.0
65 | elif self.dueling_type == "naive": # Q(s,a)=V(s)+ A(s,a)
66 | x = v + a
67 | else:
68 | assert False, "dueling_type must be one of {'avg', 'max', 'naive'}"
69 | del v_ind_vb, a_ind_vb, v, a
70 | return x
71 | else:
72 | return self.fc4(x.view(x.size(0), -1))
73 |
--------------------------------------------------------------------------------
/core/models/empty.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | import torch
6 | import torch.nn as nn
7 | import torch.nn.functional as F
8 | from torch.autograd import Variable
9 |
10 | from utils.init_weights import init_weights, normalized_columns_initializer
11 | from core.model import Model
12 |
13 | class EmptyModel(Model):
14 | def __init__(self, args):
15 | super(EmptyModel, self).__init__(args)
16 |
17 | self._reset()
18 |
19 | def _init_weights(self):
20 | pass
21 |
22 | def print_model(self):
23 | self.logger.warning("<-----------------------------------> Model")
24 | self.logger.warning(self)
25 |
26 | def _reset(self): # NOTE: should be called at each child's __init__
27 | self._init_weights()
28 | self.type(self.dtype) # put on gpu if possible
29 | self.print_model()
30 |
31 | def forward(self, input):
32 | pass
33 |
--------------------------------------------------------------------------------
/figs/.gitignore:
--------------------------------------------------------------------------------
1 | # Ignore everything in this directory
2 | *
3 | # Except this file
4 | !.gitignore
5 |
--------------------------------------------------------------------------------
/imgs/.gitignore:
--------------------------------------------------------------------------------
1 | # Ignore everything in this directory
2 | *
3 | # Except this file
4 | !.gitignore
5 |
--------------------------------------------------------------------------------
/logs/.gitignore:
--------------------------------------------------------------------------------
1 | # Ignore everything in this directory
2 | *
3 | # Except this file
4 | !.gitignore
5 |
--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 | # custom modules
4 | from utils.options import Options
5 | from utils.factory import EnvDict, ModelDict, MemoryDict, AgentDict
6 |
7 | # 0. setting up
8 | opt = Options()
9 | np.random.seed(opt.seed)
10 |
11 | # 1. env (prototype)
12 | env_prototype = EnvDict[opt.env_type]
13 | # 2. model (prototype)
14 | model_prototype = ModelDict[opt.model_type]
15 | # 3. memory (prototype)
16 | memory_prototype = MemoryDict[opt.memory_type]
17 | # 4. agent
18 | agent = AgentDict[opt.agent_type](opt.agent_params,
19 | env_prototype = env_prototype,
20 | model_prototype = model_prototype,
21 | memory_prototype = memory_prototype)
22 | # 5. fit model
23 | if opt.mode == 1: # train
24 | agent.fit_model()
25 | elif opt.mode == 2: # test opt.model_file
26 | agent.test_model()
27 |
--------------------------------------------------------------------------------
/models/.gitignore:
--------------------------------------------------------------------------------
1 | # Ignore everything in this directory
2 | *
3 | # Except this file
4 | !.gitignore
5 |
--------------------------------------------------------------------------------
/optims/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingweiz/pytorch-rl/20b3b738ca400b1916197f27a91367878b09803c/optims/__init__.py
--------------------------------------------------------------------------------
/optims/helpers.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 |
5 | # NOTE: refer to: https://discuss.pytorch.org/t/adaptive-learning-rate/320/31
6 | def adjust_learning_rate(optimizer, lr):
7 | for param_group in optimizer.param_groups:
8 | param_group['lr'] = lr
9 |
--------------------------------------------------------------------------------
/optims/sharedAdam.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import math
5 | import torch
6 | import torch.optim as optim
7 |
8 | class SharedAdam(optim.Adam):
9 | """Implements Adam algorithm with shared states.
10 | """
11 |
12 | def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
13 | weight_decay=0):
14 | super(SharedAdam, self).__init__(params, lr, betas, eps, weight_decay)
15 |
16 | # State initialisation (must be done before step, else will not be shared between threads)
17 | for group in self.param_groups:
18 | for p in group['params']:
19 | state = self.state[p]
20 | state['step'] = torch.zeros(1)
21 | state['exp_avg'] = p.data.new().resize_as_(p.data).zero_()
22 | state['exp_avg_sq'] = p.data.new().resize_as_(p.data).zero_()
23 |
24 | def share_memory(self):
25 | for group in self.param_groups:
26 | for p in group['params']:
27 | state = self.state[p]
28 | state['step'].share_memory_()
29 | state['exp_avg'].share_memory_()
30 | state['exp_avg_sq'].share_memory_()
31 |
32 | def step(self, closure=None):
33 | """Performs a single optimization step.
34 | Arguments:
35 | closure (callable, optional): A closure that reevaluates the model
36 | and returns the loss.
37 | """
38 | loss = None
39 | if closure is not None:
40 | loss = closure()
41 |
42 | for group in self.param_groups:
43 | for p in group['params']:
44 | if p.grad is None:
45 | continue
46 | grad = p.grad.data
47 | state = self.state[p]
48 |
49 | exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
50 | beta1, beta2 = group['betas']
51 |
52 | state['step'] += 1
53 |
54 | if group['weight_decay'] != 0:
55 | grad = grad.add(group['weight_decay'], p.data)
56 |
57 | # Decay the first and second moment running average coefficient
58 | exp_avg.mul_(beta1).add_(1 - beta1, grad)
59 | exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
60 |
61 | denom = exp_avg_sq.sqrt().add_(group['eps'])
62 |
63 | bias_correction1 = 1 - beta1 ** state['step'][0]
64 | bias_correction2 = 1 - beta2 ** state['step'][0]
65 | step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1
66 |
67 | p.data.addcdiv_(-step_size, exp_avg, denom)
68 |
69 | return loss
70 |
--------------------------------------------------------------------------------
/optims/sharedRMSprop.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | from __future__ import absolute_import
3 | from __future__ import division
4 | from __future__ import print_function
5 | from torch import optim
6 |
7 | # Non-centered RMSprop update with shared statistics (without momentum)
8 | class SharedRMSprop(optim.RMSprop):
9 | """Implements RMSprop algorithm with shared states.
10 | """
11 |
12 | def __init__(self, params, lr=1e-2, alpha=0.99, eps=1e-8, weight_decay=0):
13 | super(SharedRMSprop, self).__init__(params, lr=lr, alpha=alpha, eps=eps, weight_decay=weight_decay, momentum=0, centered=False)
14 |
15 | # State initialisation (must be done before step, else will not be shared between threads)
16 | for group in self.param_groups:
17 | for p in group['params']:
18 | state = self.state[p]
19 | state['step'] = p.data.new().resize_(1).zero_()
20 | state['square_avg'] = p.data.new().resize_as_(p.data).zero_()
21 |
22 | def share_memory(self):
23 | for group in self.param_groups:
24 | for p in group['params']:
25 | state = self.state[p]
26 | state['step'].share_memory_()
27 | state['square_avg'].share_memory_()
28 |
29 | def step(self, closure=None):
30 | """Performs a single optimization step.
31 | Arguments:
32 | closure (callable, optional): A closure that reevaluates the model
33 | and returns the loss.
34 | """
35 | loss = None
36 | if closure is not None:
37 | loss = closure()
38 |
39 | for group in self.param_groups:
40 | for p in group['params']:
41 | if p.grad is None:
42 | continue
43 | grad = p.grad.data
44 | state = self.state[p]
45 |
46 | square_avg = state['square_avg']
47 | alpha = group['alpha']
48 |
49 | state['step'] += 1
50 |
51 | if group['weight_decay'] != 0:
52 | grad = grad.add(group['weight_decay'], p.data)
53 |
54 | # g = αg + (1 - α)Δθ^2
55 | square_avg.mul_(alpha).addcmul_(1 - alpha, grad, grad)
56 | # θ ← θ - ηΔθ/√(g + ε)
57 | avg = square_avg.sqrt().add_(group['eps'])
58 | p.data.addcdiv_(-group['lr'], grad, avg)
59 |
60 | return loss
61 |
--------------------------------------------------------------------------------
/plot.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | set -e
3 |
4 | if [ $# -lt 1 ]; then
5 | echo "usage: $0 machine time_stamp"
6 | exit
7 | fi
8 |
9 | LOG_DIR="logs/"
10 | LOG_FIL=$LOG_DIR$1"_"$2".log"
11 | echo "$LOG_FIL"
12 | # sed -i 's/,/./g' $LOG_FIL
13 |
14 | SUB_DIR="figs/"$1"_"$2"/"
15 | if [ -d "$SUB_DIR" ]; then
16 | rm -r $SUB_DIR
17 | fi
18 | mkdir $SUB_DIR
19 |
20 | OPT="lmj-plot --input $LOG_FIL \
21 | --num-x-ticks 8 \
22 | --alpha 0.7 \
23 | --colors r g b m y c \
24 | --points - \
25 | -g -T"
26 |
27 | $OPT -o $SUB_DIR"p_loss_avg.png" -m 'Iteration: (\d+); p_loss_avg: (\S+)' --xlabel Iteration --ylabel p_loss_avg --title "p_loss_avg" &
28 | $OPT -o $SUB_DIR"v_loss_avg.png" -m 'Iteration: (\d+); v_loss_avg: (\S+)' --xlabel Iteration --ylabel v_loss_avg --title "v_loss_avg" &
29 | $OPT -o $SUB_DIR"loss_avg.png" -m 'Iteration: (\d+); loss_avg: (\S+)' --xlabel Iteration --ylabel loss_avg --title "loss_avg" &
30 | $OPT -o $SUB_DIR"entropy_avg.png" -m 'Iteration: (\d+); entropy_avg: (\S+)' --xlabel Iteration --ylabel entropy_avg --title "entropy_avg" &
31 | $OPT -o $SUB_DIR"v_avg.png" -m 'Iteration: (\d+); v_avg: (\S+)' --xlabel Iteration --ylabel v_avg --title "v_avg" &
32 | $OPT -o $SUB_DIR"reward_avg.png" -m 'Iteration: (\d+); reward_avg: (\S+)' --xlabel Iteration --ylabel reward_avg --title "reward_avg" &
33 | $OPT -o $SUB_DIR"reward_std.png" -m 'Iteration: (\d+); reward_std: (\S+)' --xlabel Iteration --ylabel reward_std --title "reward_std" &
34 | $OPT -o $SUB_DIR"steps_avg.png" -m 'Iteration: (\d+); steps_avg: (\S+)' --xlabel Iteration --ylabel steps_avg --title "steps_avg" &
35 | $OPT -o $SUB_DIR"steps_std.png" -m 'Iteration: (\d+); steps_std: (\S+)' --xlabel Iteration --ylabel steps_std --title "steps_std" &
36 | $OPT -o $SUB_DIR"nepisodes.png" -m 'Iteration: (\d+); nepisodes: (\S+)' --xlabel Iteration --ylabel nepisodes --title "nepisodes" &
37 | $OPT -o $SUB_DIR"nepisodes_solved.png" -m 'Iteration: (\d+); nepisodes_solved: (\S+)' --xlabel Iteration --ylabel nepisodes_solved --title "nepisodes_solved" &
38 | $OPT -o $SUB_DIR"repisodes_solved.png" -m 'Iteration: (\d+); repisodes_solved: (\S+)' --xlabel Iteration --ylabel repisodes_solved --title "repisodes_solved" &
39 |
--------------------------------------------------------------------------------
/plot_compare.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | set -e
3 |
4 | if [ $# -lt 1 ]; then
5 | echo "usage: $0 compare_ind machine_1 time_stamp_1 .. .. machine_N time_stamp_N"
6 | exit
7 | fi
8 |
9 | LOG_FIL_ALL=""
10 | LOG_DIR="logs/"
11 | j=0
12 | for i in $@
13 | do
14 | if [ $j -eq 0 ]; then
15 | SUB_DIR="figs/compare_"$i"/"
16 | let j=$j+1
17 | continue
18 | fi
19 | let j=$j+1
20 | if [ $(($j % 2)) = 0 ]; then # machine
21 | MACHINE=$i
22 | else # time_stamp
23 | TIME_STAMP=$i
24 | LOG_FIL=$LOG_DIR$MACHINE"_"$TIME_STAMP".log"
25 | echo "$LOG_FIL"
26 | LOG_FIL_ALL="$LOG_FIL_ALL $LOG_FIL"
27 | fi
28 | done
29 |
30 | if [ -d "$SUB_DIR" ]; then
31 | rm -r $SUB_DIR
32 | fi
33 | mkdir $SUB_DIR
34 |
35 | OPT="lmj-plot --input $LOG_FIL_ALL \
36 | --num-x-ticks 8 \
37 | --alpha 0.7 \
38 | --colors r g b m y c \
39 | --points - \
40 | -g -T"
41 |
42 | $OPT -o $SUB_DIR"p_loss_avg.png" -m 'Iteration: (\d+); p_loss_avg: (\S+)' --xlabel Iteration --ylabel p_loss_avg --title "p_loss_avg" &
43 | $OPT -o $SUB_DIR"v_loss_avg.png" -m 'Iteration: (\d+); v_loss_avg: (\S+)' --xlabel Iteration --ylabel v_loss_avg --title "v_loss_avg" &
44 | $OPT -o $SUB_DIR"loss_avg.png" -m 'Iteration: (\d+); loss_avg: (\S+)' --xlabel Iteration --ylabel loss_avg --title "loss_avg" &
45 | $OPT -o $SUB_DIR"entropy_avg.png" -m 'Iteration: (\d+); entropy_avg: (\S+)' --xlabel Iteration --ylabel entropy_avg --title "entropy_avg" &
46 | $OPT -o $SUB_DIR"v_avg.png" -m 'Iteration: (\d+); v_avg: (\S+)' --xlabel Iteration --ylabel v_avg --title "v_avg" &
47 | $OPT -o $SUB_DIR"reward_avg.png" -m 'Iteration: (\d+); reward_avg: (\S+)' --xlabel Iteration --ylabel reward_avg --title "reward_avg" &
48 | $OPT -o $SUB_DIR"reward_std.png" -m 'Iteration: (\d+); reward_std: (\S+)' --xlabel Iteration --ylabel reward_std --title "reward_std" &
49 | $OPT -o $SUB_DIR"steps_avg.png" -m 'Iteration: (\d+); steps_avg: (\S+)' --xlabel Iteration --ylabel steps_avg --title "steps_avg" &
50 | $OPT -o $SUB_DIR"steps_std.png" -m 'Iteration: (\d+); steps_std: (\S+)' --xlabel Iteration --ylabel steps_std --title "steps_std" &
51 | $OPT -o $SUB_DIR"nepisodes.png" -m 'Iteration: (\d+); nepisodes: (\S+)' --xlabel Iteration --ylabel nepisodes --title "nepisodes" &
52 | $OPT -o $SUB_DIR"nepisodes_solved.png" -m 'Iteration: (\d+); nepisodes_solved: (\S+)' --xlabel Iteration --ylabel nepisodes_solved --title "nepisodes_solved" &
53 | $OPT -o $SUB_DIR"repisodes_solved.png" -m 'Iteration: (\d+); repisodes_solved: (\S+)' --xlabel Iteration --ylabel repisodes_solved --title "repisodes_solved" &
54 |
--------------------------------------------------------------------------------
/utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jingweiz/pytorch-rl/20b3b738ca400b1916197f27a91367878b09803c/utils/__init__.py
--------------------------------------------------------------------------------
/utils/distributions.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import math
5 | import random
6 |
7 | # Knuth's algorithm for generating Poisson samples
8 | def sample_poisson(lmbd):
9 | L, k, p = math.exp(-lmbd), 0, 1
10 | while p > L:
11 | k += 1
12 | p *= random.uniform(0, 1)
13 | return max(k - 1, 0)
14 |
15 | # KL divergence k = DKL[ ref_distribution || input_distribution]
16 | def categorical_kl_div(input_vb, ref_vb):
17 | """
18 | kl_div = \sum ref * (log(ref) - log(input))
19 | variables needed:
20 | input_vb: [batch_size x state_dim]
21 | ref_vb: [batch_size x state_dim]
22 | returns:
23 | kl_div_vb: [batch_size x 1]
24 | """
25 | return (ref_vb * (ref_vb.log() - input_vb.log())).sum(1, keepdim=True)
26 |
27 | # import torch
28 | # from torch.autograd import Variable
29 | # import torch.nn.functional as F
30 | # # input_vb = Variable(torch.Tensor([0.2, 0.8])).view(1, 2)
31 | # # ref_vb = Variable(torch.Tensor([0.3, 0.7])).view(1, 2)
32 | # input_vb = Variable(torch.Tensor([0.0002, 0.9998])).view(1, 2)
33 | # ref_vb = Variable(torch.Tensor([0.3, 0.7])).view(1, 2)
34 | # input_vb = Variable(torch.Tensor([0.2, 0.8, 0.5, 0.5, 0.7, 0.3])).view(3, 2)
35 | # ref_vb = Variable(torch.Tensor([0.3, 0.7, 0.5, 0.5, 0.1, 0.9])).view(3, 2)
36 | # print(F.kl_div(input_vb.log(), ref_vb, size_average=False))
37 | # print(kl_div(input_vb, ref_vb))
38 |
--------------------------------------------------------------------------------
/utils/factory.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 |
5 | from core.envs.gym import GymEnv
6 | from core.envs.atari_ram import AtariRamEnv
7 | from core.envs.atari import AtariEnv
8 | from core.envs.lab import LabEnv
9 | EnvDict = {"gym": GymEnv, # classic control games from openai w/ low-level input
10 | "atari-ram": AtariRamEnv, # atari integrations from openai, with low-level input
11 | "atari": AtariEnv, # atari integrations from openai, with pixel-level input
12 | "lab": LabEnv}
13 |
14 | from core.models.empty import EmptyModel
15 | from core.models.dqn_mlp import DQNMlpModel
16 | from core.models.dqn_cnn import DQNCnnModel
17 | from core.models.a3c_mlp_con import A3CMlpConModel
18 | from core.models.a3c_cnn_dis import A3CCnnDisModel
19 | from core.models.acer_mlp_dis import ACERMlpDisModel
20 | from core.models.acer_cnn_dis import ACERCnnDisModel
21 | ModelDict = {"empty": EmptyModel, # contains nothing, only should be used w/ EmptyAgent
22 | "dqn-mlp": DQNMlpModel, # for dqn low-level input
23 | "dqn-cnn": DQNCnnModel, # for dqn pixel-level input
24 | "a3c-mlp-con": A3CMlpConModel, # for a3c low-level input (NOTE: continuous must end in "-con")
25 | "a3c-cnn-dis": A3CCnnDisModel, # for a3c pixel-level input
26 | "acer-mlp-dis": ACERMlpDisModel, # for acer low-level input
27 | "acer-cnn-dis": ACERCnnDisModel, # for acer pixel-level input
28 | "none": None}
29 |
30 | from core.memories.sequential import SequentialMemory
31 | from core.memories.episode_parameter import EpisodeParameterMemory
32 | from core.memories.episodic import EpisodicMemory
33 | MemoryDict = {"sequential": SequentialMemory, # off-policy
34 | "episode-parameter": EpisodeParameterMemory, # not in use right now
35 | "episodic": EpisodicMemory, # on/off-policy
36 | "none": None} # on-policy
37 |
38 | from core.agents.empty import EmptyAgent
39 | from core.agents.dqn import DQNAgent
40 | from core.agents.a3c import A3CAgent
41 | from core.agents.acer import ACERAgent
42 | AgentDict = {"empty": EmptyAgent, # to test integration of new envs, contains only the most basic control loop
43 | "dqn": DQNAgent, # dqn (w/ double dqn & dueling as options)
44 | "a3c": A3CAgent, # a3c (multi-process, pure cpu version)
45 | "acer": ACERAgent} # acer (multi-process, pure cpu version)
46 |
--------------------------------------------------------------------------------
/utils/helpers.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import logging
5 | import numpy as np
6 | import cv2
7 | from collections import namedtuple
8 |
9 | def loggerConfig(log_file, verbose=2):
10 | logger = logging.getLogger()
11 | formatter = logging.Formatter('[%(levelname)-8s] (%(processName)-11s) %(message)s')
12 | fileHandler = logging.FileHandler(log_file, 'w')
13 | fileHandler.setFormatter(formatter)
14 | logger.addHandler(fileHandler)
15 | if verbose >= 2:
16 | logger.setLevel(logging.DEBUG)
17 | elif verbose >= 1:
18 | logger.setLevel(logging.INFO)
19 | else:
20 | # NOTE: we currently use this level to log to get rid of visdom's info printouts
21 | logger.setLevel(logging.WARNING)
22 | return logger
23 |
24 | # This is to be understood as a transition: Given `state0`, performing `action`
25 | # yields `reward` and results in `state1`, which might be `terminal`.
26 | # NOTE: used as the return format for Env(), and as the format to push into replay memory for off-policy methods (DQN)
27 | # NOTE: when return from Env(), state0 is always None
28 | Experience = namedtuple('Experience', 'state0, action, reward, state1, terminal1')
29 | # NOTE: used for on-policy methods for collect experiences over a rollout of an episode
30 | # NOTE: policy_vb & value0_vb for storing output Variables along a rollout # NOTE: they should not be detached from the graph!
31 | A3C_Experience = namedtuple('A3C_Experience', 'state0, action, reward, state1, terminal1, policy_vb, sigmoid_vb, value0_vb')
32 | ACER_On_Policy_Experience = namedtuple('ACER_On_Policy_Experience', 'state0, action, reward, state1, terminal1, policy_vb, q0_vb, value0_vb, detached_avg_policy_vb, detached_old_policy_vb')
33 | # # NOTE: used as the format to push into the replay memory for ACER; when sampled, used to get ACER_On_Policy_Experience
34 | ACER_Off_Policy_Experience = namedtuple('ACER_Off_Policy_Experience', 'state0, action, reward, detached_old_policy_vb')
35 |
36 | def preprocessAtari(frame):
37 | frame = frame[34:34 + 160, :160]
38 | frame = cv2.resize(frame, (80, 80))
39 | frame = cv2.resize(frame, (42, 42))
40 | frame = frame.mean(2)
41 | frame = frame.astype(np.float32)
42 | frame*= (1. / 255.)
43 | return frame
44 |
45 | # TODO: check the order rgb to confirm
46 | def rgb2gray(rgb):
47 | gray_image = 0.2126 * rgb[..., 0]
48 | gray_image[:] += 0.0722 * rgb[..., 1]
49 | gray_image[:] += 0.7152 * rgb[..., 2]
50 | return gray_image
51 |
52 | # TODO: check the order rgb to confirm
53 | def rgb2y(rgb):
54 | y_image = 0.299 * rgb[..., 0]
55 | y_image[:] += 0.587 * rgb[..., 1]
56 | y_image[:] += 0.114 * rgb[..., 2]
57 | return y_image
58 |
59 | def scale(image, hei_image, wid_image):
60 | return cv2.resize(image, (wid_image, hei_image),
61 | interpolation=cv2.INTER_LINEAR)
62 |
63 | def one_hot(n_classes, labels):
64 | one_hot_labels = np.zeros(labels.shape + (n_classes,))
65 | for c in range(n_classes):
66 | one_hot_labels[labels == c, c] = 1
67 | return one_hot_labels
68 |
--------------------------------------------------------------------------------
/utils/init_weights.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 |
5 | import numpy as np
6 | import torch
7 |
8 | def normalized_columns_initializer(weights, std=1.0):
9 | out = torch.randn(weights.size())
10 | # out *= std / torch.sqrt(out.pow(2).sum(1).expand_as(out)) # 0.1.12
11 | out *= std / torch.sqrt(out.pow(2).sum(1, keepdim=True).expand_as(out)) # 0.2.0
12 | return out
13 |
14 | def init_weights(m):
15 | classname = m.__class__.__name__
16 | if classname.find('Conv') != -1:
17 | weight_shape = list(m.weight.data.size())
18 | fan_in = np.prod(weight_shape[1:4])
19 | fan_out = np.prod(weight_shape[2:4]) * weight_shape[0]
20 | w_bound = np.sqrt(6. / (fan_in + fan_out))
21 | m.weight.data.uniform_(-w_bound, w_bound)
22 | m.bias.data.fill_(0)
23 | elif classname.find('Linear') != -1:
24 | weight_shape = list(m.weight.data.size())
25 | fan_in = weight_shape[1]
26 | fan_out = weight_shape[0]
27 | w_bound = np.sqrt(6. / (fan_in + fan_out))
28 | m.weight.data.uniform_(-w_bound, w_bound)
29 | m.bias.data.fill_(0)
30 |
--------------------------------------------------------------------------------
/utils/options.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | from __future__ import division
3 | from __future__ import print_function
4 | import numpy as np
5 | import os
6 | import visdom
7 | import torch
8 | import torch.nn as nn
9 | import torch.nn.functional as F
10 | import torch.optim as optim
11 |
12 | from utils.helpers import loggerConfig
13 | from optims.sharedAdam import SharedAdam
14 | from optims.sharedRMSprop import SharedRMSprop
15 |
16 | CONFIGS = [
17 | # agent_type, env_type, game, model_type, memory_type
18 | [ "empty", "gym", "CartPole-v0", "empty", "none" ], # 0
19 | [ "dqn", "gym", "CartPole-v0", "dqn-mlp", "sequential"], # 1
20 | [ "dqn", "atari-ram", "Pong-ram-v0", "dqn-mlp", "sequential"], # 2
21 | [ "dqn", "atari", "PongDeterministic-v4", "dqn-cnn", "sequential"], # 3
22 | [ "dqn", "atari", "BreakoutDeterministic-v4", "dqn-cnn", "sequential"], # 4
23 | [ "a3c", "atari", "PongDeterministic-v4", "a3c-cnn-dis", "none" ], # 5
24 | [ "a3c", "gym", "InvertedPendulum-v1", "a3c-mlp-con", "none" ], # 6
25 | [ "acer", "gym", "CartPole-v0", "acer-mlp-dis", "episodic" ], # 7 # NOTE: acer under testing
26 | [ "acer", "atari", "Boxing-v0", "acer-cnn-dis", "episodic" ] # 8 # NOTE: acer under testing
27 | ]
28 |
29 | class Params(object): # NOTE: shared across all modules
30 | def __init__(self):
31 | self.verbose = 0 # 0(warning) | 1(info) | 2(debug)
32 |
33 | # training signature
34 | self.machine = "aisgpu8" # "machine_id"
35 | self.timestamp = "17082701" # "yymmdd##"
36 | # training configuration
37 | self.mode = 1 # 1(train) | 2(test model_file)
38 | self.config = 7
39 |
40 | self.seed = 123
41 | self.render = False # whether render the window from the original envs or not
42 | self.visualize = True # whether do online plotting and stuff or not
43 | self.save_best = False # save model w/ highest reward if True, otherwise always save the latest model
44 |
45 | self.agent_type, self.env_type, self.game, self.model_type, self.memory_type = CONFIGS[self.config]
46 |
47 | if self.agent_type == "dqn":
48 | self.enable_double_dqn = False
49 | self.enable_dueling = False
50 | self.dueling_type = 'avg' # avg | max | naive
51 |
52 | if self.env_type == "gym":
53 | self.hist_len = 1
54 | self.hidden_dim = 16
55 | else:
56 | self.hist_len = 4
57 | self.hidden_dim = 256
58 |
59 | self.use_cuda = torch.cuda.is_available()
60 | self.dtype = torch.cuda.FloatTensor if torch.cuda.is_available() else torch.FloatTensor
61 | elif self.agent_type == "a3c":
62 | self.enable_log_at_train_step = True # when False, x-axis would be frame_step instead of train_step
63 |
64 | self.enable_lstm = True
65 | if "-con" in self.model_type:
66 | self.enable_continuous = True
67 | else:
68 | self.enable_continuous = False
69 | self.num_processes = 16
70 |
71 | self.hist_len = 1
72 | self.hidden_dim = 128
73 |
74 | self.use_cuda = False
75 | self.dtype = torch.FloatTensor
76 | elif self.agent_type == "acer":
77 | self.enable_bias_correction = True
78 | self.enable_1st_order_trpo = True
79 | self.enable_log_at_train_step = True # when False, x-axis would be frame_step instead of train_step
80 |
81 | self.enable_lstm = True
82 | if "-con" in self.model_type:
83 | self.enable_continuous = True
84 | else:
85 | self.enable_continuous = False
86 | self.num_processes = 16
87 |
88 | self.hist_len = 1
89 | self.hidden_dim = 32
90 |
91 | self.use_cuda = False
92 | self.dtype = torch.FloatTensor
93 | else:
94 | self.hist_len = 1
95 | self.hidden_dim = 256
96 |
97 | self.use_cuda = torch.cuda.is_available()
98 | self.dtype = torch.cuda.FloatTensor if torch.cuda.is_available() else torch.FloatTensor
99 |
100 | # prefix for model/log/visdom
101 | self.refs = self.machine + "_" + self.timestamp # NOTE: using this as env for visdom
102 | self.root_dir = os.getcwd()
103 |
104 | # model files
105 | # NOTE: will save the current model to model_name
106 | self.model_name = self.root_dir + "/models/" + self.refs + ".pth"
107 | # NOTE: will load pretrained model_file if not None
108 | self.model_file = None#self.root_dir + "/models/{TODO:FILL_IN_PRETAINED_MODEL_FILE}.pth"
109 | if self.mode == 2:
110 | self.model_file = self.model_name # NOTE: so only need to change self.mode to 2 to test the current training
111 | assert self.model_file is not None, "Pre-Trained model is None, Testing aborted!!!"
112 | self.refs = self.refs + "_test" # NOTE: using this as env for visdom for testing, to avoid accidentally redraw on the training plots
113 |
114 | # logging configs
115 | self.log_name = self.root_dir + "/logs/" + self.refs + ".log"
116 | self.logger = loggerConfig(self.log_name, self.verbose)
117 | self.logger.warning("<===================================>")
118 |
119 | if self.visualize:
120 | self.vis = visdom.Visdom()
121 | self.logger.warning("bash$: python -m visdom.server") # activate visdom server on bash
122 | self.logger.warning("http://localhost:8097/env/" + self.refs) # open this address on browser
123 |
124 | class EnvParams(Params): # settings for simulation environment
125 | def __init__(self):
126 | super(EnvParams, self).__init__()
127 |
128 | if self.env_type == "gym":
129 | pass
130 | elif self.env_type == "atari-ram":
131 | pass
132 | elif self.env_type == "atari":
133 | self.hei_state = 42
134 | self.wid_state = 42
135 | self.preprocess_mode = 3 # 0(nothing) | 1(rgb2gray) | 2(rgb2y) | 3(crop&resize)
136 | elif self.env_type == "lab":
137 | pass
138 | elif self.env_type == "gazebo":
139 | self.hei_state = 60
140 | self.wid_state = 80
141 | self.preprocess_mode = 3 # 0(nothing) | 1(rgb2gray) | 2(rgb2y) | 3(crop&resize depth)
142 | self.img_encoding_type = "passthrough"
143 | else:
144 | assert False, "env_type must be: gym | atari-ram | atari | lab"
145 |
146 | class ModelParams(Params): # settings for network architecture
147 | def __init__(self):
148 | super(ModelParams, self).__init__()
149 |
150 | self.state_shape = None # NOTE: set in fit_model of inherited Agents
151 | self.action_dim = None # NOTE: set in fit_model of inherited Agents
152 |
153 | class MemoryParams(Params): # settings for replay memory
154 | def __init__(self):
155 | super(MemoryParams, self).__init__()
156 |
157 | # NOTE: for multiprocess agents. this memory_size is the total number
158 | # NOTE: across all processes
159 | if self.agent_type == "dqn" and self.env_type == "gym":
160 | self.memory_size = 50000
161 | else:
162 | self.memory_size = 1000000
163 |
164 | class AgentParams(Params): # hyperparameters for drl agents
165 | def __init__(self):
166 | super(AgentParams, self).__init__()
167 |
168 | # criteria and optimizer
169 | if self.agent_type == "dqn":
170 | self.value_criteria = F.smooth_l1_loss
171 | self.optim = optim.Adam
172 | # self.optim = optim.RMSprop
173 | elif self.agent_type == "a3c":
174 | self.value_criteria = nn.MSELoss()
175 | self.optim = SharedAdam # share momentum across learners
176 | elif self.agent_type == "acer":
177 | self.value_criteria = nn.MSELoss()
178 | self.optim = SharedRMSprop # share momentum across learners
179 | else:
180 | self.value_criteria = F.smooth_l1_loss
181 | self.optim = optim.Adam
182 | # hyperparameters
183 | if self.agent_type == "dqn" and self.env_type == "gym":
184 | self.steps = 100000 # max #iterations
185 | self.early_stop = None # max #steps per episode
186 | self.gamma = 0.99
187 | self.clip_grad = 1.#np.inf
188 | self.lr = 0.0001
189 | self.lr_decay = False
190 | self.weight_decay = 0.
191 | self.eval_freq = 2500 # NOTE: here means every this many steps
192 | self.eval_steps = 1000
193 | self.prog_freq = self.eval_freq
194 | self.test_nepisodes = 1
195 |
196 | self.learn_start = 500 # start update params after this many steps
197 | self.batch_size = 32
198 | self.valid_size = 250
199 | self.eps_start = 1
200 | self.eps_end = 0.3
201 | self.eps_eval = 0.#0.05
202 | self.eps_decay = 50000
203 | self.target_model_update = 1000#0.0001
204 | self.action_repetition = 1
205 | self.memory_interval = 1
206 | self.train_interval = 1
207 | elif self.agent_type == "dqn" and self.env_type == "atari-ram" or \
208 | self.agent_type == "dqn" and self.env_type == "atari":
209 | self.steps = 50000000 # max #iterations
210 | self.early_stop = None # max #steps per episode
211 | self.gamma = 0.99
212 | self.clip_grad = 40.#np.inf
213 | self.lr = 0.00025
214 | self.lr_decay = False
215 | self.weight_decay = 0.
216 | self.eval_freq = 250000#12500 # NOTE: here means every this many steps
217 | self.eval_steps = 125000#2500
218 | self.prog_freq = 10000#self.eval_freq
219 | self.test_nepisodes = 1
220 |
221 | self.learn_start = 50000 # start update params after this many steps
222 | self.batch_size = 32
223 | self.valid_size = 500
224 | self.eps_start = 1
225 | self.eps_end = 0.1
226 | self.eps_eval = 0.#0.05
227 | self.eps_decay = 1000000
228 | self.target_model_update = 10000
229 | self.action_repetition = 4
230 | self.memory_interval = 1
231 | self.train_interval = 4
232 | elif self.agent_type == "a3c":
233 | self.steps = 20000000 # max #iterations
234 | self.early_stop = None # max #steps per episode
235 | self.gamma = 0.99
236 | self.clip_grad = 40.
237 | self.lr = 0.0001
238 | self.lr_decay = False
239 | self.weight_decay = 1e-4 if self.enable_continuous else 0.
240 | self.eval_freq = 60 # NOTE: here means every this many seconds
241 | self.eval_steps = 3000
242 | self.prog_freq = self.eval_freq
243 | self.test_nepisodes = 10
244 |
245 | self.rollout_steps = 20 # max look-ahead steps in a single rollout
246 | self.tau = 1.
247 | self.beta = 0.01 # coefficient for entropy penalty
248 | elif self.agent_type == "acer":
249 | self.steps = 20000000 # max #iterations
250 | self.early_stop = 200 # max #steps per episode
251 | self.gamma = 0.99
252 | self.clip_grad = 40.
253 | self.lr = 0.0001
254 | self.lr_decay = False
255 | self.weight_decay = 1e-4
256 | self.eval_freq = 60 # NOTE: here means every this many seconds
257 | self.eval_steps = 3000
258 | self.prog_freq = self.eval_freq
259 | self.test_nepisodes = 10
260 |
261 | self.replay_ratio = 4 # NOTE: 0: purely on-policy; otherwise mix with off-policy
262 | self.replay_start = 20000 # start off-policy learning after this many steps
263 | self.batch_size = 16
264 | self.valid_size = 500 # TODO: should do the same thing as in dqn
265 | self.clip_trace = 10#np.inf# c in retrace
266 | self.clip_1st_order_trpo = 1
267 | self.avg_model_decay = 0.99
268 |
269 | self.rollout_steps = 20 # max look-ahead steps in a single rollout
270 | self.tau = 1.
271 | self.beta = 1e-2 # coefficient for entropy penalty
272 | else:
273 | self.steps = 1000000 # max #iterations
274 | self.early_stop = None # max #steps per episode
275 | self.gamma = 0.99
276 | self.clip_grad = 1.#np.inf
277 | self.lr = 0.001
278 | self.lr_decay = False
279 | self.weight_decay = 0.
280 | self.eval_freq = 2500 # NOTE: here means every this many steps
281 | self.eval_steps = 1000
282 | self.prog_freq = self.eval_freq
283 | self.test_nepisodes = 10
284 |
285 | self.learn_start = 25000 # start update params after this many steps
286 | self.batch_size = 32
287 | self.valid_size = 500
288 | self.eps_start = 1
289 | self.eps_end = 0.1
290 | self.eps_eval = 0.#0.05
291 | self.eps_decay = 50000
292 | self.target_model_update = 1000
293 | self.action_repetition = 1
294 | self.memory_interval = 1
295 | self.train_interval = 4
296 |
297 | self.rollout_steps = 20 # max look-ahead steps in a single rollout
298 | self.tau = 1.
299 |
300 | if self.memory_type == "episodic": assert self.early_stop is not None
301 |
302 | self.env_params = EnvParams()
303 | self.model_params = ModelParams()
304 | self.memory_params = MemoryParams()
305 |
306 | class Options(Params):
307 | agent_params = AgentParams()
308 |
--------------------------------------------------------------------------------