├── .gitignore
├── .gitmodules
├── README.md
├── modular_rl
├── run_batch_size_experiment_set.sh
├── run_multiple.sh
└── run_policy_experiment_set.sh
├── rllab
├── run_ddpg_rllab.py
└── run_trpo.py
├── rllabplusplus
├── ddpg
│ ├── __init__.py
│ └── ddpg.py
├── run_ddpg.py
└── sampling_utils.py
└── tools
├── plot_results.py
├── run_multiple.sh
├── run_policy_experiment_set.sh
└── significance_and_bootstrap.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | wheels/
24 | *.egg-info/
25 | .installed.cfg
26 | *.egg
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 |
49 | # Translations
50 | *.mo
51 | *.pot
52 |
53 | # Django stuff:
54 | *.log
55 | local_settings.py
56 |
57 | # Flask stuff:
58 | instance/
59 | .webassets-cache
60 |
61 | # Scrapy stuff:
62 | .scrapy
63 |
64 | # Sphinx documentation
65 | docs/_build/
66 |
67 | # PyBuilder
68 | target/
69 |
70 | # Jupyter Notebook
71 | .ipynb_checkpoints
72 |
73 | # pyenv
74 | .python-version
75 |
76 | # celery beat schedule file
77 | celerybeat-schedule
78 |
79 | # SageMath parsed files
80 | *.sage.py
81 |
82 | # dotenv
83 | .env
84 |
85 | # virtualenv
86 | .venv
87 | venv/
88 | ENV/
89 |
90 | # Spyder project settings
91 | .spyderproject
92 | .spyproject
93 |
94 | # Rope project settings
95 | .ropeproject
96 |
97 | # mkdocs documentation
98 | /site
99 |
100 | # mypy
101 | .mypy_cache/
102 |
--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "baselines"]
2 | path = baselines
3 | url = https://github.com/Breakend/baselines
4 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # DeepReinforcementLearningThatMatters
2 |
3 | Accompanying code for "Deep Reinforcement Learning that Matters"
4 |
5 | ## Baselines Experiments
6 |
7 | Our Fork
8 |
9 | Current Baselines Code
10 |
11 | Our checkpointed version of the baselines code is found in the `baselines` folder. We make several modifications, mostly to allow for passing network structures as arguments to the MuJoCo-related run scripts.
12 |
13 | Our only change internally was to the DDPG evaluation code. We do this to allow for comparison against other algorithms. In the DDPG code, evaluation is done across N different policies where N is the number of "epoch_cycles", we did not find this to be consistent for comparison against other methods, so we modify this to match the rllab version of DDPG evaluation. That is, we run on the target policy for 10 full trajectories at the end of an epoch.
14 |
15 | ## rllab experiments
16 |
17 | rllab code
18 |
19 | These require the full rllab code, which we do not provide. Instead we provide some run scripts for rllab experiments in the `rllab` folder.
20 |
21 | ## rllabplusplus experiments
22 |
23 | rllabplusplus (Q-Prop) code
24 |
25 | This is the code provided with QPROP, we only provide a checkpointed version of the DDPG code which we use for evaluation here. This is under the rllabplusplus folder.
26 |
27 | ## modular_rl experiments
28 |
29 | Original TRPO (Modular RL) Code
30 |
31 | These are simply run scripts for the modular rl codebase.
32 |
33 | ## Tools
34 |
35 | This contains tools for significance testing which we used. And various associated run scripts.
36 |
37 | For bootstrap-based analysis, we use the bootstrapped repo. Tutorials there are a nice introduction to this sort of statistical analysis.
38 |
39 | For t-test and KS test we use the scipy tools.
40 |
41 | ## Citation
42 |
43 | ```
44 | @article{hendersonRL2017,
45 | author = {{Henderson}, Peter and {Islam}, Riashat and {Bachman}, Philip and {Pineau}, Joelle and {Precup}, Doina and {Meger}, David},
46 | title = "{Deep Reinforcement Learning that Matters}",
47 | journal = {arXiv preprint arXiv:1709.06560},
48 | year = 2017,
49 | url={https://arxiv.org/pdf/1709.06560.pdf}
50 | }
51 | ```
52 |
--------------------------------------------------------------------------------
/modular_rl/run_batch_size_experiment_set.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 |
4 | die() { echo "$@" 1>&2 ; exit 1; }
5 |
6 | # TODO: make this proper usage script
7 |
8 | # check whether user had supplied -h or --help . If yes display usage
9 | if [[ ( $# == "--help") || $# == "-h" ]]
10 | then
11 | die "Usage: bash $0 num_experiments num_experiments_in_parallel run_script [--all --other --args --to --run_script]"
12 | fi
13 |
14 | env=$1
15 |
16 |
17 | # default (6464tanh)
18 | # activations
19 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh_defaultbs1024 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=977 --timesteps_per_batch=1024 --env=${env} --filter=1 &> ${env}_default_bs1024.log
20 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh_default2048 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=488 --timesteps_per_batch=2048 --env=${env} --filter=1 &> ${env}_defaul2048t.log
21 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh_default4096 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=244 --timesteps_per_batch=4096 --env=${env} --filter=1 &> ${env}_defaul4096t.log
22 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh_default8192 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=122 --timesteps_per_batch=4096 --env=${env} --filter=1 &> ${env}_default8192.log
23 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh_default16384 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=61 --timesteps_per_batch=16384 --env=${env} --filter=1 &> ${env}_default16384.log
24 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh_default32768 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=30 --timesteps_per_batch=32768 --env=${env} --filter=1 &> ${env}_default32768.log
25 |
--------------------------------------------------------------------------------
/modular_rl/run_multiple.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 |
4 | die() { echo "$@" 1>&2 ; exit 1; }
5 |
6 | # TODO: make this proper usage script
7 |
8 | if [ $# -le 4 ]
9 | then
10 | die "Usage: bash $0 num_experiments num_experiments_in_parallel run_script [--all --other --args --to --run-script]"
11 | fi
12 |
13 | # check whether user had supplied -h or --help . If yes display usage
14 | if [[ ( $# == "--help") || $# == "-h" ]]
15 | then
16 | die "Usage: bash $0 num_experiments num_experiments_in_parallel run_script [--all --other --args --to --run_script]"
17 | fi
18 |
19 | num_experiments=$1
20 | parallel_exps=$2
21 | log_prefix=$3
22 | run_script=$4
23 |
24 | pickle_files=()
25 |
26 | mkdir -p ./$log_prefix/
27 |
28 | trap 'jobs -p | xargs kill' EXIT
29 |
30 | for (( c=1; c<=$num_experiments; ))
31 | do
32 | for (( j=1; j<=$parallel_exps; j++ ))
33 | do
34 | echo "Launching experiment $c"
35 | mkdir -p ./$log_prefix/exp_$c/
36 | python $run_script --seed $c --outfile ./$log_prefix/exp_$c/ "${@:5}" &> ./$log_prefix/exp_$c.log &
37 | #pickle_files=("${pickle_files[@]}" "exp_$c.pickle")
38 | c=$((c+1))
39 | done
40 | wait
41 | done
42 |
43 | #python create_graphs_from_pickle.py "${pickle_files[@]}"
44 |
45 |
--------------------------------------------------------------------------------
/modular_rl/run_policy_experiment_set.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 |
4 | die() { echo "$@" 1>&2 ; exit 1; }
5 |
6 | # TODO: make this proper usage script
7 |
8 | # check whether user had supplied -h or --help . If yes display usage
9 | if [[ ( $# == "--help") || $# == "-h" ]]
10 | then
11 | die "Usage: bash $0 num_experiments num_experiments_in_parallel run_script [--all --other --args --to --run_script]"
12 | fi
13 |
14 | env=$1
15 |
16 |
17 | # default (6464tanh)
18 | # activations
19 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh_default run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=100 --timesteps_per_batch=20000 --env=${env} --filter=1 &> ${env}_default.log
20 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_vfrelu run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation_vf=relu --n_iter=100 --timesteps_per_batch=20000 --env=${env} --filter=1 &> ${env}_vfrelu.log
21 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_polrelu run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=relu --n_iter=100 --timesteps_per_batch=20000 --env=${env} --filter=1 &> ${env}_relu.log
22 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh1005025 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=100 --timesteps_per_batch=20000 --hid_sizes "100,50,25" --env=${env} --filter=1 &> ${env}_1005025.log
23 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_400300 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=100 --timesteps_per_batch=20000 --hid_sizes "400,300" --env=${env} --filter=1 &> ${env}_400300.log
24 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_400300vf run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=100 --timesteps_per_batch=20000 --hid_sizes_vf "400,300" --env=${env} --filter=1 &> ${env}_400300vf.log
25 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_1005025vf run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=100 --timesteps_per_batch=20000 --hid_sizes_vf "100,50,25" --env=${env} --filter=1 &> ${env}_1005025vf.log
26 |
--------------------------------------------------------------------------------
/rllab/run_ddpg_rllab.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import os.path as osp
3 | import pickle
4 |
5 | import tensorflow as tf
6 |
7 | from rllab.envs.gym_env import GymEnv
8 | from rllab.envs.normalized_env import normalize
9 | from rllab.exploration_strategies.ou_strategy import OUStrategy
10 | from rllab.misc import ext
11 | from rllab.misc.instrument import run_experiment_lite, stub
12 | from rllab.algos.ddpg import DDPG
13 | from rllab.exploration_strategies.ou_strategy import OUStrategy
14 | from rllab.policies.deterministic_mlp_policy import DeterministicMLPPolicy
15 | from rllab.q_functions.continuous_mlp_q_function import ContinuousMLPQFunction
16 | import lasagne.nonlinearities as NL
17 |
18 |
19 | from sandbox.rocky.tf.misc.tensor_utils import lrelu
20 |
21 | parser = argparse.ArgumentParser()
22 | parser.add_argument("env", help="The environment name from OpenAIGym environments")
23 | parser.add_argument("--num_epochs", default=200, type=int)
24 | parser.add_argument("--log_dir", default="./data_ddpg/")
25 | parser.add_argument("--reward_scale", default=1.0, type=float)
26 | parser.add_argument("--use_ec2", action="store_true", help="Use your ec2 instances if configured")
27 | parser.add_argument("--dont_terminate_machine", action="store_false", help="Whether to terminate your spot instance or not. Be careful.")
28 | parser.add_argument("--policy_size", default=[100,50,25], type=int, nargs='*')
29 | parser.add_argument("--policy_activation", default="relu", type=str)
30 | parser.add_argument("--vf_size", default=[100,50,25], type=int, nargs='*')
31 | parser.add_argument("--vf_activation", default="relu", type=str)
32 | parser.add_argument("--seed", type=int, default=0)
33 | args = parser.parse_args()
34 |
35 | stub(globals())
36 | ext.set_seed(args.seed)
37 |
38 | gymenv = GymEnv(args.env, force_reset=True, record_video=False, record_log=False)
39 |
40 | env = normalize(gymenv)
41 |
42 | activation_map = { "relu" : NL.rectify, "tanh" : NL.tanh, "leaky_relu" : NL.LeakyRectify}
43 |
44 | policy = DeterministicMLPPolicy(
45 | env_spec=env.spec,
46 | # The neural network policy should have two hidden layers, each with 32 hidden units.
47 | hidden_sizes=args.policy_size,
48 | hidden_nonlinearity=activation_map[args.policy_activation],
49 | )
50 |
51 | es = OUStrategy(env_spec=env.spec)
52 |
53 | qf = ContinuousMLPQFunction(env_spec=env.spec,
54 | hidden_nonlinearity=activation_map[args.vf_activation],
55 | hidden_sizes=args.vf_size,)
56 |
57 | algo = DDPG(
58 | env=env,
59 | policy=policy,
60 | es=es,
61 | qf=qf,
62 | batch_size=128,
63 | max_path_length=env.horizon,
64 | epoch_length=1000,
65 | min_pool_size=10000,
66 | n_epochs=args.num_epochs,
67 | discount=0.995,
68 | scale_reward=args.reward_scale,
69 | qf_learning_rate=1e-3,
70 | policy_learning_rate=1e-4,
71 | plot=False
72 | )
73 |
74 |
75 | run_experiment_lite(
76 | algo.train(),
77 | log_dir=None if args.use_ec2 else args.log_dir,
78 | # Number of parallel workers for sampling
79 | n_parallel=1,
80 | # Only keep the snapshot parameters for the last iteration
81 | snapshot_mode="last",
82 | # Specifies the seed for the experiment. If this is not provided, a random seed
83 | # will be used
84 | exp_prefix="DDPG_" + args.env,
85 | seed=args.seed,
86 | mode="ec2" if args.use_ec2 else "local",
87 | plot=False,
88 | # dry=True,
89 | terminate_machine=args.dont_terminate_machine,
90 | added_project_directories=[osp.abspath(osp.join(osp.dirname(__file__), '.'))]
91 | )
92 |
--------------------------------------------------------------------------------
/rllab/run_trpo.py:
--------------------------------------------------------------------------------
1 | from rllab.envs.box2d.cartpole_env import CartpoleEnv
2 | from rllab.envs.normalized_env import normalize
3 | from rllab.misc.instrument import stub, run_experiment_lite
4 | from rllab.baselines.linear_feature_baseline import LinearFeatureBaseline
5 | from rllab.envs.gym_env import GymEnv
6 | from sandbox.rocky.tf.misc.tensor_utils import lrelu
7 |
8 | from sandbox.rocky.tf.envs.base import TfEnv
9 | from sandbox.rocky.tf.policies.gaussian_mlp_policy import GaussianMLPPolicy
10 | from sandbox.rocky.tf.algos.trpo import TRPO
11 | from rllab.misc import ext
12 | from sandbox.rocky.tf.optimizers.conjugate_gradient_optimizer import ConjugateGradientOptimizer
13 | from sandbox.rocky.tf.optimizers.conjugate_gradient_optimizer import FiniteDifferenceHvp
14 |
15 | import pickle
16 | import os.path as osp
17 |
18 | import tensorflow as tf
19 |
20 | import argparse
21 | parser = argparse.ArgumentParser()
22 | parser.add_argument("env", help="The environment name from OpenAIGym environments")
23 | parser.add_argument("--num_epochs", default=100, type=int)
24 | parser.add_argument("--batch_size", default=20000, type=int)
25 | parser.add_argument("--step_size", default=0.01, type=float)
26 | parser.add_argument("--reg_coeff", default=0.1, type=float)
27 | parser.add_argument("--gae_lambda", default=.97, type=float)
28 | parser.add_argument("--policy_size", default=[100,50,25], type=int, nargs='*')
29 | parser.add_argument("--log_dir", default="./data/")
30 | parser.add_argument("--use_ec2", action="store_true", help="Use your ec2 instances if configured")
31 | parser.add_argument("--dont_terminate_machine", action="store_false", help="Whether to terminate your spot instance or not. Be careful.")
32 | parser.add_argument("--activation", default="relu", type=str)
33 | parser.add_argument("--seed", default=1, type=int)
34 | args = parser.parse_args()
35 |
36 | stub(globals())
37 | ext.set_seed(args.seed)
38 |
39 | supported_gym_envs = ["MountainCarContinuous-v0", "InvertedPendulum-v1", "InvertedDoublePendulum-v1", "Hopper-v1", "Walker2d-v1", "Humanoid-v1", "Reacher-v1", "HalfCheetah-v1", "Swimmer-v1", "HumanoidStandup-v1"]
40 |
41 | other_env_class_map = { "Cartpole" : CartpoleEnv}
42 |
43 | activation_map = { "relu" : tf.nn.relu, "tanh" : tf.nn.tanh, "leaky_relu" : lrelu}
44 |
45 | if args.env in supported_gym_envs:
46 | gymenv = GymEnv(args.env, force_reset=True, record_video=False, record_log=False)
47 | else:
48 | gymenv = other_env_class_map[args.env]()
49 |
50 | #TODO: assert continuous space
51 |
52 | env = TfEnv(normalize(gymenv))
53 |
54 | print("Using network arch: %s" % ", ".join([str(x) for x in args.policy_size]))
55 |
56 | policy = GaussianMLPPolicy(
57 | name="policy",
58 | env_spec=env.spec,
59 | # The neural network policy should have two hidden layers, each with 32 hidden units.
60 | hidden_sizes=tuple([int(x) for x in args.policy_size]),
61 | hidden_nonlinearity=activation_map[args.activation],
62 | )
63 |
64 | baseline = LinearFeatureBaseline(env_spec=env.spec)
65 |
66 | algo = TRPO(
67 | env=env,
68 | policy=policy,
69 | baseline=baseline,
70 | batch_size=args.batch_size,
71 | max_path_length=env.horizon,
72 | n_itr=args.num_epochs,
73 | discount=0.99,
74 | step_size=args.step_size,
75 | gae_lambda=args.gae_lambda,
76 | optimizer=ConjugateGradientOptimizer(reg_coeff=args.reg_coeff)
77 | )
78 |
79 | arch_name="_".join([str(x) for x in args.policy_size])
80 | pref = "TRPO_" + args.env + "_bs_" + str(args.batch_size) + "_sp_" + str(args.step_size) + "_regc_" + str(args.reg_coeff) + "_gael_" + str(args.gae_lambda) + "_na_" + arch_name + "_seed_" + str(args.seed)
81 | pref = pref.replace(".", "_")
82 | print("Using prefix %s" % pref)
83 |
84 | run_experiment_lite(
85 | algo.train(),
86 | log_dir=None if args.use_ec2 else args.log_dir,
87 | # Number of parallel workers for sampling
88 | n_parallel=1,
89 | # Only keep the snapshot parameters for the last iteration
90 | snapshot_mode="none",
91 | # Specifies the seed for the experiment. If this is not provided, a random seed
92 | # will be used
93 | exp_prefix=pref,
94 | seed=args.seed,
95 | use_gpu=False,
96 | mode="ec2" if args.use_ec2 else "local",
97 | plot=False,
98 | # dry=True,
99 | terminate_machine=args.dont_terminate_machine,
100 | added_project_directories=[osp.abspath(osp.join(osp.dirname(__file__), '.'))]
101 | )
102 |
--------------------------------------------------------------------------------
/rllabplusplus/ddpg/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Breakend/DeepReinforcementLearningThatMatters/1f4ea363623854b8072149cb404ee919d9042e79/rllabplusplus/ddpg/__init__.py
--------------------------------------------------------------------------------
/rllabplusplus/ddpg/ddpg.py:
--------------------------------------------------------------------------------
1 | # MODIFIED FROM: https://raw.githubusercontent.com/shaneshixiang/rllabplusplus/master/sandbox/rocky/tf/algos/ddpg.py
2 | import gc
3 | import time
4 |
5 | #import pickle as pickle
6 | import numpy as np
7 | import tensorflow as tf
8 |
9 | import pyprind
10 | import rllab.misc.logger as logger
11 | from rllab.algos.base import RLAlgorithm
12 | from rllab.core.serializable import Serializable
13 | from rllab.misc import ext, special
14 | from rllab.misc.overrides import overrides
15 | from rllab.plotter import plotter
16 | from rllab.sampler import parallel_sampler
17 | from sampling_utils import SimpleReplayPool
18 | from sandbox.rocky.tf.misc import tensor_utils
19 | from sandbox.rocky.tf.optimizers.first_order_optimizer import \
20 | FirstOrderOptimizer
21 |
22 |
23 | class DDPG(RLAlgorithm):
24 | """
25 | Deep Deterministic Policy Gradient.
26 | """
27 |
28 | def __init__(
29 | self,
30 | env,
31 | policy,
32 | qf,
33 | es,
34 | batch_size=32,
35 | n_epochs=200,
36 | epoch_length=1000,
37 | min_pool_size=10000,
38 | replay_pool_size=1000000,
39 | replacement_prob=1.0,
40 | discount=0.99,
41 | max_path_length=250,
42 | qf_weight_decay=0.,
43 | qf_update_method='adam',
44 | qf_learning_rate=1e-3,
45 | policy_weight_decay=0,
46 | policy_update_method='adam',
47 | policy_learning_rate=1e-3,
48 | policy_updates_ratio=1.0,
49 | eval_samples=10000,
50 | soft_target=True,
51 | soft_target_tau=0.001,
52 | n_updates_per_sample=1,
53 | scale_reward=1.0,
54 | include_horizon_terminal_transitions=False,
55 | plot=False,
56 | pause_for_plot=False,
57 | **kwargs):
58 | """
59 | :param env: Environment
60 | :param policy: Policy
61 | :param qf: Q function
62 | :param es: Exploration strategy
63 | :param batch_size: Number of samples for each minibatch.
64 | :param n_epochs: Number of epochs. Policy will be evaluated after each epoch.
65 | :param epoch_length: How many timesteps for each epoch.
66 | :param min_pool_size: Minimum size of the pool to start training.
67 | :param replay_pool_size: Size of the experience replay pool.
68 | :param discount: Discount factor for the cumulative return.
69 | :param max_path_length: Discount factor for the cumulative return.
70 | :param qf_weight_decay: Weight decay factor for parameters of the Q function.
71 | :param qf_update_method: Online optimization method for training Q function.
72 | :param qf_learning_rate: Learning rate for training Q function.
73 | :param policy_weight_decay: Weight decay factor for parameters of the policy.
74 | :param policy_update_method: Online optimization method for training the policy.
75 | :param policy_learning_rate: Learning rate for training the policy.
76 | :param eval_samples: Number of samples (timesteps) for evaluating the policy.
77 | :param soft_target_tau: Interpolation parameter for doing the soft target update.
78 | :param n_updates_per_sample: Number of Q function and policy updates per new sample obtained
79 | :param scale_reward: The scaling factor applied to the rewards when training
80 | :param include_horizon_terminal_transitions: whether to include transitions with terminal=True because the
81 | horizon was reached. This might make the Q value back up less stable for certain tasks.
82 | :param plot: Whether to visualize the policy performance after each eval_interval.
83 | :param pause_for_plot: Whether to pause before continuing when plotting.
84 | :return:
85 | """
86 | self.env = env
87 | self.policy = policy
88 | self.qf = qf
89 | self.es = es
90 | self.batch_size = batch_size
91 | self.n_epochs = n_epochs
92 | self.epoch_length = epoch_length
93 | self.min_pool_size = min_pool_size
94 | self.replay_pool_size = replay_pool_size
95 | self.replacement_prob = replacement_prob
96 | self.discount = discount
97 | self.max_path_length = max_path_length
98 | self.qf_weight_decay = qf_weight_decay
99 | self.qf_update_method = \
100 | FirstOrderOptimizer(
101 | update_method=qf_update_method,
102 | learning_rate=qf_learning_rate,
103 | )
104 | self.qf_learning_rate = qf_learning_rate
105 | self.policy_weight_decay = policy_weight_decay
106 | self.policy_update_method = \
107 | FirstOrderOptimizer(
108 | update_method=policy_update_method,
109 | learning_rate=policy_learning_rate,
110 | )
111 | self.policy_learning_rate = policy_learning_rate
112 | self.policy_updates_ratio = policy_updates_ratio
113 | self.eval_samples = eval_samples
114 | self.soft_target_tau = soft_target_tau
115 | self.n_updates_per_sample = n_updates_per_sample
116 | self.include_horizon_terminal_transitions = include_horizon_terminal_transitions
117 | self.plot = plot
118 | self.pause_for_plot = pause_for_plot
119 |
120 | self.qf_loss_averages = []
121 | self.policy_surr_averages = []
122 | self.q_averages = []
123 | self.y_averages = []
124 | self.paths = []
125 | self.es_path_returns = []
126 | self.paths_samples_cnt = 0
127 |
128 | self.scale_reward = scale_reward
129 |
130 | self.train_policy_itr = 0
131 |
132 | self.opt_info = None
133 |
134 | def start_worker(self):
135 | parallel_sampler.populate_task(self.env, self.policy)
136 | if self.plot:
137 | plotter.init_plot(self.env, self.policy)
138 |
139 | @overrides
140 | def train(self):
141 | gc_dump_time = time.time()
142 | with tf.Session() as sess:
143 | sess.run(tf.global_variables_initializer())
144 | # This seems like a rather sequential method
145 | pool = SimpleReplayPool(
146 | max_pool_size=self.replay_pool_size,
147 | observation_dim=self.env.observation_space.flat_dim,
148 | action_dim=self.env.action_space.flat_dim,
149 | replacement_prob=self.replacement_prob,
150 | )
151 | self.start_worker()
152 |
153 | self.init_opt()
154 | # This initializes the optimizer parameters
155 | sess.run(tf.global_variables_initializer())
156 | itr = 0
157 | path_length = 0
158 | path_return = 0
159 | terminal = False
160 | initial = False
161 | observation = self.env.reset()
162 |
163 | with tf.variable_scope("sample_policy"):
164 | sample_policy = Serializable.clone(self.policy)
165 |
166 | for epoch in range(self.n_epochs):
167 | logger.push_prefix('epoch #%d | ' % epoch)
168 | logger.log("Training started")
169 | train_qf_itr, train_policy_itr = 0, 0
170 | for epoch_itr in pyprind.prog_bar(range(self.epoch_length)):
171 | # Execute policy
172 | if terminal:
173 | # Note that if the last time step ends an episode, the very
174 | # last state and observation will be ignored and not added
175 | # to the replay pool
176 | observation = self.env.reset()
177 | sample_policy.reset()
178 | self.es_path_returns.append(path_return)
179 | path_length = 0
180 | path_return = 0
181 | initial = True
182 | else:
183 | initial = False
184 | action = self.es.get_action(itr, observation, policy=sample_policy) # qf=qf)
185 |
186 | next_observation, reward, terminal, _ = self.env.step(action)
187 | path_length += 1
188 | path_return += reward
189 |
190 | if not terminal and path_length >= self.max_path_length:
191 | terminal = True
192 | # only include the terminal transition in this case if the flag was set
193 | if self.include_horizon_terminal_transitions:
194 | pool.add_sample(observation, action, reward * self.scale_reward, terminal, initial)
195 | else:
196 | pool.add_sample(observation, action, reward * self.scale_reward, terminal, initial)
197 |
198 | observation = next_observation
199 |
200 | if pool.size >= self.min_pool_size:
201 | for update_itr in range(self.n_updates_per_sample):
202 | # Train policy
203 | batch = pool.random_batch(self.batch_size)
204 | itrs = self.do_training(itr, batch)
205 | train_qf_itr += itrs[0]
206 | train_policy_itr += itrs[1]
207 | sample_policy.set_param_values(self.policy.get_param_values())
208 |
209 | itr += 1
210 | if time.time() - gc_dump_time > 100:
211 | gc.collect()
212 | gc_dump_time = time.time()
213 |
214 | logger.log("Training finished")
215 | logger.log("Trained qf %d steps, policy %d steps"%(train_qf_itr, train_policy_itr))
216 | if pool.size >= self.min_pool_size:
217 | self.evaluate(epoch, pool)
218 | params = self.get_epoch_snapshot(epoch)
219 | logger.save_itr_params(epoch, params)
220 | logger.dump_tabular(with_prefix=False)
221 | logger.pop_prefix()
222 | if self.plot:
223 | self.update_plot()
224 | if self.pause_for_plot:
225 | input("Plotting evaluation run: Press Enter to "
226 | "continue...")
227 | self.env.terminate()
228 | self.policy.terminate()
229 |
230 | def init_opt(self):
231 |
232 | # First, create "target" policy and Q functions
233 | with tf.variable_scope("target_policy"):
234 | target_policy = Serializable.clone(self.policy)
235 | with tf.variable_scope("target_qf"):
236 | target_qf = Serializable.clone(self.qf)
237 |
238 | # y need to be computed first
239 | obs = self.env.observation_space.new_tensor_variable(
240 | 'obs',
241 | extra_dims=1,
242 | )
243 |
244 | # The yi values are computed separately as above and then passed to
245 | # the training functions below
246 | action = self.env.action_space.new_tensor_variable(
247 | 'action',
248 | extra_dims=1,
249 | )
250 |
251 | yvar = tensor_utils.new_tensor(
252 | 'ys',
253 | ndim=1,
254 | dtype=tf.float32,
255 | )
256 |
257 | qf_weight_decay_term = 0.5 * self.qf_weight_decay * \
258 | sum([tf.reduce_sum(tf.square(param)) for param in
259 | self.qf.get_params(regularizable=True)])
260 |
261 | qval = self.qf.get_qval_sym(obs, action)
262 |
263 | qf_loss = tf.reduce_mean(tf.square(yvar - qval))
264 | qf_reg_loss = qf_loss + qf_weight_decay_term
265 |
266 | policy_weight_decay_term = 0.5 * self.policy_weight_decay * \
267 | sum([tf.reduce_sum(tf.square(param))
268 | for param in self.policy.get_params(regularizable=True)])
269 | policy_qval = self.qf.get_qval_sym(
270 | obs, self.policy.get_action_sym(obs),
271 | deterministic=True
272 | )
273 | policy_surr = -tf.reduce_mean(policy_qval)
274 |
275 | policy_reg_surr = policy_surr + policy_weight_decay_term
276 |
277 | qf_input_list = [yvar, obs, action]
278 | policy_input_list = [obs]
279 |
280 | self.qf_update_method.update_opt(
281 | loss=qf_reg_loss, target=self.qf, inputs=qf_input_list)
282 | self.policy_update_method.update_opt(
283 | loss=policy_reg_surr, target=self.policy, inputs=policy_input_list)
284 |
285 | f_train_qf = tensor_utils.compile_function(
286 | inputs=qf_input_list,
287 | outputs=[qf_loss, qval, self.qf_update_method._train_op],
288 | )
289 |
290 | f_train_policy = tensor_utils.compile_function(
291 | inputs=policy_input_list,
292 | outputs=[policy_surr, self.policy_update_method._train_op],
293 | )
294 |
295 | self.opt_info = dict(
296 | f_train_qf=f_train_qf,
297 | f_train_policy=f_train_policy,
298 | target_qf=target_qf,
299 | target_policy=target_policy,
300 | )
301 |
302 | def do_training(self, itr, batch):
303 |
304 | obs, actions, rewards, next_obs, terminals = ext.extract(
305 | batch,
306 | "observations", "actions", "rewards", "next_observations",
307 | "terminals"
308 | )
309 |
310 | # compute the on-policy y values
311 | target_qf = self.opt_info["target_qf"]
312 | target_policy = self.opt_info["target_policy"]
313 |
314 | next_actions, _ = target_policy.get_actions(next_obs)
315 | next_qvals = target_qf.get_qval(next_obs, next_actions)
316 |
317 | ys = rewards + (1. - terminals) * self.discount * next_qvals.reshape(-1)
318 |
319 | f_train_qf = self.opt_info["f_train_qf"]
320 | qf_loss, qval, _ = f_train_qf(ys, obs, actions)
321 | target_qf.set_param_values(
322 | target_qf.get_param_values() * (1.0 - self.soft_target_tau) +
323 | self.qf.get_param_values() * self.soft_target_tau)
324 | self.qf_loss_averages.append(qf_loss)
325 | self.q_averages.append(qval)
326 | self.y_averages.append(ys)
327 |
328 | self.train_policy_itr += self.policy_updates_ratio
329 | train_policy_itr = 0
330 | while self.train_policy_itr > 0:
331 | f_train_policy = self.opt_info["f_train_policy"]
332 | policy_surr, _ = f_train_policy(obs)
333 | target_policy.set_param_values(
334 | target_policy.get_param_values() * (1.0 - self.soft_target_tau) +
335 | self.policy.get_param_values() * self.soft_target_tau)
336 | self.policy_surr_averages.append(policy_surr)
337 | self.train_policy_itr -= 1
338 | train_policy_itr += 1
339 | return 1, train_policy_itr # number of itrs qf, policy are trained
340 |
341 | def evaluate(self, epoch, pool):
342 | logger.log("Collecting samples for evaluation")
343 |
344 | paths = parallel_sampler.sample_paths(
345 | policy_params=self.policy.get_param_values(),
346 | max_samples=self.eval_samples,
347 | max_path_length=self.max_path_length,
348 | )
349 |
350 | self.env.reset()
351 |
352 | average_discounted_return = np.mean(
353 | [special.discount_return(path["rewards"], self.discount) for path in paths]
354 | )
355 |
356 | returns = [sum(path["rewards"]) for path in paths]
357 |
358 | all_qs = np.concatenate(self.q_averages)
359 | all_ys = np.concatenate(self.y_averages)
360 |
361 | average_q_loss = np.mean(self.qf_loss_averages)
362 | average_policy_surr = np.mean(self.policy_surr_averages)
363 | average_action = np.mean(np.square(np.concatenate(
364 | [path["actions"] for path in paths]
365 | )))
366 |
367 | policy_reg_param_norm = np.linalg.norm(
368 | self.policy.get_param_values(regularizable=True)
369 | )
370 | qfun_reg_param_norm = np.linalg.norm(
371 | self.qf.get_param_values(regularizable=True)
372 | )
373 |
374 | logger.record_tabular('Epoch', epoch)
375 | logger.record_tabular('Iteration', epoch)
376 | logger.record_tabular('AverageReturn', np.mean(returns))
377 | logger.record_tabular('StdReturn',
378 | np.std(returns))
379 | logger.record_tabular('MaxReturn',
380 | np.max(returns))
381 | logger.record_tabular('MinReturn',
382 | np.min(returns))
383 | if len(self.es_path_returns) > 0:
384 | logger.record_tabular('AverageEsReturn',
385 | np.mean(self.es_path_returns))
386 | logger.record_tabular('StdEsReturn',
387 | np.std(self.es_path_returns))
388 | logger.record_tabular('MaxEsReturn',
389 | np.max(self.es_path_returns))
390 | logger.record_tabular('MinEsReturn',
391 | np.min(self.es_path_returns))
392 | logger.record_tabular('AverageDiscountedReturn',
393 | average_discounted_return)
394 | logger.record_tabular('AverageQLoss', average_q_loss)
395 | logger.record_tabular('AveragePolicySurr', average_policy_surr)
396 | logger.record_tabular('AverageQ', np.mean(all_qs))
397 | logger.record_tabular('AverageAbsQ', np.mean(np.abs(all_qs)))
398 | logger.record_tabular('AverageY', np.mean(all_ys))
399 | logger.record_tabular('AverageAbsY', np.mean(np.abs(all_ys)))
400 | logger.record_tabular('AverageAbsQYDiff',
401 | np.mean(np.abs(all_qs - all_ys)))
402 | logger.record_tabular('AverageAction', average_action)
403 |
404 | logger.record_tabular('PolicyRegParamNorm',
405 | policy_reg_param_norm)
406 | logger.record_tabular('QFunRegParamNorm',
407 | qfun_reg_param_norm)
408 |
409 | self.env.log_diagnostics(paths)
410 | self.policy.log_diagnostics(paths)
411 |
412 | self.qf_loss_averages = []
413 | self.policy_surr_averages = []
414 |
415 | self.q_averages = []
416 | self.y_averages = []
417 | self.es_path_returns = []
418 |
419 | def update_plot(self):
420 | if self.plot:
421 | plotter.update_plot(self.policy, self.max_path_length)
422 |
423 | def get_epoch_snapshot(self, epoch):
424 | return dict(
425 | env=self.env,
426 | epoch=epoch,
427 | qf=self.qf,
428 | policy=self.policy,
429 | target_qf=self.opt_info["target_qf"],
430 | target_policy=self.opt_info["target_policy"],
431 | es=self.es,
432 | )
433 |
--------------------------------------------------------------------------------
/rllabplusplus/run_ddpg.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import os.path as osp
3 | import pickle
4 |
5 | import tensorflow as tf
6 |
7 | from ddpg.ddpg import DDPG
8 | from rllab.envs.gym_env import GymEnv
9 | from rllab.envs.normalized_env import normalize
10 | from rllab.exploration_strategies.ou_strategy import OUStrategy
11 | from rllab.misc import ext
12 | from rllab.misc.instrument import run_experiment_lite, stub
13 | from sandbox.rocky.tf.envs.base import TfEnv
14 | from sandbox.rocky.tf.policies.deterministic_mlp_policy import \
15 | DeterministicMLPPolicy
16 | from sandbox.rocky.tf.q_functions.continuous_mlp_q_function import \
17 | ContinuousMLPQFunction
18 | from sandbox.rocky.tf.misc.tensor_utils import lrelu
19 |
20 | parser = argparse.ArgumentParser()
21 | parser.add_argument("env", help="The environment name from OpenAIGym environments")
22 | parser.add_argument("--num_epochs", default=200, type=int)
23 | parser.add_argument("--log_dir", default="./data_ddpg/")
24 | parser.add_argument("--reward_scale", default=1.0, type=float)
25 | parser.add_argument("--use_ec2", action="store_true", help="Use your ec2 instances if configured")
26 | parser.add_argument("--dont_terminate_machine", action="store_false", help="Whether to terminate your spot instance or not. Be careful.")
27 | parser.add_argument("--policy_size", default=[100,50,25], type=int, nargs='*')
28 | parser.add_argument("--policy_activation", default="relu", type=str)
29 | parser.add_argument("--vf_size", default=[100,50,25], type=int, nargs='*')
30 | parser.add_argument("--vf_activation", default="relu", type=str)
31 | parser.add_argument("--seed", type=int, default=0)
32 | args = parser.parse_args()
33 |
34 | stub(globals())
35 | ext.set_seed(args.seed)
36 |
37 | gymenv = GymEnv(args.env, force_reset=True, record_video=False, record_log=False)
38 |
39 | env = TfEnv(normalize(gymenv))
40 |
41 | activation_map = { "relu" : tf.nn.relu, "tanh" : tf.nn.tanh, "leaky_relu" : lrelu}
42 |
43 | policy = DeterministicMLPPolicy(
44 | env_spec=env.spec,
45 | name="policy",
46 | # The neural network policy should have two hidden layers, each with 32 hidden units.
47 | hidden_sizes=args.policy_size,
48 | hidden_nonlinearity=activation_map[args.policy_activation],
49 | )
50 |
51 | es = OUStrategy(env_spec=env.spec)
52 |
53 | qf = ContinuousMLPQFunction(env_spec=env.spec,
54 | hidden_nonlinearity=activation_map[args.vf_activation],
55 | hidden_sizes=args.vf_size,)
56 |
57 | algo = DDPG(
58 | env=env,
59 | policy=policy,
60 | es=es,
61 | qf=qf,
62 | batch_size=128,
63 | max_path_length=env.horizon,
64 | epoch_length=1000,
65 | min_pool_size=10000,
66 | n_epochs=args.num_epochs,
67 | discount=0.995,
68 | scale_reward=args.reward_scale,
69 | qf_learning_rate=1e-3,
70 | policy_learning_rate=1e-4,
71 | plot=False
72 | )
73 |
74 |
75 | run_experiment_lite(
76 | algo.train(),
77 | log_dir=None if args.use_ec2 else args.log_dir,
78 | # Number of parallel workers for sampling
79 | n_parallel=1,
80 | # Only keep the snapshot parameters for the last iteration
81 | snapshot_mode="last",
82 | # Specifies the seed for the experiment. If this is not provided, a random seed
83 | # will be used
84 | exp_prefix="DDPG_" + args.env,
85 | seed=args.seed,
86 | mode="ec2" if args.use_ec2 else "local",
87 | plot=False,
88 | # dry=True,
89 | terminate_machine=args.dont_terminate_machine,
90 | added_project_directories=[osp.abspath(osp.join(osp.dirname(__file__), '.'))]
91 | )
92 |
--------------------------------------------------------------------------------
/rllabplusplus/sampling_utils.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import rllab.misc.logger as logger
3 |
4 | class SimpleReplayPool(object):
5 | """
6 | Used from https://raw.githubusercontent.com/shaneshixiang/rllabplusplus/master/rllab/pool/simple_pool.py
7 | """
8 | def __init__(
9 | self, max_pool_size, observation_dim, action_dim,
10 | replacement_policy='stochastic', replacement_prob=1.0,
11 | max_skip_episode=10):
12 | self._observation_dim = observation_dim
13 | self._action_dim = action_dim
14 | self._max_pool_size = max_pool_size
15 | self._replacement_policy = replacement_policy
16 | self._replacement_prob = replacement_prob
17 | self._max_skip_episode = max_skip_episode
18 | self._observations = np.zeros(
19 | (max_pool_size, observation_dim),
20 | )
21 | self._actions = np.zeros(
22 | (max_pool_size, action_dim),
23 | )
24 | self._rewards = np.zeros(max_pool_size)
25 | self._terminals = np.zeros(max_pool_size, dtype='uint8')
26 | self._initials = np.zeros(max_pool_size, dtype='uint8')
27 | self._bottom = 0
28 | self._top = 0
29 | self._size = 0
30 |
31 | def add_sample(self, observation, action, reward, terminal, initial):
32 | self.check_replacement()
33 | self._observations[self._top] = observation
34 | self._actions[self._top] = action
35 | self._rewards[self._top] = reward
36 | self._terminals[self._top] = terminal
37 | self._initials[self._top] = initial
38 | self.advance()
39 |
40 | def check_replacement(self):
41 | if self._replacement_prob < 1.0:
42 | if self._size < self._max_pool_size or \
43 | not self._initials[self._top]: return
44 | self.advance_until_terminate()
45 |
46 | def get_skip_flag(self):
47 | if self._replacement_policy == 'full': skip = False
48 | elif self._replacement_policy == 'stochastic':
49 | skip = np.random.uniform() > self._replacement_prob
50 | else: raise NotImplementedError
51 | return skip
52 |
53 | def advance_until_terminate(self):
54 | skip = self.get_skip_flag()
55 | n_skips = 0
56 | old_top = self._top
57 | new_top = (old_top + 1) % self._max_pool_size
58 | while skip and old_top != new_top and n_skips < self._max_skip_episode:
59 | n_skips += 1
60 | self.advance()
61 | while not self._initials[self._top]:
62 | self.advance()
63 | skip = self.get_skip_flag()
64 | new_top = self._top
65 | logger.log("add_sample, skipped %d episodes, top=%d->%d"%(
66 | n_skips, old_top, new_top))
67 |
68 | def advance(self):
69 | self._top = (self._top + 1) % self._max_pool_size
70 | if self._size >= self._max_pool_size:
71 | self._bottom = (self._bottom + 1) % self._max_pool_size
72 | else:
73 | self._size += 1
74 |
75 | def random_batch(self, batch_size):
76 | assert self._size > batch_size
77 | indices = np.zeros(batch_size, dtype='uint64')
78 | transition_indices = np.zeros(batch_size, dtype='uint64')
79 | count = 0
80 | while count < batch_size:
81 | index = np.random.randint(self._bottom, self._bottom + self._size) % self._max_pool_size
82 | # make sure that the transition is valid: if we are at the end of the pool, we need to discard
83 | # this sample
84 | if index == self._size - 1 and self._size <= self._max_pool_size:
85 | continue
86 |
87 | transition_index = (index + 1) % self._max_pool_size
88 |
89 | # make sure that the transition is valid: discard the transition if it crosses horizon-triggered resets
90 | if not self._terminals[index] and self._initials[transition_index]:
91 | continue
92 | indices[count] = index
93 | transition_indices[count] = transition_index
94 | count += 1
95 | return dict(
96 | observations=self._observations[indices],
97 | actions=self._actions[indices],
98 | rewards=self._rewards[indices],
99 | terminals=self._terminals[indices],
100 | initials=self._initials[indices],
101 | next_observations=self._observations[transition_indices]
102 | )
103 |
104 | @property
105 | def size(self):
106 | return self._size
107 |
--------------------------------------------------------------------------------
/tools/plot_results.py:
--------------------------------------------------------------------------------
1 | import matplotlib.pyplot as plt
2 | from scipy.ndimage.filters import uniform_filter1d
3 | import time
4 | import numpy as np
5 | import pandas as pd
6 | from itertools import cycle
7 | import argparse
8 |
9 | from numpy import genfromtxt
10 | from numpy.random import choice
11 |
12 | # Make fonts pretty
13 | plt.rcParams['text.usetex'] = True
14 |
15 | def multiple_plot(average_vals_list, std_dev_list, traj_list, other_labels, env_name, smoothing_window=5, no_show=False, ignore_std=False, climit=None, extra_lines=None):
16 | fig = plt.figure(figsize=(16, 8))
17 | colors = ["#1f77b4", "#ff7f0e", "#d62728", "#9467bd", "#2ca02c", "#8c564b", "#e377c2", "#bcbd22", "#17becf"]
18 | linestyles = ['solid', 'dashed', 'dashdot', 'dotted']
19 | color_index = 0
20 | ax = plt.subplot() # Defines ax variable by creating an empty plot
21 | offset = 1
22 |
23 | # Set the tick labels font
24 | for label in (ax.get_xticklabels() + ax.get_yticklabels()):
25 | label.set_fontname('Arial')
26 | label.set_fontsize(28)
27 | if traj_list is None:
28 | traj_list = [None]*len(average_vals_list)
29 | limit = climit
30 | index = 0
31 | for average_vals, std_dev, label, trajs in zip(average_vals_list, std_dev_list, other_labels[:len(average_vals_list)], traj_list):
32 | index += 1
33 |
34 | if climit is None:
35 | limit = len(average_vals)
36 |
37 | # If we don't want reward smoothing, set smoothing window to size 1
38 | rewards_smoothed_1 = uniform_filter1d(average_vals, size=smoothing_window)[:limit]
39 | std_smoothed_1 = uniform_filter1d(std_dev, size=smoothing_window)[:limit]
40 | rewards_smoothed_1 = rewards_smoothed_1[:limit]
41 | std_dev = std_dev[:limit]
42 |
43 | if trajs is None:
44 | # in this case, we just want to use algorithm iterations, so just take the number of things we have.
45 | trajs = list(range(len(rewards_smoothed_1)))
46 | else:
47 | plt.ticklabel_format(style='sci', axis='x', scilimits=(0,0))
48 | ax.xaxis.get_offset_text().set_fontsize(20)
49 |
50 | fill_color = colors[color_index]
51 | color_index += 1
52 |
53 | cum_rwd_1, = plt.plot(trajs, rewards_smoothed_1, label=label, color=fill_color, ls=linestyles[color_index % len(linestyles)])
54 | offset += 3
55 | if not ignore_std:
56 | # uncomment this to use error bars
57 | #plt.errorbar(trajs[::25 + offset], rewards_smoothed_1[::25 + offset], yerr=std_smoothed_1[::25 + offset], linestyle='None', color=fill_color, capsize=5)
58 | plt.fill_between(trajs, rewards_smoothed_1 + std_smoothed_1, rewards_smoothed_1 - std_smoothed_1, alpha=0.3, edgecolor=fill_color, facecolor=fill_color)
59 |
60 | if extra_lines:
61 | for lin in extra_lines:
62 | plt.plot(trajs, np.repeat(lin, len(rewards_smoothed_1)), linestyle='-.', color = colors[color_index], linewidth=2.5, label=other_labels[index])
63 | color_index += 1
64 | index += 1
65 |
66 | axis_font = {'fontname':'Arial', 'size':'32'}
67 | plt.legend(loc='lower right', prop={'size' : 16})
68 | plt.xlabel("Iterations", **axis_font)
69 | if traj_list is not None and traj_list[0] is not None:
70 | plt.ticklabel_format(style='sci', axis='x', scilimits=(0,0))
71 | ax.xaxis.get_offset_text().set_fontsize(20)
72 | plt.xlabel("Timesteps", **axis_font)
73 | else:
74 | plt.xlabel("Iterations", **axis_font)
75 | plt.ylabel("Average Return", **axis_font)
76 | plt.title("%s"% env_name, **axis_font)
77 |
78 | if no_show:
79 | fig.savefig('%s.pdf' % env_name, dpi=fig.dpi, bbox_inches='tight')
80 | else:
81 | plt.show()
82 |
83 | return fig
84 |
85 | parser = argparse.ArgumentParser()
86 | parser.add_argument("paths_to_progress_csvs", nargs="+", help="All the csvs associated with the data (Need AverageReturn, StdReturn, and TimestepsSoFar columns)")
87 | parser.add_argument("env_name", help= "This is just the title of the plot and the filename." )
88 | parser.add_argument("--save", action="store_true")
89 | parser.add_argument("--ignore_std", action="store_true")
90 | parser.add_argument('--labels', nargs='+', help='List of labels to go along with the lines to plot', required=False)
91 | parser.add_argument('--smoothing_window', default=1, type=int, help="Running average to smooth with, default is 1 (i.e. no smoothing)")
92 | parser.add_argument('--limit', default=None, type=int)
93 | parser.add_argument('--extra_lines', nargs="+", type=float, help="Any extra lines to add on the graph (such as other paper resulls)")
94 |
95 | args = parser.parse_args()
96 |
97 | avg_rets = []
98 | std_dev_rets = []
99 | trajs = []
100 |
101 | # Read all csv's into arrays
102 | for o in args.paths_to_progress_csvs:
103 | data = pd.read_csv(o)
104 | avg_ret = np.array(data["AverageReturn"]) # averge return across trials
105 | std_dev_ret = np.array(data["StdReturn"]) # standard error across trials
106 | if "total/steps" in data and trajs is not None:
107 | trajs.append(np.array(data["total/steps"]))
108 | elif "TimestepsSoFar" in data and trajs is not None:
109 | trajs.append(np.array(data["TimestepsSoFar"]))
110 | else:
111 | trajs = None
112 | avg_rets.append(avg_ret)
113 | std_dev_rets.append(std_dev_ret)
114 |
115 | # call plotting script
116 | multiple_plot(avg_rets, std_dev_rets, trajs, args.labels, args.env_name, smoothing_window=args.smoothing_window, no_show=args.save, ignore_std=args.ignore_std, climit=args.limit, extra_lines=args.extra_lines)
117 |
--------------------------------------------------------------------------------
/tools/run_multiple.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 |
4 | die() { echo "$@" 1>&2 ; exit 1; }
5 |
6 | # TODO: make this proper usage script
7 |
8 | if [ $# -le 4 ]
9 | then
10 | die "Usage: bash $0 num_experiments num_experiments_in_parallel run_script [--all --other --args --to --run-script]"
11 | fi
12 |
13 | # check whether user had supplied -h or --help . If yes display usage
14 | if [[ ( $# == "--help") || $# == "-h" ]]
15 | then
16 | die "Usage: bash $0 num_experiments num_experiments_in_parallel run_script [--all --other --args --to --run_script]"
17 | fi
18 |
19 | num_experiments=$1
20 | parallel_exps=$2
21 | log_prefix=$3
22 | run_script=$4
23 |
24 | pickle_files=()
25 |
26 | mkdir -p ./$log_prefix/
27 |
28 | trap 'jobs -p | xargs kill' EXIT
29 |
30 | for (( c=1; c<=$num_experiments; ))
31 | do
32 | for (( j=1; j<=$parallel_exps; j++ ))
33 | do
34 | echo "Launching experiment $c"
35 | mkdir -p ./$log_prefix/exp_$c/
36 | python3 $run_script --seed $c --log_dir ./$log_prefix/exp_$c/ "${@:5}" &> ./$log_prefix/exp_$c.log &
37 | #pickle_files=("${pickle_files[@]}" "exp_$c.pickle")
38 | c=$((c+1))
39 | done
40 | wait
41 | done
42 |
43 | #python create_graphs_from_pickle.py "${pickle_files[@]}"
44 |
45 |
--------------------------------------------------------------------------------
/tools/run_policy_experiment_set.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 |
4 | die() { echo "$@" 1>&2 ; exit 1; }
5 |
6 | # TODO: make this proper usage script
7 |
8 | # check whether user had supplied -h or --help . If yes display usage
9 | if [[ ( $# == "--help") || $# == "-h" ]]
10 | then
11 | die "Usage: bash $0 num_experiments num_experiments_in_parallel run_script [--all --other --args --to --run_script]"
12 | fi
13 |
14 | script=$1
15 | env=$2
16 |
17 |
18 | # default (6464tanh)
19 | # activations
20 | bash run_multiple.sh 5 1 ${env}_default $script ${env} &> ${env}default.log &
21 | bash run_multiple.sh 5 1 ${env}_vfleakrelu $script ${env} --activation_vf leaky_relu &> ${env}vfleakrelu.log
22 | bash run_multiple.sh 5 1 ${env}_vfrelu $script ${env} --activation_vf relu &> ${env}vfrelu.log &
23 | bash run_multiple.sh 5 1 ${env}_policyrelu $script ${env} --activation_policy relu &> ${env}policyrelu.log &
24 | bash run_multiple.sh 5 1 ${env}_policyleakyrelu $script ${env} --activation_policy leaky_relu &> ${env}policyleakyrelu.log
25 |
26 | #
27 | bash run_multiple.sh 5 1 ${env}_400300 $script $env --policy_size 400 300 &> ${env}_400300tanh.log &
28 | bash run_multiple.sh 5 1 ${env}_1005025 $script $env --policy_size 100 50 25 &> ${env}_1005025tanh.log &
29 | bash run_multiple.sh 5 1 ${env}_vf1005025 $script $env --value_func_size 100 50 25 &> ${env}_vf1005025tanh.log &
30 | bash run_multiple.sh 5 1 ${env}_vf400300 $script $env --value_func_size 400 300 &> ${env}_vf400300tanh.log
31 |
--------------------------------------------------------------------------------
/tools/significance_and_bootstrap.py:
--------------------------------------------------------------------------------
1 | import scipy.stats as stats
2 | import random
3 | import pandas as pd
4 | import numpy as np
5 | import bootstrapped.bootstrap as bas
6 | import bootstrapped.stats_functions as bs_stats
7 | import bootstrapped.compare_functions as bs_compare
8 | import bootstrapped.power as bs_power
9 | from scipy.stats import ks_2samp
10 | import argparse
11 |
12 | parser = argparse.ArgumentParser()
13 | parser.add_argument("numpy_arrays", nargs="+", help="Assumes results returned in N X M array. N=number of iterations, M = number of trials")
14 |
15 | args = parser.parse_args()
16 |
17 | assert len(args.paths_to_progress_csvs) == 2
18 |
19 | data = np.load(args.paths_to_progress_csvs[0])
20 | data2 = np.load(args.paths_to_progress_csvs[1])
21 |
22 | """
23 | We assume that the final evaluation iteration in the numpy array is the one to be analyzed.
24 | """
25 |
26 | final_average_return = np.array(sorted(data[-1]))
27 | final_average_return2 = np.array(sorted(data2[-1]))
28 |
29 | print("***********************************************")
30 | print("Average Returns 1: ", final_average_return)
31 | print("Average Returns 2", final_average_return2)
32 | print("***********************************************")
33 |
34 | #print(t_test(len(final_average_return), np.array(final_average_return) - np.array(final_average_return2), 2e6, 5000))
35 | print("t-test", stats.ttest_ind(final_average_return, final_average_return2))
36 | print("ks-2sample", ks_2samp(final_average_return, final_average_return2))
37 |
38 | def run_simulation(data):
39 | lift = 1.25
40 | results = []
41 | for i in range(3000):
42 | random.shuffle(data)
43 | test = data[:len(data)/2] * lift
44 | ctrl = data[len(data)/2:]
45 | results.append(bas.bootstrap_ab(test, ctrl, bs_stats.mean, bs_compare.percent_change))
46 | return results
47 |
48 | def run_simulation2(data, data2):
49 | results = []
50 | for i in range(3000):
51 | results.append(bas.bootstrap_ab(data, data2, bs_stats.mean, bs_compare.percent_change))
52 | return results
53 |
54 | print("bootstrap a/b", bas.bootstrap_ab(final_average_return, final_average_return2, bs_stats.mean, bs_compare.percent_change))
55 | bab= bas.bootstrap_ab(final_average_return, final_average_return2, bs_stats.mean, bs_compare.percent_change)
56 | x = run_simulation2(final_average_return, final_average_return2)
57 | bootstrap_ab = bs_power.power_stats(x)
58 | print("power analysis bootstrap a/b")
59 | print(bootstrap_ab)
60 |
61 | print("***********************************************")
62 | print("Bootstrap analysis")
63 | print("***********************************************")
64 | print("***Arg1****")
65 | sim = bas.bootstrap(final_average_return, stat_func=bs_stats.mean)
66 | print("%.2f (%.2f, %.2f)" % (sim.value, sim.lower_bound, sim.upper_bound))
67 | print("***Arg2****")
68 | sim = bas.bootstrap(final_average_return2, stat_func=bs_stats.mean)
69 | print("%.2f (%.2f, %.2f)" % (sim.value, sim.lower_bound, sim.upper_bound))
70 | print("***********************************************")
71 |
72 |
73 | print("***********************************************")
74 | print("****Significance of lift analysis****")
75 | print("***********************************************")
76 | print("***Power1****")
77 | sim = run_simulation(final_average_return)
78 | sim = bs_power.power_stats(sim)
79 | print(sim)
80 | print("\\shortstack{%.2f \\%%\\\\%.2f \\%% \\\\ %.2f \\%%}" % (sim.transpose()["Insignificant"], sim.transpose()["Positive Significant"], sim.transpose()["Negative Significant"]))
81 | print("***Power2****")
82 | sim = run_simulation(final_average_return2)
83 | sim = bs_power.power_stats(sim)
84 | print(sim)
85 | #sim = bas.bootstrap(final_average_return2, stat_func=bs_stats.mean)
86 | print("\\shortstack{%.2f \\%% \\\\ %.2f \\%% \\\\ %.2f \\%%}" % (sim.transpose()["Insignificant"], sim.transpose()["Positive Significant"], sim.transpose()["Negative Significant"]))
87 | print("***********************************************")
88 |
89 |
90 | print("***********************************************")
91 | print("Significance analysis, t-test, ks test and bootstrap a/b")
92 | print("***********************************************")
93 | t,p = stats.ttest_ind(final_average_return, final_average_return2)
94 | ks,kp = ks_2samp(final_average_return, final_average_return2)
95 |
96 | a, b,c = bab.value, bab.lower_bound, bab.upper_bound
97 |
98 | if bootstrap_ab.transpose()["Positive Significant"][0] >= bootstrap_ab.transpose()["Negative Significant"][0]:
99 | boot = "+%.2f" % bootstrap_ab.transpose()["Positive Significant"]
100 | else:
101 | boot = "-%.2f" % bootstrap_ab.transpose()["Negative Significant"]
102 | print("\\shortstack{$t=%.2f,p=%.3f$\\\\$KS=%.2f,p=%.3f$\\\\%.2f \\%% (%.2f \\%%, %.2f \\%%) }" % (t,p,ks,kp,a,b,c))
103 | print("***********************************************")
104 |
--------------------------------------------------------------------------------