├── .gitignore ├── .gitmodules ├── README.md ├── modular_rl ├── run_batch_size_experiment_set.sh ├── run_multiple.sh └── run_policy_experiment_set.sh ├── rllab ├── run_ddpg_rllab.py └── run_trpo.py ├── rllabplusplus ├── ddpg │ ├── __init__.py │ └── ddpg.py ├── run_ddpg.py └── sampling_utils.py └── tools ├── plot_results.py ├── run_multiple.sh ├── run_policy_experiment_set.sh └── significance_and_bootstrap.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | local_settings.py 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # Jupyter Notebook 71 | .ipynb_checkpoints 72 | 73 | # pyenv 74 | .python-version 75 | 76 | # celery beat schedule file 77 | celerybeat-schedule 78 | 79 | # SageMath parsed files 80 | *.sage.py 81 | 82 | # dotenv 83 | .env 84 | 85 | # virtualenv 86 | .venv 87 | venv/ 88 | ENV/ 89 | 90 | # Spyder project settings 91 | .spyderproject 92 | .spyproject 93 | 94 | # Rope project settings 95 | .ropeproject 96 | 97 | # mkdocs documentation 98 | /site 99 | 100 | # mypy 101 | .mypy_cache/ 102 | -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "baselines"] 2 | path = baselines 3 | url = https://github.com/Breakend/baselines 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DeepReinforcementLearningThatMatters 2 | 3 | Accompanying code for "Deep Reinforcement Learning that Matters" 4 | 5 | ## Baselines Experiments 6 | 7 | Our Fork 8 | 9 | Current Baselines Code 10 | 11 | Our checkpointed version of the baselines code is found in the `baselines` folder. We make several modifications, mostly to allow for passing network structures as arguments to the MuJoCo-related run scripts. 12 | 13 | Our only change internally was to the DDPG evaluation code. We do this to allow for comparison against other algorithms. In the DDPG code, evaluation is done across N different policies where N is the number of "epoch_cycles", we did not find this to be consistent for comparison against other methods, so we modify this to match the rllab version of DDPG evaluation. That is, we run on the target policy for 10 full trajectories at the end of an epoch. 14 | 15 | ## rllab experiments 16 | 17 | rllab code 18 | 19 | These require the full rllab code, which we do not provide. Instead we provide some run scripts for rllab experiments in the `rllab` folder. 20 | 21 | ## rllabplusplus experiments 22 | 23 | rllabplusplus (Q-Prop) code 24 | 25 | This is the code provided with QPROP, we only provide a checkpointed version of the DDPG code which we use for evaluation here. This is under the rllabplusplus folder. 26 | 27 | ## modular_rl experiments 28 | 29 | Original TRPO (Modular RL) Code 30 | 31 | These are simply run scripts for the modular rl codebase. 32 | 33 | ## Tools 34 | 35 | This contains tools for significance testing which we used. And various associated run scripts. 36 | 37 | For bootstrap-based analysis, we use the bootstrapped repo. Tutorials there are a nice introduction to this sort of statistical analysis. 38 | 39 | For t-test and KS test we use the scipy tools. 40 | 41 | ## Citation 42 | 43 | ``` 44 | @article{hendersonRL2017, 45 | author = {{Henderson}, Peter and {Islam}, Riashat and {Bachman}, Philip and {Pineau}, Joelle and {Precup}, Doina and {Meger}, David}, 46 | title = "{Deep Reinforcement Learning that Matters}", 47 | journal = {arXiv preprint arXiv:1709.06560}, 48 | year = 2017, 49 | url={https://arxiv.org/pdf/1709.06560.pdf} 50 | } 51 | ``` 52 | -------------------------------------------------------------------------------- /modular_rl/run_batch_size_experiment_set.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | 4 | die() { echo "$@" 1>&2 ; exit 1; } 5 | 6 | # TODO: make this proper usage script 7 | 8 | # check whether user had supplied -h or --help . If yes display usage 9 | if [[ ( $# == "--help") || $# == "-h" ]] 10 | then 11 | die "Usage: bash $0 num_experiments num_experiments_in_parallel run_script [--all --other --args --to --run_script]" 12 | fi 13 | 14 | env=$1 15 | 16 | 17 | # default (6464tanh) 18 | # activations 19 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh_defaultbs1024 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=977 --timesteps_per_batch=1024 --env=${env} --filter=1 &> ${env}_default_bs1024.log 20 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh_default2048 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=488 --timesteps_per_batch=2048 --env=${env} --filter=1 &> ${env}_defaul2048t.log 21 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh_default4096 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=244 --timesteps_per_batch=4096 --env=${env} --filter=1 &> ${env}_defaul4096t.log 22 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh_default8192 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=122 --timesteps_per_batch=4096 --env=${env} --filter=1 &> ${env}_default8192.log 23 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh_default16384 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=61 --timesteps_per_batch=16384 --env=${env} --filter=1 &> ${env}_default16384.log 24 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh_default32768 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=30 --timesteps_per_batch=32768 --env=${env} --filter=1 &> ${env}_default32768.log 25 | -------------------------------------------------------------------------------- /modular_rl/run_multiple.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | 4 | die() { echo "$@" 1>&2 ; exit 1; } 5 | 6 | # TODO: make this proper usage script 7 | 8 | if [ $# -le 4 ] 9 | then 10 | die "Usage: bash $0 num_experiments num_experiments_in_parallel run_script [--all --other --args --to --run-script]" 11 | fi 12 | 13 | # check whether user had supplied -h or --help . If yes display usage 14 | if [[ ( $# == "--help") || $# == "-h" ]] 15 | then 16 | die "Usage: bash $0 num_experiments num_experiments_in_parallel run_script [--all --other --args --to --run_script]" 17 | fi 18 | 19 | num_experiments=$1 20 | parallel_exps=$2 21 | log_prefix=$3 22 | run_script=$4 23 | 24 | pickle_files=() 25 | 26 | mkdir -p ./$log_prefix/ 27 | 28 | trap 'jobs -p | xargs kill' EXIT 29 | 30 | for (( c=1; c<=$num_experiments; )) 31 | do 32 | for (( j=1; j<=$parallel_exps; j++ )) 33 | do 34 | echo "Launching experiment $c" 35 | mkdir -p ./$log_prefix/exp_$c/ 36 | python $run_script --seed $c --outfile ./$log_prefix/exp_$c/ "${@:5}" &> ./$log_prefix/exp_$c.log & 37 | #pickle_files=("${pickle_files[@]}" "exp_$c.pickle") 38 | c=$((c+1)) 39 | done 40 | wait 41 | done 42 | 43 | #python create_graphs_from_pickle.py "${pickle_files[@]}" 44 | 45 | -------------------------------------------------------------------------------- /modular_rl/run_policy_experiment_set.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | 4 | die() { echo "$@" 1>&2 ; exit 1; } 5 | 6 | # TODO: make this proper usage script 7 | 8 | # check whether user had supplied -h or --help . If yes display usage 9 | if [[ ( $# == "--help") || $# == "-h" ]] 10 | then 11 | die "Usage: bash $0 num_experiments num_experiments_in_parallel run_script [--all --other --args --to --run_script]" 12 | fi 13 | 14 | env=$1 15 | 16 | 17 | # default (6464tanh) 18 | # activations 19 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh_default run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=100 --timesteps_per_batch=20000 --env=${env} --filter=1 &> ${env}_default.log 20 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_vfrelu run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation_vf=relu --n_iter=100 --timesteps_per_batch=20000 --env=${env} --filter=1 &> ${env}_vfrelu.log 21 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_polrelu run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=relu --n_iter=100 --timesteps_per_batch=20000 --env=${env} --filter=1 &> ${env}_relu.log 22 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_tanh1005025 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=100 --timesteps_per_batch=20000 --hid_sizes "100,50,25" --env=${env} --filter=1 &> ${env}_1005025.log 23 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_400300 run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=100 --timesteps_per_batch=20000 --hid_sizes "400,300" --env=${env} --filter=1 &> ${env}_400300.log 24 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_400300vf run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=100 --timesteps_per_batch=20000 --hid_sizes_vf "400,300" --env=${env} --filter=1 &> ${env}_400300vf.log 25 | KERAS_BACKEND=theano bash run_multiple.sh 5 1 ${env}_1005025vf run_pg.py --gamma=0.995 --lam=0.97 --agent=modular_rl.agentzoo.TrpoAgent --max_kl=0.01 --cg_damping=0.1 --activation=tanh --n_iter=100 --timesteps_per_batch=20000 --hid_sizes_vf "100,50,25" --env=${env} --filter=1 &> ${env}_1005025vf.log 26 | -------------------------------------------------------------------------------- /rllab/run_ddpg_rllab.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os.path as osp 3 | import pickle 4 | 5 | import tensorflow as tf 6 | 7 | from rllab.envs.gym_env import GymEnv 8 | from rllab.envs.normalized_env import normalize 9 | from rllab.exploration_strategies.ou_strategy import OUStrategy 10 | from rllab.misc import ext 11 | from rllab.misc.instrument import run_experiment_lite, stub 12 | from rllab.algos.ddpg import DDPG 13 | from rllab.exploration_strategies.ou_strategy import OUStrategy 14 | from rllab.policies.deterministic_mlp_policy import DeterministicMLPPolicy 15 | from rllab.q_functions.continuous_mlp_q_function import ContinuousMLPQFunction 16 | import lasagne.nonlinearities as NL 17 | 18 | 19 | from sandbox.rocky.tf.misc.tensor_utils import lrelu 20 | 21 | parser = argparse.ArgumentParser() 22 | parser.add_argument("env", help="The environment name from OpenAIGym environments") 23 | parser.add_argument("--num_epochs", default=200, type=int) 24 | parser.add_argument("--log_dir", default="./data_ddpg/") 25 | parser.add_argument("--reward_scale", default=1.0, type=float) 26 | parser.add_argument("--use_ec2", action="store_true", help="Use your ec2 instances if configured") 27 | parser.add_argument("--dont_terminate_machine", action="store_false", help="Whether to terminate your spot instance or not. Be careful.") 28 | parser.add_argument("--policy_size", default=[100,50,25], type=int, nargs='*') 29 | parser.add_argument("--policy_activation", default="relu", type=str) 30 | parser.add_argument("--vf_size", default=[100,50,25], type=int, nargs='*') 31 | parser.add_argument("--vf_activation", default="relu", type=str) 32 | parser.add_argument("--seed", type=int, default=0) 33 | args = parser.parse_args() 34 | 35 | stub(globals()) 36 | ext.set_seed(args.seed) 37 | 38 | gymenv = GymEnv(args.env, force_reset=True, record_video=False, record_log=False) 39 | 40 | env = normalize(gymenv) 41 | 42 | activation_map = { "relu" : NL.rectify, "tanh" : NL.tanh, "leaky_relu" : NL.LeakyRectify} 43 | 44 | policy = DeterministicMLPPolicy( 45 | env_spec=env.spec, 46 | # The neural network policy should have two hidden layers, each with 32 hidden units. 47 | hidden_sizes=args.policy_size, 48 | hidden_nonlinearity=activation_map[args.policy_activation], 49 | ) 50 | 51 | es = OUStrategy(env_spec=env.spec) 52 | 53 | qf = ContinuousMLPQFunction(env_spec=env.spec, 54 | hidden_nonlinearity=activation_map[args.vf_activation], 55 | hidden_sizes=args.vf_size,) 56 | 57 | algo = DDPG( 58 | env=env, 59 | policy=policy, 60 | es=es, 61 | qf=qf, 62 | batch_size=128, 63 | max_path_length=env.horizon, 64 | epoch_length=1000, 65 | min_pool_size=10000, 66 | n_epochs=args.num_epochs, 67 | discount=0.995, 68 | scale_reward=args.reward_scale, 69 | qf_learning_rate=1e-3, 70 | policy_learning_rate=1e-4, 71 | plot=False 72 | ) 73 | 74 | 75 | run_experiment_lite( 76 | algo.train(), 77 | log_dir=None if args.use_ec2 else args.log_dir, 78 | # Number of parallel workers for sampling 79 | n_parallel=1, 80 | # Only keep the snapshot parameters for the last iteration 81 | snapshot_mode="last", 82 | # Specifies the seed for the experiment. If this is not provided, a random seed 83 | # will be used 84 | exp_prefix="DDPG_" + args.env, 85 | seed=args.seed, 86 | mode="ec2" if args.use_ec2 else "local", 87 | plot=False, 88 | # dry=True, 89 | terminate_machine=args.dont_terminate_machine, 90 | added_project_directories=[osp.abspath(osp.join(osp.dirname(__file__), '.'))] 91 | ) 92 | -------------------------------------------------------------------------------- /rllab/run_trpo.py: -------------------------------------------------------------------------------- 1 | from rllab.envs.box2d.cartpole_env import CartpoleEnv 2 | from rllab.envs.normalized_env import normalize 3 | from rllab.misc.instrument import stub, run_experiment_lite 4 | from rllab.baselines.linear_feature_baseline import LinearFeatureBaseline 5 | from rllab.envs.gym_env import GymEnv 6 | from sandbox.rocky.tf.misc.tensor_utils import lrelu 7 | 8 | from sandbox.rocky.tf.envs.base import TfEnv 9 | from sandbox.rocky.tf.policies.gaussian_mlp_policy import GaussianMLPPolicy 10 | from sandbox.rocky.tf.algos.trpo import TRPO 11 | from rllab.misc import ext 12 | from sandbox.rocky.tf.optimizers.conjugate_gradient_optimizer import ConjugateGradientOptimizer 13 | from sandbox.rocky.tf.optimizers.conjugate_gradient_optimizer import FiniteDifferenceHvp 14 | 15 | import pickle 16 | import os.path as osp 17 | 18 | import tensorflow as tf 19 | 20 | import argparse 21 | parser = argparse.ArgumentParser() 22 | parser.add_argument("env", help="The environment name from OpenAIGym environments") 23 | parser.add_argument("--num_epochs", default=100, type=int) 24 | parser.add_argument("--batch_size", default=20000, type=int) 25 | parser.add_argument("--step_size", default=0.01, type=float) 26 | parser.add_argument("--reg_coeff", default=0.1, type=float) 27 | parser.add_argument("--gae_lambda", default=.97, type=float) 28 | parser.add_argument("--policy_size", default=[100,50,25], type=int, nargs='*') 29 | parser.add_argument("--log_dir", default="./data/") 30 | parser.add_argument("--use_ec2", action="store_true", help="Use your ec2 instances if configured") 31 | parser.add_argument("--dont_terminate_machine", action="store_false", help="Whether to terminate your spot instance or not. Be careful.") 32 | parser.add_argument("--activation", default="relu", type=str) 33 | parser.add_argument("--seed", default=1, type=int) 34 | args = parser.parse_args() 35 | 36 | stub(globals()) 37 | ext.set_seed(args.seed) 38 | 39 | supported_gym_envs = ["MountainCarContinuous-v0", "InvertedPendulum-v1", "InvertedDoublePendulum-v1", "Hopper-v1", "Walker2d-v1", "Humanoid-v1", "Reacher-v1", "HalfCheetah-v1", "Swimmer-v1", "HumanoidStandup-v1"] 40 | 41 | other_env_class_map = { "Cartpole" : CartpoleEnv} 42 | 43 | activation_map = { "relu" : tf.nn.relu, "tanh" : tf.nn.tanh, "leaky_relu" : lrelu} 44 | 45 | if args.env in supported_gym_envs: 46 | gymenv = GymEnv(args.env, force_reset=True, record_video=False, record_log=False) 47 | else: 48 | gymenv = other_env_class_map[args.env]() 49 | 50 | #TODO: assert continuous space 51 | 52 | env = TfEnv(normalize(gymenv)) 53 | 54 | print("Using network arch: %s" % ", ".join([str(x) for x in args.policy_size])) 55 | 56 | policy = GaussianMLPPolicy( 57 | name="policy", 58 | env_spec=env.spec, 59 | # The neural network policy should have two hidden layers, each with 32 hidden units. 60 | hidden_sizes=tuple([int(x) for x in args.policy_size]), 61 | hidden_nonlinearity=activation_map[args.activation], 62 | ) 63 | 64 | baseline = LinearFeatureBaseline(env_spec=env.spec) 65 | 66 | algo = TRPO( 67 | env=env, 68 | policy=policy, 69 | baseline=baseline, 70 | batch_size=args.batch_size, 71 | max_path_length=env.horizon, 72 | n_itr=args.num_epochs, 73 | discount=0.99, 74 | step_size=args.step_size, 75 | gae_lambda=args.gae_lambda, 76 | optimizer=ConjugateGradientOptimizer(reg_coeff=args.reg_coeff) 77 | ) 78 | 79 | arch_name="_".join([str(x) for x in args.policy_size]) 80 | pref = "TRPO_" + args.env + "_bs_" + str(args.batch_size) + "_sp_" + str(args.step_size) + "_regc_" + str(args.reg_coeff) + "_gael_" + str(args.gae_lambda) + "_na_" + arch_name + "_seed_" + str(args.seed) 81 | pref = pref.replace(".", "_") 82 | print("Using prefix %s" % pref) 83 | 84 | run_experiment_lite( 85 | algo.train(), 86 | log_dir=None if args.use_ec2 else args.log_dir, 87 | # Number of parallel workers for sampling 88 | n_parallel=1, 89 | # Only keep the snapshot parameters for the last iteration 90 | snapshot_mode="none", 91 | # Specifies the seed for the experiment. If this is not provided, a random seed 92 | # will be used 93 | exp_prefix=pref, 94 | seed=args.seed, 95 | use_gpu=False, 96 | mode="ec2" if args.use_ec2 else "local", 97 | plot=False, 98 | # dry=True, 99 | terminate_machine=args.dont_terminate_machine, 100 | added_project_directories=[osp.abspath(osp.join(osp.dirname(__file__), '.'))] 101 | ) 102 | -------------------------------------------------------------------------------- /rllabplusplus/ddpg/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Breakend/DeepReinforcementLearningThatMatters/1f4ea363623854b8072149cb404ee919d9042e79/rllabplusplus/ddpg/__init__.py -------------------------------------------------------------------------------- /rllabplusplus/ddpg/ddpg.py: -------------------------------------------------------------------------------- 1 | # MODIFIED FROM: https://raw.githubusercontent.com/shaneshixiang/rllabplusplus/master/sandbox/rocky/tf/algos/ddpg.py 2 | import gc 3 | import time 4 | 5 | #import pickle as pickle 6 | import numpy as np 7 | import tensorflow as tf 8 | 9 | import pyprind 10 | import rllab.misc.logger as logger 11 | from rllab.algos.base import RLAlgorithm 12 | from rllab.core.serializable import Serializable 13 | from rllab.misc import ext, special 14 | from rllab.misc.overrides import overrides 15 | from rllab.plotter import plotter 16 | from rllab.sampler import parallel_sampler 17 | from sampling_utils import SimpleReplayPool 18 | from sandbox.rocky.tf.misc import tensor_utils 19 | from sandbox.rocky.tf.optimizers.first_order_optimizer import \ 20 | FirstOrderOptimizer 21 | 22 | 23 | class DDPG(RLAlgorithm): 24 | """ 25 | Deep Deterministic Policy Gradient. 26 | """ 27 | 28 | def __init__( 29 | self, 30 | env, 31 | policy, 32 | qf, 33 | es, 34 | batch_size=32, 35 | n_epochs=200, 36 | epoch_length=1000, 37 | min_pool_size=10000, 38 | replay_pool_size=1000000, 39 | replacement_prob=1.0, 40 | discount=0.99, 41 | max_path_length=250, 42 | qf_weight_decay=0., 43 | qf_update_method='adam', 44 | qf_learning_rate=1e-3, 45 | policy_weight_decay=0, 46 | policy_update_method='adam', 47 | policy_learning_rate=1e-3, 48 | policy_updates_ratio=1.0, 49 | eval_samples=10000, 50 | soft_target=True, 51 | soft_target_tau=0.001, 52 | n_updates_per_sample=1, 53 | scale_reward=1.0, 54 | include_horizon_terminal_transitions=False, 55 | plot=False, 56 | pause_for_plot=False, 57 | **kwargs): 58 | """ 59 | :param env: Environment 60 | :param policy: Policy 61 | :param qf: Q function 62 | :param es: Exploration strategy 63 | :param batch_size: Number of samples for each minibatch. 64 | :param n_epochs: Number of epochs. Policy will be evaluated after each epoch. 65 | :param epoch_length: How many timesteps for each epoch. 66 | :param min_pool_size: Minimum size of the pool to start training. 67 | :param replay_pool_size: Size of the experience replay pool. 68 | :param discount: Discount factor for the cumulative return. 69 | :param max_path_length: Discount factor for the cumulative return. 70 | :param qf_weight_decay: Weight decay factor for parameters of the Q function. 71 | :param qf_update_method: Online optimization method for training Q function. 72 | :param qf_learning_rate: Learning rate for training Q function. 73 | :param policy_weight_decay: Weight decay factor for parameters of the policy. 74 | :param policy_update_method: Online optimization method for training the policy. 75 | :param policy_learning_rate: Learning rate for training the policy. 76 | :param eval_samples: Number of samples (timesteps) for evaluating the policy. 77 | :param soft_target_tau: Interpolation parameter for doing the soft target update. 78 | :param n_updates_per_sample: Number of Q function and policy updates per new sample obtained 79 | :param scale_reward: The scaling factor applied to the rewards when training 80 | :param include_horizon_terminal_transitions: whether to include transitions with terminal=True because the 81 | horizon was reached. This might make the Q value back up less stable for certain tasks. 82 | :param plot: Whether to visualize the policy performance after each eval_interval. 83 | :param pause_for_plot: Whether to pause before continuing when plotting. 84 | :return: 85 | """ 86 | self.env = env 87 | self.policy = policy 88 | self.qf = qf 89 | self.es = es 90 | self.batch_size = batch_size 91 | self.n_epochs = n_epochs 92 | self.epoch_length = epoch_length 93 | self.min_pool_size = min_pool_size 94 | self.replay_pool_size = replay_pool_size 95 | self.replacement_prob = replacement_prob 96 | self.discount = discount 97 | self.max_path_length = max_path_length 98 | self.qf_weight_decay = qf_weight_decay 99 | self.qf_update_method = \ 100 | FirstOrderOptimizer( 101 | update_method=qf_update_method, 102 | learning_rate=qf_learning_rate, 103 | ) 104 | self.qf_learning_rate = qf_learning_rate 105 | self.policy_weight_decay = policy_weight_decay 106 | self.policy_update_method = \ 107 | FirstOrderOptimizer( 108 | update_method=policy_update_method, 109 | learning_rate=policy_learning_rate, 110 | ) 111 | self.policy_learning_rate = policy_learning_rate 112 | self.policy_updates_ratio = policy_updates_ratio 113 | self.eval_samples = eval_samples 114 | self.soft_target_tau = soft_target_tau 115 | self.n_updates_per_sample = n_updates_per_sample 116 | self.include_horizon_terminal_transitions = include_horizon_terminal_transitions 117 | self.plot = plot 118 | self.pause_for_plot = pause_for_plot 119 | 120 | self.qf_loss_averages = [] 121 | self.policy_surr_averages = [] 122 | self.q_averages = [] 123 | self.y_averages = [] 124 | self.paths = [] 125 | self.es_path_returns = [] 126 | self.paths_samples_cnt = 0 127 | 128 | self.scale_reward = scale_reward 129 | 130 | self.train_policy_itr = 0 131 | 132 | self.opt_info = None 133 | 134 | def start_worker(self): 135 | parallel_sampler.populate_task(self.env, self.policy) 136 | if self.plot: 137 | plotter.init_plot(self.env, self.policy) 138 | 139 | @overrides 140 | def train(self): 141 | gc_dump_time = time.time() 142 | with tf.Session() as sess: 143 | sess.run(tf.global_variables_initializer()) 144 | # This seems like a rather sequential method 145 | pool = SimpleReplayPool( 146 | max_pool_size=self.replay_pool_size, 147 | observation_dim=self.env.observation_space.flat_dim, 148 | action_dim=self.env.action_space.flat_dim, 149 | replacement_prob=self.replacement_prob, 150 | ) 151 | self.start_worker() 152 | 153 | self.init_opt() 154 | # This initializes the optimizer parameters 155 | sess.run(tf.global_variables_initializer()) 156 | itr = 0 157 | path_length = 0 158 | path_return = 0 159 | terminal = False 160 | initial = False 161 | observation = self.env.reset() 162 | 163 | with tf.variable_scope("sample_policy"): 164 | sample_policy = Serializable.clone(self.policy) 165 | 166 | for epoch in range(self.n_epochs): 167 | logger.push_prefix('epoch #%d | ' % epoch) 168 | logger.log("Training started") 169 | train_qf_itr, train_policy_itr = 0, 0 170 | for epoch_itr in pyprind.prog_bar(range(self.epoch_length)): 171 | # Execute policy 172 | if terminal: 173 | # Note that if the last time step ends an episode, the very 174 | # last state and observation will be ignored and not added 175 | # to the replay pool 176 | observation = self.env.reset() 177 | sample_policy.reset() 178 | self.es_path_returns.append(path_return) 179 | path_length = 0 180 | path_return = 0 181 | initial = True 182 | else: 183 | initial = False 184 | action = self.es.get_action(itr, observation, policy=sample_policy) # qf=qf) 185 | 186 | next_observation, reward, terminal, _ = self.env.step(action) 187 | path_length += 1 188 | path_return += reward 189 | 190 | if not terminal and path_length >= self.max_path_length: 191 | terminal = True 192 | # only include the terminal transition in this case if the flag was set 193 | if self.include_horizon_terminal_transitions: 194 | pool.add_sample(observation, action, reward * self.scale_reward, terminal, initial) 195 | else: 196 | pool.add_sample(observation, action, reward * self.scale_reward, terminal, initial) 197 | 198 | observation = next_observation 199 | 200 | if pool.size >= self.min_pool_size: 201 | for update_itr in range(self.n_updates_per_sample): 202 | # Train policy 203 | batch = pool.random_batch(self.batch_size) 204 | itrs = self.do_training(itr, batch) 205 | train_qf_itr += itrs[0] 206 | train_policy_itr += itrs[1] 207 | sample_policy.set_param_values(self.policy.get_param_values()) 208 | 209 | itr += 1 210 | if time.time() - gc_dump_time > 100: 211 | gc.collect() 212 | gc_dump_time = time.time() 213 | 214 | logger.log("Training finished") 215 | logger.log("Trained qf %d steps, policy %d steps"%(train_qf_itr, train_policy_itr)) 216 | if pool.size >= self.min_pool_size: 217 | self.evaluate(epoch, pool) 218 | params = self.get_epoch_snapshot(epoch) 219 | logger.save_itr_params(epoch, params) 220 | logger.dump_tabular(with_prefix=False) 221 | logger.pop_prefix() 222 | if self.plot: 223 | self.update_plot() 224 | if self.pause_for_plot: 225 | input("Plotting evaluation run: Press Enter to " 226 | "continue...") 227 | self.env.terminate() 228 | self.policy.terminate() 229 | 230 | def init_opt(self): 231 | 232 | # First, create "target" policy and Q functions 233 | with tf.variable_scope("target_policy"): 234 | target_policy = Serializable.clone(self.policy) 235 | with tf.variable_scope("target_qf"): 236 | target_qf = Serializable.clone(self.qf) 237 | 238 | # y need to be computed first 239 | obs = self.env.observation_space.new_tensor_variable( 240 | 'obs', 241 | extra_dims=1, 242 | ) 243 | 244 | # The yi values are computed separately as above and then passed to 245 | # the training functions below 246 | action = self.env.action_space.new_tensor_variable( 247 | 'action', 248 | extra_dims=1, 249 | ) 250 | 251 | yvar = tensor_utils.new_tensor( 252 | 'ys', 253 | ndim=1, 254 | dtype=tf.float32, 255 | ) 256 | 257 | qf_weight_decay_term = 0.5 * self.qf_weight_decay * \ 258 | sum([tf.reduce_sum(tf.square(param)) for param in 259 | self.qf.get_params(regularizable=True)]) 260 | 261 | qval = self.qf.get_qval_sym(obs, action) 262 | 263 | qf_loss = tf.reduce_mean(tf.square(yvar - qval)) 264 | qf_reg_loss = qf_loss + qf_weight_decay_term 265 | 266 | policy_weight_decay_term = 0.5 * self.policy_weight_decay * \ 267 | sum([tf.reduce_sum(tf.square(param)) 268 | for param in self.policy.get_params(regularizable=True)]) 269 | policy_qval = self.qf.get_qval_sym( 270 | obs, self.policy.get_action_sym(obs), 271 | deterministic=True 272 | ) 273 | policy_surr = -tf.reduce_mean(policy_qval) 274 | 275 | policy_reg_surr = policy_surr + policy_weight_decay_term 276 | 277 | qf_input_list = [yvar, obs, action] 278 | policy_input_list = [obs] 279 | 280 | self.qf_update_method.update_opt( 281 | loss=qf_reg_loss, target=self.qf, inputs=qf_input_list) 282 | self.policy_update_method.update_opt( 283 | loss=policy_reg_surr, target=self.policy, inputs=policy_input_list) 284 | 285 | f_train_qf = tensor_utils.compile_function( 286 | inputs=qf_input_list, 287 | outputs=[qf_loss, qval, self.qf_update_method._train_op], 288 | ) 289 | 290 | f_train_policy = tensor_utils.compile_function( 291 | inputs=policy_input_list, 292 | outputs=[policy_surr, self.policy_update_method._train_op], 293 | ) 294 | 295 | self.opt_info = dict( 296 | f_train_qf=f_train_qf, 297 | f_train_policy=f_train_policy, 298 | target_qf=target_qf, 299 | target_policy=target_policy, 300 | ) 301 | 302 | def do_training(self, itr, batch): 303 | 304 | obs, actions, rewards, next_obs, terminals = ext.extract( 305 | batch, 306 | "observations", "actions", "rewards", "next_observations", 307 | "terminals" 308 | ) 309 | 310 | # compute the on-policy y values 311 | target_qf = self.opt_info["target_qf"] 312 | target_policy = self.opt_info["target_policy"] 313 | 314 | next_actions, _ = target_policy.get_actions(next_obs) 315 | next_qvals = target_qf.get_qval(next_obs, next_actions) 316 | 317 | ys = rewards + (1. - terminals) * self.discount * next_qvals.reshape(-1) 318 | 319 | f_train_qf = self.opt_info["f_train_qf"] 320 | qf_loss, qval, _ = f_train_qf(ys, obs, actions) 321 | target_qf.set_param_values( 322 | target_qf.get_param_values() * (1.0 - self.soft_target_tau) + 323 | self.qf.get_param_values() * self.soft_target_tau) 324 | self.qf_loss_averages.append(qf_loss) 325 | self.q_averages.append(qval) 326 | self.y_averages.append(ys) 327 | 328 | self.train_policy_itr += self.policy_updates_ratio 329 | train_policy_itr = 0 330 | while self.train_policy_itr > 0: 331 | f_train_policy = self.opt_info["f_train_policy"] 332 | policy_surr, _ = f_train_policy(obs) 333 | target_policy.set_param_values( 334 | target_policy.get_param_values() * (1.0 - self.soft_target_tau) + 335 | self.policy.get_param_values() * self.soft_target_tau) 336 | self.policy_surr_averages.append(policy_surr) 337 | self.train_policy_itr -= 1 338 | train_policy_itr += 1 339 | return 1, train_policy_itr # number of itrs qf, policy are trained 340 | 341 | def evaluate(self, epoch, pool): 342 | logger.log("Collecting samples for evaluation") 343 | 344 | paths = parallel_sampler.sample_paths( 345 | policy_params=self.policy.get_param_values(), 346 | max_samples=self.eval_samples, 347 | max_path_length=self.max_path_length, 348 | ) 349 | 350 | self.env.reset() 351 | 352 | average_discounted_return = np.mean( 353 | [special.discount_return(path["rewards"], self.discount) for path in paths] 354 | ) 355 | 356 | returns = [sum(path["rewards"]) for path in paths] 357 | 358 | all_qs = np.concatenate(self.q_averages) 359 | all_ys = np.concatenate(self.y_averages) 360 | 361 | average_q_loss = np.mean(self.qf_loss_averages) 362 | average_policy_surr = np.mean(self.policy_surr_averages) 363 | average_action = np.mean(np.square(np.concatenate( 364 | [path["actions"] for path in paths] 365 | ))) 366 | 367 | policy_reg_param_norm = np.linalg.norm( 368 | self.policy.get_param_values(regularizable=True) 369 | ) 370 | qfun_reg_param_norm = np.linalg.norm( 371 | self.qf.get_param_values(regularizable=True) 372 | ) 373 | 374 | logger.record_tabular('Epoch', epoch) 375 | logger.record_tabular('Iteration', epoch) 376 | logger.record_tabular('AverageReturn', np.mean(returns)) 377 | logger.record_tabular('StdReturn', 378 | np.std(returns)) 379 | logger.record_tabular('MaxReturn', 380 | np.max(returns)) 381 | logger.record_tabular('MinReturn', 382 | np.min(returns)) 383 | if len(self.es_path_returns) > 0: 384 | logger.record_tabular('AverageEsReturn', 385 | np.mean(self.es_path_returns)) 386 | logger.record_tabular('StdEsReturn', 387 | np.std(self.es_path_returns)) 388 | logger.record_tabular('MaxEsReturn', 389 | np.max(self.es_path_returns)) 390 | logger.record_tabular('MinEsReturn', 391 | np.min(self.es_path_returns)) 392 | logger.record_tabular('AverageDiscountedReturn', 393 | average_discounted_return) 394 | logger.record_tabular('AverageQLoss', average_q_loss) 395 | logger.record_tabular('AveragePolicySurr', average_policy_surr) 396 | logger.record_tabular('AverageQ', np.mean(all_qs)) 397 | logger.record_tabular('AverageAbsQ', np.mean(np.abs(all_qs))) 398 | logger.record_tabular('AverageY', np.mean(all_ys)) 399 | logger.record_tabular('AverageAbsY', np.mean(np.abs(all_ys))) 400 | logger.record_tabular('AverageAbsQYDiff', 401 | np.mean(np.abs(all_qs - all_ys))) 402 | logger.record_tabular('AverageAction', average_action) 403 | 404 | logger.record_tabular('PolicyRegParamNorm', 405 | policy_reg_param_norm) 406 | logger.record_tabular('QFunRegParamNorm', 407 | qfun_reg_param_norm) 408 | 409 | self.env.log_diagnostics(paths) 410 | self.policy.log_diagnostics(paths) 411 | 412 | self.qf_loss_averages = [] 413 | self.policy_surr_averages = [] 414 | 415 | self.q_averages = [] 416 | self.y_averages = [] 417 | self.es_path_returns = [] 418 | 419 | def update_plot(self): 420 | if self.plot: 421 | plotter.update_plot(self.policy, self.max_path_length) 422 | 423 | def get_epoch_snapshot(self, epoch): 424 | return dict( 425 | env=self.env, 426 | epoch=epoch, 427 | qf=self.qf, 428 | policy=self.policy, 429 | target_qf=self.opt_info["target_qf"], 430 | target_policy=self.opt_info["target_policy"], 431 | es=self.es, 432 | ) 433 | -------------------------------------------------------------------------------- /rllabplusplus/run_ddpg.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os.path as osp 3 | import pickle 4 | 5 | import tensorflow as tf 6 | 7 | from ddpg.ddpg import DDPG 8 | from rllab.envs.gym_env import GymEnv 9 | from rllab.envs.normalized_env import normalize 10 | from rllab.exploration_strategies.ou_strategy import OUStrategy 11 | from rllab.misc import ext 12 | from rllab.misc.instrument import run_experiment_lite, stub 13 | from sandbox.rocky.tf.envs.base import TfEnv 14 | from sandbox.rocky.tf.policies.deterministic_mlp_policy import \ 15 | DeterministicMLPPolicy 16 | from sandbox.rocky.tf.q_functions.continuous_mlp_q_function import \ 17 | ContinuousMLPQFunction 18 | from sandbox.rocky.tf.misc.tensor_utils import lrelu 19 | 20 | parser = argparse.ArgumentParser() 21 | parser.add_argument("env", help="The environment name from OpenAIGym environments") 22 | parser.add_argument("--num_epochs", default=200, type=int) 23 | parser.add_argument("--log_dir", default="./data_ddpg/") 24 | parser.add_argument("--reward_scale", default=1.0, type=float) 25 | parser.add_argument("--use_ec2", action="store_true", help="Use your ec2 instances if configured") 26 | parser.add_argument("--dont_terminate_machine", action="store_false", help="Whether to terminate your spot instance or not. Be careful.") 27 | parser.add_argument("--policy_size", default=[100,50,25], type=int, nargs='*') 28 | parser.add_argument("--policy_activation", default="relu", type=str) 29 | parser.add_argument("--vf_size", default=[100,50,25], type=int, nargs='*') 30 | parser.add_argument("--vf_activation", default="relu", type=str) 31 | parser.add_argument("--seed", type=int, default=0) 32 | args = parser.parse_args() 33 | 34 | stub(globals()) 35 | ext.set_seed(args.seed) 36 | 37 | gymenv = GymEnv(args.env, force_reset=True, record_video=False, record_log=False) 38 | 39 | env = TfEnv(normalize(gymenv)) 40 | 41 | activation_map = { "relu" : tf.nn.relu, "tanh" : tf.nn.tanh, "leaky_relu" : lrelu} 42 | 43 | policy = DeterministicMLPPolicy( 44 | env_spec=env.spec, 45 | name="policy", 46 | # The neural network policy should have two hidden layers, each with 32 hidden units. 47 | hidden_sizes=args.policy_size, 48 | hidden_nonlinearity=activation_map[args.policy_activation], 49 | ) 50 | 51 | es = OUStrategy(env_spec=env.spec) 52 | 53 | qf = ContinuousMLPQFunction(env_spec=env.spec, 54 | hidden_nonlinearity=activation_map[args.vf_activation], 55 | hidden_sizes=args.vf_size,) 56 | 57 | algo = DDPG( 58 | env=env, 59 | policy=policy, 60 | es=es, 61 | qf=qf, 62 | batch_size=128, 63 | max_path_length=env.horizon, 64 | epoch_length=1000, 65 | min_pool_size=10000, 66 | n_epochs=args.num_epochs, 67 | discount=0.995, 68 | scale_reward=args.reward_scale, 69 | qf_learning_rate=1e-3, 70 | policy_learning_rate=1e-4, 71 | plot=False 72 | ) 73 | 74 | 75 | run_experiment_lite( 76 | algo.train(), 77 | log_dir=None if args.use_ec2 else args.log_dir, 78 | # Number of parallel workers for sampling 79 | n_parallel=1, 80 | # Only keep the snapshot parameters for the last iteration 81 | snapshot_mode="last", 82 | # Specifies the seed for the experiment. If this is not provided, a random seed 83 | # will be used 84 | exp_prefix="DDPG_" + args.env, 85 | seed=args.seed, 86 | mode="ec2" if args.use_ec2 else "local", 87 | plot=False, 88 | # dry=True, 89 | terminate_machine=args.dont_terminate_machine, 90 | added_project_directories=[osp.abspath(osp.join(osp.dirname(__file__), '.'))] 91 | ) 92 | -------------------------------------------------------------------------------- /rllabplusplus/sampling_utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import rllab.misc.logger as logger 3 | 4 | class SimpleReplayPool(object): 5 | """ 6 | Used from https://raw.githubusercontent.com/shaneshixiang/rllabplusplus/master/rllab/pool/simple_pool.py 7 | """ 8 | def __init__( 9 | self, max_pool_size, observation_dim, action_dim, 10 | replacement_policy='stochastic', replacement_prob=1.0, 11 | max_skip_episode=10): 12 | self._observation_dim = observation_dim 13 | self._action_dim = action_dim 14 | self._max_pool_size = max_pool_size 15 | self._replacement_policy = replacement_policy 16 | self._replacement_prob = replacement_prob 17 | self._max_skip_episode = max_skip_episode 18 | self._observations = np.zeros( 19 | (max_pool_size, observation_dim), 20 | ) 21 | self._actions = np.zeros( 22 | (max_pool_size, action_dim), 23 | ) 24 | self._rewards = np.zeros(max_pool_size) 25 | self._terminals = np.zeros(max_pool_size, dtype='uint8') 26 | self._initials = np.zeros(max_pool_size, dtype='uint8') 27 | self._bottom = 0 28 | self._top = 0 29 | self._size = 0 30 | 31 | def add_sample(self, observation, action, reward, terminal, initial): 32 | self.check_replacement() 33 | self._observations[self._top] = observation 34 | self._actions[self._top] = action 35 | self._rewards[self._top] = reward 36 | self._terminals[self._top] = terminal 37 | self._initials[self._top] = initial 38 | self.advance() 39 | 40 | def check_replacement(self): 41 | if self._replacement_prob < 1.0: 42 | if self._size < self._max_pool_size or \ 43 | not self._initials[self._top]: return 44 | self.advance_until_terminate() 45 | 46 | def get_skip_flag(self): 47 | if self._replacement_policy == 'full': skip = False 48 | elif self._replacement_policy == 'stochastic': 49 | skip = np.random.uniform() > self._replacement_prob 50 | else: raise NotImplementedError 51 | return skip 52 | 53 | def advance_until_terminate(self): 54 | skip = self.get_skip_flag() 55 | n_skips = 0 56 | old_top = self._top 57 | new_top = (old_top + 1) % self._max_pool_size 58 | while skip and old_top != new_top and n_skips < self._max_skip_episode: 59 | n_skips += 1 60 | self.advance() 61 | while not self._initials[self._top]: 62 | self.advance() 63 | skip = self.get_skip_flag() 64 | new_top = self._top 65 | logger.log("add_sample, skipped %d episodes, top=%d->%d"%( 66 | n_skips, old_top, new_top)) 67 | 68 | def advance(self): 69 | self._top = (self._top + 1) % self._max_pool_size 70 | if self._size >= self._max_pool_size: 71 | self._bottom = (self._bottom + 1) % self._max_pool_size 72 | else: 73 | self._size += 1 74 | 75 | def random_batch(self, batch_size): 76 | assert self._size > batch_size 77 | indices = np.zeros(batch_size, dtype='uint64') 78 | transition_indices = np.zeros(batch_size, dtype='uint64') 79 | count = 0 80 | while count < batch_size: 81 | index = np.random.randint(self._bottom, self._bottom + self._size) % self._max_pool_size 82 | # make sure that the transition is valid: if we are at the end of the pool, we need to discard 83 | # this sample 84 | if index == self._size - 1 and self._size <= self._max_pool_size: 85 | continue 86 | 87 | transition_index = (index + 1) % self._max_pool_size 88 | 89 | # make sure that the transition is valid: discard the transition if it crosses horizon-triggered resets 90 | if not self._terminals[index] and self._initials[transition_index]: 91 | continue 92 | indices[count] = index 93 | transition_indices[count] = transition_index 94 | count += 1 95 | return dict( 96 | observations=self._observations[indices], 97 | actions=self._actions[indices], 98 | rewards=self._rewards[indices], 99 | terminals=self._terminals[indices], 100 | initials=self._initials[indices], 101 | next_observations=self._observations[transition_indices] 102 | ) 103 | 104 | @property 105 | def size(self): 106 | return self._size 107 | -------------------------------------------------------------------------------- /tools/plot_results.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | from scipy.ndimage.filters import uniform_filter1d 3 | import time 4 | import numpy as np 5 | import pandas as pd 6 | from itertools import cycle 7 | import argparse 8 | 9 | from numpy import genfromtxt 10 | from numpy.random import choice 11 | 12 | # Make fonts pretty 13 | plt.rcParams['text.usetex'] = True 14 | 15 | def multiple_plot(average_vals_list, std_dev_list, traj_list, other_labels, env_name, smoothing_window=5, no_show=False, ignore_std=False, climit=None, extra_lines=None): 16 | fig = plt.figure(figsize=(16, 8)) 17 | colors = ["#1f77b4", "#ff7f0e", "#d62728", "#9467bd", "#2ca02c", "#8c564b", "#e377c2", "#bcbd22", "#17becf"] 18 | linestyles = ['solid', 'dashed', 'dashdot', 'dotted'] 19 | color_index = 0 20 | ax = plt.subplot() # Defines ax variable by creating an empty plot 21 | offset = 1 22 | 23 | # Set the tick labels font 24 | for label in (ax.get_xticklabels() + ax.get_yticklabels()): 25 | label.set_fontname('Arial') 26 | label.set_fontsize(28) 27 | if traj_list is None: 28 | traj_list = [None]*len(average_vals_list) 29 | limit = climit 30 | index = 0 31 | for average_vals, std_dev, label, trajs in zip(average_vals_list, std_dev_list, other_labels[:len(average_vals_list)], traj_list): 32 | index += 1 33 | 34 | if climit is None: 35 | limit = len(average_vals) 36 | 37 | # If we don't want reward smoothing, set smoothing window to size 1 38 | rewards_smoothed_1 = uniform_filter1d(average_vals, size=smoothing_window)[:limit] 39 | std_smoothed_1 = uniform_filter1d(std_dev, size=smoothing_window)[:limit] 40 | rewards_smoothed_1 = rewards_smoothed_1[:limit] 41 | std_dev = std_dev[:limit] 42 | 43 | if trajs is None: 44 | # in this case, we just want to use algorithm iterations, so just take the number of things we have. 45 | trajs = list(range(len(rewards_smoothed_1))) 46 | else: 47 | plt.ticklabel_format(style='sci', axis='x', scilimits=(0,0)) 48 | ax.xaxis.get_offset_text().set_fontsize(20) 49 | 50 | fill_color = colors[color_index] 51 | color_index += 1 52 | 53 | cum_rwd_1, = plt.plot(trajs, rewards_smoothed_1, label=label, color=fill_color, ls=linestyles[color_index % len(linestyles)]) 54 | offset += 3 55 | if not ignore_std: 56 | # uncomment this to use error bars 57 | #plt.errorbar(trajs[::25 + offset], rewards_smoothed_1[::25 + offset], yerr=std_smoothed_1[::25 + offset], linestyle='None', color=fill_color, capsize=5) 58 | plt.fill_between(trajs, rewards_smoothed_1 + std_smoothed_1, rewards_smoothed_1 - std_smoothed_1, alpha=0.3, edgecolor=fill_color, facecolor=fill_color) 59 | 60 | if extra_lines: 61 | for lin in extra_lines: 62 | plt.plot(trajs, np.repeat(lin, len(rewards_smoothed_1)), linestyle='-.', color = colors[color_index], linewidth=2.5, label=other_labels[index]) 63 | color_index += 1 64 | index += 1 65 | 66 | axis_font = {'fontname':'Arial', 'size':'32'} 67 | plt.legend(loc='lower right', prop={'size' : 16}) 68 | plt.xlabel("Iterations", **axis_font) 69 | if traj_list is not None and traj_list[0] is not None: 70 | plt.ticklabel_format(style='sci', axis='x', scilimits=(0,0)) 71 | ax.xaxis.get_offset_text().set_fontsize(20) 72 | plt.xlabel("Timesteps", **axis_font) 73 | else: 74 | plt.xlabel("Iterations", **axis_font) 75 | plt.ylabel("Average Return", **axis_font) 76 | plt.title("%s"% env_name, **axis_font) 77 | 78 | if no_show: 79 | fig.savefig('%s.pdf' % env_name, dpi=fig.dpi, bbox_inches='tight') 80 | else: 81 | plt.show() 82 | 83 | return fig 84 | 85 | parser = argparse.ArgumentParser() 86 | parser.add_argument("paths_to_progress_csvs", nargs="+", help="All the csvs associated with the data (Need AverageReturn, StdReturn, and TimestepsSoFar columns)") 87 | parser.add_argument("env_name", help= "This is just the title of the plot and the filename." ) 88 | parser.add_argument("--save", action="store_true") 89 | parser.add_argument("--ignore_std", action="store_true") 90 | parser.add_argument('--labels', nargs='+', help='List of labels to go along with the lines to plot', required=False) 91 | parser.add_argument('--smoothing_window', default=1, type=int, help="Running average to smooth with, default is 1 (i.e. no smoothing)") 92 | parser.add_argument('--limit', default=None, type=int) 93 | parser.add_argument('--extra_lines', nargs="+", type=float, help="Any extra lines to add on the graph (such as other paper resulls)") 94 | 95 | args = parser.parse_args() 96 | 97 | avg_rets = [] 98 | std_dev_rets = [] 99 | trajs = [] 100 | 101 | # Read all csv's into arrays 102 | for o in args.paths_to_progress_csvs: 103 | data = pd.read_csv(o) 104 | avg_ret = np.array(data["AverageReturn"]) # averge return across trials 105 | std_dev_ret = np.array(data["StdReturn"]) # standard error across trials 106 | if "total/steps" in data and trajs is not None: 107 | trajs.append(np.array(data["total/steps"])) 108 | elif "TimestepsSoFar" in data and trajs is not None: 109 | trajs.append(np.array(data["TimestepsSoFar"])) 110 | else: 111 | trajs = None 112 | avg_rets.append(avg_ret) 113 | std_dev_rets.append(std_dev_ret) 114 | 115 | # call plotting script 116 | multiple_plot(avg_rets, std_dev_rets, trajs, args.labels, args.env_name, smoothing_window=args.smoothing_window, no_show=args.save, ignore_std=args.ignore_std, climit=args.limit, extra_lines=args.extra_lines) 117 | -------------------------------------------------------------------------------- /tools/run_multiple.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | 4 | die() { echo "$@" 1>&2 ; exit 1; } 5 | 6 | # TODO: make this proper usage script 7 | 8 | if [ $# -le 4 ] 9 | then 10 | die "Usage: bash $0 num_experiments num_experiments_in_parallel run_script [--all --other --args --to --run-script]" 11 | fi 12 | 13 | # check whether user had supplied -h or --help . If yes display usage 14 | if [[ ( $# == "--help") || $# == "-h" ]] 15 | then 16 | die "Usage: bash $0 num_experiments num_experiments_in_parallel run_script [--all --other --args --to --run_script]" 17 | fi 18 | 19 | num_experiments=$1 20 | parallel_exps=$2 21 | log_prefix=$3 22 | run_script=$4 23 | 24 | pickle_files=() 25 | 26 | mkdir -p ./$log_prefix/ 27 | 28 | trap 'jobs -p | xargs kill' EXIT 29 | 30 | for (( c=1; c<=$num_experiments; )) 31 | do 32 | for (( j=1; j<=$parallel_exps; j++ )) 33 | do 34 | echo "Launching experiment $c" 35 | mkdir -p ./$log_prefix/exp_$c/ 36 | python3 $run_script --seed $c --log_dir ./$log_prefix/exp_$c/ "${@:5}" &> ./$log_prefix/exp_$c.log & 37 | #pickle_files=("${pickle_files[@]}" "exp_$c.pickle") 38 | c=$((c+1)) 39 | done 40 | wait 41 | done 42 | 43 | #python create_graphs_from_pickle.py "${pickle_files[@]}" 44 | 45 | -------------------------------------------------------------------------------- /tools/run_policy_experiment_set.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | 4 | die() { echo "$@" 1>&2 ; exit 1; } 5 | 6 | # TODO: make this proper usage script 7 | 8 | # check whether user had supplied -h or --help . If yes display usage 9 | if [[ ( $# == "--help") || $# == "-h" ]] 10 | then 11 | die "Usage: bash $0 num_experiments num_experiments_in_parallel run_script [--all --other --args --to --run_script]" 12 | fi 13 | 14 | script=$1 15 | env=$2 16 | 17 | 18 | # default (6464tanh) 19 | # activations 20 | bash run_multiple.sh 5 1 ${env}_default $script ${env} &> ${env}default.log & 21 | bash run_multiple.sh 5 1 ${env}_vfleakrelu $script ${env} --activation_vf leaky_relu &> ${env}vfleakrelu.log 22 | bash run_multiple.sh 5 1 ${env}_vfrelu $script ${env} --activation_vf relu &> ${env}vfrelu.log & 23 | bash run_multiple.sh 5 1 ${env}_policyrelu $script ${env} --activation_policy relu &> ${env}policyrelu.log & 24 | bash run_multiple.sh 5 1 ${env}_policyleakyrelu $script ${env} --activation_policy leaky_relu &> ${env}policyleakyrelu.log 25 | 26 | # 27 | bash run_multiple.sh 5 1 ${env}_400300 $script $env --policy_size 400 300 &> ${env}_400300tanh.log & 28 | bash run_multiple.sh 5 1 ${env}_1005025 $script $env --policy_size 100 50 25 &> ${env}_1005025tanh.log & 29 | bash run_multiple.sh 5 1 ${env}_vf1005025 $script $env --value_func_size 100 50 25 &> ${env}_vf1005025tanh.log & 30 | bash run_multiple.sh 5 1 ${env}_vf400300 $script $env --value_func_size 400 300 &> ${env}_vf400300tanh.log 31 | -------------------------------------------------------------------------------- /tools/significance_and_bootstrap.py: -------------------------------------------------------------------------------- 1 | import scipy.stats as stats 2 | import random 3 | import pandas as pd 4 | import numpy as np 5 | import bootstrapped.bootstrap as bas 6 | import bootstrapped.stats_functions as bs_stats 7 | import bootstrapped.compare_functions as bs_compare 8 | import bootstrapped.power as bs_power 9 | from scipy.stats import ks_2samp 10 | import argparse 11 | 12 | parser = argparse.ArgumentParser() 13 | parser.add_argument("numpy_arrays", nargs="+", help="Assumes results returned in N X M array. N=number of iterations, M = number of trials") 14 | 15 | args = parser.parse_args() 16 | 17 | assert len(args.paths_to_progress_csvs) == 2 18 | 19 | data = np.load(args.paths_to_progress_csvs[0]) 20 | data2 = np.load(args.paths_to_progress_csvs[1]) 21 | 22 | """ 23 | We assume that the final evaluation iteration in the numpy array is the one to be analyzed. 24 | """ 25 | 26 | final_average_return = np.array(sorted(data[-1])) 27 | final_average_return2 = np.array(sorted(data2[-1])) 28 | 29 | print("***********************************************") 30 | print("Average Returns 1: ", final_average_return) 31 | print("Average Returns 2", final_average_return2) 32 | print("***********************************************") 33 | 34 | #print(t_test(len(final_average_return), np.array(final_average_return) - np.array(final_average_return2), 2e6, 5000)) 35 | print("t-test", stats.ttest_ind(final_average_return, final_average_return2)) 36 | print("ks-2sample", ks_2samp(final_average_return, final_average_return2)) 37 | 38 | def run_simulation(data): 39 | lift = 1.25 40 | results = [] 41 | for i in range(3000): 42 | random.shuffle(data) 43 | test = data[:len(data)/2] * lift 44 | ctrl = data[len(data)/2:] 45 | results.append(bas.bootstrap_ab(test, ctrl, bs_stats.mean, bs_compare.percent_change)) 46 | return results 47 | 48 | def run_simulation2(data, data2): 49 | results = [] 50 | for i in range(3000): 51 | results.append(bas.bootstrap_ab(data, data2, bs_stats.mean, bs_compare.percent_change)) 52 | return results 53 | 54 | print("bootstrap a/b", bas.bootstrap_ab(final_average_return, final_average_return2, bs_stats.mean, bs_compare.percent_change)) 55 | bab= bas.bootstrap_ab(final_average_return, final_average_return2, bs_stats.mean, bs_compare.percent_change) 56 | x = run_simulation2(final_average_return, final_average_return2) 57 | bootstrap_ab = bs_power.power_stats(x) 58 | print("power analysis bootstrap a/b") 59 | print(bootstrap_ab) 60 | 61 | print("***********************************************") 62 | print("Bootstrap analysis") 63 | print("***********************************************") 64 | print("***Arg1****") 65 | sim = bas.bootstrap(final_average_return, stat_func=bs_stats.mean) 66 | print("%.2f (%.2f, %.2f)" % (sim.value, sim.lower_bound, sim.upper_bound)) 67 | print("***Arg2****") 68 | sim = bas.bootstrap(final_average_return2, stat_func=bs_stats.mean) 69 | print("%.2f (%.2f, %.2f)" % (sim.value, sim.lower_bound, sim.upper_bound)) 70 | print("***********************************************") 71 | 72 | 73 | print("***********************************************") 74 | print("****Significance of lift analysis****") 75 | print("***********************************************") 76 | print("***Power1****") 77 | sim = run_simulation(final_average_return) 78 | sim = bs_power.power_stats(sim) 79 | print(sim) 80 | print("\\shortstack{%.2f \\%%\\\\%.2f \\%% \\\\ %.2f \\%%}" % (sim.transpose()["Insignificant"], sim.transpose()["Positive Significant"], sim.transpose()["Negative Significant"])) 81 | print("***Power2****") 82 | sim = run_simulation(final_average_return2) 83 | sim = bs_power.power_stats(sim) 84 | print(sim) 85 | #sim = bas.bootstrap(final_average_return2, stat_func=bs_stats.mean) 86 | print("\\shortstack{%.2f \\%% \\\\ %.2f \\%% \\\\ %.2f \\%%}" % (sim.transpose()["Insignificant"], sim.transpose()["Positive Significant"], sim.transpose()["Negative Significant"])) 87 | print("***********************************************") 88 | 89 | 90 | print("***********************************************") 91 | print("Significance analysis, t-test, ks test and bootstrap a/b") 92 | print("***********************************************") 93 | t,p = stats.ttest_ind(final_average_return, final_average_return2) 94 | ks,kp = ks_2samp(final_average_return, final_average_return2) 95 | 96 | a, b,c = bab.value, bab.lower_bound, bab.upper_bound 97 | 98 | if bootstrap_ab.transpose()["Positive Significant"][0] >= bootstrap_ab.transpose()["Negative Significant"][0]: 99 | boot = "+%.2f" % bootstrap_ab.transpose()["Positive Significant"] 100 | else: 101 | boot = "-%.2f" % bootstrap_ab.transpose()["Negative Significant"] 102 | print("\\shortstack{$t=%.2f,p=%.3f$\\\\$KS=%.2f,p=%.3f$\\\\%.2f \\%% (%.2f \\%%, %.2f \\%%) }" % (t,p,ks,kp,a,b,c)) 103 | print("***********************************************") 104 | --------------------------------------------------------------------------------