├── .gitignore ├── README.md ├── args ├── simple_spread_coma_fc.py ├── simple_spread_independent_ac.py ├── simple_spread_independent_ddpg.py ├── simple_spread_maddpg.py ├── simple_spread_sqddpg.py ├── simple_tag_coma_fc.py ├── simple_tag_independent_ac.py ├── simple_tag_independent_ddpg.py ├── simple_tag_maddpg.py ├── simple_tag_sqddpg.py ├── traffic_junction_coma_fc.py ├── traffic_junction_independent_ac.py ├── traffic_junction_independent_ddpg.py ├── traffic_junction_maddpg.py └── traffic_junction_sqddpg.py ├── aux.py ├── environments ├── multiagent_particle_envs │ ├── .gitignore │ ├── LICENSE.txt │ ├── README.md │ ├── bin │ │ ├── __init__.py │ │ └── interactive.py │ ├── make_env.py │ ├── multiagent │ │ ├── __init__.py │ │ ├── core.py │ │ ├── environment.py │ │ ├── multi_discrete.py │ │ ├── policy.py │ │ ├── rendering.py │ │ ├── scenario.py │ │ └── scenarios │ │ │ ├── __init__.py │ │ │ ├── simple.py │ │ │ ├── simple_adversary.py │ │ │ ├── simple_crypto.py │ │ │ ├── simple_push.py │ │ │ ├── simple_reference.py │ │ │ ├── simple_speaker_listener.py │ │ │ ├── simple_spread.py │ │ │ ├── simple_tag.py │ │ │ └── simple_world_comm.py │ └── setup.py ├── predator_prey_env.py ├── traffic_helper.py └── traffic_junction_env.py ├── figures ├── dynamics_131.pdf ├── dynamics_135.pdf ├── dynamics_136.pdf ├── dynamics_19.pdf ├── dynamics_38.png ├── easy_reward.pdf ├── easy_road.pdf ├── easy_success.pdf ├── hard_reward.pdf ├── hard_road.pdf ├── hard_success.pdf ├── medium_reward.pdf ├── medium_road.pdf ├── medium_success.pdf ├── simple_spread_mean_reward.png ├── simple_tag_turn.png └── venn.png ├── learning_algorithms ├── actor_critic.py ├── ddpg.py └── rl_algorithms.py ├── models ├── coma_fc.py ├── independent_ac.py ├── independent_ddpg.py ├── maddpg.py ├── model.py ├── random.py └── sqddpg.py ├── test.py ├── test.sh ├── train.py ├── train.sh └── utilities ├── gym_wrapper.py ├── inspector.py ├── logger.py ├── replay_buffer.py ├── tester.py ├── trainer.py └── util.py /.gitignore: -------------------------------------------------------------------------------- 1 | logs/ 2 | model_save/ 3 | __pycache__/ 4 | learning_algorithms/__pycache__/ 5 | models/__pycache__/ 6 | utilities/__pycache__/ 7 | environments/__pycache__/ 8 | tensorboard/ 9 | arguments.py 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Shapley Q-value: A Local Reward Approach to Solve Global Reward Games 2 | 3 | | :exclamation: News | 4 | |:-----------------------------------------| 5 | |The Jax version of SQDDPG was implemented in [the repository of SHAQ](https://github.com/hsvgbkhgbv/shapley-q-learning) under the framework of PyMARL, to adapt to the environment of SMAC and some related environments.| 6 | 7 | ## Dependencies 8 | This project implements the algorithm of Shapley Q-value deep deterministic policy gradient (SQDDPG) mentioned in the paper accpted by AAAI2020 (Oral):https://arxiv.org/abs/1907.05707 and demonstrates the experiments in comparison with Independent DDPG, Independent A2C, MADDPG and COMA. 9 | 10 | The code is running on Ubuntu 18.04 with Python (3.5.4) and Pytorch (1.0). 11 | 12 | The suggestion is installing Anaconda 3 with Python (3.5.4): https://www.anaconda.com/download/. 13 | To enable the experimantal environments, please install OpenAI Gym (0.10.5) and Numpy (1.14.5). 14 | To use Tensorboard to monitor the training process, please install Tensorflow (r1.14). 15 | After installing the related dependencies mentioned above, open the terminal and execute the following bash script: 16 | ```bash 17 | cd SQDDPG/environments/multiagent_particle_envs/ 18 | pip install -e . 19 | ``` 20 | 21 | Now, the dependencies for running the code are installed. 22 | 23 | ## Running Code for Experiments 24 | The experiments on Cooperative Navigation and Prey-and-Predator mentioned in the paper are based on the environments from https://github.com/openai/multiagent-particle-envs, i.e., simple_spread and simple_tag. For convenience, we merge this repository to our framework with slight modifications on the scenario simple-tag. 25 | 26 | About the experiment on Traffic Junction, the environment is from https://github.com/IC3Net/IC3Net/tree/master/ic3net-envs/ic3net_envs. To ease the life, we also add it to our framework. 27 | 28 | ### Training 29 | To easily run the code for training, we provide argument files for each experiment with variant methods under the directory `args` and bash script to execute the experiment with different arguments. 30 | 31 | For example, if we would like to run the experiment of simple_tag with the algorithm SQPG, we can edit the file `simple_tag_sqddpg.py` to change the hyperparameters. Then, we can edit `train.sh` to change the variable `EXP_NAME` to `"simple_tag_sqddpg"` and the variable `CUDA_VISIBLE_DEVICES` to the alias of the GPU you'd like to use, e.g. 0 here such that 32 | ```bash 33 | # !/bin/bash 34 | # sh train.sh 35 | 36 | EXP_NAME="simple_tag_sqddpg" 37 | ALIAS="" 38 | export CUDA_DEVICE_ORDER=PCI_BUS_ID 39 | export CUDA_VISIBLE_DEVICES=0 40 | 41 | if [ ! -d "./model_save" ] 42 | then 43 | mkdir ./model_save 44 | fi 45 | 46 | mkdir ./model_save/$EXP_NAME$ALIAS 47 | cp ./args/$EXP_NAME.py arguments.py 48 | python -u train.py > ./model_save/$EXP_NAME$ALIAS/exp.out & 49 | echo $! > ./model_save/$EXP_NAME$ALIAS/exp.pid 50 | ``` 51 | 52 | If necessary, we can also edit the variable `ALIAS` to ease the experiments with different hyperparameters. 53 | Now, we only need to run the experiment by the bash script such that 54 | ```bash 55 | source train.sh 56 | ``` 57 | 58 | ### Testing 59 | About testing, we provide a Python function called `test.py` which includes several arguments such that 60 | ```bash 61 | --save-model-dir # the path to save the trained model 62 | --render # whether the visualization is needed 63 | --episodes # the number of episodes needed to run the test 64 | ``` 65 | 66 | ### Experimental Results 67 | 68 | 69 | #### Cooperative Navigation 70 |
71 | 72 |
Mean reward per episode during training in Cooperative Navigation. SQDDPG(n) indicates SQDDPG with the sample size (i.e., M in Eq.8 of the paper) of n. In the rest of experiments, since only SQDDPG with the sample size of 1 is run, we just use SQDDPG to represent SQDDPG(1).
73 |
74 | 75 | #### Prey-and-Predator 76 |
77 | 78 |
Turns to capture the prey per episode during training in Prey-and-Predator. SQDDPG in this experiment is with the sample size of 1.
79 |
80 | 81 |
82 | 83 |
Credit assignment to each predator for a fixed trajectory. The leftmost figure records a trajectory sampled by an expert policy. The square represents the initial position whereas the circle indicates the final position of each agent. The dots on the trajectory indicates each agent's temporary positions. The other figures show the normalized credit assignments generated by different MARL algorithms according to this trajectory. SQDDPG in this experiment is with the sample size of 1.
84 |
85 | 86 | #### Traffic Junction 87 | | Difficulty | IA2C | IDDPG | COMA | MADDPG | SQDDPG | 88 | |------------|--------|--------|--------|------------|------------| 89 | | Easy | 65.01% | 93.08% | 93.01% | **93.72%** | 93.26% | 90 | | Medium | 67.51% | 84.16% | 82.48% | 87.92% | **88.98%** | 91 | | Hard | 60.89% | 64.99% | 85.33% | 84.21% | **87.04%** | 92 | 93 | The success rate on Traffic Junction, tested with 20, 40, and 60 steps per episode in easy, medium and hard versions respectively. The results are obtained by running each algorithm after training for 1000 episodes. 94 | 95 | ## Extension of the Framework 96 | This framework is easily to be extended by adding extra environments implemented in OpenAI Gym or new multi-agent algorithms implemented in Pytorch. To add extra algorithms, it just needs to inherit the base class `models/model.py` and implement the functions such that 97 | ```python 98 | construct_model(self) 99 | policy(self, obs, last_act=None, last_hid=None, gate=None, info={}, stat={}) 100 | value(self, obs, act) 101 | construct_policy_net(self) 102 | construct_value_net(self) 103 | get_loss(self) 104 | ``` 105 | 106 | After implementing the class of your own methods, it needs to register your algorithm by the file `aux.py`. For example, if the algorithm is called sqddpg and the corresponding class is called `SQDDPG`, then the process of registeration is shown as below 107 | ```python 108 | schednetArgs = namedtuple( 'sqddpgArgs', ['sample_size'] ) # define the exclusive hyperparameters of this algorithm 109 | Model = dict(..., 110 | ..., 111 | ..., 112 | ..., 113 | sqddpg=SQDDPG 114 | ) # register the handle of the corresponding class of this algorithm 115 | AuxArgs = dict(..., 116 | ..., 117 | ..., 118 | ..., 119 | sqddpg=sqddpgArgs 120 | ) # register the exclusive args of this algorithm 121 | Strategy=dict(..., 122 | ..., 123 | ..., 124 | ..., 125 | sqddpg='pg' 126 | ) # register the training strategy of this algorithm, e.g., 'pg' or 'q' 127 | ``` 128 | 129 | Moreover, it is optional to define a restriction for your algorithm to avoid mis-defined hyperparameters in `utilities/inspector.py` such that 130 | ```python 131 | if ... ...: 132 | ... ... ... ... 133 | elif args.model_name is 'sqddpg': 134 | assert args.replay is True 135 | assert args.q_func is True 136 | assert args.target is True 137 | assert args.gumbel_softmax is True 138 | assert args.epsilon_softmax is False 139 | assert args.online is True 140 | assert hasattr(args, 'sample_size') 141 | ``` 142 | 143 | Finally, you can additionally add auxilliary functions in directory `utilities`. 144 | 145 | Temporarily, this framework only supports the policy gradient methods. The functionality of value based method is under test and will be available soon. 146 | 147 | ## Citation 148 | If you use the framework or part of the work mentioned in the paper, please cite: 149 | ``` 150 | @article{Wang_2020, 151 | title={Shapley Q-Value: A Local Reward Approach to Solve Global Reward Games}, 152 | volume={34}, 153 | ISSN={2159-5399}, 154 | url={http://dx.doi.org/10.1609/aaai.v34i05.6220}, 155 | DOI={10.1609/aaai.v34i05.6220}, 156 | number={05}, 157 | journal={Proceedings of the AAAI Conference on Artificial Intelligence}, 158 | publisher={Association for the Advancement of Artificial Intelligence (AAAI)}, 159 | author={Wang, Jianhong and Zhang, Yuan and Kim, Tae-Kyun and Gu, Yunjie}, 160 | year={2020}, 161 | month={Apr}, 162 | pages={7285–7292} 163 | } 164 | ``` 165 | -------------------------------------------------------------------------------- /args/simple_spread_coma_fc.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from multiagent.environment import MultiAgentEnv 3 | import multiagent.scenarios as scenario 4 | from utilities.gym_wrapper import * 5 | import numpy as np 6 | from aux import * 7 | 8 | 9 | '''define the model name''' 10 | model_name = 'coma_fc' 11 | 12 | '''define the scenario name''' 13 | scenario_name = 'simple_spread' 14 | 15 | '''define the special property''' 16 | # independentArgs = namedtuple( 'independentArgs', [] ) 17 | aux_args = AuxArgs[model_name]() 18 | alias = '_new_1' 19 | 20 | '''load scenario from script''' 21 | scenario = scenario.load(scenario_name+".py").Scenario() 22 | 23 | '''create world''' 24 | world = scenario.make_world() 25 | 26 | '''create multiagent environment''' 27 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True) 28 | env = GymWrapper(env) 29 | 30 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields) 31 | 32 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update 33 | args = Args(model_name=model_name, 34 | agent_num=env.get_num_of_agents(), 35 | hid_size=32, 36 | obs_size=np.max(env.get_shape_of_obs()), 37 | continuous=False, 38 | action_dim=np.max(env.get_output_shape_of_act()), 39 | init_std=0.1, 40 | policy_lrate=1e-2, 41 | value_lrate=1e-4, 42 | max_steps=200, 43 | batch_size=100, 44 | gamma=0.9, 45 | normalize_advantages=False, 46 | entr=1e-2, 47 | entr_inc=0.0, 48 | action_num=np.max(env.get_input_shape_of_act()), 49 | q_func=True, 50 | train_episodes_num=int(5e3), 51 | replay=True, 52 | replay_buffer_size=1e4, 53 | replay_warmup=0, 54 | cuda=True, 55 | grad_clip=True, 56 | save_model_freq=10, 57 | target=True, 58 | target_lr=1e-1, 59 | behaviour_update_freq=100, 60 | critic_update_times=10, 61 | target_update_freq=200, 62 | gumbel_softmax=False, 63 | epsilon_softmax=False, 64 | online=True, 65 | reward_record_type='episode_mean_step', 66 | shared_parameters=True 67 | ) 68 | 69 | args = MergeArgs(*(args+aux_args)) 70 | 71 | log_name = scenario_name + '_' + model_name + alias 72 | -------------------------------------------------------------------------------- /args/simple_spread_independent_ac.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from multiagent.environment import MultiAgentEnv 3 | import multiagent.scenarios as scenario 4 | from utilities.gym_wrapper import * 5 | import numpy as np 6 | from aux import * 7 | 8 | 9 | 10 | '''define the model name''' 11 | model_name = 'independent_ac' 12 | 13 | '''define the scenario name''' 14 | scenario_name = 'simple_spread' 15 | 16 | '''define the special property''' 17 | # independentArgs = namedtuple( 'independentArgs', [] ) 18 | aux_args = AuxArgs[model_name]() 19 | alias = '_new_1' 20 | 21 | '''load scenario from script''' 22 | scenario = scenario.load(scenario_name+".py").Scenario() 23 | 24 | '''create world''' 25 | world = scenario.make_world() 26 | 27 | '''create multiagent environment''' 28 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True) 29 | env = GymWrapper(env) 30 | 31 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields) 32 | 33 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update 34 | args = Args(model_name=model_name, 35 | agent_num=env.get_num_of_agents(), 36 | hid_size=32, 37 | obs_size=np.max(env.get_shape_of_obs()), 38 | continuous=False, 39 | action_dim=np.max(env.get_output_shape_of_act()), 40 | init_std=0.1, 41 | policy_lrate=1e-6, 42 | value_lrate=1e-5, 43 | max_steps=200, 44 | batch_size=100, 45 | gamma=0.9, 46 | normalize_advantages=False, 47 | entr=1e-2, 48 | entr_inc=0.0, 49 | action_num=np.max(env.get_input_shape_of_act()), 50 | q_func=True, 51 | train_episodes_num=int(5e3), 52 | replay=True, 53 | replay_buffer_size=1e4, 54 | replay_warmup=0, 55 | cuda=True, 56 | grad_clip=True, 57 | save_model_freq=10, 58 | target=True, 59 | target_lr=1e-1, 60 | behaviour_update_freq=100, 61 | critic_update_times=10, 62 | target_update_freq=200, 63 | gumbel_softmax=False, 64 | epsilon_softmax=False, 65 | online=True, 66 | reward_record_type='episode_mean_step', 67 | shared_parameters=False 68 | ) 69 | 70 | args = MergeArgs(*(args+aux_args)) 71 | 72 | log_name = scenario_name + '_' + model_name + alias 73 | -------------------------------------------------------------------------------- /args/simple_spread_independent_ddpg.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from multiagent.environment import MultiAgentEnv 3 | import multiagent.scenarios as scenario 4 | from utilities.gym_wrapper import * 5 | import numpy as np 6 | from aux import * 7 | 8 | 9 | 10 | '''define the model name''' 11 | model_name = 'independent_ddpg' 12 | 13 | '''define the scenario name''' 14 | scenario_name = 'simple_spread' 15 | 16 | '''define the special property''' 17 | # independentArgs = namedtuple( 'independentArgs', [] ) 18 | aux_args = AuxArgs[model_name]() 19 | alias = '_new_6' 20 | 21 | '''load scenario from script''' 22 | scenario = scenario.load(scenario_name+".py").Scenario() 23 | 24 | '''create world''' 25 | world = scenario.make_world() 26 | 27 | '''create multiagent environment''' 28 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True) 29 | env = GymWrapper(env) 30 | 31 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields) 32 | 33 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update 34 | args = Args(model_name=model_name, 35 | agent_num=env.get_num_of_agents(), 36 | hid_size=32, 37 | obs_size=np.max(env.get_shape_of_obs()), 38 | continuous=False, 39 | action_dim=np.max(env.get_output_shape_of_act()), 40 | init_std=0.1, 41 | policy_lrate=1e-3, 42 | value_lrate=1e-2, 43 | max_steps=200, 44 | batch_size=32, 45 | gamma=0.9, 46 | normalize_advantages=False, 47 | entr=1e-2, 48 | entr_inc=0.0, 49 | action_num=np.max(env.get_input_shape_of_act()), 50 | q_func=False, 51 | train_episodes_num=int(5e3), 52 | replay=True, 53 | replay_buffer_size=1e4, 54 | replay_warmup=0, 55 | cuda=True, 56 | grad_clip=True, 57 | save_model_freq=10, 58 | target=True, 59 | target_lr=1e-1, 60 | behaviour_update_freq=100, 61 | critic_update_times=10, 62 | target_update_freq=200, 63 | gumbel_softmax=True, 64 | epsilon_softmax=False, 65 | online=True, 66 | reward_record_type='episode_mean_step', 67 | shared_parameters=False 68 | ) 69 | 70 | args = MergeArgs(*(args+aux_args)) 71 | 72 | log_name = scenario_name + '_' + model_name + alias 73 | -------------------------------------------------------------------------------- /args/simple_spread_maddpg.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from multiagent.environment import MultiAgentEnv 3 | import multiagent.scenarios as scenario 4 | from utilities.gym_wrapper import * 5 | import numpy as np 6 | from aux import * 7 | 8 | 9 | 10 | '''define the model name''' 11 | model_name = 'maddpg' 12 | 13 | '''define the scenario name''' 14 | scenario_name = 'simple_spread' 15 | 16 | '''define the special property''' 17 | # maddpgArgs = namedtuple( 'maddpgArgs', [] ) 18 | aux_args = AuxArgs[model_name]() 19 | alias = '_new_3' 20 | 21 | '''load scenario from script''' 22 | scenario = scenario.load(scenario_name+".py").Scenario() 23 | 24 | '''create world''' 25 | world = scenario.make_world() 26 | 27 | '''create multiagent environment''' 28 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True) 29 | env = GymWrapper(env) 30 | 31 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields) 32 | 33 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update 34 | args = Args(model_name=model_name, 35 | agent_num=env.get_num_of_agents(), 36 | hid_size=32, 37 | obs_size=np.max(env.get_shape_of_obs()), 38 | continuous=False, 39 | action_dim=np.max(env.get_output_shape_of_act()), 40 | init_std=0.1, 41 | policy_lrate=1e-4, 42 | value_lrate=1e-3, 43 | max_steps=200, 44 | batch_size=32, 45 | gamma=0.9, 46 | normalize_advantages=False, 47 | entr=1e-3, 48 | entr_inc=0.0, 49 | action_num=np.max(env.get_input_shape_of_act()), 50 | q_func=True, 51 | train_episodes_num=int(5e3), 52 | replay=True, 53 | replay_buffer_size=1e4, 54 | replay_warmup=0, 55 | cuda=True, 56 | grad_clip=True, 57 | save_model_freq=10, 58 | target=True, 59 | target_lr=1e-1, 60 | behaviour_update_freq=100, 61 | critic_update_times=10, 62 | target_update_freq=200, 63 | gumbel_softmax=True, 64 | epsilon_softmax=False, 65 | online=True, 66 | reward_record_type='episode_mean_step', 67 | shared_parameters=False 68 | ) 69 | 70 | args = MergeArgs(*(args+aux_args)) 71 | 72 | log_name = scenario_name + '_' + model_name + alias 73 | -------------------------------------------------------------------------------- /args/simple_spread_sqddpg.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from multiagent.environment import MultiAgentEnv 3 | import multiagent.scenarios as scenario 4 | from utilities.gym_wrapper import * 5 | import numpy as np 6 | from aux import * 7 | 8 | 9 | 10 | '''define the model name''' 11 | model_name = 'sqddpg' 12 | 13 | '''define the scenario name''' 14 | scenario_name = 'simple_spread' 15 | 16 | '''define the special property''' 17 | # sqddpgArgs = namedtuple( 'sqddpgArgs', ['sample_size'] ) 18 | aux_args = AuxArgs[model_name](5) 19 | alias = '_new_sample_12' 20 | 21 | '''load scenario from script''' 22 | scenario = scenario.load(scenario_name+".py").Scenario() 23 | 24 | '''create world''' 25 | world = scenario.make_world() 26 | 27 | '''create multiagent environment''' 28 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True) 29 | env = GymWrapper(env) 30 | 31 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields) 32 | 33 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update 34 | args = Args(model_name=model_name, 35 | agent_num=env.get_num_of_agents(), 36 | hid_size=32, 37 | obs_size=np.max(env.get_shape_of_obs()), 38 | continuous=False, 39 | action_dim=np.max(env.get_output_shape_of_act()), 40 | init_std=0.1, 41 | policy_lrate=1e-4, 42 | value_lrate=1e-3, 43 | max_steps=200, 44 | batch_size=32, 45 | gamma=0.9, 46 | normalize_advantages=False, 47 | entr=1e-2, 48 | entr_inc=0.0, 49 | action_num=np.max(env.get_input_shape_of_act()), 50 | q_func=True, 51 | train_episodes_num=int(5e3), 52 | replay=True, 53 | replay_buffer_size=1e4, 54 | replay_warmup=0, 55 | cuda=True, 56 | grad_clip=True, 57 | save_model_freq=10, 58 | target=True, 59 | target_lr=1e-1, 60 | behaviour_update_freq=100, 61 | critic_update_times=10, 62 | target_update_freq=200, 63 | gumbel_softmax=True, 64 | epsilon_softmax=False, 65 | online=True, 66 | reward_record_type='episode_mean_step', 67 | shared_parameters=False 68 | ) 69 | 70 | args = MergeArgs(*(args+aux_args)) 71 | 72 | log_name = scenario_name + '_' + model_name + alias 73 | -------------------------------------------------------------------------------- /args/simple_tag_coma_fc.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from multiagent.environment import MultiAgentEnv 3 | import multiagent.scenarios as scenario 4 | from utilities.gym_wrapper import * 5 | import numpy as np 6 | from aux import * 7 | 8 | 9 | '''define the model name''' 10 | model_name = 'coma_fc' 11 | 12 | '''define the scenario name''' 13 | scenario_name = 'simple_tag' 14 | 15 | '''define the special property''' 16 | # independentArgs = namedtuple( 'independentArgs', [] ) 17 | aux_args = AuxArgs[model_name]() 18 | alias = '' 19 | 20 | '''load scenario from script''' 21 | scenario = scenario.load(scenario_name+".py").Scenario() 22 | 23 | '''create world''' 24 | world = scenario.make_world() 25 | 26 | '''create multiagent environment''' 27 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True,done_callback=scenario.episode_over) 28 | env = GymWrapper(env) 29 | 30 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields) 31 | 32 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update 33 | args = Args(model_name=model_name, 34 | agent_num=env.get_num_of_agents(), 35 | hid_size=128, 36 | obs_size=np.max(env.get_shape_of_obs()), 37 | continuous=False, 38 | action_dim=np.max(env.get_output_shape_of_act()), 39 | init_std=0.1, 40 | policy_lrate=1e-3, 41 | value_lrate=1e-4, 42 | max_steps=200, 43 | batch_size=100, 44 | gamma=0.99, 45 | normalize_advantages=False, 46 | entr=1e-3, 47 | entr_inc=0.0, 48 | action_num=np.max(env.get_input_shape_of_act()), 49 | q_func=True, 50 | train_episodes_num=int(5e3), 51 | replay=True, 52 | replay_buffer_size=1e4, 53 | replay_warmup=0, 54 | cuda=True, 55 | grad_clip=True, 56 | save_model_freq=10, 57 | target=True, 58 | target_lr=1e-1, 59 | behaviour_update_freq=100, 60 | critic_update_times=10, 61 | target_update_freq=200, 62 | gumbel_softmax=False, 63 | epsilon_softmax=False, 64 | online=True, 65 | reward_record_type='episode_mean_step', 66 | shared_parameters=False 67 | ) 68 | 69 | args = MergeArgs(*(args+aux_args)) 70 | 71 | log_name = scenario_name + '_' + model_name + alias 72 | -------------------------------------------------------------------------------- /args/simple_tag_independent_ac.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from multiagent.environment import MultiAgentEnv 3 | import multiagent.scenarios as scenario 4 | from utilities.gym_wrapper import * 5 | import numpy as np 6 | from aux import * 7 | 8 | 9 | '''define the model name''' 10 | model_name = 'independent_ac' 11 | 12 | '''define the scenario name''' 13 | scenario_name = 'simple_tag' 14 | 15 | '''define the special property''' 16 | # independentArgs = namedtuple( 'independentArgs', [] ) 17 | aux_args = AuxArgs[model_name]() 18 | alias = '' 19 | 20 | '''load scenario from script''' 21 | scenario = scenario.load(scenario_name+".py").Scenario() 22 | 23 | '''create world''' 24 | world = scenario.make_world() 25 | 26 | '''create multiagent environment''' 27 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True,done_callback=scenario.episode_over) 28 | env = GymWrapper(env) 29 | 30 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields) 31 | 32 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update 33 | args = Args(model_name=model_name, 34 | agent_num=env.get_num_of_agents(), 35 | hid_size=128, 36 | obs_size=np.max(env.get_shape_of_obs()), 37 | continuous=False, 38 | action_dim=np.max(env.get_output_shape_of_act()), 39 | init_std=0.1, 40 | policy_lrate=1e-3, 41 | value_lrate=1e-4, 42 | max_steps=200, 43 | batch_size=100, 44 | gamma=0.99, 45 | normalize_advantages=False, 46 | entr=1e-3, 47 | entr_inc=0.0, 48 | action_num=np.max(env.get_input_shape_of_act()), 49 | q_func=True, 50 | train_episodes_num=int(5e3), 51 | replay=True, 52 | replay_buffer_size=1e4, 53 | replay_warmup=0, 54 | cuda=True, 55 | grad_clip=True, 56 | save_model_freq=10, 57 | target=True, 58 | target_lr=1e-1, 59 | behaviour_update_freq=100, 60 | critic_update_times=10, 61 | target_update_freq=200, 62 | gumbel_softmax=False, 63 | epsilon_softmax=False, 64 | online=True, 65 | reward_record_type='episode_mean_step', 66 | shared_parameters=False 67 | ) 68 | 69 | args = MergeArgs(*(args+aux_args)) 70 | 71 | log_name = scenario_name + '_' + model_name + alias 72 | -------------------------------------------------------------------------------- /args/simple_tag_independent_ddpg.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from multiagent.environment import MultiAgentEnv 3 | import multiagent.scenarios as scenario 4 | from utilities.gym_wrapper import * 5 | import numpy as np 6 | from aux import * 7 | 8 | 9 | 10 | '''define the model name''' 11 | model_name = 'independent_ddpg' 12 | 13 | '''define the scenario name''' 14 | scenario_name = 'simple_tag' 15 | 16 | '''define the special property''' 17 | # independentArgs = namedtuple( 'independentArgs', [] ) 18 | aux_args = AuxArgs[model_name]() 19 | alias = '' 20 | 21 | '''load scenario from script''' 22 | scenario = scenario.load(scenario_name+".py").Scenario() 23 | 24 | '''create world''' 25 | world = scenario.make_world() 26 | 27 | '''create multiagent environment''' 28 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True,done_callback=scenario.episode_over) 29 | env = GymWrapper(env) 30 | 31 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields) 32 | 33 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update 34 | args = Args(model_name=model_name, 35 | agent_num=env.get_num_of_agents(), 36 | hid_size=128, 37 | obs_size=np.max(env.get_shape_of_obs()), 38 | continuous=False, 39 | action_dim=np.max(env.get_output_shape_of_act()), 40 | init_std=0.1, 41 | policy_lrate=1e-4, 42 | value_lrate=5e-4, 43 | max_steps=200, 44 | batch_size=128, 45 | gamma=0.99, 46 | normalize_advantages=False, 47 | entr=1e-3, 48 | entr_inc=0.0, 49 | action_num=np.max(env.get_input_shape_of_act()), 50 | q_func=False, 51 | train_episodes_num=int(5e3), 52 | replay=True, 53 | replay_buffer_size=1e4, 54 | replay_warmup=0, 55 | cuda=True, 56 | grad_clip=True, 57 | save_model_freq=10, 58 | target=True, 59 | target_lr=1e-1, 60 | behaviour_update_freq=100, 61 | critic_update_times=10, 62 | target_update_freq=200, 63 | gumbel_softmax=True, 64 | epsilon_softmax=False, 65 | online=True, 66 | reward_record_type='episode_mean_step', 67 | shared_parameters=False 68 | ) 69 | 70 | args = MergeArgs(*(args+aux_args)) 71 | 72 | log_name = scenario_name + '_' + model_name + alias 73 | -------------------------------------------------------------------------------- /args/simple_tag_maddpg.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from multiagent.environment import MultiAgentEnv 3 | import multiagent.scenarios as scenario 4 | from utilities.gym_wrapper import * 5 | import numpy as np 6 | from aux import * 7 | 8 | 9 | 10 | '''define the model name''' 11 | model_name = 'maddpg' 12 | 13 | '''define the scenario name''' 14 | scenario_name = 'simple_tag' 15 | 16 | '''define the special property''' 17 | # maddpgArgs = namedtuple( 'maddpgArgs', [] ) 18 | aux_args = AuxArgs[model_name]() 19 | alias = '' 20 | 21 | '''load scenario from script''' 22 | scenario = scenario.load(scenario_name+".py").Scenario() 23 | 24 | '''create world''' 25 | world = scenario.make_world() 26 | 27 | '''create multiagent environment''' 28 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True,done_callback=scenario.episode_over) 29 | env = GymWrapper(env) 30 | 31 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields) 32 | 33 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update 34 | args = Args(model_name=model_name, 35 | agent_num=env.get_num_of_agents(), 36 | hid_size=128, 37 | obs_size=np.max(env.get_shape_of_obs()), 38 | continuous=False, 39 | action_dim=np.max(env.get_output_shape_of_act()), 40 | init_std=0.1, 41 | policy_lrate=1e-4, 42 | value_lrate=5e-4, 43 | max_steps=200, 44 | batch_size=128, 45 | gamma=0.99, 46 | normalize_advantages=False, 47 | entr=1e-3, 48 | entr_inc=0.0, 49 | action_num=np.max(env.get_input_shape_of_act()), 50 | q_func=True, 51 | train_episodes_num=int(5e3), 52 | replay=True, 53 | replay_buffer_size=1e4, 54 | replay_warmup=0, 55 | cuda=True, 56 | grad_clip=True, 57 | save_model_freq=10, 58 | target=True, 59 | target_lr=1e-1, 60 | behaviour_update_freq=100, 61 | critic_update_times=10, 62 | target_update_freq=200, 63 | gumbel_softmax=True, 64 | epsilon_softmax=False, 65 | online=True, 66 | reward_record_type='episode_mean_step', 67 | shared_parameters=False 68 | ) 69 | 70 | args = MergeArgs(*(args+aux_args)) 71 | 72 | log_name = scenario_name + '_' + model_name + alias 73 | -------------------------------------------------------------------------------- /args/simple_tag_sqddpg.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from multiagent.environment import MultiAgentEnv 3 | import multiagent.scenarios as scenario 4 | from utilities.gym_wrapper import * 5 | import numpy as np 6 | from aux import * 7 | 8 | 9 | 10 | '''define the model name''' 11 | model_name = 'sqddpg' 12 | 13 | '''define the scenario name''' 14 | scenario_name = 'simple_tag' 15 | 16 | '''define the special property''' 17 | # sqddpgArgs = namedtuple( 'sqddpgArgs', ['sample_size'] ) 18 | aux_args = AuxArgs[model_name](1) 19 | alias = '' 20 | 21 | '''load scenario from script''' 22 | scenario = scenario.load(scenario_name+".py").Scenario() 23 | 24 | '''create world''' 25 | world = scenario.make_world() 26 | 27 | '''create multiagent environment''' 28 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True,done_callback=scenario.episode_over) 29 | env = GymWrapper(env) 30 | 31 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields) 32 | 33 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update 34 | args = Args(model_name=model_name, 35 | agent_num=env.get_num_of_agents(), 36 | hid_size=128, 37 | obs_size=np.max(env.get_shape_of_obs()), 38 | continuous=False, 39 | action_dim=np.max(env.get_output_shape_of_act()), 40 | init_std=0.1, 41 | policy_lrate=1e-4, 42 | value_lrate=5e-4, 43 | max_steps=200, 44 | batch_size=128, 45 | gamma=0.99, 46 | normalize_advantages=False, 47 | entr=1e-3, 48 | entr_inc=0.0, 49 | action_num=np.max(env.get_input_shape_of_act()), 50 | q_func=True, 51 | train_episodes_num=int(5e3), 52 | replay=True, 53 | replay_buffer_size=1e4, 54 | replay_warmup=0, 55 | cuda=True, 56 | grad_clip=True, 57 | save_model_freq=10, 58 | target=True, 59 | target_lr=1e-1, 60 | behaviour_update_freq=100, 61 | critic_update_times=10, 62 | target_update_freq=200, 63 | gumbel_softmax=True, 64 | epsilon_softmax=False, 65 | online=True, 66 | reward_record_type='episode_mean_step', 67 | shared_parameters=False 68 | ) 69 | 70 | args = MergeArgs(*(args+aux_args)) 71 | 72 | log_name = scenario_name + '_' + model_name + alias 73 | -------------------------------------------------------------------------------- /args/traffic_junction_coma_fc.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from utilities.gym_wrapper import * 3 | import numpy as np 4 | from models.coma import * 5 | from aux import * 6 | from environments.traffic_junction_env import TrafficJunctionEnv 7 | 8 | 9 | 10 | '''define the model name''' 11 | model_name = 'coma_fc' 12 | 13 | '''define the special property''' 14 | # independentArgs = namedtuple( 'independentArgs', [] ) 15 | aux_args = AuxArgs[model_name]() 16 | alias = '_medium' 17 | 18 | '''define the scenario name''' 19 | scenario_name = 'traffic_junction' 20 | 21 | '''define the environment''' 22 | env = TrafficJunctionEnv() 23 | env = GymWrapper(env) 24 | 25 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields) 26 | 27 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update 28 | args = Args(model_name=model_name, 29 | agent_num=env.get_num_of_agents(), 30 | hid_size=128, 31 | obs_size=np.max(env.get_shape_of_obs()), 32 | continuous=False, 33 | action_dim=np.max(env.get_output_shape_of_act()), 34 | init_std=0.1, 35 | policy_lrate=1e-4, 36 | value_lrate=1e-3, 37 | max_steps=50, 38 | batch_size=2, 39 | gamma=0.99, 40 | normalize_advantages=False, 41 | entr=1e-4, 42 | entr_inc=0.0, 43 | action_num=np.max(env.get_input_shape_of_act()), 44 | q_func=True, 45 | train_episodes_num=int(5e3), 46 | replay=True, 47 | replay_buffer_size=2, 48 | replay_warmup=0, 49 | cuda=True, 50 | grad_clip=True, 51 | save_model_freq=100, 52 | target=True, 53 | target_lr=1e-1, 54 | behaviour_update_freq=2, 55 | critic_update_times=10, 56 | target_update_freq=2, 57 | gumbel_softmax=False, 58 | epsilon_softmax=True, 59 | online=False, 60 | reward_record_type='episode_mean_step', 61 | shared_parameters=False 62 | ) 63 | 64 | args = MergeArgs(*(args+aux_args)) 65 | 66 | log_name = scenario_name + '_' + model_name + alias 67 | -------------------------------------------------------------------------------- /args/traffic_junction_independent_ac.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from utilities.gym_wrapper import * 3 | import numpy as np 4 | from aux import * 5 | from environments.traffic_junction_env import TrafficJunctionEnv 6 | 7 | 8 | 9 | '''define the model name''' 10 | model_name = 'independent_ac' 11 | 12 | '''define the special property''' 13 | # independentArgs = namedtuple( 'independentArgs', [] ) 14 | aux_args = AuxArgs[model_name]() 15 | alias = '_medium' 16 | 17 | '''define the scenario name''' 18 | scenario_name = 'traffic_junction' 19 | 20 | '''define the environment''' 21 | env = TrafficJunctionEnv() 22 | env = GymWrapper(env) 23 | 24 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields) 25 | 26 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update 27 | args = Args(model_name=model_name, 28 | agent_num=env.get_num_of_agents(), 29 | hid_size=128, 30 | obs_size=np.max(env.get_shape_of_obs()), 31 | continuous=False, 32 | action_dim=np.max(env.get_output_shape_of_act()), 33 | init_std=0.1, 34 | policy_lrate=1e-4, 35 | value_lrate=1e-3, 36 | max_steps=50, 37 | batch_size=64, 38 | gamma=0.99, 39 | normalize_advantages=False, 40 | entr=1e-4, 41 | entr_inc=0.0, 42 | action_num=np.max(env.get_input_shape_of_act()), 43 | q_func=True, 44 | train_episodes_num=int(5e3), 45 | replay=True, 46 | replay_buffer_size=100, 47 | replay_warmup=0, 48 | cuda=True, 49 | grad_clip=True, 50 | save_model_freq=100, 51 | target=True, 52 | target_lr=1.0, 53 | behaviour_update_freq=25, 54 | critic_update_times=10, 55 | target_update_freq=50, 56 | gumbel_softmax=False, 57 | epsilon_softmax=False, 58 | online=True, 59 | reward_record_type='episode_mean_step', 60 | shared_parameters=False 61 | ) 62 | 63 | args = MergeArgs(*(args+aux_args)) 64 | 65 | log_name = scenario_name + '_' + model_name + alias 66 | -------------------------------------------------------------------------------- /args/traffic_junction_independent_ddpg.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from utilities.gym_wrapper import * 3 | import numpy as np 4 | from aux import * 5 | from environments.traffic_junction_env import TrafficJunctionEnv 6 | 7 | 8 | 9 | '''define the model name''' 10 | model_name = 'independent_ddpg' 11 | 12 | '''define the special property''' 13 | # independentArgs = namedtuple( 'independentArgs', [] ) 14 | aux_args = AuxArgs[model_name]() 15 | alias = '_medium' 16 | 17 | '''define the scenario name''' 18 | scenario_name = 'traffic_junction' 19 | 20 | '''define the environment''' 21 | env = TrafficJunctionEnv() 22 | env = GymWrapper(env) 23 | 24 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields) 25 | 26 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update 27 | args = Args(model_name=model_name, 28 | agent_num=env.get_num_of_agents(), 29 | hid_size=128, 30 | obs_size=np.max(env.get_shape_of_obs()), 31 | continuous=False, 32 | action_dim=np.max(env.get_output_shape_of_act()), 33 | init_std=0.1, 34 | policy_lrate=1e-4, 35 | value_lrate=1e-3, 36 | max_steps=50, 37 | batch_size=64, 38 | gamma=0.99, 39 | normalize_advantages=False, 40 | entr=1e-4, 41 | entr_inc=0.0, 42 | action_num=np.max(env.get_input_shape_of_act()), 43 | q_func=True, 44 | train_episodes_num=int(5e3), 45 | replay=True, 46 | replay_buffer_size=1e4, 47 | replay_warmup=0, 48 | cuda=True, 49 | grad_clip=True, 50 | save_model_freq=100, 51 | target=True, 52 | target_lr=1e-1, 53 | behaviour_update_freq=25, 54 | critic_update_times=10, 55 | target_update_freq=50, 56 | gumbel_softmax=True, 57 | epsilon_softmax=False, 58 | online=True, 59 | reward_record_type='episode_mean_step', 60 | shared_parameters=False 61 | ) 62 | 63 | args = MergeArgs(*(args+aux_args)) 64 | 65 | log_name = scenario_name + '_' + model_name + alias 66 | -------------------------------------------------------------------------------- /args/traffic_junction_maddpg.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from utilities.gym_wrapper import * 3 | import numpy as np 4 | from aux import * 5 | from environments.traffic_junction_env import TrafficJunctionEnv 6 | 7 | 8 | 9 | '''define the model name''' 10 | model_name = 'maddpg' 11 | 12 | '''define the special property''' 13 | # maddpgArgs = namedtuple( 'maddpgArgs', [] ) 14 | aux_args = AuxArgs[model_name]() # maddpg 15 | alias = '_medium' 16 | 17 | '''define the scenario name''' 18 | scenario_name = 'traffic_junction' 19 | 20 | '''define the environment''' 21 | env = TrafficJunctionEnv() 22 | env = GymWrapper(env) 23 | 24 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields) 25 | 26 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update 27 | args = Args(model_name=model_name, 28 | agent_num=env.get_num_of_agents(), 29 | hid_size=128, 30 | obs_size=np.max(env.get_shape_of_obs()), 31 | continuous=False, 32 | action_dim=np.max(env.get_output_shape_of_act()), 33 | init_std=0.1, 34 | policy_lrate=1e-4, 35 | value_lrate=1e-3, 36 | max_steps=50, 37 | batch_size=64, 38 | gamma=0.99, 39 | normalize_advantages=False, 40 | entr=1e-4, 41 | entr_inc=0.0, 42 | action_num=np.max(env.get_input_shape_of_act()), 43 | q_func=True, 44 | train_episodes_num=int(5e3), 45 | replay=True, 46 | replay_buffer_size=1e4, 47 | replay_warmup=0, 48 | cuda=True, 49 | grad_clip=True, 50 | save_model_freq=100, 51 | target=True, 52 | target_lr=1e-1, 53 | behaviour_update_freq=25, 54 | critic_update_times=10, 55 | target_update_freq=50, 56 | gumbel_softmax=True, 57 | epsilon_softmax=False, 58 | online=True, 59 | reward_record_type='episode_mean_step', 60 | shared_parameters=False 61 | ) 62 | 63 | args = MergeArgs(*(args+aux_args)) 64 | 65 | log_name = scenario_name + '_' + model_name + alias 66 | -------------------------------------------------------------------------------- /args/traffic_junction_sqddpg.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from utilities.gym_wrapper import * 3 | import numpy as np 4 | from aux import * 5 | from environments.traffic_junction_env import TrafficJunctionEnv 6 | 7 | 8 | 9 | '''define the model name''' 10 | model_name = 'sqddpg' 11 | 12 | '''define the special property''' 13 | # sqddpgArgs = namedtuple( 'sqddpgArgs', ['sample_size'] ) 14 | aux_args = AuxArgs[model_name](1) # sqddpg 15 | alias = '_medium' 16 | 17 | '''define the scenario name''' 18 | scenario_name = 'traffic_junction' 19 | 20 | '''define the environment''' 21 | env = TrafficJunctionEnv() 22 | env = GymWrapper(env) 23 | 24 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields) 25 | 26 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update 27 | args = Args(model_name=model_name, 28 | agent_num=env.get_num_of_agents(), 29 | hid_size=128, 30 | obs_size=np.max(env.get_shape_of_obs()), 31 | continuous=False, 32 | action_dim=np.max(env.get_output_shape_of_act()), 33 | init_std=0.1, 34 | policy_lrate=1e-4, 35 | value_lrate=1e-3, 36 | max_steps=50, 37 | batch_size=32, 38 | gamma=0.99, 39 | normalize_advantages=False, 40 | entr=1e-4, 41 | entr_inc=0.0, 42 | action_num=np.max(env.get_input_shape_of_act()), 43 | q_func=True, 44 | train_episodes_num=int(5e3), 45 | replay=True, 46 | replay_buffer_size=1e4, 47 | replay_warmup=0, 48 | cuda=True, 49 | grad_clip=True, 50 | save_model_freq=100, 51 | target=True, 52 | target_lr=0.1, 53 | behaviour_update_freq=25, 54 | critic_update_times=10, 55 | target_update_freq=50, 56 | gumbel_softmax=True, 57 | epsilon_softmax=False, 58 | online=True, 59 | reward_record_type='episode_mean_step', 60 | shared_parameters=False 61 | ) 62 | 63 | args = MergeArgs(*(args+aux_args)) 64 | 65 | log_name = scenario_name + '_' + model_name + alias 66 | -------------------------------------------------------------------------------- /aux.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | from models.maddpg import * 3 | from models.sqddpg import * 4 | from models.independent_ac import * 5 | from models.independent_ddpg import * 6 | from models.coma_fc import * 7 | 8 | 9 | 10 | maddpgArgs = namedtuple( 'maddpgArgs', [] ) 11 | 12 | randomArgs = namedtuple( 'randomArgs', [] ) 13 | 14 | sqddpgArgs = namedtuple( 'sqddpgArgs', ['sample_size'] ) 15 | 16 | independentArgs = namedtuple( 'independentArgs', [] ) 17 | 18 | comafcArgs = namedtuple( 'comafcArgs', [] ) 19 | 20 | 21 | 22 | Model = dict(maddpg=MADDPG, 23 | sqddpg=SQDDPG, 24 | independent_ac=IndependentAC, 25 | independent_ddpg=IndependentDDPG, 26 | coma_fc=COMAFC 27 | ) 28 | 29 | 30 | 31 | AuxArgs = dict(maddpg=maddpgArgs, 32 | sqddpg=sqddpgArgs, 33 | independent_ac=independentArgs, 34 | independent_ddpg=independentArgs, 35 | coma_fc=comafcArgs 36 | ) 37 | 38 | 39 | 40 | Strategy=dict(maddpg='pg', 41 | sqddpg='pg', 42 | independent_ac='pg', 43 | independent_ddpg='pg', 44 | coma_fc='pg' 45 | ) 46 | 47 | 48 | 49 | Args = namedtuple('Args', ['model_name', 50 | 'agent_num', 51 | 'hid_size', 52 | 'obs_size', 53 | 'continuous', 54 | 'action_dim', 55 | 'init_std', 56 | 'policy_lrate', 57 | 'value_lrate', 58 | 'max_steps', 59 | 'batch_size', # steps<-online/episodes<-offline 60 | 'gamma', 61 | 'normalize_advantages', 62 | 'entr', 63 | 'entr_inc', 64 | 'action_num', 65 | 'q_func', 66 | 'train_episodes_num', 67 | 'replay', 68 | 'replay_buffer_size', 69 | 'replay_warmup', 70 | 'cuda', 71 | 'grad_clip', 72 | 'save_model_freq', # episodes 73 | 'target', 74 | 'target_lr', 75 | 'behaviour_update_freq', # steps<-online/episodes<-offline 76 | 'critic_update_times', 77 | 'target_update_freq', # steps<-online/episodes<-offline 78 | 'gumbel_softmax', 79 | 'epsilon_softmax', 80 | 'online', 81 | 'reward_record_type', 82 | 'shared_parameters' # boolean 83 | ] 84 | ) 85 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__/ 2 | *.egg-info/ 3 | *.pyc -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 OpenAI 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/README.md: -------------------------------------------------------------------------------- 1 | **Status:** Archive (code is provided as-is, no updates expected) 2 | 3 | # Multi-Agent Particle Environment 4 | 5 | A simple multi-agent particle world with a continuous observation and discrete action space, along with some basic simulated physics. 6 | Used in the paper [Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments](https://arxiv.org/pdf/1706.02275.pdf). 7 | 8 | ## Getting started: 9 | 10 | - To install, `cd` into the root directory and type `pip install -e .` 11 | 12 | - To interactively view moving to landmark scenario (see others in ./scenarios/): 13 | `bin/interactive.py --scenario simple.py` 14 | 15 | - Known dependencies: Python (3.5.4), OpenAI gym (0.10.5), numpy (1.14.5) 16 | 17 | - To use the environments, look at the code for importing them in `make_env.py`. 18 | 19 | ## Code structure 20 | 21 | - `make_env.py`: contains code for importing a multiagent environment as an OpenAI Gym-like object. 22 | 23 | - `./multiagent/environment.py`: contains code for environment simulation (interaction physics, `_step()` function, etc.) 24 | 25 | - `./multiagent/core.py`: contains classes for various objects (Entities, Landmarks, Agents, etc.) that are used throughout the code. 26 | 27 | - `./multiagent/rendering.py`: used for displaying agent behaviors on the screen. 28 | 29 | - `./multiagent/policy.py`: contains code for interactive policy based on keyboard input. 30 | 31 | - `./multiagent/scenario.py`: contains base scenario object that is extended for all scenarios. 32 | 33 | - `./multiagent/scenarios/`: folder where various scenarios/ environments are stored. scenario code consists of several functions: 34 | 1) `make_world()`: creates all of the entities that inhabit the world (landmarks, agents, etc.), assigns their capabilities (whether they can communicate, or move, or both). 35 | called once at the beginning of each training session 36 | 2) `reset_world()`: resets the world by assigning properties (position, color, etc.) to all entities in the world 37 | called before every episode (including after make_world() before the first episode) 38 | 3) `reward()`: defines the reward function for a given agent 39 | 4) `observation()`: defines the observation space of a given agent 40 | 5) (optional) `benchmark_data()`: provides diagnostic data for policies trained on the environment (e.g. evaluation metrics) 41 | 42 | ### Creating new environments 43 | 44 | You can create new scenarios by implementing the first 4 functions above (`make_world()`, `reset_world()`, `reward()`, and `observation()`). 45 | 46 | ## List of environments 47 | 48 | 49 | | Env name in code (name in paper) | Communication? | Competitive? | Notes | 50 | | --- | --- | --- | --- | 51 | | `simple.py` | N | N | Single agent sees landmark position, rewarded based on how close it gets to landmark. Not a multiagent environment -- used for debugging policies. | 52 | | `simple_adversary.py` (Physical deception) | N | Y | 1 adversary (red), N good agents (green), N landmarks (usually N=2). All agents observe position of landmarks and other agents. One landmark is the ‘target landmark’ (colored green). Good agents rewarded based on how close one of them is to the target landmark, but negatively rewarded if the adversary is close to target landmark. Adversary is rewarded based on how close it is to the target, but it doesn’t know which landmark is the target landmark. So good agents have to learn to ‘split up’ and cover all landmarks to deceive the adversary. | 53 | | `simple_crypto.py` (Covert communication) | Y | Y | Two good agents (alice and bob), one adversary (eve). Alice must sent a private message to bob over a public channel. Alice and bob are rewarded based on how well bob reconstructs the message, but negatively rewarded if eve can reconstruct the message. Alice and bob have a private key (randomly generated at beginning of each episode), which they must learn to use to encrypt the message. | 54 | | `simple_push.py` (Keep-away) | N |Y | 1 agent, 1 adversary, 1 landmark. Agent is rewarded based on distance to landmark. Adversary is rewarded if it is close to the landmark, and if the agent is far from the landmark. So the adversary learns to push agent away from the landmark. | 55 | | `simple_reference.py` | Y | N | 2 agents, 3 landmarks of different colors. Each agent wants to get to their target landmark, which is known only by other agent. Reward is collective. So agents have to learn to communicate the goal of the other agent, and navigate to their landmark. This is the same as the simple_speaker_listener scenario where both agents are simultaneous speakers and listeners. | 56 | | `simple_speaker_listener.py` (Cooperative communication) | Y | N | Same as simple_reference, except one agent is the ‘speaker’ (gray) that does not move (observes goal of other agent), and other agent is the listener (cannot speak, but must navigate to correct landmark).| 57 | | `simple_spread.py` (Cooperative navigation) | N | N | N agents, N landmarks. Agents are rewarded based on how far any agent is from each landmark. Agents are penalized if they collide with other agents. So, agents have to learn to cover all the landmarks while avoiding collisions. | 58 | | `simple_tag.py` (Predator-prey) | N | Y | Predator-prey environment. Good agents (green) are faster and want to avoid being hit by adversaries (red). Adversaries are slower and want to hit good agents. Obstacles (large black circles) block the way. | 59 | | `simple_world_comm.py` | Y | Y | Environment seen in the video accompanying the paper. Same as simple_tag, except (1) there is food (small blue balls) that the good agents are rewarded for being near, (2) we now have ‘forests’ that hide agents inside from being seen from outside; (3) there is a ‘leader adversary” that can see the agents at all times, and can communicate with the other adversaries to help coordinate the chase. | 60 | 61 | ## Paper citation 62 | 63 | If you used this environment for your experiments or found it helpful, consider citing the following papers: 64 | 65 | Environments in this repo: 66 |
67 | @article{lowe2017multi,
68 |   title={Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments},
69 |   author={Lowe, Ryan and Wu, Yi and Tamar, Aviv and Harb, Jean and Abbeel, Pieter and Mordatch, Igor},
70 |   journal={Neural Information Processing Systems (NIPS)},
71 |   year={2017}
72 | }
73 | 
74 | 75 | Original particle world environment: 76 |
77 | @article{mordatch2017emergence,
78 |   title={Emergence of Grounded Compositional Language in Multi-Agent Populations},
79 |   author={Mordatch, Igor and Abbeel, Pieter},
80 |   journal={arXiv preprint arXiv:1703.04908},
81 |   year={2017}
82 | }
83 | 
84 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/bin/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/environments/multiagent_particle_envs/bin/__init__.py -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/bin/interactive.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import os,sys 3 | sys.path.insert(1, os.path.join(sys.path[0], '..')) 4 | import argparse 5 | 6 | from multiagent.environment import MultiAgentEnv 7 | from multiagent.policy import InteractivePolicy 8 | import multiagent.scenarios as scenarios 9 | 10 | if __name__ == '__main__': 11 | # parse arguments 12 | parser = argparse.ArgumentParser(description=None) 13 | parser.add_argument('-s', '--scenario', default='simple.py', help='Path of the scenario Python script.') 14 | args = parser.parse_args() 15 | 16 | # load scenario from script 17 | scenario = scenarios.load(args.scenario).Scenario() 18 | # create world 19 | world = scenario.make_world() 20 | # create multiagent environment 21 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer = False) 22 | # render call to create viewer window (necessary only for interactive policies) 23 | env.render() 24 | # create interactive policies for each agent 25 | policies = [InteractivePolicy(env,i) for i in range(env.n)] 26 | # execution loop 27 | obs_n = env.reset() 28 | while True: 29 | # query for action from each agent's policy 30 | act_n = [] 31 | for i, policy in enumerate(policies): 32 | act_n.append(policy.action(obs_n[i])) 33 | # step environment 34 | obs_n, reward_n, done_n, _ = env.step(act_n) 35 | # render all agent views 36 | env.render() 37 | # display rewards 38 | #for agent in env.world.agents: 39 | # print(agent.name + " reward: %0.3f" % env._get_reward(agent)) 40 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/make_env.py: -------------------------------------------------------------------------------- 1 | """ 2 | Code for creating a multiagent environment with one of the scenarios listed 3 | in ./scenarios/. 4 | Can be called by using, for example: 5 | env = make_env('simple_speaker_listener') 6 | After producing the env object, can be used similarly to an OpenAI gym 7 | environment. 8 | 9 | A policy using this environment must output actions in the form of a list 10 | for all agents. Each element of the list should be a numpy array, 11 | of size (env.world.dim_p + env.world.dim_c, 1). Physical actions precede 12 | communication actions in this array. See environment.py for more details. 13 | """ 14 | 15 | def make_env(scenario_name, benchmark=False): 16 | ''' 17 | Creates a MultiAgentEnv object as env. This can be used similar to a gym 18 | environment by calling env.reset() and env.step(). 19 | Use env.render() to view the environment on the screen. 20 | 21 | Input: 22 | scenario_name : name of the scenario from ./scenarios/ to be Returns 23 | (without the .py extension) 24 | benchmark : whether you want to produce benchmarking data 25 | (usually only done during evaluation) 26 | 27 | Some useful env properties (see environment.py): 28 | .observation_space : Returns the observation space for each agent 29 | .action_space : Returns the action space for each agent 30 | .n : Returns the number of Agents 31 | ''' 32 | from multiagent.environment import MultiAgentEnv 33 | import multiagent.scenarios as scenarios 34 | 35 | # load scenario from script 36 | scenario = scenarios.load(scenario_name + ".py").Scenario() 37 | # create world 38 | world = scenario.make_world() 39 | # create multiagent environment 40 | if benchmark: 41 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, scenario.benchmark_data) 42 | else: 43 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation) 44 | return env 45 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/__init__.py: -------------------------------------------------------------------------------- 1 | from gym.envs.registration import register 2 | 3 | # Multiagent envs 4 | # ---------------------------------------- 5 | 6 | register( 7 | id='MultiagentSimple-v0', 8 | entry_point='multiagent.envs:SimpleEnv', 9 | # FIXME(cathywu) currently has to be exactly max_path_length parameters in 10 | # rllab run script 11 | max_episode_steps=100, 12 | ) 13 | 14 | register( 15 | id='MultiagentSimpleSpeakerListener-v0', 16 | entry_point='multiagent.envs:SimpleSpeakerListenerEnv', 17 | max_episode_steps=100, 18 | ) 19 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/core.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | # physical/external base state of all entites 4 | class EntityState(object): 5 | def __init__(self): 6 | # physical position 7 | self.p_pos = None 8 | # physical velocity 9 | self.p_vel = None 10 | 11 | # state of agents (including communication and internal/mental state) 12 | class AgentState(EntityState): 13 | def __init__(self): 14 | super(AgentState, self).__init__() 15 | # communication utterance 16 | self.c = None 17 | 18 | # action of the agent 19 | class Action(object): 20 | def __init__(self): 21 | # physical action 22 | self.u = None 23 | # communication action 24 | self.c = None 25 | 26 | # properties and state of physical world entity 27 | class Entity(object): 28 | def __init__(self): 29 | # name 30 | self.name = '' 31 | # properties: 32 | self.size = 0.050 33 | # entity can move / be pushed 34 | self.movable = False 35 | # entity collides with others 36 | self.collide = True 37 | # material density (affects mass) 38 | self.density = 25.0 39 | # color 40 | self.color = None 41 | # max speed and accel 42 | self.max_speed = None 43 | self.accel = None 44 | # state 45 | self.state = EntityState() 46 | # mass 47 | self.initial_mass = 1.0 48 | 49 | @property 50 | def mass(self): 51 | return self.initial_mass 52 | 53 | # properties of landmark entities 54 | class Landmark(Entity): 55 | def __init__(self): 56 | super(Landmark, self).__init__() 57 | 58 | # properties of agent entities 59 | class Agent(Entity): 60 | def __init__(self): 61 | super(Agent, self).__init__() 62 | # agents are movable by default 63 | self.movable = True 64 | # cannot send communication signals 65 | self.silent = False 66 | # cannot observe the world 67 | self.blind = False 68 | # physical motor noise amount 69 | self.u_noise = None 70 | # communication noise amount 71 | self.c_noise = None 72 | # control range 73 | self.u_range = 1.0 74 | # state 75 | self.state = AgentState() 76 | # action 77 | self.action = Action() 78 | # script behavior to execute 79 | self.action_callback = None 80 | 81 | # multi-agent world 82 | class World(object): 83 | def __init__(self): 84 | # list of agents and entities (can change at execution-time!) 85 | self.agents = [] 86 | self.landmarks = [] 87 | # communication channel dimensionality 88 | self.dim_c = 0 89 | # position dimensionality 90 | self.dim_p = 2 91 | # color dimensionality 92 | self.dim_color = 3 93 | # simulation timestep 94 | self.dt = 0.1 95 | # physical damping 96 | self.damping = 0.25 97 | # contact response parameters 98 | self.contact_force = 1e+2 99 | self.contact_margin = 1e-3 100 | 101 | # return all entities in the world 102 | @property 103 | def entities(self): 104 | return self.agents + self.landmarks 105 | 106 | # return all agents controllable by external policies 107 | @property 108 | def policy_agents(self): 109 | return [agent for agent in self.agents if agent.action_callback is None] 110 | 111 | # return all agents controlled by world scripts 112 | @property 113 | def scripted_agents(self): 114 | return [agent for agent in self.agents if agent.action_callback is not None] 115 | 116 | # update state of the world 117 | def step(self): 118 | # set actions for scripted agents 119 | for agent in self.scripted_agents: 120 | agent.action = agent.action_callback(agent, self) 121 | # gather forces applied to entities 122 | p_force = [None] * len(self.entities) 123 | # apply agent physical controls 124 | p_force = self.apply_action_force(p_force) 125 | # apply environment forces 126 | p_force = self.apply_environment_force(p_force) 127 | # integrate physical state 128 | self.integrate_state(p_force) 129 | # update agent state 130 | for agent in self.agents: 131 | self.update_agent_state(agent) 132 | 133 | # gather agent action forces 134 | def apply_action_force(self, p_force): 135 | # set applied forces 136 | for i,agent in enumerate(self.agents): 137 | if agent.movable: 138 | noise = np.random.randn(*agent.action.u.shape) * agent.u_noise if agent.u_noise else 0.0 139 | p_force[i] = agent.action.u + noise 140 | return p_force 141 | 142 | # gather physical forces acting on entities 143 | def apply_environment_force(self, p_force): 144 | # simple (but inefficient) collision response 145 | for a,entity_a in enumerate(self.entities): 146 | for b,entity_b in enumerate(self.entities): 147 | if(b <= a): continue 148 | [f_a, f_b] = self.get_collision_force(entity_a, entity_b) 149 | if(f_a is not None): 150 | if(p_force[a] is None): p_force[a] = 0.0 151 | p_force[a] = f_a + p_force[a] 152 | if(f_b is not None): 153 | if(p_force[b] is None): p_force[b] = 0.0 154 | p_force[b] = f_b + p_force[b] 155 | return p_force 156 | 157 | # integrate physical state 158 | def integrate_state(self, p_force): 159 | for i,entity in enumerate(self.entities): 160 | if not entity.movable: continue 161 | entity.state.p_vel = entity.state.p_vel * (1 - self.damping) 162 | if (p_force[i] is not None): 163 | entity.state.p_vel += (p_force[i] / entity.mass) * self.dt 164 | if entity.max_speed is not None: 165 | speed = np.sqrt(np.square(entity.state.p_vel[0]) + np.square(entity.state.p_vel[1])) 166 | if speed > entity.max_speed: 167 | entity.state.p_vel = entity.state.p_vel / np.sqrt(np.square(entity.state.p_vel[0]) + 168 | np.square(entity.state.p_vel[1])) * entity.max_speed 169 | entity.state.p_pos += entity.state.p_vel * self.dt 170 | 171 | def update_agent_state(self, agent): 172 | # set communication state (directly for now) 173 | if agent.silent: 174 | agent.state.c = np.zeros(self.dim_c) 175 | else: 176 | noise = np.random.randn(*agent.action.c.shape) * agent.c_noise if agent.c_noise else 0.0 177 | agent.state.c = agent.action.c + noise 178 | 179 | # get collision forces for any contact between two entities 180 | def get_collision_force(self, entity_a, entity_b): 181 | if (not entity_a.collide) or (not entity_b.collide): 182 | return [None, None] # not a collider 183 | if (entity_a is entity_b): 184 | return [None, None] # don't collide against itself 185 | # compute actual distance between entities 186 | delta_pos = entity_a.state.p_pos - entity_b.state.p_pos 187 | dist = np.sqrt(np.sum(np.square(delta_pos))) 188 | # minimum allowable distance 189 | dist_min = entity_a.size + entity_b.size 190 | # softmax penetration 191 | k = self.contact_margin 192 | penetration = np.logaddexp(0, -(dist - dist_min)/k)*k 193 | force = self.contact_force * delta_pos / dist * penetration 194 | force_a = +force if entity_a.movable else None 195 | force_b = -force if entity_b.movable else None 196 | return [force_a, force_b] -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/multi_discrete.py: -------------------------------------------------------------------------------- 1 | # An old version of OpenAI Gym's multi_discrete.py. (Was getting affected by Gym updates) 2 | # (https://github.com/openai/gym/blob/1fb81d4e3fb780ccf77fec731287ba07da35eb84/gym/spaces/multi_discrete.py) 3 | 4 | import numpy as np 5 | 6 | import gym 7 | from gym.spaces import prng 8 | 9 | class MultiDiscrete(gym.Space): 10 | """ 11 | - The multi-discrete action space consists of a series of discrete action spaces with different parameters 12 | - It can be adapted to both a Discrete action space or a continuous (Box) action space 13 | - It is useful to represent game controllers or keyboards where each key can be represented as a discrete action space 14 | - It is parametrized by passing an array of arrays containing [min, max] for each discrete action space 15 | where the discrete action space can take any integers from `min` to `max` (both inclusive) 16 | Note: A value of 0 always need to represent the NOOP action. 17 | e.g. Nintendo Game Controller 18 | - Can be conceptualized as 3 discrete action spaces: 19 | 1) Arrow Keys: Discrete 5 - NOOP[0], UP[1], RIGHT[2], DOWN[3], LEFT[4] - params: min: 0, max: 4 20 | 2) Button A: Discrete 2 - NOOP[0], Pressed[1] - params: min: 0, max: 1 21 | 3) Button B: Discrete 2 - NOOP[0], Pressed[1] - params: min: 0, max: 1 22 | - Can be initialized as 23 | MultiDiscrete([ [0,4], [0,1], [0,1] ]) 24 | """ 25 | def __init__(self, array_of_param_array): 26 | self.low = np.array([x[0] for x in array_of_param_array]) 27 | self.high = np.array([x[1] for x in array_of_param_array]) 28 | self.num_discrete_space = self.low.shape[0] 29 | 30 | def sample(self): 31 | """ Returns a array with one sample from each discrete action space """ 32 | # For each row: round(random .* (max - min) + min, 0) 33 | random_array = prng.np_random.rand(self.num_discrete_space) 34 | return [int(x) for x in np.floor(np.multiply((self.high - self.low + 1.), random_array) + self.low)] 35 | def contains(self, x): 36 | return len(x) == self.num_discrete_space and (np.array(x) >= self.low).all() and (np.array(x) <= self.high).all() 37 | 38 | @property 39 | def shape(self): 40 | return self.num_discrete_space 41 | def __repr__(self): 42 | return "MultiDiscrete" + str(self.num_discrete_space) 43 | def __eq__(self, other): 44 | return np.array_equal(self.low, other.low) and np.array_equal(self.high, other.high) -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/policy.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from pyglet.window import key 3 | 4 | # individual agent policy 5 | class Policy(object): 6 | def __init__(self): 7 | pass 8 | def action(self, obs): 9 | raise NotImplementedError() 10 | 11 | # interactive policy based on keyboard input 12 | # hard-coded to deal only with movement, not communication 13 | class InteractivePolicy(Policy): 14 | def __init__(self, env, agent_index): 15 | super(InteractivePolicy, self).__init__() 16 | self.env = env 17 | # hard-coded keyboard events 18 | self.move = [False for i in range(4)] 19 | self.comm = [False for i in range(env.world.dim_c)] 20 | # register keyboard events with this environment's window 21 | env.viewers[agent_index].window.on_key_press = self.key_press 22 | env.viewers[agent_index].window.on_key_release = self.key_release 23 | 24 | def action(self, obs): 25 | # ignore observation and just act based on keyboard events 26 | if self.env.discrete_action_input: 27 | u = 0 28 | if self.move[0]: u = 1 29 | if self.move[1]: u = 2 30 | if self.move[2]: u = 4 31 | if self.move[3]: u = 3 32 | else: 33 | u = np.zeros(5) # 5-d because of no-move action 34 | if self.move[0]: u[1] += 1.0 35 | if self.move[1]: u[2] += 1.0 36 | if self.move[3]: u[3] += 1.0 37 | if self.move[2]: u[4] += 1.0 38 | if True not in self.move: 39 | u[0] += 1.0 40 | return np.concatenate([u, np.zeros(self.env.world.dim_c)]) 41 | 42 | # keyboard event callbacks 43 | def key_press(self, k, mod): 44 | if k==key.LEFT: self.move[0] = True 45 | if k==key.RIGHT: self.move[1] = True 46 | if k==key.UP: self.move[2] = True 47 | if k==key.DOWN: self.move[3] = True 48 | def key_release(self, k, mod): 49 | if k==key.LEFT: self.move[0] = False 50 | if k==key.RIGHT: self.move[1] = False 51 | if k==key.UP: self.move[2] = False 52 | if k==key.DOWN: self.move[3] = False 53 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/rendering.py: -------------------------------------------------------------------------------- 1 | """ 2 | 2D rendering framework 3 | """ 4 | from __future__ import division 5 | import os 6 | import six 7 | import sys 8 | 9 | if "Apple" in sys.version: 10 | if 'DYLD_FALLBACK_LIBRARY_PATH' in os.environ: 11 | os.environ['DYLD_FALLBACK_LIBRARY_PATH'] += ':/usr/lib' 12 | # (JDS 2016/04/15): avoid bug on Anaconda 2.3.0 / Yosemite 13 | 14 | from gym.utils import reraise 15 | from gym import error 16 | 17 | try: 18 | import pyglet 19 | except ImportError as e: 20 | reraise(suffix="HINT: you can install pyglet directly via 'pip install pyglet'. But if you really just want to install all Gym dependencies and not have to think about it, 'pip install -e .[all]' or 'pip install gym[all]' will do it.") 21 | 22 | try: 23 | from pyglet.gl import * 24 | except ImportError as e: 25 | reraise(prefix="Error occured while running `from pyglet.gl import *`",suffix="HINT: make sure you have OpenGL install. On Ubuntu, you can run 'apt-get install python-opengl'. If you're running on a server, you may need a virtual frame buffer; something like this should work: 'xvfb-run -s \"-screen 0 1400x900x24\" python '") 26 | 27 | import math 28 | import numpy as np 29 | 30 | RAD2DEG = 57.29577951308232 31 | 32 | def get_display(spec): 33 | """Convert a display specification (such as :0) into an actual Display 34 | object. 35 | 36 | Pyglet only supports multiple Displays on Linux. 37 | """ 38 | if spec is None: 39 | return None 40 | elif isinstance(spec, six.string_types): 41 | return pyglet.canvas.Display(spec) 42 | else: 43 | raise error.Error('Invalid display specification: {}. (Must be a string like :0 or None.)'.format(spec)) 44 | 45 | class Viewer(object): 46 | def __init__(self, width, height, display=None): 47 | display = get_display(display) 48 | 49 | self.width = width 50 | self.height = height 51 | 52 | self.window = pyglet.window.Window(width=width, height=height, display=display) 53 | self.window.on_close = self.window_closed_by_user 54 | self.geoms = [] 55 | self.onetime_geoms = [] 56 | self.transform = Transform() 57 | 58 | glEnable(GL_BLEND) 59 | # glEnable(GL_MULTISAMPLE) 60 | glEnable(GL_LINE_SMOOTH) 61 | # glHint(GL_LINE_SMOOTH_HINT, GL_DONT_CARE) 62 | glHint(GL_LINE_SMOOTH_HINT, GL_NICEST) 63 | glLineWidth(2.0) 64 | glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA) 65 | 66 | def close(self): 67 | self.window.close() 68 | 69 | def window_closed_by_user(self): 70 | self.close() 71 | 72 | def set_bounds(self, left, right, bottom, top): 73 | assert right > left and top > bottom 74 | scalex = self.width/(right-left) 75 | scaley = self.height/(top-bottom) 76 | self.transform = Transform( 77 | translation=(-left*scalex, -bottom*scaley), 78 | scale=(scalex, scaley)) 79 | 80 | def add_geom(self, geom): 81 | self.geoms.append(geom) 82 | 83 | def add_onetime(self, geom): 84 | self.onetime_geoms.append(geom) 85 | 86 | def render(self, return_rgb_array=False): 87 | glClearColor(1,1,1,1) 88 | self.window.clear() 89 | self.window.switch_to() 90 | self.window.dispatch_events() 91 | self.transform.enable() 92 | for geom in self.geoms: 93 | geom.render() 94 | for geom in self.onetime_geoms: 95 | geom.render() 96 | self.transform.disable() 97 | arr = None 98 | if return_rgb_array: 99 | buffer = pyglet.image.get_buffer_manager().get_color_buffer() 100 | image_data = buffer.get_image_data() 101 | arr = np.fromstring(image_data.data, dtype=np.uint8, sep='') 102 | # In https://github.com/openai/gym-http-api/issues/2, we 103 | # discovered that someone using Xmonad on Arch was having 104 | # a window of size 598 x 398, though a 600 x 400 window 105 | # was requested. (Guess Xmonad was preserving a pixel for 106 | # the boundary.) So we use the buffer height/width rather 107 | # than the requested one. 108 | arr = arr.reshape(buffer.height, buffer.width, 4) 109 | arr = arr[::-1,:,0:3] 110 | self.window.flip() 111 | self.onetime_geoms = [] 112 | return arr 113 | 114 | # Convenience 115 | def draw_circle(self, radius=10, res=30, filled=True, **attrs): 116 | geom = make_circle(radius=radius, res=res, filled=filled) 117 | _add_attrs(geom, attrs) 118 | self.add_onetime(geom) 119 | return geom 120 | 121 | def draw_polygon(self, v, filled=True, **attrs): 122 | geom = make_polygon(v=v, filled=filled) 123 | _add_attrs(geom, attrs) 124 | self.add_onetime(geom) 125 | return geom 126 | 127 | def draw_polyline(self, v, **attrs): 128 | geom = make_polyline(v=v) 129 | _add_attrs(geom, attrs) 130 | self.add_onetime(geom) 131 | return geom 132 | 133 | def draw_line(self, start, end, **attrs): 134 | geom = Line(start, end) 135 | _add_attrs(geom, attrs) 136 | self.add_onetime(geom) 137 | return geom 138 | 139 | def get_array(self): 140 | self.window.flip() 141 | image_data = pyglet.image.get_buffer_manager().get_color_buffer().get_image_data() 142 | self.window.flip() 143 | arr = np.fromstring(image_data.data, dtype=np.uint8, sep='') 144 | arr = arr.reshape(self.height, self.width, 4) 145 | return arr[::-1,:,0:3] 146 | 147 | def _add_attrs(geom, attrs): 148 | if "color" in attrs: 149 | geom.set_color(*attrs["color"]) 150 | if "linewidth" in attrs: 151 | geom.set_linewidth(attrs["linewidth"]) 152 | 153 | class Geom(object): 154 | def __init__(self): 155 | self._color=Color((0, 0, 0, 1.0)) 156 | self.attrs = [self._color] 157 | def render(self): 158 | for attr in reversed(self.attrs): 159 | attr.enable() 160 | self.render1() 161 | for attr in self.attrs: 162 | attr.disable() 163 | def render1(self): 164 | raise NotImplementedError 165 | def add_attr(self, attr): 166 | self.attrs.append(attr) 167 | def set_color(self, r, g, b, alpha=1): 168 | self._color.vec4 = (r, g, b, alpha) 169 | 170 | class Attr(object): 171 | def enable(self): 172 | raise NotImplementedError 173 | def disable(self): 174 | pass 175 | 176 | class Transform(Attr): 177 | def __init__(self, translation=(0.0, 0.0), rotation=0.0, scale=(1,1)): 178 | self.set_translation(*translation) 179 | self.set_rotation(rotation) 180 | self.set_scale(*scale) 181 | def enable(self): 182 | glPushMatrix() 183 | glTranslatef(self.translation[0], self.translation[1], 0) # translate to GL loc ppint 184 | glRotatef(RAD2DEG * self.rotation, 0, 0, 1.0) 185 | glScalef(self.scale[0], self.scale[1], 1) 186 | def disable(self): 187 | glPopMatrix() 188 | def set_translation(self, newx, newy): 189 | self.translation = (float(newx), float(newy)) 190 | def set_rotation(self, new): 191 | self.rotation = float(new) 192 | def set_scale(self, newx, newy): 193 | self.scale = (float(newx), float(newy)) 194 | 195 | class Color(Attr): 196 | def __init__(self, vec4): 197 | self.vec4 = vec4 198 | def enable(self): 199 | glColor4f(*self.vec4) 200 | 201 | class LineStyle(Attr): 202 | def __init__(self, style): 203 | self.style = style 204 | def enable(self): 205 | glEnable(GL_LINE_STIPPLE) 206 | glLineStipple(1, self.style) 207 | def disable(self): 208 | glDisable(GL_LINE_STIPPLE) 209 | 210 | class LineWidth(Attr): 211 | def __init__(self, stroke): 212 | self.stroke = stroke 213 | def enable(self): 214 | glLineWidth(self.stroke) 215 | 216 | class Point(Geom): 217 | def __init__(self): 218 | Geom.__init__(self) 219 | def render1(self): 220 | glBegin(GL_POINTS) # draw point 221 | glVertex3f(0.0, 0.0, 0.0) 222 | glEnd() 223 | 224 | class FilledPolygon(Geom): 225 | def __init__(self, v): 226 | Geom.__init__(self) 227 | self.v = v 228 | def render1(self): 229 | if len(self.v) == 4 : glBegin(GL_QUADS) 230 | elif len(self.v) > 4 : glBegin(GL_POLYGON) 231 | else: glBegin(GL_TRIANGLES) 232 | for p in self.v: 233 | glVertex3f(p[0], p[1],0) # draw each vertex 234 | glEnd() 235 | 236 | color = (self._color.vec4[0] * 0.5, self._color.vec4[1] * 0.5, self._color.vec4[2] * 0.5, self._color.vec4[3] * 0.5) 237 | glColor4f(*color) 238 | glBegin(GL_LINE_LOOP) 239 | for p in self.v: 240 | glVertex3f(p[0], p[1],0) # draw each vertex 241 | glEnd() 242 | 243 | def make_circle(radius=10, res=30, filled=True): 244 | points = [] 245 | for i in range(res): 246 | ang = 2*math.pi*i / res 247 | points.append((math.cos(ang)*radius, math.sin(ang)*radius)) 248 | if filled: 249 | return FilledPolygon(points) 250 | else: 251 | return PolyLine(points, True) 252 | 253 | def make_polygon(v, filled=True): 254 | if filled: return FilledPolygon(v) 255 | else: return PolyLine(v, True) 256 | 257 | def make_polyline(v): 258 | return PolyLine(v, False) 259 | 260 | def make_capsule(length, width): 261 | l, r, t, b = 0, length, width/2, -width/2 262 | box = make_polygon([(l,b), (l,t), (r,t), (r,b)]) 263 | circ0 = make_circle(width/2) 264 | circ1 = make_circle(width/2) 265 | circ1.add_attr(Transform(translation=(length, 0))) 266 | geom = Compound([box, circ0, circ1]) 267 | return geom 268 | 269 | class Compound(Geom): 270 | def __init__(self, gs): 271 | Geom.__init__(self) 272 | self.gs = gs 273 | for g in self.gs: 274 | g.attrs = [a for a in g.attrs if not isinstance(a, Color)] 275 | def render1(self): 276 | for g in self.gs: 277 | g.render() 278 | 279 | class PolyLine(Geom): 280 | def __init__(self, v, close): 281 | Geom.__init__(self) 282 | self.v = v 283 | self.close = close 284 | self.linewidth = LineWidth(1) 285 | self.add_attr(self.linewidth) 286 | def render1(self): 287 | glBegin(GL_LINE_LOOP if self.close else GL_LINE_STRIP) 288 | for p in self.v: 289 | glVertex3f(p[0], p[1],0) # draw each vertex 290 | glEnd() 291 | def set_linewidth(self, x): 292 | self.linewidth.stroke = x 293 | 294 | class Line(Geom): 295 | def __init__(self, start=(0.0, 0.0), end=(0.0, 0.0)): 296 | Geom.__init__(self) 297 | self.start = start 298 | self.end = end 299 | self.linewidth = LineWidth(1) 300 | self.add_attr(self.linewidth) 301 | 302 | def render1(self): 303 | glBegin(GL_LINES) 304 | glVertex2f(*self.start) 305 | glVertex2f(*self.end) 306 | glEnd() 307 | 308 | class Image(Geom): 309 | def __init__(self, fname, width, height): 310 | Geom.__init__(self) 311 | self.width = width 312 | self.height = height 313 | img = pyglet.image.load(fname) 314 | self.img = img 315 | self.flip = False 316 | def render1(self): 317 | self.img.blit(-self.width/2, -self.height/2, width=self.width, height=self.height) 318 | 319 | # ================================================================ 320 | 321 | class SimpleImageViewer(object): 322 | def __init__(self, display=None): 323 | self.window = None 324 | self.isopen = False 325 | self.display = display 326 | def imshow(self, arr): 327 | if self.window is None: 328 | height, width, channels = arr.shape 329 | self.window = pyglet.window.Window(width=width, height=height, display=self.display) 330 | self.width = width 331 | self.height = height 332 | self.isopen = True 333 | assert arr.shape == (self.height, self.width, 3), "You passed in an image with the wrong number shape" 334 | image = pyglet.image.ImageData(self.width, self.height, 'RGB', arr.tobytes(), pitch=self.width * -3) 335 | self.window.clear() 336 | self.window.switch_to() 337 | self.window.dispatch_events() 338 | image.blit(0,0) 339 | self.window.flip() 340 | def close(self): 341 | if self.isopen: 342 | self.window.close() 343 | self.isopen = False 344 | def __del__(self): 345 | self.close() -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/scenario.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | # defines scenario upon which the world is built 4 | class BaseScenario(object): 5 | # create elements of the world 6 | def make_world(self): 7 | raise NotImplementedError() 8 | # create initial conditions of the world 9 | def reset_world(self, world): 10 | raise NotImplementedError() 11 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/scenarios/__init__.py: -------------------------------------------------------------------------------- 1 | import imp 2 | import os.path as osp 3 | 4 | 5 | def load(name): 6 | pathname = osp.join(osp.dirname(__file__), name) 7 | return imp.load_source('', pathname) 8 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/scenarios/simple.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from multiagent.core import World, Agent, Landmark 3 | from multiagent.scenario import BaseScenario 4 | 5 | class Scenario(BaseScenario): 6 | def make_world(self): 7 | world = World() 8 | # add agents 9 | world.agents = [Agent() for i in range(1)] 10 | for i, agent in enumerate(world.agents): 11 | agent.name = 'agent %d' % i 12 | agent.collide = False 13 | agent.silent = True 14 | # add landmarks 15 | world.landmarks = [Landmark() for i in range(1)] 16 | for i, landmark in enumerate(world.landmarks): 17 | landmark.name = 'landmark %d' % i 18 | landmark.collide = False 19 | landmark.movable = False 20 | # make initial conditions 21 | self.reset_world(world) 22 | return world 23 | 24 | def reset_world(self, world): 25 | # random properties for agents 26 | for i, agent in enumerate(world.agents): 27 | agent.color = np.array([0.25,0.25,0.25]) 28 | # random properties for landmarks 29 | for i, landmark in enumerate(world.landmarks): 30 | landmark.color = np.array([0.75,0.75,0.75]) 31 | world.landmarks[0].color = np.array([0.75,0.25,0.25]) 32 | # set random initial states 33 | for agent in world.agents: 34 | agent.state.p_pos = np.random.uniform(-1,+1, world.dim_p) 35 | agent.state.p_vel = np.zeros(world.dim_p) 36 | agent.state.c = np.zeros(world.dim_c) 37 | for i, landmark in enumerate(world.landmarks): 38 | landmark.state.p_pos = np.random.uniform(-1,+1, world.dim_p) 39 | landmark.state.p_vel = np.zeros(world.dim_p) 40 | 41 | def reward(self, agent, world): 42 | dist2 = np.sum(np.square(agent.state.p_pos - world.landmarks[0].state.p_pos)) 43 | return -dist2 44 | 45 | def observation(self, agent, world): 46 | # get positions of all entities in this agent's reference frame 47 | entity_pos = [] 48 | for entity in world.landmarks: 49 | entity_pos.append(entity.state.p_pos - agent.state.p_pos) 50 | return np.concatenate([agent.state.p_vel] + entity_pos) 51 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/scenarios/simple_adversary.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from multiagent.core import World, Agent, Landmark 3 | from multiagent.scenario import BaseScenario 4 | 5 | 6 | class Scenario(BaseScenario): 7 | 8 | def make_world(self): 9 | world = World() 10 | # set any world properties first 11 | world.dim_c = 2 12 | num_agents = 3 13 | world.num_agents = num_agents 14 | num_adversaries = 1 15 | num_landmarks = num_agents - 1 16 | # add agents 17 | world.agents = [Agent() for i in range(num_agents)] 18 | for i, agent in enumerate(world.agents): 19 | agent.name = 'agent %d' % i 20 | agent.collide = False 21 | agent.silent = True 22 | agent.adversary = True if i < num_adversaries else False 23 | agent.size = 0.15 24 | # add landmarks 25 | world.landmarks = [Landmark() for i in range(num_landmarks)] 26 | for i, landmark in enumerate(world.landmarks): 27 | landmark.name = 'landmark %d' % i 28 | landmark.collide = False 29 | landmark.movable = False 30 | landmark.size = 0.08 31 | # make initial conditions 32 | self.reset_world(world) 33 | return world 34 | 35 | def reset_world(self, world): 36 | # random properties for agents 37 | world.agents[0].color = np.array([0.85, 0.35, 0.35]) 38 | for i in range(1, world.num_agents): 39 | world.agents[i].color = np.array([0.35, 0.35, 0.85]) 40 | # random properties for landmarks 41 | for i, landmark in enumerate(world.landmarks): 42 | landmark.color = np.array([0.15, 0.15, 0.15]) 43 | # set goal landmark 44 | goal = np.random.choice(world.landmarks) 45 | goal.color = np.array([0.15, 0.65, 0.15]) 46 | for agent in world.agents: 47 | agent.goal_a = goal 48 | # set random initial states 49 | for agent in world.agents: 50 | agent.state.p_pos = np.random.uniform(-1, +1, world.dim_p) 51 | agent.state.p_vel = np.zeros(world.dim_p) 52 | agent.state.c = np.zeros(world.dim_c) 53 | for i, landmark in enumerate(world.landmarks): 54 | landmark.state.p_pos = np.random.uniform(-1, +1, world.dim_p) 55 | landmark.state.p_vel = np.zeros(world.dim_p) 56 | 57 | def benchmark_data(self, agent, world): 58 | # returns data for benchmarking purposes 59 | if agent.adversary: 60 | return np.sum(np.square(agent.state.p_pos - agent.goal_a.state.p_pos)) 61 | else: 62 | dists = [] 63 | for l in world.landmarks: 64 | dists.append(np.sum(np.square(agent.state.p_pos - l.state.p_pos))) 65 | dists.append(np.sum(np.square(agent.state.p_pos - agent.goal_a.state.p_pos))) 66 | return tuple(dists) 67 | 68 | # return all agents that are not adversaries 69 | def good_agents(self, world): 70 | return [agent for agent in world.agents if not agent.adversary] 71 | 72 | # return all adversarial agents 73 | def adversaries(self, world): 74 | return [agent for agent in world.agents if agent.adversary] 75 | 76 | def reward(self, agent, world): 77 | # Agents are rewarded based on minimum agent distance to each landmark 78 | return self.adversary_reward(agent, world) if agent.adversary else self.agent_reward(agent, world) 79 | 80 | def agent_reward(self, agent, world): 81 | # Rewarded based on how close any good agent is to the goal landmark, and how far the adversary is from it 82 | shaped_reward = True 83 | shaped_adv_reward = True 84 | 85 | # Calculate negative reward for adversary 86 | adversary_agents = self.adversaries(world) 87 | if shaped_adv_reward: # distance-based adversary reward 88 | adv_rew = sum([np.sqrt(np.sum(np.square(a.state.p_pos - a.goal_a.state.p_pos))) for a in adversary_agents]) 89 | else: # proximity-based adversary reward (binary) 90 | adv_rew = 0 91 | for a in adversary_agents: 92 | if np.sqrt(np.sum(np.square(a.state.p_pos - a.goal_a.state.p_pos))) < 2 * a.goal_a.size: 93 | adv_rew -= 5 94 | 95 | # Calculate positive reward for agents 96 | good_agents = self.good_agents(world) 97 | if shaped_reward: # distance-based agent reward 98 | pos_rew = -min( 99 | [np.sqrt(np.sum(np.square(a.state.p_pos - a.goal_a.state.p_pos))) for a in good_agents]) 100 | else: # proximity-based agent reward (binary) 101 | pos_rew = 0 102 | if min([np.sqrt(np.sum(np.square(a.state.p_pos - a.goal_a.state.p_pos))) for a in good_agents]) \ 103 | < 2 * agent.goal_a.size: 104 | pos_rew += 5 105 | pos_rew -= min( 106 | [np.sqrt(np.sum(np.square(a.state.p_pos - a.goal_a.state.p_pos))) for a in good_agents]) 107 | return pos_rew + adv_rew 108 | 109 | def adversary_reward(self, agent, world): 110 | # Rewarded based on proximity to the goal landmark 111 | shaped_reward = True 112 | if shaped_reward: # distance-based reward 113 | return -np.sum(np.square(agent.state.p_pos - agent.goal_a.state.p_pos)) 114 | else: # proximity-based reward (binary) 115 | adv_rew = 0 116 | if np.sqrt(np.sum(np.square(agent.state.p_pos - agent.goal_a.state.p_pos))) < 2 * agent.goal_a.size: 117 | adv_rew += 5 118 | return adv_rew 119 | 120 | 121 | def observation(self, agent, world): 122 | # get positions of all entities in this agent's reference frame 123 | entity_pos = [] 124 | for entity in world.landmarks: 125 | entity_pos.append(entity.state.p_pos - agent.state.p_pos) 126 | # entity colors 127 | entity_color = [] 128 | for entity in world.landmarks: 129 | entity_color.append(entity.color) 130 | # communication of all other agents 131 | other_pos = [] 132 | for other in world.agents: 133 | if other is agent: continue 134 | other_pos.append(other.state.p_pos - agent.state.p_pos) 135 | 136 | if not agent.adversary: 137 | return np.concatenate([agent.goal_a.state.p_pos - agent.state.p_pos] + entity_pos + other_pos) 138 | else: 139 | return np.concatenate(entity_pos + other_pos) 140 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/scenarios/simple_crypto.py: -------------------------------------------------------------------------------- 1 | """ 2 | Scenario: 3 | 1 speaker, 2 listeners (one of which is an adversary). Good agents rewarded for proximity to goal, and distance from 4 | adversary to goal. Adversary is rewarded for its distance to the goal. 5 | """ 6 | 7 | 8 | import numpy as np 9 | from multiagent.core import World, Agent, Landmark 10 | from multiagent.scenario import BaseScenario 11 | import random 12 | 13 | 14 | class CryptoAgent(Agent): 15 | def __init__(self): 16 | super(CryptoAgent, self).__init__() 17 | self.key = None 18 | 19 | class Scenario(BaseScenario): 20 | 21 | def make_world(self): 22 | world = World() 23 | # set any world properties first 24 | num_agents = 3 25 | num_adversaries = 1 26 | num_landmarks = 2 27 | world.dim_c = 4 28 | # add agents 29 | world.agents = [CryptoAgent() for i in range(num_agents)] 30 | for i, agent in enumerate(world.agents): 31 | agent.name = 'agent %d' % i 32 | agent.collide = False 33 | agent.adversary = True if i < num_adversaries else False 34 | agent.speaker = True if i == 2 else False 35 | agent.movable = False 36 | # add landmarks 37 | world.landmarks = [Landmark() for i in range(num_landmarks)] 38 | for i, landmark in enumerate(world.landmarks): 39 | landmark.name = 'landmark %d' % i 40 | landmark.collide = False 41 | landmark.movable = False 42 | # make initial conditions 43 | self.reset_world(world) 44 | return world 45 | 46 | 47 | def reset_world(self, world): 48 | # random properties for agents 49 | for i, agent in enumerate(world.agents): 50 | agent.color = np.array([0.25, 0.25, 0.25]) 51 | if agent.adversary: 52 | agent.color = np.array([0.75, 0.25, 0.25]) 53 | agent.key = None 54 | # random properties for landmarks 55 | color_list = [np.zeros(world.dim_c) for i in world.landmarks] 56 | for i, color in enumerate(color_list): 57 | color[i] += 1 58 | for color, landmark in zip(color_list, world.landmarks): 59 | landmark.color = color 60 | # set goal landmark 61 | goal = np.random.choice(world.landmarks) 62 | world.agents[1].color = goal.color 63 | world.agents[2].key = np.random.choice(world.landmarks).color 64 | 65 | for agent in world.agents: 66 | agent.goal_a = goal 67 | 68 | # set random initial states 69 | for agent in world.agents: 70 | agent.state.p_pos = np.random.uniform(-1, +1, world.dim_p) 71 | agent.state.p_vel = np.zeros(world.dim_p) 72 | agent.state.c = np.zeros(world.dim_c) 73 | for i, landmark in enumerate(world.landmarks): 74 | landmark.state.p_pos = np.random.uniform(-1, +1, world.dim_p) 75 | landmark.state.p_vel = np.zeros(world.dim_p) 76 | 77 | 78 | def benchmark_data(self, agent, world): 79 | # returns data for benchmarking purposes 80 | return (agent.state.c, agent.goal_a.color) 81 | 82 | # return all agents that are not adversaries 83 | def good_listeners(self, world): 84 | return [agent for agent in world.agents if not agent.adversary and not agent.speaker] 85 | 86 | # return all agents that are not adversaries 87 | def good_agents(self, world): 88 | return [agent for agent in world.agents if not agent.adversary] 89 | 90 | # return all adversarial agents 91 | def adversaries(self, world): 92 | return [agent for agent in world.agents if agent.adversary] 93 | 94 | def reward(self, agent, world): 95 | return self.adversary_reward(agent, world) if agent.adversary else self.agent_reward(agent, world) 96 | 97 | def agent_reward(self, agent, world): 98 | # Agents rewarded if Bob can reconstruct message, but adversary (Eve) cannot 99 | good_listeners = self.good_listeners(world) 100 | adversaries = self.adversaries(world) 101 | good_rew = 0 102 | adv_rew = 0 103 | for a in good_listeners: 104 | if (a.state.c == np.zeros(world.dim_c)).all(): 105 | continue 106 | else: 107 | good_rew -= np.sum(np.square(a.state.c - agent.goal_a.color)) 108 | for a in adversaries: 109 | if (a.state.c == np.zeros(world.dim_c)).all(): 110 | continue 111 | else: 112 | adv_l1 = np.sum(np.square(a.state.c - agent.goal_a.color)) 113 | adv_rew += adv_l1 114 | return adv_rew + good_rew 115 | 116 | def adversary_reward(self, agent, world): 117 | # Adversary (Eve) is rewarded if it can reconstruct original goal 118 | rew = 0 119 | if not (agent.state.c == np.zeros(world.dim_c)).all(): 120 | rew -= np.sum(np.square(agent.state.c - agent.goal_a.color)) 121 | return rew 122 | 123 | 124 | def observation(self, agent, world): 125 | # goal color 126 | goal_color = np.zeros(world.dim_color) 127 | if agent.goal_a is not None: 128 | goal_color = agent.goal_a.color 129 | 130 | # get positions of all entities in this agent's reference frame 131 | entity_pos = [] 132 | for entity in world.landmarks: 133 | entity_pos.append(entity.state.p_pos - agent.state.p_pos) 134 | # communication of all other agents 135 | comm = [] 136 | for other in world.agents: 137 | if other is agent or (other.state.c is None) or not other.speaker: continue 138 | comm.append(other.state.c) 139 | 140 | confer = np.array([0]) 141 | 142 | if world.agents[2].key is None: 143 | confer = np.array([1]) 144 | key = np.zeros(world.dim_c) 145 | goal_color = np.zeros(world.dim_c) 146 | else: 147 | key = world.agents[2].key 148 | 149 | prnt = False 150 | # speaker 151 | if agent.speaker: 152 | if prnt: 153 | print('speaker') 154 | print(agent.state.c) 155 | print(np.concatenate([goal_color] + [key] + [confer] + [np.random.randn(1)])) 156 | return np.concatenate([goal_color] + [key]) 157 | # listener 158 | if not agent.speaker and not agent.adversary: 159 | if prnt: 160 | print('listener') 161 | print(agent.state.c) 162 | print(np.concatenate([key] + comm + [confer])) 163 | return np.concatenate([key] + comm) 164 | if not agent.speaker and agent.adversary: 165 | if prnt: 166 | print('adversary') 167 | print(agent.state.c) 168 | print(np.concatenate(comm + [confer])) 169 | return np.concatenate(comm) 170 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/scenarios/simple_push.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from multiagent.core import World, Agent, Landmark 3 | from multiagent.scenario import BaseScenario 4 | 5 | class Scenario(BaseScenario): 6 | def make_world(self): 7 | world = World() 8 | # set any world properties first 9 | world.dim_c = 2 10 | num_agents = 2 11 | num_adversaries = 1 12 | num_landmarks = 2 13 | # add agents 14 | world.agents = [Agent() for i in range(num_agents)] 15 | for i, agent in enumerate(world.agents): 16 | agent.name = 'agent %d' % i 17 | agent.collide = True 18 | agent.silent = True 19 | if i < num_adversaries: 20 | agent.adversary = True 21 | else: 22 | agent.adversary = False 23 | # add landmarks 24 | world.landmarks = [Landmark() for i in range(num_landmarks)] 25 | for i, landmark in enumerate(world.landmarks): 26 | landmark.name = 'landmark %d' % i 27 | landmark.collide = False 28 | landmark.movable = False 29 | # make initial conditions 30 | self.reset_world(world) 31 | return world 32 | 33 | def reset_world(self, world): 34 | # random properties for landmarks 35 | for i, landmark in enumerate(world.landmarks): 36 | landmark.color = np.array([0.1, 0.1, 0.1]) 37 | landmark.color[i + 1] += 0.8 38 | landmark.index = i 39 | # set goal landmark 40 | goal = np.random.choice(world.landmarks) 41 | for i, agent in enumerate(world.agents): 42 | agent.goal_a = goal 43 | agent.color = np.array([0.25, 0.25, 0.25]) 44 | if agent.adversary: 45 | agent.color = np.array([0.75, 0.25, 0.25]) 46 | else: 47 | j = goal.index 48 | agent.color[j + 1] += 0.5 49 | # set random initial states 50 | for agent in world.agents: 51 | agent.state.p_pos = np.random.uniform(-1, +1, world.dim_p) 52 | agent.state.p_vel = np.zeros(world.dim_p) 53 | agent.state.c = np.zeros(world.dim_c) 54 | for i, landmark in enumerate(world.landmarks): 55 | landmark.state.p_pos = np.random.uniform(-1, +1, world.dim_p) 56 | landmark.state.p_vel = np.zeros(world.dim_p) 57 | 58 | def reward(self, agent, world): 59 | # Agents are rewarded based on minimum agent distance to each landmark 60 | return self.adversary_reward(agent, world) if agent.adversary else self.agent_reward(agent, world) 61 | 62 | def agent_reward(self, agent, world): 63 | # the distance to the goal 64 | return -np.sqrt(np.sum(np.square(agent.state.p_pos - agent.goal_a.state.p_pos))) 65 | 66 | def adversary_reward(self, agent, world): 67 | # keep the nearest good agents away from the goal 68 | agent_dist = [np.sqrt(np.sum(np.square(a.state.p_pos - a.goal_a.state.p_pos))) for a in world.agents if not a.adversary] 69 | pos_rew = min(agent_dist) 70 | #nearest_agent = world.good_agents[np.argmin(agent_dist)] 71 | #neg_rew = np.sqrt(np.sum(np.square(nearest_agent.state.p_pos - agent.state.p_pos))) 72 | neg_rew = np.sqrt(np.sum(np.square(agent.goal_a.state.p_pos - agent.state.p_pos))) 73 | #neg_rew = sum([np.sqrt(np.sum(np.square(a.state.p_pos - agent.state.p_pos))) for a in world.good_agents]) 74 | return pos_rew - neg_rew 75 | 76 | def observation(self, agent, world): 77 | # get positions of all entities in this agent's reference frame 78 | entity_pos = [] 79 | for entity in world.landmarks: # world.entities: 80 | entity_pos.append(entity.state.p_pos - agent.state.p_pos) 81 | # entity colors 82 | entity_color = [] 83 | for entity in world.landmarks: # world.entities: 84 | entity_color.append(entity.color) 85 | # communication of all other agents 86 | comm = [] 87 | other_pos = [] 88 | for other in world.agents: 89 | if other is agent: continue 90 | comm.append(other.state.c) 91 | other_pos.append(other.state.p_pos - agent.state.p_pos) 92 | if not agent.adversary: 93 | return np.concatenate([agent.state.p_vel] + [agent.goal_a.state.p_pos - agent.state.p_pos] + [agent.color] + entity_pos + entity_color + other_pos) 94 | else: 95 | #other_pos = list(reversed(other_pos)) if random.uniform(0,1) > 0.5 else other_pos # randomize position of other agents in adversary network 96 | return np.concatenate([agent.state.p_vel] + entity_pos + other_pos) 97 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/scenarios/simple_reference.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from multiagent.core import World, Agent, Landmark 3 | from multiagent.scenario import BaseScenario 4 | 5 | class Scenario(BaseScenario): 6 | def make_world(self): 7 | world = World() 8 | # set any world properties first 9 | world.dim_c = 10 10 | world.collaborative = True # whether agents share rewards 11 | # add agents 12 | world.agents = [Agent() for i in range(2)] 13 | for i, agent in enumerate(world.agents): 14 | agent.name = 'agent %d' % i 15 | agent.collide = False 16 | # add landmarks 17 | world.landmarks = [Landmark() for i in range(3)] 18 | for i, landmark in enumerate(world.landmarks): 19 | landmark.name = 'landmark %d' % i 20 | landmark.collide = False 21 | landmark.movable = False 22 | # make initial conditions 23 | self.reset_world(world) 24 | return world 25 | 26 | def reset_world(self, world): 27 | # assign goals to agents 28 | for agent in world.agents: 29 | agent.goal_a = None 30 | agent.goal_b = None 31 | # want other agent to go to the goal landmark 32 | world.agents[0].goal_a = world.agents[1] 33 | world.agents[0].goal_b = np.random.choice(world.landmarks) 34 | world.agents[1].goal_a = world.agents[0] 35 | world.agents[1].goal_b = np.random.choice(world.landmarks) 36 | # random properties for agents 37 | for i, agent in enumerate(world.agents): 38 | agent.color = np.array([0.25,0.25,0.25]) 39 | # random properties for landmarks 40 | world.landmarks[0].color = np.array([0.75,0.25,0.25]) 41 | world.landmarks[1].color = np.array([0.25,0.75,0.25]) 42 | world.landmarks[2].color = np.array([0.25,0.25,0.75]) 43 | # special colors for goals 44 | world.agents[0].goal_a.color = world.agents[0].goal_b.color 45 | world.agents[1].goal_a.color = world.agents[1].goal_b.color 46 | # set random initial states 47 | for agent in world.agents: 48 | agent.state.p_pos = np.random.uniform(-1,+1, world.dim_p) 49 | agent.state.p_vel = np.zeros(world.dim_p) 50 | agent.state.c = np.zeros(world.dim_c) 51 | for i, landmark in enumerate(world.landmarks): 52 | landmark.state.p_pos = np.random.uniform(-1,+1, world.dim_p) 53 | landmark.state.p_vel = np.zeros(world.dim_p) 54 | 55 | def reward(self, agent, world): 56 | if agent.goal_a is None or agent.goal_b is None: 57 | return 0.0 58 | dist2 = np.sum(np.square(agent.goal_a.state.p_pos - agent.goal_b.state.p_pos)) 59 | return -dist2 60 | 61 | def observation(self, agent, world): 62 | # goal color 63 | goal_color = [np.zeros(world.dim_color), np.zeros(world.dim_color)] 64 | if agent.goal_b is not None: 65 | goal_color[1] = agent.goal_b.color 66 | 67 | # get positions of all entities in this agent's reference frame 68 | entity_pos = [] 69 | for entity in world.landmarks: 70 | entity_pos.append(entity.state.p_pos - agent.state.p_pos) 71 | # entity colors 72 | entity_color = [] 73 | for entity in world.landmarks: 74 | entity_color.append(entity.color) 75 | # communication of all other agents 76 | comm = [] 77 | for other in world.agents: 78 | if other is agent: continue 79 | comm.append(other.state.c) 80 | return np.concatenate([agent.state.p_vel] + entity_pos + [goal_color[1]] + comm) 81 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/scenarios/simple_speaker_listener.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from multiagent.core import World, Agent, Landmark 3 | from multiagent.scenario import BaseScenario 4 | 5 | class Scenario(BaseScenario): 6 | def make_world(self): 7 | world = World() 8 | # set any world properties first 9 | world.dim_c = 3 10 | num_landmarks = 3 11 | world.collaborative = True 12 | # add agents 13 | world.agents = [Agent() for i in range(2)] 14 | for i, agent in enumerate(world.agents): 15 | agent.name = 'agent %d' % i 16 | agent.collide = False 17 | agent.size = 0.075 18 | # speaker 19 | world.agents[0].movable = False 20 | # listener 21 | world.agents[1].silent = True 22 | # add landmarks 23 | world.landmarks = [Landmark() for i in range(num_landmarks)] 24 | for i, landmark in enumerate(world.landmarks): 25 | landmark.name = 'landmark %d' % i 26 | landmark.collide = False 27 | landmark.movable = False 28 | landmark.size = 0.04 29 | # make initial conditions 30 | self.reset_world(world) 31 | return world 32 | 33 | def reset_world(self, world): 34 | # assign goals to agents 35 | for agent in world.agents: 36 | agent.goal_a = None 37 | agent.goal_b = None 38 | # want listener to go to the goal landmark 39 | world.agents[0].goal_a = world.agents[1] 40 | world.agents[0].goal_b = np.random.choice(world.landmarks) 41 | # random properties for agents 42 | for i, agent in enumerate(world.agents): 43 | agent.color = np.array([0.25,0.25,0.25]) 44 | # random properties for landmarks 45 | world.landmarks[0].color = np.array([0.65,0.15,0.15]) 46 | world.landmarks[1].color = np.array([0.15,0.65,0.15]) 47 | world.landmarks[2].color = np.array([0.15,0.15,0.65]) 48 | # special colors for goals 49 | world.agents[0].goal_a.color = world.agents[0].goal_b.color + np.array([0.45, 0.45, 0.45]) 50 | # set random initial states 51 | for agent in world.agents: 52 | agent.state.p_pos = np.random.uniform(-1,+1, world.dim_p) 53 | agent.state.p_vel = np.zeros(world.dim_p) 54 | agent.state.c = np.zeros(world.dim_c) 55 | for i, landmark in enumerate(world.landmarks): 56 | landmark.state.p_pos = np.random.uniform(-1,+1, world.dim_p) 57 | landmark.state.p_vel = np.zeros(world.dim_p) 58 | 59 | def benchmark_data(self, agent, world): 60 | # returns data for benchmarking purposes 61 | return self.reward(agent, reward) 62 | 63 | def reward(self, agent, world): 64 | # squared distance from listener to landmark 65 | a = world.agents[0] 66 | dist2 = np.sum(np.square(a.goal_a.state.p_pos - a.goal_b.state.p_pos)) 67 | return -dist2 68 | 69 | def observation(self, agent, world): 70 | # goal color 71 | goal_color = np.zeros(world.dim_color) 72 | if agent.goal_b is not None: 73 | goal_color = agent.goal_b.color 74 | 75 | # get positions of all entities in this agent's reference frame 76 | entity_pos = [] 77 | for entity in world.landmarks: 78 | entity_pos.append(entity.state.p_pos - agent.state.p_pos) 79 | 80 | # communication of all other agents 81 | comm = [] 82 | for other in world.agents: 83 | if other is agent or (other.state.c is None): continue 84 | comm.append(other.state.c) 85 | 86 | # speaker 87 | if not agent.movable: 88 | return np.concatenate([goal_color]) 89 | # listener 90 | if agent.silent: 91 | return np.concatenate([agent.state.p_vel] + entity_pos + comm) 92 | 93 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/scenarios/simple_spread.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from multiagent.core import World, Agent, Landmark 3 | from multiagent.scenario import BaseScenario 4 | 5 | 6 | class Scenario(BaseScenario): 7 | def make_world(self): 8 | world = World() 9 | # set any world properties first 10 | world.dim_c = 2 11 | num_agents = 3 12 | num_landmarks = 3 13 | world.collaborative = True 14 | # add agents 15 | world.agents = [Agent() for i in range(num_agents)] 16 | for i, agent in enumerate(world.agents): 17 | agent.name = 'agent %d' % i 18 | agent.collide = True 19 | agent.silent = True 20 | agent.size = 0.15 21 | # add landmarks 22 | world.landmarks = [Landmark() for i in range(num_landmarks)] 23 | for i, landmark in enumerate(world.landmarks): 24 | landmark.name = 'landmark %d' % i 25 | landmark.collide = False 26 | landmark.movable = False 27 | # make initial conditions 28 | self.reset_world(world) 29 | return world 30 | 31 | def reset_world(self, world): 32 | # random properties for agents 33 | for i, agent in enumerate(world.agents): 34 | agent.color = np.array([0.35, 0.35, 0.85]) 35 | # random properties for landmarks 36 | for i, landmark in enumerate(world.landmarks): 37 | landmark.color = np.array([0.25, 0.25, 0.25]) 38 | # set random initial states 39 | for agent in world.agents: 40 | agent.state.p_pos = np.random.uniform(-1, +1, world.dim_p) 41 | agent.state.p_vel = np.zeros(world.dim_p) 42 | agent.state.c = np.zeros(world.dim_c) 43 | for i, landmark in enumerate(world.landmarks): 44 | landmark.state.p_pos = np.random.uniform(-1, +1, world.dim_p) 45 | landmark.state.p_vel = np.zeros(world.dim_p) 46 | 47 | def benchmark_data(self, agent, world): 48 | rew = 0 49 | collisions = 0 50 | occupied_landmarks = 0 51 | min_dists = 0 52 | for l in world.landmarks: 53 | dists = [np.sqrt(np.sum(np.square(a.state.p_pos - l.state.p_pos))) for a in world.agents] 54 | min_dists += min(dists) 55 | rew -= min(dists) 56 | if min(dists) < 0.1: 57 | occupied_landmarks += 1 58 | if agent.collide: 59 | for a in world.agents: 60 | if self.is_collision(a, agent): 61 | rew -= 1 62 | collisions += 1 63 | return (rew, collisions, min_dists, occupied_landmarks) 64 | 65 | 66 | def is_collision(self, agent1, agent2): 67 | delta_pos = agent1.state.p_pos - agent2.state.p_pos 68 | dist = np.sqrt(np.sum(np.square(delta_pos))) 69 | dist_min = agent1.size + agent2.size 70 | return True if dist < dist_min else False 71 | 72 | def reward(self, agent, world): 73 | # Agents are rewarded based on minimum agent distance to each landmark, penalized for collisions 74 | rew = 0 75 | for l in world.landmarks: 76 | dists = [np.sqrt(np.sum(np.square(a.state.p_pos - l.state.p_pos))) for a in world.agents] 77 | rew -= min(dists) 78 | if agent.collide: 79 | for a in world.agents: 80 | if self.is_collision(a, agent): 81 | rew -= 1 82 | return rew 83 | 84 | def observation(self, agent, world): 85 | # get positions of all entities in this agent's reference frame 86 | entity_pos = [] 87 | for entity in world.landmarks: # world.entities: 88 | entity_pos.append(entity.state.p_pos - agent.state.p_pos) 89 | # entity colors 90 | entity_color = [] 91 | for entity in world.landmarks: # world.entities: 92 | entity_color.append(entity.color) 93 | # communication of all other agents 94 | comm = [] 95 | other_pos = [] 96 | for other in world.agents: 97 | if other is agent: continue 98 | comm.append(other.state.c) 99 | other_pos.append(other.state.p_pos - agent.state.p_pos) 100 | return np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + comm) 101 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/scenarios/simple_tag.py: -------------------------------------------------------------------------------- 1 | ''' 2 | COPY this file into multiagent(openai) repr to replace the old one and install again 3 | ''' 4 | import numpy as np 5 | from multiagent.core import World, Agent, Landmark, Action 6 | from multiagent.scenario import BaseScenario 7 | 8 | # By Yuan Zhang: 9 | def random_action(agent,world): 10 | action = Action() 11 | action.u = np.zeros(world.dim_p) 12 | action.c = np.zeros(world.dim_c) 13 | random_action = np.random.choice(5) 14 | # process discrete action 15 | if random_action == 1: action.u[0] = -1.0 16 | if random_action == 2: action.u[0] = +1.0 17 | if random_action == 3: action.u[1] = -1.0 18 | if random_action == 4: action.u[1] = +1.0 19 | 20 | # accel of prey 21 | sensitivity = 5.0 22 | if agent.accel is not None: 23 | sensitivity = agent.accel 24 | action.u *= sensitivity 25 | return action 26 | 27 | class Scenario(BaseScenario): 28 | def make_world(self): 29 | world = World() 30 | # By Yuan Zhang 31 | world.collaborative = True 32 | # set any world properties first 33 | world.dim_c = 2 34 | num_good_agents = 1 35 | num_adversaries = 3 36 | num_agents = num_adversaries + num_good_agents 37 | num_landmarks = 2 38 | # add agents 39 | world.agents = [Agent() for i in range(num_agents)] 40 | for i, agent in enumerate(world.agents): 41 | agent.name = 'agent %d' % i 42 | agent.collide = True 43 | agent.silent = True 44 | agent.adversary = True if i < num_adversaries else False 45 | agent.size = 0.075 if agent.adversary else 0.05 46 | agent.accel = 3.0 if agent.adversary else 4.0 # 3.0 4.0 47 | #agent.accel = 20.0 if agent.adversary else 25.0 48 | agent.max_speed = 1.0 if agent.adversary else 1.3 # 1.0 1.3 49 | # By Yuan Zhang: 50 | agent.action_callback = random_action if not agent.adversary else None 51 | 52 | # add landmarks 53 | world.landmarks = [Landmark() for i in range(num_landmarks)] 54 | for i, landmark in enumerate(world.landmarks): 55 | landmark.name = 'landmark %d' % i 56 | landmark.collide = True 57 | landmark.movable = False 58 | landmark.size = 0.2 # 0.2 59 | landmark.boundary = False 60 | # make initial conditions 61 | self.reset_world(world) 62 | self.done = False 63 | return world 64 | 65 | 66 | def reset_world(self, world): 67 | self.done = False 68 | # random properties for agents 69 | for i, agent in enumerate(world.agents): 70 | agent.color = np.array([0.35, 0.85, 0.35]) if not agent.adversary else np.array([0.85, 0.35, 0.35]) 71 | # random properties for landmarks 72 | for i, landmark in enumerate(world.landmarks): 73 | landmark.color = np.array([0.25, 0.25, 0.25]) 74 | # set random initial states 75 | for agent in world.agents: 76 | agent.state.p_pos = np.random.uniform(-1, +1, world.dim_p) 77 | agent.state.p_vel = np.zeros(world.dim_p) 78 | agent.state.c = np.zeros(world.dim_c) 79 | for i, landmark in enumerate(world.landmarks): 80 | if not landmark.boundary: 81 | landmark.state.p_pos = np.random.uniform(-0.9, +0.9, world.dim_p) 82 | landmark.state.p_vel = np.zeros(world.dim_p) 83 | 84 | 85 | def benchmark_data(self, agent, world): 86 | # returns data for benchmarking purposes 87 | if agent.adversary: 88 | collisions = 0 89 | for a in self.good_agents(world): 90 | if self.is_collision(a, agent): 91 | collisions += 1 92 | return collisions 93 | else: 94 | return 0 95 | 96 | 97 | def is_collision(self, agent1, agent2): 98 | delta_pos = agent1.state.p_pos - agent2.state.p_pos 99 | dist = np.sqrt(np.sum(np.square(delta_pos))) 100 | dist_min = agent1.size + agent2.size 101 | return True if dist < dist_min else False 102 | 103 | # return all agents that are not adversaries 104 | def good_agents(self, world): 105 | return [agent for agent in world.agents if not agent.adversary] 106 | 107 | # return all adversarial agents 108 | def adversaries(self, world): 109 | return [agent for agent in world.agents if agent.adversary] 110 | 111 | 112 | def reward(self, agent, world): 113 | # Agents are rewarded based on minimum agent distance to each landmark 114 | main_reward = self.adversary_reward(agent, world) if agent.adversary else self.agent_reward(agent, world) 115 | return main_reward 116 | 117 | def agent_reward(self, agent, world): 118 | # Agents are negatively rewarded if caught by adversaries 119 | rew = 0 120 | shape = False 121 | adversaries = self.adversaries(world) 122 | if shape: # reward can optionally be shaped (increased reward for increased distance from adversary) 123 | for adv in adversaries: 124 | rew += 0.1 * np.sqrt(np.sum(np.square(agent.state.p_pos - adv.state.p_pos))) 125 | if agent.collide: 126 | for a in adversaries: 127 | if self.is_collision(a, agent): 128 | rew -= 10 129 | 130 | # agents are penalized for exiting the screen, so that they can be caught by the adversaries 131 | def bound(x): 132 | if x < 0.9: 133 | return 0 134 | if x < 1.0: 135 | return (x - 0.9) * 10 136 | return min(np.exp(2 * x - 2), 10) 137 | for p in range(world.dim_p): 138 | x = abs(agent.state.p_pos[p]) 139 | rew -= bound(x) 140 | 141 | return rew 142 | 143 | def adversary_reward(self, agent, world): 144 | # Adversaries are rewarded for collisions with agents 145 | rew = 0 146 | shape = True # False 147 | agents = self.good_agents(world) 148 | adversaries = self.adversaries(world) 149 | if shape: # reward can optionally be shaped (decreased reward for increased distance from agents) 150 | for adv in adversaries: 151 | rew -= 0.1 * min([np.sqrt(np.sum(np.square(a.state.p_pos - adv.state.p_pos))) for a in agents]) 152 | if agent.collide: 153 | for ag in agents: 154 | for adv in adversaries: 155 | if self.is_collision(ag, adv): 156 | rew += 10 157 | self.done = True 158 | return rew 159 | 160 | def observation(self, agent, world): 161 | # get positions of all entities in this agent's reference frame 162 | entity_pos = [] 163 | for entity in world.landmarks: 164 | if not entity.boundary: 165 | entity_pos.append(entity.state.p_pos - agent.state.p_pos) 166 | # communication of all other agents 167 | comm = [] 168 | other_pos = [] 169 | other_vel = [] 170 | for other in world.agents: 171 | if other is agent: continue 172 | comm.append(other.state.c) 173 | other_pos.append(other.state.p_pos - agent.state.p_pos) 174 | if not other.adversary: 175 | other_vel.append(other.state.p_vel) 176 | return np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + other_vel) 177 | 178 | def episode_over(self, agent, world): 179 | return self.done 180 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/multiagent/scenarios/simple_world_comm.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from multiagent.core import World, Agent, Landmark 3 | from multiagent.scenario import BaseScenario 4 | 5 | 6 | class Scenario(BaseScenario): 7 | def make_world(self): 8 | world = World() 9 | # set any world properties first 10 | world.dim_c = 4 11 | #world.damping = 1 12 | num_good_agents = 2 13 | num_adversaries = 4 14 | num_agents = num_adversaries + num_good_agents 15 | num_landmarks = 1 16 | num_food = 2 17 | num_forests = 2 18 | # add agents 19 | world.agents = [Agent() for i in range(num_agents)] 20 | for i, agent in enumerate(world.agents): 21 | agent.name = 'agent %d' % i 22 | agent.collide = True 23 | agent.leader = True if i == 0 else False 24 | agent.silent = True if i > 0 else False 25 | agent.adversary = True if i < num_adversaries else False 26 | agent.size = 0.075 if agent.adversary else 0.045 27 | agent.accel = 3.0 if agent.adversary else 4.0 28 | #agent.accel = 20.0 if agent.adversary else 25.0 29 | agent.max_speed = 1.0 if agent.adversary else 1.3 30 | # add landmarks 31 | world.landmarks = [Landmark() for i in range(num_landmarks)] 32 | for i, landmark in enumerate(world.landmarks): 33 | landmark.name = 'landmark %d' % i 34 | landmark.collide = True 35 | landmark.movable = False 36 | landmark.size = 0.2 37 | landmark.boundary = False 38 | world.food = [Landmark() for i in range(num_food)] 39 | for i, landmark in enumerate(world.food): 40 | landmark.name = 'food %d' % i 41 | landmark.collide = False 42 | landmark.movable = False 43 | landmark.size = 0.03 44 | landmark.boundary = False 45 | world.forests = [Landmark() for i in range(num_forests)] 46 | for i, landmark in enumerate(world.forests): 47 | landmark.name = 'forest %d' % i 48 | landmark.collide = False 49 | landmark.movable = False 50 | landmark.size = 0.3 51 | landmark.boundary = False 52 | world.landmarks += world.food 53 | world.landmarks += world.forests 54 | #world.landmarks += self.set_boundaries(world) # world boundaries now penalized with negative reward 55 | # make initial conditions 56 | self.reset_world(world) 57 | return world 58 | 59 | def set_boundaries(self, world): 60 | boundary_list = [] 61 | landmark_size = 1 62 | edge = 1 + landmark_size 63 | num_landmarks = int(edge * 2 / landmark_size) 64 | for x_pos in [-edge, edge]: 65 | for i in range(num_landmarks): 66 | l = Landmark() 67 | l.state.p_pos = np.array([x_pos, -1 + i * landmark_size]) 68 | boundary_list.append(l) 69 | 70 | for y_pos in [-edge, edge]: 71 | for i in range(num_landmarks): 72 | l = Landmark() 73 | l.state.p_pos = np.array([-1 + i * landmark_size, y_pos]) 74 | boundary_list.append(l) 75 | 76 | for i, l in enumerate(boundary_list): 77 | l.name = 'boundary %d' % i 78 | l.collide = True 79 | l.movable = False 80 | l.boundary = True 81 | l.color = np.array([0.75, 0.75, 0.75]) 82 | l.size = landmark_size 83 | l.state.p_vel = np.zeros(world.dim_p) 84 | 85 | return boundary_list 86 | 87 | 88 | def reset_world(self, world): 89 | # random properties for agents 90 | for i, agent in enumerate(world.agents): 91 | agent.color = np.array([0.45, 0.95, 0.45]) if not agent.adversary else np.array([0.95, 0.45, 0.45]) 92 | agent.color -= np.array([0.3, 0.3, 0.3]) if agent.leader else np.array([0, 0, 0]) 93 | # random properties for landmarks 94 | for i, landmark in enumerate(world.landmarks): 95 | landmark.color = np.array([0.25, 0.25, 0.25]) 96 | for i, landmark in enumerate(world.food): 97 | landmark.color = np.array([0.15, 0.15, 0.65]) 98 | for i, landmark in enumerate(world.forests): 99 | landmark.color = np.array([0.6, 0.9, 0.6]) 100 | # set random initial states 101 | for agent in world.agents: 102 | agent.state.p_pos = np.random.uniform(-1, +1, world.dim_p) 103 | agent.state.p_vel = np.zeros(world.dim_p) 104 | agent.state.c = np.zeros(world.dim_c) 105 | for i, landmark in enumerate(world.landmarks): 106 | landmark.state.p_pos = np.random.uniform(-0.9, +0.9, world.dim_p) 107 | landmark.state.p_vel = np.zeros(world.dim_p) 108 | for i, landmark in enumerate(world.food): 109 | landmark.state.p_pos = np.random.uniform(-0.9, +0.9, world.dim_p) 110 | landmark.state.p_vel = np.zeros(world.dim_p) 111 | for i, landmark in enumerate(world.forests): 112 | landmark.state.p_pos = np.random.uniform(-0.9, +0.9, world.dim_p) 113 | landmark.state.p_vel = np.zeros(world.dim_p) 114 | 115 | def benchmark_data(self, agent, world): 116 | if agent.adversary: 117 | collisions = 0 118 | for a in self.good_agents(world): 119 | if self.is_collision(a, agent): 120 | collisions += 1 121 | return collisions 122 | else: 123 | return 0 124 | 125 | 126 | def is_collision(self, agent1, agent2): 127 | delta_pos = agent1.state.p_pos - agent2.state.p_pos 128 | dist = np.sqrt(np.sum(np.square(delta_pos))) 129 | dist_min = agent1.size + agent2.size 130 | return True if dist < dist_min else False 131 | 132 | 133 | # return all agents that are not adversaries 134 | def good_agents(self, world): 135 | return [agent for agent in world.agents if not agent.adversary] 136 | 137 | # return all adversarial agents 138 | def adversaries(self, world): 139 | return [agent for agent in world.agents if agent.adversary] 140 | 141 | 142 | def reward(self, agent, world): 143 | # Agents are rewarded based on minimum agent distance to each landmark 144 | #boundary_reward = -10 if self.outside_boundary(agent) else 0 145 | main_reward = self.adversary_reward(agent, world) if agent.adversary else self.agent_reward(agent, world) 146 | return main_reward 147 | 148 | def outside_boundary(self, agent): 149 | if agent.state.p_pos[0] > 1 or agent.state.p_pos[0] < -1 or agent.state.p_pos[1] > 1 or agent.state.p_pos[1] < -1: 150 | return True 151 | else: 152 | return False 153 | 154 | 155 | def agent_reward(self, agent, world): 156 | # Agents are rewarded based on minimum agent distance to each landmark 157 | rew = 0 158 | shape = False 159 | adversaries = self.adversaries(world) 160 | if shape: 161 | for adv in adversaries: 162 | rew += 0.1 * np.sqrt(np.sum(np.square(agent.state.p_pos - adv.state.p_pos))) 163 | if agent.collide: 164 | for a in adversaries: 165 | if self.is_collision(a, agent): 166 | rew -= 5 167 | def bound(x): 168 | if x < 0.9: 169 | return 0 170 | if x < 1.0: 171 | return (x - 0.9) * 10 172 | return min(np.exp(2 * x - 2), 10) # 1 + (x - 1) * (x - 1) 173 | 174 | for p in range(world.dim_p): 175 | x = abs(agent.state.p_pos[p]) 176 | rew -= 2 * bound(x) 177 | 178 | for food in world.food: 179 | if self.is_collision(agent, food): 180 | rew += 2 181 | rew += 0.05 * min([np.sqrt(np.sum(np.square(food.state.p_pos - agent.state.p_pos))) for food in world.food]) 182 | 183 | return rew 184 | 185 | def adversary_reward(self, agent, world): 186 | # Agents are rewarded based on minimum agent distance to each landmark 187 | rew = 0 188 | shape = True 189 | agents = self.good_agents(world) 190 | adversaries = self.adversaries(world) 191 | if shape: 192 | rew -= 0.1 * min([np.sqrt(np.sum(np.square(a.state.p_pos - agent.state.p_pos))) for a in agents]) 193 | if agent.collide: 194 | for ag in agents: 195 | for adv in adversaries: 196 | if self.is_collision(ag, adv): 197 | rew += 5 198 | return rew 199 | 200 | 201 | def observation2(self, agent, world): 202 | # get positions of all entities in this agent's reference frame 203 | entity_pos = [] 204 | for entity in world.landmarks: 205 | if not entity.boundary: 206 | entity_pos.append(entity.state.p_pos - agent.state.p_pos) 207 | 208 | food_pos = [] 209 | for entity in world.food: 210 | if not entity.boundary: 211 | food_pos.append(entity.state.p_pos - agent.state.p_pos) 212 | # communication of all other agents 213 | comm = [] 214 | other_pos = [] 215 | other_vel = [] 216 | for other in world.agents: 217 | if other is agent: continue 218 | comm.append(other.state.c) 219 | other_pos.append(other.state.p_pos - agent.state.p_pos) 220 | if not other.adversary: 221 | other_vel.append(other.state.p_vel) 222 | return np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + other_vel) 223 | 224 | def observation(self, agent, world): 225 | # get positions of all entities in this agent's reference frame 226 | entity_pos = [] 227 | for entity in world.landmarks: 228 | if not entity.boundary: 229 | entity_pos.append(entity.state.p_pos - agent.state.p_pos) 230 | 231 | in_forest = [np.array([-1]), np.array([-1])] 232 | inf1 = False 233 | inf2 = False 234 | if self.is_collision(agent, world.forests[0]): 235 | in_forest[0] = np.array([1]) 236 | inf1= True 237 | if self.is_collision(agent, world.forests[1]): 238 | in_forest[1] = np.array([1]) 239 | inf2 = True 240 | 241 | food_pos = [] 242 | for entity in world.food: 243 | if not entity.boundary: 244 | food_pos.append(entity.state.p_pos - agent.state.p_pos) 245 | # communication of all other agents 246 | comm = [] 247 | other_pos = [] 248 | other_vel = [] 249 | for other in world.agents: 250 | if other is agent: continue 251 | comm.append(other.state.c) 252 | oth_f1 = self.is_collision(other, world.forests[0]) 253 | oth_f2 = self.is_collision(other, world.forests[1]) 254 | if (inf1 and oth_f1) or (inf2 and oth_f2) or (not inf1 and not oth_f1 and not inf2 and not oth_f2) or agent.leader: #without forest vis 255 | other_pos.append(other.state.p_pos - agent.state.p_pos) 256 | if not other.adversary: 257 | other_vel.append(other.state.p_vel) 258 | else: 259 | other_pos.append([0, 0]) 260 | if not other.adversary: 261 | other_vel.append([0, 0]) 262 | 263 | # to tell the pred when the prey are in the forest 264 | prey_forest = [] 265 | ga = self.good_agents(world) 266 | for a in ga: 267 | if any([self.is_collision(a, f) for f in world.forests]): 268 | prey_forest.append(np.array([1])) 269 | else: 270 | prey_forest.append(np.array([-1])) 271 | # to tell leader when pred are in forest 272 | prey_forest_lead = [] 273 | for f in world.forests: 274 | if any([self.is_collision(a, f) for a in ga]): 275 | prey_forest_lead.append(np.array([1])) 276 | else: 277 | prey_forest_lead.append(np.array([-1])) 278 | 279 | comm = [world.agents[0].state.c] 280 | 281 | if agent.adversary and not agent.leader: 282 | return np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + other_vel + in_forest + comm) 283 | if agent.leader: 284 | return np.concatenate( 285 | [agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + other_vel + in_forest + comm) 286 | else: 287 | return np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + in_forest + other_vel) 288 | 289 | 290 | -------------------------------------------------------------------------------- /environments/multiagent_particle_envs/setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup(name='multiagent', 4 | version='0.0.1', 5 | description='Multi-Agent Goal-Driven Communication Environment', 6 | url='https://github.com/openai/multiagent-public', 7 | author='Igor Mordatch', 8 | author_email='mordatch@openai.com', 9 | packages=find_packages(), 10 | include_package_data=True, 11 | zip_safe=False, 12 | install_requires=['gym', 'numpy-stl'] 13 | ) 14 | -------------------------------------------------------------------------------- /environments/traffic_helper.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | import numpy as np 4 | 5 | move = [(-1,0),(1,0),(0,-1),(0,1)] 6 | 7 | def get_road_blocks(w, h, difficulty): 8 | 9 | # assuming 1 is the lane width for each direction. 10 | road_blocks = { 11 | 'easy': [ np.s_[h//2, :], 12 | np.s_[:, w//2]], 13 | 14 | 'medium': [ np.s_[h//2 - 1 : h//2 + 1, :], 15 | np.s_[:, w//2 - 1 : w//2 + 1]], 16 | 17 | 'hard': [ np.s_[h//3-2: h//3, :], 18 | np.s_[2* h//3: 2* h//3 + 2, :], 19 | 20 | np.s_[:, w//3-2: w//3], 21 | np.s_[:, 2* h//3: 2* h//3 + 2]], 22 | } 23 | 24 | return road_blocks[difficulty] 25 | 26 | def goal_reached(place_i, curr, finish_points): 27 | return curr in finish_points[:place_i] + finish_points[place_i+1:] 28 | 29 | 30 | def get_add_mat(dims, grid, difficulty): 31 | h,w = dims 32 | 33 | road_dir = grid.copy() 34 | junction = np.zeros_like(grid) 35 | 36 | if difficulty == 'medium': 37 | arrival_points = [ (0, w//2-1), # TOP 38 | (h-1,w//2), # BOTTOM 39 | (h//2, 0), # LEFT 40 | (h//2-1,w-1)] # RIGHT 41 | 42 | finish_points = [ (0, w//2), # TOP 43 | (h-1,w//2-1), # BOTTOM 44 | (h//2-1, 0), # LEFT 45 | (h//2,w-1)] # RIGHT 46 | 47 | # mark road direction 48 | road_dir[h//2, :] = 2 49 | road_dir[h//2 - 1, :] = 3 50 | road_dir[:, w//2 ] = 4 51 | 52 | # mark the Junction 53 | junction[h//2-1:h//2+1,w//2-1:w//2+1 ] =1 54 | 55 | elif difficulty =='hard': 56 | arrival_points = [ (0, w//3-2), # TOP-left 57 | (0,2*w//3), # TOP-right 58 | 59 | (h//3-1, 0), # LEFT-top 60 | (2*h//3+1,0), # LEFT-bottom 61 | 62 | (h-1,w//3-1), # BOTTOM-left 63 | (h-1,2*w//3+1), # BOTTOM-right 64 | 65 | (h//3-2, w-1), # RIGHT-top 66 | (2*h//3,w-1)] # RIGHT-bottom 67 | 68 | 69 | finish_points = [ (0, w//3-1), # TOP-left 70 | (0,2*w//3+1), # TOP-right 71 | 72 | (h//3-2, 0), # LEFT-top 73 | (2*h//3,0), # LEFT-bottom 74 | 75 | (h-1,w//3-2), # BOTTOM-left 76 | (h-1,2*w//3), # BOTTOM-right 77 | 78 | (h//3-1, w-1), # RIGHT-top 79 | (2*h//3+1,w-1)] # RIGHT-bottom 80 | 81 | # mark road direction 82 | road_dir[h//3-1, :] = 2 83 | road_dir[2*h//3, :] = 3 84 | road_dir[2*h//3 + 1, :] = 4 85 | 86 | road_dir[:, w//3-2 ] = 5 87 | road_dir[:, w//3-1 ] = 6 88 | road_dir[:, 2*w//3 ] = 7 89 | road_dir[:, 2*w//3 +1] = 8 90 | 91 | # mark the Junctions 92 | junction[h//3-2:h//3, w//3-2:w//3 ] = 1 93 | junction[2*h//3:2*h//3+2, w//3-2:w//3 ] = 1 94 | 95 | junction[h//3-2:h//3, 2*w//3:2*w//3+2 ] = 1 96 | junction[2*h//3:2*h//3+2, 2*w//3:2*w//3+2 ] = 1 97 | 98 | return arrival_points, finish_points, road_dir, junction 99 | 100 | 101 | def next_move(curr, turn, turn_step, start, grid, road_dir, junction, visited): 102 | h,w = grid.shape 103 | turn_completed = False 104 | turn_prog = False 105 | neigh =[] 106 | for m in move: 107 | # check lane while taking left turn 108 | n = (curr[0] + m[0], curr[1] + m[1]) 109 | if 0 <= n[0] <= h-1 and 0 <= n[1] <= w-1 and grid[n] and n not in visited: 110 | # On Junction, use turns 111 | if junction[n] == junction[curr] == 1: 112 | if (turn == 0 or turn == 2) and ((n[0] == start[0]) or (n[1] == start[1])): 113 | # Straight on junction for either left or straight 114 | neigh.append(n) 115 | if turn == 2: 116 | turn_prog = True 117 | 118 | # left from junction 119 | elif turn == 2 and turn_step ==1: 120 | neigh.append(n) 121 | turn_prog = True 122 | 123 | else: 124 | # End of path 125 | pass 126 | 127 | # Completing left turn on junction 128 | elif junction[curr] and not junction[n] and turn ==2 and turn_step==2 \ 129 | and (abs(start[0] - n[0]) ==2 or abs(start[1] - n[1]) ==2): 130 | neigh.append(n) 131 | turn_completed =True 132 | 133 | # junction seen, get onto it; 134 | elif (junction[n] and not junction[curr]): 135 | neigh.append(n) 136 | 137 | # right from junction 138 | elif turn == 1 and not junction[n] and junction[curr]: 139 | neigh.append(n) 140 | turn_completed =True 141 | 142 | # Straight from jucntion 143 | elif turn == 0 and junction[curr] and road_dir[n] == road_dir[start]: 144 | neigh.append(n) 145 | turn_completed = True 146 | 147 | # keep going no decision to make; 148 | elif road_dir[n] == road_dir[curr] and not junction[curr]: 149 | neigh.append(n) 150 | 151 | if neigh: 152 | return neigh[0], turn_prog, turn_completed 153 | if len(neigh) != 1: 154 | raise RuntimeError("next move should be of len 1. Reached ambiguous situation.") 155 | 156 | 157 | 158 | def get_routes(dims, grid, difficulty): 159 | ''' 160 | returns 161 | - routes: type list of list 162 | list for each arrival point of list of routes from that arrival point. 163 | ''' 164 | grid.dtype = int 165 | h,w = dims 166 | 167 | assert difficulty == 'medium' or difficulty == 'hard' 168 | 169 | arrival_points, finish_points, road_dir, junction = get_add_mat(dims, grid, difficulty) 170 | 171 | n_turn1 = 3 # 0 - straight, 1-right, 2-left 172 | n_turn2 = 1 if difficulty == 'medium' else 3 173 | 174 | 175 | routes=[] 176 | # routes for each arrival point 177 | for i in range(len(arrival_points)): 178 | paths = [] 179 | # turn 1 180 | for turn_1 in range(n_turn1): 181 | # turn 2 182 | for turn_2 in range(n_turn2): 183 | total_turns = 0 184 | curr_turn = turn_1 185 | path = [] 186 | visited = set() 187 | current = arrival_points[i] 188 | path.append(current) 189 | start = current 190 | turn_step = 0 191 | # "start" 192 | while not goal_reached(i, current, finish_points): 193 | visited.add(current) 194 | current, turn_prog, turn_completed = next_move(current, curr_turn, turn_step, start, grid, road_dir, junction, visited) 195 | if curr_turn == 2 and turn_prog: 196 | turn_step +=1 197 | if turn_completed: 198 | total_turns += 1 199 | curr_turn = turn_2 200 | turn_step = 0 201 | start = current 202 | # keep going straight till the exit if 2 turns made already. 203 | if total_turns == 2: 204 | curr_turn = 0 205 | path.append(current) 206 | paths.append(path) 207 | # early stopping, if first turn leads to exit 208 | if total_turns == 1: 209 | break 210 | routes.append(paths) 211 | return routes 212 | -------------------------------------------------------------------------------- /figures/dynamics_131.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/dynamics_131.pdf -------------------------------------------------------------------------------- /figures/dynamics_135.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/dynamics_135.pdf -------------------------------------------------------------------------------- /figures/dynamics_136.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/dynamics_136.pdf -------------------------------------------------------------------------------- /figures/dynamics_19.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/dynamics_19.pdf -------------------------------------------------------------------------------- /figures/dynamics_38.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/dynamics_38.png -------------------------------------------------------------------------------- /figures/easy_reward.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/easy_reward.pdf -------------------------------------------------------------------------------- /figures/easy_road.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/easy_road.pdf -------------------------------------------------------------------------------- /figures/easy_success.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/easy_success.pdf -------------------------------------------------------------------------------- /figures/hard_reward.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/hard_reward.pdf -------------------------------------------------------------------------------- /figures/hard_road.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/hard_road.pdf -------------------------------------------------------------------------------- /figures/hard_success.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/hard_success.pdf -------------------------------------------------------------------------------- /figures/medium_reward.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/medium_reward.pdf -------------------------------------------------------------------------------- /figures/medium_road.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/medium_road.pdf -------------------------------------------------------------------------------- /figures/medium_success.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/medium_success.pdf -------------------------------------------------------------------------------- /figures/simple_spread_mean_reward.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/simple_spread_mean_reward.png -------------------------------------------------------------------------------- /figures/simple_tag_turn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/simple_tag_turn.png -------------------------------------------------------------------------------- /figures/venn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/venn.png -------------------------------------------------------------------------------- /learning_algorithms/actor_critic.py: -------------------------------------------------------------------------------- 1 | from learning_algorithms.rl_algorithms import * 2 | import torch 3 | from utilities.util import * 4 | 5 | 6 | 7 | class ActorCritic(ReinforcementLearning): 8 | 9 | def __init__(self, args): 10 | super(ActorCritic, self).__init__('Actor_Critic', args) 11 | 12 | def __call__(self, batch, behaviour_net): 13 | return self.get_loss(batch, behaviour_net) 14 | 15 | def get_loss(self, batch, behaviour_net, target_net=None): 16 | # TODO: fix policy params update 17 | batch_size = len(batch.state) 18 | n = self.args.agent_num 19 | # collect the transition data 20 | rewards, last_step, done, actions, state, next_state = behaviour_net.unpack_data(batch) 21 | # construct the computational graph 22 | action_out = behaviour_net.policy(state) 23 | values = behaviour_net.value(state, actions) 24 | if self.args.q_func: 25 | values = torch.sum(values*actions, dim=-1) 26 | values = values.contiguous().view(-1, n) 27 | if target_net == None: 28 | next_action_out = behaviour_net.policy(next_state) 29 | else: 30 | next_action_out = target_net.policy(next_state) 31 | next_actions = select_action(self.args, next_action_out, status='train') 32 | next_values = behaviour_net.value(next_state, next_actions) 33 | if self.args.q_func: 34 | next_values = torch.sum(next_values*next_actions, dim=-1) 35 | next_values = next_values.contiguous().view(-1, n) 36 | returns = cuda_wrapper(torch.zeros((batch_size, n), dtype=torch.float), self.cuda_) 37 | # calculate the advantages 38 | assert values.size() == next_values.size() 39 | assert returns.size() == values.size() 40 | for i in reversed(range(rewards.size(0))): 41 | if last_step[i]: 42 | next_return = 0 if done[i] else next_values[i].detach() 43 | else: 44 | next_return = next_values[i].detach() 45 | returns[i] = rewards[i] + self.args.gamma * next_return 46 | deltas = returns - values 47 | advantages = values.detach() 48 | # advantages = advantages.contiguous().view(-1, 1) 49 | if self.args.normalize_advantages: 50 | advantages = batchnorm(advantages) 51 | # construct the action loss and the value loss 52 | log_prob_a = multinomials_log_density(actions, action_out).contiguous().view(-1,n) 53 | assert log_prob_a.size() == advantages.size() 54 | action_loss = -advantages * log_prob_a 55 | action_loss = action_loss.mean(dim=0) 56 | value_loss = deltas.pow(2).mean(dim=0) 57 | return action_loss, value_loss, action_out 58 | -------------------------------------------------------------------------------- /learning_algorithms/ddpg.py: -------------------------------------------------------------------------------- 1 | from learning_algorithms.rl_algorithms import * 2 | from utilities.util import * 3 | 4 | 5 | 6 | class DDPG(ReinforcementLearning): 7 | 8 | def __init__(self, args): 9 | super(DDPG, self).__init__('DDPG', args) 10 | 11 | def __call__(self, batch, behaviour_net, target_net): 12 | return self.get_loss(batch, behaviour_net, target_net) 13 | 14 | def get_loss(self, batch, behaviour_net, target_net): 15 | # TODO: fix policy params update 16 | batch_size = len(batch.state) 17 | n = self.args.agent_num 18 | # collect the transition data 19 | rewards, last_step, done, actions, state, next_state = behaviour_net.unpack_data(batch) 20 | # construct the computational graph 21 | # do the argmax action on the action loss 22 | action_out = behaviour_net.policy(state) 23 | actions_ = select_action(self.args, action_out, status='train', exploration=False) 24 | values_ = behaviour_net.value(state, actions_).contiguous().view(-1, n) 25 | # do the exploration action on the value loss 26 | values = behaviour_net.value(state, actions).contiguous().view(-1, n) 27 | # do the argmax action on the next value loss 28 | next_action_out = target_net.policy(next_state) 29 | next_actions_ = select_action(self.args, next_action_out, status='train', exploration=False) 30 | next_values_ = target_net.value(next_state, next_actions_.detach()).contiguous().view(-1, n) 31 | returns = cuda_wrapper(torch.zeros((batch_size, n), dtype=torch.float), self.cuda_) 32 | assert values_.size() == next_values_.size() 33 | assert returns.size() == values.size() 34 | for i in reversed(range(rewards.size(0))): 35 | if last_step[i]: 36 | next_return = 0 if done[i] else next_values_[i].detach() 37 | else: 38 | next_return = next_values_[i].detach() 39 | returns[i] = rewards[i] + self.args.gamma * next_return 40 | deltas = returns - values 41 | advantages = values_ 42 | if self.args.normalize_advantages: 43 | advantages = batchnorm(advantages) 44 | action_loss = -advantages 45 | action_loss = action_loss.mean(dim=0) 46 | value_loss = deltas.pow(2).mean(dim=0) 47 | return action_loss, value_loss, action_out 48 | -------------------------------------------------------------------------------- /learning_algorithms/rl_algorithms.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | from torch import optim 4 | import torch.nn as nn 5 | from utilities.util import * 6 | 7 | 8 | 9 | class ReinforcementLearning(object): 10 | 11 | def __init__(self, name, args): 12 | self.name = name 13 | self.args = args 14 | self.cuda_ = torch.cuda.is_available() and self.args.cuda 15 | 16 | def __str__(self): 17 | print (self.name) 18 | 19 | def __call__(self): 20 | raise NotImplementedError() 21 | 22 | def get_loss(self): 23 | raise NotImplementedError() 24 | -------------------------------------------------------------------------------- /models/coma_fc.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import numpy as np 4 | from utilities.util import * 5 | from models.model import Model 6 | from collections import namedtuple 7 | 8 | 9 | 10 | class COMAFC(Model): 11 | 12 | def __init__(self, args, target_net=None): 13 | super(COMAFC, self).__init__(args) 14 | self.construct_model() 15 | self.apply(self.init_weights) 16 | if target_net != None: 17 | self.target_net = target_net 18 | self.reload_params_to_target() 19 | self.Transition = namedtuple('Transition', ('state', 'action', 'reward', 'next_state', 'done', 'last_step')) 20 | 21 | def construct_policy_net(self): 22 | action_dicts = [] 23 | if self.args.shared_parameters: 24 | l1 = nn.Linear(self.obs_dim, self.hid_dim) 25 | l2 = nn.Linear(self.hid_dim, self.hid_dim) 26 | a = nn.Linear(self.hid_dim, self.act_dim) 27 | for i in range(self.n_): 28 | action_dicts.append(nn.ModuleDict( {'layer_1': l1,\ 29 | 'layer_2': l2,\ 30 | 'action_head': a 31 | } 32 | ) 33 | ) 34 | else: 35 | for i in range(self.n_): 36 | action_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear(self.obs_dim, self.hid_dim),\ 37 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\ 38 | 'action_head': nn.Linear(self.hid_dim, self.act_dim) 39 | } 40 | ) 41 | ) 42 | self.action_dicts = nn.ModuleList(action_dicts) 43 | 44 | def construct_value_net(self): 45 | value_dicts = [] 46 | if self.args.shared_parameters: 47 | l1 = nn.Linear((self.n_+1)*self.obs_dim+(self.n_-1)*self.act_dim, self.hid_dim) 48 | l2 = nn.Linear(self.hid_dim, self.hid_dim) 49 | v = nn.Linear(self.hid_dim, self.act_dim) 50 | for i in range(self.n_): 51 | value_dicts.append(nn.ModuleDict( {'layer_1': l1,\ 52 | 'layer_2': l2,\ 53 | 'value_head': v 54 | } 55 | ) 56 | ) 57 | else: 58 | for i in range(self.n_): 59 | value_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear((self.n_+1)*self.obs_dim+(self.n_-1)*self.act_dim, self.hid_dim),\ 60 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\ 61 | 'value_head': nn.Linear(self.hid_dim, self.act_dim) 62 | } 63 | ) 64 | ) 65 | self.value_dicts = nn.ModuleList(value_dicts) 66 | 67 | def construct_model(self): 68 | self.construct_value_net() 69 | self.construct_policy_net() 70 | 71 | def policy(self, obs, schedule=None, last_act=None, last_hid=None, info={}, stat={}): 72 | actions = [] 73 | for i in range(self.n_): 74 | h = torch.relu( self.action_dicts[i]['layer_1'](obs[:, i, :]) ) 75 | h = torch.relu( self.action_dicts[i]['layer_2'](h) ) 76 | a = self.action_dicts[i]['action_head'](h) 77 | actions.append(a) 78 | actions = torch.stack(actions, dim=1) 79 | return actions 80 | 81 | def value(self, obs, act): 82 | batch_size = obs.size(0) 83 | obs_own = obs.clone() 84 | obs = obs.unsqueeze(1).expand(batch_size, self.n_, self.n_, self.obs_dim) # shape = (b, n, o) -> (b, 1, n, o) -> (b, n, n, o) 85 | obs = obs.contiguous().view(batch_size, self.n_, -1) # shape = (b, n, o*n) 86 | inp = torch.cat((obs, obs_own), dim=-1) # shape = (b, n, o*n+o) 87 | values = [] 88 | for i in range(self.n_): 89 | # other people actions 90 | act_other = torch.cat((act[:,:i,:].view(batch_size,-1),act[:,i+1:,:].view(batch_size,-1)),dim=-1) 91 | h = torch.relu( self.value_dicts[i]['layer_1'](torch.cat((inp[:, i, :], act_other),dim=-1)) ) 92 | h = torch.relu( self.value_dicts[i]['layer_2'](h) ) 93 | v = self.value_dicts[i]['value_head'](h) 94 | values.append(v) 95 | values = torch.stack(values, dim=1) 96 | return values 97 | 98 | 99 | def get_loss(self, batch): 100 | batch_size = len(batch.state) 101 | rewards, last_step, done, actions, state, next_state = self.unpack_data(batch) 102 | action_out = self.policy(state) # (b,n,a) action probability 103 | values = self.value(state, actions) # (b,n,a) action value 104 | baselines = torch.sum(values*torch.softmax(action_out, dim=-1), dim=-1) # the only difference to ActorCritic is this baseline (b,n) 105 | values = torch.sum(values*actions, dim=-1) # (b,n) 106 | if self.args.target: 107 | next_action_out = self.target_net.policy(next_state, last_act=actions) 108 | else: 109 | next_action_out = self.policy(next_state, last_act=actions) 110 | next_actions = select_action(self.args, next_action_out, status='train', exploration=False) 111 | if self.args.target: 112 | next_values = self.target_net.value(next_state, next_actions) 113 | else: 114 | next_values = self.value(next_state, next_actions) 115 | next_values = torch.sum(next_values*next_actions, dim=-1) # b*n 116 | # calculate the advantages 117 | returns = cuda_wrapper(torch.zeros((batch_size, self.n_), dtype=torch.float), self.cuda_) 118 | assert values.size() == next_values.size() 119 | assert returns.size() == values.size() 120 | for i in reversed(range(rewards.size(0))): 121 | if last_step[i]: 122 | next_return = 0 if done[i] else next_values[i].detach() 123 | else: 124 | next_return = next_values[i].detach() 125 | returns[i] = rewards[i] + self.args.gamma * next_return 126 | # value loss 127 | deltas = returns - values 128 | value_loss = deltas.pow(2).mean(dim=0) 129 | # actio loss 130 | advantages = ( values - baselines ).detach() 131 | if self.args.normalize_advantages: 132 | advantages = batchnorm(advantages) 133 | log_prob = multinomials_log_density(actions, action_out).contiguous().view(-1, self.n_) 134 | assert log_prob.size() == advantages.size() 135 | action_loss = - advantages * log_prob 136 | action_loss = action_loss.mean(dim=0) 137 | return action_loss, value_loss, action_out 138 | -------------------------------------------------------------------------------- /models/independent_ac.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import numpy as np 4 | from utilities.util import * 5 | from models.model import Model 6 | from collections import namedtuple 7 | from learning_algorithms.actor_critic import * 8 | 9 | 10 | 11 | class IndependentAC(Model): 12 | 13 | def __init__(self, args, target_net=None): 14 | super(IndependentAC, self).__init__(args) 15 | self.construct_model() 16 | self.apply(self.init_weights) 17 | if target_net != None: 18 | self.target_net = target_net 19 | self.reload_params_to_target() 20 | self.Transition = namedtuple('Transition', ('state', 'action', 'reward', 'next_state', 'done', 'last_step')) 21 | self.rl = ActorCritic(self.args) 22 | 23 | 24 | def construct_policy_net(self): 25 | # TODO: fix policy params update 26 | action_dicts = [] 27 | if self.args.shared_parameters: 28 | l1 = nn.Linear(self.obs_dim, self.hid_dim) 29 | l2 = nn.Linear(self.hid_dim, self.hid_dim) 30 | a = nn.Linear(self.hid_dim, self.act_dim) 31 | for i in range(self.n_): 32 | action_dicts.append(nn.ModuleDict( {'layer_1': l1,\ 33 | 'layer_2': l2,\ 34 | 'action_head': a 35 | } 36 | ) 37 | ) 38 | else: 39 | for i in range(self.n_): 40 | action_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear(self.obs_dim, self.hid_dim),\ 41 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\ 42 | 'action_head': nn.Linear(self.hid_dim, self.act_dim) 43 | } 44 | ) 45 | ) 46 | self.action_dicts = nn.ModuleList(action_dicts) 47 | 48 | def construct_value_net(self): 49 | # TODO: policy params update 50 | value_dicts = [] 51 | if self.args.shared_parameters: 52 | l1 = nn.Linear(self.obs_dim, self.hid_dim ) 53 | l2 = nn.Linear(self.hid_dim, self.hid_dim) 54 | v = nn.Linear(self.hid_dim, self.act_dim) 55 | for i in range(self.n_): 56 | value_dicts.append(nn.ModuleDict( {'layer_1': l1,\ 57 | 'layer_2': l2,\ 58 | 'value_head': v 59 | } 60 | ) 61 | ) 62 | else: 63 | for i in range(self.n_): 64 | value_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear(self.obs_dim, self.hid_dim ),\ 65 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\ 66 | 'value_head': nn.Linear(self.hid_dim, self.act_dim) 67 | } 68 | ) 69 | ) 70 | self.value_dicts = nn.ModuleList(value_dicts) 71 | 72 | def construct_model(self): 73 | self.construct_value_net() 74 | self.construct_policy_net() 75 | 76 | def policy(self, obs, schedule=None, last_act=None, last_hid=None, info={}, stat={}): 77 | # TODO: policy params update 78 | actions = [] 79 | for i in range(self.n_): 80 | h = torch.relu( self.action_dicts[i]['layer_1'](obs[:, i, :]) ) 81 | h = torch.relu( self.action_dicts[i]['layer_2'](h) ) 82 | a = self.action_dicts[i]['action_head'](h) 83 | actions.append(a) 84 | actions = torch.stack(actions, dim=1) 85 | return actions 86 | 87 | 88 | def value(self, obs, act=None): 89 | # TODO: policy params update 90 | values = [] 91 | for i in range(self.n_): 92 | h = torch.relu( self.value_dicts[i]['layer_1'](obs[:,i,:]) ) 93 | h = torch.relu( self.value_dicts[i]['layer_2'](h) ) 94 | v = self.value_dicts[i]['value_head'](h) 95 | values.append(v) 96 | values = torch.stack(values, dim=1) 97 | return values 98 | 99 | def get_loss(self, batch): 100 | action_loss, value_loss, log_p_a = self.rl.get_loss(batch, self, self.target_net) 101 | return action_loss, value_loss, log_p_a 102 | -------------------------------------------------------------------------------- /models/independent_ddpg.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import numpy as np 4 | from utilities.util import * 5 | from models.model import Model 6 | from learning_algorithms.ddpg import * 7 | from collections import namedtuple 8 | 9 | 10 | 11 | class IndependentDDPG(Model): 12 | 13 | def __init__(self, args, target_net=None): 14 | super(IndependentDDPG, self).__init__(args) 15 | self.construct_model() 16 | self.apply(self.init_weights) 17 | if target_net != None: 18 | self.target_net = target_net 19 | self.reload_params_to_target() 20 | self.Transition = namedtuple('Transition', ('state', 'action', 'reward', 'next_state', 'done', 'last_step')) 21 | self.rl = DDPG(self.args) 22 | 23 | def construct_policy_net(self): 24 | action_dicts = [] 25 | if self.args.shared_parameters: 26 | l1 = nn.Linear(self.obs_dim, self.hid_dim) 27 | l2 = nn.Linear(self.hid_dim, self.hid_dim) 28 | a = nn.Linear(self.hid_dim, self.act_dim) 29 | for i in range(self.n_): 30 | action_dicts.append(nn.ModuleDict( {'layer_1': l1,\ 31 | 'layer_2': l2,\ 32 | 'action_head': a 33 | } 34 | ) 35 | ) 36 | else: 37 | for i in range(self.n_): 38 | action_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear(self.obs_dim, self.hid_dim),\ 39 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\ 40 | 'action_head': nn.Linear(self.hid_dim, self.act_dim) 41 | } 42 | ) 43 | ) 44 | self.action_dicts = nn.ModuleList(action_dicts) 45 | 46 | def construct_value_net(self): 47 | value_dicts = [] 48 | if self.args.shared_parameters: 49 | l1 = nn.Linear(self.obs_dim+self.act_dim, self.hid_dim ) 50 | l2 = nn.Linear(self.hid_dim, self.hid_dim) 51 | v = nn.Linear(self.hid_dim, 1) 52 | for i in range(self.n_): 53 | value_dicts.append(nn.ModuleDict( {'layer_1': l1,\ 54 | 'layer_2': l2,\ 55 | 'value_head': v 56 | } 57 | ) 58 | ) 59 | else: 60 | for i in range(self.n_): 61 | value_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear(self.obs_dim+self.act_dim, self.hid_dim ),\ 62 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\ 63 | 'value_head': nn.Linear(self.hid_dim, 1) 64 | } 65 | ) 66 | ) 67 | self.value_dicts = nn.ModuleList(value_dicts) 68 | 69 | def construct_model(self): 70 | self.construct_value_net() 71 | self.construct_policy_net() 72 | 73 | def policy(self, obs, schedule=None, last_act=None, last_hid=None, info={}, stat={}): 74 | actions = [] 75 | for i in range(self.n_): 76 | h = torch.relu( self.action_dicts[i]['layer_1'](obs[:, i, :]) ) 77 | h = torch.relu( self.action_dicts[i]['layer_2'](h) ) 78 | a = self.action_dicts[i]['action_head'](h) 79 | actions.append(a) 80 | actions = torch.stack(actions, dim=1) 81 | return actions 82 | 83 | def value(self, obs, act): 84 | values = [] 85 | for i in range(self.n_): 86 | h = torch.relu( self.value_dicts[i]['layer_1']( torch.cat((obs[:,i,:],act[:,i,:]),dim=-1))) 87 | h = torch.relu( self.value_dicts[i]['layer_2'](h) ) 88 | v = self.value_dicts[i]['value_head'](h) 89 | values.append(v) 90 | values = torch.stack(values, dim=1) 91 | return values 92 | 93 | def get_loss(self, batch): 94 | action_loss, value_loss, log_p_a = self.rl.get_loss(batch, self, self.target_net) 95 | return action_loss, value_loss, log_p_a 96 | -------------------------------------------------------------------------------- /models/maddpg.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import numpy as np 4 | from utilities.util import * 5 | from models.model import Model 6 | from learning_algorithms.ddpg import * 7 | from collections import namedtuple 8 | 9 | 10 | 11 | class MADDPG(Model): 12 | 13 | def __init__(self, args, target_net=None): 14 | super(MADDPG, self).__init__(args) 15 | self.construct_model() 16 | self.apply(self.init_weights) 17 | if target_net != None: 18 | self.target_net = target_net 19 | self.reload_params_to_target() 20 | self.Transition = namedtuple('Transition', ('state', 'action', 'reward', 'next_state', 'done', 'last_step')) 21 | 22 | def construct_policy_net(self): 23 | # TODO: fix policy params update 24 | action_dicts = [] 25 | if self.args.shared_parameters: 26 | l1 = nn.Linear(self.obs_dim, self.hid_dim) 27 | l2 = nn.Linear(self.hid_dim, self.hid_dim) 28 | a = nn.Linear(self.hid_dim, self.act_dim) 29 | for i in range(self.n_): 30 | action_dicts.append(nn.ModuleDict( {'layer_1': l1,\ 31 | 'layer_2': l2,\ 32 | 'action_head': a 33 | } 34 | ) 35 | ) 36 | else: 37 | for i in range(self.n_): 38 | action_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear(self.obs_dim, self.hid_dim),\ 39 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\ 40 | 'action_head': nn.Linear(self.hid_dim, self.act_dim) 41 | } 42 | ) 43 | ) 44 | self.action_dicts = nn.ModuleList(action_dicts) 45 | 46 | def construct_value_net(self): 47 | # TODO: policy params update 48 | value_dicts = [] 49 | if self.args.shared_parameters: 50 | l1 = nn.Linear( (self.obs_dim+self.act_dim)*self.n_, self.hid_dim ) 51 | l2 = nn.Linear(self.hid_dim, self.hid_dim) 52 | v = nn.Linear(self.hid_dim, 1) 53 | for i in range(self.n_): 54 | value_dicts.append(nn.ModuleDict( {'layer_1': l1,\ 55 | 'layer_2': l2,\ 56 | 'value_head': v 57 | } 58 | ) 59 | ) 60 | else: 61 | for i in range(self.n_): 62 | value_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear( (self.obs_dim+self.act_dim)*self.n_, self.hid_dim ),\ 63 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\ 64 | 'value_head': nn.Linear(self.hid_dim, 1) 65 | } 66 | ) 67 | ) 68 | self.value_dicts = nn.ModuleList(value_dicts) 69 | 70 | def construct_model(self): 71 | self.construct_value_net() 72 | self.construct_policy_net() 73 | 74 | def policy(self, obs, schedule=None, last_act=None, last_hid=None, info={}, stat={}): 75 | # TODO: policy params update 76 | actions = [] 77 | for i in range(self.n_): 78 | h = torch.relu( self.action_dicts[i]['layer_1'](obs[:, i, :]) ) 79 | h = torch.relu( self.action_dicts[i]['layer_2'](h) ) 80 | a = self.action_dicts[i]['action_head'](h) 81 | actions.append(a) 82 | actions = torch.stack(actions, dim=1) 83 | return actions 84 | 85 | def value(self, obs, act): 86 | # TODO: policy params update 87 | values = [] 88 | for i in range(self.n_): 89 | h = torch.relu( self.value_dicts[i]['layer_1']( torch.cat( ( obs.contiguous().view( -1, np.prod(obs.size()[1:]) ), act.contiguous().view( -1, np.prod(act.size()[1:]) ) ), dim=-1 ) ) ) 90 | h = torch.relu( self.value_dicts[i]['layer_2'](h) ) 91 | v = self.value_dicts[i]['value_head'](h) 92 | values.append(v) 93 | values = torch.stack(values, dim=1) 94 | return values 95 | 96 | def get_loss(self, batch): 97 | # TODO: fix policy params update 98 | batch_size = len(batch.state) 99 | # collect the transition data 100 | rewards, last_step, done, actions, state, next_state = self.unpack_data(batch) 101 | # construct the computational graph 102 | # do the argmax action on the action loss 103 | action_out = self.policy(state) 104 | actions_ = select_action(self.args, action_out, status='train', exploration=False) 105 | values_ = self.value(state, actions_).contiguous().view(-1, self.n_) 106 | # do the exploration action on the value loss 107 | values = self.value(state, actions).contiguous().view(-1, self.n_) 108 | # do the argmax action on the next value loss 109 | next_action_out = self.target_net.policy(next_state) 110 | next_actions_ = select_action(self.args, next_action_out, status='train', exploration=False) 111 | next_values_ = self.target_net.value(next_state, next_actions_.detach()).contiguous().view(-1, self.n_) 112 | returns = cuda_wrapper(torch.zeros((batch_size, self.n_), dtype=torch.float), self.cuda_) 113 | assert values_.size() == next_values_.size() 114 | assert returns.size() == values.size() 115 | for i in reversed(range(rewards.size(0))): 116 | if last_step[i]: 117 | next_return = 0 if done[i] else next_values_[i].detach() 118 | else: 119 | next_return = next_values_[i].detach() 120 | returns[i] = rewards[i] + self.args.gamma * next_return 121 | deltas = returns - values 122 | advantages = values_ 123 | # advantages = advantages.contiguous().view(-1, 1) 124 | if self.args.normalize_advantages: 125 | advantages = batchnorm(advantages) 126 | action_loss = -advantages 127 | action_loss = action_loss.mean(dim=0) 128 | value_loss = deltas.pow(2).mean(dim=0) 129 | return action_loss, value_loss, action_out 130 | -------------------------------------------------------------------------------- /models/model.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import numpy as np 4 | from utilities.util import * 5 | 6 | 7 | 8 | class Model(nn.Module): 9 | 10 | def __init__(self, args): 11 | super(Model, self).__init__() 12 | self.args = args 13 | self.cuda_ = torch.cuda.is_available() and self.args.cuda 14 | self.n_ = self.args.agent_num 15 | self.hid_dim = self.args.hid_size 16 | self.obs_dim = self.args.obs_size 17 | self.act_dim = self.args.action_dim 18 | 19 | def reload_params_to_target(self): 20 | self.target_net.action_dicts.load_state_dict( self.action_dicts.state_dict() ) 21 | self.target_net.value_dicts.load_state_dict( self.value_dicts.state_dict() ) 22 | 23 | def update_target(self): 24 | for name, param in self.target_net.action_dicts.state_dict().items(): 25 | update_params = (1 - self.args.target_lr) * param + self.args.target_lr * self.action_dicts.state_dict()[name] 26 | self.target_net.action_dicts.state_dict()[name].copy_(update_params) 27 | for name, param in self.target_net.value_dicts.state_dict().items(): 28 | update_params = (1 - self.args.target_lr) * param + self.args.target_lr * self.value_dicts.state_dict()[name] 29 | self.target_net.value_dicts.state_dict()[name].copy_(update_params) 30 | 31 | def transition_update(self, trainer, trans, stat): 32 | if self.args.replay: 33 | trainer.replay_buffer.add_experience(trans) 34 | replay_cond = trainer.steps>self.args.replay_warmup\ 35 | and len(trainer.replay_buffer.buffer)>=self.args.batch_size\ 36 | and trainer.steps%self.args.behaviour_update_freq==0 37 | if replay_cond: 38 | for _ in range(self.args.critic_update_times): 39 | trainer.value_replay_process(stat) 40 | trainer.action_replay_process(stat) 41 | # TODO: hard code 42 | # clear replay buffer for on policy algorithm 43 | if self.__class__.__name__ in ["COMAFC","MFAC","IndependentAC"] : 44 | trainer.replay_buffer.clear() 45 | else: 46 | trans_cond = trainer.steps%self.args.behaviour_update_freq==0 47 | if trans_cond: 48 | for _ in range(self.args.critic_update_times): 49 | trainer.value_replay_process(stat) 50 | trainer.action_transition_process(stat, trans) 51 | if self.args.target: 52 | target_cond = trainer.steps%self.args.target_update_freq==0 53 | if target_cond: 54 | self.update_target() 55 | 56 | def episode_update(self, trainer, episode, stat): 57 | if self.args.replay: 58 | trainer.replay_buffer.add_experience(episode) 59 | replay_cond = trainer.episodes>self.args.replay_warmup\ 60 | and len(trainer.replay_buffer.buffer)>=self.args.batch_size\ 61 | and trainer.episodes%self.args.behaviour_update_freq==0 62 | if replay_cond: 63 | for _ in range(self.args.critic_update_times): 64 | trainer.value_replay_process(stat) 65 | trainer.action_replay_process(stat) 66 | else: 67 | episode = self.Transition(*zip(*episode)) 68 | episode_cond = trainer.episodes%self.args.behaviour_update_freq==0 69 | if episode_cond: 70 | for _ in range(self.args.critic_update_times): 71 | trainer.value_replay_process(stat) 72 | trainer.action_transition_process(stat) 73 | 74 | def construct_model(self): 75 | raise NotImplementedError() 76 | 77 | def get_agent_mask(self, batch_size, info): 78 | ''' 79 | define the getter of agent mask to confirm the living agent 80 | ''' 81 | if 'alive_mask' in info: 82 | agent_mask = torch.from_numpy(info['alive_mask']) 83 | num_agents_alive = agent_mask.sum() 84 | else: 85 | agent_mask = torch.ones(self.n_) 86 | num_agents_alive = self.n_ 87 | # shape = (1, 1, n) 88 | agent_mask = agent_mask.view(1, 1, self.n_) 89 | # shape = (batch_size, n ,n, 1) 90 | agent_mask = cuda_wrapper(agent_mask.expand(batch_size, self.n_, self.n_).unsqueeze(-1), self.cuda_) 91 | return num_agents_alive, agent_mask 92 | 93 | def policy(self, obs, last_act=None, last_hid=None, gate=None, info={}, stat={}): 94 | raise NotImplementedError() 95 | 96 | def value(self, obs, act): 97 | raise NotImplementedError() 98 | 99 | def construct_policy_net(self): 100 | raise NotImplementedError() 101 | 102 | def construct_value_net(self): 103 | raise NotImplementedError() 104 | 105 | def init_weights(self, m): 106 | ''' 107 | initialize the weights of parameters 108 | ''' 109 | if type(m) == nn.Linear: 110 | m.weight.data.normal_(0, self.args.init_std) 111 | 112 | def get_loss(self): 113 | raise NotImplementedError() 114 | 115 | def credit_assignment_demo(self, obs, act): 116 | assert isinstance(obs, np.ndarray) 117 | assert isinstance(act, np.ndarray) 118 | obs = cuda_wrapper(torch.tensor(obs).float(), self.cuda_) 119 | act = cuda_wrapper(torch.tensor(act).float(), self.cuda_) 120 | values = self.value(obs, act) 121 | return values 122 | 123 | 124 | def train_process(self, stat, trainer): 125 | info = {} 126 | state = trainer.env.reset() 127 | if self.args.reward_record_type is 'episode_mean_step': 128 | trainer.mean_reward = 0 129 | trainer.mean_success = 0 130 | 131 | for t in range(self.args.max_steps): 132 | state_ = cuda_wrapper(prep_obs(state).contiguous().view(1, self.n_, self.obs_dim), self.cuda_) 133 | start_step = True if t == 0 else False 134 | state_ = cuda_wrapper(prep_obs(state).contiguous().view(1, self.n_, self.obs_dim), self.cuda_) 135 | action_out = self.policy(state_, info=info, stat=stat) 136 | action = select_action(self.args, action_out, status='train', info=info) 137 | _, actual = translate_action(self.args, action, trainer.env) 138 | next_state, reward, done, debug = trainer.env.step(actual) 139 | if isinstance(done, list): done = np.sum(done) 140 | done_ = done or t==self.args.max_steps-1 141 | trans = self.Transition(state, 142 | action.cpu().numpy(), 143 | np.array(reward), 144 | next_state, 145 | done, 146 | done_ 147 | ) 148 | self.transition_update(trainer, trans, stat) 149 | success = debug['success'] if 'success' in debug else 0.0 150 | trainer.steps += 1 151 | if self.args.reward_record_type is 'mean_step': 152 | trainer.mean_reward = trainer.mean_reward + 1/trainer.steps*(np.mean(reward) - trainer.mean_reward) 153 | trainer.mean_success = trainer.mean_success + 1/trainer.steps*(success - trainer.mean_success) 154 | elif self.args.reward_record_type is 'episode_mean_step': 155 | trainer.mean_reward = trainer.mean_reward + 1/(t+1)*(np.mean(reward) - trainer.mean_reward) 156 | trainer.mean_success = trainer.mean_success + 1/(t+1)*(success - trainer.mean_success) 157 | else: 158 | raise RuntimeError('Please enter a correct reward record type, e.g. mean_step or episode_mean_step.') 159 | stat['mean_reward'] = trainer.mean_reward 160 | stat['mean_success'] = trainer.mean_success 161 | if done_: 162 | break 163 | state = next_state 164 | stat['turn'] = t + 1 165 | trainer.episodes += 1 166 | 167 | 168 | def unpack_data(self, batch): 169 | batch_size = len(batch.state) 170 | rewards = cuda_wrapper(torch.tensor(batch.reward, dtype=torch.float), self.cuda_) 171 | last_step = cuda_wrapper(torch.tensor(batch.last_step, dtype=torch.float).contiguous().view(-1, 1), self.cuda_) 172 | done = cuda_wrapper(torch.tensor(batch.done, dtype=torch.float).contiguous().view(-1, 1), self.cuda_) 173 | actions = cuda_wrapper(torch.tensor(np.stack(list(zip(*batch.action))[0], axis=0), dtype=torch.float), self.cuda_) 174 | state = cuda_wrapper(prep_obs(list(zip(batch.state))), self.cuda_) 175 | next_state = cuda_wrapper(prep_obs(list(zip(batch.next_state))), self.cuda_) 176 | return (rewards, last_step, done, actions, state, next_state) 177 | -------------------------------------------------------------------------------- /models/random.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import numpy as np 4 | from utilities.util import * 5 | from models.model import Model 6 | from learning_algorithms.ddpg import * 7 | from collections import namedtuple 8 | 9 | 10 | 11 | class RandomAgent(Model): 12 | 13 | def __init__(self, args): 14 | super(RandomAgent, self).__init__(args) 15 | self.args = args 16 | 17 | def policy(self, obs, schedule=None, last_act=None, last_hid=None, info={}, stat={}): 18 | actions = [] 19 | tensor = torch.cuda.FloatTensor if self.args.cuda else torch.FloatTensor 20 | actions = tensor([[1.0]*self.act_dim]*self.n_) 21 | return actions 22 | -------------------------------------------------------------------------------- /test.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from utilities.tester import * 3 | from arguments import * 4 | import argparse 5 | 6 | 7 | parser = argparse.ArgumentParser(description='Test rl agent.') 8 | parser.add_argument('--save-model-dir', type=str, nargs='?', default='./model_save/', help='Please input the directory of saving model.') 9 | parser.add_argument('--render', action='store_true', help='Please input the flag to control the render.') 10 | parser.add_argument('--episodes', type=int, default=10, help='Please input the number of test episodes') 11 | 12 | argv = parser.parse_args() 13 | 14 | model = Model[model_name] 15 | 16 | strategy = Strategy[model_name] 17 | 18 | if argv.save_model_dir[-1] is '/': 19 | save_path = argv.save_model_dir 20 | else: 21 | save_path = argv.save_model_dir+'/' 22 | 23 | PATH = save_path + log_name + '/model.pt' 24 | 25 | if args.target: 26 | target_net = model(args) 27 | behaviour_net = model(args, target_net) 28 | else: 29 | behaviour_net = model(args) 30 | 31 | checkpoint = torch.load(PATH, map_location='cpu') if not args.cuda else torch.load(PATH) 32 | behaviour_net.load_state_dict(checkpoint['model_state_dict']) 33 | 34 | if strategy == 'pg': 35 | test = PGTester(env(), behaviour_net, args) 36 | elif strategy == 'q': 37 | raise NotImplementedError('This needs to be implemented.') 38 | else: 39 | raise RuntimeError('Please input the correct strategy, e.g. pg or q.') 40 | 41 | print(args) 42 | test.run_game(episodes=argv.episodes, render=argv.render) 43 | test.print_info() 44 | -------------------------------------------------------------------------------- /test.sh: -------------------------------------------------------------------------------- 1 | # !/bin/bash 2 | 3 | EXP_NAME="spread_sqddpg" 4 | 5 | cp ./args/$EXP_NAME.py arguments.py 6 | python -u test.py 7 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from utilities.trainer import * 3 | import torch 4 | from arguments import * 5 | import os 6 | from utilities.util import * 7 | from utilities.logger import Logger 8 | import argparse 9 | 10 | 11 | 12 | parser = argparse.ArgumentParser(description='Test rl agent.') 13 | parser.add_argument('--save-path', type=str, nargs='?', default='./', help='Please input the directory of saving model.') 14 | argv = parser.parse_args() 15 | 16 | 17 | 18 | if argv.save_path[-1] is '/': 19 | save_path = argv.save_path 20 | else: 21 | save_path = argv.save_path+'/' 22 | 23 | # create save folders 24 | if 'model_save' not in os.listdir(save_path): 25 | os.mkdir(save_path+'model_save') 26 | if 'tensorboard' not in os.listdir(save_path): 27 | os.mkdir(save_path+'tensorboard') 28 | if log_name not in os.listdir(save_path+'model_save/'): 29 | os.mkdir(save_path+'model_save/'+log_name) 30 | if log_name not in os.listdir(save_path+'tensorboard/'): 31 | os.mkdir(save_path+'tensorboard/'+log_name) 32 | else: 33 | path = save_path+'tensorboard/'+log_name 34 | for f in os.listdir(path): 35 | file_path = os.path.join(path,f) 36 | if os.path.isfile(file_path): 37 | os.remove(file_path) 38 | 39 | logger = Logger(save_path+'tensorboard/' + log_name) 40 | 41 | model = Model[model_name] 42 | 43 | strategy = Strategy[model_name] 44 | 45 | print ( '{}\n'.format(args) ) 46 | 47 | if strategy == 'pg': 48 | train = PGTrainer(args, model, env(), logger, args.online) 49 | elif strategy == 'q': 50 | raise NotImplementedError('This needs to be implemented.') 51 | else: 52 | raise RuntimeError('Please input the correct strategy, e.g. pg or q.') 53 | 54 | stat = dict() 55 | 56 | for i in range(args.train_episodes_num): 57 | train.run(stat) 58 | train.logging(stat) 59 | if i%args.save_model_freq == args.save_model_freq-1: 60 | train.print_info(stat) 61 | torch.save({'model_state_dict': train.behaviour_net.state_dict()}, save_path+'model_save/'+log_name+'/model.pt') 62 | print ('The model is saved!\n') 63 | with open(save_path+'model_save/'+log_name +'/log.txt', 'w+') as file: 64 | file.write(str(args)+'\n') 65 | file.write(str(i)) 66 | -------------------------------------------------------------------------------- /train.sh: -------------------------------------------------------------------------------- 1 | # !/bin/bash 2 | # sh train.sh 3 | 4 | EXP_NAME="simple_tag_sqddpg" 5 | ALIAS="" 6 | export CUDA_DEVICE_ORDER=PCI_BUS_ID 7 | export CUDA_VISIBLE_DEVICES=0 8 | 9 | if [ ! -d "./model_save" ] 10 | then 11 | mkdir ./model_save 12 | fi 13 | 14 | mkdir ./model_save/$EXP_NAME$ALIAS 15 | cp ./args/$EXP_NAME.py arguments.py 16 | python -u train.py > ./model_save/$EXP_NAME$ALIAS/exp.out & 17 | echo $! > ./model_save/$EXP_NAME$ALIAS/exp.pid 18 | -------------------------------------------------------------------------------- /utilities/gym_wrapper.py: -------------------------------------------------------------------------------- 1 | from gym import spaces 2 | 3 | 4 | class GymWrapper(object): 5 | 6 | def __init__(self, env): 7 | self.env = env 8 | self.obs_space = self.env.observation_space 9 | self.act_space = self.env.action_space 10 | 11 | def __call__(self): 12 | return self.env 13 | 14 | def get_num_of_agents(self): 15 | return self.env.n 16 | 17 | def get_shape_of_obs(self): 18 | obs_shapes = [] 19 | for obs in self.obs_space: 20 | if isinstance(obs, spaces.Box): 21 | obs_shapes.append(obs.shape) 22 | assert len(self.obs_space) == len(obs_shapes) 23 | return obs_shapes 24 | 25 | def get_output_shape_of_act(self): 26 | act_shapes = [] 27 | for act in self.act_space: 28 | if isinstance(act, spaces.Discrete): 29 | act_shapes.append(act.n) 30 | elif isinstance(act, spaces.MultiDiscrete): 31 | act_shapes.append(act.high - act.low + 1) 32 | elif isinstance(act, spaces.Boxes): 33 | assert act.shape == 1 34 | act_shapes.append(act.shape) 35 | return act_shapes 36 | 37 | def get_dtype_of_obs(self): 38 | return [obs.dtype for obs in self.obs_space] 39 | 40 | def get_input_shape_of_act(self): 41 | act_shapes = [] 42 | for act in self.act_space: 43 | if isinstance(act, spaces.Discrete): 44 | act_shapes.append(act.n) 45 | elif isinstance(act, spaces.MultiDiscrete): 46 | act_shapes.append(act.shape) 47 | elif isinstance(act, spaces.Boxes): 48 | assert act.shape == 1 49 | act_shapes.append(act.shape) 50 | return act_shapes 51 | -------------------------------------------------------------------------------- /utilities/inspector.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | 4 | 5 | def inspector(args): 6 | if args.model_name is 'maddpg': 7 | assert args.replay is True 8 | assert args.q_func is True 9 | assert args.target is True 10 | assert args.gumbel_softmax is True 11 | assert args.epsilon_softmax is False 12 | assert args.online is True 13 | elif args.model_name is 'independent_ac': 14 | assert args.replay is True 15 | assert args.q_func is True 16 | assert args.target is True 17 | assert args.online is True 18 | assert args.gumbel_softmax is False 19 | assert args.epsilon_softmax is False 20 | elif args.model_name is 'independent_ddpg': 21 | assert args.replay is True 22 | assert args.q_func is False 23 | assert args.target is True 24 | assert args.online is True 25 | assert args.gumbel_softmax is True 26 | assert args.epsilon_softmax is False 27 | elif args.model_name is 'sqddpg': 28 | assert args.replay is True 29 | assert args.q_func is True 30 | assert args.target is True 31 | assert args.gumbel_softmax is True 32 | assert args.epsilon_softmax is False 33 | assert args.online is True 34 | assert hasattr(args, 'sample_size') 35 | elif args.model_name is 'coma_fc': 36 | assert args.replay is True 37 | assert args.q_func is True 38 | assert args.target is True 39 | assert args.online is True 40 | assert args.continuous is False 41 | assert args.gumbel_softmax is False 42 | assert args.epsilon_softmax is False 43 | else: 44 | raise NotImplementedError('The model is not added!') 45 | -------------------------------------------------------------------------------- /utilities/logger.py: -------------------------------------------------------------------------------- 1 | # Code referenced from https://gist.github.com/gyglim/1f8dfb1b5c82627ae3efcfbbadb9f514 2 | import tensorflow as tf 3 | import numpy as np 4 | import scipy.misc 5 | try: 6 | from StringIO import StringIO # Python 2.7 7 | except ImportError: 8 | from io import BytesIO # Python 3.x 9 | 10 | 11 | class Logger(object): 12 | 13 | def __init__(self, log_dir): 14 | """Create a summary writer logging to log_dir.""" 15 | self.writer = tf.summary.FileWriter(log_dir) 16 | 17 | def scalar_summary(self, tag, value, step): 18 | """Log a scalar variable.""" 19 | summary = tf.Summary(value=[tf.Summary.Value(tag=tag, simple_value=value)]) 20 | self.writer.add_summary(summary, step) 21 | 22 | def image_summary(self, tag, images, step): 23 | """Log a list of images.""" 24 | 25 | img_summaries = [] 26 | for i, img in enumerate(images): 27 | # Write the image to a string 28 | try: 29 | s = StringIO() 30 | except: 31 | s = BytesIO() 32 | scipy.misc.toimage(img).save(s, format="png") 33 | 34 | # Create an Image object 35 | img_sum = tf.Summary.Image(encoded_image_string=s.getvalue(), 36 | height=img.shape[0], 37 | width=img.shape[1]) 38 | # Create a Summary value 39 | img_summaries.append(tf.Summary.Value(tag='%s/%d' % (tag, i), image=img_sum)) 40 | 41 | # Create and write Summary 42 | summary = tf.Summary(value=img_summaries) 43 | self.writer.add_summary(summary, step) 44 | 45 | def hist_summary(self, tag, values, step, bins=1000): 46 | """Log a histogram of the tensor of values.""" 47 | 48 | # Create a histogram using numpy 49 | counts, bin_edges = np.histogram(values, bins=bins) 50 | 51 | # Fill the fields of the histogram proto 52 | hist = tf.HistogramProto() 53 | hist.min = float(np.min(values)) 54 | hist.max = float(np.max(values)) 55 | hist.num = int(np.prod(values.shape)) 56 | hist.sum = float(np.sum(values)) 57 | hist.sum_squares = float(np.sum(values**2)) 58 | 59 | # Drop the start of the first bin 60 | bin_edges = bin_edges[1:] 61 | 62 | # Add bin edges and counts 63 | for edge in bin_edges: 64 | hist.bucket_limit.append(edge) 65 | for c in counts: 66 | hist.bucket.append(c) 67 | 68 | # Create and write Summary 69 | summary = tf.Summary(value=[tf.Summary.Value(tag=tag, histo=hist)]) 70 | self.writer.add_summary(summary, step) 71 | self.writer.flush() 72 | -------------------------------------------------------------------------------- /utilities/replay_buffer.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | 4 | class TransReplayBuffer(object): 5 | 6 | def __init__(self, size): 7 | self.size = size 8 | self.buffer = [] 9 | 10 | def get_single(self, index): 11 | return self.buffer[index] 12 | 13 | def offset(self): 14 | self.buffer.pop(0) 15 | 16 | def get_batch(self, batch_size): 17 | length = len(self.buffer) 18 | indices = np.random.choice(length, batch_size, replace=False) 19 | batch_buffer = [self.buffer[i] for i in indices] 20 | return batch_buffer 21 | 22 | def add_experience(self, trans): 23 | est_len = 1 + len(self.buffer) 24 | if est_len > self.size: 25 | self.offset() 26 | self.buffer.append(trans) 27 | 28 | def clear(self): 29 | self.buffer = [] 30 | 31 | 32 | 33 | class EpisodeReplayBuffer(object): 34 | 35 | def __init__(self, size): 36 | self.size = size 37 | self.buffer = [] 38 | 39 | def get_single(self, index): 40 | return self.buffer[index] 41 | 42 | def offset(self): 43 | self.buffer.pop(0) 44 | 45 | def get_batch(self, batch_size): 46 | length = len(self.buffer) 47 | indices = np.random.choice(length, batch_size, replace=False) 48 | batch_buffer = [] 49 | for i in indices: 50 | batch_buffer.extend(self.buffer[i]) 51 | return batch_buffer 52 | 53 | def add_experience(self, episode): 54 | est_len = 1 + len(self.buffer) 55 | if est_len > self.size: 56 | self.offset() 57 | self.buffer.append(episode) 58 | -------------------------------------------------------------------------------- /utilities/tester.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | from utilities.util import * 4 | import time 5 | import signal 6 | import sys 7 | 8 | 9 | 10 | class PGTester(object): 11 | 12 | def __init__(self, env, behaviour_net, args): 13 | self.env = env 14 | self.behaviour_net = behaviour_net.cuda().eval() if args.cuda else behaviour_net.eval() 15 | self.args = args 16 | self.cuda_ = self.args.cuda and torch.cuda.is_available() 17 | 18 | def action_logits(self, state, schedule, last_action, last_hidden, info): 19 | return self.behaviour_net.policy(state, schedule=schedule, last_act=last_action, last_hid=last_hidden, info=info) 20 | 21 | def run_step(self, state, schedule, last_action, last_hidden, info={}): 22 | state = cuda_wrapper(prep_obs(state).contiguous().view(1, self.args.agent_num, self.args.obs_size), cuda=self.cuda_) 23 | if self.args.model_name in ['schednet']: 24 | weight = self.behaviour_net.weight_generator(state).detach() 25 | schedule, _ = self.behaviour_net.weight_based_scheduler(weight, exploration=False) 26 | action_out = self.action_logits(state, schedule, last_action, last_hidden, info) 27 | action = select_action(self.args, action_out, status='test') 28 | _, actual = translate_action(self.args, action, self.env) 29 | next_state, reward, done, debug = self.env.step(actual) 30 | success = debug['success'] if 'success' in debug else 0.0 31 | disp = 'The rewards of agents are:' 32 | for r in reward: 33 | disp += ' '+str(r)[:7] 34 | print (disp+'.') 35 | return next_state, action, done, reward, success 36 | 37 | def run_game(self, episodes, render): 38 | action = cuda_wrapper(torch.zeros((1, self.args.agent_num, self.args.action_dim)), cuda=self.cuda_) 39 | info = {} 40 | # set up a flag to control the exit of the program 41 | if render and self.env.name in ['traffic_junction','predator_prey']: 42 | signal.signal(signal.SIGINT, self.signal_handler) 43 | self.env.init_curses() 44 | if self.args.model_name in ['coma', 'ic3net']: 45 | self.behaviour_net.init_hidden(batch_size=1) 46 | last_hidden = self.behaviour_net.get_hidden() 47 | else: 48 | last_hidden = None 49 | if self.args.model_name in ['ic3net']: 50 | gate = self.behaviour_net.gate(last_hidden[:, :, :self.args.hid_size]) 51 | schedule = self.behaviour_net.schedule(gate) 52 | else: 53 | schedule = None 54 | self.all_reward = [] 55 | self.all_turn = [] 56 | self.all_success = [] # special for traffic junction 57 | for ep in range(episodes): 58 | print ('The episode {} starts!'.format(ep)) 59 | episode_reward = [] 60 | episode_success = [] 61 | state = self.env.reset() 62 | t = 0 63 | while True: 64 | if render: 65 | self.env.render() 66 | state, action, done, reward, success = self.run_step(state, schedule, action, last_hidden, info=info) 67 | if self.args.model_name in ['coma']: 68 | last_hidden = self.behaviour_net.get_hidden() 69 | episode_reward.append(np.mean(reward)) 70 | episode_success.append(success) 71 | if render: 72 | time.sleep(0.01) 73 | if np.all(done) or t==self.args.max_steps-1: 74 | print ('The episode {} is finished!'.format(ep)) 75 | self.all_reward.append(np.mean(episode_reward)) 76 | self.all_success.append(np.mean(episode_success)) 77 | self.all_turn.append(t+1) 78 | break 79 | t += 1 80 | 81 | def signal_handler(self, signal, frame): 82 | print('You pressed Ctrl+C! Exiting gracefully.') 83 | self.env.exit_render() 84 | sys.exit(0) 85 | 86 | def print_info(self): 87 | episodes = len(self.all_reward) 88 | print("\n"+"="*10+ " REUSLTS "+ "="*10) 89 | print ('Episode: {:4d}'.format(episodes)) 90 | print('Mean Reward: {:2.4f}/{:2.4f}'.format(np.mean(self.all_reward),np.std(self.all_reward))) 91 | print('Mean Success: {:2.4f}/{:2.4f}'.format(np.mean(self.all_success),np.std(self.all_success))) 92 | print('Mean Turn: {:2.4f}/{:2.4f}'.format(np.mean(self.all_turn),np.std(self.all_turn))) 93 | -------------------------------------------------------------------------------- /utilities/trainer.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | import numpy as np 3 | import torch 4 | from torch import optim 5 | import torch.nn as nn 6 | from utilities.util import * 7 | from utilities.replay_buffer import * 8 | from utilities.inspector import * 9 | from arguments import * 10 | from utilities.logger import Logger 11 | 12 | 13 | 14 | class PGTrainer(object): 15 | 16 | def __init__(self, args, model, env, logger, online): 17 | self.args = args 18 | self.cuda_ = self.args.cuda and torch.cuda.is_available() 19 | self.logger = logger 20 | self.online = online 21 | inspector(self.args) 22 | if self.args.target: 23 | target_net = model(self.args).cuda() if self.cuda_ else model(self.args) 24 | self.behaviour_net = model(self.args, target_net).cuda() if self.cuda_ else model(self.args, target_net) 25 | else: 26 | self.behaviour_net = model(self.args).cuda() if self.cuda_ else model(self.args) 27 | if self.args.replay: 28 | if self.online: 29 | self.replay_buffer = TransReplayBuffer(int(self.args.replay_buffer_size)) 30 | else: 31 | self.replay_buffer = EpisodeReplayBuffer(int(self.args.replay_buffer_size)) 32 | self.env = env 33 | self.action_optimizers = [] 34 | for action_dict in self.behaviour_net.action_dicts: 35 | self.action_optimizers.append(optim.Adam(action_dict.parameters(), lr=args.policy_lrate)) 36 | self.value_optimizers = [] 37 | for value_dict in self.behaviour_net.value_dicts: 38 | self.value_optimizers.append(optim.Adam(value_dict.parameters(), lr=args.value_lrate)) 39 | self.init_action = cuda_wrapper( torch.zeros(1, self.args.agent_num, self.args.action_dim), cuda=self.cuda_ ) 40 | self.steps = 0 41 | self.episodes = 0 42 | self.mean_reward = 0 43 | self.mean_success = 0 44 | self.entr = self.args.entr 45 | self.entr_inc = self.args.entr_inc 46 | 47 | def get_loss(self, batch): 48 | action_loss, value_loss, log_p_a = self.behaviour_net.get_loss(batch) 49 | return action_loss, value_loss, log_p_a 50 | 51 | def action_compute_grad(self, stat, loss, retain_graph): 52 | action_loss, log_p_a = loss 53 | if not self.args.continuous: 54 | if self.entr > 0: 55 | entropy = multinomial_entropy(log_p_a) 56 | action_loss -= self.entr * entropy 57 | stat['entropy'] = entropy.item() 58 | action_loss.backward(retain_graph=retain_graph) 59 | 60 | def value_compute_grad(self, value_loss, retain_graph): 61 | value_loss.backward(retain_graph=retain_graph) 62 | 63 | def grad_clip(self, params): 64 | for param in params: 65 | param.grad.data.clamp_(-1, 1) 66 | 67 | def action_replay_process(self, stat): 68 | batch = self.replay_buffer.get_batch(self.args.batch_size) 69 | batch = self.behaviour_net.Transition(*zip(*batch)) 70 | self.action_transition_process(stat, batch) 71 | 72 | def value_replay_process(self, stat): 73 | batch = self.replay_buffer.get_batch(self.args.batch_size) 74 | batch = self.behaviour_net.Transition(*zip(*batch)) 75 | self.value_transition_process(stat, batch) 76 | 77 | def action_transition_process(self, stat, trans): 78 | action_loss, value_loss, log_p_a = self.get_loss(trans) 79 | policy_grads = [] 80 | for i in range(self.args.agent_num): 81 | retain_graph = False if i == self.args.agent_num-1 else True 82 | action_optimizer = self.action_optimizers[i] 83 | action_optimizer.zero_grad() 84 | self.action_compute_grad(stat, (action_loss[i], log_p_a[:, i, :]), retain_graph) 85 | grad = [] 86 | for pp in action_optimizer.param_groups[0]['params']: 87 | grad.append(pp.grad.clone()) 88 | policy_grads.append(grad) 89 | policy_grad_norms = [] 90 | for action_optimizer, grad in zip(self.action_optimizers, policy_grads): 91 | param = action_optimizer.param_groups[0]['params'] 92 | for i in range(len(param)): 93 | param[i].grad = grad[i] 94 | if self.args.grad_clip: 95 | self.grad_clip(param) 96 | policy_grad_norms.append(get_grad_norm(param)) 97 | action_optimizer.step() 98 | stat['policy_grad_norm'] = np.array(policy_grad_norms).mean() 99 | stat['action_loss'] = action_loss.mean().item() 100 | 101 | def value_transition_process(self, stat, trans): 102 | action_loss, value_loss, log_p_a = self.get_loss(trans) 103 | value_grads = [] 104 | for i in range(self.args.agent_num): 105 | retain_graph = False if i == self.args.agent_num-1 else True 106 | value_optimizer = self.value_optimizers[i] 107 | value_optimizer.zero_grad() 108 | self.value_compute_grad(value_loss[i], retain_graph) 109 | grad = [] 110 | for pp in value_optimizer.param_groups[0]['params']: 111 | grad.append(pp.grad.clone()) 112 | value_grads.append(grad) 113 | value_grad_norms = [] 114 | for value_optimizer, grad in zip(self.value_optimizers, value_grads): 115 | param = value_optimizer.param_groups[0]['params'] 116 | for i in range(len(param)): 117 | param[i].grad = grad[i] 118 | if self.args.grad_clip: 119 | self.grad_clip(param) 120 | value_grad_norms.append(get_grad_norm(param)) 121 | value_optimizer.step() 122 | stat['value_grad_norm'] = np.array(value_grad_norms).mean() 123 | stat['value_loss'] = value_loss.mean().item() 124 | 125 | def run(self, stat): 126 | self.behaviour_net.train_process(stat, self) 127 | self.entr += self.entr_inc 128 | 129 | def logging(self, stat): 130 | for tag, value in stat.items(): 131 | if isinstance(value, np.ndarray): 132 | self.logger.image_summary(tag, value, self.episodes) 133 | else: 134 | self.logger.scalar_summary(tag, value, self.episodes) 135 | 136 | def print_info(self, stat): 137 | action_loss = stat.get('action_loss', 0) 138 | value_loss = stat.get('value_loss', 0) 139 | entropy = stat.get('entropy', 0) 140 | print ('Episode: {:4d}, Mean Reward: {:2.4f}, Action Loss: {:2.4f}, Value Loss is: {:2.4f}, Entropy: {:2.4f}\n'\ 141 | .format(self.episodes, stat['mean_reward'], action_loss+self.entr*entropy, value_loss, entropy)) 142 | -------------------------------------------------------------------------------- /utilities/util.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | from torch.distributions.one_hot_categorical import OneHotCategorical 4 | from torch.distributions.normal import Normal 5 | 6 | 7 | 8 | class GumbelSoftmax(OneHotCategorical): 9 | 10 | def __init__(self, logits, probs=None, temperature=0.1): 11 | super(GumbelSoftmax, self).__init__(logits=logits, probs=probs) 12 | self.eps = 1e-20 13 | self.temperature = temperature 14 | 15 | def sample_gumbel(self): 16 | U = self.logits.clone() 17 | U.uniform_(0, 1) 18 | return -torch.log( -torch.log( U + self.eps ) ) 19 | 20 | def gumbel_softmax_sample(self): 21 | y = self.logits + self.sample_gumbel() 22 | return torch.softmax( y / self.temperature, dim=-1) 23 | 24 | def hard_gumbel_softmax_sample(self): 25 | y = self.gumbel_softmax_sample() 26 | return (torch.max(y, dim=-1, keepdim=True)[0] == y).float() 27 | 28 | def rsample(self): 29 | return self.gumbel_softmax_sample() 30 | 31 | def sample(self): 32 | return self.rsample().detach() 33 | 34 | def hard_sample(self): 35 | return self.hard_gumbel_softmax_sample() 36 | 37 | 38 | 39 | def normal_entropy(mean, std): 40 | return Normal(mean, std).entropy().sum() 41 | 42 | def multinomial_entropy(logits): 43 | assert logits.size(-1) > 1 44 | return GumbelSoftmax(logits=logits).entropy().sum() 45 | 46 | def normal_log_density(x, mean, std): 47 | return Normal(mean, std).log_prob(x) 48 | 49 | def multinomials_log_density(actions, logits): 50 | assert logits.size(-1) > 1 51 | return GumbelSoftmax(logits=logits).log_prob(actions) 52 | 53 | def select_action(args, logits, status='train', exploration=True, info={}): 54 | if args.continuous: 55 | act_mean = logits 56 | act_std = cuda_wrapper(torch.ones_like(act_mean), args.cuda) 57 | if status is 'train': 58 | return Normal(act_mean, act_std).sample() 59 | elif status is 'test': 60 | return act_mean 61 | else: 62 | if status is 'train': 63 | if exploration: 64 | if args.epsilon_softmax: 65 | eps = info['softmax_eps'] 66 | p_a = (1 - eps) * torch.softmax(logits, dim=-1) + eps / logits.size(-1) 67 | return OneHotCategorical(logits=None, probs=p_a).sample() 68 | elif args.gumbel_softmax: 69 | return GumbelSoftmax(logits=logits).sample() 70 | else: 71 | return OneHotCategorical(logits=logits).sample() 72 | else: 73 | if args.gumbel_softmax: 74 | temperature = 1.0 75 | return torch.softmax(logits/temperature, dim=-1) 76 | else: 77 | return OneHotCategorical(logits=logits).sample() 78 | elif status is 'test': 79 | p_a = torch.softmax(logits, dim=-1) 80 | return (p_a == torch.max(p_a, dim=-1, keepdim=True)[0]).float() 81 | 82 | def translate_action(args, action, env): 83 | if not args.continuous: 84 | actual = [act.detach().squeeze().cpu().numpy() for act in torch.unbind(action, 1)] 85 | return action, actual 86 | else: 87 | actions = action.data[0].numpy() 88 | cp_actions = actions.copy() 89 | # clip and scale action to correct range 90 | for i in range(len(cp_actions)): 91 | cp_actions[i] = env.action_space[i].low 92 | cp_actions[i] = env.action_space[i].high 93 | cp_actions[i] = max(-1.0, min(cp_actions[i], 1.0)) 94 | cp_actions[i] = 0.5 * (cp_actions[i] + 1.0) * (high - low) + low 95 | return actions, cp_actions 96 | 97 | def prep_obs(state=[]): 98 | state = np.array(state) 99 | if len(state.shape) == 2: 100 | state = np.stack(state, axis=0) 101 | elif len(state.shape) == 4: 102 | state = np.concatenate(state, axis=0) 103 | else: 104 | raise RuntimeError('The shape of the observation is incorrect.') 105 | return torch.tensor(state).float() 106 | 107 | def cuda_wrapper(tensor, cuda): 108 | if isinstance(tensor, torch.Tensor): 109 | return tensor.cuda() if cuda else tensor 110 | else: 111 | raise RuntimeError('Please enter a pytorch tensor, now a {} is received.'.format(type(tensor))) 112 | 113 | def batchnorm(batch): 114 | if isinstance(batch, torch.Tensor): 115 | assert batch.size(-1) == 1 116 | return (batch - batch.mean()) / batch.std() 117 | else: 118 | raise RuntimeError('Please enter a pytorch tensor, now a {} is received.'.format(type(batch))) 119 | 120 | def get_grad_norm(params): 121 | grad_norms = [] 122 | for param in params: 123 | grad_norms.append(torch.norm(param.grad).item()) 124 | return np.mean(grad_norms) 125 | 126 | def merge_dict(stat, key, value): 127 | if key in stat.keys(): 128 | stat[key] += value 129 | else: 130 | stat[key] = value 131 | 132 | def unpack_data(args, batch): 133 | batch_size = len(batch.state) 134 | n = args.agent_num 135 | action_dim = args.action_dim 136 | cuda = torch.cuda.is_available() and args.cuda 137 | rewards = cuda_wrapper(torch.tensor(batch.reward, dtype=torch.float), cuda) 138 | last_step = cuda_wrapper(torch.tensor(batch.last_step, dtype=torch.float).contiguous().view(-1, 1), cuda) 139 | done = cuda_wrapper(torch.tensor(batch.done, dtype=torch.float).contiguous().view(-1, 1), cuda) 140 | actions = cuda_wrapper(torch.tensor(np.stack(list(zip(*batch.action))[0], axis=0), dtype=torch.float), cuda) 141 | last_actions = cuda_wrapper(torch.tensor(np.stack(list(zip(*batch.last_action))[0], axis=0), dtype=torch.float), cuda) 142 | state = cuda_wrapper(prep_obs(list(zip(batch.state))), cuda) 143 | next_state = cuda_wrapper(prep_obs(list(zip(batch.next_state))), cuda) 144 | return (rewards, last_step, done, actions, last_actions, state, next_state) 145 | 146 | def n_step(rewards, last_step, done, next_values, n_step, args): 147 | cuda = torch.cuda.is_available() and args.cuda 148 | returns = cuda_wrapper(torch.zeros_like(rewards), cuda=cuda) 149 | i = rewards.size(0)-1 150 | while i >= 0: 151 | if last_step[i]: 152 | next_return = 0 if done[i] else next_values[i].detach() 153 | for j in reversed(range(i-n_step+1, i+1)): 154 | returns[j] = rewards[j] + args.gamma * next_return 155 | next_return = returns[j] 156 | i -= n_step 157 | continue 158 | else: 159 | next_return = next_values[i+n_step-1].detach() 160 | for j in reversed(range(n_step)): 161 | g = rewards[i+j] + args.gamma * next_return 162 | next_return = g 163 | returns[i] = g.detach() 164 | i -= 1 165 | return returns 166 | --------------------------------------------------------------------------------