├── .gitignore
├── README.md
├── args
├── simple_spread_coma_fc.py
├── simple_spread_independent_ac.py
├── simple_spread_independent_ddpg.py
├── simple_spread_maddpg.py
├── simple_spread_sqddpg.py
├── simple_tag_coma_fc.py
├── simple_tag_independent_ac.py
├── simple_tag_independent_ddpg.py
├── simple_tag_maddpg.py
├── simple_tag_sqddpg.py
├── traffic_junction_coma_fc.py
├── traffic_junction_independent_ac.py
├── traffic_junction_independent_ddpg.py
├── traffic_junction_maddpg.py
└── traffic_junction_sqddpg.py
├── aux.py
├── environments
├── multiagent_particle_envs
│ ├── .gitignore
│ ├── LICENSE.txt
│ ├── README.md
│ ├── bin
│ │ ├── __init__.py
│ │ └── interactive.py
│ ├── make_env.py
│ ├── multiagent
│ │ ├── __init__.py
│ │ ├── core.py
│ │ ├── environment.py
│ │ ├── multi_discrete.py
│ │ ├── policy.py
│ │ ├── rendering.py
│ │ ├── scenario.py
│ │ └── scenarios
│ │ │ ├── __init__.py
│ │ │ ├── simple.py
│ │ │ ├── simple_adversary.py
│ │ │ ├── simple_crypto.py
│ │ │ ├── simple_push.py
│ │ │ ├── simple_reference.py
│ │ │ ├── simple_speaker_listener.py
│ │ │ ├── simple_spread.py
│ │ │ ├── simple_tag.py
│ │ │ └── simple_world_comm.py
│ └── setup.py
├── predator_prey_env.py
├── traffic_helper.py
└── traffic_junction_env.py
├── figures
├── dynamics_131.pdf
├── dynamics_135.pdf
├── dynamics_136.pdf
├── dynamics_19.pdf
├── dynamics_38.png
├── easy_reward.pdf
├── easy_road.pdf
├── easy_success.pdf
├── hard_reward.pdf
├── hard_road.pdf
├── hard_success.pdf
├── medium_reward.pdf
├── medium_road.pdf
├── medium_success.pdf
├── simple_spread_mean_reward.png
├── simple_tag_turn.png
└── venn.png
├── learning_algorithms
├── actor_critic.py
├── ddpg.py
└── rl_algorithms.py
├── models
├── coma_fc.py
├── independent_ac.py
├── independent_ddpg.py
├── maddpg.py
├── model.py
├── random.py
└── sqddpg.py
├── test.py
├── test.sh
├── train.py
├── train.sh
└── utilities
├── gym_wrapper.py
├── inspector.py
├── logger.py
├── replay_buffer.py
├── tester.py
├── trainer.py
└── util.py
/.gitignore:
--------------------------------------------------------------------------------
1 | logs/
2 | model_save/
3 | __pycache__/
4 | learning_algorithms/__pycache__/
5 | models/__pycache__/
6 | utilities/__pycache__/
7 | environments/__pycache__/
8 | tensorboard/
9 | arguments.py
10 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Shapley Q-value: A Local Reward Approach to Solve Global Reward Games
2 |
3 | | :exclamation: News |
4 | |:-----------------------------------------|
5 | |The Jax version of SQDDPG was implemented in [the repository of SHAQ](https://github.com/hsvgbkhgbv/shapley-q-learning) under the framework of PyMARL, to adapt to the environment of SMAC and some related environments.|
6 |
7 | ## Dependencies
8 | This project implements the algorithm of Shapley Q-value deep deterministic policy gradient (SQDDPG) mentioned in the paper accpted by AAAI2020 (Oral):https://arxiv.org/abs/1907.05707 and demonstrates the experiments in comparison with Independent DDPG, Independent A2C, MADDPG and COMA.
9 |
10 | The code is running on Ubuntu 18.04 with Python (3.5.4) and Pytorch (1.0).
11 |
12 | The suggestion is installing Anaconda 3 with Python (3.5.4): https://www.anaconda.com/download/.
13 | To enable the experimantal environments, please install OpenAI Gym (0.10.5) and Numpy (1.14.5).
14 | To use Tensorboard to monitor the training process, please install Tensorflow (r1.14).
15 | After installing the related dependencies mentioned above, open the terminal and execute the following bash script:
16 | ```bash
17 | cd SQDDPG/environments/multiagent_particle_envs/
18 | pip install -e .
19 | ```
20 |
21 | Now, the dependencies for running the code are installed.
22 |
23 | ## Running Code for Experiments
24 | The experiments on Cooperative Navigation and Prey-and-Predator mentioned in the paper are based on the environments from https://github.com/openai/multiagent-particle-envs, i.e., simple_spread and simple_tag. For convenience, we merge this repository to our framework with slight modifications on the scenario simple-tag.
25 |
26 | About the experiment on Traffic Junction, the environment is from https://github.com/IC3Net/IC3Net/tree/master/ic3net-envs/ic3net_envs. To ease the life, we also add it to our framework.
27 |
28 | ### Training
29 | To easily run the code for training, we provide argument files for each experiment with variant methods under the directory `args` and bash script to execute the experiment with different arguments.
30 |
31 | For example, if we would like to run the experiment of simple_tag with the algorithm SQPG, we can edit the file `simple_tag_sqddpg.py` to change the hyperparameters. Then, we can edit `train.sh` to change the variable `EXP_NAME` to `"simple_tag_sqddpg"` and the variable `CUDA_VISIBLE_DEVICES` to the alias of the GPU you'd like to use, e.g. 0 here such that
32 | ```bash
33 | # !/bin/bash
34 | # sh train.sh
35 |
36 | EXP_NAME="simple_tag_sqddpg"
37 | ALIAS=""
38 | export CUDA_DEVICE_ORDER=PCI_BUS_ID
39 | export CUDA_VISIBLE_DEVICES=0
40 |
41 | if [ ! -d "./model_save" ]
42 | then
43 | mkdir ./model_save
44 | fi
45 |
46 | mkdir ./model_save/$EXP_NAME$ALIAS
47 | cp ./args/$EXP_NAME.py arguments.py
48 | python -u train.py > ./model_save/$EXP_NAME$ALIAS/exp.out &
49 | echo $! > ./model_save/$EXP_NAME$ALIAS/exp.pid
50 | ```
51 |
52 | If necessary, we can also edit the variable `ALIAS` to ease the experiments with different hyperparameters.
53 | Now, we only need to run the experiment by the bash script such that
54 | ```bash
55 | source train.sh
56 | ```
57 |
58 | ### Testing
59 | About testing, we provide a Python function called `test.py` which includes several arguments such that
60 | ```bash
61 | --save-model-dir # the path to save the trained model
62 | --render # whether the visualization is needed
63 | --episodes # the number of episodes needed to run the test
64 | ```
65 |
66 | ### Experimental Results
67 |
68 |
69 | #### Cooperative Navigation
70 |
71 |
72 | Mean reward per episode during training in Cooperative Navigation. SQDDPG(n) indicates SQDDPG with the sample size (i.e., M in Eq.8 of the paper) of n. In the rest of experiments, since only SQDDPG with the sample size of 1 is run, we just use SQDDPG to represent SQDDPG(1).
73 |
74 |
75 | #### Prey-and-Predator
76 |
77 |
78 | Turns to capture the prey per episode during training in Prey-and-Predator. SQDDPG in this experiment is with the sample size of 1.
79 |
80 |
81 |
82 |
83 | Credit assignment to each predator for a fixed trajectory. The leftmost figure records a trajectory sampled by an expert policy. The square represents the initial position whereas the circle indicates the final position of each agent. The dots on the trajectory indicates each agent's temporary positions. The other figures show the normalized credit assignments generated by different MARL algorithms according to this trajectory. SQDDPG in this experiment is with the sample size of 1.
84 |
85 |
86 | #### Traffic Junction
87 | | Difficulty | IA2C | IDDPG | COMA | MADDPG | SQDDPG |
88 | |------------|--------|--------|--------|------------|------------|
89 | | Easy | 65.01% | 93.08% | 93.01% | **93.72%** | 93.26% |
90 | | Medium | 67.51% | 84.16% | 82.48% | 87.92% | **88.98%** |
91 | | Hard | 60.89% | 64.99% | 85.33% | 84.21% | **87.04%** |
92 |
93 | The success rate on Traffic Junction, tested with 20, 40, and 60 steps per episode in easy, medium and hard versions respectively. The results are obtained by running each algorithm after training for 1000 episodes.
94 |
95 | ## Extension of the Framework
96 | This framework is easily to be extended by adding extra environments implemented in OpenAI Gym or new multi-agent algorithms implemented in Pytorch. To add extra algorithms, it just needs to inherit the base class `models/model.py` and implement the functions such that
97 | ```python
98 | construct_model(self)
99 | policy(self, obs, last_act=None, last_hid=None, gate=None, info={}, stat={})
100 | value(self, obs, act)
101 | construct_policy_net(self)
102 | construct_value_net(self)
103 | get_loss(self)
104 | ```
105 |
106 | After implementing the class of your own methods, it needs to register your algorithm by the file `aux.py`. For example, if the algorithm is called sqddpg and the corresponding class is called `SQDDPG`, then the process of registeration is shown as below
107 | ```python
108 | schednetArgs = namedtuple( 'sqddpgArgs', ['sample_size'] ) # define the exclusive hyperparameters of this algorithm
109 | Model = dict(...,
110 | ...,
111 | ...,
112 | ...,
113 | sqddpg=SQDDPG
114 | ) # register the handle of the corresponding class of this algorithm
115 | AuxArgs = dict(...,
116 | ...,
117 | ...,
118 | ...,
119 | sqddpg=sqddpgArgs
120 | ) # register the exclusive args of this algorithm
121 | Strategy=dict(...,
122 | ...,
123 | ...,
124 | ...,
125 | sqddpg='pg'
126 | ) # register the training strategy of this algorithm, e.g., 'pg' or 'q'
127 | ```
128 |
129 | Moreover, it is optional to define a restriction for your algorithm to avoid mis-defined hyperparameters in `utilities/inspector.py` such that
130 | ```python
131 | if ... ...:
132 | ... ... ... ...
133 | elif args.model_name is 'sqddpg':
134 | assert args.replay is True
135 | assert args.q_func is True
136 | assert args.target is True
137 | assert args.gumbel_softmax is True
138 | assert args.epsilon_softmax is False
139 | assert args.online is True
140 | assert hasattr(args, 'sample_size')
141 | ```
142 |
143 | Finally, you can additionally add auxilliary functions in directory `utilities`.
144 |
145 | Temporarily, this framework only supports the policy gradient methods. The functionality of value based method is under test and will be available soon.
146 |
147 | ## Citation
148 | If you use the framework or part of the work mentioned in the paper, please cite:
149 | ```
150 | @article{Wang_2020,
151 | title={Shapley Q-Value: A Local Reward Approach to Solve Global Reward Games},
152 | volume={34},
153 | ISSN={2159-5399},
154 | url={http://dx.doi.org/10.1609/aaai.v34i05.6220},
155 | DOI={10.1609/aaai.v34i05.6220},
156 | number={05},
157 | journal={Proceedings of the AAAI Conference on Artificial Intelligence},
158 | publisher={Association for the Advancement of Artificial Intelligence (AAAI)},
159 | author={Wang, Jianhong and Zhang, Yuan and Kim, Tae-Kyun and Gu, Yunjie},
160 | year={2020},
161 | month={Apr},
162 | pages={7285–7292}
163 | }
164 | ```
165 |
--------------------------------------------------------------------------------
/args/simple_spread_coma_fc.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from multiagent.environment import MultiAgentEnv
3 | import multiagent.scenarios as scenario
4 | from utilities.gym_wrapper import *
5 | import numpy as np
6 | from aux import *
7 |
8 |
9 | '''define the model name'''
10 | model_name = 'coma_fc'
11 |
12 | '''define the scenario name'''
13 | scenario_name = 'simple_spread'
14 |
15 | '''define the special property'''
16 | # independentArgs = namedtuple( 'independentArgs', [] )
17 | aux_args = AuxArgs[model_name]()
18 | alias = '_new_1'
19 |
20 | '''load scenario from script'''
21 | scenario = scenario.load(scenario_name+".py").Scenario()
22 |
23 | '''create world'''
24 | world = scenario.make_world()
25 |
26 | '''create multiagent environment'''
27 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True)
28 | env = GymWrapper(env)
29 |
30 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields)
31 |
32 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update
33 | args = Args(model_name=model_name,
34 | agent_num=env.get_num_of_agents(),
35 | hid_size=32,
36 | obs_size=np.max(env.get_shape_of_obs()),
37 | continuous=False,
38 | action_dim=np.max(env.get_output_shape_of_act()),
39 | init_std=0.1,
40 | policy_lrate=1e-2,
41 | value_lrate=1e-4,
42 | max_steps=200,
43 | batch_size=100,
44 | gamma=0.9,
45 | normalize_advantages=False,
46 | entr=1e-2,
47 | entr_inc=0.0,
48 | action_num=np.max(env.get_input_shape_of_act()),
49 | q_func=True,
50 | train_episodes_num=int(5e3),
51 | replay=True,
52 | replay_buffer_size=1e4,
53 | replay_warmup=0,
54 | cuda=True,
55 | grad_clip=True,
56 | save_model_freq=10,
57 | target=True,
58 | target_lr=1e-1,
59 | behaviour_update_freq=100,
60 | critic_update_times=10,
61 | target_update_freq=200,
62 | gumbel_softmax=False,
63 | epsilon_softmax=False,
64 | online=True,
65 | reward_record_type='episode_mean_step',
66 | shared_parameters=True
67 | )
68 |
69 | args = MergeArgs(*(args+aux_args))
70 |
71 | log_name = scenario_name + '_' + model_name + alias
72 |
--------------------------------------------------------------------------------
/args/simple_spread_independent_ac.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from multiagent.environment import MultiAgentEnv
3 | import multiagent.scenarios as scenario
4 | from utilities.gym_wrapper import *
5 | import numpy as np
6 | from aux import *
7 |
8 |
9 |
10 | '''define the model name'''
11 | model_name = 'independent_ac'
12 |
13 | '''define the scenario name'''
14 | scenario_name = 'simple_spread'
15 |
16 | '''define the special property'''
17 | # independentArgs = namedtuple( 'independentArgs', [] )
18 | aux_args = AuxArgs[model_name]()
19 | alias = '_new_1'
20 |
21 | '''load scenario from script'''
22 | scenario = scenario.load(scenario_name+".py").Scenario()
23 |
24 | '''create world'''
25 | world = scenario.make_world()
26 |
27 | '''create multiagent environment'''
28 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True)
29 | env = GymWrapper(env)
30 |
31 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields)
32 |
33 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update
34 | args = Args(model_name=model_name,
35 | agent_num=env.get_num_of_agents(),
36 | hid_size=32,
37 | obs_size=np.max(env.get_shape_of_obs()),
38 | continuous=False,
39 | action_dim=np.max(env.get_output_shape_of_act()),
40 | init_std=0.1,
41 | policy_lrate=1e-6,
42 | value_lrate=1e-5,
43 | max_steps=200,
44 | batch_size=100,
45 | gamma=0.9,
46 | normalize_advantages=False,
47 | entr=1e-2,
48 | entr_inc=0.0,
49 | action_num=np.max(env.get_input_shape_of_act()),
50 | q_func=True,
51 | train_episodes_num=int(5e3),
52 | replay=True,
53 | replay_buffer_size=1e4,
54 | replay_warmup=0,
55 | cuda=True,
56 | grad_clip=True,
57 | save_model_freq=10,
58 | target=True,
59 | target_lr=1e-1,
60 | behaviour_update_freq=100,
61 | critic_update_times=10,
62 | target_update_freq=200,
63 | gumbel_softmax=False,
64 | epsilon_softmax=False,
65 | online=True,
66 | reward_record_type='episode_mean_step',
67 | shared_parameters=False
68 | )
69 |
70 | args = MergeArgs(*(args+aux_args))
71 |
72 | log_name = scenario_name + '_' + model_name + alias
73 |
--------------------------------------------------------------------------------
/args/simple_spread_independent_ddpg.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from multiagent.environment import MultiAgentEnv
3 | import multiagent.scenarios as scenario
4 | from utilities.gym_wrapper import *
5 | import numpy as np
6 | from aux import *
7 |
8 |
9 |
10 | '''define the model name'''
11 | model_name = 'independent_ddpg'
12 |
13 | '''define the scenario name'''
14 | scenario_name = 'simple_spread'
15 |
16 | '''define the special property'''
17 | # independentArgs = namedtuple( 'independentArgs', [] )
18 | aux_args = AuxArgs[model_name]()
19 | alias = '_new_6'
20 |
21 | '''load scenario from script'''
22 | scenario = scenario.load(scenario_name+".py").Scenario()
23 |
24 | '''create world'''
25 | world = scenario.make_world()
26 |
27 | '''create multiagent environment'''
28 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True)
29 | env = GymWrapper(env)
30 |
31 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields)
32 |
33 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update
34 | args = Args(model_name=model_name,
35 | agent_num=env.get_num_of_agents(),
36 | hid_size=32,
37 | obs_size=np.max(env.get_shape_of_obs()),
38 | continuous=False,
39 | action_dim=np.max(env.get_output_shape_of_act()),
40 | init_std=0.1,
41 | policy_lrate=1e-3,
42 | value_lrate=1e-2,
43 | max_steps=200,
44 | batch_size=32,
45 | gamma=0.9,
46 | normalize_advantages=False,
47 | entr=1e-2,
48 | entr_inc=0.0,
49 | action_num=np.max(env.get_input_shape_of_act()),
50 | q_func=False,
51 | train_episodes_num=int(5e3),
52 | replay=True,
53 | replay_buffer_size=1e4,
54 | replay_warmup=0,
55 | cuda=True,
56 | grad_clip=True,
57 | save_model_freq=10,
58 | target=True,
59 | target_lr=1e-1,
60 | behaviour_update_freq=100,
61 | critic_update_times=10,
62 | target_update_freq=200,
63 | gumbel_softmax=True,
64 | epsilon_softmax=False,
65 | online=True,
66 | reward_record_type='episode_mean_step',
67 | shared_parameters=False
68 | )
69 |
70 | args = MergeArgs(*(args+aux_args))
71 |
72 | log_name = scenario_name + '_' + model_name + alias
73 |
--------------------------------------------------------------------------------
/args/simple_spread_maddpg.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from multiagent.environment import MultiAgentEnv
3 | import multiagent.scenarios as scenario
4 | from utilities.gym_wrapper import *
5 | import numpy as np
6 | from aux import *
7 |
8 |
9 |
10 | '''define the model name'''
11 | model_name = 'maddpg'
12 |
13 | '''define the scenario name'''
14 | scenario_name = 'simple_spread'
15 |
16 | '''define the special property'''
17 | # maddpgArgs = namedtuple( 'maddpgArgs', [] )
18 | aux_args = AuxArgs[model_name]()
19 | alias = '_new_3'
20 |
21 | '''load scenario from script'''
22 | scenario = scenario.load(scenario_name+".py").Scenario()
23 |
24 | '''create world'''
25 | world = scenario.make_world()
26 |
27 | '''create multiagent environment'''
28 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True)
29 | env = GymWrapper(env)
30 |
31 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields)
32 |
33 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update
34 | args = Args(model_name=model_name,
35 | agent_num=env.get_num_of_agents(),
36 | hid_size=32,
37 | obs_size=np.max(env.get_shape_of_obs()),
38 | continuous=False,
39 | action_dim=np.max(env.get_output_shape_of_act()),
40 | init_std=0.1,
41 | policy_lrate=1e-4,
42 | value_lrate=1e-3,
43 | max_steps=200,
44 | batch_size=32,
45 | gamma=0.9,
46 | normalize_advantages=False,
47 | entr=1e-3,
48 | entr_inc=0.0,
49 | action_num=np.max(env.get_input_shape_of_act()),
50 | q_func=True,
51 | train_episodes_num=int(5e3),
52 | replay=True,
53 | replay_buffer_size=1e4,
54 | replay_warmup=0,
55 | cuda=True,
56 | grad_clip=True,
57 | save_model_freq=10,
58 | target=True,
59 | target_lr=1e-1,
60 | behaviour_update_freq=100,
61 | critic_update_times=10,
62 | target_update_freq=200,
63 | gumbel_softmax=True,
64 | epsilon_softmax=False,
65 | online=True,
66 | reward_record_type='episode_mean_step',
67 | shared_parameters=False
68 | )
69 |
70 | args = MergeArgs(*(args+aux_args))
71 |
72 | log_name = scenario_name + '_' + model_name + alias
73 |
--------------------------------------------------------------------------------
/args/simple_spread_sqddpg.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from multiagent.environment import MultiAgentEnv
3 | import multiagent.scenarios as scenario
4 | from utilities.gym_wrapper import *
5 | import numpy as np
6 | from aux import *
7 |
8 |
9 |
10 | '''define the model name'''
11 | model_name = 'sqddpg'
12 |
13 | '''define the scenario name'''
14 | scenario_name = 'simple_spread'
15 |
16 | '''define the special property'''
17 | # sqddpgArgs = namedtuple( 'sqddpgArgs', ['sample_size'] )
18 | aux_args = AuxArgs[model_name](5)
19 | alias = '_new_sample_12'
20 |
21 | '''load scenario from script'''
22 | scenario = scenario.load(scenario_name+".py").Scenario()
23 |
24 | '''create world'''
25 | world = scenario.make_world()
26 |
27 | '''create multiagent environment'''
28 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True)
29 | env = GymWrapper(env)
30 |
31 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields)
32 |
33 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update
34 | args = Args(model_name=model_name,
35 | agent_num=env.get_num_of_agents(),
36 | hid_size=32,
37 | obs_size=np.max(env.get_shape_of_obs()),
38 | continuous=False,
39 | action_dim=np.max(env.get_output_shape_of_act()),
40 | init_std=0.1,
41 | policy_lrate=1e-4,
42 | value_lrate=1e-3,
43 | max_steps=200,
44 | batch_size=32,
45 | gamma=0.9,
46 | normalize_advantages=False,
47 | entr=1e-2,
48 | entr_inc=0.0,
49 | action_num=np.max(env.get_input_shape_of_act()),
50 | q_func=True,
51 | train_episodes_num=int(5e3),
52 | replay=True,
53 | replay_buffer_size=1e4,
54 | replay_warmup=0,
55 | cuda=True,
56 | grad_clip=True,
57 | save_model_freq=10,
58 | target=True,
59 | target_lr=1e-1,
60 | behaviour_update_freq=100,
61 | critic_update_times=10,
62 | target_update_freq=200,
63 | gumbel_softmax=True,
64 | epsilon_softmax=False,
65 | online=True,
66 | reward_record_type='episode_mean_step',
67 | shared_parameters=False
68 | )
69 |
70 | args = MergeArgs(*(args+aux_args))
71 |
72 | log_name = scenario_name + '_' + model_name + alias
73 |
--------------------------------------------------------------------------------
/args/simple_tag_coma_fc.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from multiagent.environment import MultiAgentEnv
3 | import multiagent.scenarios as scenario
4 | from utilities.gym_wrapper import *
5 | import numpy as np
6 | from aux import *
7 |
8 |
9 | '''define the model name'''
10 | model_name = 'coma_fc'
11 |
12 | '''define the scenario name'''
13 | scenario_name = 'simple_tag'
14 |
15 | '''define the special property'''
16 | # independentArgs = namedtuple( 'independentArgs', [] )
17 | aux_args = AuxArgs[model_name]()
18 | alias = ''
19 |
20 | '''load scenario from script'''
21 | scenario = scenario.load(scenario_name+".py").Scenario()
22 |
23 | '''create world'''
24 | world = scenario.make_world()
25 |
26 | '''create multiagent environment'''
27 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True,done_callback=scenario.episode_over)
28 | env = GymWrapper(env)
29 |
30 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields)
31 |
32 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update
33 | args = Args(model_name=model_name,
34 | agent_num=env.get_num_of_agents(),
35 | hid_size=128,
36 | obs_size=np.max(env.get_shape_of_obs()),
37 | continuous=False,
38 | action_dim=np.max(env.get_output_shape_of_act()),
39 | init_std=0.1,
40 | policy_lrate=1e-3,
41 | value_lrate=1e-4,
42 | max_steps=200,
43 | batch_size=100,
44 | gamma=0.99,
45 | normalize_advantages=False,
46 | entr=1e-3,
47 | entr_inc=0.0,
48 | action_num=np.max(env.get_input_shape_of_act()),
49 | q_func=True,
50 | train_episodes_num=int(5e3),
51 | replay=True,
52 | replay_buffer_size=1e4,
53 | replay_warmup=0,
54 | cuda=True,
55 | grad_clip=True,
56 | save_model_freq=10,
57 | target=True,
58 | target_lr=1e-1,
59 | behaviour_update_freq=100,
60 | critic_update_times=10,
61 | target_update_freq=200,
62 | gumbel_softmax=False,
63 | epsilon_softmax=False,
64 | online=True,
65 | reward_record_type='episode_mean_step',
66 | shared_parameters=False
67 | )
68 |
69 | args = MergeArgs(*(args+aux_args))
70 |
71 | log_name = scenario_name + '_' + model_name + alias
72 |
--------------------------------------------------------------------------------
/args/simple_tag_independent_ac.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from multiagent.environment import MultiAgentEnv
3 | import multiagent.scenarios as scenario
4 | from utilities.gym_wrapper import *
5 | import numpy as np
6 | from aux import *
7 |
8 |
9 | '''define the model name'''
10 | model_name = 'independent_ac'
11 |
12 | '''define the scenario name'''
13 | scenario_name = 'simple_tag'
14 |
15 | '''define the special property'''
16 | # independentArgs = namedtuple( 'independentArgs', [] )
17 | aux_args = AuxArgs[model_name]()
18 | alias = ''
19 |
20 | '''load scenario from script'''
21 | scenario = scenario.load(scenario_name+".py").Scenario()
22 |
23 | '''create world'''
24 | world = scenario.make_world()
25 |
26 | '''create multiagent environment'''
27 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True,done_callback=scenario.episode_over)
28 | env = GymWrapper(env)
29 |
30 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields)
31 |
32 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update
33 | args = Args(model_name=model_name,
34 | agent_num=env.get_num_of_agents(),
35 | hid_size=128,
36 | obs_size=np.max(env.get_shape_of_obs()),
37 | continuous=False,
38 | action_dim=np.max(env.get_output_shape_of_act()),
39 | init_std=0.1,
40 | policy_lrate=1e-3,
41 | value_lrate=1e-4,
42 | max_steps=200,
43 | batch_size=100,
44 | gamma=0.99,
45 | normalize_advantages=False,
46 | entr=1e-3,
47 | entr_inc=0.0,
48 | action_num=np.max(env.get_input_shape_of_act()),
49 | q_func=True,
50 | train_episodes_num=int(5e3),
51 | replay=True,
52 | replay_buffer_size=1e4,
53 | replay_warmup=0,
54 | cuda=True,
55 | grad_clip=True,
56 | save_model_freq=10,
57 | target=True,
58 | target_lr=1e-1,
59 | behaviour_update_freq=100,
60 | critic_update_times=10,
61 | target_update_freq=200,
62 | gumbel_softmax=False,
63 | epsilon_softmax=False,
64 | online=True,
65 | reward_record_type='episode_mean_step',
66 | shared_parameters=False
67 | )
68 |
69 | args = MergeArgs(*(args+aux_args))
70 |
71 | log_name = scenario_name + '_' + model_name + alias
72 |
--------------------------------------------------------------------------------
/args/simple_tag_independent_ddpg.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from multiagent.environment import MultiAgentEnv
3 | import multiagent.scenarios as scenario
4 | from utilities.gym_wrapper import *
5 | import numpy as np
6 | from aux import *
7 |
8 |
9 |
10 | '''define the model name'''
11 | model_name = 'independent_ddpg'
12 |
13 | '''define the scenario name'''
14 | scenario_name = 'simple_tag'
15 |
16 | '''define the special property'''
17 | # independentArgs = namedtuple( 'independentArgs', [] )
18 | aux_args = AuxArgs[model_name]()
19 | alias = ''
20 |
21 | '''load scenario from script'''
22 | scenario = scenario.load(scenario_name+".py").Scenario()
23 |
24 | '''create world'''
25 | world = scenario.make_world()
26 |
27 | '''create multiagent environment'''
28 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True,done_callback=scenario.episode_over)
29 | env = GymWrapper(env)
30 |
31 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields)
32 |
33 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update
34 | args = Args(model_name=model_name,
35 | agent_num=env.get_num_of_agents(),
36 | hid_size=128,
37 | obs_size=np.max(env.get_shape_of_obs()),
38 | continuous=False,
39 | action_dim=np.max(env.get_output_shape_of_act()),
40 | init_std=0.1,
41 | policy_lrate=1e-4,
42 | value_lrate=5e-4,
43 | max_steps=200,
44 | batch_size=128,
45 | gamma=0.99,
46 | normalize_advantages=False,
47 | entr=1e-3,
48 | entr_inc=0.0,
49 | action_num=np.max(env.get_input_shape_of_act()),
50 | q_func=False,
51 | train_episodes_num=int(5e3),
52 | replay=True,
53 | replay_buffer_size=1e4,
54 | replay_warmup=0,
55 | cuda=True,
56 | grad_clip=True,
57 | save_model_freq=10,
58 | target=True,
59 | target_lr=1e-1,
60 | behaviour_update_freq=100,
61 | critic_update_times=10,
62 | target_update_freq=200,
63 | gumbel_softmax=True,
64 | epsilon_softmax=False,
65 | online=True,
66 | reward_record_type='episode_mean_step',
67 | shared_parameters=False
68 | )
69 |
70 | args = MergeArgs(*(args+aux_args))
71 |
72 | log_name = scenario_name + '_' + model_name + alias
73 |
--------------------------------------------------------------------------------
/args/simple_tag_maddpg.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from multiagent.environment import MultiAgentEnv
3 | import multiagent.scenarios as scenario
4 | from utilities.gym_wrapper import *
5 | import numpy as np
6 | from aux import *
7 |
8 |
9 |
10 | '''define the model name'''
11 | model_name = 'maddpg'
12 |
13 | '''define the scenario name'''
14 | scenario_name = 'simple_tag'
15 |
16 | '''define the special property'''
17 | # maddpgArgs = namedtuple( 'maddpgArgs', [] )
18 | aux_args = AuxArgs[model_name]()
19 | alias = ''
20 |
21 | '''load scenario from script'''
22 | scenario = scenario.load(scenario_name+".py").Scenario()
23 |
24 | '''create world'''
25 | world = scenario.make_world()
26 |
27 | '''create multiagent environment'''
28 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True,done_callback=scenario.episode_over)
29 | env = GymWrapper(env)
30 |
31 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields)
32 |
33 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update
34 | args = Args(model_name=model_name,
35 | agent_num=env.get_num_of_agents(),
36 | hid_size=128,
37 | obs_size=np.max(env.get_shape_of_obs()),
38 | continuous=False,
39 | action_dim=np.max(env.get_output_shape_of_act()),
40 | init_std=0.1,
41 | policy_lrate=1e-4,
42 | value_lrate=5e-4,
43 | max_steps=200,
44 | batch_size=128,
45 | gamma=0.99,
46 | normalize_advantages=False,
47 | entr=1e-3,
48 | entr_inc=0.0,
49 | action_num=np.max(env.get_input_shape_of_act()),
50 | q_func=True,
51 | train_episodes_num=int(5e3),
52 | replay=True,
53 | replay_buffer_size=1e4,
54 | replay_warmup=0,
55 | cuda=True,
56 | grad_clip=True,
57 | save_model_freq=10,
58 | target=True,
59 | target_lr=1e-1,
60 | behaviour_update_freq=100,
61 | critic_update_times=10,
62 | target_update_freq=200,
63 | gumbel_softmax=True,
64 | epsilon_softmax=False,
65 | online=True,
66 | reward_record_type='episode_mean_step',
67 | shared_parameters=False
68 | )
69 |
70 | args = MergeArgs(*(args+aux_args))
71 |
72 | log_name = scenario_name + '_' + model_name + alias
73 |
--------------------------------------------------------------------------------
/args/simple_tag_sqddpg.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from multiagent.environment import MultiAgentEnv
3 | import multiagent.scenarios as scenario
4 | from utilities.gym_wrapper import *
5 | import numpy as np
6 | from aux import *
7 |
8 |
9 |
10 | '''define the model name'''
11 | model_name = 'sqddpg'
12 |
13 | '''define the scenario name'''
14 | scenario_name = 'simple_tag'
15 |
16 | '''define the special property'''
17 | # sqddpgArgs = namedtuple( 'sqddpgArgs', ['sample_size'] )
18 | aux_args = AuxArgs[model_name](1)
19 | alias = ''
20 |
21 | '''load scenario from script'''
22 | scenario = scenario.load(scenario_name+".py").Scenario()
23 |
24 | '''create world'''
25 | world = scenario.make_world()
26 |
27 | '''create multiagent environment'''
28 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer=True,done_callback=scenario.episode_over)
29 | env = GymWrapper(env)
30 |
31 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields)
32 |
33 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update
34 | args = Args(model_name=model_name,
35 | agent_num=env.get_num_of_agents(),
36 | hid_size=128,
37 | obs_size=np.max(env.get_shape_of_obs()),
38 | continuous=False,
39 | action_dim=np.max(env.get_output_shape_of_act()),
40 | init_std=0.1,
41 | policy_lrate=1e-4,
42 | value_lrate=5e-4,
43 | max_steps=200,
44 | batch_size=128,
45 | gamma=0.99,
46 | normalize_advantages=False,
47 | entr=1e-3,
48 | entr_inc=0.0,
49 | action_num=np.max(env.get_input_shape_of_act()),
50 | q_func=True,
51 | train_episodes_num=int(5e3),
52 | replay=True,
53 | replay_buffer_size=1e4,
54 | replay_warmup=0,
55 | cuda=True,
56 | grad_clip=True,
57 | save_model_freq=10,
58 | target=True,
59 | target_lr=1e-1,
60 | behaviour_update_freq=100,
61 | critic_update_times=10,
62 | target_update_freq=200,
63 | gumbel_softmax=True,
64 | epsilon_softmax=False,
65 | online=True,
66 | reward_record_type='episode_mean_step',
67 | shared_parameters=False
68 | )
69 |
70 | args = MergeArgs(*(args+aux_args))
71 |
72 | log_name = scenario_name + '_' + model_name + alias
73 |
--------------------------------------------------------------------------------
/args/traffic_junction_coma_fc.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from utilities.gym_wrapper import *
3 | import numpy as np
4 | from models.coma import *
5 | from aux import *
6 | from environments.traffic_junction_env import TrafficJunctionEnv
7 |
8 |
9 |
10 | '''define the model name'''
11 | model_name = 'coma_fc'
12 |
13 | '''define the special property'''
14 | # independentArgs = namedtuple( 'independentArgs', [] )
15 | aux_args = AuxArgs[model_name]()
16 | alias = '_medium'
17 |
18 | '''define the scenario name'''
19 | scenario_name = 'traffic_junction'
20 |
21 | '''define the environment'''
22 | env = TrafficJunctionEnv()
23 | env = GymWrapper(env)
24 |
25 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields)
26 |
27 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update
28 | args = Args(model_name=model_name,
29 | agent_num=env.get_num_of_agents(),
30 | hid_size=128,
31 | obs_size=np.max(env.get_shape_of_obs()),
32 | continuous=False,
33 | action_dim=np.max(env.get_output_shape_of_act()),
34 | init_std=0.1,
35 | policy_lrate=1e-4,
36 | value_lrate=1e-3,
37 | max_steps=50,
38 | batch_size=2,
39 | gamma=0.99,
40 | normalize_advantages=False,
41 | entr=1e-4,
42 | entr_inc=0.0,
43 | action_num=np.max(env.get_input_shape_of_act()),
44 | q_func=True,
45 | train_episodes_num=int(5e3),
46 | replay=True,
47 | replay_buffer_size=2,
48 | replay_warmup=0,
49 | cuda=True,
50 | grad_clip=True,
51 | save_model_freq=100,
52 | target=True,
53 | target_lr=1e-1,
54 | behaviour_update_freq=2,
55 | critic_update_times=10,
56 | target_update_freq=2,
57 | gumbel_softmax=False,
58 | epsilon_softmax=True,
59 | online=False,
60 | reward_record_type='episode_mean_step',
61 | shared_parameters=False
62 | )
63 |
64 | args = MergeArgs(*(args+aux_args))
65 |
66 | log_name = scenario_name + '_' + model_name + alias
67 |
--------------------------------------------------------------------------------
/args/traffic_junction_independent_ac.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from utilities.gym_wrapper import *
3 | import numpy as np
4 | from aux import *
5 | from environments.traffic_junction_env import TrafficJunctionEnv
6 |
7 |
8 |
9 | '''define the model name'''
10 | model_name = 'independent_ac'
11 |
12 | '''define the special property'''
13 | # independentArgs = namedtuple( 'independentArgs', [] )
14 | aux_args = AuxArgs[model_name]()
15 | alias = '_medium'
16 |
17 | '''define the scenario name'''
18 | scenario_name = 'traffic_junction'
19 |
20 | '''define the environment'''
21 | env = TrafficJunctionEnv()
22 | env = GymWrapper(env)
23 |
24 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields)
25 |
26 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update
27 | args = Args(model_name=model_name,
28 | agent_num=env.get_num_of_agents(),
29 | hid_size=128,
30 | obs_size=np.max(env.get_shape_of_obs()),
31 | continuous=False,
32 | action_dim=np.max(env.get_output_shape_of_act()),
33 | init_std=0.1,
34 | policy_lrate=1e-4,
35 | value_lrate=1e-3,
36 | max_steps=50,
37 | batch_size=64,
38 | gamma=0.99,
39 | normalize_advantages=False,
40 | entr=1e-4,
41 | entr_inc=0.0,
42 | action_num=np.max(env.get_input_shape_of_act()),
43 | q_func=True,
44 | train_episodes_num=int(5e3),
45 | replay=True,
46 | replay_buffer_size=100,
47 | replay_warmup=0,
48 | cuda=True,
49 | grad_clip=True,
50 | save_model_freq=100,
51 | target=True,
52 | target_lr=1.0,
53 | behaviour_update_freq=25,
54 | critic_update_times=10,
55 | target_update_freq=50,
56 | gumbel_softmax=False,
57 | epsilon_softmax=False,
58 | online=True,
59 | reward_record_type='episode_mean_step',
60 | shared_parameters=False
61 | )
62 |
63 | args = MergeArgs(*(args+aux_args))
64 |
65 | log_name = scenario_name + '_' + model_name + alias
66 |
--------------------------------------------------------------------------------
/args/traffic_junction_independent_ddpg.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from utilities.gym_wrapper import *
3 | import numpy as np
4 | from aux import *
5 | from environments.traffic_junction_env import TrafficJunctionEnv
6 |
7 |
8 |
9 | '''define the model name'''
10 | model_name = 'independent_ddpg'
11 |
12 | '''define the special property'''
13 | # independentArgs = namedtuple( 'independentArgs', [] )
14 | aux_args = AuxArgs[model_name]()
15 | alias = '_medium'
16 |
17 | '''define the scenario name'''
18 | scenario_name = 'traffic_junction'
19 |
20 | '''define the environment'''
21 | env = TrafficJunctionEnv()
22 | env = GymWrapper(env)
23 |
24 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields)
25 |
26 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update
27 | args = Args(model_name=model_name,
28 | agent_num=env.get_num_of_agents(),
29 | hid_size=128,
30 | obs_size=np.max(env.get_shape_of_obs()),
31 | continuous=False,
32 | action_dim=np.max(env.get_output_shape_of_act()),
33 | init_std=0.1,
34 | policy_lrate=1e-4,
35 | value_lrate=1e-3,
36 | max_steps=50,
37 | batch_size=64,
38 | gamma=0.99,
39 | normalize_advantages=False,
40 | entr=1e-4,
41 | entr_inc=0.0,
42 | action_num=np.max(env.get_input_shape_of_act()),
43 | q_func=True,
44 | train_episodes_num=int(5e3),
45 | replay=True,
46 | replay_buffer_size=1e4,
47 | replay_warmup=0,
48 | cuda=True,
49 | grad_clip=True,
50 | save_model_freq=100,
51 | target=True,
52 | target_lr=1e-1,
53 | behaviour_update_freq=25,
54 | critic_update_times=10,
55 | target_update_freq=50,
56 | gumbel_softmax=True,
57 | epsilon_softmax=False,
58 | online=True,
59 | reward_record_type='episode_mean_step',
60 | shared_parameters=False
61 | )
62 |
63 | args = MergeArgs(*(args+aux_args))
64 |
65 | log_name = scenario_name + '_' + model_name + alias
66 |
--------------------------------------------------------------------------------
/args/traffic_junction_maddpg.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from utilities.gym_wrapper import *
3 | import numpy as np
4 | from aux import *
5 | from environments.traffic_junction_env import TrafficJunctionEnv
6 |
7 |
8 |
9 | '''define the model name'''
10 | model_name = 'maddpg'
11 |
12 | '''define the special property'''
13 | # maddpgArgs = namedtuple( 'maddpgArgs', [] )
14 | aux_args = AuxArgs[model_name]() # maddpg
15 | alias = '_medium'
16 |
17 | '''define the scenario name'''
18 | scenario_name = 'traffic_junction'
19 |
20 | '''define the environment'''
21 | env = TrafficJunctionEnv()
22 | env = GymWrapper(env)
23 |
24 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields)
25 |
26 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update
27 | args = Args(model_name=model_name,
28 | agent_num=env.get_num_of_agents(),
29 | hid_size=128,
30 | obs_size=np.max(env.get_shape_of_obs()),
31 | continuous=False,
32 | action_dim=np.max(env.get_output_shape_of_act()),
33 | init_std=0.1,
34 | policy_lrate=1e-4,
35 | value_lrate=1e-3,
36 | max_steps=50,
37 | batch_size=64,
38 | gamma=0.99,
39 | normalize_advantages=False,
40 | entr=1e-4,
41 | entr_inc=0.0,
42 | action_num=np.max(env.get_input_shape_of_act()),
43 | q_func=True,
44 | train_episodes_num=int(5e3),
45 | replay=True,
46 | replay_buffer_size=1e4,
47 | replay_warmup=0,
48 | cuda=True,
49 | grad_clip=True,
50 | save_model_freq=100,
51 | target=True,
52 | target_lr=1e-1,
53 | behaviour_update_freq=25,
54 | critic_update_times=10,
55 | target_update_freq=50,
56 | gumbel_softmax=True,
57 | epsilon_softmax=False,
58 | online=True,
59 | reward_record_type='episode_mean_step',
60 | shared_parameters=False
61 | )
62 |
63 | args = MergeArgs(*(args+aux_args))
64 |
65 | log_name = scenario_name + '_' + model_name + alias
66 |
--------------------------------------------------------------------------------
/args/traffic_junction_sqddpg.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from utilities.gym_wrapper import *
3 | import numpy as np
4 | from aux import *
5 | from environments.traffic_junction_env import TrafficJunctionEnv
6 |
7 |
8 |
9 | '''define the model name'''
10 | model_name = 'sqddpg'
11 |
12 | '''define the special property'''
13 | # sqddpgArgs = namedtuple( 'sqddpgArgs', ['sample_size'] )
14 | aux_args = AuxArgs[model_name](1) # sqddpg
15 | alias = '_medium'
16 |
17 | '''define the scenario name'''
18 | scenario_name = 'traffic_junction'
19 |
20 | '''define the environment'''
21 | env = TrafficJunctionEnv()
22 | env = GymWrapper(env)
23 |
24 | MergeArgs = namedtuple('MergeArgs', Args._fields+AuxArgs[model_name]._fields)
25 |
26 | # under offline trainer if set batch_size=replay_buffer_size=update_freq -> epoch update
27 | args = Args(model_name=model_name,
28 | agent_num=env.get_num_of_agents(),
29 | hid_size=128,
30 | obs_size=np.max(env.get_shape_of_obs()),
31 | continuous=False,
32 | action_dim=np.max(env.get_output_shape_of_act()),
33 | init_std=0.1,
34 | policy_lrate=1e-4,
35 | value_lrate=1e-3,
36 | max_steps=50,
37 | batch_size=32,
38 | gamma=0.99,
39 | normalize_advantages=False,
40 | entr=1e-4,
41 | entr_inc=0.0,
42 | action_num=np.max(env.get_input_shape_of_act()),
43 | q_func=True,
44 | train_episodes_num=int(5e3),
45 | replay=True,
46 | replay_buffer_size=1e4,
47 | replay_warmup=0,
48 | cuda=True,
49 | grad_clip=True,
50 | save_model_freq=100,
51 | target=True,
52 | target_lr=0.1,
53 | behaviour_update_freq=25,
54 | critic_update_times=10,
55 | target_update_freq=50,
56 | gumbel_softmax=True,
57 | epsilon_softmax=False,
58 | online=True,
59 | reward_record_type='episode_mean_step',
60 | shared_parameters=False
61 | )
62 |
63 | args = MergeArgs(*(args+aux_args))
64 |
65 | log_name = scenario_name + '_' + model_name + alias
66 |
--------------------------------------------------------------------------------
/aux.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | from models.maddpg import *
3 | from models.sqddpg import *
4 | from models.independent_ac import *
5 | from models.independent_ddpg import *
6 | from models.coma_fc import *
7 |
8 |
9 |
10 | maddpgArgs = namedtuple( 'maddpgArgs', [] )
11 |
12 | randomArgs = namedtuple( 'randomArgs', [] )
13 |
14 | sqddpgArgs = namedtuple( 'sqddpgArgs', ['sample_size'] )
15 |
16 | independentArgs = namedtuple( 'independentArgs', [] )
17 |
18 | comafcArgs = namedtuple( 'comafcArgs', [] )
19 |
20 |
21 |
22 | Model = dict(maddpg=MADDPG,
23 | sqddpg=SQDDPG,
24 | independent_ac=IndependentAC,
25 | independent_ddpg=IndependentDDPG,
26 | coma_fc=COMAFC
27 | )
28 |
29 |
30 |
31 | AuxArgs = dict(maddpg=maddpgArgs,
32 | sqddpg=sqddpgArgs,
33 | independent_ac=independentArgs,
34 | independent_ddpg=independentArgs,
35 | coma_fc=comafcArgs
36 | )
37 |
38 |
39 |
40 | Strategy=dict(maddpg='pg',
41 | sqddpg='pg',
42 | independent_ac='pg',
43 | independent_ddpg='pg',
44 | coma_fc='pg'
45 | )
46 |
47 |
48 |
49 | Args = namedtuple('Args', ['model_name',
50 | 'agent_num',
51 | 'hid_size',
52 | 'obs_size',
53 | 'continuous',
54 | 'action_dim',
55 | 'init_std',
56 | 'policy_lrate',
57 | 'value_lrate',
58 | 'max_steps',
59 | 'batch_size', # steps<-online/episodes<-offline
60 | 'gamma',
61 | 'normalize_advantages',
62 | 'entr',
63 | 'entr_inc',
64 | 'action_num',
65 | 'q_func',
66 | 'train_episodes_num',
67 | 'replay',
68 | 'replay_buffer_size',
69 | 'replay_warmup',
70 | 'cuda',
71 | 'grad_clip',
72 | 'save_model_freq', # episodes
73 | 'target',
74 | 'target_lr',
75 | 'behaviour_update_freq', # steps<-online/episodes<-offline
76 | 'critic_update_times',
77 | 'target_update_freq', # steps<-online/episodes<-offline
78 | 'gumbel_softmax',
79 | 'epsilon_softmax',
80 | 'online',
81 | 'reward_record_type',
82 | 'shared_parameters' # boolean
83 | ]
84 | )
85 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__/
2 | *.egg-info/
3 | *.pyc
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/LICENSE.txt:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2018 OpenAI
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/README.md:
--------------------------------------------------------------------------------
1 | **Status:** Archive (code is provided as-is, no updates expected)
2 |
3 | # Multi-Agent Particle Environment
4 |
5 | A simple multi-agent particle world with a continuous observation and discrete action space, along with some basic simulated physics.
6 | Used in the paper [Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments](https://arxiv.org/pdf/1706.02275.pdf).
7 |
8 | ## Getting started:
9 |
10 | - To install, `cd` into the root directory and type `pip install -e .`
11 |
12 | - To interactively view moving to landmark scenario (see others in ./scenarios/):
13 | `bin/interactive.py --scenario simple.py`
14 |
15 | - Known dependencies: Python (3.5.4), OpenAI gym (0.10.5), numpy (1.14.5)
16 |
17 | - To use the environments, look at the code for importing them in `make_env.py`.
18 |
19 | ## Code structure
20 |
21 | - `make_env.py`: contains code for importing a multiagent environment as an OpenAI Gym-like object.
22 |
23 | - `./multiagent/environment.py`: contains code for environment simulation (interaction physics, `_step()` function, etc.)
24 |
25 | - `./multiagent/core.py`: contains classes for various objects (Entities, Landmarks, Agents, etc.) that are used throughout the code.
26 |
27 | - `./multiagent/rendering.py`: used for displaying agent behaviors on the screen.
28 |
29 | - `./multiagent/policy.py`: contains code for interactive policy based on keyboard input.
30 |
31 | - `./multiagent/scenario.py`: contains base scenario object that is extended for all scenarios.
32 |
33 | - `./multiagent/scenarios/`: folder where various scenarios/ environments are stored. scenario code consists of several functions:
34 | 1) `make_world()`: creates all of the entities that inhabit the world (landmarks, agents, etc.), assigns their capabilities (whether they can communicate, or move, or both).
35 | called once at the beginning of each training session
36 | 2) `reset_world()`: resets the world by assigning properties (position, color, etc.) to all entities in the world
37 | called before every episode (including after make_world() before the first episode)
38 | 3) `reward()`: defines the reward function for a given agent
39 | 4) `observation()`: defines the observation space of a given agent
40 | 5) (optional) `benchmark_data()`: provides diagnostic data for policies trained on the environment (e.g. evaluation metrics)
41 |
42 | ### Creating new environments
43 |
44 | You can create new scenarios by implementing the first 4 functions above (`make_world()`, `reset_world()`, `reward()`, and `observation()`).
45 |
46 | ## List of environments
47 |
48 |
49 | | Env name in code (name in paper) | Communication? | Competitive? | Notes |
50 | | --- | --- | --- | --- |
51 | | `simple.py` | N | N | Single agent sees landmark position, rewarded based on how close it gets to landmark. Not a multiagent environment -- used for debugging policies. |
52 | | `simple_adversary.py` (Physical deception) | N | Y | 1 adversary (red), N good agents (green), N landmarks (usually N=2). All agents observe position of landmarks and other agents. One landmark is the ‘target landmark’ (colored green). Good agents rewarded based on how close one of them is to the target landmark, but negatively rewarded if the adversary is close to target landmark. Adversary is rewarded based on how close it is to the target, but it doesn’t know which landmark is the target landmark. So good agents have to learn to ‘split up’ and cover all landmarks to deceive the adversary. |
53 | | `simple_crypto.py` (Covert communication) | Y | Y | Two good agents (alice and bob), one adversary (eve). Alice must sent a private message to bob over a public channel. Alice and bob are rewarded based on how well bob reconstructs the message, but negatively rewarded if eve can reconstruct the message. Alice and bob have a private key (randomly generated at beginning of each episode), which they must learn to use to encrypt the message. |
54 | | `simple_push.py` (Keep-away) | N |Y | 1 agent, 1 adversary, 1 landmark. Agent is rewarded based on distance to landmark. Adversary is rewarded if it is close to the landmark, and if the agent is far from the landmark. So the adversary learns to push agent away from the landmark. |
55 | | `simple_reference.py` | Y | N | 2 agents, 3 landmarks of different colors. Each agent wants to get to their target landmark, which is known only by other agent. Reward is collective. So agents have to learn to communicate the goal of the other agent, and navigate to their landmark. This is the same as the simple_speaker_listener scenario where both agents are simultaneous speakers and listeners. |
56 | | `simple_speaker_listener.py` (Cooperative communication) | Y | N | Same as simple_reference, except one agent is the ‘speaker’ (gray) that does not move (observes goal of other agent), and other agent is the listener (cannot speak, but must navigate to correct landmark).|
57 | | `simple_spread.py` (Cooperative navigation) | N | N | N agents, N landmarks. Agents are rewarded based on how far any agent is from each landmark. Agents are penalized if they collide with other agents. So, agents have to learn to cover all the landmarks while avoiding collisions. |
58 | | `simple_tag.py` (Predator-prey) | N | Y | Predator-prey environment. Good agents (green) are faster and want to avoid being hit by adversaries (red). Adversaries are slower and want to hit good agents. Obstacles (large black circles) block the way. |
59 | | `simple_world_comm.py` | Y | Y | Environment seen in the video accompanying the paper. Same as simple_tag, except (1) there is food (small blue balls) that the good agents are rewarded for being near, (2) we now have ‘forests’ that hide agents inside from being seen from outside; (3) there is a ‘leader adversary” that can see the agents at all times, and can communicate with the other adversaries to help coordinate the chase. |
60 |
61 | ## Paper citation
62 |
63 | If you used this environment for your experiments or found it helpful, consider citing the following papers:
64 |
65 | Environments in this repo:
66 |
67 | @article{lowe2017multi,
68 | title={Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments},
69 | author={Lowe, Ryan and Wu, Yi and Tamar, Aviv and Harb, Jean and Abbeel, Pieter and Mordatch, Igor},
70 | journal={Neural Information Processing Systems (NIPS)},
71 | year={2017}
72 | }
73 |
74 |
75 | Original particle world environment:
76 |
77 | @article{mordatch2017emergence,
78 | title={Emergence of Grounded Compositional Language in Multi-Agent Populations},
79 | author={Mordatch, Igor and Abbeel, Pieter},
80 | journal={arXiv preprint arXiv:1703.04908},
81 | year={2017}
82 | }
83 |
84 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/bin/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/environments/multiagent_particle_envs/bin/__init__.py
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/bin/interactive.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | import os,sys
3 | sys.path.insert(1, os.path.join(sys.path[0], '..'))
4 | import argparse
5 |
6 | from multiagent.environment import MultiAgentEnv
7 | from multiagent.policy import InteractivePolicy
8 | import multiagent.scenarios as scenarios
9 |
10 | if __name__ == '__main__':
11 | # parse arguments
12 | parser = argparse.ArgumentParser(description=None)
13 | parser.add_argument('-s', '--scenario', default='simple.py', help='Path of the scenario Python script.')
14 | args = parser.parse_args()
15 |
16 | # load scenario from script
17 | scenario = scenarios.load(args.scenario).Scenario()
18 | # create world
19 | world = scenario.make_world()
20 | # create multiagent environment
21 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, info_callback=None, shared_viewer = False)
22 | # render call to create viewer window (necessary only for interactive policies)
23 | env.render()
24 | # create interactive policies for each agent
25 | policies = [InteractivePolicy(env,i) for i in range(env.n)]
26 | # execution loop
27 | obs_n = env.reset()
28 | while True:
29 | # query for action from each agent's policy
30 | act_n = []
31 | for i, policy in enumerate(policies):
32 | act_n.append(policy.action(obs_n[i]))
33 | # step environment
34 | obs_n, reward_n, done_n, _ = env.step(act_n)
35 | # render all agent views
36 | env.render()
37 | # display rewards
38 | #for agent in env.world.agents:
39 | # print(agent.name + " reward: %0.3f" % env._get_reward(agent))
40 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/make_env.py:
--------------------------------------------------------------------------------
1 | """
2 | Code for creating a multiagent environment with one of the scenarios listed
3 | in ./scenarios/.
4 | Can be called by using, for example:
5 | env = make_env('simple_speaker_listener')
6 | After producing the env object, can be used similarly to an OpenAI gym
7 | environment.
8 |
9 | A policy using this environment must output actions in the form of a list
10 | for all agents. Each element of the list should be a numpy array,
11 | of size (env.world.dim_p + env.world.dim_c, 1). Physical actions precede
12 | communication actions in this array. See environment.py for more details.
13 | """
14 |
15 | def make_env(scenario_name, benchmark=False):
16 | '''
17 | Creates a MultiAgentEnv object as env. This can be used similar to a gym
18 | environment by calling env.reset() and env.step().
19 | Use env.render() to view the environment on the screen.
20 |
21 | Input:
22 | scenario_name : name of the scenario from ./scenarios/ to be Returns
23 | (without the .py extension)
24 | benchmark : whether you want to produce benchmarking data
25 | (usually only done during evaluation)
26 |
27 | Some useful env properties (see environment.py):
28 | .observation_space : Returns the observation space for each agent
29 | .action_space : Returns the action space for each agent
30 | .n : Returns the number of Agents
31 | '''
32 | from multiagent.environment import MultiAgentEnv
33 | import multiagent.scenarios as scenarios
34 |
35 | # load scenario from script
36 | scenario = scenarios.load(scenario_name + ".py").Scenario()
37 | # create world
38 | world = scenario.make_world()
39 | # create multiagent environment
40 | if benchmark:
41 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, scenario.benchmark_data)
42 | else:
43 | env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation)
44 | return env
45 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/__init__.py:
--------------------------------------------------------------------------------
1 | from gym.envs.registration import register
2 |
3 | # Multiagent envs
4 | # ----------------------------------------
5 |
6 | register(
7 | id='MultiagentSimple-v0',
8 | entry_point='multiagent.envs:SimpleEnv',
9 | # FIXME(cathywu) currently has to be exactly max_path_length parameters in
10 | # rllab run script
11 | max_episode_steps=100,
12 | )
13 |
14 | register(
15 | id='MultiagentSimpleSpeakerListener-v0',
16 | entry_point='multiagent.envs:SimpleSpeakerListenerEnv',
17 | max_episode_steps=100,
18 | )
19 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/core.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 | # physical/external base state of all entites
4 | class EntityState(object):
5 | def __init__(self):
6 | # physical position
7 | self.p_pos = None
8 | # physical velocity
9 | self.p_vel = None
10 |
11 | # state of agents (including communication and internal/mental state)
12 | class AgentState(EntityState):
13 | def __init__(self):
14 | super(AgentState, self).__init__()
15 | # communication utterance
16 | self.c = None
17 |
18 | # action of the agent
19 | class Action(object):
20 | def __init__(self):
21 | # physical action
22 | self.u = None
23 | # communication action
24 | self.c = None
25 |
26 | # properties and state of physical world entity
27 | class Entity(object):
28 | def __init__(self):
29 | # name
30 | self.name = ''
31 | # properties:
32 | self.size = 0.050
33 | # entity can move / be pushed
34 | self.movable = False
35 | # entity collides with others
36 | self.collide = True
37 | # material density (affects mass)
38 | self.density = 25.0
39 | # color
40 | self.color = None
41 | # max speed and accel
42 | self.max_speed = None
43 | self.accel = None
44 | # state
45 | self.state = EntityState()
46 | # mass
47 | self.initial_mass = 1.0
48 |
49 | @property
50 | def mass(self):
51 | return self.initial_mass
52 |
53 | # properties of landmark entities
54 | class Landmark(Entity):
55 | def __init__(self):
56 | super(Landmark, self).__init__()
57 |
58 | # properties of agent entities
59 | class Agent(Entity):
60 | def __init__(self):
61 | super(Agent, self).__init__()
62 | # agents are movable by default
63 | self.movable = True
64 | # cannot send communication signals
65 | self.silent = False
66 | # cannot observe the world
67 | self.blind = False
68 | # physical motor noise amount
69 | self.u_noise = None
70 | # communication noise amount
71 | self.c_noise = None
72 | # control range
73 | self.u_range = 1.0
74 | # state
75 | self.state = AgentState()
76 | # action
77 | self.action = Action()
78 | # script behavior to execute
79 | self.action_callback = None
80 |
81 | # multi-agent world
82 | class World(object):
83 | def __init__(self):
84 | # list of agents and entities (can change at execution-time!)
85 | self.agents = []
86 | self.landmarks = []
87 | # communication channel dimensionality
88 | self.dim_c = 0
89 | # position dimensionality
90 | self.dim_p = 2
91 | # color dimensionality
92 | self.dim_color = 3
93 | # simulation timestep
94 | self.dt = 0.1
95 | # physical damping
96 | self.damping = 0.25
97 | # contact response parameters
98 | self.contact_force = 1e+2
99 | self.contact_margin = 1e-3
100 |
101 | # return all entities in the world
102 | @property
103 | def entities(self):
104 | return self.agents + self.landmarks
105 |
106 | # return all agents controllable by external policies
107 | @property
108 | def policy_agents(self):
109 | return [agent for agent in self.agents if agent.action_callback is None]
110 |
111 | # return all agents controlled by world scripts
112 | @property
113 | def scripted_agents(self):
114 | return [agent for agent in self.agents if agent.action_callback is not None]
115 |
116 | # update state of the world
117 | def step(self):
118 | # set actions for scripted agents
119 | for agent in self.scripted_agents:
120 | agent.action = agent.action_callback(agent, self)
121 | # gather forces applied to entities
122 | p_force = [None] * len(self.entities)
123 | # apply agent physical controls
124 | p_force = self.apply_action_force(p_force)
125 | # apply environment forces
126 | p_force = self.apply_environment_force(p_force)
127 | # integrate physical state
128 | self.integrate_state(p_force)
129 | # update agent state
130 | for agent in self.agents:
131 | self.update_agent_state(agent)
132 |
133 | # gather agent action forces
134 | def apply_action_force(self, p_force):
135 | # set applied forces
136 | for i,agent in enumerate(self.agents):
137 | if agent.movable:
138 | noise = np.random.randn(*agent.action.u.shape) * agent.u_noise if agent.u_noise else 0.0
139 | p_force[i] = agent.action.u + noise
140 | return p_force
141 |
142 | # gather physical forces acting on entities
143 | def apply_environment_force(self, p_force):
144 | # simple (but inefficient) collision response
145 | for a,entity_a in enumerate(self.entities):
146 | for b,entity_b in enumerate(self.entities):
147 | if(b <= a): continue
148 | [f_a, f_b] = self.get_collision_force(entity_a, entity_b)
149 | if(f_a is not None):
150 | if(p_force[a] is None): p_force[a] = 0.0
151 | p_force[a] = f_a + p_force[a]
152 | if(f_b is not None):
153 | if(p_force[b] is None): p_force[b] = 0.0
154 | p_force[b] = f_b + p_force[b]
155 | return p_force
156 |
157 | # integrate physical state
158 | def integrate_state(self, p_force):
159 | for i,entity in enumerate(self.entities):
160 | if not entity.movable: continue
161 | entity.state.p_vel = entity.state.p_vel * (1 - self.damping)
162 | if (p_force[i] is not None):
163 | entity.state.p_vel += (p_force[i] / entity.mass) * self.dt
164 | if entity.max_speed is not None:
165 | speed = np.sqrt(np.square(entity.state.p_vel[0]) + np.square(entity.state.p_vel[1]))
166 | if speed > entity.max_speed:
167 | entity.state.p_vel = entity.state.p_vel / np.sqrt(np.square(entity.state.p_vel[0]) +
168 | np.square(entity.state.p_vel[1])) * entity.max_speed
169 | entity.state.p_pos += entity.state.p_vel * self.dt
170 |
171 | def update_agent_state(self, agent):
172 | # set communication state (directly for now)
173 | if agent.silent:
174 | agent.state.c = np.zeros(self.dim_c)
175 | else:
176 | noise = np.random.randn(*agent.action.c.shape) * agent.c_noise if agent.c_noise else 0.0
177 | agent.state.c = agent.action.c + noise
178 |
179 | # get collision forces for any contact between two entities
180 | def get_collision_force(self, entity_a, entity_b):
181 | if (not entity_a.collide) or (not entity_b.collide):
182 | return [None, None] # not a collider
183 | if (entity_a is entity_b):
184 | return [None, None] # don't collide against itself
185 | # compute actual distance between entities
186 | delta_pos = entity_a.state.p_pos - entity_b.state.p_pos
187 | dist = np.sqrt(np.sum(np.square(delta_pos)))
188 | # minimum allowable distance
189 | dist_min = entity_a.size + entity_b.size
190 | # softmax penetration
191 | k = self.contact_margin
192 | penetration = np.logaddexp(0, -(dist - dist_min)/k)*k
193 | force = self.contact_force * delta_pos / dist * penetration
194 | force_a = +force if entity_a.movable else None
195 | force_b = -force if entity_b.movable else None
196 | return [force_a, force_b]
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/multi_discrete.py:
--------------------------------------------------------------------------------
1 | # An old version of OpenAI Gym's multi_discrete.py. (Was getting affected by Gym updates)
2 | # (https://github.com/openai/gym/blob/1fb81d4e3fb780ccf77fec731287ba07da35eb84/gym/spaces/multi_discrete.py)
3 |
4 | import numpy as np
5 |
6 | import gym
7 | from gym.spaces import prng
8 |
9 | class MultiDiscrete(gym.Space):
10 | """
11 | - The multi-discrete action space consists of a series of discrete action spaces with different parameters
12 | - It can be adapted to both a Discrete action space or a continuous (Box) action space
13 | - It is useful to represent game controllers or keyboards where each key can be represented as a discrete action space
14 | - It is parametrized by passing an array of arrays containing [min, max] for each discrete action space
15 | where the discrete action space can take any integers from `min` to `max` (both inclusive)
16 | Note: A value of 0 always need to represent the NOOP action.
17 | e.g. Nintendo Game Controller
18 | - Can be conceptualized as 3 discrete action spaces:
19 | 1) Arrow Keys: Discrete 5 - NOOP[0], UP[1], RIGHT[2], DOWN[3], LEFT[4] - params: min: 0, max: 4
20 | 2) Button A: Discrete 2 - NOOP[0], Pressed[1] - params: min: 0, max: 1
21 | 3) Button B: Discrete 2 - NOOP[0], Pressed[1] - params: min: 0, max: 1
22 | - Can be initialized as
23 | MultiDiscrete([ [0,4], [0,1], [0,1] ])
24 | """
25 | def __init__(self, array_of_param_array):
26 | self.low = np.array([x[0] for x in array_of_param_array])
27 | self.high = np.array([x[1] for x in array_of_param_array])
28 | self.num_discrete_space = self.low.shape[0]
29 |
30 | def sample(self):
31 | """ Returns a array with one sample from each discrete action space """
32 | # For each row: round(random .* (max - min) + min, 0)
33 | random_array = prng.np_random.rand(self.num_discrete_space)
34 | return [int(x) for x in np.floor(np.multiply((self.high - self.low + 1.), random_array) + self.low)]
35 | def contains(self, x):
36 | return len(x) == self.num_discrete_space and (np.array(x) >= self.low).all() and (np.array(x) <= self.high).all()
37 |
38 | @property
39 | def shape(self):
40 | return self.num_discrete_space
41 | def __repr__(self):
42 | return "MultiDiscrete" + str(self.num_discrete_space)
43 | def __eq__(self, other):
44 | return np.array_equal(self.low, other.low) and np.array_equal(self.high, other.high)
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/policy.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from pyglet.window import key
3 |
4 | # individual agent policy
5 | class Policy(object):
6 | def __init__(self):
7 | pass
8 | def action(self, obs):
9 | raise NotImplementedError()
10 |
11 | # interactive policy based on keyboard input
12 | # hard-coded to deal only with movement, not communication
13 | class InteractivePolicy(Policy):
14 | def __init__(self, env, agent_index):
15 | super(InteractivePolicy, self).__init__()
16 | self.env = env
17 | # hard-coded keyboard events
18 | self.move = [False for i in range(4)]
19 | self.comm = [False for i in range(env.world.dim_c)]
20 | # register keyboard events with this environment's window
21 | env.viewers[agent_index].window.on_key_press = self.key_press
22 | env.viewers[agent_index].window.on_key_release = self.key_release
23 |
24 | def action(self, obs):
25 | # ignore observation and just act based on keyboard events
26 | if self.env.discrete_action_input:
27 | u = 0
28 | if self.move[0]: u = 1
29 | if self.move[1]: u = 2
30 | if self.move[2]: u = 4
31 | if self.move[3]: u = 3
32 | else:
33 | u = np.zeros(5) # 5-d because of no-move action
34 | if self.move[0]: u[1] += 1.0
35 | if self.move[1]: u[2] += 1.0
36 | if self.move[3]: u[3] += 1.0
37 | if self.move[2]: u[4] += 1.0
38 | if True not in self.move:
39 | u[0] += 1.0
40 | return np.concatenate([u, np.zeros(self.env.world.dim_c)])
41 |
42 | # keyboard event callbacks
43 | def key_press(self, k, mod):
44 | if k==key.LEFT: self.move[0] = True
45 | if k==key.RIGHT: self.move[1] = True
46 | if k==key.UP: self.move[2] = True
47 | if k==key.DOWN: self.move[3] = True
48 | def key_release(self, k, mod):
49 | if k==key.LEFT: self.move[0] = False
50 | if k==key.RIGHT: self.move[1] = False
51 | if k==key.UP: self.move[2] = False
52 | if k==key.DOWN: self.move[3] = False
53 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/rendering.py:
--------------------------------------------------------------------------------
1 | """
2 | 2D rendering framework
3 | """
4 | from __future__ import division
5 | import os
6 | import six
7 | import sys
8 |
9 | if "Apple" in sys.version:
10 | if 'DYLD_FALLBACK_LIBRARY_PATH' in os.environ:
11 | os.environ['DYLD_FALLBACK_LIBRARY_PATH'] += ':/usr/lib'
12 | # (JDS 2016/04/15): avoid bug on Anaconda 2.3.0 / Yosemite
13 |
14 | from gym.utils import reraise
15 | from gym import error
16 |
17 | try:
18 | import pyglet
19 | except ImportError as e:
20 | reraise(suffix="HINT: you can install pyglet directly via 'pip install pyglet'. But if you really just want to install all Gym dependencies and not have to think about it, 'pip install -e .[all]' or 'pip install gym[all]' will do it.")
21 |
22 | try:
23 | from pyglet.gl import *
24 | except ImportError as e:
25 | reraise(prefix="Error occured while running `from pyglet.gl import *`",suffix="HINT: make sure you have OpenGL install. On Ubuntu, you can run 'apt-get install python-opengl'. If you're running on a server, you may need a virtual frame buffer; something like this should work: 'xvfb-run -s \"-screen 0 1400x900x24\" python '")
26 |
27 | import math
28 | import numpy as np
29 |
30 | RAD2DEG = 57.29577951308232
31 |
32 | def get_display(spec):
33 | """Convert a display specification (such as :0) into an actual Display
34 | object.
35 |
36 | Pyglet only supports multiple Displays on Linux.
37 | """
38 | if spec is None:
39 | return None
40 | elif isinstance(spec, six.string_types):
41 | return pyglet.canvas.Display(spec)
42 | else:
43 | raise error.Error('Invalid display specification: {}. (Must be a string like :0 or None.)'.format(spec))
44 |
45 | class Viewer(object):
46 | def __init__(self, width, height, display=None):
47 | display = get_display(display)
48 |
49 | self.width = width
50 | self.height = height
51 |
52 | self.window = pyglet.window.Window(width=width, height=height, display=display)
53 | self.window.on_close = self.window_closed_by_user
54 | self.geoms = []
55 | self.onetime_geoms = []
56 | self.transform = Transform()
57 |
58 | glEnable(GL_BLEND)
59 | # glEnable(GL_MULTISAMPLE)
60 | glEnable(GL_LINE_SMOOTH)
61 | # glHint(GL_LINE_SMOOTH_HINT, GL_DONT_CARE)
62 | glHint(GL_LINE_SMOOTH_HINT, GL_NICEST)
63 | glLineWidth(2.0)
64 | glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA)
65 |
66 | def close(self):
67 | self.window.close()
68 |
69 | def window_closed_by_user(self):
70 | self.close()
71 |
72 | def set_bounds(self, left, right, bottom, top):
73 | assert right > left and top > bottom
74 | scalex = self.width/(right-left)
75 | scaley = self.height/(top-bottom)
76 | self.transform = Transform(
77 | translation=(-left*scalex, -bottom*scaley),
78 | scale=(scalex, scaley))
79 |
80 | def add_geom(self, geom):
81 | self.geoms.append(geom)
82 |
83 | def add_onetime(self, geom):
84 | self.onetime_geoms.append(geom)
85 |
86 | def render(self, return_rgb_array=False):
87 | glClearColor(1,1,1,1)
88 | self.window.clear()
89 | self.window.switch_to()
90 | self.window.dispatch_events()
91 | self.transform.enable()
92 | for geom in self.geoms:
93 | geom.render()
94 | for geom in self.onetime_geoms:
95 | geom.render()
96 | self.transform.disable()
97 | arr = None
98 | if return_rgb_array:
99 | buffer = pyglet.image.get_buffer_manager().get_color_buffer()
100 | image_data = buffer.get_image_data()
101 | arr = np.fromstring(image_data.data, dtype=np.uint8, sep='')
102 | # In https://github.com/openai/gym-http-api/issues/2, we
103 | # discovered that someone using Xmonad on Arch was having
104 | # a window of size 598 x 398, though a 600 x 400 window
105 | # was requested. (Guess Xmonad was preserving a pixel for
106 | # the boundary.) So we use the buffer height/width rather
107 | # than the requested one.
108 | arr = arr.reshape(buffer.height, buffer.width, 4)
109 | arr = arr[::-1,:,0:3]
110 | self.window.flip()
111 | self.onetime_geoms = []
112 | return arr
113 |
114 | # Convenience
115 | def draw_circle(self, radius=10, res=30, filled=True, **attrs):
116 | geom = make_circle(radius=radius, res=res, filled=filled)
117 | _add_attrs(geom, attrs)
118 | self.add_onetime(geom)
119 | return geom
120 |
121 | def draw_polygon(self, v, filled=True, **attrs):
122 | geom = make_polygon(v=v, filled=filled)
123 | _add_attrs(geom, attrs)
124 | self.add_onetime(geom)
125 | return geom
126 |
127 | def draw_polyline(self, v, **attrs):
128 | geom = make_polyline(v=v)
129 | _add_attrs(geom, attrs)
130 | self.add_onetime(geom)
131 | return geom
132 |
133 | def draw_line(self, start, end, **attrs):
134 | geom = Line(start, end)
135 | _add_attrs(geom, attrs)
136 | self.add_onetime(geom)
137 | return geom
138 |
139 | def get_array(self):
140 | self.window.flip()
141 | image_data = pyglet.image.get_buffer_manager().get_color_buffer().get_image_data()
142 | self.window.flip()
143 | arr = np.fromstring(image_data.data, dtype=np.uint8, sep='')
144 | arr = arr.reshape(self.height, self.width, 4)
145 | return arr[::-1,:,0:3]
146 |
147 | def _add_attrs(geom, attrs):
148 | if "color" in attrs:
149 | geom.set_color(*attrs["color"])
150 | if "linewidth" in attrs:
151 | geom.set_linewidth(attrs["linewidth"])
152 |
153 | class Geom(object):
154 | def __init__(self):
155 | self._color=Color((0, 0, 0, 1.0))
156 | self.attrs = [self._color]
157 | def render(self):
158 | for attr in reversed(self.attrs):
159 | attr.enable()
160 | self.render1()
161 | for attr in self.attrs:
162 | attr.disable()
163 | def render1(self):
164 | raise NotImplementedError
165 | def add_attr(self, attr):
166 | self.attrs.append(attr)
167 | def set_color(self, r, g, b, alpha=1):
168 | self._color.vec4 = (r, g, b, alpha)
169 |
170 | class Attr(object):
171 | def enable(self):
172 | raise NotImplementedError
173 | def disable(self):
174 | pass
175 |
176 | class Transform(Attr):
177 | def __init__(self, translation=(0.0, 0.0), rotation=0.0, scale=(1,1)):
178 | self.set_translation(*translation)
179 | self.set_rotation(rotation)
180 | self.set_scale(*scale)
181 | def enable(self):
182 | glPushMatrix()
183 | glTranslatef(self.translation[0], self.translation[1], 0) # translate to GL loc ppint
184 | glRotatef(RAD2DEG * self.rotation, 0, 0, 1.0)
185 | glScalef(self.scale[0], self.scale[1], 1)
186 | def disable(self):
187 | glPopMatrix()
188 | def set_translation(self, newx, newy):
189 | self.translation = (float(newx), float(newy))
190 | def set_rotation(self, new):
191 | self.rotation = float(new)
192 | def set_scale(self, newx, newy):
193 | self.scale = (float(newx), float(newy))
194 |
195 | class Color(Attr):
196 | def __init__(self, vec4):
197 | self.vec4 = vec4
198 | def enable(self):
199 | glColor4f(*self.vec4)
200 |
201 | class LineStyle(Attr):
202 | def __init__(self, style):
203 | self.style = style
204 | def enable(self):
205 | glEnable(GL_LINE_STIPPLE)
206 | glLineStipple(1, self.style)
207 | def disable(self):
208 | glDisable(GL_LINE_STIPPLE)
209 |
210 | class LineWidth(Attr):
211 | def __init__(self, stroke):
212 | self.stroke = stroke
213 | def enable(self):
214 | glLineWidth(self.stroke)
215 |
216 | class Point(Geom):
217 | def __init__(self):
218 | Geom.__init__(self)
219 | def render1(self):
220 | glBegin(GL_POINTS) # draw point
221 | glVertex3f(0.0, 0.0, 0.0)
222 | glEnd()
223 |
224 | class FilledPolygon(Geom):
225 | def __init__(self, v):
226 | Geom.__init__(self)
227 | self.v = v
228 | def render1(self):
229 | if len(self.v) == 4 : glBegin(GL_QUADS)
230 | elif len(self.v) > 4 : glBegin(GL_POLYGON)
231 | else: glBegin(GL_TRIANGLES)
232 | for p in self.v:
233 | glVertex3f(p[0], p[1],0) # draw each vertex
234 | glEnd()
235 |
236 | color = (self._color.vec4[0] * 0.5, self._color.vec4[1] * 0.5, self._color.vec4[2] * 0.5, self._color.vec4[3] * 0.5)
237 | glColor4f(*color)
238 | glBegin(GL_LINE_LOOP)
239 | for p in self.v:
240 | glVertex3f(p[0], p[1],0) # draw each vertex
241 | glEnd()
242 |
243 | def make_circle(radius=10, res=30, filled=True):
244 | points = []
245 | for i in range(res):
246 | ang = 2*math.pi*i / res
247 | points.append((math.cos(ang)*radius, math.sin(ang)*radius))
248 | if filled:
249 | return FilledPolygon(points)
250 | else:
251 | return PolyLine(points, True)
252 |
253 | def make_polygon(v, filled=True):
254 | if filled: return FilledPolygon(v)
255 | else: return PolyLine(v, True)
256 |
257 | def make_polyline(v):
258 | return PolyLine(v, False)
259 |
260 | def make_capsule(length, width):
261 | l, r, t, b = 0, length, width/2, -width/2
262 | box = make_polygon([(l,b), (l,t), (r,t), (r,b)])
263 | circ0 = make_circle(width/2)
264 | circ1 = make_circle(width/2)
265 | circ1.add_attr(Transform(translation=(length, 0)))
266 | geom = Compound([box, circ0, circ1])
267 | return geom
268 |
269 | class Compound(Geom):
270 | def __init__(self, gs):
271 | Geom.__init__(self)
272 | self.gs = gs
273 | for g in self.gs:
274 | g.attrs = [a for a in g.attrs if not isinstance(a, Color)]
275 | def render1(self):
276 | for g in self.gs:
277 | g.render()
278 |
279 | class PolyLine(Geom):
280 | def __init__(self, v, close):
281 | Geom.__init__(self)
282 | self.v = v
283 | self.close = close
284 | self.linewidth = LineWidth(1)
285 | self.add_attr(self.linewidth)
286 | def render1(self):
287 | glBegin(GL_LINE_LOOP if self.close else GL_LINE_STRIP)
288 | for p in self.v:
289 | glVertex3f(p[0], p[1],0) # draw each vertex
290 | glEnd()
291 | def set_linewidth(self, x):
292 | self.linewidth.stroke = x
293 |
294 | class Line(Geom):
295 | def __init__(self, start=(0.0, 0.0), end=(0.0, 0.0)):
296 | Geom.__init__(self)
297 | self.start = start
298 | self.end = end
299 | self.linewidth = LineWidth(1)
300 | self.add_attr(self.linewidth)
301 |
302 | def render1(self):
303 | glBegin(GL_LINES)
304 | glVertex2f(*self.start)
305 | glVertex2f(*self.end)
306 | glEnd()
307 |
308 | class Image(Geom):
309 | def __init__(self, fname, width, height):
310 | Geom.__init__(self)
311 | self.width = width
312 | self.height = height
313 | img = pyglet.image.load(fname)
314 | self.img = img
315 | self.flip = False
316 | def render1(self):
317 | self.img.blit(-self.width/2, -self.height/2, width=self.width, height=self.height)
318 |
319 | # ================================================================
320 |
321 | class SimpleImageViewer(object):
322 | def __init__(self, display=None):
323 | self.window = None
324 | self.isopen = False
325 | self.display = display
326 | def imshow(self, arr):
327 | if self.window is None:
328 | height, width, channels = arr.shape
329 | self.window = pyglet.window.Window(width=width, height=height, display=self.display)
330 | self.width = width
331 | self.height = height
332 | self.isopen = True
333 | assert arr.shape == (self.height, self.width, 3), "You passed in an image with the wrong number shape"
334 | image = pyglet.image.ImageData(self.width, self.height, 'RGB', arr.tobytes(), pitch=self.width * -3)
335 | self.window.clear()
336 | self.window.switch_to()
337 | self.window.dispatch_events()
338 | image.blit(0,0)
339 | self.window.flip()
340 | def close(self):
341 | if self.isopen:
342 | self.window.close()
343 | self.isopen = False
344 | def __del__(self):
345 | self.close()
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/scenario.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 | # defines scenario upon which the world is built
4 | class BaseScenario(object):
5 | # create elements of the world
6 | def make_world(self):
7 | raise NotImplementedError()
8 | # create initial conditions of the world
9 | def reset_world(self, world):
10 | raise NotImplementedError()
11 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/scenarios/__init__.py:
--------------------------------------------------------------------------------
1 | import imp
2 | import os.path as osp
3 |
4 |
5 | def load(name):
6 | pathname = osp.join(osp.dirname(__file__), name)
7 | return imp.load_source('', pathname)
8 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/scenarios/simple.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from multiagent.core import World, Agent, Landmark
3 | from multiagent.scenario import BaseScenario
4 |
5 | class Scenario(BaseScenario):
6 | def make_world(self):
7 | world = World()
8 | # add agents
9 | world.agents = [Agent() for i in range(1)]
10 | for i, agent in enumerate(world.agents):
11 | agent.name = 'agent %d' % i
12 | agent.collide = False
13 | agent.silent = True
14 | # add landmarks
15 | world.landmarks = [Landmark() for i in range(1)]
16 | for i, landmark in enumerate(world.landmarks):
17 | landmark.name = 'landmark %d' % i
18 | landmark.collide = False
19 | landmark.movable = False
20 | # make initial conditions
21 | self.reset_world(world)
22 | return world
23 |
24 | def reset_world(self, world):
25 | # random properties for agents
26 | for i, agent in enumerate(world.agents):
27 | agent.color = np.array([0.25,0.25,0.25])
28 | # random properties for landmarks
29 | for i, landmark in enumerate(world.landmarks):
30 | landmark.color = np.array([0.75,0.75,0.75])
31 | world.landmarks[0].color = np.array([0.75,0.25,0.25])
32 | # set random initial states
33 | for agent in world.agents:
34 | agent.state.p_pos = np.random.uniform(-1,+1, world.dim_p)
35 | agent.state.p_vel = np.zeros(world.dim_p)
36 | agent.state.c = np.zeros(world.dim_c)
37 | for i, landmark in enumerate(world.landmarks):
38 | landmark.state.p_pos = np.random.uniform(-1,+1, world.dim_p)
39 | landmark.state.p_vel = np.zeros(world.dim_p)
40 |
41 | def reward(self, agent, world):
42 | dist2 = np.sum(np.square(agent.state.p_pos - world.landmarks[0].state.p_pos))
43 | return -dist2
44 |
45 | def observation(self, agent, world):
46 | # get positions of all entities in this agent's reference frame
47 | entity_pos = []
48 | for entity in world.landmarks:
49 | entity_pos.append(entity.state.p_pos - agent.state.p_pos)
50 | return np.concatenate([agent.state.p_vel] + entity_pos)
51 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/scenarios/simple_adversary.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from multiagent.core import World, Agent, Landmark
3 | from multiagent.scenario import BaseScenario
4 |
5 |
6 | class Scenario(BaseScenario):
7 |
8 | def make_world(self):
9 | world = World()
10 | # set any world properties first
11 | world.dim_c = 2
12 | num_agents = 3
13 | world.num_agents = num_agents
14 | num_adversaries = 1
15 | num_landmarks = num_agents - 1
16 | # add agents
17 | world.agents = [Agent() for i in range(num_agents)]
18 | for i, agent in enumerate(world.agents):
19 | agent.name = 'agent %d' % i
20 | agent.collide = False
21 | agent.silent = True
22 | agent.adversary = True if i < num_adversaries else False
23 | agent.size = 0.15
24 | # add landmarks
25 | world.landmarks = [Landmark() for i in range(num_landmarks)]
26 | for i, landmark in enumerate(world.landmarks):
27 | landmark.name = 'landmark %d' % i
28 | landmark.collide = False
29 | landmark.movable = False
30 | landmark.size = 0.08
31 | # make initial conditions
32 | self.reset_world(world)
33 | return world
34 |
35 | def reset_world(self, world):
36 | # random properties for agents
37 | world.agents[0].color = np.array([0.85, 0.35, 0.35])
38 | for i in range(1, world.num_agents):
39 | world.agents[i].color = np.array([0.35, 0.35, 0.85])
40 | # random properties for landmarks
41 | for i, landmark in enumerate(world.landmarks):
42 | landmark.color = np.array([0.15, 0.15, 0.15])
43 | # set goal landmark
44 | goal = np.random.choice(world.landmarks)
45 | goal.color = np.array([0.15, 0.65, 0.15])
46 | for agent in world.agents:
47 | agent.goal_a = goal
48 | # set random initial states
49 | for agent in world.agents:
50 | agent.state.p_pos = np.random.uniform(-1, +1, world.dim_p)
51 | agent.state.p_vel = np.zeros(world.dim_p)
52 | agent.state.c = np.zeros(world.dim_c)
53 | for i, landmark in enumerate(world.landmarks):
54 | landmark.state.p_pos = np.random.uniform(-1, +1, world.dim_p)
55 | landmark.state.p_vel = np.zeros(world.dim_p)
56 |
57 | def benchmark_data(self, agent, world):
58 | # returns data for benchmarking purposes
59 | if agent.adversary:
60 | return np.sum(np.square(agent.state.p_pos - agent.goal_a.state.p_pos))
61 | else:
62 | dists = []
63 | for l in world.landmarks:
64 | dists.append(np.sum(np.square(agent.state.p_pos - l.state.p_pos)))
65 | dists.append(np.sum(np.square(agent.state.p_pos - agent.goal_a.state.p_pos)))
66 | return tuple(dists)
67 |
68 | # return all agents that are not adversaries
69 | def good_agents(self, world):
70 | return [agent for agent in world.agents if not agent.adversary]
71 |
72 | # return all adversarial agents
73 | def adversaries(self, world):
74 | return [agent for agent in world.agents if agent.adversary]
75 |
76 | def reward(self, agent, world):
77 | # Agents are rewarded based on minimum agent distance to each landmark
78 | return self.adversary_reward(agent, world) if agent.adversary else self.agent_reward(agent, world)
79 |
80 | def agent_reward(self, agent, world):
81 | # Rewarded based on how close any good agent is to the goal landmark, and how far the adversary is from it
82 | shaped_reward = True
83 | shaped_adv_reward = True
84 |
85 | # Calculate negative reward for adversary
86 | adversary_agents = self.adversaries(world)
87 | if shaped_adv_reward: # distance-based adversary reward
88 | adv_rew = sum([np.sqrt(np.sum(np.square(a.state.p_pos - a.goal_a.state.p_pos))) for a in adversary_agents])
89 | else: # proximity-based adversary reward (binary)
90 | adv_rew = 0
91 | for a in adversary_agents:
92 | if np.sqrt(np.sum(np.square(a.state.p_pos - a.goal_a.state.p_pos))) < 2 * a.goal_a.size:
93 | adv_rew -= 5
94 |
95 | # Calculate positive reward for agents
96 | good_agents = self.good_agents(world)
97 | if shaped_reward: # distance-based agent reward
98 | pos_rew = -min(
99 | [np.sqrt(np.sum(np.square(a.state.p_pos - a.goal_a.state.p_pos))) for a in good_agents])
100 | else: # proximity-based agent reward (binary)
101 | pos_rew = 0
102 | if min([np.sqrt(np.sum(np.square(a.state.p_pos - a.goal_a.state.p_pos))) for a in good_agents]) \
103 | < 2 * agent.goal_a.size:
104 | pos_rew += 5
105 | pos_rew -= min(
106 | [np.sqrt(np.sum(np.square(a.state.p_pos - a.goal_a.state.p_pos))) for a in good_agents])
107 | return pos_rew + adv_rew
108 |
109 | def adversary_reward(self, agent, world):
110 | # Rewarded based on proximity to the goal landmark
111 | shaped_reward = True
112 | if shaped_reward: # distance-based reward
113 | return -np.sum(np.square(agent.state.p_pos - agent.goal_a.state.p_pos))
114 | else: # proximity-based reward (binary)
115 | adv_rew = 0
116 | if np.sqrt(np.sum(np.square(agent.state.p_pos - agent.goal_a.state.p_pos))) < 2 * agent.goal_a.size:
117 | adv_rew += 5
118 | return adv_rew
119 |
120 |
121 | def observation(self, agent, world):
122 | # get positions of all entities in this agent's reference frame
123 | entity_pos = []
124 | for entity in world.landmarks:
125 | entity_pos.append(entity.state.p_pos - agent.state.p_pos)
126 | # entity colors
127 | entity_color = []
128 | for entity in world.landmarks:
129 | entity_color.append(entity.color)
130 | # communication of all other agents
131 | other_pos = []
132 | for other in world.agents:
133 | if other is agent: continue
134 | other_pos.append(other.state.p_pos - agent.state.p_pos)
135 |
136 | if not agent.adversary:
137 | return np.concatenate([agent.goal_a.state.p_pos - agent.state.p_pos] + entity_pos + other_pos)
138 | else:
139 | return np.concatenate(entity_pos + other_pos)
140 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/scenarios/simple_crypto.py:
--------------------------------------------------------------------------------
1 | """
2 | Scenario:
3 | 1 speaker, 2 listeners (one of which is an adversary). Good agents rewarded for proximity to goal, and distance from
4 | adversary to goal. Adversary is rewarded for its distance to the goal.
5 | """
6 |
7 |
8 | import numpy as np
9 | from multiagent.core import World, Agent, Landmark
10 | from multiagent.scenario import BaseScenario
11 | import random
12 |
13 |
14 | class CryptoAgent(Agent):
15 | def __init__(self):
16 | super(CryptoAgent, self).__init__()
17 | self.key = None
18 |
19 | class Scenario(BaseScenario):
20 |
21 | def make_world(self):
22 | world = World()
23 | # set any world properties first
24 | num_agents = 3
25 | num_adversaries = 1
26 | num_landmarks = 2
27 | world.dim_c = 4
28 | # add agents
29 | world.agents = [CryptoAgent() for i in range(num_agents)]
30 | for i, agent in enumerate(world.agents):
31 | agent.name = 'agent %d' % i
32 | agent.collide = False
33 | agent.adversary = True if i < num_adversaries else False
34 | agent.speaker = True if i == 2 else False
35 | agent.movable = False
36 | # add landmarks
37 | world.landmarks = [Landmark() for i in range(num_landmarks)]
38 | for i, landmark in enumerate(world.landmarks):
39 | landmark.name = 'landmark %d' % i
40 | landmark.collide = False
41 | landmark.movable = False
42 | # make initial conditions
43 | self.reset_world(world)
44 | return world
45 |
46 |
47 | def reset_world(self, world):
48 | # random properties for agents
49 | for i, agent in enumerate(world.agents):
50 | agent.color = np.array([0.25, 0.25, 0.25])
51 | if agent.adversary:
52 | agent.color = np.array([0.75, 0.25, 0.25])
53 | agent.key = None
54 | # random properties for landmarks
55 | color_list = [np.zeros(world.dim_c) for i in world.landmarks]
56 | for i, color in enumerate(color_list):
57 | color[i] += 1
58 | for color, landmark in zip(color_list, world.landmarks):
59 | landmark.color = color
60 | # set goal landmark
61 | goal = np.random.choice(world.landmarks)
62 | world.agents[1].color = goal.color
63 | world.agents[2].key = np.random.choice(world.landmarks).color
64 |
65 | for agent in world.agents:
66 | agent.goal_a = goal
67 |
68 | # set random initial states
69 | for agent in world.agents:
70 | agent.state.p_pos = np.random.uniform(-1, +1, world.dim_p)
71 | agent.state.p_vel = np.zeros(world.dim_p)
72 | agent.state.c = np.zeros(world.dim_c)
73 | for i, landmark in enumerate(world.landmarks):
74 | landmark.state.p_pos = np.random.uniform(-1, +1, world.dim_p)
75 | landmark.state.p_vel = np.zeros(world.dim_p)
76 |
77 |
78 | def benchmark_data(self, agent, world):
79 | # returns data for benchmarking purposes
80 | return (agent.state.c, agent.goal_a.color)
81 |
82 | # return all agents that are not adversaries
83 | def good_listeners(self, world):
84 | return [agent for agent in world.agents if not agent.adversary and not agent.speaker]
85 |
86 | # return all agents that are not adversaries
87 | def good_agents(self, world):
88 | return [agent for agent in world.agents if not agent.adversary]
89 |
90 | # return all adversarial agents
91 | def adversaries(self, world):
92 | return [agent for agent in world.agents if agent.adversary]
93 |
94 | def reward(self, agent, world):
95 | return self.adversary_reward(agent, world) if agent.adversary else self.agent_reward(agent, world)
96 |
97 | def agent_reward(self, agent, world):
98 | # Agents rewarded if Bob can reconstruct message, but adversary (Eve) cannot
99 | good_listeners = self.good_listeners(world)
100 | adversaries = self.adversaries(world)
101 | good_rew = 0
102 | adv_rew = 0
103 | for a in good_listeners:
104 | if (a.state.c == np.zeros(world.dim_c)).all():
105 | continue
106 | else:
107 | good_rew -= np.sum(np.square(a.state.c - agent.goal_a.color))
108 | for a in adversaries:
109 | if (a.state.c == np.zeros(world.dim_c)).all():
110 | continue
111 | else:
112 | adv_l1 = np.sum(np.square(a.state.c - agent.goal_a.color))
113 | adv_rew += adv_l1
114 | return adv_rew + good_rew
115 |
116 | def adversary_reward(self, agent, world):
117 | # Adversary (Eve) is rewarded if it can reconstruct original goal
118 | rew = 0
119 | if not (agent.state.c == np.zeros(world.dim_c)).all():
120 | rew -= np.sum(np.square(agent.state.c - agent.goal_a.color))
121 | return rew
122 |
123 |
124 | def observation(self, agent, world):
125 | # goal color
126 | goal_color = np.zeros(world.dim_color)
127 | if agent.goal_a is not None:
128 | goal_color = agent.goal_a.color
129 |
130 | # get positions of all entities in this agent's reference frame
131 | entity_pos = []
132 | for entity in world.landmarks:
133 | entity_pos.append(entity.state.p_pos - agent.state.p_pos)
134 | # communication of all other agents
135 | comm = []
136 | for other in world.agents:
137 | if other is agent or (other.state.c is None) or not other.speaker: continue
138 | comm.append(other.state.c)
139 |
140 | confer = np.array([0])
141 |
142 | if world.agents[2].key is None:
143 | confer = np.array([1])
144 | key = np.zeros(world.dim_c)
145 | goal_color = np.zeros(world.dim_c)
146 | else:
147 | key = world.agents[2].key
148 |
149 | prnt = False
150 | # speaker
151 | if agent.speaker:
152 | if prnt:
153 | print('speaker')
154 | print(agent.state.c)
155 | print(np.concatenate([goal_color] + [key] + [confer] + [np.random.randn(1)]))
156 | return np.concatenate([goal_color] + [key])
157 | # listener
158 | if not agent.speaker and not agent.adversary:
159 | if prnt:
160 | print('listener')
161 | print(agent.state.c)
162 | print(np.concatenate([key] + comm + [confer]))
163 | return np.concatenate([key] + comm)
164 | if not agent.speaker and agent.adversary:
165 | if prnt:
166 | print('adversary')
167 | print(agent.state.c)
168 | print(np.concatenate(comm + [confer]))
169 | return np.concatenate(comm)
170 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/scenarios/simple_push.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from multiagent.core import World, Agent, Landmark
3 | from multiagent.scenario import BaseScenario
4 |
5 | class Scenario(BaseScenario):
6 | def make_world(self):
7 | world = World()
8 | # set any world properties first
9 | world.dim_c = 2
10 | num_agents = 2
11 | num_adversaries = 1
12 | num_landmarks = 2
13 | # add agents
14 | world.agents = [Agent() for i in range(num_agents)]
15 | for i, agent in enumerate(world.agents):
16 | agent.name = 'agent %d' % i
17 | agent.collide = True
18 | agent.silent = True
19 | if i < num_adversaries:
20 | agent.adversary = True
21 | else:
22 | agent.adversary = False
23 | # add landmarks
24 | world.landmarks = [Landmark() for i in range(num_landmarks)]
25 | for i, landmark in enumerate(world.landmarks):
26 | landmark.name = 'landmark %d' % i
27 | landmark.collide = False
28 | landmark.movable = False
29 | # make initial conditions
30 | self.reset_world(world)
31 | return world
32 |
33 | def reset_world(self, world):
34 | # random properties for landmarks
35 | for i, landmark in enumerate(world.landmarks):
36 | landmark.color = np.array([0.1, 0.1, 0.1])
37 | landmark.color[i + 1] += 0.8
38 | landmark.index = i
39 | # set goal landmark
40 | goal = np.random.choice(world.landmarks)
41 | for i, agent in enumerate(world.agents):
42 | agent.goal_a = goal
43 | agent.color = np.array([0.25, 0.25, 0.25])
44 | if agent.adversary:
45 | agent.color = np.array([0.75, 0.25, 0.25])
46 | else:
47 | j = goal.index
48 | agent.color[j + 1] += 0.5
49 | # set random initial states
50 | for agent in world.agents:
51 | agent.state.p_pos = np.random.uniform(-1, +1, world.dim_p)
52 | agent.state.p_vel = np.zeros(world.dim_p)
53 | agent.state.c = np.zeros(world.dim_c)
54 | for i, landmark in enumerate(world.landmarks):
55 | landmark.state.p_pos = np.random.uniform(-1, +1, world.dim_p)
56 | landmark.state.p_vel = np.zeros(world.dim_p)
57 |
58 | def reward(self, agent, world):
59 | # Agents are rewarded based on minimum agent distance to each landmark
60 | return self.adversary_reward(agent, world) if agent.adversary else self.agent_reward(agent, world)
61 |
62 | def agent_reward(self, agent, world):
63 | # the distance to the goal
64 | return -np.sqrt(np.sum(np.square(agent.state.p_pos - agent.goal_a.state.p_pos)))
65 |
66 | def adversary_reward(self, agent, world):
67 | # keep the nearest good agents away from the goal
68 | agent_dist = [np.sqrt(np.sum(np.square(a.state.p_pos - a.goal_a.state.p_pos))) for a in world.agents if not a.adversary]
69 | pos_rew = min(agent_dist)
70 | #nearest_agent = world.good_agents[np.argmin(agent_dist)]
71 | #neg_rew = np.sqrt(np.sum(np.square(nearest_agent.state.p_pos - agent.state.p_pos)))
72 | neg_rew = np.sqrt(np.sum(np.square(agent.goal_a.state.p_pos - agent.state.p_pos)))
73 | #neg_rew = sum([np.sqrt(np.sum(np.square(a.state.p_pos - agent.state.p_pos))) for a in world.good_agents])
74 | return pos_rew - neg_rew
75 |
76 | def observation(self, agent, world):
77 | # get positions of all entities in this agent's reference frame
78 | entity_pos = []
79 | for entity in world.landmarks: # world.entities:
80 | entity_pos.append(entity.state.p_pos - agent.state.p_pos)
81 | # entity colors
82 | entity_color = []
83 | for entity in world.landmarks: # world.entities:
84 | entity_color.append(entity.color)
85 | # communication of all other agents
86 | comm = []
87 | other_pos = []
88 | for other in world.agents:
89 | if other is agent: continue
90 | comm.append(other.state.c)
91 | other_pos.append(other.state.p_pos - agent.state.p_pos)
92 | if not agent.adversary:
93 | return np.concatenate([agent.state.p_vel] + [agent.goal_a.state.p_pos - agent.state.p_pos] + [agent.color] + entity_pos + entity_color + other_pos)
94 | else:
95 | #other_pos = list(reversed(other_pos)) if random.uniform(0,1) > 0.5 else other_pos # randomize position of other agents in adversary network
96 | return np.concatenate([agent.state.p_vel] + entity_pos + other_pos)
97 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/scenarios/simple_reference.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from multiagent.core import World, Agent, Landmark
3 | from multiagent.scenario import BaseScenario
4 |
5 | class Scenario(BaseScenario):
6 | def make_world(self):
7 | world = World()
8 | # set any world properties first
9 | world.dim_c = 10
10 | world.collaborative = True # whether agents share rewards
11 | # add agents
12 | world.agents = [Agent() for i in range(2)]
13 | for i, agent in enumerate(world.agents):
14 | agent.name = 'agent %d' % i
15 | agent.collide = False
16 | # add landmarks
17 | world.landmarks = [Landmark() for i in range(3)]
18 | for i, landmark in enumerate(world.landmarks):
19 | landmark.name = 'landmark %d' % i
20 | landmark.collide = False
21 | landmark.movable = False
22 | # make initial conditions
23 | self.reset_world(world)
24 | return world
25 |
26 | def reset_world(self, world):
27 | # assign goals to agents
28 | for agent in world.agents:
29 | agent.goal_a = None
30 | agent.goal_b = None
31 | # want other agent to go to the goal landmark
32 | world.agents[0].goal_a = world.agents[1]
33 | world.agents[0].goal_b = np.random.choice(world.landmarks)
34 | world.agents[1].goal_a = world.agents[0]
35 | world.agents[1].goal_b = np.random.choice(world.landmarks)
36 | # random properties for agents
37 | for i, agent in enumerate(world.agents):
38 | agent.color = np.array([0.25,0.25,0.25])
39 | # random properties for landmarks
40 | world.landmarks[0].color = np.array([0.75,0.25,0.25])
41 | world.landmarks[1].color = np.array([0.25,0.75,0.25])
42 | world.landmarks[2].color = np.array([0.25,0.25,0.75])
43 | # special colors for goals
44 | world.agents[0].goal_a.color = world.agents[0].goal_b.color
45 | world.agents[1].goal_a.color = world.agents[1].goal_b.color
46 | # set random initial states
47 | for agent in world.agents:
48 | agent.state.p_pos = np.random.uniform(-1,+1, world.dim_p)
49 | agent.state.p_vel = np.zeros(world.dim_p)
50 | agent.state.c = np.zeros(world.dim_c)
51 | for i, landmark in enumerate(world.landmarks):
52 | landmark.state.p_pos = np.random.uniform(-1,+1, world.dim_p)
53 | landmark.state.p_vel = np.zeros(world.dim_p)
54 |
55 | def reward(self, agent, world):
56 | if agent.goal_a is None or agent.goal_b is None:
57 | return 0.0
58 | dist2 = np.sum(np.square(agent.goal_a.state.p_pos - agent.goal_b.state.p_pos))
59 | return -dist2
60 |
61 | def observation(self, agent, world):
62 | # goal color
63 | goal_color = [np.zeros(world.dim_color), np.zeros(world.dim_color)]
64 | if agent.goal_b is not None:
65 | goal_color[1] = agent.goal_b.color
66 |
67 | # get positions of all entities in this agent's reference frame
68 | entity_pos = []
69 | for entity in world.landmarks:
70 | entity_pos.append(entity.state.p_pos - agent.state.p_pos)
71 | # entity colors
72 | entity_color = []
73 | for entity in world.landmarks:
74 | entity_color.append(entity.color)
75 | # communication of all other agents
76 | comm = []
77 | for other in world.agents:
78 | if other is agent: continue
79 | comm.append(other.state.c)
80 | return np.concatenate([agent.state.p_vel] + entity_pos + [goal_color[1]] + comm)
81 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/scenarios/simple_speaker_listener.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from multiagent.core import World, Agent, Landmark
3 | from multiagent.scenario import BaseScenario
4 |
5 | class Scenario(BaseScenario):
6 | def make_world(self):
7 | world = World()
8 | # set any world properties first
9 | world.dim_c = 3
10 | num_landmarks = 3
11 | world.collaborative = True
12 | # add agents
13 | world.agents = [Agent() for i in range(2)]
14 | for i, agent in enumerate(world.agents):
15 | agent.name = 'agent %d' % i
16 | agent.collide = False
17 | agent.size = 0.075
18 | # speaker
19 | world.agents[0].movable = False
20 | # listener
21 | world.agents[1].silent = True
22 | # add landmarks
23 | world.landmarks = [Landmark() for i in range(num_landmarks)]
24 | for i, landmark in enumerate(world.landmarks):
25 | landmark.name = 'landmark %d' % i
26 | landmark.collide = False
27 | landmark.movable = False
28 | landmark.size = 0.04
29 | # make initial conditions
30 | self.reset_world(world)
31 | return world
32 |
33 | def reset_world(self, world):
34 | # assign goals to agents
35 | for agent in world.agents:
36 | agent.goal_a = None
37 | agent.goal_b = None
38 | # want listener to go to the goal landmark
39 | world.agents[0].goal_a = world.agents[1]
40 | world.agents[0].goal_b = np.random.choice(world.landmarks)
41 | # random properties for agents
42 | for i, agent in enumerate(world.agents):
43 | agent.color = np.array([0.25,0.25,0.25])
44 | # random properties for landmarks
45 | world.landmarks[0].color = np.array([0.65,0.15,0.15])
46 | world.landmarks[1].color = np.array([0.15,0.65,0.15])
47 | world.landmarks[2].color = np.array([0.15,0.15,0.65])
48 | # special colors for goals
49 | world.agents[0].goal_a.color = world.agents[0].goal_b.color + np.array([0.45, 0.45, 0.45])
50 | # set random initial states
51 | for agent in world.agents:
52 | agent.state.p_pos = np.random.uniform(-1,+1, world.dim_p)
53 | agent.state.p_vel = np.zeros(world.dim_p)
54 | agent.state.c = np.zeros(world.dim_c)
55 | for i, landmark in enumerate(world.landmarks):
56 | landmark.state.p_pos = np.random.uniform(-1,+1, world.dim_p)
57 | landmark.state.p_vel = np.zeros(world.dim_p)
58 |
59 | def benchmark_data(self, agent, world):
60 | # returns data for benchmarking purposes
61 | return self.reward(agent, reward)
62 |
63 | def reward(self, agent, world):
64 | # squared distance from listener to landmark
65 | a = world.agents[0]
66 | dist2 = np.sum(np.square(a.goal_a.state.p_pos - a.goal_b.state.p_pos))
67 | return -dist2
68 |
69 | def observation(self, agent, world):
70 | # goal color
71 | goal_color = np.zeros(world.dim_color)
72 | if agent.goal_b is not None:
73 | goal_color = agent.goal_b.color
74 |
75 | # get positions of all entities in this agent's reference frame
76 | entity_pos = []
77 | for entity in world.landmarks:
78 | entity_pos.append(entity.state.p_pos - agent.state.p_pos)
79 |
80 | # communication of all other agents
81 | comm = []
82 | for other in world.agents:
83 | if other is agent or (other.state.c is None): continue
84 | comm.append(other.state.c)
85 |
86 | # speaker
87 | if not agent.movable:
88 | return np.concatenate([goal_color])
89 | # listener
90 | if agent.silent:
91 | return np.concatenate([agent.state.p_vel] + entity_pos + comm)
92 |
93 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/scenarios/simple_spread.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from multiagent.core import World, Agent, Landmark
3 | from multiagent.scenario import BaseScenario
4 |
5 |
6 | class Scenario(BaseScenario):
7 | def make_world(self):
8 | world = World()
9 | # set any world properties first
10 | world.dim_c = 2
11 | num_agents = 3
12 | num_landmarks = 3
13 | world.collaborative = True
14 | # add agents
15 | world.agents = [Agent() for i in range(num_agents)]
16 | for i, agent in enumerate(world.agents):
17 | agent.name = 'agent %d' % i
18 | agent.collide = True
19 | agent.silent = True
20 | agent.size = 0.15
21 | # add landmarks
22 | world.landmarks = [Landmark() for i in range(num_landmarks)]
23 | for i, landmark in enumerate(world.landmarks):
24 | landmark.name = 'landmark %d' % i
25 | landmark.collide = False
26 | landmark.movable = False
27 | # make initial conditions
28 | self.reset_world(world)
29 | return world
30 |
31 | def reset_world(self, world):
32 | # random properties for agents
33 | for i, agent in enumerate(world.agents):
34 | agent.color = np.array([0.35, 0.35, 0.85])
35 | # random properties for landmarks
36 | for i, landmark in enumerate(world.landmarks):
37 | landmark.color = np.array([0.25, 0.25, 0.25])
38 | # set random initial states
39 | for agent in world.agents:
40 | agent.state.p_pos = np.random.uniform(-1, +1, world.dim_p)
41 | agent.state.p_vel = np.zeros(world.dim_p)
42 | agent.state.c = np.zeros(world.dim_c)
43 | for i, landmark in enumerate(world.landmarks):
44 | landmark.state.p_pos = np.random.uniform(-1, +1, world.dim_p)
45 | landmark.state.p_vel = np.zeros(world.dim_p)
46 |
47 | def benchmark_data(self, agent, world):
48 | rew = 0
49 | collisions = 0
50 | occupied_landmarks = 0
51 | min_dists = 0
52 | for l in world.landmarks:
53 | dists = [np.sqrt(np.sum(np.square(a.state.p_pos - l.state.p_pos))) for a in world.agents]
54 | min_dists += min(dists)
55 | rew -= min(dists)
56 | if min(dists) < 0.1:
57 | occupied_landmarks += 1
58 | if agent.collide:
59 | for a in world.agents:
60 | if self.is_collision(a, agent):
61 | rew -= 1
62 | collisions += 1
63 | return (rew, collisions, min_dists, occupied_landmarks)
64 |
65 |
66 | def is_collision(self, agent1, agent2):
67 | delta_pos = agent1.state.p_pos - agent2.state.p_pos
68 | dist = np.sqrt(np.sum(np.square(delta_pos)))
69 | dist_min = agent1.size + agent2.size
70 | return True if dist < dist_min else False
71 |
72 | def reward(self, agent, world):
73 | # Agents are rewarded based on minimum agent distance to each landmark, penalized for collisions
74 | rew = 0
75 | for l in world.landmarks:
76 | dists = [np.sqrt(np.sum(np.square(a.state.p_pos - l.state.p_pos))) for a in world.agents]
77 | rew -= min(dists)
78 | if agent.collide:
79 | for a in world.agents:
80 | if self.is_collision(a, agent):
81 | rew -= 1
82 | return rew
83 |
84 | def observation(self, agent, world):
85 | # get positions of all entities in this agent's reference frame
86 | entity_pos = []
87 | for entity in world.landmarks: # world.entities:
88 | entity_pos.append(entity.state.p_pos - agent.state.p_pos)
89 | # entity colors
90 | entity_color = []
91 | for entity in world.landmarks: # world.entities:
92 | entity_color.append(entity.color)
93 | # communication of all other agents
94 | comm = []
95 | other_pos = []
96 | for other in world.agents:
97 | if other is agent: continue
98 | comm.append(other.state.c)
99 | other_pos.append(other.state.p_pos - agent.state.p_pos)
100 | return np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + comm)
101 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/scenarios/simple_tag.py:
--------------------------------------------------------------------------------
1 | '''
2 | COPY this file into multiagent(openai) repr to replace the old one and install again
3 | '''
4 | import numpy as np
5 | from multiagent.core import World, Agent, Landmark, Action
6 | from multiagent.scenario import BaseScenario
7 |
8 | # By Yuan Zhang:
9 | def random_action(agent,world):
10 | action = Action()
11 | action.u = np.zeros(world.dim_p)
12 | action.c = np.zeros(world.dim_c)
13 | random_action = np.random.choice(5)
14 | # process discrete action
15 | if random_action == 1: action.u[0] = -1.0
16 | if random_action == 2: action.u[0] = +1.0
17 | if random_action == 3: action.u[1] = -1.0
18 | if random_action == 4: action.u[1] = +1.0
19 |
20 | # accel of prey
21 | sensitivity = 5.0
22 | if agent.accel is not None:
23 | sensitivity = agent.accel
24 | action.u *= sensitivity
25 | return action
26 |
27 | class Scenario(BaseScenario):
28 | def make_world(self):
29 | world = World()
30 | # By Yuan Zhang
31 | world.collaborative = True
32 | # set any world properties first
33 | world.dim_c = 2
34 | num_good_agents = 1
35 | num_adversaries = 3
36 | num_agents = num_adversaries + num_good_agents
37 | num_landmarks = 2
38 | # add agents
39 | world.agents = [Agent() for i in range(num_agents)]
40 | for i, agent in enumerate(world.agents):
41 | agent.name = 'agent %d' % i
42 | agent.collide = True
43 | agent.silent = True
44 | agent.adversary = True if i < num_adversaries else False
45 | agent.size = 0.075 if agent.adversary else 0.05
46 | agent.accel = 3.0 if agent.adversary else 4.0 # 3.0 4.0
47 | #agent.accel = 20.0 if agent.adversary else 25.0
48 | agent.max_speed = 1.0 if agent.adversary else 1.3 # 1.0 1.3
49 | # By Yuan Zhang:
50 | agent.action_callback = random_action if not agent.adversary else None
51 |
52 | # add landmarks
53 | world.landmarks = [Landmark() for i in range(num_landmarks)]
54 | for i, landmark in enumerate(world.landmarks):
55 | landmark.name = 'landmark %d' % i
56 | landmark.collide = True
57 | landmark.movable = False
58 | landmark.size = 0.2 # 0.2
59 | landmark.boundary = False
60 | # make initial conditions
61 | self.reset_world(world)
62 | self.done = False
63 | return world
64 |
65 |
66 | def reset_world(self, world):
67 | self.done = False
68 | # random properties for agents
69 | for i, agent in enumerate(world.agents):
70 | agent.color = np.array([0.35, 0.85, 0.35]) if not agent.adversary else np.array([0.85, 0.35, 0.35])
71 | # random properties for landmarks
72 | for i, landmark in enumerate(world.landmarks):
73 | landmark.color = np.array([0.25, 0.25, 0.25])
74 | # set random initial states
75 | for agent in world.agents:
76 | agent.state.p_pos = np.random.uniform(-1, +1, world.dim_p)
77 | agent.state.p_vel = np.zeros(world.dim_p)
78 | agent.state.c = np.zeros(world.dim_c)
79 | for i, landmark in enumerate(world.landmarks):
80 | if not landmark.boundary:
81 | landmark.state.p_pos = np.random.uniform(-0.9, +0.9, world.dim_p)
82 | landmark.state.p_vel = np.zeros(world.dim_p)
83 |
84 |
85 | def benchmark_data(self, agent, world):
86 | # returns data for benchmarking purposes
87 | if agent.adversary:
88 | collisions = 0
89 | for a in self.good_agents(world):
90 | if self.is_collision(a, agent):
91 | collisions += 1
92 | return collisions
93 | else:
94 | return 0
95 |
96 |
97 | def is_collision(self, agent1, agent2):
98 | delta_pos = agent1.state.p_pos - agent2.state.p_pos
99 | dist = np.sqrt(np.sum(np.square(delta_pos)))
100 | dist_min = agent1.size + agent2.size
101 | return True if dist < dist_min else False
102 |
103 | # return all agents that are not adversaries
104 | def good_agents(self, world):
105 | return [agent for agent in world.agents if not agent.adversary]
106 |
107 | # return all adversarial agents
108 | def adversaries(self, world):
109 | return [agent for agent in world.agents if agent.adversary]
110 |
111 |
112 | def reward(self, agent, world):
113 | # Agents are rewarded based on minimum agent distance to each landmark
114 | main_reward = self.adversary_reward(agent, world) if agent.adversary else self.agent_reward(agent, world)
115 | return main_reward
116 |
117 | def agent_reward(self, agent, world):
118 | # Agents are negatively rewarded if caught by adversaries
119 | rew = 0
120 | shape = False
121 | adversaries = self.adversaries(world)
122 | if shape: # reward can optionally be shaped (increased reward for increased distance from adversary)
123 | for adv in adversaries:
124 | rew += 0.1 * np.sqrt(np.sum(np.square(agent.state.p_pos - adv.state.p_pos)))
125 | if agent.collide:
126 | for a in adversaries:
127 | if self.is_collision(a, agent):
128 | rew -= 10
129 |
130 | # agents are penalized for exiting the screen, so that they can be caught by the adversaries
131 | def bound(x):
132 | if x < 0.9:
133 | return 0
134 | if x < 1.0:
135 | return (x - 0.9) * 10
136 | return min(np.exp(2 * x - 2), 10)
137 | for p in range(world.dim_p):
138 | x = abs(agent.state.p_pos[p])
139 | rew -= bound(x)
140 |
141 | return rew
142 |
143 | def adversary_reward(self, agent, world):
144 | # Adversaries are rewarded for collisions with agents
145 | rew = 0
146 | shape = True # False
147 | agents = self.good_agents(world)
148 | adversaries = self.adversaries(world)
149 | if shape: # reward can optionally be shaped (decreased reward for increased distance from agents)
150 | for adv in adversaries:
151 | rew -= 0.1 * min([np.sqrt(np.sum(np.square(a.state.p_pos - adv.state.p_pos))) for a in agents])
152 | if agent.collide:
153 | for ag in agents:
154 | for adv in adversaries:
155 | if self.is_collision(ag, adv):
156 | rew += 10
157 | self.done = True
158 | return rew
159 |
160 | def observation(self, agent, world):
161 | # get positions of all entities in this agent's reference frame
162 | entity_pos = []
163 | for entity in world.landmarks:
164 | if not entity.boundary:
165 | entity_pos.append(entity.state.p_pos - agent.state.p_pos)
166 | # communication of all other agents
167 | comm = []
168 | other_pos = []
169 | other_vel = []
170 | for other in world.agents:
171 | if other is agent: continue
172 | comm.append(other.state.c)
173 | other_pos.append(other.state.p_pos - agent.state.p_pos)
174 | if not other.adversary:
175 | other_vel.append(other.state.p_vel)
176 | return np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + other_vel)
177 |
178 | def episode_over(self, agent, world):
179 | return self.done
180 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/multiagent/scenarios/simple_world_comm.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from multiagent.core import World, Agent, Landmark
3 | from multiagent.scenario import BaseScenario
4 |
5 |
6 | class Scenario(BaseScenario):
7 | def make_world(self):
8 | world = World()
9 | # set any world properties first
10 | world.dim_c = 4
11 | #world.damping = 1
12 | num_good_agents = 2
13 | num_adversaries = 4
14 | num_agents = num_adversaries + num_good_agents
15 | num_landmarks = 1
16 | num_food = 2
17 | num_forests = 2
18 | # add agents
19 | world.agents = [Agent() for i in range(num_agents)]
20 | for i, agent in enumerate(world.agents):
21 | agent.name = 'agent %d' % i
22 | agent.collide = True
23 | agent.leader = True if i == 0 else False
24 | agent.silent = True if i > 0 else False
25 | agent.adversary = True if i < num_adversaries else False
26 | agent.size = 0.075 if agent.adversary else 0.045
27 | agent.accel = 3.0 if agent.adversary else 4.0
28 | #agent.accel = 20.0 if agent.adversary else 25.0
29 | agent.max_speed = 1.0 if agent.adversary else 1.3
30 | # add landmarks
31 | world.landmarks = [Landmark() for i in range(num_landmarks)]
32 | for i, landmark in enumerate(world.landmarks):
33 | landmark.name = 'landmark %d' % i
34 | landmark.collide = True
35 | landmark.movable = False
36 | landmark.size = 0.2
37 | landmark.boundary = False
38 | world.food = [Landmark() for i in range(num_food)]
39 | for i, landmark in enumerate(world.food):
40 | landmark.name = 'food %d' % i
41 | landmark.collide = False
42 | landmark.movable = False
43 | landmark.size = 0.03
44 | landmark.boundary = False
45 | world.forests = [Landmark() for i in range(num_forests)]
46 | for i, landmark in enumerate(world.forests):
47 | landmark.name = 'forest %d' % i
48 | landmark.collide = False
49 | landmark.movable = False
50 | landmark.size = 0.3
51 | landmark.boundary = False
52 | world.landmarks += world.food
53 | world.landmarks += world.forests
54 | #world.landmarks += self.set_boundaries(world) # world boundaries now penalized with negative reward
55 | # make initial conditions
56 | self.reset_world(world)
57 | return world
58 |
59 | def set_boundaries(self, world):
60 | boundary_list = []
61 | landmark_size = 1
62 | edge = 1 + landmark_size
63 | num_landmarks = int(edge * 2 / landmark_size)
64 | for x_pos in [-edge, edge]:
65 | for i in range(num_landmarks):
66 | l = Landmark()
67 | l.state.p_pos = np.array([x_pos, -1 + i * landmark_size])
68 | boundary_list.append(l)
69 |
70 | for y_pos in [-edge, edge]:
71 | for i in range(num_landmarks):
72 | l = Landmark()
73 | l.state.p_pos = np.array([-1 + i * landmark_size, y_pos])
74 | boundary_list.append(l)
75 |
76 | for i, l in enumerate(boundary_list):
77 | l.name = 'boundary %d' % i
78 | l.collide = True
79 | l.movable = False
80 | l.boundary = True
81 | l.color = np.array([0.75, 0.75, 0.75])
82 | l.size = landmark_size
83 | l.state.p_vel = np.zeros(world.dim_p)
84 |
85 | return boundary_list
86 |
87 |
88 | def reset_world(self, world):
89 | # random properties for agents
90 | for i, agent in enumerate(world.agents):
91 | agent.color = np.array([0.45, 0.95, 0.45]) if not agent.adversary else np.array([0.95, 0.45, 0.45])
92 | agent.color -= np.array([0.3, 0.3, 0.3]) if agent.leader else np.array([0, 0, 0])
93 | # random properties for landmarks
94 | for i, landmark in enumerate(world.landmarks):
95 | landmark.color = np.array([0.25, 0.25, 0.25])
96 | for i, landmark in enumerate(world.food):
97 | landmark.color = np.array([0.15, 0.15, 0.65])
98 | for i, landmark in enumerate(world.forests):
99 | landmark.color = np.array([0.6, 0.9, 0.6])
100 | # set random initial states
101 | for agent in world.agents:
102 | agent.state.p_pos = np.random.uniform(-1, +1, world.dim_p)
103 | agent.state.p_vel = np.zeros(world.dim_p)
104 | agent.state.c = np.zeros(world.dim_c)
105 | for i, landmark in enumerate(world.landmarks):
106 | landmark.state.p_pos = np.random.uniform(-0.9, +0.9, world.dim_p)
107 | landmark.state.p_vel = np.zeros(world.dim_p)
108 | for i, landmark in enumerate(world.food):
109 | landmark.state.p_pos = np.random.uniform(-0.9, +0.9, world.dim_p)
110 | landmark.state.p_vel = np.zeros(world.dim_p)
111 | for i, landmark in enumerate(world.forests):
112 | landmark.state.p_pos = np.random.uniform(-0.9, +0.9, world.dim_p)
113 | landmark.state.p_vel = np.zeros(world.dim_p)
114 |
115 | def benchmark_data(self, agent, world):
116 | if agent.adversary:
117 | collisions = 0
118 | for a in self.good_agents(world):
119 | if self.is_collision(a, agent):
120 | collisions += 1
121 | return collisions
122 | else:
123 | return 0
124 |
125 |
126 | def is_collision(self, agent1, agent2):
127 | delta_pos = agent1.state.p_pos - agent2.state.p_pos
128 | dist = np.sqrt(np.sum(np.square(delta_pos)))
129 | dist_min = agent1.size + agent2.size
130 | return True if dist < dist_min else False
131 |
132 |
133 | # return all agents that are not adversaries
134 | def good_agents(self, world):
135 | return [agent for agent in world.agents if not agent.adversary]
136 |
137 | # return all adversarial agents
138 | def adversaries(self, world):
139 | return [agent for agent in world.agents if agent.adversary]
140 |
141 |
142 | def reward(self, agent, world):
143 | # Agents are rewarded based on minimum agent distance to each landmark
144 | #boundary_reward = -10 if self.outside_boundary(agent) else 0
145 | main_reward = self.adversary_reward(agent, world) if agent.adversary else self.agent_reward(agent, world)
146 | return main_reward
147 |
148 | def outside_boundary(self, agent):
149 | if agent.state.p_pos[0] > 1 or agent.state.p_pos[0] < -1 or agent.state.p_pos[1] > 1 or agent.state.p_pos[1] < -1:
150 | return True
151 | else:
152 | return False
153 |
154 |
155 | def agent_reward(self, agent, world):
156 | # Agents are rewarded based on minimum agent distance to each landmark
157 | rew = 0
158 | shape = False
159 | adversaries = self.adversaries(world)
160 | if shape:
161 | for adv in adversaries:
162 | rew += 0.1 * np.sqrt(np.sum(np.square(agent.state.p_pos - adv.state.p_pos)))
163 | if agent.collide:
164 | for a in adversaries:
165 | if self.is_collision(a, agent):
166 | rew -= 5
167 | def bound(x):
168 | if x < 0.9:
169 | return 0
170 | if x < 1.0:
171 | return (x - 0.9) * 10
172 | return min(np.exp(2 * x - 2), 10) # 1 + (x - 1) * (x - 1)
173 |
174 | for p in range(world.dim_p):
175 | x = abs(agent.state.p_pos[p])
176 | rew -= 2 * bound(x)
177 |
178 | for food in world.food:
179 | if self.is_collision(agent, food):
180 | rew += 2
181 | rew += 0.05 * min([np.sqrt(np.sum(np.square(food.state.p_pos - agent.state.p_pos))) for food in world.food])
182 |
183 | return rew
184 |
185 | def adversary_reward(self, agent, world):
186 | # Agents are rewarded based on minimum agent distance to each landmark
187 | rew = 0
188 | shape = True
189 | agents = self.good_agents(world)
190 | adversaries = self.adversaries(world)
191 | if shape:
192 | rew -= 0.1 * min([np.sqrt(np.sum(np.square(a.state.p_pos - agent.state.p_pos))) for a in agents])
193 | if agent.collide:
194 | for ag in agents:
195 | for adv in adversaries:
196 | if self.is_collision(ag, adv):
197 | rew += 5
198 | return rew
199 |
200 |
201 | def observation2(self, agent, world):
202 | # get positions of all entities in this agent's reference frame
203 | entity_pos = []
204 | for entity in world.landmarks:
205 | if not entity.boundary:
206 | entity_pos.append(entity.state.p_pos - agent.state.p_pos)
207 |
208 | food_pos = []
209 | for entity in world.food:
210 | if not entity.boundary:
211 | food_pos.append(entity.state.p_pos - agent.state.p_pos)
212 | # communication of all other agents
213 | comm = []
214 | other_pos = []
215 | other_vel = []
216 | for other in world.agents:
217 | if other is agent: continue
218 | comm.append(other.state.c)
219 | other_pos.append(other.state.p_pos - agent.state.p_pos)
220 | if not other.adversary:
221 | other_vel.append(other.state.p_vel)
222 | return np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + other_vel)
223 |
224 | def observation(self, agent, world):
225 | # get positions of all entities in this agent's reference frame
226 | entity_pos = []
227 | for entity in world.landmarks:
228 | if not entity.boundary:
229 | entity_pos.append(entity.state.p_pos - agent.state.p_pos)
230 |
231 | in_forest = [np.array([-1]), np.array([-1])]
232 | inf1 = False
233 | inf2 = False
234 | if self.is_collision(agent, world.forests[0]):
235 | in_forest[0] = np.array([1])
236 | inf1= True
237 | if self.is_collision(agent, world.forests[1]):
238 | in_forest[1] = np.array([1])
239 | inf2 = True
240 |
241 | food_pos = []
242 | for entity in world.food:
243 | if not entity.boundary:
244 | food_pos.append(entity.state.p_pos - agent.state.p_pos)
245 | # communication of all other agents
246 | comm = []
247 | other_pos = []
248 | other_vel = []
249 | for other in world.agents:
250 | if other is agent: continue
251 | comm.append(other.state.c)
252 | oth_f1 = self.is_collision(other, world.forests[0])
253 | oth_f2 = self.is_collision(other, world.forests[1])
254 | if (inf1 and oth_f1) or (inf2 and oth_f2) or (not inf1 and not oth_f1 and not inf2 and not oth_f2) or agent.leader: #without forest vis
255 | other_pos.append(other.state.p_pos - agent.state.p_pos)
256 | if not other.adversary:
257 | other_vel.append(other.state.p_vel)
258 | else:
259 | other_pos.append([0, 0])
260 | if not other.adversary:
261 | other_vel.append([0, 0])
262 |
263 | # to tell the pred when the prey are in the forest
264 | prey_forest = []
265 | ga = self.good_agents(world)
266 | for a in ga:
267 | if any([self.is_collision(a, f) for f in world.forests]):
268 | prey_forest.append(np.array([1]))
269 | else:
270 | prey_forest.append(np.array([-1]))
271 | # to tell leader when pred are in forest
272 | prey_forest_lead = []
273 | for f in world.forests:
274 | if any([self.is_collision(a, f) for a in ga]):
275 | prey_forest_lead.append(np.array([1]))
276 | else:
277 | prey_forest_lead.append(np.array([-1]))
278 |
279 | comm = [world.agents[0].state.c]
280 |
281 | if agent.adversary and not agent.leader:
282 | return np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + other_vel + in_forest + comm)
283 | if agent.leader:
284 | return np.concatenate(
285 | [agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + other_vel + in_forest + comm)
286 | else:
287 | return np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + in_forest + other_vel)
288 |
289 |
290 |
--------------------------------------------------------------------------------
/environments/multiagent_particle_envs/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup, find_packages
2 |
3 | setup(name='multiagent',
4 | version='0.0.1',
5 | description='Multi-Agent Goal-Driven Communication Environment',
6 | url='https://github.com/openai/multiagent-public',
7 | author='Igor Mordatch',
8 | author_email='mordatch@openai.com',
9 | packages=find_packages(),
10 | include_package_data=True,
11 | zip_safe=False,
12 | install_requires=['gym', 'numpy-stl']
13 | )
14 |
--------------------------------------------------------------------------------
/environments/traffic_helper.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | import numpy as np
4 |
5 | move = [(-1,0),(1,0),(0,-1),(0,1)]
6 |
7 | def get_road_blocks(w, h, difficulty):
8 |
9 | # assuming 1 is the lane width for each direction.
10 | road_blocks = {
11 | 'easy': [ np.s_[h//2, :],
12 | np.s_[:, w//2]],
13 |
14 | 'medium': [ np.s_[h//2 - 1 : h//2 + 1, :],
15 | np.s_[:, w//2 - 1 : w//2 + 1]],
16 |
17 | 'hard': [ np.s_[h//3-2: h//3, :],
18 | np.s_[2* h//3: 2* h//3 + 2, :],
19 |
20 | np.s_[:, w//3-2: w//3],
21 | np.s_[:, 2* h//3: 2* h//3 + 2]],
22 | }
23 |
24 | return road_blocks[difficulty]
25 |
26 | def goal_reached(place_i, curr, finish_points):
27 | return curr in finish_points[:place_i] + finish_points[place_i+1:]
28 |
29 |
30 | def get_add_mat(dims, grid, difficulty):
31 | h,w = dims
32 |
33 | road_dir = grid.copy()
34 | junction = np.zeros_like(grid)
35 |
36 | if difficulty == 'medium':
37 | arrival_points = [ (0, w//2-1), # TOP
38 | (h-1,w//2), # BOTTOM
39 | (h//2, 0), # LEFT
40 | (h//2-1,w-1)] # RIGHT
41 |
42 | finish_points = [ (0, w//2), # TOP
43 | (h-1,w//2-1), # BOTTOM
44 | (h//2-1, 0), # LEFT
45 | (h//2,w-1)] # RIGHT
46 |
47 | # mark road direction
48 | road_dir[h//2, :] = 2
49 | road_dir[h//2 - 1, :] = 3
50 | road_dir[:, w//2 ] = 4
51 |
52 | # mark the Junction
53 | junction[h//2-1:h//2+1,w//2-1:w//2+1 ] =1
54 |
55 | elif difficulty =='hard':
56 | arrival_points = [ (0, w//3-2), # TOP-left
57 | (0,2*w//3), # TOP-right
58 |
59 | (h//3-1, 0), # LEFT-top
60 | (2*h//3+1,0), # LEFT-bottom
61 |
62 | (h-1,w//3-1), # BOTTOM-left
63 | (h-1,2*w//3+1), # BOTTOM-right
64 |
65 | (h//3-2, w-1), # RIGHT-top
66 | (2*h//3,w-1)] # RIGHT-bottom
67 |
68 |
69 | finish_points = [ (0, w//3-1), # TOP-left
70 | (0,2*w//3+1), # TOP-right
71 |
72 | (h//3-2, 0), # LEFT-top
73 | (2*h//3,0), # LEFT-bottom
74 |
75 | (h-1,w//3-2), # BOTTOM-left
76 | (h-1,2*w//3), # BOTTOM-right
77 |
78 | (h//3-1, w-1), # RIGHT-top
79 | (2*h//3+1,w-1)] # RIGHT-bottom
80 |
81 | # mark road direction
82 | road_dir[h//3-1, :] = 2
83 | road_dir[2*h//3, :] = 3
84 | road_dir[2*h//3 + 1, :] = 4
85 |
86 | road_dir[:, w//3-2 ] = 5
87 | road_dir[:, w//3-1 ] = 6
88 | road_dir[:, 2*w//3 ] = 7
89 | road_dir[:, 2*w//3 +1] = 8
90 |
91 | # mark the Junctions
92 | junction[h//3-2:h//3, w//3-2:w//3 ] = 1
93 | junction[2*h//3:2*h//3+2, w//3-2:w//3 ] = 1
94 |
95 | junction[h//3-2:h//3, 2*w//3:2*w//3+2 ] = 1
96 | junction[2*h//3:2*h//3+2, 2*w//3:2*w//3+2 ] = 1
97 |
98 | return arrival_points, finish_points, road_dir, junction
99 |
100 |
101 | def next_move(curr, turn, turn_step, start, grid, road_dir, junction, visited):
102 | h,w = grid.shape
103 | turn_completed = False
104 | turn_prog = False
105 | neigh =[]
106 | for m in move:
107 | # check lane while taking left turn
108 | n = (curr[0] + m[0], curr[1] + m[1])
109 | if 0 <= n[0] <= h-1 and 0 <= n[1] <= w-1 and grid[n] and n not in visited:
110 | # On Junction, use turns
111 | if junction[n] == junction[curr] == 1:
112 | if (turn == 0 or turn == 2) and ((n[0] == start[0]) or (n[1] == start[1])):
113 | # Straight on junction for either left or straight
114 | neigh.append(n)
115 | if turn == 2:
116 | turn_prog = True
117 |
118 | # left from junction
119 | elif turn == 2 and turn_step ==1:
120 | neigh.append(n)
121 | turn_prog = True
122 |
123 | else:
124 | # End of path
125 | pass
126 |
127 | # Completing left turn on junction
128 | elif junction[curr] and not junction[n] and turn ==2 and turn_step==2 \
129 | and (abs(start[0] - n[0]) ==2 or abs(start[1] - n[1]) ==2):
130 | neigh.append(n)
131 | turn_completed =True
132 |
133 | # junction seen, get onto it;
134 | elif (junction[n] and not junction[curr]):
135 | neigh.append(n)
136 |
137 | # right from junction
138 | elif turn == 1 and not junction[n] and junction[curr]:
139 | neigh.append(n)
140 | turn_completed =True
141 |
142 | # Straight from jucntion
143 | elif turn == 0 and junction[curr] and road_dir[n] == road_dir[start]:
144 | neigh.append(n)
145 | turn_completed = True
146 |
147 | # keep going no decision to make;
148 | elif road_dir[n] == road_dir[curr] and not junction[curr]:
149 | neigh.append(n)
150 |
151 | if neigh:
152 | return neigh[0], turn_prog, turn_completed
153 | if len(neigh) != 1:
154 | raise RuntimeError("next move should be of len 1. Reached ambiguous situation.")
155 |
156 |
157 |
158 | def get_routes(dims, grid, difficulty):
159 | '''
160 | returns
161 | - routes: type list of list
162 | list for each arrival point of list of routes from that arrival point.
163 | '''
164 | grid.dtype = int
165 | h,w = dims
166 |
167 | assert difficulty == 'medium' or difficulty == 'hard'
168 |
169 | arrival_points, finish_points, road_dir, junction = get_add_mat(dims, grid, difficulty)
170 |
171 | n_turn1 = 3 # 0 - straight, 1-right, 2-left
172 | n_turn2 = 1 if difficulty == 'medium' else 3
173 |
174 |
175 | routes=[]
176 | # routes for each arrival point
177 | for i in range(len(arrival_points)):
178 | paths = []
179 | # turn 1
180 | for turn_1 in range(n_turn1):
181 | # turn 2
182 | for turn_2 in range(n_turn2):
183 | total_turns = 0
184 | curr_turn = turn_1
185 | path = []
186 | visited = set()
187 | current = arrival_points[i]
188 | path.append(current)
189 | start = current
190 | turn_step = 0
191 | # "start"
192 | while not goal_reached(i, current, finish_points):
193 | visited.add(current)
194 | current, turn_prog, turn_completed = next_move(current, curr_turn, turn_step, start, grid, road_dir, junction, visited)
195 | if curr_turn == 2 and turn_prog:
196 | turn_step +=1
197 | if turn_completed:
198 | total_turns += 1
199 | curr_turn = turn_2
200 | turn_step = 0
201 | start = current
202 | # keep going straight till the exit if 2 turns made already.
203 | if total_turns == 2:
204 | curr_turn = 0
205 | path.append(current)
206 | paths.append(path)
207 | # early stopping, if first turn leads to exit
208 | if total_turns == 1:
209 | break
210 | routes.append(paths)
211 | return routes
212 |
--------------------------------------------------------------------------------
/figures/dynamics_131.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/dynamics_131.pdf
--------------------------------------------------------------------------------
/figures/dynamics_135.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/dynamics_135.pdf
--------------------------------------------------------------------------------
/figures/dynamics_136.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/dynamics_136.pdf
--------------------------------------------------------------------------------
/figures/dynamics_19.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/dynamics_19.pdf
--------------------------------------------------------------------------------
/figures/dynamics_38.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/dynamics_38.png
--------------------------------------------------------------------------------
/figures/easy_reward.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/easy_reward.pdf
--------------------------------------------------------------------------------
/figures/easy_road.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/easy_road.pdf
--------------------------------------------------------------------------------
/figures/easy_success.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/easy_success.pdf
--------------------------------------------------------------------------------
/figures/hard_reward.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/hard_reward.pdf
--------------------------------------------------------------------------------
/figures/hard_road.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/hard_road.pdf
--------------------------------------------------------------------------------
/figures/hard_success.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/hard_success.pdf
--------------------------------------------------------------------------------
/figures/medium_reward.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/medium_reward.pdf
--------------------------------------------------------------------------------
/figures/medium_road.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/medium_road.pdf
--------------------------------------------------------------------------------
/figures/medium_success.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/medium_success.pdf
--------------------------------------------------------------------------------
/figures/simple_spread_mean_reward.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/simple_spread_mean_reward.png
--------------------------------------------------------------------------------
/figures/simple_tag_turn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/simple_tag_turn.png
--------------------------------------------------------------------------------
/figures/venn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hsvgbkhgbv/SQDDPG/33eef74b3cbe207a340d65a65d6ff5be34acc86e/figures/venn.png
--------------------------------------------------------------------------------
/learning_algorithms/actor_critic.py:
--------------------------------------------------------------------------------
1 | from learning_algorithms.rl_algorithms import *
2 | import torch
3 | from utilities.util import *
4 |
5 |
6 |
7 | class ActorCritic(ReinforcementLearning):
8 |
9 | def __init__(self, args):
10 | super(ActorCritic, self).__init__('Actor_Critic', args)
11 |
12 | def __call__(self, batch, behaviour_net):
13 | return self.get_loss(batch, behaviour_net)
14 |
15 | def get_loss(self, batch, behaviour_net, target_net=None):
16 | # TODO: fix policy params update
17 | batch_size = len(batch.state)
18 | n = self.args.agent_num
19 | # collect the transition data
20 | rewards, last_step, done, actions, state, next_state = behaviour_net.unpack_data(batch)
21 | # construct the computational graph
22 | action_out = behaviour_net.policy(state)
23 | values = behaviour_net.value(state, actions)
24 | if self.args.q_func:
25 | values = torch.sum(values*actions, dim=-1)
26 | values = values.contiguous().view(-1, n)
27 | if target_net == None:
28 | next_action_out = behaviour_net.policy(next_state)
29 | else:
30 | next_action_out = target_net.policy(next_state)
31 | next_actions = select_action(self.args, next_action_out, status='train')
32 | next_values = behaviour_net.value(next_state, next_actions)
33 | if self.args.q_func:
34 | next_values = torch.sum(next_values*next_actions, dim=-1)
35 | next_values = next_values.contiguous().view(-1, n)
36 | returns = cuda_wrapper(torch.zeros((batch_size, n), dtype=torch.float), self.cuda_)
37 | # calculate the advantages
38 | assert values.size() == next_values.size()
39 | assert returns.size() == values.size()
40 | for i in reversed(range(rewards.size(0))):
41 | if last_step[i]:
42 | next_return = 0 if done[i] else next_values[i].detach()
43 | else:
44 | next_return = next_values[i].detach()
45 | returns[i] = rewards[i] + self.args.gamma * next_return
46 | deltas = returns - values
47 | advantages = values.detach()
48 | # advantages = advantages.contiguous().view(-1, 1)
49 | if self.args.normalize_advantages:
50 | advantages = batchnorm(advantages)
51 | # construct the action loss and the value loss
52 | log_prob_a = multinomials_log_density(actions, action_out).contiguous().view(-1,n)
53 | assert log_prob_a.size() == advantages.size()
54 | action_loss = -advantages * log_prob_a
55 | action_loss = action_loss.mean(dim=0)
56 | value_loss = deltas.pow(2).mean(dim=0)
57 | return action_loss, value_loss, action_out
58 |
--------------------------------------------------------------------------------
/learning_algorithms/ddpg.py:
--------------------------------------------------------------------------------
1 | from learning_algorithms.rl_algorithms import *
2 | from utilities.util import *
3 |
4 |
5 |
6 | class DDPG(ReinforcementLearning):
7 |
8 | def __init__(self, args):
9 | super(DDPG, self).__init__('DDPG', args)
10 |
11 | def __call__(self, batch, behaviour_net, target_net):
12 | return self.get_loss(batch, behaviour_net, target_net)
13 |
14 | def get_loss(self, batch, behaviour_net, target_net):
15 | # TODO: fix policy params update
16 | batch_size = len(batch.state)
17 | n = self.args.agent_num
18 | # collect the transition data
19 | rewards, last_step, done, actions, state, next_state = behaviour_net.unpack_data(batch)
20 | # construct the computational graph
21 | # do the argmax action on the action loss
22 | action_out = behaviour_net.policy(state)
23 | actions_ = select_action(self.args, action_out, status='train', exploration=False)
24 | values_ = behaviour_net.value(state, actions_).contiguous().view(-1, n)
25 | # do the exploration action on the value loss
26 | values = behaviour_net.value(state, actions).contiguous().view(-1, n)
27 | # do the argmax action on the next value loss
28 | next_action_out = target_net.policy(next_state)
29 | next_actions_ = select_action(self.args, next_action_out, status='train', exploration=False)
30 | next_values_ = target_net.value(next_state, next_actions_.detach()).contiguous().view(-1, n)
31 | returns = cuda_wrapper(torch.zeros((batch_size, n), dtype=torch.float), self.cuda_)
32 | assert values_.size() == next_values_.size()
33 | assert returns.size() == values.size()
34 | for i in reversed(range(rewards.size(0))):
35 | if last_step[i]:
36 | next_return = 0 if done[i] else next_values_[i].detach()
37 | else:
38 | next_return = next_values_[i].detach()
39 | returns[i] = rewards[i] + self.args.gamma * next_return
40 | deltas = returns - values
41 | advantages = values_
42 | if self.args.normalize_advantages:
43 | advantages = batchnorm(advantages)
44 | action_loss = -advantages
45 | action_loss = action_loss.mean(dim=0)
46 | value_loss = deltas.pow(2).mean(dim=0)
47 | return action_loss, value_loss, action_out
48 |
--------------------------------------------------------------------------------
/learning_algorithms/rl_algorithms.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 | from torch import optim
4 | import torch.nn as nn
5 | from utilities.util import *
6 |
7 |
8 |
9 | class ReinforcementLearning(object):
10 |
11 | def __init__(self, name, args):
12 | self.name = name
13 | self.args = args
14 | self.cuda_ = torch.cuda.is_available() and self.args.cuda
15 |
16 | def __str__(self):
17 | print (self.name)
18 |
19 | def __call__(self):
20 | raise NotImplementedError()
21 |
22 | def get_loss(self):
23 | raise NotImplementedError()
24 |
--------------------------------------------------------------------------------
/models/coma_fc.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import torch.nn as nn
3 | import numpy as np
4 | from utilities.util import *
5 | from models.model import Model
6 | from collections import namedtuple
7 |
8 |
9 |
10 | class COMAFC(Model):
11 |
12 | def __init__(self, args, target_net=None):
13 | super(COMAFC, self).__init__(args)
14 | self.construct_model()
15 | self.apply(self.init_weights)
16 | if target_net != None:
17 | self.target_net = target_net
18 | self.reload_params_to_target()
19 | self.Transition = namedtuple('Transition', ('state', 'action', 'reward', 'next_state', 'done', 'last_step'))
20 |
21 | def construct_policy_net(self):
22 | action_dicts = []
23 | if self.args.shared_parameters:
24 | l1 = nn.Linear(self.obs_dim, self.hid_dim)
25 | l2 = nn.Linear(self.hid_dim, self.hid_dim)
26 | a = nn.Linear(self.hid_dim, self.act_dim)
27 | for i in range(self.n_):
28 | action_dicts.append(nn.ModuleDict( {'layer_1': l1,\
29 | 'layer_2': l2,\
30 | 'action_head': a
31 | }
32 | )
33 | )
34 | else:
35 | for i in range(self.n_):
36 | action_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear(self.obs_dim, self.hid_dim),\
37 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\
38 | 'action_head': nn.Linear(self.hid_dim, self.act_dim)
39 | }
40 | )
41 | )
42 | self.action_dicts = nn.ModuleList(action_dicts)
43 |
44 | def construct_value_net(self):
45 | value_dicts = []
46 | if self.args.shared_parameters:
47 | l1 = nn.Linear((self.n_+1)*self.obs_dim+(self.n_-1)*self.act_dim, self.hid_dim)
48 | l2 = nn.Linear(self.hid_dim, self.hid_dim)
49 | v = nn.Linear(self.hid_dim, self.act_dim)
50 | for i in range(self.n_):
51 | value_dicts.append(nn.ModuleDict( {'layer_1': l1,\
52 | 'layer_2': l2,\
53 | 'value_head': v
54 | }
55 | )
56 | )
57 | else:
58 | for i in range(self.n_):
59 | value_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear((self.n_+1)*self.obs_dim+(self.n_-1)*self.act_dim, self.hid_dim),\
60 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\
61 | 'value_head': nn.Linear(self.hid_dim, self.act_dim)
62 | }
63 | )
64 | )
65 | self.value_dicts = nn.ModuleList(value_dicts)
66 |
67 | def construct_model(self):
68 | self.construct_value_net()
69 | self.construct_policy_net()
70 |
71 | def policy(self, obs, schedule=None, last_act=None, last_hid=None, info={}, stat={}):
72 | actions = []
73 | for i in range(self.n_):
74 | h = torch.relu( self.action_dicts[i]['layer_1'](obs[:, i, :]) )
75 | h = torch.relu( self.action_dicts[i]['layer_2'](h) )
76 | a = self.action_dicts[i]['action_head'](h)
77 | actions.append(a)
78 | actions = torch.stack(actions, dim=1)
79 | return actions
80 |
81 | def value(self, obs, act):
82 | batch_size = obs.size(0)
83 | obs_own = obs.clone()
84 | obs = obs.unsqueeze(1).expand(batch_size, self.n_, self.n_, self.obs_dim) # shape = (b, n, o) -> (b, 1, n, o) -> (b, n, n, o)
85 | obs = obs.contiguous().view(batch_size, self.n_, -1) # shape = (b, n, o*n)
86 | inp = torch.cat((obs, obs_own), dim=-1) # shape = (b, n, o*n+o)
87 | values = []
88 | for i in range(self.n_):
89 | # other people actions
90 | act_other = torch.cat((act[:,:i,:].view(batch_size,-1),act[:,i+1:,:].view(batch_size,-1)),dim=-1)
91 | h = torch.relu( self.value_dicts[i]['layer_1'](torch.cat((inp[:, i, :], act_other),dim=-1)) )
92 | h = torch.relu( self.value_dicts[i]['layer_2'](h) )
93 | v = self.value_dicts[i]['value_head'](h)
94 | values.append(v)
95 | values = torch.stack(values, dim=1)
96 | return values
97 |
98 |
99 | def get_loss(self, batch):
100 | batch_size = len(batch.state)
101 | rewards, last_step, done, actions, state, next_state = self.unpack_data(batch)
102 | action_out = self.policy(state) # (b,n,a) action probability
103 | values = self.value(state, actions) # (b,n,a) action value
104 | baselines = torch.sum(values*torch.softmax(action_out, dim=-1), dim=-1) # the only difference to ActorCritic is this baseline (b,n)
105 | values = torch.sum(values*actions, dim=-1) # (b,n)
106 | if self.args.target:
107 | next_action_out = self.target_net.policy(next_state, last_act=actions)
108 | else:
109 | next_action_out = self.policy(next_state, last_act=actions)
110 | next_actions = select_action(self.args, next_action_out, status='train', exploration=False)
111 | if self.args.target:
112 | next_values = self.target_net.value(next_state, next_actions)
113 | else:
114 | next_values = self.value(next_state, next_actions)
115 | next_values = torch.sum(next_values*next_actions, dim=-1) # b*n
116 | # calculate the advantages
117 | returns = cuda_wrapper(torch.zeros((batch_size, self.n_), dtype=torch.float), self.cuda_)
118 | assert values.size() == next_values.size()
119 | assert returns.size() == values.size()
120 | for i in reversed(range(rewards.size(0))):
121 | if last_step[i]:
122 | next_return = 0 if done[i] else next_values[i].detach()
123 | else:
124 | next_return = next_values[i].detach()
125 | returns[i] = rewards[i] + self.args.gamma * next_return
126 | # value loss
127 | deltas = returns - values
128 | value_loss = deltas.pow(2).mean(dim=0)
129 | # actio loss
130 | advantages = ( values - baselines ).detach()
131 | if self.args.normalize_advantages:
132 | advantages = batchnorm(advantages)
133 | log_prob = multinomials_log_density(actions, action_out).contiguous().view(-1, self.n_)
134 | assert log_prob.size() == advantages.size()
135 | action_loss = - advantages * log_prob
136 | action_loss = action_loss.mean(dim=0)
137 | return action_loss, value_loss, action_out
138 |
--------------------------------------------------------------------------------
/models/independent_ac.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import torch.nn as nn
3 | import numpy as np
4 | from utilities.util import *
5 | from models.model import Model
6 | from collections import namedtuple
7 | from learning_algorithms.actor_critic import *
8 |
9 |
10 |
11 | class IndependentAC(Model):
12 |
13 | def __init__(self, args, target_net=None):
14 | super(IndependentAC, self).__init__(args)
15 | self.construct_model()
16 | self.apply(self.init_weights)
17 | if target_net != None:
18 | self.target_net = target_net
19 | self.reload_params_to_target()
20 | self.Transition = namedtuple('Transition', ('state', 'action', 'reward', 'next_state', 'done', 'last_step'))
21 | self.rl = ActorCritic(self.args)
22 |
23 |
24 | def construct_policy_net(self):
25 | # TODO: fix policy params update
26 | action_dicts = []
27 | if self.args.shared_parameters:
28 | l1 = nn.Linear(self.obs_dim, self.hid_dim)
29 | l2 = nn.Linear(self.hid_dim, self.hid_dim)
30 | a = nn.Linear(self.hid_dim, self.act_dim)
31 | for i in range(self.n_):
32 | action_dicts.append(nn.ModuleDict( {'layer_1': l1,\
33 | 'layer_2': l2,\
34 | 'action_head': a
35 | }
36 | )
37 | )
38 | else:
39 | for i in range(self.n_):
40 | action_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear(self.obs_dim, self.hid_dim),\
41 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\
42 | 'action_head': nn.Linear(self.hid_dim, self.act_dim)
43 | }
44 | )
45 | )
46 | self.action_dicts = nn.ModuleList(action_dicts)
47 |
48 | def construct_value_net(self):
49 | # TODO: policy params update
50 | value_dicts = []
51 | if self.args.shared_parameters:
52 | l1 = nn.Linear(self.obs_dim, self.hid_dim )
53 | l2 = nn.Linear(self.hid_dim, self.hid_dim)
54 | v = nn.Linear(self.hid_dim, self.act_dim)
55 | for i in range(self.n_):
56 | value_dicts.append(nn.ModuleDict( {'layer_1': l1,\
57 | 'layer_2': l2,\
58 | 'value_head': v
59 | }
60 | )
61 | )
62 | else:
63 | for i in range(self.n_):
64 | value_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear(self.obs_dim, self.hid_dim ),\
65 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\
66 | 'value_head': nn.Linear(self.hid_dim, self.act_dim)
67 | }
68 | )
69 | )
70 | self.value_dicts = nn.ModuleList(value_dicts)
71 |
72 | def construct_model(self):
73 | self.construct_value_net()
74 | self.construct_policy_net()
75 |
76 | def policy(self, obs, schedule=None, last_act=None, last_hid=None, info={}, stat={}):
77 | # TODO: policy params update
78 | actions = []
79 | for i in range(self.n_):
80 | h = torch.relu( self.action_dicts[i]['layer_1'](obs[:, i, :]) )
81 | h = torch.relu( self.action_dicts[i]['layer_2'](h) )
82 | a = self.action_dicts[i]['action_head'](h)
83 | actions.append(a)
84 | actions = torch.stack(actions, dim=1)
85 | return actions
86 |
87 |
88 | def value(self, obs, act=None):
89 | # TODO: policy params update
90 | values = []
91 | for i in range(self.n_):
92 | h = torch.relu( self.value_dicts[i]['layer_1'](obs[:,i,:]) )
93 | h = torch.relu( self.value_dicts[i]['layer_2'](h) )
94 | v = self.value_dicts[i]['value_head'](h)
95 | values.append(v)
96 | values = torch.stack(values, dim=1)
97 | return values
98 |
99 | def get_loss(self, batch):
100 | action_loss, value_loss, log_p_a = self.rl.get_loss(batch, self, self.target_net)
101 | return action_loss, value_loss, log_p_a
102 |
--------------------------------------------------------------------------------
/models/independent_ddpg.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import torch.nn as nn
3 | import numpy as np
4 | from utilities.util import *
5 | from models.model import Model
6 | from learning_algorithms.ddpg import *
7 | from collections import namedtuple
8 |
9 |
10 |
11 | class IndependentDDPG(Model):
12 |
13 | def __init__(self, args, target_net=None):
14 | super(IndependentDDPG, self).__init__(args)
15 | self.construct_model()
16 | self.apply(self.init_weights)
17 | if target_net != None:
18 | self.target_net = target_net
19 | self.reload_params_to_target()
20 | self.Transition = namedtuple('Transition', ('state', 'action', 'reward', 'next_state', 'done', 'last_step'))
21 | self.rl = DDPG(self.args)
22 |
23 | def construct_policy_net(self):
24 | action_dicts = []
25 | if self.args.shared_parameters:
26 | l1 = nn.Linear(self.obs_dim, self.hid_dim)
27 | l2 = nn.Linear(self.hid_dim, self.hid_dim)
28 | a = nn.Linear(self.hid_dim, self.act_dim)
29 | for i in range(self.n_):
30 | action_dicts.append(nn.ModuleDict( {'layer_1': l1,\
31 | 'layer_2': l2,\
32 | 'action_head': a
33 | }
34 | )
35 | )
36 | else:
37 | for i in range(self.n_):
38 | action_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear(self.obs_dim, self.hid_dim),\
39 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\
40 | 'action_head': nn.Linear(self.hid_dim, self.act_dim)
41 | }
42 | )
43 | )
44 | self.action_dicts = nn.ModuleList(action_dicts)
45 |
46 | def construct_value_net(self):
47 | value_dicts = []
48 | if self.args.shared_parameters:
49 | l1 = nn.Linear(self.obs_dim+self.act_dim, self.hid_dim )
50 | l2 = nn.Linear(self.hid_dim, self.hid_dim)
51 | v = nn.Linear(self.hid_dim, 1)
52 | for i in range(self.n_):
53 | value_dicts.append(nn.ModuleDict( {'layer_1': l1,\
54 | 'layer_2': l2,\
55 | 'value_head': v
56 | }
57 | )
58 | )
59 | else:
60 | for i in range(self.n_):
61 | value_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear(self.obs_dim+self.act_dim, self.hid_dim ),\
62 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\
63 | 'value_head': nn.Linear(self.hid_dim, 1)
64 | }
65 | )
66 | )
67 | self.value_dicts = nn.ModuleList(value_dicts)
68 |
69 | def construct_model(self):
70 | self.construct_value_net()
71 | self.construct_policy_net()
72 |
73 | def policy(self, obs, schedule=None, last_act=None, last_hid=None, info={}, stat={}):
74 | actions = []
75 | for i in range(self.n_):
76 | h = torch.relu( self.action_dicts[i]['layer_1'](obs[:, i, :]) )
77 | h = torch.relu( self.action_dicts[i]['layer_2'](h) )
78 | a = self.action_dicts[i]['action_head'](h)
79 | actions.append(a)
80 | actions = torch.stack(actions, dim=1)
81 | return actions
82 |
83 | def value(self, obs, act):
84 | values = []
85 | for i in range(self.n_):
86 | h = torch.relu( self.value_dicts[i]['layer_1']( torch.cat((obs[:,i,:],act[:,i,:]),dim=-1)))
87 | h = torch.relu( self.value_dicts[i]['layer_2'](h) )
88 | v = self.value_dicts[i]['value_head'](h)
89 | values.append(v)
90 | values = torch.stack(values, dim=1)
91 | return values
92 |
93 | def get_loss(self, batch):
94 | action_loss, value_loss, log_p_a = self.rl.get_loss(batch, self, self.target_net)
95 | return action_loss, value_loss, log_p_a
96 |
--------------------------------------------------------------------------------
/models/maddpg.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import torch.nn as nn
3 | import numpy as np
4 | from utilities.util import *
5 | from models.model import Model
6 | from learning_algorithms.ddpg import *
7 | from collections import namedtuple
8 |
9 |
10 |
11 | class MADDPG(Model):
12 |
13 | def __init__(self, args, target_net=None):
14 | super(MADDPG, self).__init__(args)
15 | self.construct_model()
16 | self.apply(self.init_weights)
17 | if target_net != None:
18 | self.target_net = target_net
19 | self.reload_params_to_target()
20 | self.Transition = namedtuple('Transition', ('state', 'action', 'reward', 'next_state', 'done', 'last_step'))
21 |
22 | def construct_policy_net(self):
23 | # TODO: fix policy params update
24 | action_dicts = []
25 | if self.args.shared_parameters:
26 | l1 = nn.Linear(self.obs_dim, self.hid_dim)
27 | l2 = nn.Linear(self.hid_dim, self.hid_dim)
28 | a = nn.Linear(self.hid_dim, self.act_dim)
29 | for i in range(self.n_):
30 | action_dicts.append(nn.ModuleDict( {'layer_1': l1,\
31 | 'layer_2': l2,\
32 | 'action_head': a
33 | }
34 | )
35 | )
36 | else:
37 | for i in range(self.n_):
38 | action_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear(self.obs_dim, self.hid_dim),\
39 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\
40 | 'action_head': nn.Linear(self.hid_dim, self.act_dim)
41 | }
42 | )
43 | )
44 | self.action_dicts = nn.ModuleList(action_dicts)
45 |
46 | def construct_value_net(self):
47 | # TODO: policy params update
48 | value_dicts = []
49 | if self.args.shared_parameters:
50 | l1 = nn.Linear( (self.obs_dim+self.act_dim)*self.n_, self.hid_dim )
51 | l2 = nn.Linear(self.hid_dim, self.hid_dim)
52 | v = nn.Linear(self.hid_dim, 1)
53 | for i in range(self.n_):
54 | value_dicts.append(nn.ModuleDict( {'layer_1': l1,\
55 | 'layer_2': l2,\
56 | 'value_head': v
57 | }
58 | )
59 | )
60 | else:
61 | for i in range(self.n_):
62 | value_dicts.append(nn.ModuleDict( {'layer_1': nn.Linear( (self.obs_dim+self.act_dim)*self.n_, self.hid_dim ),\
63 | 'layer_2': nn.Linear(self.hid_dim, self.hid_dim),\
64 | 'value_head': nn.Linear(self.hid_dim, 1)
65 | }
66 | )
67 | )
68 | self.value_dicts = nn.ModuleList(value_dicts)
69 |
70 | def construct_model(self):
71 | self.construct_value_net()
72 | self.construct_policy_net()
73 |
74 | def policy(self, obs, schedule=None, last_act=None, last_hid=None, info={}, stat={}):
75 | # TODO: policy params update
76 | actions = []
77 | for i in range(self.n_):
78 | h = torch.relu( self.action_dicts[i]['layer_1'](obs[:, i, :]) )
79 | h = torch.relu( self.action_dicts[i]['layer_2'](h) )
80 | a = self.action_dicts[i]['action_head'](h)
81 | actions.append(a)
82 | actions = torch.stack(actions, dim=1)
83 | return actions
84 |
85 | def value(self, obs, act):
86 | # TODO: policy params update
87 | values = []
88 | for i in range(self.n_):
89 | h = torch.relu( self.value_dicts[i]['layer_1']( torch.cat( ( obs.contiguous().view( -1, np.prod(obs.size()[1:]) ), act.contiguous().view( -1, np.prod(act.size()[1:]) ) ), dim=-1 ) ) )
90 | h = torch.relu( self.value_dicts[i]['layer_2'](h) )
91 | v = self.value_dicts[i]['value_head'](h)
92 | values.append(v)
93 | values = torch.stack(values, dim=1)
94 | return values
95 |
96 | def get_loss(self, batch):
97 | # TODO: fix policy params update
98 | batch_size = len(batch.state)
99 | # collect the transition data
100 | rewards, last_step, done, actions, state, next_state = self.unpack_data(batch)
101 | # construct the computational graph
102 | # do the argmax action on the action loss
103 | action_out = self.policy(state)
104 | actions_ = select_action(self.args, action_out, status='train', exploration=False)
105 | values_ = self.value(state, actions_).contiguous().view(-1, self.n_)
106 | # do the exploration action on the value loss
107 | values = self.value(state, actions).contiguous().view(-1, self.n_)
108 | # do the argmax action on the next value loss
109 | next_action_out = self.target_net.policy(next_state)
110 | next_actions_ = select_action(self.args, next_action_out, status='train', exploration=False)
111 | next_values_ = self.target_net.value(next_state, next_actions_.detach()).contiguous().view(-1, self.n_)
112 | returns = cuda_wrapper(torch.zeros((batch_size, self.n_), dtype=torch.float), self.cuda_)
113 | assert values_.size() == next_values_.size()
114 | assert returns.size() == values.size()
115 | for i in reversed(range(rewards.size(0))):
116 | if last_step[i]:
117 | next_return = 0 if done[i] else next_values_[i].detach()
118 | else:
119 | next_return = next_values_[i].detach()
120 | returns[i] = rewards[i] + self.args.gamma * next_return
121 | deltas = returns - values
122 | advantages = values_
123 | # advantages = advantages.contiguous().view(-1, 1)
124 | if self.args.normalize_advantages:
125 | advantages = batchnorm(advantages)
126 | action_loss = -advantages
127 | action_loss = action_loss.mean(dim=0)
128 | value_loss = deltas.pow(2).mean(dim=0)
129 | return action_loss, value_loss, action_out
130 |
--------------------------------------------------------------------------------
/models/model.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import torch.nn as nn
3 | import numpy as np
4 | from utilities.util import *
5 |
6 |
7 |
8 | class Model(nn.Module):
9 |
10 | def __init__(self, args):
11 | super(Model, self).__init__()
12 | self.args = args
13 | self.cuda_ = torch.cuda.is_available() and self.args.cuda
14 | self.n_ = self.args.agent_num
15 | self.hid_dim = self.args.hid_size
16 | self.obs_dim = self.args.obs_size
17 | self.act_dim = self.args.action_dim
18 |
19 | def reload_params_to_target(self):
20 | self.target_net.action_dicts.load_state_dict( self.action_dicts.state_dict() )
21 | self.target_net.value_dicts.load_state_dict( self.value_dicts.state_dict() )
22 |
23 | def update_target(self):
24 | for name, param in self.target_net.action_dicts.state_dict().items():
25 | update_params = (1 - self.args.target_lr) * param + self.args.target_lr * self.action_dicts.state_dict()[name]
26 | self.target_net.action_dicts.state_dict()[name].copy_(update_params)
27 | for name, param in self.target_net.value_dicts.state_dict().items():
28 | update_params = (1 - self.args.target_lr) * param + self.args.target_lr * self.value_dicts.state_dict()[name]
29 | self.target_net.value_dicts.state_dict()[name].copy_(update_params)
30 |
31 | def transition_update(self, trainer, trans, stat):
32 | if self.args.replay:
33 | trainer.replay_buffer.add_experience(trans)
34 | replay_cond = trainer.steps>self.args.replay_warmup\
35 | and len(trainer.replay_buffer.buffer)>=self.args.batch_size\
36 | and trainer.steps%self.args.behaviour_update_freq==0
37 | if replay_cond:
38 | for _ in range(self.args.critic_update_times):
39 | trainer.value_replay_process(stat)
40 | trainer.action_replay_process(stat)
41 | # TODO: hard code
42 | # clear replay buffer for on policy algorithm
43 | if self.__class__.__name__ in ["COMAFC","MFAC","IndependentAC"] :
44 | trainer.replay_buffer.clear()
45 | else:
46 | trans_cond = trainer.steps%self.args.behaviour_update_freq==0
47 | if trans_cond:
48 | for _ in range(self.args.critic_update_times):
49 | trainer.value_replay_process(stat)
50 | trainer.action_transition_process(stat, trans)
51 | if self.args.target:
52 | target_cond = trainer.steps%self.args.target_update_freq==0
53 | if target_cond:
54 | self.update_target()
55 |
56 | def episode_update(self, trainer, episode, stat):
57 | if self.args.replay:
58 | trainer.replay_buffer.add_experience(episode)
59 | replay_cond = trainer.episodes>self.args.replay_warmup\
60 | and len(trainer.replay_buffer.buffer)>=self.args.batch_size\
61 | and trainer.episodes%self.args.behaviour_update_freq==0
62 | if replay_cond:
63 | for _ in range(self.args.critic_update_times):
64 | trainer.value_replay_process(stat)
65 | trainer.action_replay_process(stat)
66 | else:
67 | episode = self.Transition(*zip(*episode))
68 | episode_cond = trainer.episodes%self.args.behaviour_update_freq==0
69 | if episode_cond:
70 | for _ in range(self.args.critic_update_times):
71 | trainer.value_replay_process(stat)
72 | trainer.action_transition_process(stat)
73 |
74 | def construct_model(self):
75 | raise NotImplementedError()
76 |
77 | def get_agent_mask(self, batch_size, info):
78 | '''
79 | define the getter of agent mask to confirm the living agent
80 | '''
81 | if 'alive_mask' in info:
82 | agent_mask = torch.from_numpy(info['alive_mask'])
83 | num_agents_alive = agent_mask.sum()
84 | else:
85 | agent_mask = torch.ones(self.n_)
86 | num_agents_alive = self.n_
87 | # shape = (1, 1, n)
88 | agent_mask = agent_mask.view(1, 1, self.n_)
89 | # shape = (batch_size, n ,n, 1)
90 | agent_mask = cuda_wrapper(agent_mask.expand(batch_size, self.n_, self.n_).unsqueeze(-1), self.cuda_)
91 | return num_agents_alive, agent_mask
92 |
93 | def policy(self, obs, last_act=None, last_hid=None, gate=None, info={}, stat={}):
94 | raise NotImplementedError()
95 |
96 | def value(self, obs, act):
97 | raise NotImplementedError()
98 |
99 | def construct_policy_net(self):
100 | raise NotImplementedError()
101 |
102 | def construct_value_net(self):
103 | raise NotImplementedError()
104 |
105 | def init_weights(self, m):
106 | '''
107 | initialize the weights of parameters
108 | '''
109 | if type(m) == nn.Linear:
110 | m.weight.data.normal_(0, self.args.init_std)
111 |
112 | def get_loss(self):
113 | raise NotImplementedError()
114 |
115 | def credit_assignment_demo(self, obs, act):
116 | assert isinstance(obs, np.ndarray)
117 | assert isinstance(act, np.ndarray)
118 | obs = cuda_wrapper(torch.tensor(obs).float(), self.cuda_)
119 | act = cuda_wrapper(torch.tensor(act).float(), self.cuda_)
120 | values = self.value(obs, act)
121 | return values
122 |
123 |
124 | def train_process(self, stat, trainer):
125 | info = {}
126 | state = trainer.env.reset()
127 | if self.args.reward_record_type is 'episode_mean_step':
128 | trainer.mean_reward = 0
129 | trainer.mean_success = 0
130 |
131 | for t in range(self.args.max_steps):
132 | state_ = cuda_wrapper(prep_obs(state).contiguous().view(1, self.n_, self.obs_dim), self.cuda_)
133 | start_step = True if t == 0 else False
134 | state_ = cuda_wrapper(prep_obs(state).contiguous().view(1, self.n_, self.obs_dim), self.cuda_)
135 | action_out = self.policy(state_, info=info, stat=stat)
136 | action = select_action(self.args, action_out, status='train', info=info)
137 | _, actual = translate_action(self.args, action, trainer.env)
138 | next_state, reward, done, debug = trainer.env.step(actual)
139 | if isinstance(done, list): done = np.sum(done)
140 | done_ = done or t==self.args.max_steps-1
141 | trans = self.Transition(state,
142 | action.cpu().numpy(),
143 | np.array(reward),
144 | next_state,
145 | done,
146 | done_
147 | )
148 | self.transition_update(trainer, trans, stat)
149 | success = debug['success'] if 'success' in debug else 0.0
150 | trainer.steps += 1
151 | if self.args.reward_record_type is 'mean_step':
152 | trainer.mean_reward = trainer.mean_reward + 1/trainer.steps*(np.mean(reward) - trainer.mean_reward)
153 | trainer.mean_success = trainer.mean_success + 1/trainer.steps*(success - trainer.mean_success)
154 | elif self.args.reward_record_type is 'episode_mean_step':
155 | trainer.mean_reward = trainer.mean_reward + 1/(t+1)*(np.mean(reward) - trainer.mean_reward)
156 | trainer.mean_success = trainer.mean_success + 1/(t+1)*(success - trainer.mean_success)
157 | else:
158 | raise RuntimeError('Please enter a correct reward record type, e.g. mean_step or episode_mean_step.')
159 | stat['mean_reward'] = trainer.mean_reward
160 | stat['mean_success'] = trainer.mean_success
161 | if done_:
162 | break
163 | state = next_state
164 | stat['turn'] = t + 1
165 | trainer.episodes += 1
166 |
167 |
168 | def unpack_data(self, batch):
169 | batch_size = len(batch.state)
170 | rewards = cuda_wrapper(torch.tensor(batch.reward, dtype=torch.float), self.cuda_)
171 | last_step = cuda_wrapper(torch.tensor(batch.last_step, dtype=torch.float).contiguous().view(-1, 1), self.cuda_)
172 | done = cuda_wrapper(torch.tensor(batch.done, dtype=torch.float).contiguous().view(-1, 1), self.cuda_)
173 | actions = cuda_wrapper(torch.tensor(np.stack(list(zip(*batch.action))[0], axis=0), dtype=torch.float), self.cuda_)
174 | state = cuda_wrapper(prep_obs(list(zip(batch.state))), self.cuda_)
175 | next_state = cuda_wrapper(prep_obs(list(zip(batch.next_state))), self.cuda_)
176 | return (rewards, last_step, done, actions, state, next_state)
177 |
--------------------------------------------------------------------------------
/models/random.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import torch.nn as nn
3 | import numpy as np
4 | from utilities.util import *
5 | from models.model import Model
6 | from learning_algorithms.ddpg import *
7 | from collections import namedtuple
8 |
9 |
10 |
11 | class RandomAgent(Model):
12 |
13 | def __init__(self, args):
14 | super(RandomAgent, self).__init__(args)
15 | self.args = args
16 |
17 | def policy(self, obs, schedule=None, last_act=None, last_hid=None, info={}, stat={}):
18 | actions = []
19 | tensor = torch.cuda.FloatTensor if self.args.cuda else torch.FloatTensor
20 | actions = tensor([[1.0]*self.act_dim]*self.n_)
21 | return actions
22 |
--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
1 | import torch
2 | from utilities.tester import *
3 | from arguments import *
4 | import argparse
5 |
6 |
7 | parser = argparse.ArgumentParser(description='Test rl agent.')
8 | parser.add_argument('--save-model-dir', type=str, nargs='?', default='./model_save/', help='Please input the directory of saving model.')
9 | parser.add_argument('--render', action='store_true', help='Please input the flag to control the render.')
10 | parser.add_argument('--episodes', type=int, default=10, help='Please input the number of test episodes')
11 |
12 | argv = parser.parse_args()
13 |
14 | model = Model[model_name]
15 |
16 | strategy = Strategy[model_name]
17 |
18 | if argv.save_model_dir[-1] is '/':
19 | save_path = argv.save_model_dir
20 | else:
21 | save_path = argv.save_model_dir+'/'
22 |
23 | PATH = save_path + log_name + '/model.pt'
24 |
25 | if args.target:
26 | target_net = model(args)
27 | behaviour_net = model(args, target_net)
28 | else:
29 | behaviour_net = model(args)
30 |
31 | checkpoint = torch.load(PATH, map_location='cpu') if not args.cuda else torch.load(PATH)
32 | behaviour_net.load_state_dict(checkpoint['model_state_dict'])
33 |
34 | if strategy == 'pg':
35 | test = PGTester(env(), behaviour_net, args)
36 | elif strategy == 'q':
37 | raise NotImplementedError('This needs to be implemented.')
38 | else:
39 | raise RuntimeError('Please input the correct strategy, e.g. pg or q.')
40 |
41 | print(args)
42 | test.run_game(episodes=argv.episodes, render=argv.render)
43 | test.print_info()
44 |
--------------------------------------------------------------------------------
/test.sh:
--------------------------------------------------------------------------------
1 | # !/bin/bash
2 |
3 | EXP_NAME="spread_sqddpg"
4 |
5 | cp ./args/$EXP_NAME.py arguments.py
6 | python -u test.py
7 |
--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from utilities.trainer import *
3 | import torch
4 | from arguments import *
5 | import os
6 | from utilities.util import *
7 | from utilities.logger import Logger
8 | import argparse
9 |
10 |
11 |
12 | parser = argparse.ArgumentParser(description='Test rl agent.')
13 | parser.add_argument('--save-path', type=str, nargs='?', default='./', help='Please input the directory of saving model.')
14 | argv = parser.parse_args()
15 |
16 |
17 |
18 | if argv.save_path[-1] is '/':
19 | save_path = argv.save_path
20 | else:
21 | save_path = argv.save_path+'/'
22 |
23 | # create save folders
24 | if 'model_save' not in os.listdir(save_path):
25 | os.mkdir(save_path+'model_save')
26 | if 'tensorboard' not in os.listdir(save_path):
27 | os.mkdir(save_path+'tensorboard')
28 | if log_name not in os.listdir(save_path+'model_save/'):
29 | os.mkdir(save_path+'model_save/'+log_name)
30 | if log_name not in os.listdir(save_path+'tensorboard/'):
31 | os.mkdir(save_path+'tensorboard/'+log_name)
32 | else:
33 | path = save_path+'tensorboard/'+log_name
34 | for f in os.listdir(path):
35 | file_path = os.path.join(path,f)
36 | if os.path.isfile(file_path):
37 | os.remove(file_path)
38 |
39 | logger = Logger(save_path+'tensorboard/' + log_name)
40 |
41 | model = Model[model_name]
42 |
43 | strategy = Strategy[model_name]
44 |
45 | print ( '{}\n'.format(args) )
46 |
47 | if strategy == 'pg':
48 | train = PGTrainer(args, model, env(), logger, args.online)
49 | elif strategy == 'q':
50 | raise NotImplementedError('This needs to be implemented.')
51 | else:
52 | raise RuntimeError('Please input the correct strategy, e.g. pg or q.')
53 |
54 | stat = dict()
55 |
56 | for i in range(args.train_episodes_num):
57 | train.run(stat)
58 | train.logging(stat)
59 | if i%args.save_model_freq == args.save_model_freq-1:
60 | train.print_info(stat)
61 | torch.save({'model_state_dict': train.behaviour_net.state_dict()}, save_path+'model_save/'+log_name+'/model.pt')
62 | print ('The model is saved!\n')
63 | with open(save_path+'model_save/'+log_name +'/log.txt', 'w+') as file:
64 | file.write(str(args)+'\n')
65 | file.write(str(i))
66 |
--------------------------------------------------------------------------------
/train.sh:
--------------------------------------------------------------------------------
1 | # !/bin/bash
2 | # sh train.sh
3 |
4 | EXP_NAME="simple_tag_sqddpg"
5 | ALIAS=""
6 | export CUDA_DEVICE_ORDER=PCI_BUS_ID
7 | export CUDA_VISIBLE_DEVICES=0
8 |
9 | if [ ! -d "./model_save" ]
10 | then
11 | mkdir ./model_save
12 | fi
13 |
14 | mkdir ./model_save/$EXP_NAME$ALIAS
15 | cp ./args/$EXP_NAME.py arguments.py
16 | python -u train.py > ./model_save/$EXP_NAME$ALIAS/exp.out &
17 | echo $! > ./model_save/$EXP_NAME$ALIAS/exp.pid
18 |
--------------------------------------------------------------------------------
/utilities/gym_wrapper.py:
--------------------------------------------------------------------------------
1 | from gym import spaces
2 |
3 |
4 | class GymWrapper(object):
5 |
6 | def __init__(self, env):
7 | self.env = env
8 | self.obs_space = self.env.observation_space
9 | self.act_space = self.env.action_space
10 |
11 | def __call__(self):
12 | return self.env
13 |
14 | def get_num_of_agents(self):
15 | return self.env.n
16 |
17 | def get_shape_of_obs(self):
18 | obs_shapes = []
19 | for obs in self.obs_space:
20 | if isinstance(obs, spaces.Box):
21 | obs_shapes.append(obs.shape)
22 | assert len(self.obs_space) == len(obs_shapes)
23 | return obs_shapes
24 |
25 | def get_output_shape_of_act(self):
26 | act_shapes = []
27 | for act in self.act_space:
28 | if isinstance(act, spaces.Discrete):
29 | act_shapes.append(act.n)
30 | elif isinstance(act, spaces.MultiDiscrete):
31 | act_shapes.append(act.high - act.low + 1)
32 | elif isinstance(act, spaces.Boxes):
33 | assert act.shape == 1
34 | act_shapes.append(act.shape)
35 | return act_shapes
36 |
37 | def get_dtype_of_obs(self):
38 | return [obs.dtype for obs in self.obs_space]
39 |
40 | def get_input_shape_of_act(self):
41 | act_shapes = []
42 | for act in self.act_space:
43 | if isinstance(act, spaces.Discrete):
44 | act_shapes.append(act.n)
45 | elif isinstance(act, spaces.MultiDiscrete):
46 | act_shapes.append(act.shape)
47 | elif isinstance(act, spaces.Boxes):
48 | assert act.shape == 1
49 | act_shapes.append(act.shape)
50 | return act_shapes
51 |
--------------------------------------------------------------------------------
/utilities/inspector.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 |
4 |
5 | def inspector(args):
6 | if args.model_name is 'maddpg':
7 | assert args.replay is True
8 | assert args.q_func is True
9 | assert args.target is True
10 | assert args.gumbel_softmax is True
11 | assert args.epsilon_softmax is False
12 | assert args.online is True
13 | elif args.model_name is 'independent_ac':
14 | assert args.replay is True
15 | assert args.q_func is True
16 | assert args.target is True
17 | assert args.online is True
18 | assert args.gumbel_softmax is False
19 | assert args.epsilon_softmax is False
20 | elif args.model_name is 'independent_ddpg':
21 | assert args.replay is True
22 | assert args.q_func is False
23 | assert args.target is True
24 | assert args.online is True
25 | assert args.gumbel_softmax is True
26 | assert args.epsilon_softmax is False
27 | elif args.model_name is 'sqddpg':
28 | assert args.replay is True
29 | assert args.q_func is True
30 | assert args.target is True
31 | assert args.gumbel_softmax is True
32 | assert args.epsilon_softmax is False
33 | assert args.online is True
34 | assert hasattr(args, 'sample_size')
35 | elif args.model_name is 'coma_fc':
36 | assert args.replay is True
37 | assert args.q_func is True
38 | assert args.target is True
39 | assert args.online is True
40 | assert args.continuous is False
41 | assert args.gumbel_softmax is False
42 | assert args.epsilon_softmax is False
43 | else:
44 | raise NotImplementedError('The model is not added!')
45 |
--------------------------------------------------------------------------------
/utilities/logger.py:
--------------------------------------------------------------------------------
1 | # Code referenced from https://gist.github.com/gyglim/1f8dfb1b5c82627ae3efcfbbadb9f514
2 | import tensorflow as tf
3 | import numpy as np
4 | import scipy.misc
5 | try:
6 | from StringIO import StringIO # Python 2.7
7 | except ImportError:
8 | from io import BytesIO # Python 3.x
9 |
10 |
11 | class Logger(object):
12 |
13 | def __init__(self, log_dir):
14 | """Create a summary writer logging to log_dir."""
15 | self.writer = tf.summary.FileWriter(log_dir)
16 |
17 | def scalar_summary(self, tag, value, step):
18 | """Log a scalar variable."""
19 | summary = tf.Summary(value=[tf.Summary.Value(tag=tag, simple_value=value)])
20 | self.writer.add_summary(summary, step)
21 |
22 | def image_summary(self, tag, images, step):
23 | """Log a list of images."""
24 |
25 | img_summaries = []
26 | for i, img in enumerate(images):
27 | # Write the image to a string
28 | try:
29 | s = StringIO()
30 | except:
31 | s = BytesIO()
32 | scipy.misc.toimage(img).save(s, format="png")
33 |
34 | # Create an Image object
35 | img_sum = tf.Summary.Image(encoded_image_string=s.getvalue(),
36 | height=img.shape[0],
37 | width=img.shape[1])
38 | # Create a Summary value
39 | img_summaries.append(tf.Summary.Value(tag='%s/%d' % (tag, i), image=img_sum))
40 |
41 | # Create and write Summary
42 | summary = tf.Summary(value=img_summaries)
43 | self.writer.add_summary(summary, step)
44 |
45 | def hist_summary(self, tag, values, step, bins=1000):
46 | """Log a histogram of the tensor of values."""
47 |
48 | # Create a histogram using numpy
49 | counts, bin_edges = np.histogram(values, bins=bins)
50 |
51 | # Fill the fields of the histogram proto
52 | hist = tf.HistogramProto()
53 | hist.min = float(np.min(values))
54 | hist.max = float(np.max(values))
55 | hist.num = int(np.prod(values.shape))
56 | hist.sum = float(np.sum(values))
57 | hist.sum_squares = float(np.sum(values**2))
58 |
59 | # Drop the start of the first bin
60 | bin_edges = bin_edges[1:]
61 |
62 | # Add bin edges and counts
63 | for edge in bin_edges:
64 | hist.bucket_limit.append(edge)
65 | for c in counts:
66 | hist.bucket.append(c)
67 |
68 | # Create and write Summary
69 | summary = tf.Summary(value=[tf.Summary.Value(tag=tag, histo=hist)])
70 | self.writer.add_summary(summary, step)
71 | self.writer.flush()
72 |
--------------------------------------------------------------------------------
/utilities/replay_buffer.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 |
4 | class TransReplayBuffer(object):
5 |
6 | def __init__(self, size):
7 | self.size = size
8 | self.buffer = []
9 |
10 | def get_single(self, index):
11 | return self.buffer[index]
12 |
13 | def offset(self):
14 | self.buffer.pop(0)
15 |
16 | def get_batch(self, batch_size):
17 | length = len(self.buffer)
18 | indices = np.random.choice(length, batch_size, replace=False)
19 | batch_buffer = [self.buffer[i] for i in indices]
20 | return batch_buffer
21 |
22 | def add_experience(self, trans):
23 | est_len = 1 + len(self.buffer)
24 | if est_len > self.size:
25 | self.offset()
26 | self.buffer.append(trans)
27 |
28 | def clear(self):
29 | self.buffer = []
30 |
31 |
32 |
33 | class EpisodeReplayBuffer(object):
34 |
35 | def __init__(self, size):
36 | self.size = size
37 | self.buffer = []
38 |
39 | def get_single(self, index):
40 | return self.buffer[index]
41 |
42 | def offset(self):
43 | self.buffer.pop(0)
44 |
45 | def get_batch(self, batch_size):
46 | length = len(self.buffer)
47 | indices = np.random.choice(length, batch_size, replace=False)
48 | batch_buffer = []
49 | for i in indices:
50 | batch_buffer.extend(self.buffer[i])
51 | return batch_buffer
52 |
53 | def add_experience(self, episode):
54 | est_len = 1 + len(self.buffer)
55 | if est_len > self.size:
56 | self.offset()
57 | self.buffer.append(episode)
58 |
--------------------------------------------------------------------------------
/utilities/tester.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 | from utilities.util import *
4 | import time
5 | import signal
6 | import sys
7 |
8 |
9 |
10 | class PGTester(object):
11 |
12 | def __init__(self, env, behaviour_net, args):
13 | self.env = env
14 | self.behaviour_net = behaviour_net.cuda().eval() if args.cuda else behaviour_net.eval()
15 | self.args = args
16 | self.cuda_ = self.args.cuda and torch.cuda.is_available()
17 |
18 | def action_logits(self, state, schedule, last_action, last_hidden, info):
19 | return self.behaviour_net.policy(state, schedule=schedule, last_act=last_action, last_hid=last_hidden, info=info)
20 |
21 | def run_step(self, state, schedule, last_action, last_hidden, info={}):
22 | state = cuda_wrapper(prep_obs(state).contiguous().view(1, self.args.agent_num, self.args.obs_size), cuda=self.cuda_)
23 | if self.args.model_name in ['schednet']:
24 | weight = self.behaviour_net.weight_generator(state).detach()
25 | schedule, _ = self.behaviour_net.weight_based_scheduler(weight, exploration=False)
26 | action_out = self.action_logits(state, schedule, last_action, last_hidden, info)
27 | action = select_action(self.args, action_out, status='test')
28 | _, actual = translate_action(self.args, action, self.env)
29 | next_state, reward, done, debug = self.env.step(actual)
30 | success = debug['success'] if 'success' in debug else 0.0
31 | disp = 'The rewards of agents are:'
32 | for r in reward:
33 | disp += ' '+str(r)[:7]
34 | print (disp+'.')
35 | return next_state, action, done, reward, success
36 |
37 | def run_game(self, episodes, render):
38 | action = cuda_wrapper(torch.zeros((1, self.args.agent_num, self.args.action_dim)), cuda=self.cuda_)
39 | info = {}
40 | # set up a flag to control the exit of the program
41 | if render and self.env.name in ['traffic_junction','predator_prey']:
42 | signal.signal(signal.SIGINT, self.signal_handler)
43 | self.env.init_curses()
44 | if self.args.model_name in ['coma', 'ic3net']:
45 | self.behaviour_net.init_hidden(batch_size=1)
46 | last_hidden = self.behaviour_net.get_hidden()
47 | else:
48 | last_hidden = None
49 | if self.args.model_name in ['ic3net']:
50 | gate = self.behaviour_net.gate(last_hidden[:, :, :self.args.hid_size])
51 | schedule = self.behaviour_net.schedule(gate)
52 | else:
53 | schedule = None
54 | self.all_reward = []
55 | self.all_turn = []
56 | self.all_success = [] # special for traffic junction
57 | for ep in range(episodes):
58 | print ('The episode {} starts!'.format(ep))
59 | episode_reward = []
60 | episode_success = []
61 | state = self.env.reset()
62 | t = 0
63 | while True:
64 | if render:
65 | self.env.render()
66 | state, action, done, reward, success = self.run_step(state, schedule, action, last_hidden, info=info)
67 | if self.args.model_name in ['coma']:
68 | last_hidden = self.behaviour_net.get_hidden()
69 | episode_reward.append(np.mean(reward))
70 | episode_success.append(success)
71 | if render:
72 | time.sleep(0.01)
73 | if np.all(done) or t==self.args.max_steps-1:
74 | print ('The episode {} is finished!'.format(ep))
75 | self.all_reward.append(np.mean(episode_reward))
76 | self.all_success.append(np.mean(episode_success))
77 | self.all_turn.append(t+1)
78 | break
79 | t += 1
80 |
81 | def signal_handler(self, signal, frame):
82 | print('You pressed Ctrl+C! Exiting gracefully.')
83 | self.env.exit_render()
84 | sys.exit(0)
85 |
86 | def print_info(self):
87 | episodes = len(self.all_reward)
88 | print("\n"+"="*10+ " REUSLTS "+ "="*10)
89 | print ('Episode: {:4d}'.format(episodes))
90 | print('Mean Reward: {:2.4f}/{:2.4f}'.format(np.mean(self.all_reward),np.std(self.all_reward)))
91 | print('Mean Success: {:2.4f}/{:2.4f}'.format(np.mean(self.all_success),np.std(self.all_success)))
92 | print('Mean Turn: {:2.4f}/{:2.4f}'.format(np.mean(self.all_turn),np.std(self.all_turn)))
93 |
--------------------------------------------------------------------------------
/utilities/trainer.py:
--------------------------------------------------------------------------------
1 | from collections import namedtuple
2 | import numpy as np
3 | import torch
4 | from torch import optim
5 | import torch.nn as nn
6 | from utilities.util import *
7 | from utilities.replay_buffer import *
8 | from utilities.inspector import *
9 | from arguments import *
10 | from utilities.logger import Logger
11 |
12 |
13 |
14 | class PGTrainer(object):
15 |
16 | def __init__(self, args, model, env, logger, online):
17 | self.args = args
18 | self.cuda_ = self.args.cuda and torch.cuda.is_available()
19 | self.logger = logger
20 | self.online = online
21 | inspector(self.args)
22 | if self.args.target:
23 | target_net = model(self.args).cuda() if self.cuda_ else model(self.args)
24 | self.behaviour_net = model(self.args, target_net).cuda() if self.cuda_ else model(self.args, target_net)
25 | else:
26 | self.behaviour_net = model(self.args).cuda() if self.cuda_ else model(self.args)
27 | if self.args.replay:
28 | if self.online:
29 | self.replay_buffer = TransReplayBuffer(int(self.args.replay_buffer_size))
30 | else:
31 | self.replay_buffer = EpisodeReplayBuffer(int(self.args.replay_buffer_size))
32 | self.env = env
33 | self.action_optimizers = []
34 | for action_dict in self.behaviour_net.action_dicts:
35 | self.action_optimizers.append(optim.Adam(action_dict.parameters(), lr=args.policy_lrate))
36 | self.value_optimizers = []
37 | for value_dict in self.behaviour_net.value_dicts:
38 | self.value_optimizers.append(optim.Adam(value_dict.parameters(), lr=args.value_lrate))
39 | self.init_action = cuda_wrapper( torch.zeros(1, self.args.agent_num, self.args.action_dim), cuda=self.cuda_ )
40 | self.steps = 0
41 | self.episodes = 0
42 | self.mean_reward = 0
43 | self.mean_success = 0
44 | self.entr = self.args.entr
45 | self.entr_inc = self.args.entr_inc
46 |
47 | def get_loss(self, batch):
48 | action_loss, value_loss, log_p_a = self.behaviour_net.get_loss(batch)
49 | return action_loss, value_loss, log_p_a
50 |
51 | def action_compute_grad(self, stat, loss, retain_graph):
52 | action_loss, log_p_a = loss
53 | if not self.args.continuous:
54 | if self.entr > 0:
55 | entropy = multinomial_entropy(log_p_a)
56 | action_loss -= self.entr * entropy
57 | stat['entropy'] = entropy.item()
58 | action_loss.backward(retain_graph=retain_graph)
59 |
60 | def value_compute_grad(self, value_loss, retain_graph):
61 | value_loss.backward(retain_graph=retain_graph)
62 |
63 | def grad_clip(self, params):
64 | for param in params:
65 | param.grad.data.clamp_(-1, 1)
66 |
67 | def action_replay_process(self, stat):
68 | batch = self.replay_buffer.get_batch(self.args.batch_size)
69 | batch = self.behaviour_net.Transition(*zip(*batch))
70 | self.action_transition_process(stat, batch)
71 |
72 | def value_replay_process(self, stat):
73 | batch = self.replay_buffer.get_batch(self.args.batch_size)
74 | batch = self.behaviour_net.Transition(*zip(*batch))
75 | self.value_transition_process(stat, batch)
76 |
77 | def action_transition_process(self, stat, trans):
78 | action_loss, value_loss, log_p_a = self.get_loss(trans)
79 | policy_grads = []
80 | for i in range(self.args.agent_num):
81 | retain_graph = False if i == self.args.agent_num-1 else True
82 | action_optimizer = self.action_optimizers[i]
83 | action_optimizer.zero_grad()
84 | self.action_compute_grad(stat, (action_loss[i], log_p_a[:, i, :]), retain_graph)
85 | grad = []
86 | for pp in action_optimizer.param_groups[0]['params']:
87 | grad.append(pp.grad.clone())
88 | policy_grads.append(grad)
89 | policy_grad_norms = []
90 | for action_optimizer, grad in zip(self.action_optimizers, policy_grads):
91 | param = action_optimizer.param_groups[0]['params']
92 | for i in range(len(param)):
93 | param[i].grad = grad[i]
94 | if self.args.grad_clip:
95 | self.grad_clip(param)
96 | policy_grad_norms.append(get_grad_norm(param))
97 | action_optimizer.step()
98 | stat['policy_grad_norm'] = np.array(policy_grad_norms).mean()
99 | stat['action_loss'] = action_loss.mean().item()
100 |
101 | def value_transition_process(self, stat, trans):
102 | action_loss, value_loss, log_p_a = self.get_loss(trans)
103 | value_grads = []
104 | for i in range(self.args.agent_num):
105 | retain_graph = False if i == self.args.agent_num-1 else True
106 | value_optimizer = self.value_optimizers[i]
107 | value_optimizer.zero_grad()
108 | self.value_compute_grad(value_loss[i], retain_graph)
109 | grad = []
110 | for pp in value_optimizer.param_groups[0]['params']:
111 | grad.append(pp.grad.clone())
112 | value_grads.append(grad)
113 | value_grad_norms = []
114 | for value_optimizer, grad in zip(self.value_optimizers, value_grads):
115 | param = value_optimizer.param_groups[0]['params']
116 | for i in range(len(param)):
117 | param[i].grad = grad[i]
118 | if self.args.grad_clip:
119 | self.grad_clip(param)
120 | value_grad_norms.append(get_grad_norm(param))
121 | value_optimizer.step()
122 | stat['value_grad_norm'] = np.array(value_grad_norms).mean()
123 | stat['value_loss'] = value_loss.mean().item()
124 |
125 | def run(self, stat):
126 | self.behaviour_net.train_process(stat, self)
127 | self.entr += self.entr_inc
128 |
129 | def logging(self, stat):
130 | for tag, value in stat.items():
131 | if isinstance(value, np.ndarray):
132 | self.logger.image_summary(tag, value, self.episodes)
133 | else:
134 | self.logger.scalar_summary(tag, value, self.episodes)
135 |
136 | def print_info(self, stat):
137 | action_loss = stat.get('action_loss', 0)
138 | value_loss = stat.get('value_loss', 0)
139 | entropy = stat.get('entropy', 0)
140 | print ('Episode: {:4d}, Mean Reward: {:2.4f}, Action Loss: {:2.4f}, Value Loss is: {:2.4f}, Entropy: {:2.4f}\n'\
141 | .format(self.episodes, stat['mean_reward'], action_loss+self.entr*entropy, value_loss, entropy))
142 |
--------------------------------------------------------------------------------
/utilities/util.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import torch
3 | from torch.distributions.one_hot_categorical import OneHotCategorical
4 | from torch.distributions.normal import Normal
5 |
6 |
7 |
8 | class GumbelSoftmax(OneHotCategorical):
9 |
10 | def __init__(self, logits, probs=None, temperature=0.1):
11 | super(GumbelSoftmax, self).__init__(logits=logits, probs=probs)
12 | self.eps = 1e-20
13 | self.temperature = temperature
14 |
15 | def sample_gumbel(self):
16 | U = self.logits.clone()
17 | U.uniform_(0, 1)
18 | return -torch.log( -torch.log( U + self.eps ) )
19 |
20 | def gumbel_softmax_sample(self):
21 | y = self.logits + self.sample_gumbel()
22 | return torch.softmax( y / self.temperature, dim=-1)
23 |
24 | def hard_gumbel_softmax_sample(self):
25 | y = self.gumbel_softmax_sample()
26 | return (torch.max(y, dim=-1, keepdim=True)[0] == y).float()
27 |
28 | def rsample(self):
29 | return self.gumbel_softmax_sample()
30 |
31 | def sample(self):
32 | return self.rsample().detach()
33 |
34 | def hard_sample(self):
35 | return self.hard_gumbel_softmax_sample()
36 |
37 |
38 |
39 | def normal_entropy(mean, std):
40 | return Normal(mean, std).entropy().sum()
41 |
42 | def multinomial_entropy(logits):
43 | assert logits.size(-1) > 1
44 | return GumbelSoftmax(logits=logits).entropy().sum()
45 |
46 | def normal_log_density(x, mean, std):
47 | return Normal(mean, std).log_prob(x)
48 |
49 | def multinomials_log_density(actions, logits):
50 | assert logits.size(-1) > 1
51 | return GumbelSoftmax(logits=logits).log_prob(actions)
52 |
53 | def select_action(args, logits, status='train', exploration=True, info={}):
54 | if args.continuous:
55 | act_mean = logits
56 | act_std = cuda_wrapper(torch.ones_like(act_mean), args.cuda)
57 | if status is 'train':
58 | return Normal(act_mean, act_std).sample()
59 | elif status is 'test':
60 | return act_mean
61 | else:
62 | if status is 'train':
63 | if exploration:
64 | if args.epsilon_softmax:
65 | eps = info['softmax_eps']
66 | p_a = (1 - eps) * torch.softmax(logits, dim=-1) + eps / logits.size(-1)
67 | return OneHotCategorical(logits=None, probs=p_a).sample()
68 | elif args.gumbel_softmax:
69 | return GumbelSoftmax(logits=logits).sample()
70 | else:
71 | return OneHotCategorical(logits=logits).sample()
72 | else:
73 | if args.gumbel_softmax:
74 | temperature = 1.0
75 | return torch.softmax(logits/temperature, dim=-1)
76 | else:
77 | return OneHotCategorical(logits=logits).sample()
78 | elif status is 'test':
79 | p_a = torch.softmax(logits, dim=-1)
80 | return (p_a == torch.max(p_a, dim=-1, keepdim=True)[0]).float()
81 |
82 | def translate_action(args, action, env):
83 | if not args.continuous:
84 | actual = [act.detach().squeeze().cpu().numpy() for act in torch.unbind(action, 1)]
85 | return action, actual
86 | else:
87 | actions = action.data[0].numpy()
88 | cp_actions = actions.copy()
89 | # clip and scale action to correct range
90 | for i in range(len(cp_actions)):
91 | cp_actions[i] = env.action_space[i].low
92 | cp_actions[i] = env.action_space[i].high
93 | cp_actions[i] = max(-1.0, min(cp_actions[i], 1.0))
94 | cp_actions[i] = 0.5 * (cp_actions[i] + 1.0) * (high - low) + low
95 | return actions, cp_actions
96 |
97 | def prep_obs(state=[]):
98 | state = np.array(state)
99 | if len(state.shape) == 2:
100 | state = np.stack(state, axis=0)
101 | elif len(state.shape) == 4:
102 | state = np.concatenate(state, axis=0)
103 | else:
104 | raise RuntimeError('The shape of the observation is incorrect.')
105 | return torch.tensor(state).float()
106 |
107 | def cuda_wrapper(tensor, cuda):
108 | if isinstance(tensor, torch.Tensor):
109 | return tensor.cuda() if cuda else tensor
110 | else:
111 | raise RuntimeError('Please enter a pytorch tensor, now a {} is received.'.format(type(tensor)))
112 |
113 | def batchnorm(batch):
114 | if isinstance(batch, torch.Tensor):
115 | assert batch.size(-1) == 1
116 | return (batch - batch.mean()) / batch.std()
117 | else:
118 | raise RuntimeError('Please enter a pytorch tensor, now a {} is received.'.format(type(batch)))
119 |
120 | def get_grad_norm(params):
121 | grad_norms = []
122 | for param in params:
123 | grad_norms.append(torch.norm(param.grad).item())
124 | return np.mean(grad_norms)
125 |
126 | def merge_dict(stat, key, value):
127 | if key in stat.keys():
128 | stat[key] += value
129 | else:
130 | stat[key] = value
131 |
132 | def unpack_data(args, batch):
133 | batch_size = len(batch.state)
134 | n = args.agent_num
135 | action_dim = args.action_dim
136 | cuda = torch.cuda.is_available() and args.cuda
137 | rewards = cuda_wrapper(torch.tensor(batch.reward, dtype=torch.float), cuda)
138 | last_step = cuda_wrapper(torch.tensor(batch.last_step, dtype=torch.float).contiguous().view(-1, 1), cuda)
139 | done = cuda_wrapper(torch.tensor(batch.done, dtype=torch.float).contiguous().view(-1, 1), cuda)
140 | actions = cuda_wrapper(torch.tensor(np.stack(list(zip(*batch.action))[0], axis=0), dtype=torch.float), cuda)
141 | last_actions = cuda_wrapper(torch.tensor(np.stack(list(zip(*batch.last_action))[0], axis=0), dtype=torch.float), cuda)
142 | state = cuda_wrapper(prep_obs(list(zip(batch.state))), cuda)
143 | next_state = cuda_wrapper(prep_obs(list(zip(batch.next_state))), cuda)
144 | return (rewards, last_step, done, actions, last_actions, state, next_state)
145 |
146 | def n_step(rewards, last_step, done, next_values, n_step, args):
147 | cuda = torch.cuda.is_available() and args.cuda
148 | returns = cuda_wrapper(torch.zeros_like(rewards), cuda=cuda)
149 | i = rewards.size(0)-1
150 | while i >= 0:
151 | if last_step[i]:
152 | next_return = 0 if done[i] else next_values[i].detach()
153 | for j in reversed(range(i-n_step+1, i+1)):
154 | returns[j] = rewards[j] + args.gamma * next_return
155 | next_return = returns[j]
156 | i -= n_step
157 | continue
158 | else:
159 | next_return = next_values[i+n_step-1].detach()
160 | for j in reversed(range(n_step)):
161 | g = rewards[i+j] + args.gamma * next_return
162 | next_return = g
163 | returns[i] = g.detach()
164 | i -= 1
165 | return returns
166 |
--------------------------------------------------------------------------------