├── .gitignore
├── README.md
├── es.py
├── evaluation.py
├── linear.py
├── models
├── test_BipedalWalker_v1.pkl
├── test_BipedalWalker_v2.pkl
├── test_BipedalWalker_v3.pkl
├── test_BipedalWalker_v4.pkl
├── test_BipedalWalker_v5.0.pkl
├── test_BipedalWalker_v5.1.pkl
├── test_BipedalWalker_v5.2.pkl
├── test_BipedalWalker_v5.3.pkl
├── test_BipedalWalker_v6.0.pkl
├── test_BipedalWalker_v6.1.pkl
├── test_BipedalWalker_v6.2.pkl
├── test_CartPole_v1.pkl
├── test_LunarLanderCont_v1.pkl
├── test_LunarLander_v3.pkl
└── test_MountainCarCont_v1.pkl
├── plot.py
├── plots
├── algo_code.png
├── gifs
│ ├── best_bipedal_walker.gif
│ ├── best_lunar.gif
│ ├── best_lunar_cont.gif
│ ├── best_mountain_car_cont.gif
│ └── best_pole.gif
├── test_BipedalWalker_v1.png
├── test_BipedalWalker_v2.png
├── test_BipedalWalker_v3.png
├── test_BipedalWalker_v4.png
├── test_BipedalWalker_v5.0.png
├── test_BipedalWalker_v5.1.png
├── test_BipedalWalker_v5.2.png
├── test_BipedalWalker_v5.3.png
├── test_BipedalWalker_v6.0.png
├── test_BipedalWalker_v6.1.png
├── test_BipedalWalker_v6.2.png
├── test_CartPole_v1.png
├── test_LunarLanderCont_v1.png
├── test_LunarLander_v1.png
├── test_LunarLander_v3.png
└── test_MountainCarCont_v1.png
├── tests
├── bipedal_walker.py
├── cart_pole.py
├── lunar_lander.py
├── lunar_lander_cont.py
└── mountain_car_cont.py
└── training.py
/.gitignore:
--------------------------------------------------------------------------------
1 | .vscode
2 | __pycache__
3 | .DS_Store
4 | tmp
5 | videos
6 | logs
7 | utils.py
8 | papers
9 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Evolution Strategies OpenAI
2 |
3 | Implementation is strictly for educational purposes and not distributed (as in paper), but it works.
4 |
5 | ## Example
6 |
7 | ```python
8 | from training import run_experiment, render_policy
9 |
10 | example_config = {
11 | "experiment_name": "test_BipedalWalker_v0",
12 | "plot_path": "plots/",
13 | "model_path": "models/", # optional
14 | "log_path": "logs/", # optional
15 | "init_model": "models/test_BipedalWalker_v5.0.pkl", # optional
16 | "env": "BipedalWalker-v3",
17 | "n_sessions": 128,
18 | "env_steps": 1600,
19 | "population_size": 256,
20 | "learning_rate": 0.06,
21 | "noise_std": 0.1,
22 | "noise_decay": 0.99, # optional
23 | "lr_decay": 1.0, # optional
24 | "decay_step": 20, # optional
25 | "eval_step": 10,
26 | "hidden_sizes": (40, 40)
27 | }
28 |
29 | policy = run_experiment(example_config, n_jobs=4, verbose=True)
30 |
31 | # to render policy perfomance
32 | render_policy(model_path, env_name, n_videos=10)
33 | ```
34 |
35 | ## Implemented
36 |
37 | - [x] OpenAI ES algorithm [Algorithm 1].
38 | - [x] Z-normalization fitness shaping (not rank-based).
39 | - [x] Parallelization with joblib.
40 | - [x] Training for 6 OpenAI gym envs (3 solved).
41 | - [x] Simple three layer net as policy example.
42 | - [x] [Learning rate & noise std decay.](https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)
43 |
44 | 
45 |
46 | ## Experiments
47 |
48 | ### CartPole
49 |
50 | Solved quickly and easily, especially if the population size is increased. However it is necessary to control the learning rate: it is better to put it less, as well as noise std: in this task there is no need to explore, it is enough to get a lot of feedback as a reward for natural gradient estimation.
51 |
52 |
53 |
54 |
55 |
56 |
57 | ### LunarLander
58 |
59 | As in the previous task, the algorithm is doing well, it is also important to set a small learning rate, but slightly increase nose std.
60 |
61 |
62 |
63 |
64 |
65 |
66 | ### LunarLanderContinuous
67 |
68 | Continuous env is solved much faster and better, probably at the expense of more dense reward. It is also interesting that here the agent has learned to land faster, not to turn on the engines immediately, but only before landing.
69 |
70 |
71 |
72 |
73 |
74 |
75 | ### MountainCarContinuous
76 |
77 | Сan't solve it yet.
78 |
79 | In the discrete version of env, the main problem is sparse reward, which is only given at the very end if you climb a hill. Since the agent does not have time for 200 iterations with the random weights to do so, the natural gradient turns out to be zero and the training is stuck. Solution: remove the 200 iteration limit and wait for the random agent to climb the mountain himself, getting the first reward :). However, this is not quite fair.
80 |
81 | In continuous env, the main problem is the lack of exploration. The agent quickly (faster than climbing the hill) realizes that the best way is to stand still and get 0 reward, which is much higher than when moving.
82 |
83 |
84 |
85 |
86 |
87 |
88 | Possible solution: novelity search. As a novelity function it is possible to take velocity, velocity * x_coord, or x_coord at the end of episode. [Reward shaping](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf) may improve convergence for DQN/CEM methods, but in this case it does not produce better results.
89 |
90 | ### BipedalWalker
91 |
92 | Not solved yet. More iterations is needed.
93 |
94 |
95 |
96 |
97 |
98 |
99 |
105 |
107 |
108 |
109 |
110 |
115 | ## References
116 |
117 | [Evolution Strategies as a Scalable Alternative to Reinforcement Learning](https://arxiv.org/abs/1703.03864) (Tim Salimans, Jonathan Ho, Xi Chen, Ilya Sutskever)
118 |
--------------------------------------------------------------------------------
/es.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 | from copy import deepcopy
4 |
5 |
6 | class OpenAiES:
7 | def __init__(self, model, learning_rate, noise_std, \
8 | noise_decay=1.0, lr_decay=1.0, decay_step=50, norm_rewards=True):
9 | self.model = model
10 |
11 | self._lr = learning_rate
12 | self._noise_std = noise_std
13 |
14 | self.noise_decay = noise_decay
15 | self.lr_decay = lr_decay
16 | self.decay_step = decay_step
17 |
18 | self.norm_rewards = norm_rewards
19 |
20 | self._population = None
21 | self._count = 0
22 |
23 | @property
24 | def noise_std(self):
25 | step_decay = np.power(self.noise_decay, np.floor((1 + self._count) / self.decay_step))
26 |
27 | return self._noise_std * step_decay
28 |
29 | @property
30 | def lr(self):
31 | step_decay = np.power(self.lr_decay, np.floor((1 + self._count) / self.decay_step))
32 |
33 | return self._lr * step_decay
34 |
35 | def generate_population(self, npop=50):
36 | self._population = []
37 |
38 | for i in range(npop):
39 | new_model = deepcopy(self.model)
40 | new_model.E = []
41 |
42 | for i, layer in enumerate(new_model.W):
43 | noise = np.random.randn(layer.shape[0], layer.shape[1])
44 |
45 | new_model.E.append(noise)
46 | new_model.W[i] = new_model.W[i] + self.noise_std * noise
47 | self._population.append(new_model)
48 |
49 | return self._population
50 |
51 | def update_population(self, rewards):
52 | if self._population is None:
53 | raise ValueError("populations is none, generate & eval it first")
54 |
55 | # z-normalization (?) - works better, but slower
56 | if self.norm_rewards:
57 | rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-5)
58 |
59 | for i, layer in enumerate(self.model.W):
60 | w_updates = np.zeros_like(layer)
61 |
62 | for j, model in enumerate(self._population):
63 | w_updates = w_updates + (model.E[i] * rewards[j])
64 |
65 | # SGD weights update
66 | self.model.W[i] = self.model.W[i] + (self.lr / (len(rewards) * self.noise_std)) * w_updates
67 |
68 | self._count = self._count + 1
69 |
70 | def get_model(self):
71 | return self.model
72 |
73 |
74 | class OpenAIES_NSR:
75 | # TODO: novelity search
76 | def __init__(self):
77 | pass
--------------------------------------------------------------------------------
/evaluation.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 | from joblib import delayed
4 |
5 | CONTINUOUS_ENVS = ('LunarLanderContinuous', "MountainCarContinuous", "BipedalWalker")
6 |
7 | def eval_policy(policy, env, n_steps=200):
8 | try:
9 | env_name = env.spec._env_name
10 | except AttributeError:
11 | env_name = env._env_name
12 |
13 | total_reward = 0
14 |
15 | obs = env.reset()
16 | for i in range(n_steps):
17 | if env_name in CONTINUOUS_ENVS:
18 | action = policy.predict(np.array(obs).reshape(1, -1), scale="tanh")
19 | else:
20 | action = policy.predict(np.array(obs).reshape(1, -1), scale="softmax")
21 |
22 | new_obs, reward, done, _ = env.step(action)
23 |
24 | total_reward = total_reward + reward
25 | obs = new_obs
26 |
27 | if done:
28 | break
29 |
30 | return total_reward
31 |
32 |
33 | # for parallel
34 | eval_policy_delayed = delayed(eval_policy)
--------------------------------------------------------------------------------
/linear.py:
--------------------------------------------------------------------------------
1 | import pickle
2 |
3 | import numpy as np
4 |
5 |
6 | def ReLU(x):
7 | return np.maximum(0, x)
8 |
9 |
10 | def softmax(x):
11 | x_exp = np.exp(x - np.max(x))
12 | return x_exp / x_exp.sum()
13 |
14 |
15 | def tanh(x):
16 | return np.tanh(x)
17 |
18 |
19 | class ThreeLayerNetwork:
20 | def __init__(self, in_features, out_features, hidden_sizes=(32, 32)):
21 | self.in_features = in_features
22 | self.out_features = out_features
23 | self.hidden_sizes = hidden_sizes
24 |
25 | self.W = self._init_layers()
26 |
27 | # TODO: init weights from model -> load_model(self, path)
28 |
29 | def _init_layers(self):
30 | layer1_dim, layer2_dim = self.hidden_sizes
31 |
32 | # +1 to dims for bias trick & He weight init
33 | W1 = np.random.randn(self.in_features + 1, layer1_dim + 1) * np.sqrt(2 / (self.in_features + 1))
34 | W2 = np.random.randn(layer1_dim + 1, layer2_dim + 1) * np.sqrt(2 / (layer1_dim + 1))
35 | W3 = np.random.randn(layer2_dim + 1, self.out_features) * np.sqrt(2 / (layer2_dim + 1))
36 |
37 | return [W1, W2, W3]
38 |
39 | @staticmethod
40 | def from_model(path):
41 | with open(path, "rb") as file:
42 | model = pickle.load(file)
43 |
44 | assert isinstance(model, ThreeLayerNetwork), "init model is not instance of ThreeLayerNetwork class"
45 |
46 | return model
47 |
48 | def forward(self, X):
49 | bias = np.ones((X.shape[0], 1))
50 | X_bias = np.hstack((X, bias))
51 |
52 | output = ReLU(ReLU(X_bias @ self.W[0]) @ self.W[1]) @ self.W[2]
53 |
54 | return output
55 |
56 | def predict(self, X, scale="softmax"):
57 | X_norm = (X - X.mean()) / (X.std() + 1e-5)
58 |
59 | raw_output = self.forward(X_norm)
60 |
61 | if scale == "tanh":
62 | return tanh(raw_output)[0]
63 | elif scale == "softmax":
64 | prob = softmax(raw_output)[0]
65 | # TODO: action choice more about agent than model
66 | return np.random.choice(self.out_features, p=prob)
67 |
68 | return raw_output[0]
69 |
70 |
71 | if __name__ == "__main__":
72 | model = ThreeLayerNetwork(4, 4)
73 | data = np.random.randn(1, 4)
74 |
75 | prediction = model.predict(data, scale="tanh")
76 |
77 | print(prediction)
78 |
79 |
--------------------------------------------------------------------------------
/models/test_BipedalWalker_v1.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v1.pkl
--------------------------------------------------------------------------------
/models/test_BipedalWalker_v2.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v2.pkl
--------------------------------------------------------------------------------
/models/test_BipedalWalker_v3.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v3.pkl
--------------------------------------------------------------------------------
/models/test_BipedalWalker_v4.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v4.pkl
--------------------------------------------------------------------------------
/models/test_BipedalWalker_v5.0.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v5.0.pkl
--------------------------------------------------------------------------------
/models/test_BipedalWalker_v5.1.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v5.1.pkl
--------------------------------------------------------------------------------
/models/test_BipedalWalker_v5.2.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v5.2.pkl
--------------------------------------------------------------------------------
/models/test_BipedalWalker_v5.3.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v5.3.pkl
--------------------------------------------------------------------------------
/models/test_BipedalWalker_v6.0.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v6.0.pkl
--------------------------------------------------------------------------------
/models/test_BipedalWalker_v6.1.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v6.1.pkl
--------------------------------------------------------------------------------
/models/test_BipedalWalker_v6.2.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v6.2.pkl
--------------------------------------------------------------------------------
/models/test_CartPole_v1.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_CartPole_v1.pkl
--------------------------------------------------------------------------------
/models/test_LunarLanderCont_v1.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_LunarLanderCont_v1.pkl
--------------------------------------------------------------------------------
/models/test_LunarLander_v3.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_LunarLander_v3.pkl
--------------------------------------------------------------------------------
/models/test_MountainCarCont_v1.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_MountainCarCont_v1.pkl
--------------------------------------------------------------------------------
/plot.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import matplotlib.pyplot as plt
3 |
4 |
5 | def plot_rewards(mean_rewards, std_rewards, config):
6 | best_mean = np.array(mean_rewards)
7 | best_std = np.array(std_rewards)
8 |
9 | stats = (
10 | f"""
11 | n_sessions: {config["n_sessions"]}
12 | population_size: {config["population_size"]}
13 | lr: {config["learning_rate"]}
14 | noise_std: {config["noise_std"]}
15 | env_steps: {config["env_steps"]}
16 | """
17 | ) # TODO: add hidden size info on plot
18 |
19 | fig, ax = plt.subplots()
20 | plt.figure(figsize=(12, 8))
21 | plt.text(0.35, 1.25, stats, transform=ax.transAxes)
22 | plt.title(f"{config['env']}: {config['experiment_name']}")
23 | plt.plot(np.arange(best_mean.shape[0]), best_mean)
24 | plt.fill_between(np.arange(best_mean.shape[0]), best_mean + best_std, best_mean - best_std, alpha=0.5)
25 | plt.xlabel(f"weights updates (mod {config.get('eval_step', '2')})")
26 | plt.ylabel("reward")
27 | plt.savefig(f"{config['plot_path']}{config['experiment_name']}.png")
--------------------------------------------------------------------------------
/plots/algo_code.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/algo_code.png
--------------------------------------------------------------------------------
/plots/gifs/best_bipedal_walker.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/gifs/best_bipedal_walker.gif
--------------------------------------------------------------------------------
/plots/gifs/best_lunar.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/gifs/best_lunar.gif
--------------------------------------------------------------------------------
/plots/gifs/best_lunar_cont.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/gifs/best_lunar_cont.gif
--------------------------------------------------------------------------------
/plots/gifs/best_mountain_car_cont.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/gifs/best_mountain_car_cont.gif
--------------------------------------------------------------------------------
/plots/gifs/best_pole.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/gifs/best_pole.gif
--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v1.png
--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v2.png
--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v3.png
--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v4.png
--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v5.0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v5.0.png
--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v5.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v5.1.png
--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v5.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v5.2.png
--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v5.3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v5.3.png
--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v6.0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v6.0.png
--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v6.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v6.1.png
--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v6.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v6.2.png
--------------------------------------------------------------------------------
/plots/test_CartPole_v1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_CartPole_v1.png
--------------------------------------------------------------------------------
/plots/test_LunarLanderCont_v1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_LunarLanderCont_v1.png
--------------------------------------------------------------------------------
/plots/test_LunarLander_v1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_LunarLander_v1.png
--------------------------------------------------------------------------------
/plots/test_LunarLander_v3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_LunarLander_v3.png
--------------------------------------------------------------------------------
/plots/test_MountainCarCont_v1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_MountainCarCont_v1.png
--------------------------------------------------------------------------------
/tests/bipedal_walker.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import pickle
3 |
4 | sys.path.append('..')
5 |
6 | from training import run_experiment
7 |
8 | # solving the task as getting an average score of 300+ over 100 consecutive random trials.
9 | def test():
10 | test_config = {
11 | "experiment_name": "test_BipedalWalker_v6.2",
12 | "plot_path": "../plots/",
13 | "model_path": "../models/",
14 | "log_path": "../logs/",
15 | "init_model": "../models/test_BipedalWalker_v6.1.pkl",
16 | "env": "BipedalWalker-v3",
17 | "n_sessions": 250,
18 | "env_steps": 1300,
19 | "population_size": 128,
20 | "learning_rate": 0.065,
21 | "noise_std": 0.07783,
22 | "noise_decay": 0.995,
23 | "decay_step": 20,
24 | "eval_step": 10,
25 | "hidden_sizes": (64, 40) # sizes from https://designrl.github.io/
26 | }
27 |
28 | policy = run_experiment(test_config, n_jobs=4)
29 |
30 |
31 | if __name__ == "__main__":
32 | test()
--------------------------------------------------------------------------------
/tests/cart_pole.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import pickle
3 |
4 | sys.path.append('..')
5 | from training import run_experiment
6 |
7 |
8 | def test():
9 | test_config = {
10 | "experiment_name": "test_CartPole_v2",
11 | "plot_path": "../plots/",
12 | "model_path": "../models/",
13 | "env": "CartPole-v0",
14 | "n_sessions": 64,
15 | "env_steps": 200,
16 | "population_size": 256,
17 | "learning_rate": 0.01,
18 | "noise_std": 0.05,
19 | "hidden_sizes": (64, 64)
20 | }
21 | policy = run_experiment(test_config)
22 |
23 | # TODO: not easy, need a change
24 | with open(f"{test_config['model_path']}{test_config['experiment_name']}.pkl", "wb") as file:
25 | pickle.dump(policy, file)
26 |
27 |
28 | if __name__ == "__main__":
29 | test()
--------------------------------------------------------------------------------
/tests/lunar_lander.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import pickle
3 |
4 | sys.path.append('..')
5 |
6 | from training import run_experiment
7 |
8 | # TODO: next is LunarLanderContinuous-v2
9 | def test():
10 | test_config = {
11 | "experiment_name": "test_LunarLander_v4",
12 | "plot_path": "../plots/",
13 | "model_path": "../models/",
14 | "env": "LunarLander-v2",
15 | "n_sessions": 512,
16 | "env_steps": 500,
17 | "population_size": 256,
18 | "learning_rate": 0.01,
19 | "noise_std": 0.075,
20 | "hidden_sizes": (64, 64)
21 | }
22 |
23 | policy = run_experiment(test_config)
24 |
25 | with open(f"{test_config['model_path']}{test_config['experiment_name']}.pkl", "wb") as file:
26 | pickle.dump(policy, file)
27 |
28 |
29 | if __name__ == "__main__":
30 | test()
--------------------------------------------------------------------------------
/tests/lunar_lander_cont.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import pickle
3 |
4 | sys.path.append('..')
5 | from training import run_experiment
6 |
7 |
8 | # TODO: parallel & continuous
9 | def test():
10 | test_config = {
11 | "experiment_name": "test_LunarLanderCont_v2",
12 | "plot_path": "../plots/",
13 | "model_path": "../models/",
14 | "env": "LunarLanderContinuous-v2",
15 | "n_sessions": 512,
16 | "env_steps": 500,
17 | "population_size": 256,
18 | "learning_rate": 0.01,
19 | "noise_std": 0.075,
20 | "hidden_sizes": (64, 64)
21 | }
22 | policy = run_experiment(test_config)
23 |
24 | with open(f"{test_config['model_path']}{test_config['experiment_name']}.pkl", "wb") as file:
25 | pickle.dump(policy, file)
26 |
27 |
28 | if __name__ == "__main__":
29 | test()
30 |
--------------------------------------------------------------------------------
/tests/mountain_car_cont.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import pickle
3 |
4 | sys.path.append('..')
5 |
6 | from training import run_experiment
7 |
8 | # MountainCar-v0 defines "solving" as getting average reward of -110.0 over 100 consecutive trials.
9 | # TODO: wait for novelity search
10 | def test():
11 | test_config = {
12 | "experiment_name": "test_MountainCarCont_v2",
13 | "plot_path": "../plots/",
14 | "model_path": "../models/",
15 | "env": "MountainCarContinuous-v0",
16 | "n_sessions": 128,
17 | "env_steps": 200,
18 | "population_size": 256,
19 | "learning_rate": 0.1,
20 | "noise_std": 0.5,
21 | "hidden_sizes": (32, 32)
22 | }
23 |
24 | policy = run_experiment(test_config, n_jobs=4)
25 |
26 | with open(f"{test_config['model_path']}{test_config['experiment_name']}.pkl", "wb") as file:
27 | pickle.dump(policy, file)
28 |
29 |
30 | if __name__ == "__main__":
31 | test()
32 |
--------------------------------------------------------------------------------
/training.py:
--------------------------------------------------------------------------------
1 | import gym
2 | import pickle
3 | import uuid
4 |
5 | import numpy as np
6 |
7 | from tqdm import tqdm
8 | from joblib import Parallel
9 | from collections import defaultdict
10 |
11 | from gym import wrappers
12 |
13 | from linear import ThreeLayerNetwork
14 | from es import OpenAiES
15 | from plot import plot_rewards
16 | from evaluation import eval_policy_delayed, eval_policy
17 |
18 | # env: (n_states, n_actions)
19 | ENV_INFO = {
20 | "CartPole-v0": (4, 2),
21 | "LunarLander-v2": (8, 4),
22 | "LunarLanderContinuous-v2": (8, 2),
23 | "MountainCar-v0": (2, 3),
24 | "MountainCarContinuous-v0": (2, 1),
25 | "CarRacing-v0": (96*96*3, 3), # TODO: wrap env to prep pixels & discrete actions
26 | "BipedalWalker-v3": (24, 4)
27 | }
28 |
29 |
30 | def train_loop(policy, env, config, n_jobs=1, verbose=True):
31 | es = OpenAiES(
32 | model=policy,
33 | learning_rate=config["learning_rate"],
34 | noise_std=config["noise_std"],
35 | noise_decay=config.get("noise_decay", 1.0),
36 | lr_decay=config.get("lr_decay", 1.0),
37 | decay_step=config.get("decay_step", 50)
38 | )
39 |
40 | log = defaultdict(list)
41 | for session in tqdm(range(config["n_sessions"])):
42 | population = es.generate_population(config["population_size"])
43 |
44 | rewards_jobs = (eval_policy_delayed(new_policy, env, config["env_steps"]) for new_policy in population)
45 | rewards = np.array(Parallel(n_jobs=n_jobs)(rewards_jobs))
46 |
47 | es.update_population(rewards)
48 |
49 | # populations stats
50 | log["pop_mean_rewards"].append(np.mean(rewards))
51 | log["pop_std_rewards"].append(np.std(rewards))
52 |
53 | # best policy stats
54 | if session % config.get("eval_step", 2) == 0:
55 | best_policy = es.get_model()
56 |
57 | best_rewards = np.zeros(10)
58 | for i in range(10):
59 | best_rewards[i] = eval_policy(best_policy, env, config["env_steps"])
60 |
61 | if verbose:
62 | # TODO: add timestamp
63 | print(f"Session: {session}")
64 | print(f"Mean reward: {round(np.mean(rewards), 4)}", f"std: {round(np.std(rewards), 3)}")
65 | print(f"lr: {round(es.lr, 5)}, noise_std: {round(es.noise_std, 5)}")
66 |
67 | log["best_mean_rewards"].append(np.mean(best_rewards))
68 | log["best_std_rewards"].append(np.std(best_rewards))
69 |
70 | return log
71 |
72 |
73 | def run_experiment(config, n_jobs=4, verbose=True):
74 | env = gym.make(config["env"])
75 | env._env_name = env.spec._env_name
76 |
77 | n_states, n_actions = ENV_INFO[config["env"]]
78 |
79 | if config.get("init_model", None):
80 | policy = ThreeLayerNetwork.from_model(config["init_model"])
81 |
82 | assert policy.in_features == n_states, "not correct policy input dims"
83 | assert policy.out_features == n_actions, "not correct policy output dims"
84 | else:
85 | policy = ThreeLayerNetwork(
86 | in_features=n_states,
87 | out_features=n_actions,
88 | hidden_sizes=config["hidden_sizes"]
89 | )
90 | # TODO: save model on KeyboardInterrupt exception
91 | log = train_loop(policy, env, config, n_jobs, verbose)
92 |
93 | if config.get("log_path", None):
94 | with open(f"{config['log_path']}{config['experiment_name']}.pkl", "wb") as file:
95 | pickle.dump(log, file)
96 |
97 | if config.get("model_path", None):
98 | with open(f"{config['model_path']}{config['experiment_name']}.pkl", "wb") as file:
99 | pickle.dump(policy, file)
100 |
101 | plot_rewards(log["best_mean_rewards"], log["best_std_rewards"], config)
102 |
103 | return policy
104 |
105 |
106 | def render_policy(model_path, env_name, n_videos=1):
107 | with open(model_path, "rb") as file:
108 | policy = pickle.load(file)
109 |
110 | model_name = model_path.split("/")[-1].split(".")[0]
111 |
112 | for i in range(n_videos):
113 | env = gym.make(env_name)
114 | env = wrappers.Monitor(env, f'videos/{model_name}/' + str(uuid.uuid4()), force=True)
115 |
116 | print(eval_policy(policy, env, n_steps=1600))
117 | env.close()
118 |
119 |
120 | if __name__ == "__main__":
121 | # TODO: analyse population stat from logs
122 | render_policy("models/test_BipedalWalker_v6.1.pkl", "BipedalWalker-v3")
123 |
124 |
--------------------------------------------------------------------------------