├── .gitignore ├── README.md ├── es.py ├── evaluation.py ├── linear.py ├── models ├── test_BipedalWalker_v1.pkl ├── test_BipedalWalker_v2.pkl ├── test_BipedalWalker_v3.pkl ├── test_BipedalWalker_v4.pkl ├── test_BipedalWalker_v5.0.pkl ├── test_BipedalWalker_v5.1.pkl ├── test_BipedalWalker_v5.2.pkl ├── test_BipedalWalker_v5.3.pkl ├── test_BipedalWalker_v6.0.pkl ├── test_BipedalWalker_v6.1.pkl ├── test_BipedalWalker_v6.2.pkl ├── test_CartPole_v1.pkl ├── test_LunarLanderCont_v1.pkl ├── test_LunarLander_v3.pkl └── test_MountainCarCont_v1.pkl ├── plot.py ├── plots ├── algo_code.png ├── gifs │ ├── best_bipedal_walker.gif │ ├── best_lunar.gif │ ├── best_lunar_cont.gif │ ├── best_mountain_car_cont.gif │ └── best_pole.gif ├── test_BipedalWalker_v1.png ├── test_BipedalWalker_v2.png ├── test_BipedalWalker_v3.png ├── test_BipedalWalker_v4.png ├── test_BipedalWalker_v5.0.png ├── test_BipedalWalker_v5.1.png ├── test_BipedalWalker_v5.2.png ├── test_BipedalWalker_v5.3.png ├── test_BipedalWalker_v6.0.png ├── test_BipedalWalker_v6.1.png ├── test_BipedalWalker_v6.2.png ├── test_CartPole_v1.png ├── test_LunarLanderCont_v1.png ├── test_LunarLander_v1.png ├── test_LunarLander_v3.png └── test_MountainCarCont_v1.png ├── tests ├── bipedal_walker.py ├── cart_pole.py ├── lunar_lander.py ├── lunar_lander_cont.py └── mountain_car_cont.py └── training.py /.gitignore: -------------------------------------------------------------------------------- 1 | .vscode 2 | __pycache__ 3 | .DS_Store 4 | tmp 5 | videos 6 | logs 7 | utils.py 8 | papers 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Evolution Strategies OpenAI 2 | 3 | Implementation is strictly for educational purposes and not distributed (as in paper), but it works. 4 | 5 | ## Example 6 | 7 | ```python 8 | from training import run_experiment, render_policy 9 | 10 | example_config = { 11 | "experiment_name": "test_BipedalWalker_v0", 12 | "plot_path": "plots/", 13 | "model_path": "models/", # optional 14 | "log_path": "logs/", # optional 15 | "init_model": "models/test_BipedalWalker_v5.0.pkl", # optional 16 | "env": "BipedalWalker-v3", 17 | "n_sessions": 128, 18 | "env_steps": 1600, 19 | "population_size": 256, 20 | "learning_rate": 0.06, 21 | "noise_std": 0.1, 22 | "noise_decay": 0.99, # optional 23 | "lr_decay": 1.0, # optional 24 | "decay_step": 20, # optional 25 | "eval_step": 10, 26 | "hidden_sizes": (40, 40) 27 | } 28 | 29 | policy = run_experiment(example_config, n_jobs=4, verbose=True) 30 | 31 | # to render policy perfomance 32 | render_policy(model_path, env_name, n_videos=10) 33 | ``` 34 | 35 | ## Implemented 36 | 37 | - [x] OpenAI ES algorithm [Algorithm 1]. 38 | - [x] Z-normalization fitness shaping (not rank-based). 39 | - [x] Parallelization with joblib. 40 | - [x] Training for 6 OpenAI gym envs (3 solved). 41 | - [x] Simple three layer net as policy example. 42 | - [x] [Learning rate & noise std decay.](https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1) 43 | 44 | ![Algorithm_1](plots/algo_code.png) 45 | 46 | ## Experiments 47 | 48 | ### CartPole 49 | 50 | Solved quickly and easily, especially if the population size is increased. However it is necessary to control the learning rate: it is better to put it less, as well as noise std: in this task there is no need to explore, it is enough to get a lot of feedback as a reward for natural gradient estimation. 51 | 52 |

53 | 54 | 55 |

56 | 57 | ### LunarLander 58 | 59 | As in the previous task, the algorithm is doing well, it is also important to set a small learning rate, but slightly increase nose std. 60 | 61 |

62 | 63 | 64 |

65 | 66 | ### LunarLanderContinuous 67 | 68 | Continuous env is solved much faster and better, probably at the expense of more dense reward. It is also interesting that here the agent has learned to land faster, not to turn on the engines immediately, but only before landing. 69 | 70 |

71 | 72 | 73 |

74 | 75 | ### MountainCarContinuous 76 | 77 | Сan't solve it yet. 78 | 79 | In the discrete version of env, the main problem is sparse reward, which is only given at the very end if you climb a hill. Since the agent does not have time for 200 iterations with the random weights to do so, the natural gradient turns out to be zero and the training is stuck. Solution: remove the 200 iteration limit and wait for the random agent to climb the mountain himself, getting the first reward :). However, this is not quite fair. 80 | 81 | In continuous env, the main problem is the lack of exploration. The agent quickly (faster than climbing the hill) realizes that the best way is to stand still and get 0 reward, which is much higher than when moving. 82 | 83 |

84 | 85 | 86 |

87 | 88 | Possible solution: novelity search. As a novelity function it is possible to take velocity, velocity * x_coord, or x_coord at the end of episode. [Reward shaping](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf) may improve convergence for DQN/CEM methods, but in this case it does not produce better results. 89 | 90 | ### BipedalWalker 91 | 92 | Not solved yet. More iterations is needed. 93 | 94 |

95 | 96 | 97 |

98 | 99 | 105 | 107 | 108 | 109 | 110 | 115 | ## References 116 | 117 | [Evolution Strategies as a Scalable Alternative to Reinforcement Learning](https://arxiv.org/abs/1703.03864) (Tim Salimans, Jonathan Ho, Xi Chen, Ilya Sutskever) 118 | -------------------------------------------------------------------------------- /es.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | from copy import deepcopy 4 | 5 | 6 | class OpenAiES: 7 | def __init__(self, model, learning_rate, noise_std, \ 8 | noise_decay=1.0, lr_decay=1.0, decay_step=50, norm_rewards=True): 9 | self.model = model 10 | 11 | self._lr = learning_rate 12 | self._noise_std = noise_std 13 | 14 | self.noise_decay = noise_decay 15 | self.lr_decay = lr_decay 16 | self.decay_step = decay_step 17 | 18 | self.norm_rewards = norm_rewards 19 | 20 | self._population = None 21 | self._count = 0 22 | 23 | @property 24 | def noise_std(self): 25 | step_decay = np.power(self.noise_decay, np.floor((1 + self._count) / self.decay_step)) 26 | 27 | return self._noise_std * step_decay 28 | 29 | @property 30 | def lr(self): 31 | step_decay = np.power(self.lr_decay, np.floor((1 + self._count) / self.decay_step)) 32 | 33 | return self._lr * step_decay 34 | 35 | def generate_population(self, npop=50): 36 | self._population = [] 37 | 38 | for i in range(npop): 39 | new_model = deepcopy(self.model) 40 | new_model.E = [] 41 | 42 | for i, layer in enumerate(new_model.W): 43 | noise = np.random.randn(layer.shape[0], layer.shape[1]) 44 | 45 | new_model.E.append(noise) 46 | new_model.W[i] = new_model.W[i] + self.noise_std * noise 47 | self._population.append(new_model) 48 | 49 | return self._population 50 | 51 | def update_population(self, rewards): 52 | if self._population is None: 53 | raise ValueError("populations is none, generate & eval it first") 54 | 55 | # z-normalization (?) - works better, but slower 56 | if self.norm_rewards: 57 | rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-5) 58 | 59 | for i, layer in enumerate(self.model.W): 60 | w_updates = np.zeros_like(layer) 61 | 62 | for j, model in enumerate(self._population): 63 | w_updates = w_updates + (model.E[i] * rewards[j]) 64 | 65 | # SGD weights update 66 | self.model.W[i] = self.model.W[i] + (self.lr / (len(rewards) * self.noise_std)) * w_updates 67 | 68 | self._count = self._count + 1 69 | 70 | def get_model(self): 71 | return self.model 72 | 73 | 74 | class OpenAIES_NSR: 75 | # TODO: novelity search 76 | def __init__(self): 77 | pass -------------------------------------------------------------------------------- /evaluation.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | from joblib import delayed 4 | 5 | CONTINUOUS_ENVS = ('LunarLanderContinuous', "MountainCarContinuous", "BipedalWalker") 6 | 7 | def eval_policy(policy, env, n_steps=200): 8 | try: 9 | env_name = env.spec._env_name 10 | except AttributeError: 11 | env_name = env._env_name 12 | 13 | total_reward = 0 14 | 15 | obs = env.reset() 16 | for i in range(n_steps): 17 | if env_name in CONTINUOUS_ENVS: 18 | action = policy.predict(np.array(obs).reshape(1, -1), scale="tanh") 19 | else: 20 | action = policy.predict(np.array(obs).reshape(1, -1), scale="softmax") 21 | 22 | new_obs, reward, done, _ = env.step(action) 23 | 24 | total_reward = total_reward + reward 25 | obs = new_obs 26 | 27 | if done: 28 | break 29 | 30 | return total_reward 31 | 32 | 33 | # for parallel 34 | eval_policy_delayed = delayed(eval_policy) -------------------------------------------------------------------------------- /linear.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | 3 | import numpy as np 4 | 5 | 6 | def ReLU(x): 7 | return np.maximum(0, x) 8 | 9 | 10 | def softmax(x): 11 | x_exp = np.exp(x - np.max(x)) 12 | return x_exp / x_exp.sum() 13 | 14 | 15 | def tanh(x): 16 | return np.tanh(x) 17 | 18 | 19 | class ThreeLayerNetwork: 20 | def __init__(self, in_features, out_features, hidden_sizes=(32, 32)): 21 | self.in_features = in_features 22 | self.out_features = out_features 23 | self.hidden_sizes = hidden_sizes 24 | 25 | self.W = self._init_layers() 26 | 27 | # TODO: init weights from model -> load_model(self, path) 28 | 29 | def _init_layers(self): 30 | layer1_dim, layer2_dim = self.hidden_sizes 31 | 32 | # +1 to dims for bias trick & He weight init 33 | W1 = np.random.randn(self.in_features + 1, layer1_dim + 1) * np.sqrt(2 / (self.in_features + 1)) 34 | W2 = np.random.randn(layer1_dim + 1, layer2_dim + 1) * np.sqrt(2 / (layer1_dim + 1)) 35 | W3 = np.random.randn(layer2_dim + 1, self.out_features) * np.sqrt(2 / (layer2_dim + 1)) 36 | 37 | return [W1, W2, W3] 38 | 39 | @staticmethod 40 | def from_model(path): 41 | with open(path, "rb") as file: 42 | model = pickle.load(file) 43 | 44 | assert isinstance(model, ThreeLayerNetwork), "init model is not instance of ThreeLayerNetwork class" 45 | 46 | return model 47 | 48 | def forward(self, X): 49 | bias = np.ones((X.shape[0], 1)) 50 | X_bias = np.hstack((X, bias)) 51 | 52 | output = ReLU(ReLU(X_bias @ self.W[0]) @ self.W[1]) @ self.W[2] 53 | 54 | return output 55 | 56 | def predict(self, X, scale="softmax"): 57 | X_norm = (X - X.mean()) / (X.std() + 1e-5) 58 | 59 | raw_output = self.forward(X_norm) 60 | 61 | if scale == "tanh": 62 | return tanh(raw_output)[0] 63 | elif scale == "softmax": 64 | prob = softmax(raw_output)[0] 65 | # TODO: action choice more about agent than model 66 | return np.random.choice(self.out_features, p=prob) 67 | 68 | return raw_output[0] 69 | 70 | 71 | if __name__ == "__main__": 72 | model = ThreeLayerNetwork(4, 4) 73 | data = np.random.randn(1, 4) 74 | 75 | prediction = model.predict(data, scale="tanh") 76 | 77 | print(prediction) 78 | 79 | -------------------------------------------------------------------------------- /models/test_BipedalWalker_v1.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v1.pkl -------------------------------------------------------------------------------- /models/test_BipedalWalker_v2.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v2.pkl -------------------------------------------------------------------------------- /models/test_BipedalWalker_v3.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v3.pkl -------------------------------------------------------------------------------- /models/test_BipedalWalker_v4.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v4.pkl -------------------------------------------------------------------------------- /models/test_BipedalWalker_v5.0.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v5.0.pkl -------------------------------------------------------------------------------- /models/test_BipedalWalker_v5.1.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v5.1.pkl -------------------------------------------------------------------------------- /models/test_BipedalWalker_v5.2.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v5.2.pkl -------------------------------------------------------------------------------- /models/test_BipedalWalker_v5.3.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v5.3.pkl -------------------------------------------------------------------------------- /models/test_BipedalWalker_v6.0.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v6.0.pkl -------------------------------------------------------------------------------- /models/test_BipedalWalker_v6.1.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v6.1.pkl -------------------------------------------------------------------------------- /models/test_BipedalWalker_v6.2.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v6.2.pkl -------------------------------------------------------------------------------- /models/test_CartPole_v1.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_CartPole_v1.pkl -------------------------------------------------------------------------------- /models/test_LunarLanderCont_v1.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_LunarLanderCont_v1.pkl -------------------------------------------------------------------------------- /models/test_LunarLander_v3.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_LunarLander_v3.pkl -------------------------------------------------------------------------------- /models/test_MountainCarCont_v1.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_MountainCarCont_v1.pkl -------------------------------------------------------------------------------- /plot.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | 4 | 5 | def plot_rewards(mean_rewards, std_rewards, config): 6 | best_mean = np.array(mean_rewards) 7 | best_std = np.array(std_rewards) 8 | 9 | stats = ( 10 | f""" 11 | n_sessions: {config["n_sessions"]} 12 | population_size: {config["population_size"]} 13 | lr: {config["learning_rate"]} 14 | noise_std: {config["noise_std"]} 15 | env_steps: {config["env_steps"]} 16 | """ 17 | ) # TODO: add hidden size info on plot 18 | 19 | fig, ax = plt.subplots() 20 | plt.figure(figsize=(12, 8)) 21 | plt.text(0.35, 1.25, stats, transform=ax.transAxes) 22 | plt.title(f"{config['env']}: {config['experiment_name']}") 23 | plt.plot(np.arange(best_mean.shape[0]), best_mean) 24 | plt.fill_between(np.arange(best_mean.shape[0]), best_mean + best_std, best_mean - best_std, alpha=0.5) 25 | plt.xlabel(f"weights updates (mod {config.get('eval_step', '2')})") 26 | plt.ylabel("reward") 27 | plt.savefig(f"{config['plot_path']}{config['experiment_name']}.png") -------------------------------------------------------------------------------- /plots/algo_code.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/algo_code.png -------------------------------------------------------------------------------- /plots/gifs/best_bipedal_walker.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/gifs/best_bipedal_walker.gif -------------------------------------------------------------------------------- /plots/gifs/best_lunar.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/gifs/best_lunar.gif -------------------------------------------------------------------------------- /plots/gifs/best_lunar_cont.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/gifs/best_lunar_cont.gif -------------------------------------------------------------------------------- /plots/gifs/best_mountain_car_cont.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/gifs/best_mountain_car_cont.gif -------------------------------------------------------------------------------- /plots/gifs/best_pole.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/gifs/best_pole.gif -------------------------------------------------------------------------------- /plots/test_BipedalWalker_v1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v1.png -------------------------------------------------------------------------------- /plots/test_BipedalWalker_v2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v2.png -------------------------------------------------------------------------------- /plots/test_BipedalWalker_v3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v3.png -------------------------------------------------------------------------------- /plots/test_BipedalWalker_v4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v4.png -------------------------------------------------------------------------------- /plots/test_BipedalWalker_v5.0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v5.0.png -------------------------------------------------------------------------------- /plots/test_BipedalWalker_v5.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v5.1.png -------------------------------------------------------------------------------- /plots/test_BipedalWalker_v5.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v5.2.png -------------------------------------------------------------------------------- /plots/test_BipedalWalker_v5.3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v5.3.png -------------------------------------------------------------------------------- /plots/test_BipedalWalker_v6.0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v6.0.png -------------------------------------------------------------------------------- /plots/test_BipedalWalker_v6.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v6.1.png -------------------------------------------------------------------------------- /plots/test_BipedalWalker_v6.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v6.2.png -------------------------------------------------------------------------------- /plots/test_CartPole_v1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_CartPole_v1.png -------------------------------------------------------------------------------- /plots/test_LunarLanderCont_v1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_LunarLanderCont_v1.png -------------------------------------------------------------------------------- /plots/test_LunarLander_v1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_LunarLander_v1.png -------------------------------------------------------------------------------- /plots/test_LunarLander_v3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_LunarLander_v3.png -------------------------------------------------------------------------------- /plots/test_MountainCarCont_v1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_MountainCarCont_v1.png -------------------------------------------------------------------------------- /tests/bipedal_walker.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import pickle 3 | 4 | sys.path.append('..') 5 | 6 | from training import run_experiment 7 | 8 | # solving the task as getting an average score of 300+ over 100 consecutive random trials. 9 | def test(): 10 | test_config = { 11 | "experiment_name": "test_BipedalWalker_v6.2", 12 | "plot_path": "../plots/", 13 | "model_path": "../models/", 14 | "log_path": "../logs/", 15 | "init_model": "../models/test_BipedalWalker_v6.1.pkl", 16 | "env": "BipedalWalker-v3", 17 | "n_sessions": 250, 18 | "env_steps": 1300, 19 | "population_size": 128, 20 | "learning_rate": 0.065, 21 | "noise_std": 0.07783, 22 | "noise_decay": 0.995, 23 | "decay_step": 20, 24 | "eval_step": 10, 25 | "hidden_sizes": (64, 40) # sizes from https://designrl.github.io/ 26 | } 27 | 28 | policy = run_experiment(test_config, n_jobs=4) 29 | 30 | 31 | if __name__ == "__main__": 32 | test() -------------------------------------------------------------------------------- /tests/cart_pole.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import pickle 3 | 4 | sys.path.append('..') 5 | from training import run_experiment 6 | 7 | 8 | def test(): 9 | test_config = { 10 | "experiment_name": "test_CartPole_v2", 11 | "plot_path": "../plots/", 12 | "model_path": "../models/", 13 | "env": "CartPole-v0", 14 | "n_sessions": 64, 15 | "env_steps": 200, 16 | "population_size": 256, 17 | "learning_rate": 0.01, 18 | "noise_std": 0.05, 19 | "hidden_sizes": (64, 64) 20 | } 21 | policy = run_experiment(test_config) 22 | 23 | # TODO: not easy, need a change 24 | with open(f"{test_config['model_path']}{test_config['experiment_name']}.pkl", "wb") as file: 25 | pickle.dump(policy, file) 26 | 27 | 28 | if __name__ == "__main__": 29 | test() -------------------------------------------------------------------------------- /tests/lunar_lander.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import pickle 3 | 4 | sys.path.append('..') 5 | 6 | from training import run_experiment 7 | 8 | # TODO: next is LunarLanderContinuous-v2 9 | def test(): 10 | test_config = { 11 | "experiment_name": "test_LunarLander_v4", 12 | "plot_path": "../plots/", 13 | "model_path": "../models/", 14 | "env": "LunarLander-v2", 15 | "n_sessions": 512, 16 | "env_steps": 500, 17 | "population_size": 256, 18 | "learning_rate": 0.01, 19 | "noise_std": 0.075, 20 | "hidden_sizes": (64, 64) 21 | } 22 | 23 | policy = run_experiment(test_config) 24 | 25 | with open(f"{test_config['model_path']}{test_config['experiment_name']}.pkl", "wb") as file: 26 | pickle.dump(policy, file) 27 | 28 | 29 | if __name__ == "__main__": 30 | test() -------------------------------------------------------------------------------- /tests/lunar_lander_cont.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import pickle 3 | 4 | sys.path.append('..') 5 | from training import run_experiment 6 | 7 | 8 | # TODO: parallel & continuous 9 | def test(): 10 | test_config = { 11 | "experiment_name": "test_LunarLanderCont_v2", 12 | "plot_path": "../plots/", 13 | "model_path": "../models/", 14 | "env": "LunarLanderContinuous-v2", 15 | "n_sessions": 512, 16 | "env_steps": 500, 17 | "population_size": 256, 18 | "learning_rate": 0.01, 19 | "noise_std": 0.075, 20 | "hidden_sizes": (64, 64) 21 | } 22 | policy = run_experiment(test_config) 23 | 24 | with open(f"{test_config['model_path']}{test_config['experiment_name']}.pkl", "wb") as file: 25 | pickle.dump(policy, file) 26 | 27 | 28 | if __name__ == "__main__": 29 | test() 30 | -------------------------------------------------------------------------------- /tests/mountain_car_cont.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import pickle 3 | 4 | sys.path.append('..') 5 | 6 | from training import run_experiment 7 | 8 | # MountainCar-v0 defines "solving" as getting average reward of -110.0 over 100 consecutive trials. 9 | # TODO: wait for novelity search 10 | def test(): 11 | test_config = { 12 | "experiment_name": "test_MountainCarCont_v2", 13 | "plot_path": "../plots/", 14 | "model_path": "../models/", 15 | "env": "MountainCarContinuous-v0", 16 | "n_sessions": 128, 17 | "env_steps": 200, 18 | "population_size": 256, 19 | "learning_rate": 0.1, 20 | "noise_std": 0.5, 21 | "hidden_sizes": (32, 32) 22 | } 23 | 24 | policy = run_experiment(test_config, n_jobs=4) 25 | 26 | with open(f"{test_config['model_path']}{test_config['experiment_name']}.pkl", "wb") as file: 27 | pickle.dump(policy, file) 28 | 29 | 30 | if __name__ == "__main__": 31 | test() 32 | -------------------------------------------------------------------------------- /training.py: -------------------------------------------------------------------------------- 1 | import gym 2 | import pickle 3 | import uuid 4 | 5 | import numpy as np 6 | 7 | from tqdm import tqdm 8 | from joblib import Parallel 9 | from collections import defaultdict 10 | 11 | from gym import wrappers 12 | 13 | from linear import ThreeLayerNetwork 14 | from es import OpenAiES 15 | from plot import plot_rewards 16 | from evaluation import eval_policy_delayed, eval_policy 17 | 18 | # env: (n_states, n_actions) 19 | ENV_INFO = { 20 | "CartPole-v0": (4, 2), 21 | "LunarLander-v2": (8, 4), 22 | "LunarLanderContinuous-v2": (8, 2), 23 | "MountainCar-v0": (2, 3), 24 | "MountainCarContinuous-v0": (2, 1), 25 | "CarRacing-v0": (96*96*3, 3), # TODO: wrap env to prep pixels & discrete actions 26 | "BipedalWalker-v3": (24, 4) 27 | } 28 | 29 | 30 | def train_loop(policy, env, config, n_jobs=1, verbose=True): 31 | es = OpenAiES( 32 | model=policy, 33 | learning_rate=config["learning_rate"], 34 | noise_std=config["noise_std"], 35 | noise_decay=config.get("noise_decay", 1.0), 36 | lr_decay=config.get("lr_decay", 1.0), 37 | decay_step=config.get("decay_step", 50) 38 | ) 39 | 40 | log = defaultdict(list) 41 | for session in tqdm(range(config["n_sessions"])): 42 | population = es.generate_population(config["population_size"]) 43 | 44 | rewards_jobs = (eval_policy_delayed(new_policy, env, config["env_steps"]) for new_policy in population) 45 | rewards = np.array(Parallel(n_jobs=n_jobs)(rewards_jobs)) 46 | 47 | es.update_population(rewards) 48 | 49 | # populations stats 50 | log["pop_mean_rewards"].append(np.mean(rewards)) 51 | log["pop_std_rewards"].append(np.std(rewards)) 52 | 53 | # best policy stats 54 | if session % config.get("eval_step", 2) == 0: 55 | best_policy = es.get_model() 56 | 57 | best_rewards = np.zeros(10) 58 | for i in range(10): 59 | best_rewards[i] = eval_policy(best_policy, env, config["env_steps"]) 60 | 61 | if verbose: 62 | # TODO: add timestamp 63 | print(f"Session: {session}") 64 | print(f"Mean reward: {round(np.mean(rewards), 4)}", f"std: {round(np.std(rewards), 3)}") 65 | print(f"lr: {round(es.lr, 5)}, noise_std: {round(es.noise_std, 5)}") 66 | 67 | log["best_mean_rewards"].append(np.mean(best_rewards)) 68 | log["best_std_rewards"].append(np.std(best_rewards)) 69 | 70 | return log 71 | 72 | 73 | def run_experiment(config, n_jobs=4, verbose=True): 74 | env = gym.make(config["env"]) 75 | env._env_name = env.spec._env_name 76 | 77 | n_states, n_actions = ENV_INFO[config["env"]] 78 | 79 | if config.get("init_model", None): 80 | policy = ThreeLayerNetwork.from_model(config["init_model"]) 81 | 82 | assert policy.in_features == n_states, "not correct policy input dims" 83 | assert policy.out_features == n_actions, "not correct policy output dims" 84 | else: 85 | policy = ThreeLayerNetwork( 86 | in_features=n_states, 87 | out_features=n_actions, 88 | hidden_sizes=config["hidden_sizes"] 89 | ) 90 | # TODO: save model on KeyboardInterrupt exception 91 | log = train_loop(policy, env, config, n_jobs, verbose) 92 | 93 | if config.get("log_path", None): 94 | with open(f"{config['log_path']}{config['experiment_name']}.pkl", "wb") as file: 95 | pickle.dump(log, file) 96 | 97 | if config.get("model_path", None): 98 | with open(f"{config['model_path']}{config['experiment_name']}.pkl", "wb") as file: 99 | pickle.dump(policy, file) 100 | 101 | plot_rewards(log["best_mean_rewards"], log["best_std_rewards"], config) 102 | 103 | return policy 104 | 105 | 106 | def render_policy(model_path, env_name, n_videos=1): 107 | with open(model_path, "rb") as file: 108 | policy = pickle.load(file) 109 | 110 | model_name = model_path.split("/")[-1].split(".")[0] 111 | 112 | for i in range(n_videos): 113 | env = gym.make(env_name) 114 | env = wrappers.Monitor(env, f'videos/{model_name}/' + str(uuid.uuid4()), force=True) 115 | 116 | print(eval_policy(policy, env, n_steps=1600)) 117 | env.close() 118 | 119 | 120 | if __name__ == "__main__": 121 | # TODO: analyse population stat from logs 122 | render_policy("models/test_BipedalWalker_v6.1.pkl", "BipedalWalker-v3") 123 | 124 | --------------------------------------------------------------------------------