├── .gitignore
├── README.md
├── es.py
├── evaluation.py
├── linear.py
├── models
    ├── test_BipedalWalker_v1.pkl
    ├── test_BipedalWalker_v2.pkl
    ├── test_BipedalWalker_v3.pkl
    ├── test_BipedalWalker_v4.pkl
    ├── test_BipedalWalker_v5.0.pkl
    ├── test_BipedalWalker_v5.1.pkl
    ├── test_BipedalWalker_v5.2.pkl
    ├── test_BipedalWalker_v5.3.pkl
    ├── test_BipedalWalker_v6.0.pkl
    ├── test_BipedalWalker_v6.1.pkl
    ├── test_BipedalWalker_v6.2.pkl
    ├── test_CartPole_v1.pkl
    ├── test_LunarLanderCont_v1.pkl
    ├── test_LunarLander_v3.pkl
    └── test_MountainCarCont_v1.pkl
├── plot.py
├── plots
    ├── algo_code.png
    ├── gifs
    │   ├── best_bipedal_walker.gif
    │   ├── best_lunar.gif
    │   ├── best_lunar_cont.gif
    │   ├── best_mountain_car_cont.gif
    │   └── best_pole.gif
    ├── test_BipedalWalker_v1.png
    ├── test_BipedalWalker_v2.png
    ├── test_BipedalWalker_v3.png
    ├── test_BipedalWalker_v4.png
    ├── test_BipedalWalker_v5.0.png
    ├── test_BipedalWalker_v5.1.png
    ├── test_BipedalWalker_v5.2.png
    ├── test_BipedalWalker_v5.3.png
    ├── test_BipedalWalker_v6.0.png
    ├── test_BipedalWalker_v6.1.png
    ├── test_BipedalWalker_v6.2.png
    ├── test_CartPole_v1.png
    ├── test_LunarLanderCont_v1.png
    ├── test_LunarLander_v1.png
    ├── test_LunarLander_v3.png
    └── test_MountainCarCont_v1.png
├── tests
    ├── bipedal_walker.py
    ├── cart_pole.py
    ├── lunar_lander.py
    ├── lunar_lander_cont.py
    └── mountain_car_cont.py
└── training.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .vscode
2 | __pycache__
3 | .DS_Store
4 | tmp
5 | videos
6 | logs
7 | utils.py
8 | papers
9 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Evolution Strategies OpenAI
  2 | 
  3 | Implementation is strictly for educational purposes and not distributed (as in paper), but it works.
  4 | 
  5 | ## Example
  6 | 
  7 | ```python
  8 | from training import run_experiment, render_policy
  9 | 
 10 | example_config = {
 11 |     "experiment_name": "test_BipedalWalker_v0",
 12 |     "plot_path": "plots/",
 13 |     "model_path": "models/", # optional
 14 |     "log_path": "logs/", # optional
 15 |     "init_model": "models/test_BipedalWalker_v5.0.pkl",  # optional
 16 |     "env": "BipedalWalker-v3",
 17 |     "n_sessions": 128,
 18 |     "env_steps": 1600, 
 19 |     "population_size": 256,
 20 |     "learning_rate": 0.06,
 21 |     "noise_std": 0.1,
 22 |     "noise_decay": 0.99, # optional
 23 |     "lr_decay": 1.0, # optional
 24 |     "decay_step": 20, # optional
 25 |     "eval_step": 10, 
 26 |     "hidden_sizes": (40, 40)
 27 |   }
 28 | 
 29 | policy = run_experiment(example_config, n_jobs=4, verbose=True)
 30 | 
 31 | # to render policy perfomance
 32 | render_policy(model_path, env_name, n_videos=10)
 33 | ```
 34 | 
 35 | ## Implemented
 36 | 
 37 | - [x] OpenAI ES algorithm [Algorithm 1].
 38 | - [x] Z-normalization fitness shaping (not rank-based).
 39 | - [x] Parallelization with joblib.
 40 | - [x] Training for 6 OpenAI gym envs (3 solved).
 41 | - [x] Simple three layer net as policy example.
 42 | - [x] [Learning rate & noise std decay.](https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1) 
 43 | 
 44 | ![Algorithm_1](plots/algo_code.png)
 45 | 
 46 | ## Experiments
 47 | 
 48 | ### CartPole
 49 | 
 50 | Solved quickly and easily, especially if the population size is increased. However it is necessary to control the learning rate: it is better to put it less, as well as noise std: in this task there is no need to explore, it is enough to get a lot of feedback as a reward for natural gradient estimation. 
 51 | 
 52 | <p float="left">
 53 |   <img src="plots/test_CartPole_v1.png" width="360" />
 54 |   <img src="plots/gifs/best_pole.gif" width="360" /> 
 55 | </p>
 56 | 
 57 | ### LunarLander
 58 | 
 59 | As in the previous task, the algorithm is doing well, it is also important to set a small learning rate, but slightly increase nose std. 
 60 | 
 61 | <p float="left">
 62 |   <img src="plots/test_LunarLander_v3.png" width="360" />
 63 |   <img src="plots/gifs/best_lunar.gif" width="360" /> 
 64 | </p>
 65 | 
 66 | ### LunarLanderContinuous
 67 | 
 68 | Continuous env is solved much faster and better, probably at the expense of more dense reward. It is also interesting that here the agent has learned to land faster, not to turn on the engines immediately, but only before landing. 
 69 | 
 70 | <p float="left">
 71 |   <img src="plots/test_LunarLanderCont_v1.png" width="360" />
 72 |   <img src="plots/gifs/best_lunar_cont.gif" width="360" /> 
 73 | </p>
 74 | 
 75 | ### MountainCarContinuous
 76 | 
 77 | Сan't solve it yet.   
 78 | 
 79 | In the discrete version of env, the main problem is sparse reward, which is only given at the very end if you climb a hill. Since the agent does not have time for 200 iterations with the random weights to do so, the natural gradient turns out to be zero and the training is stuck. Solution: remove the 200 iteration limit and wait for the random agent to climb the mountain himself, getting the first reward :). However, this is not quite fair.   
 80 | 
 81 | In continuous env, the main problem is the lack of exploration. The agent quickly (faster than climbing the hill) realizes that the best way is to stand still and get 0 reward, which is much higher than when moving.
 82 | 
 83 | <p float="left">
 84 |   <img src="plots/test_MountainCarCont_v1.png" width="360" />
 85 |   <img src="plots/gifs/best_mountain_car_cont.gif" width="360" /> 
 86 | </p>
 87 | 
 88 | Possible solution: novelity search. As a novelity function it is possible to take velocity, velocity * x_coord, or x_coord at the end of episode. [Reward shaping](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf) may improve convergence for DQN/CEM methods, but in this case it does not produce better results.
 89 | 
 90 | ### BipedalWalker
 91 | 
 92 | Not solved yet. More iterations is needed.
 93 | 
 94 | <p float="left">
 95 |   <img src="plots/test_BipedalWalker_v2.png" width="360" />
 96 |   <img src="plots/gifs/best_bipedal_walker.gif" width="360" /> 
 97 | </p>
 98 | 
 99 | <!-- ## Ideas
100 | 
101 | skip frames - https://notanymike.github.io/Solving-CarRacing/, https://danieltakeshi.github.io/2016/11/25/frame-skipping-and-preprocessing-for-deep-q-networks-on-atari-2600-games/, https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py (NB!, class MaxAndSkipEnv(gym.Wrapper) and class WarpFrame(gym.ObservationWrapper)), https://alexandervandekleut.github.io/gym-wrappers/
102 | noise/lr annealing - https://cs231n.github.io/neural-networks-3/#anneal,  https://arxiv.org/pdf/1608.03983.pdf, https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1
103 | 
104 | - добавить инициализацию модели с уже обученных весов и попробовать решить несколько сред сразу -->
105 | <!-- функция берет модель, рандомно меняет веса, прогоняет, получает ревард и возвращает апдйет весов сразу -> легче параллелить,
106 |  чем если отдельно генерировать сначала популяцию, потом ее отдельно прогонять, а потом уже апдейтить -->
107 | 
108 | <!-- Наблюдения: легко решает среды, в которых легко исследовать/пробовать разное, т.к тогда точнее получается градиент и больше данных по реварду. Среды в которых ревард очень редкий решаются очень плохо т.к. до того, как случайно случится событие с ревардом может пройти очень много времени, т.к. поиск до этого случайны и обучения нет. Taxi-v3: плохо работает и генетический и метод кросс энтропии -->
109 | 
110 | <!--         # if env.spec._env_name == 'MountainCarContinuous':
111 |         #     reward = reward + 10 * abs(new_obs[1])
112 |             # TODO: add novelity search reward (https://lilianweng.github.io/lil-log/2019/09/05/evolution-strategies.html)
113 |             # метод потенциалов https://habr.com/ru/company/hsespb/blog/444428/
114 |             # reward = reward + 300 * (0.99 * abs(new_obs[1]) - abs(obs[1])) -->
115 | ## References
116 | 
117 | [Evolution Strategies as a Scalable Alternative to Reinforcement Learning](https://arxiv.org/abs/1703.03864) (Tim Salimans, Jonathan Ho, Xi Chen, Ilya Sutskever)
118 | 


--------------------------------------------------------------------------------
/es.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | from copy import deepcopy
 4 | 
 5 | 
 6 | class OpenAiES:
 7 |     def __init__(self, model, learning_rate, noise_std, \
 8 |                     noise_decay=1.0, lr_decay=1.0, decay_step=50, norm_rewards=True):
 9 |         self.model = model
10 |         
11 |         self._lr = learning_rate
12 |         self._noise_std = noise_std
13 |         
14 |         self.noise_decay = noise_decay
15 |         self.lr_decay = lr_decay
16 |         self.decay_step = decay_step
17 | 
18 |         self.norm_rewards = norm_rewards
19 | 
20 |         self._population = None
21 |         self._count = 0
22 | 
23 |     @property
24 |     def noise_std(self):
25 |         step_decay = np.power(self.noise_decay, np.floor((1 + self._count) / self.decay_step))
26 | 
27 |         return self._noise_std * step_decay
28 | 
29 |     @property
30 |     def lr(self):
31 |         step_decay = np.power(self.lr_decay, np.floor((1 + self._count) / self.decay_step))
32 | 
33 |         return self._lr * step_decay
34 | 
35 |     def generate_population(self, npop=50):
36 |         self._population = []
37 | 
38 |         for i in range(npop):
39 |             new_model = deepcopy(self.model)
40 |             new_model.E = []
41 | 
42 |             for i, layer in enumerate(new_model.W):
43 |                 noise = np.random.randn(layer.shape[0], layer.shape[1])
44 | 
45 |                 new_model.E.append(noise)
46 |                 new_model.W[i] = new_model.W[i] + self.noise_std * noise
47 |             self._population.append(new_model)
48 | 
49 |         return self._population
50 | 
51 |     def update_population(self, rewards):
52 |         if self._population is None:
53 |             raise ValueError("populations is none, generate & eval it first")
54 | 
55 |         # z-normalization (?) - works better, but slower
56 |         if self.norm_rewards:
57 |             rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-5)
58 | 
59 |         for i, layer in enumerate(self.model.W):
60 |             w_updates = np.zeros_like(layer)
61 | 
62 |             for j, model in enumerate(self._population):
63 |                 w_updates = w_updates + (model.E[i] * rewards[j])
64 | 
65 |             # SGD weights update
66 |             self.model.W[i] = self.model.W[i] + (self.lr / (len(rewards) * self.noise_std)) * w_updates
67 |         
68 |         self._count = self._count + 1
69 | 
70 |     def get_model(self):
71 |         return self.model
72 | 
73 | 
74 | class OpenAIES_NSR:
75 |     # TODO: novelity search
76 |     def __init__(self):
77 |         pass


--------------------------------------------------------------------------------
/evaluation.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | from joblib import delayed
 4 | 
 5 | CONTINUOUS_ENVS = ('LunarLanderContinuous', "MountainCarContinuous", "BipedalWalker")
 6 | 
 7 | def eval_policy(policy, env, n_steps=200):
 8 |     try:
 9 |         env_name = env.spec._env_name
10 |     except AttributeError:
11 |         env_name = env._env_name
12 |     
13 |     total_reward = 0
14 |     
15 |     obs = env.reset()
16 |     for i in range(n_steps):
17 |         if env_name in CONTINUOUS_ENVS:
18 |             action = policy.predict(np.array(obs).reshape(1, -1), scale="tanh")
19 |         else:
20 |             action = policy.predict(np.array(obs).reshape(1, -1), scale="softmax")
21 | 
22 |         new_obs, reward, done, _ = env.step(action)
23 |         
24 |         total_reward = total_reward + reward
25 |         obs = new_obs
26 | 
27 |         if done:
28 |             break
29 | 
30 |     return total_reward
31 | 
32 | 
33 | # for parallel
34 | eval_policy_delayed = delayed(eval_policy)


--------------------------------------------------------------------------------
/linear.py:
--------------------------------------------------------------------------------
 1 | import pickle
 2 | 
 3 | import numpy as np 
 4 | 
 5 | 
 6 | def ReLU(x):
 7 |     return np.maximum(0, x)
 8 | 
 9 | 
10 | def softmax(x):
11 |     x_exp = np.exp(x - np.max(x))
12 |     return x_exp / x_exp.sum()
13 | 
14 | 
15 | def tanh(x):
16 |     return np.tanh(x)
17 |     
18 | 
19 | class ThreeLayerNetwork:
20 |     def __init__(self, in_features, out_features, hidden_sizes=(32, 32)):
21 |         self.in_features = in_features
22 |         self.out_features = out_features
23 |         self.hidden_sizes = hidden_sizes
24 | 
25 |         self.W = self._init_layers()
26 | 
27 |     # TODO: init weights from model -> load_model(self, path)
28 | 
29 |     def _init_layers(self):
30 |         layer1_dim, layer2_dim = self.hidden_sizes
31 | 
32 |         # +1 to dims for bias trick & He weight init
33 |         W1 = np.random.randn(self.in_features + 1, layer1_dim + 1) * np.sqrt(2 / (self.in_features + 1))
34 |         W2 = np.random.randn(layer1_dim + 1, layer2_dim + 1) * np.sqrt(2 / (layer1_dim + 1))
35 |         W3 = np.random.randn(layer2_dim + 1, self.out_features) * np.sqrt(2 / (layer2_dim + 1))
36 | 
37 |         return [W1, W2, W3]
38 | 
39 |     @staticmethod
40 |     def from_model(path):
41 |         with open(path, "rb") as file:
42 |             model = pickle.load(file)
43 | 
44 |         assert isinstance(model, ThreeLayerNetwork), "init model is not instance of ThreeLayerNetwork class"
45 | 
46 |         return model
47 | 
48 |     def forward(self, X):
49 |         bias = np.ones((X.shape[0], 1))
50 |         X_bias = np.hstack((X, bias))
51 | 
52 |         output = ReLU(ReLU(X_bias @ self.W[0]) @ self.W[1]) @ self.W[2]
53 |         
54 |         return output
55 | 
56 |     def predict(self, X, scale="softmax"):
57 |         X_norm = (X - X.mean()) / (X.std() + 1e-5)
58 | 
59 |         raw_output = self.forward(X_norm)
60 |         
61 |         if scale == "tanh":
62 |             return tanh(raw_output)[0]       
63 |         elif scale == "softmax":
64 |             prob = softmax(raw_output)[0]
65 |             # TODO: action choice more about agent than model
66 |             return np.random.choice(self.out_features, p=prob)
67 | 
68 |         return raw_output[0]
69 | 
70 | 
71 | if __name__ == "__main__":
72 |     model = ThreeLayerNetwork(4, 4)
73 |     data = np.random.randn(1, 4)
74 | 
75 |     prediction = model.predict(data, scale="tanh")
76 |     
77 |     print(prediction)
78 |     
79 | 


--------------------------------------------------------------------------------
/models/test_BipedalWalker_v1.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v1.pkl


--------------------------------------------------------------------------------
/models/test_BipedalWalker_v2.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v2.pkl


--------------------------------------------------------------------------------
/models/test_BipedalWalker_v3.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v3.pkl


--------------------------------------------------------------------------------
/models/test_BipedalWalker_v4.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v4.pkl


--------------------------------------------------------------------------------
/models/test_BipedalWalker_v5.0.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v5.0.pkl


--------------------------------------------------------------------------------
/models/test_BipedalWalker_v5.1.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v5.1.pkl


--------------------------------------------------------------------------------
/models/test_BipedalWalker_v5.2.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v5.2.pkl


--------------------------------------------------------------------------------
/models/test_BipedalWalker_v5.3.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v5.3.pkl


--------------------------------------------------------------------------------
/models/test_BipedalWalker_v6.0.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v6.0.pkl


--------------------------------------------------------------------------------
/models/test_BipedalWalker_v6.1.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v6.1.pkl


--------------------------------------------------------------------------------
/models/test_BipedalWalker_v6.2.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_BipedalWalker_v6.2.pkl


--------------------------------------------------------------------------------
/models/test_CartPole_v1.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_CartPole_v1.pkl


--------------------------------------------------------------------------------
/models/test_LunarLanderCont_v1.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_LunarLanderCont_v1.pkl


--------------------------------------------------------------------------------
/models/test_LunarLander_v3.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_LunarLander_v3.pkl


--------------------------------------------------------------------------------
/models/test_MountainCarCont_v1.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/models/test_MountainCarCont_v1.pkl


--------------------------------------------------------------------------------
/plot.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import matplotlib.pyplot as plt
 3 | 
 4 | 
 5 | def plot_rewards(mean_rewards, std_rewards, config):
 6 |     best_mean = np.array(mean_rewards)
 7 |     best_std = np.array(std_rewards)
 8 | 
 9 |     stats = (
10 |     f"""
11 |     n_sessions: {config["n_sessions"]}
12 |     population_size: {config["population_size"]}
13 |     lr: {config["learning_rate"]}
14 |     noise_std: {config["noise_std"]}
15 |     env_steps: {config["env_steps"]}
16 |     """
17 |     ) # TODO: add hidden size info on plot 
18 |     
19 |     fig, ax = plt.subplots()
20 |     plt.figure(figsize=(12, 8))
21 |     plt.text(0.35, 1.25, stats, transform=ax.transAxes)
22 |     plt.title(f"{config['env']}: {config['experiment_name']}") 
23 |     plt.plot(np.arange(best_mean.shape[0]), best_mean)
24 |     plt.fill_between(np.arange(best_mean.shape[0]), best_mean + best_std, best_mean - best_std, alpha=0.5)
25 |     plt.xlabel(f"weights updates (mod {config.get('eval_step', '2')})")
26 |     plt.ylabel("reward")
27 |     plt.savefig(f"{config['plot_path']}{config['experiment_name']}.png")


--------------------------------------------------------------------------------
/plots/algo_code.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/algo_code.png


--------------------------------------------------------------------------------
/plots/gifs/best_bipedal_walker.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/gifs/best_bipedal_walker.gif


--------------------------------------------------------------------------------
/plots/gifs/best_lunar.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/gifs/best_lunar.gif


--------------------------------------------------------------------------------
/plots/gifs/best_lunar_cont.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/gifs/best_lunar_cont.gif


--------------------------------------------------------------------------------
/plots/gifs/best_mountain_car_cont.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/gifs/best_mountain_car_cont.gif


--------------------------------------------------------------------------------
/plots/gifs/best_pole.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/gifs/best_pole.gif


--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v1.png


--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v2.png


--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v3.png


--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v4.png


--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v5.0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v5.0.png


--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v5.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v5.1.png


--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v5.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v5.2.png


--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v5.3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v5.3.png


--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v6.0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v6.0.png


--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v6.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v6.1.png


--------------------------------------------------------------------------------
/plots/test_BipedalWalker_v6.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_BipedalWalker_v6.2.png


--------------------------------------------------------------------------------
/plots/test_CartPole_v1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_CartPole_v1.png


--------------------------------------------------------------------------------
/plots/test_LunarLanderCont_v1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_LunarLanderCont_v1.png


--------------------------------------------------------------------------------
/plots/test_LunarLander_v1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_LunarLander_v1.png


--------------------------------------------------------------------------------
/plots/test_LunarLander_v3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_LunarLander_v3.png


--------------------------------------------------------------------------------
/plots/test_MountainCarCont_v1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Howuhh/evolution_strategies_openai/8e9c369b5df94a4afeb6773f686fca1298a69285/plots/test_MountainCarCont_v1.png


--------------------------------------------------------------------------------
/tests/bipedal_walker.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import pickle 
 3 | 
 4 | sys.path.append('..')
 5 | 
 6 | from training import run_experiment
 7 | 
 8 | # solving the task as getting an average score of 300+ over 100 consecutive random trials.
 9 | def test():
10 |     test_config = {
11 |         "experiment_name": "test_BipedalWalker_v6.2",
12 |         "plot_path": "../plots/",
13 |         "model_path": "../models/",
14 |         "log_path": "../logs/",
15 |         "init_model": "../models/test_BipedalWalker_v6.1.pkl",
16 |         "env": "BipedalWalker-v3",
17 |         "n_sessions": 250,
18 |         "env_steps": 1300, 
19 |         "population_size": 128,
20 |         "learning_rate": 0.065,
21 |         "noise_std": 0.07783,
22 |         "noise_decay": 0.995,
23 |         "decay_step": 20,
24 |         "eval_step": 10,
25 |         "hidden_sizes": (64, 40) # sizes from https://designrl.github.io/
26 |     }
27 |     
28 |     policy = run_experiment(test_config, n_jobs=4)
29 | 
30 | 
31 | if __name__ == "__main__":
32 |     test()


--------------------------------------------------------------------------------
/tests/cart_pole.py:
--------------------------------------------------------------------------------
 1 | import sys 
 2 | import pickle
 3 | 
 4 | sys.path.append('..')
 5 | from training import run_experiment
 6 | 
 7 | 
 8 | def test():
 9 |     test_config = {
10 |         "experiment_name": "test_CartPole_v2",
11 |         "plot_path": "../plots/",
12 |         "model_path": "../models/",
13 |         "env": "CartPole-v0",
14 |         "n_sessions": 64,
15 |         "env_steps": 200, 
16 |         "population_size": 256,
17 |         "learning_rate": 0.01,
18 |         "noise_std": 0.05,
19 |         "hidden_sizes": (64, 64)
20 |     }
21 |     policy = run_experiment(test_config)
22 | 
23 |     # TODO: not easy, need a change
24 |     with open(f"{test_config['model_path']}{test_config['experiment_name']}.pkl", "wb") as file:
25 |         pickle.dump(policy, file)
26 | 
27 | 
28 | if __name__ == "__main__":
29 |     test()


--------------------------------------------------------------------------------
/tests/lunar_lander.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import pickle 
 3 | 
 4 | sys.path.append('..')
 5 | 
 6 | from training import run_experiment
 7 | 
 8 | # TODO: next is LunarLanderContinuous-v2
 9 | def test():
10 |     test_config = {
11 |         "experiment_name": "test_LunarLander_v4",
12 |         "plot_path": "../plots/",
13 |         "model_path": "../models/",
14 |         "env": "LunarLander-v2",
15 |         "n_sessions": 512,
16 |         "env_steps": 500, 
17 |         "population_size": 256,
18 |         "learning_rate": 0.01,
19 |         "noise_std": 0.075,
20 |         "hidden_sizes": (64, 64)
21 |     }
22 |     
23 |     policy = run_experiment(test_config)
24 | 
25 |     with open(f"{test_config['model_path']}{test_config['experiment_name']}.pkl", "wb") as file:
26 |         pickle.dump(policy, file)
27 | 
28 | 
29 | if __name__ == "__main__":
30 |     test()


--------------------------------------------------------------------------------
/tests/lunar_lander_cont.py:
--------------------------------------------------------------------------------
 1 | import sys 
 2 | import pickle
 3 | 
 4 | sys.path.append('..')
 5 | from training import run_experiment
 6 | 
 7 | 
 8 | # TODO: parallel & continuous
 9 | def test():
10 |     test_config = {
11 |         "experiment_name": "test_LunarLanderCont_v2",
12 |         "plot_path": "../plots/",
13 |         "model_path": "../models/",
14 |         "env": "LunarLanderContinuous-v2",
15 |         "n_sessions": 512,
16 |         "env_steps": 500, 
17 |         "population_size": 256,
18 |         "learning_rate": 0.01,
19 |         "noise_std": 0.075,
20 |         "hidden_sizes": (64, 64)
21 |     }
22 |     policy = run_experiment(test_config)
23 | 
24 |     with open(f"{test_config['model_path']}{test_config['experiment_name']}.pkl", "wb") as file:
25 |         pickle.dump(policy, file)
26 | 
27 | 
28 | if __name__ == "__main__":
29 |     test()
30 | 


--------------------------------------------------------------------------------
/tests/mountain_car_cont.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import pickle 
 3 | 
 4 | sys.path.append('..')
 5 | 
 6 | from training import run_experiment
 7 | 
 8 | # MountainCar-v0 defines "solving" as getting average reward of -110.0 over 100 consecutive trials.
 9 | # TODO: wait for novelity search
10 | def test():
11 |     test_config = {
12 |         "experiment_name": "test_MountainCarCont_v2",
13 |         "plot_path": "../plots/",
14 |         "model_path": "../models/",
15 |         "env": "MountainCarContinuous-v0",
16 |         "n_sessions": 128,
17 |         "env_steps": 200, 
18 |         "population_size": 256,
19 |         "learning_rate": 0.1,
20 |         "noise_std": 0.5,
21 |         "hidden_sizes": (32, 32)
22 |     }
23 |     
24 |     policy = run_experiment(test_config, n_jobs=4)
25 | 
26 |     with open(f"{test_config['model_path']}{test_config['experiment_name']}.pkl", "wb") as file:
27 |         pickle.dump(policy, file)
28 | 
29 | 
30 | if __name__ == "__main__":
31 |     test()
32 | 


--------------------------------------------------------------------------------
/training.py:
--------------------------------------------------------------------------------
  1 | import gym
  2 | import pickle
  3 | import uuid
  4 | 
  5 | import numpy as np
  6 | 
  7 | from tqdm import tqdm
  8 | from joblib import Parallel
  9 | from collections import defaultdict
 10 | 
 11 | from gym import wrappers
 12 | 
 13 | from linear import ThreeLayerNetwork
 14 | from es import OpenAiES
 15 | from plot import plot_rewards
 16 | from evaluation import eval_policy_delayed, eval_policy
 17 | 
 18 | # env: (n_states, n_actions)
 19 | ENV_INFO = {
 20 |     "CartPole-v0": (4, 2),
 21 |     "LunarLander-v2": (8, 4),
 22 |     "LunarLanderContinuous-v2": (8, 2),
 23 |     "MountainCar-v0": (2, 3),
 24 |     "MountainCarContinuous-v0": (2, 1),
 25 |     "CarRacing-v0": (96*96*3, 3), # TODO: wrap env to prep pixels & discrete actions
 26 |     "BipedalWalker-v3": (24, 4)
 27 | }
 28 | 
 29 | 
 30 | def train_loop(policy, env, config, n_jobs=1, verbose=True):
 31 |     es = OpenAiES(
 32 |         model=policy, 
 33 |         learning_rate=config["learning_rate"], 
 34 |         noise_std=config["noise_std"],
 35 |         noise_decay=config.get("noise_decay", 1.0),
 36 |         lr_decay=config.get("lr_decay", 1.0),
 37 |         decay_step=config.get("decay_step", 50)
 38 |     )
 39 |     
 40 |     log = defaultdict(list)
 41 |     for session in tqdm(range(config["n_sessions"])):
 42 |         population = es.generate_population(config["population_size"])
 43 | 
 44 |         rewards_jobs = (eval_policy_delayed(new_policy, env, config["env_steps"]) for new_policy in population)
 45 |         rewards = np.array(Parallel(n_jobs=n_jobs)(rewards_jobs))
 46 |         
 47 |         es.update_population(rewards)
 48 | 
 49 |         # populations stats
 50 |         log["pop_mean_rewards"].append(np.mean(rewards))
 51 |         log["pop_std_rewards"].append(np.std(rewards))
 52 |         
 53 |         # best policy stats
 54 |         if session % config.get("eval_step", 2) == 0:
 55 |             best_policy = es.get_model()
 56 | 
 57 |             best_rewards = np.zeros(10)
 58 |             for i in range(10):
 59 |                 best_rewards[i] = eval_policy(best_policy, env, config["env_steps"])
 60 | 
 61 |             if verbose:
 62 |                 # TODO: add timestamp
 63 |                 print(f"Session: {session}")   
 64 |                 print(f"Mean reward: {round(np.mean(rewards), 4)}", f"std: {round(np.std(rewards), 3)}")
 65 |                 print(f"lr: {round(es.lr, 5)}, noise_std: {round(es.noise_std, 5)}")
 66 | 
 67 |             log["best_mean_rewards"].append(np.mean(best_rewards))
 68 |             log["best_std_rewards"].append(np.std(best_rewards))            
 69 | 
 70 |     return log
 71 | 
 72 | 
 73 | def run_experiment(config, n_jobs=4, verbose=True):
 74 |     env = gym.make(config["env"])
 75 |     env._env_name = env.spec._env_name
 76 | 
 77 |     n_states, n_actions = ENV_INFO[config["env"]]
 78 | 
 79 |     if config.get("init_model", None):
 80 |         policy = ThreeLayerNetwork.from_model(config["init_model"])
 81 |         
 82 |         assert policy.in_features == n_states, "not correct policy input dims"
 83 |         assert policy.out_features == n_actions, "not correct policy output dims"
 84 |     else:
 85 |         policy = ThreeLayerNetwork(
 86 |             in_features=n_states, 
 87 |             out_features=n_actions, 
 88 |             hidden_sizes=config["hidden_sizes"]
 89 |         )
 90 |     # TODO: save model on KeyboardInterrupt exception
 91 |     log = train_loop(policy, env, config, n_jobs, verbose)
 92 | 
 93 |     if config.get("log_path", None):
 94 |         with open(f"{config['log_path']}{config['experiment_name']}.pkl", "wb") as file:
 95 |             pickle.dump(log, file)
 96 | 
 97 |     if config.get("model_path", None):
 98 |         with open(f"{config['model_path']}{config['experiment_name']}.pkl", "wb") as file:
 99 |             pickle.dump(policy, file)
100 | 
101 |     plot_rewards(log["best_mean_rewards"], log["best_std_rewards"], config)
102 | 
103 |     return policy
104 | 
105 | 
106 | def render_policy(model_path, env_name, n_videos=1):
107 |     with open(model_path, "rb") as file:
108 |         policy = pickle.load(file)
109 | 
110 |     model_name = model_path.split("/")[-1].split(".")[0]
111 |     
112 |     for i in range(n_videos):
113 |         env = gym.make(env_name)
114 |         env = wrappers.Monitor(env, f'videos/{model_name}/' + str(uuid.uuid4()), force=True)
115 | 
116 |         print(eval_policy(policy, env, n_steps=1600))
117 |         env.close()
118 | 
119 | 
120 | if __name__ == "__main__":
121 |     # TODO: analyse population stat from logs
122 |     render_policy("models/test_BipedalWalker_v6.1.pkl", "BipedalWalker-v3")
123 | 
124 | 


--------------------------------------------------------------------------------