├── LICENSE
├── README.md
├── a3c
    ├── README.md
    ├── play.py
    ├── resources
    │   ├── average-scores.png
    │   └── sample-game.gif
    ├── sample-weights
    │   └── model-Breakout-v0-91750000.h5
    └── train.py
├── q-learning-1-step
    ├── README.md
    ├── play.py
    ├── resources
    │   ├── after-12h-training.gif
    │   ├── after-18h-training.gif
    │   └── after-6h-training.gif
    ├── sample-weights
    │   ├── model-12h.h5
    │   ├── model-18h.h5
    │   └── model-6h.h5
    └── train.py
└── q-learning-n-step
    ├── README.md
    ├── play.py
    └── train.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Grzegorz Opoka
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ### Variation of Asynchronous RL in Keras (Theano backend) + OpenAI gym [1-step Q-learning, n-step Q-learning, A3C]
 2 | This is a simple variation of [asynchronous reinforcement learning](http://arxiv.org/pdf/1602.01783v1.pdf) written in Python with Keras (Theano backend). Instead of many threads training at the same time there are many processes generating experience for a single agent to learn from. 
 3 | 
 4 | ### Explanation
 5 | There are many processes (tested with 4, it should work better with more in case of Q-learning methods) which are creating experience and sending it to the shared queue. Queue is limited in length (tested with 256) to stop individual processes from excessively generating experience with old weights. Learning process draws from queue samples in batches and learns on them. In A3C network weights are swapped relatively fast to keep them updated.
 6 | 
 7 | ### Currently implemented and working methods
 8 | * [1-step Q-learning](https://github.com/Grzego/async-rl/tree/master/q-learning-1-step)
 9 | * [n-step Q-learning](https://github.com/Grzego/async-rl/tree/master/q-learning-n-step)
10 | * [A3C](https://github.com/Grzego/async-rl/tree/master/a3c)
11 | 
12 | ### Requirements
13 | * [Python 3.4/Python 3.5](https://www.python.org/downloads/)
14 | * [Keras](http://keras.io/)
15 | * [Theano](http://deeplearning.net/software/theano/) ([Tensorflow](https://www.tensorflow.org/) would probably work too)
16 | * [OpenAI (atari-py)](https://gym.openai.com/)
17 | * `pip3 install scikit-image h5py scipy`
18 | 
19 | ### Sample game (A3C)
20 | ![](https://github.com/Grzego/async-rl/blob/master/a3c/resources/sample-game.gif?raw=true)
21 | 
22 | #### Feedback
23 | Because I'm newbie in Reinforcement Learning and Deep Learning, feedback is very welcome :)
24 | 
25 | ### Note
26 | * Weights were learned in Theano, so loading them in Tensorflow may be a little problematic due to Convolutional Layers.
27 | * If training halts after few seconds, don't worry, its probably because Keras lazily compiles Theano function, it should resume quickly.
28 | * Each process sets its own compilation directory for Theano so compilation can take very long time at the beginning (can be disabled with `--th_comp_fix=False`)
29 | 
30 | ### Useful resources
31 | * [Asyncronous RL in Tensorflow + Keras + OpenAI's Gym](https://github.com/coreylynch/async-rl)
32 | * [Replicating "Asynchronous Methods for Deep Reinforcement Learning"](https://github.com/muupan/async-rl)
33 | * [David Silver's "Deep Reinforcement Learning" lecture](http://videolectures.net/rldm2015_silver_reinforcement_learning/)
34 | * [Nervana's Demystifying Deep Reinforcement Learning blog post](http://www.nervanasys.com/demystifying-deep-reinforcement-learning/)
35 | * [Asynchronous Methods for Deep Reinforcement Learning](http://arxiv.org/pdf/1602.01783v1.pdf)
36 | * [Playing Atari with Deep Reinforcement Learning](http://arxiv.org/pdf/1312.5602v1.pdf)
37 | 
38 | 


--------------------------------------------------------------------------------
/a3c/README.md:
--------------------------------------------------------------------------------
 1 | #### Usage
 2 | 
 3 | To start training simply type:
 4 | ```
 5 | python train.py --game=Breakout-v0 --processes=16
 6 | ```
 7 | 
 8 | To resume training from saved model (ex. `model-Breakout-v0-1250000.h5`):
 9 | ```
10 | python train.py --game=Breakout-v0 --processes=16 --checkpoint=1250000
11 | ```
12 | 
13 | To see how it plays:
14 | ```
15 | python play.py --model=model-file.h5 --game=Breakout-v0
16 | ```
17 | 
18 | ### Results
19 | 
20 | This method works really well. Graph below shows average score of 10 games played every 1kk frames. Learning took about 24h. I was able to process ~57k frames every minute. Final weights can be found in `sample-weights` folder.
21 | 
22 | ![](https://github.com/Grzego/async-rl/blob/master/a3c/resources/average-scores.png?raw=true)
23 | 
24 | ### Sample game
25 | 
26 | ![](https://github.com/Grzego/async-rl/blob/master/a3c/resources/sample-game.gif?raw=true)


--------------------------------------------------------------------------------
/a3c/play.py:
--------------------------------------------------------------------------------
  1 | from keras.models import *
  2 | from keras.layers import *
  3 | from keras.optimizers import RMSprop
  4 | import gym
  5 | from scipy.misc import imresize
  6 | from skimage.color import rgb2gray
  7 | import numpy as np
  8 | import argparse
  9 | 
 10 | 
 11 | def build_network(input_shape, output_shape):
 12 |     state = Input(shape=input_shape)
 13 |     h = Conv2D(16, kernel_size=(8, 8), strides=(4, 4), activation='relu', data_format='channels_first')(state)
 14 |     h = Conv2D(32, kernel_size=(4, 4), strides=(2, 2), activation='relu', data_format='channels_first')(h)
 15 |     h = Flatten()(h)
 16 |     h = Dense(256, activation='relu')(h)
 17 | 
 18 |     value = Dense(1, activation='linear')(h)
 19 |     policy = Dense(output_shape, activation='softmax')(h)
 20 | 
 21 |     value_network = Model(input=state, output=value)
 22 |     policy_network = Model(input=state, output=policy)
 23 | 
 24 |     adventage = Input(shape=(1,))
 25 |     train_network = Model(inputs=state, outputs=[value, policy])
 26 | 
 27 |     return value_network, policy_network, train_network, adventage
 28 | 
 29 | 
 30 | class ActingAgent(object):
 31 |     def __init__(self, action_space, screen=(84, 84)):
 32 |         self.screen = screen
 33 |         self.input_depth = 1
 34 |         self.past_range = 3
 35 |         self.replay_size = 32
 36 |         self.observation_shape = (self.input_depth * self.past_range,) + self.screen
 37 | 
 38 |         _, self.policy, self.load_net, _ = build_network(self.observation_shape, action_space.n)
 39 | 
 40 |         self.load_net.compile(optimizer=RMSprop(clipnorm=1.), loss='mse')  # clipnorm=1.
 41 | 
 42 |         self.action_space = action_space
 43 |         self.observations = np.zeros((self.input_depth * self.past_range,) + screen)
 44 | 
 45 |     def init_episode(self, observation):
 46 |         for _ in range(self.past_range):
 47 |             self.save_observation(observation)
 48 | 
 49 |     def choose_action(self, observation):
 50 |         self.save_observation(observation)
 51 |         policy = self.policy.predict(self.observations[None, ...])[0]
 52 |         policy /= np.sum(policy)  # numpy, why?
 53 |         return np.random.choice(np.arange(self.action_space.n), p=policy)
 54 | 
 55 |     def save_observation(self, observation):
 56 |         self.observations = np.roll(self.observations, -self.input_depth, axis=0)
 57 |         self.observations[-self.input_depth:, ...] = self.transform_screen(observation)
 58 | 
 59 |     def transform_screen(self, data):
 60 |         return rgb2gray(imresize(data, self.screen))[None, ...]
 61 | 
 62 | 
 63 | parser = argparse.ArgumentParser(description='Evaluation of model')
 64 | parser.add_argument('--game', default='Breakout-v0', help='Name of openai gym environment', dest='game')
 65 | parser.add_argument('--evaldir', default=None, help='Directory to save evaluation', dest='evaldir')
 66 | parser.add_argument('--model', help='File with weights for model', dest='model')
 67 | 
 68 | 
 69 | def main():
 70 |     args = parser.parse_args()
 71 |     # -----
 72 |     env = gym.make(args.game)
 73 |     if args.evaldir:
 74 |         env.monitor.start(args.evaldir)
 75 |     # -----
 76 |     agent = ActingAgent(env.action_space)
 77 | 
 78 |     model_file = args.model
 79 | 
 80 |     agent.load_net.load_weights(model_file)
 81 | 
 82 |     game = 1
 83 |     for _ in range(10):
 84 |         done = False
 85 |         episode_reward = 0
 86 |         noops = 0
 87 | 
 88 |         # init game
 89 |         observation = env.reset()
 90 |         agent.init_episode(observation)
 91 |         # play one game
 92 |         print('Game #%8d; ' % (game,), end='')
 93 |         while not done:
 94 |             env.render()
 95 |             action = agent.choose_action(observation)
 96 |             observation, reward, done, _ = env.step(action)
 97 |             episode_reward += reward
 98 |             # ----
 99 |             if action == 0:
100 |                 noops += 1
101 |             else:
102 |                 noops = 0
103 |             if noops > 100:
104 |                 break
105 |         print('Reward %4d; ' % (episode_reward,))
106 |         game += 1
107 |     # -----
108 |     if args.evaldir:
109 |         env.monitor.close()
110 | 
111 | 
112 | if __name__ == "__main__":
113 |     main()
114 | 


--------------------------------------------------------------------------------
/a3c/resources/average-scores.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/a3c/resources/average-scores.png


--------------------------------------------------------------------------------
/a3c/resources/sample-game.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/a3c/resources/sample-game.gif


--------------------------------------------------------------------------------
/a3c/sample-weights/model-Breakout-v0-91750000.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/a3c/sample-weights/model-Breakout-v0-91750000.h5


--------------------------------------------------------------------------------
/a3c/train.py:
--------------------------------------------------------------------------------
  1 | from scipy.misc import imresize
  2 | from skimage.color import rgb2gray
  3 | from multiprocessing import *
  4 | from collections import deque
  5 | import gym
  6 | import numpy as np
  7 | import h5py
  8 | import argparse
  9 | 
 10 | # -----
 11 | parser = argparse.ArgumentParser(description='Training model')
 12 | parser.add_argument('--game', default='Breakout-v0', help='OpenAI gym environment name', dest='game', type=str)
 13 | parser.add_argument('--processes', default=4, help='Number of processes that generate experience for agent',
 14 |                     dest='processes', type=int)
 15 | parser.add_argument('--lr', default=0.001, help='Learning rate', dest='learning_rate', type=float)
 16 | parser.add_argument('--steps', default=80000000, help='Number of frames to decay learning rate', dest='steps', type=int)
 17 | parser.add_argument('--batch_size', default=20, help='Batch size to use during training', dest='batch_size', type=int)
 18 | parser.add_argument('--swap_freq', default=100, help='Number of frames before swapping network weights',
 19 |                     dest='swap_freq', type=int)
 20 | parser.add_argument('--checkpoint', default=0, help='Frame to resume training', dest='checkpoint', type=int)
 21 | parser.add_argument('--save_freq', default=250000, help='Number of frames before saving weights', dest='save_freq',
 22 |                     type=int)
 23 | parser.add_argument('--queue_size', default=256, help='Size of queue holding agent experience', dest='queue_size',
 24 |                     type=int)
 25 | parser.add_argument('--n_step', default=5, help='Number of steps', dest='n_step', type=int)
 26 | parser.add_argument('--reward_scale', default=1., dest='reward_scale', type=float)
 27 | parser.add_argument('--beta', default=0.01, dest='beta', type=float)
 28 | # -----
 29 | args = parser.parse_args()
 30 | 
 31 | 
 32 | # -----
 33 | 
 34 | 
 35 | def build_network(input_shape, output_shape):
 36 |     from keras.models import Model
 37 |     from keras.layers import Input, Conv2D, Flatten, Dense
 38 |     # -----
 39 |     state = Input(shape=input_shape)
 40 |     h = Conv2D(16, kernel_size=(8, 8), strides=(4, 4), activation='relu', data_format='channels_first')(state)
 41 |     h = Conv2D(32, kernel_size=(4, 4), strides=(2, 2), activation='relu', data_format='channels_first')(h)
 42 |     h = Flatten()(h)
 43 |     h = Dense(256, activation='relu')(h)
 44 | 
 45 |     value = Dense(1, activation='linear', name='value')(h)
 46 |     policy = Dense(output_shape, activation='softmax', name='policy')(h)
 47 | 
 48 |     value_network = Model(inputs=state, outputs=value)
 49 |     policy_network = Model(inputs=state, outputs=policy)
 50 | 
 51 |     adventage = Input(shape=(1,))
 52 |     train_network = Model(inputs=[state, adventage], outputs=[value, policy])
 53 | 
 54 |     return value_network, policy_network, train_network, adventage
 55 | 
 56 | 
 57 | def policy_loss(adventage=0., beta=0.01):
 58 |     from keras import backend as K
 59 | 
 60 |     def loss(y_true, y_pred):
 61 |         return -K.sum(K.log(K.sum(y_true * y_pred, axis=-1) + K.epsilon()) * K.flatten(adventage)) + \
 62 |                beta * K.sum(y_pred * K.log(y_pred + K.epsilon()))
 63 | 
 64 |     return loss
 65 | 
 66 | 
 67 | def value_loss():
 68 |     from keras import backend as K
 69 | 
 70 |     def loss(y_true, y_pred):
 71 |         return 0.5 * K.sum(K.square(y_true - y_pred))
 72 | 
 73 |     return loss
 74 | 
 75 | 
 76 | # -----
 77 | 
 78 | class LearningAgent(object):
 79 |     def __init__(self, action_space, batch_size=32, screen=(84, 84), swap_freq=200):
 80 |         from keras.optimizers import RMSprop		
 81 |         # -----
 82 |         self.screen = screen
 83 |         self.input_depth = 1
 84 |         self.past_range = 3
 85 |         self.observation_shape = (self.input_depth * self.past_range,) + self.screen
 86 |         self.batch_size = batch_size
 87 | 
 88 |         _, _, self.train_net, adventage = build_network(self.observation_shape, action_space.n)
 89 | 
 90 |         self.train_net.compile(optimizer=RMSprop(epsilon=0.1, rho=0.99),
 91 |                                loss=[value_loss(), policy_loss(adventage, args.beta)])
 92 | 
 93 |         self.pol_loss = deque(maxlen=25)
 94 |         self.val_loss = deque(maxlen=25)
 95 |         self.values = deque(maxlen=25)
 96 |         self.entropy = deque(maxlen=25)
 97 |         self.swap_freq = swap_freq
 98 |         self.swap_counter = self.swap_freq
 99 |         self.unroll = np.arange(self.batch_size)
100 |         self.targets = np.zeros((self.batch_size, action_space.n))
101 |         self.counter = 0
102 | 
103 |     def learn(self, last_observations, actions, rewards, learning_rate=0.001):
104 |         import keras.backend as K
105 |         K.set_value(self.train_net.optimizer.lr, learning_rate)
106 |         frames = len(last_observations)
107 |         self.counter += frames
108 |         # -----
109 |         values, policy = self.train_net.predict([last_observations, self.unroll])
110 |         # -----
111 |         self.targets.fill(0.)
112 |         adventage = rewards - values.flatten()
113 |         self.targets[self.unroll, actions] = 1.
114 |         # -----
115 |         loss = self.train_net.train_on_batch([last_observations, adventage], [rewards, self.targets])
116 |         entropy = np.mean(-policy * np.log(policy + 0.00000001))
117 |         self.pol_loss.append(loss[2])
118 |         self.val_loss.append(loss[1])
119 |         self.entropy.append(entropy)
120 |         self.values.append(np.mean(values))
121 |         min_val, max_val, avg_val = min(self.values), max(self.values), np.mean(self.values)
122 |         print('\rFrames: %8d; Policy-Loss: %10.6f; Avg: %10.6f '
123 |               '--- Value-Loss: %10.6f; Avg: %10.6f '
124 |               '--- Entropy: %7.6f; Avg: %7.6f '
125 |               '--- V-value; Min: %6.3f; Max: %6.3f; Avg: %6.3f' % (
126 |                   self.counter,
127 |                   loss[2], np.mean(self.pol_loss),
128 |                   loss[1], np.mean(self.val_loss),
129 |                   entropy, np.mean(self.entropy),
130 |                   min_val, max_val, avg_val), end='')
131 |         # -----
132 |         self.swap_counter -= frames
133 |         if self.swap_counter < 0:
134 |             self.swap_counter += self.swap_freq
135 |             return True
136 |         return False
137 | 
138 | 
139 | def learn_proc(mem_queue, weight_dict):
140 |     import os
141 |     pid = os.getpid()
142 |     os.environ['THEANO_FLAGS'] = 'floatX=float32,device=gpu,nvcc.fastmath=False,lib.cnmem=0.3,' + \
143 |                                  'compiledir=th_comp_learn'
144 |     # -----
145 |     print(' %5d> Learning process' % (pid,))
146 |     # -----
147 |     save_freq = args.save_freq
148 |     learning_rate = args.learning_rate
149 |     batch_size = args.batch_size
150 |     checkpoint = args.checkpoint
151 |     steps = args.steps
152 |     # -----
153 |     env = gym.make(args.game)
154 |     agent = LearningAgent(env.action_space, batch_size=args.batch_size, swap_freq=args.swap_freq)
155 |     # -----
156 |     if checkpoint > 0:
157 |         print(' %5d> Loading weights from file' % (pid,))
158 |         agent.train_net.load_weights('model-%s-%d.h5' % (args.game, checkpoint,))
159 |         # -----
160 |     print(' %5d> Setting weights in dict' % (pid,))
161 |     weight_dict['update'] = 0
162 |     weight_dict['weights'] = agent.train_net.get_weights()
163 |     # -----
164 |     last_obs = np.zeros((batch_size,) + agent.observation_shape)
165 |     actions = np.zeros(batch_size, dtype=np.int32)
166 |     rewards = np.zeros(batch_size)
167 |     # -----
168 |     idx = 0
169 |     agent.counter = checkpoint
170 |     save_counter = checkpoint % save_freq + save_freq
171 |     while True:
172 |         # -----
173 |         last_obs[idx, ...], actions[idx], rewards[idx] = mem_queue.get()
174 |         idx = (idx + 1) % batch_size
175 |         if idx == 0:
176 |             lr = max(0.00000001, (steps - agent.counter) / steps * learning_rate)
177 |             updated = agent.learn(last_obs, actions, rewards, learning_rate=lr)
178 |             if updated:
179 |                 # print(' %5d> Updating weights in dict' % (pid,))
180 |                 weight_dict['weights'] = agent.train_net.get_weights()
181 |                 weight_dict['update'] += 1
182 |         # -----
183 |         save_counter -= 1
184 |         if save_counter < 0:
185 |             save_counter += save_freq
186 |             agent.train_net.save_weights('model-%s-%d.h5' % (args.game, agent.counter,), overwrite=True)
187 | 
188 | 
189 | class ActingAgent(object):
190 |     def __init__(self, action_space, screen=(84, 84), n_step=8, discount=0.99):
191 |         self.screen = screen
192 |         self.input_depth = 1
193 |         self.past_range = 3
194 |         self.observation_shape = (self.input_depth * self.past_range,) + self.screen
195 | 
196 |         self.value_net, self.policy_net, self.load_net, adv = build_network(self.observation_shape, action_space.n)
197 | 
198 |         self.value_net.compile(optimizer='rmsprop', loss='mse')
199 |         self.policy_net.compile(optimizer='rmsprop', loss='categorical_crossentropy')
200 |         self.load_net.compile(optimizer='rmsprop', loss='mse', loss_weights=[0.5, 1.])  # dummy loss
201 | 
202 |         self.action_space = action_space
203 |         self.observations = np.zeros(self.observation_shape)
204 |         self.last_observations = np.zeros_like(self.observations)
205 |         # -----
206 |         self.n_step_observations = deque(maxlen=n_step)
207 |         self.n_step_actions = deque(maxlen=n_step)
208 |         self.n_step_rewards = deque(maxlen=n_step)
209 |         self.n_step = n_step
210 |         self.discount = discount
211 |         self.counter = 0
212 | 
213 |     def init_episode(self, observation):
214 |         for _ in range(self.past_range):
215 |             self.save_observation(observation)
216 | 
217 |     def reset(self):
218 |         self.counter = 0
219 |         self.n_step_observations.clear()
220 |         self.n_step_actions.clear()
221 |         self.n_step_rewards.clear()
222 | 
223 |     def sars_data(self, action, reward, observation, terminal, mem_queue):
224 |         self.save_observation(observation)
225 |         reward = np.clip(reward, -1., 1.)
226 |         # reward /= args.reward_scale
227 |         # -----
228 |         self.n_step_observations.appendleft(self.last_observations)
229 |         self.n_step_actions.appendleft(action)
230 |         self.n_step_rewards.appendleft(reward)
231 |         # -----
232 |         self.counter += 1
233 |         if terminal or self.counter >= self.n_step:
234 |             r = 0.
235 |             if not terminal:
236 |                 r = self.value_net.predict(self.observations[None, ...])[0]
237 |             for i in range(self.counter):
238 |                 r = self.n_step_rewards[i] + self.discount * r
239 |                 mem_queue.put((self.n_step_observations[i], self.n_step_actions[i], r))
240 |             self.reset()
241 | 
242 |     def choose_action(self):
243 |         policy = self.policy_net.predict(self.observations[None, ...])[0]
244 |         return np.random.choice(np.arange(self.action_space.n), p=policy)
245 | 
246 |     def save_observation(self, observation):
247 |         self.last_observations = self.observations[...]
248 |         self.observations = np.roll(self.observations, -self.input_depth, axis=0)
249 |         self.observations[-self.input_depth:, ...] = self.transform_screen(observation)
250 | 
251 |     def transform_screen(self, data):
252 |         return rgb2gray(imresize(data, self.screen))[None, ...]
253 | 
254 | 
255 | def generate_experience_proc(mem_queue, weight_dict, no):
256 |     import os
257 |     pid = os.getpid()
258 |     os.environ['THEANO_FLAGS'] = 'floatX=float32,device=gpu,nvcc.fastmath=True,lib.cnmem=0,' + \
259 |                                  'compiledir=th_comp_act_' + str(no)
260 |     # -----
261 |     print(' %5d> Process started' % (pid,))
262 |     # -----
263 |     frames = 0
264 |     batch_size = args.batch_size
265 |     # -----
266 |     env = gym.make(args.game)
267 |     agent = ActingAgent(env.action_space, n_step=args.n_step)
268 | 
269 |     if frames > 0:
270 |         print(' %5d> Loaded weights from file' % (pid,))
271 |         agent.load_net.load_weights('model-%s-%d.h5' % (args.game, frames))
272 |     else:
273 |         import time
274 |         while 'weights' not in weight_dict:
275 |             time.sleep(0.1)
276 |         agent.load_net.set_weights(weight_dict['weights'])
277 |         print(' %5d> Loaded weights from dict' % (pid,))
278 | 
279 |     best_score = 0
280 |     avg_score = deque([0], maxlen=25)
281 | 
282 |     last_update = 0
283 |     while True:
284 |         done = False
285 |         episode_reward = 0
286 |         op_last, op_count = 0, 0
287 |         observation = env.reset()
288 |         agent.init_episode(observation)
289 | 
290 |         # -----
291 |         while not done:
292 |             frames += 1
293 |             action = agent.choose_action()
294 |             observation, reward, done, _ = env.step(action)
295 |             episode_reward += reward
296 |             best_score = max(best_score, episode_reward)
297 |             # -----
298 |             agent.sars_data(action, reward, observation, done, mem_queue)
299 |             # -----
300 |             op_count = 0 if op_last != action else op_count + 1
301 |             done = done or op_count >= 100
302 |             op_last = action
303 |             # -----
304 |             if frames % 2000 == 0:
305 |                 print(' %5d> Best: %4d; Avg: %6.2f; Max: %4d' % (
306 |                     pid, best_score, np.mean(avg_score), np.max(avg_score)))
307 |             if frames % batch_size == 0:
308 |                 update = weight_dict.get('update', 0)
309 |                 if update > last_update:
310 |                     last_update = update
311 |                     # print(' %5d> Getting weights from dict' % (pid,))
312 |                     agent.load_net.set_weights(weight_dict['weights'])
313 |         # -----
314 |         avg_score.append(episode_reward)
315 | 
316 | 
317 | def init_worker():
318 |     import signal
319 |     signal.signal(signal.SIGINT, signal.SIG_IGN)
320 | 
321 | 
322 | def main():
323 |     manager = Manager()
324 |     weight_dict = manager.dict()
325 |     mem_queue = manager.Queue(args.queue_size)
326 | 
327 |     pool = Pool(args.processes + 1, init_worker)
328 | 
329 |     try:
330 |         for i in range(args.processes):
331 |             pool.apply_async(generate_experience_proc, (mem_queue, weight_dict, i))
332 | 
333 |         pool.apply_async(learn_proc, (mem_queue, weight_dict))
334 | 
335 |         pool.close()
336 |         pool.join()
337 | 
338 |     except KeyboardInterrupt:
339 |         pool.terminate()
340 |         pool.join()
341 | 
342 | 
343 | if __name__ == "__main__":
344 |     main()
345 | 


--------------------------------------------------------------------------------
/q-learning-1-step/README.md:
--------------------------------------------------------------------------------
 1 | ### Usage
 2 | 
 3 | To start training simply type (I recommend running in terminal with maximum width, due to lots of output data):
 4 | ```
 5 | python train.py --game=Breakout-v0 --processes=16
 6 | ```
 7 | 
 8 | To resume training from saved model (ex. `model-1250000.h5`):
 9 | ```
10 | python train.py --game=Breakout-v0 --processes=16 --checkpoint=1250000
11 | ```
12 | 
13 | To see how it plays:
14 | ```
15 | python play.py --model=sample-weights/model-18h.h5 --game=Breakout-v0
16 | ```
17 | 
18 | ### Samples (old version)
19 | I tested it once and it worked quite well. (Intel i7-4700MQ and NVidia GTX 765M)
20 | 
21 | Sample games after 6h, 12h and 18h of training.
22 | 
23 | ![](https://raw.githubusercontent.com/Grzego/async-rl/master/q-learning-1-step/resources/after-6h-training.gif?token=AFhQOQQq2JlswCS_p1XjU6WrKn3pQ4dvks5XbsV9wA%3D%3D)
24 | ![](https://raw.githubusercontent.com/Grzego/async-rl/master/q-learning-1-step/resources/after-12h-training.gif?token=AFhQOXkCZbPO9SrOXXu5_3_-P0ftrfSsks5XbsWiwA%3D%3D)
25 | ![](https://raw.githubusercontent.com/Grzego/async-rl/master/q-learning-1-step/resources/after-18h-training.gif?token=AFhQOR-kTbupToKnNRenZCWiBEtZBmvhks5XbsWjwA%3D%3D)


--------------------------------------------------------------------------------
/q-learning-1-step/play.py:
--------------------------------------------------------------------------------
  1 | import gym
  2 | from scipy.misc import imresize
  3 | from skimage.color import rgb2gray
  4 | import numpy as np
  5 | import argparse
  6 | 
  7 | 
  8 | def build_network(input_shape, output_shape):
  9 |     from keras.models import Model
 10 |     from keras.layers import Input, Conv2D, Flatten, Dense
 11 | 
 12 |     x = Input(shape=input_shape)
 13 |     h = Conv2D(16, kernel_size=(8, 8), strides=(4, 4), activation='relu', data_format='channels_first')(x)
 14 |     h = Conv2D(32, kernel_size=(4, 4), strides=(2, 2), activation='relu', data_format='channels_first')(h)
 15 |     h = Flatten()(h)
 16 |     h = Dense(256, activation='relu')(h)
 17 |     v = Dense(output_shape, activation='linear')(h)
 18 |     return Model(inputs=x, outputs=v)
 19 | 
 20 | 
 21 | class ActingAgent(object):
 22 |     def __init__(self, action_space, screen=(84, 84)):
 23 |         self.screen = screen
 24 |         self.input_depth = 1
 25 |         self.past_range = 3
 26 |         self.replay_size = 32
 27 |         self.observation_shape = (self.input_depth * self.past_range,) + self.screen
 28 | 
 29 |         self.action_value = build_network(self.observation_shape, action_space.n)
 30 |         self.action_value.compile(optimizer='rmsprop', loss='mse')
 31 | 
 32 |         self.action_space = action_space
 33 |         self.observations = np.zeros((self.input_depth * self.past_range,) + screen)
 34 | 
 35 |     def init_episode(self, observation):
 36 |         for _ in range(self.past_range):
 37 |             self.save_observation(observation)
 38 | 
 39 |     def choose_action(self, observation, epsilon=0.0):
 40 |         self.save_observation(observation)
 41 |         if np.random.random() < epsilon:
 42 |             return self.action_space.sample()
 43 |         else:
 44 |             return np.argmax(self.action_value.predict(self.observations[None, ...]))
 45 | 
 46 |     def save_observation(self, observation):
 47 |         self.observations = np.roll(self.observations, -self.input_depth, axis=0)
 48 |         self.observations[-self.input_depth:, ...] = self.transform_screen(observation)
 49 | 
 50 |     def transform_screen(self, data):
 51 |         return rgb2gray(imresize(data, self.screen))[None, ...]
 52 | 
 53 | 
 54 | parser = argparse.ArgumentParser(description='Evaluation of model')
 55 | parser.add_argument('--game', default='Breakout-v0', help='Name of openai gym environment', dest='game')
 56 | parser.add_argument('--evaldir', default=None, help='Directory to save evaluation', dest='evaldir')
 57 | parser.add_argument('--model', help='File with weights for model', dest='model')
 58 | parser.add_argument('--eps', default=0., help='Epsilon value', dest='eps', type=float)
 59 | 
 60 | 
 61 | def main():
 62 |     args = parser.parse_args()
 63 |     # -----
 64 |     env = gym.make(args.game)
 65 |     if args.evaldir:
 66 |         env.monitor.start(args.evaldir)
 67 |     # -----
 68 |     agent = ActingAgent(env.action_space)
 69 | 
 70 |     model_file = args.model
 71 |     epsilon = args.eps
 72 | 
 73 |     agent.action_value.load_weights(model_file)
 74 | 
 75 |     game = 1
 76 |     for _ in range(10):
 77 |         done = False
 78 |         episode_reward = 0
 79 |         noops = 0
 80 | 
 81 |         # init game
 82 |         observation = env.reset()
 83 |         agent.init_episode(observation)
 84 |         # play one game
 85 |         print('Game #%8d; ' % (game,), end='')
 86 |         while not done:
 87 |             env.render()
 88 |             action = agent.choose_action(observation, epsilon=epsilon)
 89 |             observation, reward, done, _ = env.step(action)
 90 |             episode_reward += reward
 91 |             # ----
 92 |             if action == 0:
 93 |                 noops += 1
 94 |             else:
 95 |                 noops = 0
 96 |             if noops > 100:
 97 |                 break
 98 |         print('Reward %4d; ' % (episode_reward,))
 99 |         game += 1
100 |     # -----
101 |     if args.evaldir:
102 |         env.monitor.close()
103 | 
104 | 
105 | if __name__ == "__main__":
106 |     main()
107 | 


--------------------------------------------------------------------------------
/q-learning-1-step/resources/after-12h-training.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/q-learning-1-step/resources/after-12h-training.gif


--------------------------------------------------------------------------------
/q-learning-1-step/resources/after-18h-training.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/q-learning-1-step/resources/after-18h-training.gif


--------------------------------------------------------------------------------
/q-learning-1-step/resources/after-6h-training.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/q-learning-1-step/resources/after-6h-training.gif


--------------------------------------------------------------------------------
/q-learning-1-step/sample-weights/model-12h.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/q-learning-1-step/sample-weights/model-12h.h5


--------------------------------------------------------------------------------
/q-learning-1-step/sample-weights/model-18h.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/q-learning-1-step/sample-weights/model-18h.h5


--------------------------------------------------------------------------------
/q-learning-1-step/sample-weights/model-6h.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Grzego/async-rl/b2b31f4c2d170531fabf7ddc6bc6f6c8e4e4ae31/q-learning-1-step/sample-weights/model-6h.h5


--------------------------------------------------------------------------------
/q-learning-1-step/train.py:
--------------------------------------------------------------------------------
  1 | from scipy.misc import imresize
  2 | from skimage.color import rgb2gray
  3 | from multiprocessing import *
  4 | from collections import deque
  5 | import queue
  6 | import gym
  7 | import numpy as np
  8 | import argparse
  9 | 
 10 | # -----
 11 | parser = argparse.ArgumentParser(description='Training model')
 12 | parser.add_argument('--game', default='Breakout-v0', help='OpenAI gym environment name', dest='game', type=str)
 13 | parser.add_argument('--processes', default=4, help='Number of processes that generate experience for agent',
 14 |                     dest='processes', type=int)
 15 | parser.add_argument('--lr', default=0.0001, help='Learning rate', dest='learning_rate', type=float)
 16 | parser.add_argument('--batch_size', default=20, help='Batch size to use during training', dest='batch_size', type=int)
 17 | parser.add_argument('--swap_freq', default=10000, help='Number of frames before swapping network weights',
 18 |                     dest='swap_freq', type=int)
 19 | parser.add_argument('--checkpoint', default=0, help='Iteration to resume training', dest='checkpoint', type=int)
 20 | parser.add_argument('--save_freq', default=250000, help='Number of frame before saving weights', dest='save_freq',
 21 |                     type=int)
 22 | parser.add_argument('--eps_decay', default=4000000,
 23 |                     help='Number of frames needed to decay epsilon to the lowest value', dest='eps_decay', type=int)
 24 | parser.add_argument('--lr_decay', default=80000000,
 25 |                     help='Number of frames needed to decay lr to the lowest value', dest='lr_decay', type=int)
 26 | parser.add_argument('--queue_size', default=256, help='Size of queue holding agent experience', dest='queue_size',
 27 |                     type=int)
 28 | parser.add_argument('--th_comp_fix', default=True,
 29 |                     help='Sets different Theano compiledir for each process', dest='th_fix', type=bool)
 30 | # -----
 31 | args = parser.parse_args()
 32 | 
 33 | 
 34 | # -----
 35 | 
 36 | 
 37 | def build_network(input_shape, output_shape):
 38 |     from keras.models import Model
 39 |     from keras.layers import Input, Conv2D, Flatten, Dense
 40 | 
 41 |     x = Input(shape=input_shape)
 42 |     h = Conv2D(16, kernel_size=(8, 8), strides=(4, 4), activation='relu', data_format='channels_first')(x)
 43 |     h = Conv2D(32, kernel_size=(4, 4), strides=(2, 2), activation='relu', data_format='channels_first')(h)
 44 |     h = Flatten()(h)
 45 |     h = Dense(256, activation='relu')(h)
 46 |     v = Dense(output_shape, activation='linear')(h)
 47 |     return Model(inputs=x, outputs=v)
 48 | 
 49 | 
 50 | class LearningAgent(object):
 51 |     def __init__(self, action_space, batch_size=32, screen=(84, 84), swap_freq=200):
 52 |         from keras.optimizers import RMSprop
 53 |         # -----
 54 |         self.screen = screen
 55 |         self.input_depth = 1
 56 |         self.past_range = 3
 57 |         self.observation_shape = (self.input_depth * self.past_range,) + self.screen
 58 |         self.batch_size = batch_size
 59 | 
 60 |         self.action_value = build_network(self.observation_shape, action_space.n)
 61 |         self.action_value_freeze = build_network(self.observation_shape, action_space.n)
 62 | 
 63 |         self.action_value.compile(optimizer='rmsprop', loss='mse')
 64 |         self.action_value_freeze.compile(optimizer='rmsprop', loss='mse')
 65 | 
 66 |         self.losses = deque(maxlen=25)
 67 |         self.q_values = deque(maxlen=25)
 68 |         self.swap_freq = swap_freq
 69 |         self.swap_counter = self.swap_freq
 70 |         self.unroll = np.arange(self.batch_size)
 71 |         self.frames = 0
 72 | 
 73 |     def learn(self, last_observations, actions, rewards, observations, not_terminals, discount=0.99,
 74 |               learning_rate=0.001):
 75 |         self.action_value.optimizer.lr.set_value(learning_rate)
 76 |         frames = len(last_observations)
 77 |         self.frames += frames
 78 |         # -----
 79 |         targets = self.action_value.predict_on_batch(last_observations)
 80 |         q_values = self.action_value_freeze.predict_on_batch(observations)
 81 |         # -----
 82 |         # equation = rewards + not_terminals * discount * np.argmax(q_values)
 83 |         rewards = np.clip(rewards, -1., 1.)
 84 |         equation = not_terminals
 85 |         equation *= np.max(q_values, axis=1)
 86 |         equation *= discount
 87 |         targets[self.unroll, actions] = rewards + equation
 88 |         # -----
 89 |         loss = self.action_value.train_on_batch(last_observations, targets)
 90 |         self.losses.append(loss)
 91 |         self.q_values.append(np.mean(targets))
 92 |         print(
 93 |             '\rFrames: %8d; Lr: %8.7f; Loss: %7.4f; Min: %7.4f; Max: %7.4f; Avg: %7.4f --- Q-value; Min: %7.4f; Max: %7.4f; Avg: %7.4f' % (
 94 |                 self.frames, learning_rate, loss, min(self.losses), max(self.losses), np.mean(self.losses),
 95 |                 np.min(self.q_values), np.max(self.q_values), np.mean(self.q_values)), end='')
 96 |         self.swap_counter -= frames
 97 |         if self.swap_counter < 0:
 98 |             self.swap_counter += self.swap_freq
 99 |             self.action_value_freeze.set_weights(self.action_value.get_weights())
100 |             return True
101 |         return False
102 | 
103 | 
104 | def learn_proc(global_frame, mem_queue, weight_dict):
105 |     import os
106 |     pid = os.getpid()
107 |     if args.th_fix:
108 |         os.environ['THEANO_FLAGS'] = 'floatX=float32,device=gpu,nvcc.fastmath=False,lib.cnmem=0,' + \
109 |                                      'compiledir=th_comp_learn'
110 |     # -----
111 |     save_freq = args.save_freq
112 |     learning_rate = args.learning_rate
113 |     batch_size = args.batch_size
114 |     checkpoint = args.checkpoint
115 |     lr_decay = args.lr_decay
116 |     # -----
117 |     env = gym.make(args.game)
118 |     agent = LearningAgent(env.action_space, batch_size=args.batch_size, swap_freq=args.swap_freq)
119 |     # -----
120 |     if checkpoint > 0:
121 |         agent.action_value.load_weights('model-%d.h5' % (checkpoint,))
122 |         agent.action_value_freeze.set_weights(agent.action_value.get_weights())
123 |     print(' %5d> Setting weights in dict' % (pid,))
124 |     # -----
125 |     weight_dict['update'] = 0
126 |     weight_dict['weights'] = agent.action_value.get_weights()
127 |     # -----
128 |     last_obs = np.zeros((batch_size,) + agent.observation_shape)
129 |     actions = np.zeros(batch_size, dtype=np.int32)
130 |     rewards = np.zeros(batch_size)
131 |     obs = np.zeros((batch_size,) + agent.observation_shape)
132 |     not_term = np.zeros(batch_size)
133 |     # -----
134 |     index = 0
135 |     agent.frames = checkpoint
136 |     save_counter = checkpoint % save_freq + save_freq
137 |     while True:
138 |         last_obs[index, ...], actions[index], rewards[index], obs[index, ...], not_term[index] = mem_queue.get()
139 |         # -----
140 |         index = (index + 1) % batch_size
141 |         if index == 0:
142 |             lr = max(0.00000001, learning_rate * (1. - agent.frames * batch_size / lr_decay))
143 |             updated = agent.learn(last_obs, actions, rewards, obs, not_term, learning_rate=lr)
144 |             global_frame.value = agent.frames
145 |             if updated:
146 |                 # print(' %5d> Updating weights in dict' % (pid,))
147 |                 weight_dict['weights'] = agent.action_value_freeze.get_weights()
148 |                 weight_dict['update'] += 1
149 |         # -----
150 |         save_counter -= 1
151 |         if save_counter < 0:
152 |             save_counter += save_freq
153 |             agent.action_value_freeze.save_weights('model-%d.h5' % (agent.frames,), overwrite=True)
154 | 
155 | 
156 | class ActingAgent(object):
157 |     def __init__(self, action_space, screen=(84, 84)):
158 |         from keras.optimizers import RMSprop
159 |         # -----
160 |         self.screen = screen
161 |         self.input_depth = 1
162 |         self.past_range = 3
163 |         self.observation_shape = (self.input_depth * self.past_range,) + self.screen
164 | 
165 |         self.action_value = build_network(self.observation_shape, action_space.n)
166 |         self.action_value.compile(optimizer='rmsprop', loss='mse')
167 | 
168 |         self.action_space = action_space
169 |         self.observations = np.zeros(self.observation_shape)
170 |         self.last_observations = np.zeros_like(self.observations)
171 | 
172 |     def init_episode(self, observation):
173 |         for _ in range(self.past_range):
174 |             self.save_observation(observation)
175 | 
176 |     def sars_data(self, action, reward, observation, not_terminal):
177 |         self.save_observation(observation)
178 |         return self.last_observations, action, reward, self.observations, not_terminal
179 | 
180 |     def choose_action(self, epsilon=0.0):
181 |         if np.random.random() < epsilon:
182 |             return self.action_space.sample()
183 |         else:
184 |             return np.argmax(self.action_value.predict(self.observations[None, ...]))
185 | 
186 |     def save_observation(self, observation):
187 |         self.last_observations = self.observations[...]
188 |         self.observations = np.roll(self.observations, -self.input_depth, axis=0)
189 |         self.observations[-self.input_depth:, ...] = self.transform_screen(observation)
190 | 
191 |     def transform_screen(self, data):
192 |         return rgb2gray(imresize(data, self.screen))[None, ...]
193 | 
194 | 
195 | def generate_experience_proc(global_frame, mem_queue, weight_dict, no, epsilon):
196 |     import os
197 |     pid = os.getpid()
198 |     if args.th_fix:
199 |         os.environ['THEANO_FLAGS'] = 'floatX=float32,device=gpu,nvcc.fastmath=True,lib.cnmem=0,' + \
200 |                                      'compiledir=th_comp_act_' + str(no)
201 |     # -----
202 |     print(' %5d> Process started with %6.3f' % (pid, epsilon))
203 |     # -----
204 |     env = gym.make(args.game)
205 |     agent = ActingAgent(env.action_space)
206 | 
207 |     if args.checkpoint > 0:
208 |         print(' %5d> Loaded weights from file' % (pid,))
209 |         agent.action_value.load_weights('model-%d.h5' % (args.checkpoint,))
210 |     else:
211 |         import time
212 |         while 'weights' not in weight_dict:
213 |             time.sleep(0.1)
214 |         agent.action_value.set_weights(weight_dict['weights'])
215 |         print(' %5d> Loaded weights from dict' % (pid,))
216 | 
217 |     best_score, last_update, frames = 0, 0, 0
218 |     avg_score = deque(maxlen=20)
219 |     stop_decay = global_frame.value > args.eps_decay
220 | 
221 |     while True:
222 |         done = False
223 |         episode_reward, noops, last_op = 0, 0, 0
224 |         observation = env.reset()
225 |         agent.init_episode(observation)
226 | 
227 |         # -----
228 |         while not done:
229 |             frames += 1
230 |             if not stop_decay:
231 |                 frame_tmp = global_frame.value
232 |                 decayed_epsilon = max(epsilon, epsilon + (1. - epsilon) * (
233 |                                         args.eps_decay - frame_tmp) / args.eps_decay)
234 |                 stop_decay = frame_tmp > args.eps_decay
235 |             # -----
236 |             action = agent.choose_action(decayed_epsilon)
237 |             observation, reward, done, _ = env.step(action)
238 |             episode_reward += reward
239 |             best_score = max(best_score, episode_reward)
240 |             # -----
241 |             if action == last_op:
242 |                 noops += 1
243 |             else:
244 |                 last_op, noops = action, 0
245 |             # -----
246 |             if noops > 100:
247 |                 break
248 |             # -----
249 |             mem_queue.put(agent.sars_data(action, reward, observation, not done))
250 |             # -----
251 |             if frames % 2000 == 0:
252 |                 print(' %5d> Epsilon: %9.6f; Best: %4d; Avg: %6.2f' % (
253 |                     pid, decayed_epsilon, best_score, np.mean(avg_score)))
254 |             if frames % args.batch_size == 0:
255 |                 update = weight_dict.get('update', 0)
256 |                 if update > last_update:
257 |                     last_update = update
258 |                     # print(' %5d> Getting weights from dict' % (pid,))
259 |                     agent.action_value.set_weights(weight_dict['weights'])
260 |         # -----
261 |         avg_score.append(episode_reward)
262 | 
263 | 
264 | def init_worker():
265 |     import signal
266 |     signal.signal(signal.SIGINT, signal.SIG_IGN)
267 | 
268 | 
269 | def main():
270 |     manager = Manager()
271 |     weight_dict = manager.dict()
272 |     global_frame = manager.Value('i', args.checkpoint)
273 |     mem_queue = manager.Queue(args.queue_size)
274 | 
275 |     eps = [0.1, 0.01, 0.5]
276 |     pool = Pool(args.processes + 1, init_worker)
277 | 
278 |     try:
279 |         for i in range(args.processes):
280 |             pool.apply_async(generate_experience_proc,
281 |                              args=(global_frame, mem_queue, weight_dict, i, eps[i % len(eps)]))
282 | 
283 |         pool.apply_async(learn_proc, args=(global_frame, mem_queue, weight_dict))
284 | 
285 |         pool.close()
286 |         pool.join()
287 | 
288 |     except KeyboardInterrupt:
289 |         pool.terminate()
290 |         pool.join()
291 | 
292 | 
293 | if __name__ == "__main__":
294 |     main()
295 | 


--------------------------------------------------------------------------------
/q-learning-n-step/README.md:
--------------------------------------------------------------------------------
 1 | ### Usage
 2 | 
 3 | To start training simply type:
 4 | ```
 5 | python train.py --game=Breakout-v0 --processes=16 --n_step=5
 6 | ```
 7 | 
 8 | To resume training from saved model (ex. `model-1250000.h5`):
 9 | ```
10 | python train.py --game=Breakout-v0 --processes=16 --checkpoint=1250000
11 | ```
12 | 
13 | To see how it plays:
14 | ```
15 | python play.py --model=model-file.h5 --game=Breakout-v0
16 | ```


--------------------------------------------------------------------------------
/q-learning-n-step/play.py:
--------------------------------------------------------------------------------
  1 | import gym
  2 | from scipy.misc import imresize
  3 | from skimage.color import rgb2gray
  4 | import numpy as np
  5 | import argparse
  6 | 
  7 | 
  8 | def build_network(input_shape, output_shape):
  9 |     from keras.models import Model
 10 |     from keras.layers import Input, Conv2D, Flatten, Dense
 11 | 
 12 |     x = Input(shape=input_shape)
 13 |     h = Conv2D(16, kernel_size=(8, 8), strides=(4, 4), activation='relu', data_format='channels_first')(x)
 14 |     h = Conv2D(32, kernel_size=(4, 4), strides=(2, 2), activation='relu', data_format='channels_first')(h)
 15 |     h = Flatten()(h)
 16 |     h = Dense(256, activation='relu')(h)
 17 |     v = Dense(output_shape, activation='linear')(h)
 18 |     return Model(inputs=x, outputs=v)
 19 | 
 20 | 
 21 | class ActingAgent(object):
 22 |     def __init__(self, action_space, screen=(84, 84)):
 23 |         from keras.optimizers import RMSprop
 24 | 
 25 |         self.screen = screen
 26 |         self.input_depth = 1
 27 |         self.past_range = 3
 28 |         self.replay_size = 32
 29 |         self.observation_shape = (self.input_depth * self.past_range,) + self.screen
 30 | 
 31 |         self.action_value = build_network(self.observation_shape, action_space.n)
 32 |         self.action_value.compile(optimizer=RMSprop(clipnorm=1.), loss='mse')  # clipnorm=1.
 33 | 
 34 |         self.action_space = action_space
 35 |         self.observations = np.zeros((self.input_depth * self.past_range,) + screen)
 36 | 
 37 |     def init_episode(self, observation):
 38 |         for _ in range(self.past_range):
 39 |             self.save_observation(observation)
 40 | 
 41 |     def choose_action(self, observation, epsilon=0.0):
 42 |         self.save_observation(observation)
 43 |         if np.random.random() < epsilon:
 44 |             return self.action_space.sample()
 45 |         else:
 46 |             return np.argmax(self.action_value.predict(self.observations[None, ...]))
 47 | 
 48 |     def save_observation(self, observation):
 49 |         self.observations = np.roll(self.observations, -self.input_depth, axis=0)
 50 |         self.observations[-self.input_depth:, ...] = self.transform_screen(observation)
 51 | 
 52 |     def transform_screen(self, data):
 53 |         return rgb2gray(imresize(data, self.screen))[None, ...]
 54 | 
 55 | 
 56 | parser = argparse.ArgumentParser(description='Evaluation of model')
 57 | parser.add_argument('--game', default='Breakout-v0', help='Name of openai gym environment', dest='game')
 58 | parser.add_argument('--evaldir', default=None, help='Directory to save evaluation', dest='evaldir')
 59 | parser.add_argument('--model', help='File with weights for model', dest='model')
 60 | parser.add_argument('--eps', default=0., help='Epsilon value', dest='eps', type=float)
 61 | 
 62 | 
 63 | def main():
 64 |     args = parser.parse_args()
 65 |     # -----
 66 |     env = gym.make(args.game)
 67 |     if args.evaldir:
 68 |         env.monitor.start(args.evaldir)
 69 |     # -----
 70 |     agent = ActingAgent(env.action_space)
 71 | 
 72 |     model_file = args.model
 73 |     epsilon = args.eps
 74 | 
 75 |     agent.action_value.load_weights(model_file)
 76 | 
 77 |     game = 1
 78 |     for _ in range(10):
 79 |         done = False
 80 |         episode_reward = 0
 81 |         noops = 0
 82 | 
 83 |         # init game
 84 |         observation = env.reset()
 85 |         agent.init_episode(observation)
 86 |         # play one game
 87 |         print('Game #%8d; ' % (game,), end='')
 88 |         while not done:
 89 |             env.render()
 90 |             action = agent.choose_action(observation, epsilon=epsilon)
 91 |             observation, reward, done, _ = env.step(action)
 92 |             episode_reward += reward
 93 |             # ----
 94 |             if action == 0:
 95 |                 noops += 1
 96 |             else:
 97 |                 noops = 0
 98 |             if noops > 100:
 99 |                 break
100 |         print('Reward %4d; ' % (episode_reward,))
101 |         game += 1
102 |     # -----
103 |     if args.evaldir:
104 |         env.monitor.close()
105 | 
106 | 
107 | if __name__ == "__main__":
108 |     main()
109 | 


--------------------------------------------------------------------------------
/q-learning-n-step/train.py:
--------------------------------------------------------------------------------
  1 | from scipy.misc import imresize
  2 | from skimage.color import rgb2gray
  3 | from multiprocessing import *
  4 | from collections import deque
  5 | import gym
  6 | import numpy as np
  7 | import argparse
  8 | 
  9 | # -----
 10 | parser = argparse.ArgumentParser(description='Training model')
 11 | parser.add_argument('--game', default='Breakout-v0', help='OpenAI gym environment name', dest='game', type=str)
 12 | parser.add_argument('--processes', default=4, help='Number of processes that generate experience for agent',
 13 |                     dest='processes', type=int)
 14 | parser.add_argument('--lr', default=0.0001, help='Learning rate', dest='learning_rate', type=float)
 15 | parser.add_argument('--batch_size', default=20, help='Batch size to use during training', dest='batch_size', type=int)
 16 | parser.add_argument('--swap_freq', default=10000, help='Number of frames before swapping network weights',
 17 |                     dest='swap_freq', type=int)
 18 | parser.add_argument('--checkpoint', default=0, help='Iteration to resume training', dest='checkpoint', type=int)
 19 | parser.add_argument('--save_freq', default=250000, help='Number of frame before saving weights', dest='save_freq',
 20 |                     type=int)
 21 | parser.add_argument('--eps_decay', default=4000000,
 22 |                     help='Number of frames needed to decay epsilon to the lowest value', dest='eps_decay', type=int)
 23 | parser.add_argument('--lr_decay', default=80000000,
 24 |                     help='Number of frames needed to decay lr to the lowest value', dest='lr_decay', type=int)
 25 | parser.add_argument('--queue_size', default=256, help='Size of queue holding agent experience', dest='queue_size',
 26 |                     type=int)
 27 | parser.add_argument('--n_step', default=5, help='Number of steps in Q-learning', dest='n_step', type=int)
 28 | parser.add_argument('--th_comp_fix', default=True,
 29 |                     help='Sets different Theano compiledir for each process', dest='th_fix', type=bool)
 30 | # -----
 31 | args = parser.parse_args()
 32 | 
 33 | 
 34 | # -----
 35 | 
 36 | 
 37 | def build_network(input_shape, output_shape):
 38 |     from keras.models import Model
 39 |     from keras.layers import Input, Conv2D, Flatten, Dense
 40 | 
 41 |     x = Input(shape=input_shape)
 42 |     h = Conv2D(16, kernel_size=(8, 8), strides=(4, 4), activation='relu', data_format='channels_first')(x)
 43 |     h = Conv2D(32, kernel_size=(4, 4), strides=(2, 2), activation='relu', data_format='channels_first')(h)
 44 |     h = Flatten()(h)
 45 |     h = Dense(256, activation='relu')(h)
 46 |     v = Dense(output_shape, activation='linear')(h)
 47 |     return Model(inputs=x, outputs=v)
 48 | 
 49 | 
 50 | # -----
 51 | 
 52 | class LearningAgent(object):
 53 |     def __init__(self, action_space, batch_size=32, screen=(84, 84), swap_freq=200):
 54 |         from keras.optimizers import RMSprop
 55 |         # -----
 56 |         self.screen = screen
 57 |         self.input_depth = 1
 58 |         self.past_range = 3
 59 |         self.observation_shape = (self.input_depth * self.past_range,) + self.screen
 60 |         self.batch_size = batch_size
 61 | 
 62 |         self.action_value = build_network(self.observation_shape, action_space.n)
 63 |         self.action_value.compile(optimizer=RMSprop(clipnorm=1.), loss='mse')
 64 | 
 65 |         self.losses = deque(maxlen=25)
 66 |         self.q_values = deque(maxlen=25)
 67 |         self.swap_freq = swap_freq
 68 |         self.swap_counter = self.swap_freq
 69 |         self.unroll = np.arange(self.batch_size)
 70 |         self.frames = 0
 71 | 
 72 |     def learn(self, last_observations, actions, rewards, learning_rate=0.001):
 73 |         self.action_value.optimizer.lr.set_value(learning_rate)
 74 |         frames = len(last_observations)
 75 |         self.frames += frames
 76 |         # -----
 77 |         targets = self.action_value.predict_on_batch(last_observations)
 78 |         # -----
 79 |         targets[self.unroll, actions] = rewards
 80 |         # -----
 81 |         loss = self.action_value.train_on_batch(last_observations, targets)
 82 |         self.losses.append(loss)
 83 |         self.q_values.append(np.mean(targets))
 84 |         print('\rIter: %8d; Lr: %8.7f; Loss: %7.4f; Min: %7.4f; Max: %7.4f; Avg: %7.4f --- Q-value; Min: %7.4f; Max: %7.4f; Avg: %7.4f' % (
 85 |             self.frames, learning_rate, loss, min(self.losses), max(self.losses), np.mean(self.losses),
 86 |             np.min(self.q_values), np.max(self.q_values), np.mean(self.q_values)), end='')
 87 |         self.swap_counter -= frames
 88 |         if self.swap_counter < 0:
 89 |             self.swap_counter += self.swap_freq
 90 |             return True
 91 |         return False
 92 | 
 93 | 
 94 | def learn_proc(global_frame, mem_queue, weight_dict):
 95 |     import os
 96 |     pid = os.getpid()
 97 |     if args.th_fix:
 98 |         os.environ['THEANO_FLAGS'] = 'floatX=float32,device=gpu,nvcc.fastmath=False,lib.cnmem=0,' + \
 99 |                                      'compiledir=th_comp_learn'
100 |     # -----
101 |     save_freq = args.save_freq
102 |     learning_rate = args.learning_rate
103 |     batch_size = args.batch_size
104 |     checkpoint = args.checkpoint
105 |     lr_decay = args.lr_decay
106 |     # -----
107 |     env = gym.make(args.game)
108 |     agent = LearningAgent(env.action_space, batch_size=args.batch_size, swap_freq=args.swap_freq)
109 |     # -----
110 |     if checkpoint > 0:
111 |         print(' %5d> Loading weights from file' % (pid,))
112 |         agent.action_value.load_weights('model-%d.h5' % (checkpoint,))
113 |         # -----
114 |     weight_dict['update'] = 0
115 |     weight_dict['weights'] = agent.action_value.get_weights()
116 |     print(' %5d> Setting weights in dict' % (pid,))
117 |     # -----
118 |     last_obs = np.zeros((batch_size,) + agent.observation_shape)
119 |     actions = np.zeros(batch_size, dtype=np.int32)
120 |     rewards = np.zeros(batch_size)
121 |     # -----
122 |     idx = 0
123 |     agent.frames = checkpoint
124 |     save_counter = checkpoint % save_freq + save_freq
125 |     while True:
126 |         # -----
127 |         last_obs[idx, ...], actions[idx], rewards[idx] = mem_queue.get()
128 |         idx = (idx + 1) % batch_size
129 |         if idx == 0:
130 |             lr = max(0.000000001, learning_rate * (1. - agent.frames / lr_decay))
131 |             updated = agent.learn(last_obs, actions, rewards, learning_rate=lr)
132 |             global_frame.value = agent.frames
133 |             if updated:
134 |                 # print(' %5d> Updating weights in dict' % (pid,))
135 |                 weight_dict['weights'] = agent.action_value.get_weights()
136 |                 weight_dict['update'] += 1
137 |         # -----
138 |         save_counter -= 1
139 |         if save_counter % save_freq == 0:
140 |             agent.action_value.save_weights('model-%d.h5' % (agent.frames,), overwrite=True)
141 | 
142 | 
143 | class ActingAgent(object):
144 |     def __init__(self, action_space, screen=(84, 84), n_step=8, discount=0.99):
145 |         from keras.optimizers import RMSprop
146 |         # -----
147 |         self.screen = screen
148 |         self.input_depth = 1
149 |         self.past_range = 3
150 |         self.observation_shape = (self.input_depth * self.past_range,) + self.screen
151 | 
152 |         self.action_value = build_network(self.observation_shape, action_space.n)
153 |         self.action_value.compile(optimizer=RMSprop(clipnorm=1.), loss='mse')  # clipnorm=1.
154 | 
155 |         self.action_space = action_space
156 |         self.observations = np.zeros(self.observation_shape)
157 |         self.last_observations = np.zeros_like(self.observations)
158 |         # -----
159 |         self.n_step_observations = deque(maxlen=n_step)
160 |         self.n_step_actions = deque(maxlen=n_step)
161 |         self.n_step_rewards = deque(maxlen=n_step)
162 |         self.n_step = n_step
163 |         self.discount = discount
164 |         self.counter = 0
165 | 
166 |     def init_episode(self, observation):
167 |         for _ in range(self.past_range):
168 |             self.save_observation(observation)
169 | 
170 |     def reset(self):
171 |         self.counter = 0
172 |         self.n_step_observations.clear()
173 |         self.n_step_actions.clear()
174 |         self.n_step_rewards.clear()
175 | 
176 |     def sars_data(self, action, reward, observation, terminal, mem_queue):
177 |         self.save_observation(observation)
178 |         reward = np.clip(reward, -1., 1.)
179 |         # -----
180 |         self.n_step_observations.appendleft(self.last_observations)
181 |         self.n_step_actions.appendleft(action)
182 |         self.n_step_rewards.appendleft(reward)
183 |         # -----
184 |         self.counter += 1
185 |         if terminal or self.counter >= self.n_step:
186 |             r = 0.
187 |             if not terminal:
188 |                 r = np.max(self.action_value.predict(self.observations[None, ...]))
189 |             for i in range(self.counter):
190 |                 r = self.n_step_rewards[i] + self.discount * r
191 |                 mem_queue.put((self.n_step_observations[i], self.n_step_actions[i], r))
192 |             self.reset()
193 | 
194 |     def choose_action(self, epsilon=0.0):
195 |         if np.random.random() < epsilon:
196 |             return self.action_space.sample()
197 |         else:
198 |             return np.argmax(self.action_value.predict(self.observations[None, ...]))
199 | 
200 |     def save_observation(self, observation):
201 |         self.last_observations = self.observations[...]
202 |         self.observations = np.roll(self.observations, -self.input_depth, axis=0)
203 |         self.observations[-self.input_depth:, ...] = self.transform_screen(observation)
204 | 
205 |     def transform_screen(self, data):
206 |         return rgb2gray(imresize(data, self.screen))[None, ...]
207 | 
208 | 
209 | def generate_experience_proc(global_frame, mem_queue, weight_dict, no, epsilon):
210 |     import os
211 |     pid = os.getpid()
212 |     if args.th_fix:
213 |         os.environ['THEANO_FLAGS'] = 'floatX=float32,device=gpu,nvcc.fastmath=True,lib.cnmem=0,' + \
214 |                                      'compiledir=th_comp_act_' + str(no)
215 |     # -----
216 |     batch_size = args.batch_size
217 |     # -----
218 |     print(' %5d> Process started with %6.3f' % (pid, epsilon))
219 |     # -----
220 |     env = gym.make(args.game)
221 |     agent = ActingAgent(env.action_space, n_step=args.n_step)
222 | 
223 |     if args.checkpoint > 0:
224 |         print(' %5d> Loaded weights from file' % (pid,))
225 |         agent.action_value.load_weights('model-%d.h5' % (args.checkpoint,))
226 |     else:
227 |         import time
228 |         while 'weights' not in weight_dict:
229 |             time.sleep(0.1)
230 |         agent.action_value.set_weights(weight_dict['weights'])
231 |         print(' %5d> Loaded weights from dict' % (pid,))
232 | 
233 |     best_score, last_update, frames = 0, 0, 0
234 |     avg_score = deque(maxlen=20)
235 |     stop_decay = global_frame.value > args.eps_decay
236 | 
237 |     while True:
238 |         done = False
239 |         episode_reward = 0
240 |         last_op, op_count = 0, 0
241 |         observation = env.reset()
242 |         agent.init_episode(observation)
243 | 
244 |         # -----
245 |         while not done:
246 |             frames += 1
247 |             if not stop_decay:
248 |                 frame_tmp = global_frame.value
249 |                 decayed_epsilon = max(epsilon, epsilon + (1. - epsilon) * (
250 |                                         args.eps_decay - frame_tmp) / args.eps_decay)
251 |                 stop_decay = frame_tmp > args.eps_decay
252 |             # -----
253 |             action = agent.choose_action(decayed_epsilon)
254 |             observation, reward, done, _ = env.step(action)
255 |             episode_reward += reward
256 |             best_score = max(best_score, episode_reward)
257 |             # -----
258 |             agent.sars_data(action, reward, observation, done, mem_queue)
259 |             # -----
260 |             if action == last_op:
261 |                 op_count += 1
262 |             else:
263 |                 op_count, last_op = 0, action
264 |             # -----
265 |             if op_count > 100:
266 |                 agent.reset()  # reset agent memory
267 |                 break
268 |             # -----
269 |             if frames % 2000 == 0:
270 |                 print(' %5d> Epsilon: %9.6f; Best score: %4d; Avg: %9.3f' % (
271 |                     pid, decayed_epsilon, best_score, np.mean(avg_score)))
272 |             if frames % batch_size == 0:
273 |                 update = weight_dict.get('update', 0)
274 |                 if update > last_update:
275 |                     last_update = update
276 |                     # print(' %5d> Getting weights from dict' % (pid,))
277 |                     agent.action_value.set_weights(weight_dict['weights'])
278 |         # -----
279 |         avg_score.append(episode_reward)
280 | 
281 | 
282 | def init_worker():
283 |     import signal
284 |     signal.signal(signal.SIGINT, signal.SIG_IGN)
285 | 
286 | 
287 | def main():
288 |     manager = Manager()
289 |     weight_dict = manager.dict()
290 |     global_frame = manager.Value('i', args.checkpoint)
291 |     mem_queue = manager.Queue(args.queue_size)
292 | 
293 |     eps = [0.1, 0.01, 0.5]
294 |     pool = Pool(args.processes + 1, init_worker)
295 | 
296 |     try:
297 |         for i in range(args.processes):
298 |             pool.apply_async(generate_experience_proc,
299 |                              args=(global_frame, mem_queue, weight_dict, i, eps[i % len(eps)]))
300 | 
301 |         pool.apply_async(learn_proc, args=(global_frame, mem_queue, weight_dict))
302 | 
303 |         pool.close()
304 |         pool.join()
305 | 
306 |     except KeyboardInterrupt:
307 |         pool.terminate()
308 |         pool.join()
309 | 
310 | 
311 | if __name__ == "__main__":
312 |     main()
313 | 


--------------------------------------------------------------------------------