├── main.py
├── README.md
├── networks
    ├── critic.py
    └── actor.py
└── ddpg.py


/main.py:
--------------------------------------------------------------------------------
 1 | from ddpg import DDPG as Agent
 2 | 
 3 | 
 4 | def loop(agent, world):
 5 |     state = world.get_state()
 6 |     action = agent.get_action(state)
 7 |     next_state, reward, done = world.act(action)
 8 |     agent.remember(state, action, reward, done, next_state)
 9 |     agent.train()
10 | 
11 | 
12 | def main():
13 |     world = World()
14 |     agent = Agent(state_size=world.state_size, action_size=world.action_size)
15 |     while True:
16 |         loop(agent, world)


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | DDPG: Deep Deterministic Policy Gradients
 2 | =========================================
 3 | 
 4 | A clean python implementation of an Agent for Reinforcement Learning with Continuous Control using Deep 
 5 | Deterministic Policy Gradients.
 6 | 
 7 | ![](https://github.com/rmst/ddpg/raw/master/readme/ipend.gif?raw=true ) ![](https://github.com/rmst/ddpg/raw/master/readme/reacher.gif?raw=true) ![](https://github.com/rmst/ddpg/raw/master/readme/pend.gif?raw=true)
 8 | [![DDPG on TORCS](http://img.youtube.com/vi/Tb5gASEJIRM/0.jpg)](http://www.youtube.com/watch?v=Tb5gASEJIRM "Video Title")
 9 | 
10 | # Overview:
11 | 
12 | DDPG is a reinforcement learning algorithm that uses deep neural networks to approximate policy and value functions. If you are interested in how the algorithm works in detail, you can read the original DDPG paper here
13 | 
14 | [Continuous control with deep reinforcement learning](https://arxiv.org/pdf/1509.02971v5.pdf)
15 | 
16 | The algorithm consists of two networks, an Actor and a Critic network, which approximate the policy and value functions of a reinforcement learning problem.
17 | 
18 | The name DDPG, or Deep Deterministic Policy Gradients, refers to how the networks are trained. The value function is trained with normal error and backpropagation, while the Actor network is trained with gradients found from the critic network. You can read the fascinating original paper on deterministic policy gradients here.
19 | 
20 | [Deterministic Policy Gradient Algorithms](http://www.jmlr.org/proceedings/papers/v32/silver14.pdf)
21 | 
22 | The DDPG algorithm is useful because, in very few lines of code (this project is approximately 150 lines excluding comments), you can learn a control algorithm for many different agents, including ones with complex configurations, continuous actions, and high dimensional state spaces (e.g. image data).
23 | Even better, the same code can be used to train a humanoid robot, a drone, or a car, or any robotic configuration you can think of, Making this project highly reusable.
24 | 
25 | ## Actor Network
26 | 
27 | The actor network approximates the policy function:
28 | 
29 | A(s) -> a
30 | 
31 | where s represents a state, and a represents an action.
32 | 
33 | ## Critic Network
34 | 
35 | The critic network approximates the value function:
36 | 
37 | C(s, a) -> q
38 | 
39 | where s represents a state, a represents an action, and q represents the 
40 | value of the given state action pair.
41 | 
42 | # Future:
43 | Some ideas for future improvements are as follows:
44 | 
45 | * Support for convolutional neural networks in order to support high dimensional states (such as pixels from camera input)
46 | * Support for Recurrent Networks for input data
47 | 
48 | Let me know of any other ideas you may have :)
49 | 
50 | # Thanks to:
51 | 
52 | This project is inspired by 
53 | * The paper [Continuous control with deep reinforcement learning](https://arxiv.org/pdf/1509.02971v5.pdf) by Lillicrap et al.
54 | * This python implementation, [DDPG](https://github.com/rmst/ddpg)
55 | * Also Ben Lau's Article on [Using Keras and Deep Deterministic Policy Gradient to play TORCS](https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html)


--------------------------------------------------------------------------------
/networks/critic.py:
--------------------------------------------------------------------------------
  1 | from keras.layers import Dense, Input, merge
  2 | from keras.models import Model
  3 | from keras.optimizers import Adam
  4 | import keras.backend as keras_backend
  5 | import tensorflow
  6 | 
  7 | 
  8 | class CriticNetwork(object):
  9 |     def __init__(self, tensorflow_session, state_size, action_size,
 10 |                  hidden_units=(300, 600), learning_rate=0.0001, batch_size=64,
 11 |                  tau=0.001):
 12 |         """
 13 |         Constructor for the Actor network
 14 | 
 15 |         :param tensorflow_session: The tensorflow session.
 16 |             See https://www.tensorflow.org for more information on tensorflow
 17 |             sessions.
 18 |         :param state_size: An integer denoting the dimensionality of the states
 19 |             in the current problem
 20 |         :param action_size: An integer denoting the dimensionality of the
 21 |             actions in the current problem
 22 |         :param hidden_units: An iterable defining the number of hidden units in
 23 |             each layer. Soon to be depreciated. default: (300, 600)
 24 |         :param learning_rate: A fload denoting the speed at which the network
 25 |             will learn. default: 0.0001
 26 |         :param batch_size: An integer denoting the batch size. default: 64
 27 |         :param tau: A flot denoting the rate at which the target model will
 28 |             track the main model. Formally, the tracking function is defined as:
 29 | 
 30 |               target_weights = tau * main_weights + (1 - tau) * target_weights
 31 | 
 32 |             for more explanation on how and why this happens,
 33 |             please refer to the DDPG paper:
 34 | 
 35 |             Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, & Wiestra.
 36 |             Continuous Control with Deep Reinforcement Learning. arXiv preprint
 37 |             arXiv:1509.02971, 2015.
 38 | 
 39 |             default: 0.001
 40 |         """
 41 |         # Store parameters
 42 |         self._tensorflow_session = tensorflow_session
 43 |         self._batch_size = batch_size
 44 |         self._tau = tau
 45 |         self._learning_rate = learning_rate
 46 |         self._hidden = hidden_units
 47 | 
 48 |         # Let tensorflow and keras work together
 49 |         keras_backend.set_session(tensorflow_session)
 50 | 
 51 |         # Generate the main model
 52 |         self._model, self._state_input, self._action_input = \
 53 |             self._generate_model()
 54 |         # Generate carbon copy of the model so that we avoid divergence
 55 |         self._target_model, self._target_weights, self._target_state = \
 56 |             self._generate_model()
 57 |         # gradients for policy update
 58 |         self._action_gradients = tensorflow.gradients(self._model.output,
 59 |                                                       self._action_input)
 60 |         self._tensorflow_session.run(tensorflow.initialize_all_variables())
 61 | 
 62 |     def get_gradients(self, states, actions):
 63 |         """
 64 |         Returns the gradients.
 65 |         :param states:
 66 |         :param actions:
 67 |         :return:
 68 |         """
 69 |         return self._tensorflow_session.run(self._action_gradients, feed_dict={
 70 |             self._state_inputs: states,
 71 |             self._action_input: actions
 72 |         })[0]
 73 | 
 74 |     def train_target_model(self):
 75 |         """
 76 |         Updates the weights of the target network to slowly track the main
 77 |         network.
 78 | 
 79 |         The speed at which the target network tracks the main network is
 80 |         defined by tau, given in the constructor to this class. Formally,
 81 |         the tracking function is defined as:
 82 | 
 83 |             target_weights = tau * main_weights + (1 - tau) * target_weights
 84 | 
 85 |         :return: None
 86 |         """
 87 |         main_weights = self._model.get_weights()
 88 |         target_weights = self._target_model.get_weights()
 89 |         target_weights = [self._tau * main_weight + (1 - self._tau) *
 90 |                           target_weight for main_weight, target_weight in
 91 |                           zip(actor_weights, actor_target_weights)]
 92 |         self._target_model.set_weights(target_weights)
 93 | 
 94 |     def _generate_model(self):
 95 |         """
 96 |         Generates the model based on the hyperparameters defined in the
 97 |         constructor.
 98 | 
 99 |         :return: at tuple containing references to the model, state input layer,
100 |             and action input later
101 |         """
102 |         state_input_layer = Input(shape=[self._state_size])
103 |         action_input_layer = Input(shape=[self._action_size])
104 |         s_layer = Dense(self._hidden[0], activation='relu')(state_input_layer)
105 |         a_layer = Dense(self._hidden[0], activation='linear')(action_input_layer)
106 |         hidden = Dense(self._hidden[1], activation='linear')(s_layer)
107 |         hidden = merge([hidden, a_layer], mode='sum')
108 |         hidden = Dense(self._hidden[1], activation='relu')(hidden)
109 |         output_layer = Dense(1, activation='linear')(hidden)
110 |         model = Model(input=[state_input_layer, action_input_layer],
111 |                       output=output_layer)
112 |         model.compile(loss='mse', optimizer=Adam(lr=self._learning_rate))
113 |         return model, state_input_layer, action_input_layer
114 | 


--------------------------------------------------------------------------------
/networks/actor.py:
--------------------------------------------------------------------------------
  1 | from __future__ import print_function, division
  2 | import numpy as np
  3 | import math
  4 | from keras.initializations import normal, identity
  5 | from keras.models import model_from_json, Sequential, Model
  6 | from keras.engine.training import collect_trainable_weights
  7 | from keras.layers import Dense, Flatten, Input, merge, Lambda
  8 | from keras.optimizers import Adam
  9 | import tensorflow
 10 | import keras.backend as keras_backend
 11 | 
 12 | 
 13 | class Actor(object):
 14 |     """
 15 |     Object representing the actor network, which approximates the function:
 16 | 
 17 |         u(s) -> a
 18 | 
 19 |     where u (actually mew) is the deterministic policy mapping from states s to
 20 |     actions a.
 21 |     """
 22 |     def __init__(self, tensorflow_session, state_size, action_size,
 23 |                  hidden_units=(300, 600), learning_rate=0.0001, batch_size=64,
 24 |                  tau=0.001):
 25 |         """
 26 |         Constructor for the Actor network
 27 | 
 28 |         :param tensorflow_session: The tensorflow session.
 29 |             See https://www.tensorflow.org for more information on tensorflow
 30 |             sessions.
 31 |         :param state_size: An integer denoting the dimensionality of the states
 32 |             in the current problem
 33 |         :param action_size: An integer denoting the dimensionality of the
 34 |             actions in the current problem
 35 |         :param hidden_units: An iterable defining the number of hidden units in
 36 |             each layer. Soon to be depreciated. default: (300, 600)
 37 |         :param learning_rate: A fload denoting the speed at which the network
 38 |             will learn. default: 0.0001
 39 |         :param batch_size: An integer denoting the batch size. default: 64
 40 |         :param tau: A flot denoting the rate at which the target model will
 41 |             track the main model. Formally, the tracking function is defined as:
 42 | 
 43 |               target_weights = tau * main_weights + (1 - tau) * target_weights
 44 | 
 45 |             for more explanation on how and why this happens,
 46 |             please refer to the DDPG paper:
 47 | 
 48 |             Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, & Wiestra.
 49 |             Continuous Control with Deep Reinforcement Learning. arXiv preprint
 50 |             arXiv:1509.02971, 2015.
 51 | 
 52 |             default: 0.001
 53 |         """
 54 |         # Store parameters
 55 |         self._tensorflow_session = tensorflow_session
 56 |         self._batch_size = batch_size
 57 |         self._tau = tau
 58 |         self._learning_rate = learning_rate
 59 |         self._hidden = hidden_units
 60 | 
 61 |         # Let tensorflow and keras work together
 62 |         keras_backend.set_session(tensorflow_session)
 63 | 
 64 |         # Generate the main model
 65 |         self._model, self._model_weights, self._model_input = \
 66 |             self._generate_model()
 67 |         # Generate carbon copy of the model so that we avoid divergence
 68 |         self._target_model, self._target_weights, self._target_state = \
 69 |             self._generate_model()
 70 | 
 71 |         # Generate tensors to hold the gradients for our Policy Gradient update
 72 |         self._action_gradients = tensorflow.placeholder(tensorflow.float32,
 73 |                                                        [None, action_size])
 74 |         self._parameter_gradients = tensorflow.gradients(self._model.output,
 75 |                                                          self._model_weights,
 76 |                                                          -self._action_gradient)
 77 |         self._gradients = zip(self._parameter_gradients, self._model_weights)
 78 | 
 79 |         # Define the optimisation function
 80 |         self._optimize = tensorflow.train.AdamOptimizer(learning_rate)\
 81 |             .apply_gradients(self._gradients)
 82 | 
 83 |         # And initialise all tensorflow variables
 84 |         self._tensorflow_session.run(tensorflow.initialize_all_variables())
 85 | 
 86 |     def train(self, states, action_gradients):
 87 |         # todo better explanation of the inputs to this method
 88 |         """
 89 |         Updates the weights of the main network
 90 | 
 91 |         :param states: The states of the input to the network
 92 |         :param action_gradients: The gradients of the actions to update the
 93 |             network
 94 |         :return: None
 95 |         """
 96 |         self._tensorflow_session.run(self._optimize, feed_dict={
 97 |             self._states: states,
 98 |             self._action_gradients: action_gradients
 99 |         })
100 | 
101 |     def train_target_model(self):
102 |         """
103 |         Updates the weights of the target network to slowly track the main
104 |         network.
105 | 
106 |         The speed at which the target network tracks the main network is
107 |         defined by tau, given in the constructor to this class. Formally,
108 |         the tracking function is defined as:
109 | 
110 |             target_weights = tau * main_weights + (1 - tau) * target_weights
111 | 
112 |         :return: None
113 |         """
114 |         main_weights = self._model.get_weights()
115 |         target_weights = self._target_model.get_weights()
116 |         target_weights = [self._tau * main_weight + (1 - self._tau) *
117 |                           target_weight for main_weight, target_weight in
118 |                           zip(actor_weights, actor_target_weights)]
119 |         self._target_model.set_weights(target_weights)
120 | 
121 |     def _generate_model(self):
122 |         """
123 |         Generates the model based on the hyperparameters defined in the
124 |         constructor.
125 | 
126 |         :return: at tuple containing references to the model, weights,
127 |             and input later
128 |         """
129 |         input_layer = Input(shape=[self._state_size])
130 |         layer = Dense(self._hidden[0], activation='relu')(input_layer)
131 |         layer = Dense(self._hidden[1], activation='relu')(layer)
132 |         output_layer = Dense(self._action_size, activation='sigmoid')(layer)
133 |         model = Model(input=input_layer, output=output_layer)
134 |         return model, model.trainable_weights, input_layer


--------------------------------------------------------------------------------
/ddpg.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import
  2 | 
  3 | from networks.actor import Actor
  4 | from networks.critic import Critic
  5 | 
  6 | from collections import deque
  7 | import tensorflow
  8 | import random
  9 | 
 10 | 
 11 | class DDPG(object):
 12 |     def __init__(self, state_size, action_size, actor_hidden_units=(300, 600),
 13 |                  actor_learning_rate=0.0001, critic_hidden_units=(300, 600),
 14 |                  critic_learning_rate=0.001, batch_size=64, discount=0.99,
 15 |                  memory_size=10000, tau=0.001):
 16 |         """
 17 |         Constructs a DDPG Agent with the given parameters
 18 | 
 19 |         :param state_size: Int denoting the world's state dimensionality
 20 |         :param action_size: Int denoting the world's action dimensionality
 21 |         :param actor_hidden_units: Tuple(Int) denoting the actor's hidden layer
 22 |             sizes. Each element in the tuple represents a layer in the Actor
 23 |             network and the Int denotes the number of neurons in the layer.
 24 |         :param actor_learning_rate: Float denoting the learning rate of the
 25 |             Actor network. Best to be some small number close to 0.
 26 |         :param critic_hidden_units: Tuple(Int) denoting the critic's hidden
 27 |             layer sizes. Each element in the tuple represents a layer in the
 28 |             Critic network and the Int denotes the number of neurons in the
 29 |             layer.
 30 |         :param critic_learning_rate: Float denoting the learning rate of the
 31 |             Critic network. Best to be some small number close to 0.
 32 |         :param batch_size: Int denoting the batch size for training.
 33 |         :param discount: Float denoting the discount (gamma) given to future
 34 |             potentioal rewards when calculating q values
 35 |         :param memory_size: Int denoting the number of State, action, rewards
 36 |             that the agent will remember
 37 |         :param tau:
 38 |         """
 39 | 
 40 |         self._discount = discount
 41 |         self._batch_size = batch_size
 42 |         self._memory_size = memory_size
 43 | 
 44 |         tensorflow_session = self._generate_tensorflow_session
 45 | 
 46 |         self._actor = Actor(tensorflow_session=tensorflow_session,
 47 |                             state_size=state_size, action_size=action_size,
 48 |                             hidden_units=actor_hidden_units,
 49 |                             learning_rate=actor_learning_rate,
 50 |                             batch_size=batch_size, tau=tau)
 51 | 
 52 |         self._critic = Critic(tensorflow_session=tensorflow_session,
 53 |                               state_size=state_size, action_size=action_size,
 54 |                               hidden_units=critic_hidden_units,
 55 |                               learning_rate=critic_learning_rate,
 56 |                               batch_size=batch_size, tau=tau)
 57 | 
 58 |         self._memory = deque()
 59 | 
 60 |     def _generate_tensorflow_session(self):
 61 |         """
 62 |         Generates and returns the tensorflow session
 63 |         :return: the Tensorflow Session
 64 |         """
 65 |         config = tensorflow.ConfigProto()
 66 |         config.gpu_options.allow_growth = True
 67 |         return tensorflow.Session(config=config)
 68 | 
 69 |     def get_action(self, state):
 70 |         """
 71 |         Returns the best action predicted by the agent given the current state.
 72 |         :param state: numpy array denoting the current state.
 73 |         :return: numpy array denoting the predicted action.
 74 |         """
 75 |         return self._actor._model.predict(state)
 76 | 
 77 |     def train(self):
 78 |         """
 79 |         Trains the DDPG Agent from it's current memory
 80 | 
 81 |         Please note that the agent must have gone through more steps than the
 82 |         specified batch size before this method will do anything
 83 | 
 84 |         :return: None
 85 |         """
 86 |         if len(self._memory) > self._batch_size:
 87 |             self._train()
 88 | 
 89 |     def _train(self):
 90 |         """
 91 |         Helper method for train. Takes care of sampling, and training and
 92 |         updating both the actor and critic networks
 93 | 
 94 |         :return: None
 95 |         """
 96 |         states, actions, rewards, done, next_states = self._get_sample()
 97 |         self._train_critic(states, actions, next_states, done, rewards)
 98 |         self._train_actor(states)
 99 |         self._update_target_models()
100 | 
101 |     def _get_sample(self):
102 |         """
103 |         Finds a random sample of size self._batch_size from the agent's current
104 |         memory.
105 | 
106 |         :return: Tuple(List(Float, Boolean))) denoting the sample of states,
107 |             actions, rewards, done, and next states.
108 |         """
109 |         sample = random.sample(self._memory, self._batch_size)
110 |         states, actions, rewards, done, next_states = zip(*sample)
111 |         return states, actions, rewards, done, next_states
112 | 
113 |     def _train_critic(self, states, actions, next_states, done, rewards):
114 |         """
115 |         Trains the critic network
116 | 
117 |         C(s, a) -> q
118 | 
119 |         :param states: List of the states to train the network with
120 |         :param actions: List of the actions to train the network with
121 |         :param next_states: List of the t+1 states to train the network with
122 |         :param rewards: List of rewards to calculate q_targets.
123 | 
124 |         :return: None
125 |         """
126 |         q_targets = self._get_q_targets(next_states, done, rewards)
127 |         self._critic.train(states, actions, q_targets)
128 | 
129 |     def _get_q_targets(self, next_states, done, rewards):
130 |         """
131 |         Calculates the q targets with the following formula
132 | 
133 |         q = r + gamma * next_q
134 | 
135 |         unless there is no next state in which
136 | 
137 |         q = r
138 | 
139 |         :param next_states: List(List(Float)) Denoting the t+1 state
140 |         :param done: List(Bool) denoting whether each step was an exit step
141 |         :param rewards: List(Float) Denoting the reward given in each step
142 |         :return: The q targets
143 |         """
144 |         next_actions = self._actor._model.predict(next_states)
145 |         next_q_values = self._critic._target_model.predict(next_states,
146 |                                                            next_actions)
147 |         q_targets = [reward if this_done else reward + self._discount *
148 |                                                        (next_q_value)
149 |                      for (reward, next_q_value, this_done)
150 |                      in zip(rewards, next_q_values, done)]
151 |         return q_targets
152 | 
153 |     def _train_actor(self, states):
154 |         """
155 |         Trains the actor network using the calculated deterministic policy
156 |             gradients.
157 | 
158 |         :param states: List(List(Float)) denoting he states to train the Actor
159 |             on
160 |         :return: None
161 |         """
162 |         gradients = self._get_gradients(states)
163 |         self._actor.train(states, gradients)
164 | 
165 |     def _get_gradients(self, states):
166 |         """
167 |         Calculates the Deterministic Policy Gradient for Actor training
168 |         :param states: The states to calculate the gradients for.
169 |         :return:
170 |         """
171 |         action_for_gradients = self._actor._model.predict(states)
172 |         self._critic.get_gradients(states, action_for_gradients)
173 | 
174 |         # todo finish this.
175 | 
176 |     def _update_target_models(self):
177 |         """
178 |         Updates the target models to slowly track the main models
179 | 
180 |         :return: None
181 |         """
182 |         self._critic.train_target_model()
183 |         self._actor.train_target_model()
184 | 
185 |     def remember(self, state, action, reward, done, next_state):
186 |         """
187 |         Stores the given state, action, reward etc in the Agent's memory.
188 | 
189 |         :param state: The state to remember
190 |         :param action: The action to remember
191 |         :param reward: The reward to remember
192 |         :param done: Whether this was a final state
193 |         :param next_state: The next state (if applicable)
194 |         :return: None
195 |         """
196 |         self._memory.append((state, action, reward, done, next_state))
197 |         if len(self._memory) > self._memory_size:
198 |             self._memory.popleft()


--------------------------------------------------------------------------------