├── main.py ├── README.md ├── networks ├── critic.py └── actor.py └── ddpg.py /main.py: -------------------------------------------------------------------------------- 1 | from ddpg import DDPG as Agent 2 | 3 | 4 | def loop(agent, world): 5 | state = world.get_state() 6 | action = agent.get_action(state) 7 | next_state, reward, done = world.act(action) 8 | agent.remember(state, action, reward, done, next_state) 9 | agent.train() 10 | 11 | 12 | def main(): 13 | world = World() 14 | agent = Agent(state_size=world.state_size, action_size=world.action_size) 15 | while True: 16 | loop(agent, world) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | DDPG: Deep Deterministic Policy Gradients 2 | ========================================= 3 | 4 | A clean python implementation of an Agent for Reinforcement Learning with Continuous Control using Deep 5 | Deterministic Policy Gradients. 6 | 7 | ![](https://github.com/rmst/ddpg/raw/master/readme/ipend.gif?raw=true ) ![](https://github.com/rmst/ddpg/raw/master/readme/reacher.gif?raw=true) ![](https://github.com/rmst/ddpg/raw/master/readme/pend.gif?raw=true) 8 | [![DDPG on TORCS](http://img.youtube.com/vi/Tb5gASEJIRM/0.jpg)](http://www.youtube.com/watch?v=Tb5gASEJIRM "Video Title") 9 | 10 | # Overview: 11 | 12 | DDPG is a reinforcement learning algorithm that uses deep neural networks to approximate policy and value functions. If you are interested in how the algorithm works in detail, you can read the original DDPG paper here 13 | 14 | [Continuous control with deep reinforcement learning](https://arxiv.org/pdf/1509.02971v5.pdf) 15 | 16 | The algorithm consists of two networks, an Actor and a Critic network, which approximate the policy and value functions of a reinforcement learning problem. 17 | 18 | The name DDPG, or Deep Deterministic Policy Gradients, refers to how the networks are trained. The value function is trained with normal error and backpropagation, while the Actor network is trained with gradients found from the critic network. You can read the fascinating original paper on deterministic policy gradients here. 19 | 20 | [Deterministic Policy Gradient Algorithms](http://www.jmlr.org/proceedings/papers/v32/silver14.pdf) 21 | 22 | The DDPG algorithm is useful because, in very few lines of code (this project is approximately 150 lines excluding comments), you can learn a control algorithm for many different agents, including ones with complex configurations, continuous actions, and high dimensional state spaces (e.g. image data). 23 | Even better, the same code can be used to train a humanoid robot, a drone, or a car, or any robotic configuration you can think of, Making this project highly reusable. 24 | 25 | ## Actor Network 26 | 27 | The actor network approximates the policy function: 28 | 29 | A(s) -> a 30 | 31 | where s represents a state, and a represents an action. 32 | 33 | ## Critic Network 34 | 35 | The critic network approximates the value function: 36 | 37 | C(s, a) -> q 38 | 39 | where s represents a state, a represents an action, and q represents the 40 | value of the given state action pair. 41 | 42 | # Future: 43 | Some ideas for future improvements are as follows: 44 | 45 | * Support for convolutional neural networks in order to support high dimensional states (such as pixels from camera input) 46 | * Support for Recurrent Networks for input data 47 | 48 | Let me know of any other ideas you may have :) 49 | 50 | # Thanks to: 51 | 52 | This project is inspired by 53 | * The paper [Continuous control with deep reinforcement learning](https://arxiv.org/pdf/1509.02971v5.pdf) by Lillicrap et al. 54 | * This python implementation, [DDPG](https://github.com/rmst/ddpg) 55 | * Also Ben Lau's Article on [Using Keras and Deep Deterministic Policy Gradient to play TORCS](https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html) -------------------------------------------------------------------------------- /networks/critic.py: -------------------------------------------------------------------------------- 1 | from keras.layers import Dense, Input, merge 2 | from keras.models import Model 3 | from keras.optimizers import Adam 4 | import keras.backend as keras_backend 5 | import tensorflow 6 | 7 | 8 | class CriticNetwork(object): 9 | def __init__(self, tensorflow_session, state_size, action_size, 10 | hidden_units=(300, 600), learning_rate=0.0001, batch_size=64, 11 | tau=0.001): 12 | """ 13 | Constructor for the Actor network 14 | 15 | :param tensorflow_session: The tensorflow session. 16 | See https://www.tensorflow.org for more information on tensorflow 17 | sessions. 18 | :param state_size: An integer denoting the dimensionality of the states 19 | in the current problem 20 | :param action_size: An integer denoting the dimensionality of the 21 | actions in the current problem 22 | :param hidden_units: An iterable defining the number of hidden units in 23 | each layer. Soon to be depreciated. default: (300, 600) 24 | :param learning_rate: A fload denoting the speed at which the network 25 | will learn. default: 0.0001 26 | :param batch_size: An integer denoting the batch size. default: 64 27 | :param tau: A flot denoting the rate at which the target model will 28 | track the main model. Formally, the tracking function is defined as: 29 | 30 | target_weights = tau * main_weights + (1 - tau) * target_weights 31 | 32 | for more explanation on how and why this happens, 33 | please refer to the DDPG paper: 34 | 35 | Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, & Wiestra. 36 | Continuous Control with Deep Reinforcement Learning. arXiv preprint 37 | arXiv:1509.02971, 2015. 38 | 39 | default: 0.001 40 | """ 41 | # Store parameters 42 | self._tensorflow_session = tensorflow_session 43 | self._batch_size = batch_size 44 | self._tau = tau 45 | self._learning_rate = learning_rate 46 | self._hidden = hidden_units 47 | 48 | # Let tensorflow and keras work together 49 | keras_backend.set_session(tensorflow_session) 50 | 51 | # Generate the main model 52 | self._model, self._state_input, self._action_input = \ 53 | self._generate_model() 54 | # Generate carbon copy of the model so that we avoid divergence 55 | self._target_model, self._target_weights, self._target_state = \ 56 | self._generate_model() 57 | # gradients for policy update 58 | self._action_gradients = tensorflow.gradients(self._model.output, 59 | self._action_input) 60 | self._tensorflow_session.run(tensorflow.initialize_all_variables()) 61 | 62 | def get_gradients(self, states, actions): 63 | """ 64 | Returns the gradients. 65 | :param states: 66 | :param actions: 67 | :return: 68 | """ 69 | return self._tensorflow_session.run(self._action_gradients, feed_dict={ 70 | self._state_inputs: states, 71 | self._action_input: actions 72 | })[0] 73 | 74 | def train_target_model(self): 75 | """ 76 | Updates the weights of the target network to slowly track the main 77 | network. 78 | 79 | The speed at which the target network tracks the main network is 80 | defined by tau, given in the constructor to this class. Formally, 81 | the tracking function is defined as: 82 | 83 | target_weights = tau * main_weights + (1 - tau) * target_weights 84 | 85 | :return: None 86 | """ 87 | main_weights = self._model.get_weights() 88 | target_weights = self._target_model.get_weights() 89 | target_weights = [self._tau * main_weight + (1 - self._tau) * 90 | target_weight for main_weight, target_weight in 91 | zip(actor_weights, actor_target_weights)] 92 | self._target_model.set_weights(target_weights) 93 | 94 | def _generate_model(self): 95 | """ 96 | Generates the model based on the hyperparameters defined in the 97 | constructor. 98 | 99 | :return: at tuple containing references to the model, state input layer, 100 | and action input later 101 | """ 102 | state_input_layer = Input(shape=[self._state_size]) 103 | action_input_layer = Input(shape=[self._action_size]) 104 | s_layer = Dense(self._hidden[0], activation='relu')(state_input_layer) 105 | a_layer = Dense(self._hidden[0], activation='linear')(action_input_layer) 106 | hidden = Dense(self._hidden[1], activation='linear')(s_layer) 107 | hidden = merge([hidden, a_layer], mode='sum') 108 | hidden = Dense(self._hidden[1], activation='relu')(hidden) 109 | output_layer = Dense(1, activation='linear')(hidden) 110 | model = Model(input=[state_input_layer, action_input_layer], 111 | output=output_layer) 112 | model.compile(loss='mse', optimizer=Adam(lr=self._learning_rate)) 113 | return model, state_input_layer, action_input_layer 114 | -------------------------------------------------------------------------------- /networks/actor.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function, division 2 | import numpy as np 3 | import math 4 | from keras.initializations import normal, identity 5 | from keras.models import model_from_json, Sequential, Model 6 | from keras.engine.training import collect_trainable_weights 7 | from keras.layers import Dense, Flatten, Input, merge, Lambda 8 | from keras.optimizers import Adam 9 | import tensorflow 10 | import keras.backend as keras_backend 11 | 12 | 13 | class Actor(object): 14 | """ 15 | Object representing the actor network, which approximates the function: 16 | 17 | u(s) -> a 18 | 19 | where u (actually mew) is the deterministic policy mapping from states s to 20 | actions a. 21 | """ 22 | def __init__(self, tensorflow_session, state_size, action_size, 23 | hidden_units=(300, 600), learning_rate=0.0001, batch_size=64, 24 | tau=0.001): 25 | """ 26 | Constructor for the Actor network 27 | 28 | :param tensorflow_session: The tensorflow session. 29 | See https://www.tensorflow.org for more information on tensorflow 30 | sessions. 31 | :param state_size: An integer denoting the dimensionality of the states 32 | in the current problem 33 | :param action_size: An integer denoting the dimensionality of the 34 | actions in the current problem 35 | :param hidden_units: An iterable defining the number of hidden units in 36 | each layer. Soon to be depreciated. default: (300, 600) 37 | :param learning_rate: A fload denoting the speed at which the network 38 | will learn. default: 0.0001 39 | :param batch_size: An integer denoting the batch size. default: 64 40 | :param tau: A flot denoting the rate at which the target model will 41 | track the main model. Formally, the tracking function is defined as: 42 | 43 | target_weights = tau * main_weights + (1 - tau) * target_weights 44 | 45 | for more explanation on how and why this happens, 46 | please refer to the DDPG paper: 47 | 48 | Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, & Wiestra. 49 | Continuous Control with Deep Reinforcement Learning. arXiv preprint 50 | arXiv:1509.02971, 2015. 51 | 52 | default: 0.001 53 | """ 54 | # Store parameters 55 | self._tensorflow_session = tensorflow_session 56 | self._batch_size = batch_size 57 | self._tau = tau 58 | self._learning_rate = learning_rate 59 | self._hidden = hidden_units 60 | 61 | # Let tensorflow and keras work together 62 | keras_backend.set_session(tensorflow_session) 63 | 64 | # Generate the main model 65 | self._model, self._model_weights, self._model_input = \ 66 | self._generate_model() 67 | # Generate carbon copy of the model so that we avoid divergence 68 | self._target_model, self._target_weights, self._target_state = \ 69 | self._generate_model() 70 | 71 | # Generate tensors to hold the gradients for our Policy Gradient update 72 | self._action_gradients = tensorflow.placeholder(tensorflow.float32, 73 | [None, action_size]) 74 | self._parameter_gradients = tensorflow.gradients(self._model.output, 75 | self._model_weights, 76 | -self._action_gradient) 77 | self._gradients = zip(self._parameter_gradients, self._model_weights) 78 | 79 | # Define the optimisation function 80 | self._optimize = tensorflow.train.AdamOptimizer(learning_rate)\ 81 | .apply_gradients(self._gradients) 82 | 83 | # And initialise all tensorflow variables 84 | self._tensorflow_session.run(tensorflow.initialize_all_variables()) 85 | 86 | def train(self, states, action_gradients): 87 | # todo better explanation of the inputs to this method 88 | """ 89 | Updates the weights of the main network 90 | 91 | :param states: The states of the input to the network 92 | :param action_gradients: The gradients of the actions to update the 93 | network 94 | :return: None 95 | """ 96 | self._tensorflow_session.run(self._optimize, feed_dict={ 97 | self._states: states, 98 | self._action_gradients: action_gradients 99 | }) 100 | 101 | def train_target_model(self): 102 | """ 103 | Updates the weights of the target network to slowly track the main 104 | network. 105 | 106 | The speed at which the target network tracks the main network is 107 | defined by tau, given in the constructor to this class. Formally, 108 | the tracking function is defined as: 109 | 110 | target_weights = tau * main_weights + (1 - tau) * target_weights 111 | 112 | :return: None 113 | """ 114 | main_weights = self._model.get_weights() 115 | target_weights = self._target_model.get_weights() 116 | target_weights = [self._tau * main_weight + (1 - self._tau) * 117 | target_weight for main_weight, target_weight in 118 | zip(actor_weights, actor_target_weights)] 119 | self._target_model.set_weights(target_weights) 120 | 121 | def _generate_model(self): 122 | """ 123 | Generates the model based on the hyperparameters defined in the 124 | constructor. 125 | 126 | :return: at tuple containing references to the model, weights, 127 | and input later 128 | """ 129 | input_layer = Input(shape=[self._state_size]) 130 | layer = Dense(self._hidden[0], activation='relu')(input_layer) 131 | layer = Dense(self._hidden[1], activation='relu')(layer) 132 | output_layer = Dense(self._action_size, activation='sigmoid')(layer) 133 | model = Model(input=input_layer, output=output_layer) 134 | return model, model.trainable_weights, input_layer -------------------------------------------------------------------------------- /ddpg.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | 3 | from networks.actor import Actor 4 | from networks.critic import Critic 5 | 6 | from collections import deque 7 | import tensorflow 8 | import random 9 | 10 | 11 | class DDPG(object): 12 | def __init__(self, state_size, action_size, actor_hidden_units=(300, 600), 13 | actor_learning_rate=0.0001, critic_hidden_units=(300, 600), 14 | critic_learning_rate=0.001, batch_size=64, discount=0.99, 15 | memory_size=10000, tau=0.001): 16 | """ 17 | Constructs a DDPG Agent with the given parameters 18 | 19 | :param state_size: Int denoting the world's state dimensionality 20 | :param action_size: Int denoting the world's action dimensionality 21 | :param actor_hidden_units: Tuple(Int) denoting the actor's hidden layer 22 | sizes. Each element in the tuple represents a layer in the Actor 23 | network and the Int denotes the number of neurons in the layer. 24 | :param actor_learning_rate: Float denoting the learning rate of the 25 | Actor network. Best to be some small number close to 0. 26 | :param critic_hidden_units: Tuple(Int) denoting the critic's hidden 27 | layer sizes. Each element in the tuple represents a layer in the 28 | Critic network and the Int denotes the number of neurons in the 29 | layer. 30 | :param critic_learning_rate: Float denoting the learning rate of the 31 | Critic network. Best to be some small number close to 0. 32 | :param batch_size: Int denoting the batch size for training. 33 | :param discount: Float denoting the discount (gamma) given to future 34 | potentioal rewards when calculating q values 35 | :param memory_size: Int denoting the number of State, action, rewards 36 | that the agent will remember 37 | :param tau: 38 | """ 39 | 40 | self._discount = discount 41 | self._batch_size = batch_size 42 | self._memory_size = memory_size 43 | 44 | tensorflow_session = self._generate_tensorflow_session 45 | 46 | self._actor = Actor(tensorflow_session=tensorflow_session, 47 | state_size=state_size, action_size=action_size, 48 | hidden_units=actor_hidden_units, 49 | learning_rate=actor_learning_rate, 50 | batch_size=batch_size, tau=tau) 51 | 52 | self._critic = Critic(tensorflow_session=tensorflow_session, 53 | state_size=state_size, action_size=action_size, 54 | hidden_units=critic_hidden_units, 55 | learning_rate=critic_learning_rate, 56 | batch_size=batch_size, tau=tau) 57 | 58 | self._memory = deque() 59 | 60 | def _generate_tensorflow_session(self): 61 | """ 62 | Generates and returns the tensorflow session 63 | :return: the Tensorflow Session 64 | """ 65 | config = tensorflow.ConfigProto() 66 | config.gpu_options.allow_growth = True 67 | return tensorflow.Session(config=config) 68 | 69 | def get_action(self, state): 70 | """ 71 | Returns the best action predicted by the agent given the current state. 72 | :param state: numpy array denoting the current state. 73 | :return: numpy array denoting the predicted action. 74 | """ 75 | return self._actor._model.predict(state) 76 | 77 | def train(self): 78 | """ 79 | Trains the DDPG Agent from it's current memory 80 | 81 | Please note that the agent must have gone through more steps than the 82 | specified batch size before this method will do anything 83 | 84 | :return: None 85 | """ 86 | if len(self._memory) > self._batch_size: 87 | self._train() 88 | 89 | def _train(self): 90 | """ 91 | Helper method for train. Takes care of sampling, and training and 92 | updating both the actor and critic networks 93 | 94 | :return: None 95 | """ 96 | states, actions, rewards, done, next_states = self._get_sample() 97 | self._train_critic(states, actions, next_states, done, rewards) 98 | self._train_actor(states) 99 | self._update_target_models() 100 | 101 | def _get_sample(self): 102 | """ 103 | Finds a random sample of size self._batch_size from the agent's current 104 | memory. 105 | 106 | :return: Tuple(List(Float, Boolean))) denoting the sample of states, 107 | actions, rewards, done, and next states. 108 | """ 109 | sample = random.sample(self._memory, self._batch_size) 110 | states, actions, rewards, done, next_states = zip(*sample) 111 | return states, actions, rewards, done, next_states 112 | 113 | def _train_critic(self, states, actions, next_states, done, rewards): 114 | """ 115 | Trains the critic network 116 | 117 | C(s, a) -> q 118 | 119 | :param states: List of the states to train the network with 120 | :param actions: List of the actions to train the network with 121 | :param next_states: List of the t+1 states to train the network with 122 | :param rewards: List of rewards to calculate q_targets. 123 | 124 | :return: None 125 | """ 126 | q_targets = self._get_q_targets(next_states, done, rewards) 127 | self._critic.train(states, actions, q_targets) 128 | 129 | def _get_q_targets(self, next_states, done, rewards): 130 | """ 131 | Calculates the q targets with the following formula 132 | 133 | q = r + gamma * next_q 134 | 135 | unless there is no next state in which 136 | 137 | q = r 138 | 139 | :param next_states: List(List(Float)) Denoting the t+1 state 140 | :param done: List(Bool) denoting whether each step was an exit step 141 | :param rewards: List(Float) Denoting the reward given in each step 142 | :return: The q targets 143 | """ 144 | next_actions = self._actor._model.predict(next_states) 145 | next_q_values = self._critic._target_model.predict(next_states, 146 | next_actions) 147 | q_targets = [reward if this_done else reward + self._discount * 148 | (next_q_value) 149 | for (reward, next_q_value, this_done) 150 | in zip(rewards, next_q_values, done)] 151 | return q_targets 152 | 153 | def _train_actor(self, states): 154 | """ 155 | Trains the actor network using the calculated deterministic policy 156 | gradients. 157 | 158 | :param states: List(List(Float)) denoting he states to train the Actor 159 | on 160 | :return: None 161 | """ 162 | gradients = self._get_gradients(states) 163 | self._actor.train(states, gradients) 164 | 165 | def _get_gradients(self, states): 166 | """ 167 | Calculates the Deterministic Policy Gradient for Actor training 168 | :param states: The states to calculate the gradients for. 169 | :return: 170 | """ 171 | action_for_gradients = self._actor._model.predict(states) 172 | self._critic.get_gradients(states, action_for_gradients) 173 | 174 | # todo finish this. 175 | 176 | def _update_target_models(self): 177 | """ 178 | Updates the target models to slowly track the main models 179 | 180 | :return: None 181 | """ 182 | self._critic.train_target_model() 183 | self._actor.train_target_model() 184 | 185 | def remember(self, state, action, reward, done, next_state): 186 | """ 187 | Stores the given state, action, reward etc in the Agent's memory. 188 | 189 | :param state: The state to remember 190 | :param action: The action to remember 191 | :param reward: The reward to remember 192 | :param done: Whether this was a final state 193 | :param next_state: The next state (if applicable) 194 | :return: None 195 | """ 196 | self._memory.append((state, action, reward, done, next_state)) 197 | if len(self._memory) > self._memory_size: 198 | self._memory.popleft() --------------------------------------------------------------------------------