├── _config.yml
├── Misc
├── Graph.png
├── Loss.jpg
├── Plot.png
├── Q-NN.jpg
├── Initial.gif
├── NextGen.gif
├── Q-table.jpg
├── Target.jpg
├── Double Q.png
├── Estimation.jpg
└── Q-learning.jpg
├── LICENSE
├── README.md
└── Code source
└── Lunar_Lander_v2.py
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-time-machine
--------------------------------------------------------------------------------
/Misc/Graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anh-nn01/Lunar-Lander-Double-Deep-Q-Networks/HEAD/Misc/Graph.png
--------------------------------------------------------------------------------
/Misc/Loss.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anh-nn01/Lunar-Lander-Double-Deep-Q-Networks/HEAD/Misc/Loss.jpg
--------------------------------------------------------------------------------
/Misc/Plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anh-nn01/Lunar-Lander-Double-Deep-Q-Networks/HEAD/Misc/Plot.png
--------------------------------------------------------------------------------
/Misc/Q-NN.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anh-nn01/Lunar-Lander-Double-Deep-Q-Networks/HEAD/Misc/Q-NN.jpg
--------------------------------------------------------------------------------
/Misc/Initial.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anh-nn01/Lunar-Lander-Double-Deep-Q-Networks/HEAD/Misc/Initial.gif
--------------------------------------------------------------------------------
/Misc/NextGen.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anh-nn01/Lunar-Lander-Double-Deep-Q-Networks/HEAD/Misc/NextGen.gif
--------------------------------------------------------------------------------
/Misc/Q-table.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anh-nn01/Lunar-Lander-Double-Deep-Q-Networks/HEAD/Misc/Q-table.jpg
--------------------------------------------------------------------------------
/Misc/Target.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anh-nn01/Lunar-Lander-Double-Deep-Q-Networks/HEAD/Misc/Target.jpg
--------------------------------------------------------------------------------
/Misc/Double Q.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anh-nn01/Lunar-Lander-Double-Deep-Q-Networks/HEAD/Misc/Double Q.png
--------------------------------------------------------------------------------
/Misc/Estimation.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anh-nn01/Lunar-Lander-Double-Deep-Q-Networks/HEAD/Misc/Estimation.jpg
--------------------------------------------------------------------------------
/Misc/Q-learning.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anh-nn01/Lunar-Lander-Double-Deep-Q-Networks/HEAD/Misc/Q-learning.jpg
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2020 Nhu Nhat Anh
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Lunar-Lander-Double-Deep-Q-Networks
2 | An AI agent that use Double Deep Q-learning to learn by itself how to land a Lunar Lander on OpenAI universe
3 | # AI-Lunar-Laner-Lander-v2-Keras TF Backend
4 | A Reinforcement Learning AI Agent that use Deep Q Network to play Lunar Lander
5 |
6 |
7 | Algorithm Details and Hyperparameters:
8 | ===============
9 | * Implementation: Keras TF Backend
10 | * Algorithm: Deep Q-Network with a Double Fully connected layers
11 | * Each Neural Network has the same structure: 2 Fully connected layers each with 128 nodes.
12 | * Optimization algorithm: Adaptive Moment (Adam)
13 | * Learning rate: **α = 0.0001**
14 | * Discount factor: **γ = 0.99**
15 | * Minimum exploration rate: **ε = 0.1**
16 | * Replay memory size: **10^6**
17 | * Mini batch size: **2^6**
18 |
19 | **Commplete evolution (training process): https://www.youtube.com/watch?v=XopVALk2xb4&t=286s**
20 |
21 |
22 | Description of the problem
23 | ===============
24 |
25 | * The agent has to learn how to land a Lunar Lander to the moon surface safely, quickly and accurately.
26 | * If the agent just lets the lander fall freely, it is dangerous and thus get a very negative reward from the environment.
27 | * If the agent does not land quickly enough (after 20 seconds), it fails its objective and receive a negative reward from the environment.
28 | * If the agent lands the lander safely but in wrong position, it is given either a small negative or small positive reward, depending on how far from the landing zone is the lander.
29 | * If the AI lands the lander to the landing zone quickly and safely, it is successful and is award very positive reward.
30 |
31 | Double Deep Q Networks (DDQN):
32 | ===============
33 | * Since the state space is infinite, traditional Q-value table method does not work on this problem. As a result, we need to integrate Q-learning with Neural Network for value approximation. However, the action space remains discrete.
34 |
35 | **Q-learning:**
36 | 
37 |
38 | The equation above based on Bellman equation. You can try creating a sample graph of MDP to see intuitively why the Q-learning method converge to optimal value, thus converging to optimal policy.
39 |
40 | * For Deep Q-learning, we simply use a NN to approximate Q-value in each time step, and then update the NN so that the estimate Q(s,a) approach its target:
41 | * 
42 | * 
43 | * 
44 |
45 |
46 |
47 | **Difference between Q-learning and DQN:**
48 | 
49 |
50 | 
51 |
52 | * Purpose of using Double Deep Q-network:
53 | * To stablize the target Q-value and ensure convergence.
54 | * Reference: https://arxiv.org/abs/1509.06461
55 |
56 | 
57 |
58 |
It has been proven mathematically and empirically that using Deep Q-Network approximation converges to optimal policy in reasonable amount of time.
59 |
60 |
61 | Training Result:
62 | ===============
63 |
64 | **Before training:**
65 |
66 |
67 | **After 800 games:**
68 |
69 |
70 |
71 | **Learning curve:**
72 | 
73 |
74 | * The Blue curve shows the reward the agent earned in each episode.
75 | * The Red curve shows the average reward from the corresponding episode in the x-axis and 100 previous episodes. In other words, it shows the average reward of 100 most current episodes.
76 | * From the plot, we see that the Blue curve is much noisier due to exploration ε = 0.1 throughout the training process and due to the imperfect approximation during some first episodes of the training.
77 | * Averaging 100 most current rewards produces much smoother curve, however.
78 | * From the curve, we can conclude that the agent has successfully learned a good policy to solve the Lunar Lander problem, according to OpenAI criteria (the average point of any 100 consecutive episodes is at least 200).
79 |
80 |
81 |
--------------------------------------------------------------------------------
/Code source/Lunar_Lander_v2.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 | import gym
4 |
5 | import tensorflow.compat.v1 as tf
6 | from tensorflow.keras import Model, Sequential
7 | from tensorflow.keras.layers import Dense, Embedding, Reshape
8 | from tensorflow.keras.optimizers import Adam
9 |
10 | import matplotlib.pyplot as plt
11 | import random
12 | from collections import deque
13 | import time
14 | tf.disable_v2_behavior() # testing on tensorflow 1
15 |
16 | class Agent:
17 | def __init__(self, env, optimizer, batch_size):
18 | # general info
19 | self.state_size = env.observation_space.shape[0] # number of factors in the state; e.g: velocity, position, etc
20 | self.action_size = env.action_space.n
21 | self.optimizer = optimizer
22 | self.batch_size = batch_size
23 |
24 | # allow large replay exp space
25 | self.replay_exp = deque(maxlen=1000000)
26 |
27 | self.gamma = 0.99
28 | self.epsilon = 1.0 # initialize with high exploration, which will decay later
29 |
30 | # Build Policy Network
31 | self.brain_policy = Sequential()
32 | self.brain_policy.add(Dense(128, input_dim = self.state_size, activation = "relu"))
33 | self.brain_policy.add(Dense(128 , activation = "relu"))
34 | self.brain_policy.add(Dense(self.action_size, activation = "linear"))
35 | self.brain_policy.compile(loss = "mse", optimizer = self.optimizer)
36 |
37 |
38 | # Build Target Network
39 | self.brain_target = Sequential()
40 | self.brain_target.add(Dense(128, input_dim = self.state_size, activation = "relu"))
41 | self.brain_target.add(Dense(128 , activation = "relu"))
42 | self.brain_target.add(Dense(self.action_size, activation = "linear"))
43 | self.brain_target.compile(loss = "mse", optimizer = self.optimizer)
44 |
45 |
46 | self.update_brain_target()
47 |
48 | # add new experience to the replay exp
49 | def memorize_exp(self, state, action, reward, next_state, done):
50 | self.replay_exp.append((state, action, reward, next_state, done))
51 |
52 | """
53 | # agent's brain
54 | def build_model(self):
55 | # a NN with 2 fully connected hidden layers
56 | model = Sequential()
57 | model.add(Dense(128, input_dim = self.state_size, activation = "relu"))
58 | model.add(Dense(128 , activation = "relu"))
59 | model.add(Dense(self.action_size, activation = "linear"))
60 | model.compile(loss = "mse", optimizer = self.optimizer)
61 |
62 | return model
63 | """
64 |
65 | def update_brain_target(self):
66 | return self.brain_target.set_weights(self.brain_policy.get_weights())
67 |
68 | def choose_action(self, state):
69 | if np.random.uniform(0.0, 1.0) < self.epsilon: # exploration
70 | action = np.random.choice(self.action_size)
71 | else:
72 | state = np.reshape(state, [1, state_size])
73 | qhat = self.brain_policy.predict(state) # output Q(s,a) for all a of current state
74 | action = np.argmax(qhat[0]) # because the output is m * n, so we need to consider the dimension [0]
75 |
76 | return action
77 |
78 | # update params in NN
79 | def learn(self):
80 | """
81 | sample = random.choices(self.replay_exp, k = min(len(self.replay_exp), self.batch_size))
82 |
83 |
84 | states, actions, rewards, next_states, dones = map(list, zip(sample))
85 |
86 | # add exp to replay exp
87 | qhats_next = self.brain_target(next_states)
88 |
89 | # set all value actions of terminal state to 0
90 | qhats_next[dones] = np.zeros((self.action_size))
91 |
92 | q_targets = rewards + self.gamma * np.max(qhats_next, axis=1) # update greedily
93 |
94 | self.brain.update_nn(self.sess, states, actions, q_targets)
95 |
96 | """
97 |
98 | # take a mini-batch from replay experience
99 | cur_batch_size = min(len(self.replay_exp), self.batch_size)
100 | mini_batch = random.sample(self.replay_exp, cur_batch_size)
101 |
102 | # batch data
103 | sample_states = np.ndarray(shape = (cur_batch_size, self.state_size)) # replace 128 with cur_batch_size
104 | sample_actions = np.ndarray(shape = (cur_batch_size, 1))
105 | sample_rewards = np.ndarray(shape = (cur_batch_size, 1))
106 | sample_next_states = np.ndarray(shape = (cur_batch_size, self.state_size))
107 | sample_dones = np.ndarray(shape = (cur_batch_size, 1))
108 |
109 | temp=0
110 | for exp in mini_batch:
111 | sample_states[temp] = exp[0]
112 | sample_actions[temp] = exp[1]
113 | sample_rewards[temp] = exp[2]
114 | sample_next_states[temp] = exp[3]
115 | sample_dones[temp] = exp[4]
116 | temp += 1
117 |
118 |
119 | sample_qhat_next = self.brain_target.predict(sample_next_states)
120 |
121 | # set all Q values terminal states to 0
122 | sample_qhat_next = sample_qhat_next * (np.ones(shape = sample_dones.shape) - sample_dones)
123 | # choose max action for each state
124 | sample_qhat_next = np.max(sample_qhat_next, axis=1)
125 |
126 | sample_qhat = self.brain_policy.predict(sample_states)
127 |
128 | for i in range(cur_batch_size):
129 | a = sample_actions[i,0]
130 | sample_qhat[i,int(a)] = sample_rewards[i] + self.gamma * sample_qhat_next[i]
131 |
132 | q_target = sample_qhat
133 |
134 | self.brain_policy.fit(sample_states, q_target, epochs = 1, verbose = 0)
135 |
136 |
137 |
138 | """
139 |
140 | for state, action, reward, next_state, done in mini_batch:
141 | target_Q_s_a = 0 # new target for Q(s,a)
142 | state = np.reshape(state, [1, state_size])
143 | next_state = np.reshape(next_state, [1, state_size])
144 |
145 | # if it is not the terminal state
146 | if not done:
147 | qhat_next = self.brain_target.predict(next_state) # estimate Q(s',a')
148 | target_Q_s_a = reward + self.gamma * np.amax(qhat_next[0]) # because the output is m * n, so we need to consider the dimension [0]
149 | else:
150 | target_Q_s_a = reward
151 |
152 | target_output = self.brain_policy.predict(state) # we will replace target of Q(s,a) for specific a later
153 | target_output[0][action] = target_Q_s_a # new target for state s and action a
154 |
155 | self.brain_policy.fit(state, target_output, epochs = 1, verbose = 0)
156 |
157 | """
158 |
159 |
160 |
161 |
162 | env = gym.make("LunarLander-v2")
163 | optimizer = Adam(learning_rate = 0.0001)
164 |
165 | agent = Agent(env, optimizer, batch_size = 64)
166 | state_size = env.observation_space.shape[0]
167 |
168 | #state = env.reset()
169 |
170 | #print(state.shape)
171 |
172 | # load model
173 | #agent.brain_policy.set_weights(tf.keras.models.load_model('C:/Users/nhunh/.spyder-py3/Model1.h5').get_weights())
174 |
175 | timestep=0
176 | rewards = []
177 | aver_reward = []
178 | aver = deque(maxlen=100)
179 |
180 |
181 | for episode in range(1000):
182 | state = env.reset()
183 | total_reward = 0
184 | done = False
185 |
186 | while not done:
187 | action = agent.choose_action(state)
188 | next_state, reward, done, info = env.step(action)
189 |
190 | env.render()
191 |
192 | total_reward += reward
193 |
194 | agent.memorize_exp(state, action, reward, next_state, done)
195 | agent.learn()
196 |
197 | state = next_state
198 | timestep += 1
199 |
200 |
201 | aver.append(total_reward)
202 | aver_reward.append(np.mean(aver))
203 |
204 | rewards.append(total_reward)
205 |
206 | # update model_target after each episode
207 | agent.update_brain_target()
208 |
209 | agent.epsilon = max(0.1, 0.995 * agent.epsilon) # decaying exploration
210 | print("Episode ", episode, total_reward)
211 |
212 | """
213 | if episode % 50 == 0:
214 | agent.brain_policy.save("C:/Users/nhunh/.spyder-py3/Newest_update.h5")
215 | """
216 |
217 | plt.title("Learning Curve")
218 | plt.xlabel("Episode")
219 | plt.ylabel("Reward")
220 | plt.plot(rewards)
221 |
222 | plt.xlabel("Episode")
223 | plt.ylabel("Reward")
224 | plt.plot(aver_reward, 'r')
225 |
226 | agent.brain_policy.save('C:/Users/nhunh/.spyder-py3/Model1.h5')
227 |
228 |
229 |
--------------------------------------------------------------------------------