├── README.md
└── imgs
├── Q-learning.png
├── RL_process.png
└── off-on-policy.png
/README.md:
--------------------------------------------------------------------------------
1 | # Hugging Face Deep RL Course notes
2 |
3 | Class notes of [The Hugging Face Deep Reinforcement Learning Course 🤗 (v2.0)](https://huggingface.co/deep-rl-course/unit0/introduction)
4 |
5 | ## Syllabus
6 | - [Unit 1: Introduction to Deep Reinforcement Learning](#unit-1-introduction-to-deep-reinforcement-learning)
7 | - [Unit 2: Q-Learning](#unit-2-q-learning)
8 | - [Unit 3: Deep Q-Learning with Atari Games](#unit-3-deep-q-learning-with-atari-games)
9 | - [Bonus Unit: Automatic Hyperparameter Tuning using Optuna](#bonus-unit-automatic-hyperparameter-tuning-using-optuna)
10 |
11 | ## Unit 1: Introduction to Deep Reinforcement Learning
12 |
13 | The idea behind Reinforcement Learning is that an agent (an AI) will learn from the environment by interacting with it (through trial and error) and receiving rewards (negative or positive) as feedback for performing actions.
14 |
15 | Learning from interactions with the environment comes from our natural experiences.
16 |
17 |
18 |
19 | A simple RL process would consist of an agent that:
20 | * receives a **state** $S_0$ from the **Environment**
21 | * based on that **state** $S_0$, the agent takes **action** $A_0$
22 | * **Environment** goes to a **new state** $S_1$
23 | * the **Environment** gives some **reward** $R_1$ to the agent
24 |
25 | This RL loop outputs a sequence $(S_0, A_0, R_1, S_1)$
26 |
27 | The agent’s goal is to maximize its cumulative reward, called **the expected return**.
28 |
29 | **Reward hypothesis:** all goals can be described as the maximization of the expected return.
30 |
31 | **Markov property:** the agent agent needs only the current state to decide what action to take and not the history of all the states and actions they took before.
32 |
33 | **Observations/States:**
34 | * **State:** a complete description of the state of the world (there is no hidden information). In a fully observed environment like a chess board.
35 |
36 | * **Observation:** a partial description of the state. In a partially observed environment like a Super Mario Bros world.
37 |
38 | **Action Space:** set of all possible actions in an environment
39 | * **Discrete space:** the number of possible actions is finite like in a Super Mario Bros game.
40 | * **Continuous space:** the number of possible actions is infinite like the number of possible actions for a self driving car.
41 |
42 | **Reward:** the only feedback for the agent, thanks to it the agent knows if the action taken was good or not. ($\tau$, read Tau, is a trajectory, meaning a sequence of states and actions)
43 | * Expected cumulative reward: $\displaystyle R(\tau) = r_{t+1} + r_{t+2} + r_{t+3} + ... = \sum_{k=0}^{\infty} r_{t+k+1}$
44 | * Discounted expected cumulative reward: $\displaystyle R(\tau) = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$. Used when we care more about the rewards that come sooner because they're more likely to happen. The discount rate $\gamma$ is most of the time between 0.95 and 0.99.
45 |
46 | **Types of tasks:**
47 | * **Episodic tasks:** we have a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and new States (like in a Super Mario level).
48 | * **Continuing tasks:** tasks that continue forever with no terminal state (automated stock trading for example).
49 |
50 | **Exploration/Exploitation trade-off:**
51 | * **Exploration** is exploring the environment by trying random actions in order to find more information about the environment.
52 | * **Exploitation** is exploiting known information to maximize the reward.
53 |
54 | The **Policy** $\pi$ is the brain of an RL agent, it takes a state an input and gives an action as output.
55 | * **Deterministic policy:** $\pi(s) = a$ a policy that given a state will always return the same action.
56 | * **Stochastic policy:** $\pi(a|s) = P(A|s)$ outputs a probability distribution over actions.
57 |
58 | **Policy based methods vs Value based methods:**
59 | * **Policy based methods:** we learn a policy function directly.
60 | * **Value based methods:** instead of training a policy function, we train a value function that maps a state to the expected value of being at that state, and we construct a policy from this value function.
61 |
62 | ## Unit 2: Q-Learning
63 |
64 | **Value functions:**
65 | * **State-value function for policy $\pi$ :**
66 |
67 | $\displaystyle V_{\pi}(s) = E_{\pi}(G_t|S_t = s) = E_{\pi}(\sum_{k=0}^{\infty} \gamma^k r_{t+k+1}|S_t = s)$.
68 |
69 | This is the expected return if the agent starts at state $s$ and then follows the policy forever after. In other words the value of the state $s$ under the policy $\pi$.
70 |
71 | * **Action-value function for policy $\pi$ :**
72 |
73 | $\displaystyle Q_{\pi}(s, a) = E_{\pi}(G_t|S_t = s, A_t = a) = E_{\pi}(\sum_{k=0}^{\infty} \gamma^k r_{t+k+1}|S_t = s, A_t = a)$.
74 |
75 | This is the expected return if the agent starts at state $s$, takes the action $a$ and then follows the policy forever after. In other words the value of the pair $(s, a)$ under the policy $\pi$.
76 |
77 | **Bellman Equations of value function:**
78 | Each of the value functions satisfy a particular recursive relationship called the **Bellman Equation**.
79 |
80 | * **Bellman Equation of the state-value function :**
81 |
82 | $V_{\pi}(s) = E_{\pi}(R_{t+1} + \gamma V_{\pi}(S_{t+1}) | S_t = s)$.
83 |
84 | * **Bellman Equation of the action-value function :**
85 |
86 | $Q_{\pi}(s, a) = E_{\pi}(R_{t+1} + \gamma Q_{\pi}(S_t, A_{t+1})|S_t = s, A_t = a)$.
87 |
88 | **Monte Carlo vs TD Learning:**
89 |
90 | * **Monte Carlo:** we update the value function from **a complete episode**, and so we use the actual accurate discounted return of this episode.
91 |
92 | $V_{\pi}(S_{t}) \leftarrow V_{\pi}(S_{t}) + \alpha [G_{t} - V(S_{t})]$.
93 |
94 | * **TD Learning:** we update the value function from **a step**, so we replace $G_{t}$ that we don’t have with an estimated return called TD target.
95 |
96 | $V_{\pi}(S_{t}) \leftarrow V_{\pi}(S_{t}) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_{t})]$.
97 |
98 | **Link between Value and Policy:** $\pi_{}^{*}(s) = argmax_{a} Q_{}^{*}(s, a)$
99 |
100 | #### **Code**
101 |
102 | Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.
103 |
104 | - Epsilon-greedy policy (acting policy)
105 | - Greedy-policy (updating policy)
106 |
107 | Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.
108 |
109 |
110 |
111 |
112 |
113 |
114 | ```python
115 | # Training parameters
116 | n_training_episodes = 10000 # Total training episodes
117 | learning_rate = 0.7 # Learning rate
118 |
119 | # Evaluation parameters
120 | n_eval_episodes = 100 # Total number of test episodes
121 |
122 | # Environment parameters
123 | env_id = "FrozenLake-v1" # Name of the environment
124 | max_steps = 99 # Max steps per episode
125 | gamma = 0.95 # Discounting rate
126 | eval_seed = [] # The evaluation seed of the environment
127 |
128 | # Exploration parameters
129 | max_epsilon = 1.0 # Exploration probability at start
130 | min_epsilon = 0.05 # Minimum exploration probability
131 | decay_rate = 0.0005 # Exponential decay rate for exploration prob
132 | ```
133 |
134 | ```python
135 | def initialize_Q_table(env):
136 | num_possible_states = env.observation_space.n
137 | num_possible_action = env.action_space.n
138 |
139 | Qtable = np.zeros((num_possible_states, num_possible_action))
140 | return Qtable
141 | ```
142 |
143 | ```python
144 | def greedy_policy(Qtable, state):
145 | # Exploitation: take the action with the highest state, action value
146 | action = np.argmax(Qtable[state])
147 |
148 | return action
149 |
150 |
151 | def epsilon_greedy_policy(Qtable, state, epsilon):
152 | # Randomly generate a number between 0 and 1
153 | random_int = random.uniform(0, 1)
154 |
155 | if random_int > epsilon: # exploitation
156 | action = np.argmax(Qtable[state])
157 | elif random_int <= epsilon: # exploration
158 | action = env.action_space.sample()
159 |
160 | return action
161 | ```
162 |
163 | ```python
164 | def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
165 | """
166 | For episode in the total of training episodes:
167 |
168 | Reduce epsilon (since we need less and less exploration)
169 | Reset the environment
170 |
171 | For step in max timesteps:
172 | Choose the action At using epsilon greedy policy
173 | Take the action (a) and observe the outcome state(s') and reward (r)
174 | Update the Q-value using Bellman equation Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
175 | If done, finish the episode
176 | Our next state is the new state
177 | """
178 |
179 | for episode in tqdm(range(n_training_episodes)):
180 |
181 | # Reduce epsilon (because we need less and less exploration)
182 | epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
183 |
184 | state = env.reset()
185 | step = 0
186 | done = False
187 |
188 | for step in range(max_steps):
189 |
190 | action = epsilon_greedy_policy(Qtable, state, epsilon)
191 |
192 | new_state, reward, done, info = env.step(action)
193 |
194 | # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
195 | # NB: Qtable[new_state][greedy_policy(Qtable, new_state)] = np.max(Qtable[new_state])
196 | Qtable[state][action] += learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action])
197 |
198 | if done:
199 | break
200 |
201 | # Our next state is the new state
202 | state = new_state
203 |
204 | return Qtable
205 | ```
206 |
207 | ```python
208 | env = gym.make('FrozenLake-v1', map_name="4x4", is_slippery=False)
209 | Qtable = initialize_Q_table(env)
210 | Qtable = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable)
211 | ```
212 | ## Unit 3: Deep Q-Learning with Atari Games
213 |
214 | ## Bonus Unit: Automatic Hyperparameter Tuning using Optuna
215 |
--------------------------------------------------------------------------------
/imgs/Q-learning.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Elameri/huggingface-deep-rl-class-notes/df71bb03e79873cc0f59979f5b0b34af1af4e562/imgs/Q-learning.png
--------------------------------------------------------------------------------
/imgs/RL_process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Elameri/huggingface-deep-rl-class-notes/df71bb03e79873cc0f59979f5b0b34af1af4e562/imgs/RL_process.png
--------------------------------------------------------------------------------
/imgs/off-on-policy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Elameri/huggingface-deep-rl-class-notes/df71bb03e79873cc0f59979f5b0b34af1af4e562/imgs/off-on-policy.png
--------------------------------------------------------------------------------