├── README.md └── imgs ├── Q-learning.png ├── RL_process.png └── off-on-policy.png /README.md: -------------------------------------------------------------------------------- 1 | # Hugging Face Deep RL Course notes 2 | 3 | Class notes of [The Hugging Face Deep Reinforcement Learning Course 🤗 (v2.0)](https://huggingface.co/deep-rl-course/unit0/introduction) 4 | 5 | ## Syllabus 6 | - [Unit 1: Introduction to Deep Reinforcement Learning](#unit-1-introduction-to-deep-reinforcement-learning) 7 | - [Unit 2: Q-Learning](#unit-2-q-learning) 8 | - [Unit 3: Deep Q-Learning with Atari Games](#unit-3-deep-q-learning-with-atari-games) 9 | - [Bonus Unit: Automatic Hyperparameter Tuning using Optuna](#bonus-unit-automatic-hyperparameter-tuning-using-optuna) 10 | 11 | ## Unit 1: Introduction to Deep Reinforcement Learning 12 | 13 | The idea behind Reinforcement Learning is that an agent (an AI) will learn from the environment by interacting with it (through trial and error) and receiving rewards (negative or positive) as feedback for performing actions. 14 | 15 | Learning from interactions with the environment comes from our natural experiences. 16 | 17 | RL_process 18 | 19 | A simple RL process would consist of an agent that: 20 | * receives a **state** $S_0$ from the **Environment** 21 | * based on that **state** $S_0$, the agent takes **action** $A_0$ 22 | * **Environment** goes to a **new state** $S_1$ 23 | * the **Environment** gives some **reward** $R_1$ to the agent 24 | 25 | This RL loop outputs a sequence $(S_0, A_0, R_1, S_1)$ 26 | 27 | The agent’s goal is to maximize its cumulative reward, called **the expected return**. 28 | 29 | **Reward hypothesis:** all goals can be described as the maximization of the expected return. 30 | 31 | **Markov property:** the agent agent needs only the current state to decide what action to take and not the history of all the states and actions they took before. 32 | 33 | **Observations/States:** 34 | * **State:** a complete description of the state of the world (there is no hidden information). In a fully observed environment like a chess board. 35 | 36 | * **Observation:** a partial description of the state. In a partially observed environment like a Super Mario Bros world. 37 | 38 | **Action Space:** set of all possible actions in an environment 39 | * **Discrete space:** the number of possible actions is finite like in a Super Mario Bros game. 40 | * **Continuous space:** the number of possible actions is infinite like the number of possible actions for a self driving car. 41 | 42 | **Reward:** the only feedback for the agent, thanks to it the agent knows if the action taken was good or not. ($\tau$, read Tau, is a trajectory, meaning a sequence of states and actions) 43 | * Expected cumulative reward: $\displaystyle R(\tau) = r_{t+1} + r_{t+2} + r_{t+3} + ... = \sum_{k=0}^{\infty} r_{t+k+1}$ 44 | * Discounted expected cumulative reward: $\displaystyle R(\tau) = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$. Used when we care more about the rewards that come sooner because they're more likely to happen. The discount rate $\gamma$ is most of the time between 0.95 and 0.99. 45 | 46 | **Types of tasks:** 47 | * **Episodic tasks:** we have a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and new States (like in a Super Mario level). 48 | * **Continuing tasks:** tasks that continue forever with no terminal state (automated stock trading for example). 49 | 50 | **Exploration/Exploitation trade-off:** 51 | * **Exploration** is exploring the environment by trying random actions in order to find more information about the environment. 52 | * **Exploitation** is exploiting known information to maximize the reward. 53 | 54 | The **Policy** $\pi$ is the brain of an RL agent, it takes a state an input and gives an action as output. 55 | * **Deterministic policy:** $\pi(s) = a$ a policy that given a state will always return the same action. 56 | * **Stochastic policy:** $\pi(a|s) = P(A|s)$ outputs a probability distribution over actions. 57 | 58 | **Policy based methods vs Value based methods:** 59 | * **Policy based methods:** we learn a policy function directly. 60 | * **Value based methods:** instead of training a policy function, we train a value function that maps a state to the expected value of being at that state, and we construct a policy from this value function. 61 | 62 | ## Unit 2: Q-Learning 63 | 64 | **Value functions:** 65 | * **State-value function for policy $\pi$ :** 66 | 67 | $\displaystyle V_{\pi}(s) = E_{\pi}(G_t|S_t = s) = E_{\pi}(\sum_{k=0}^{\infty} \gamma^k r_{t+k+1}|S_t = s)$. 68 | 69 | This is the expected return if the agent starts at state $s$ and then follows the policy forever after. In other words the value of the state $s$ under the policy $\pi$. 70 | 71 | * **Action-value function for policy $\pi$ :** 72 | 73 | $\displaystyle Q_{\pi}(s, a) = E_{\pi}(G_t|S_t = s, A_t = a) = E_{\pi}(\sum_{k=0}^{\infty} \gamma^k r_{t+k+1}|S_t = s, A_t = a)$. 74 | 75 | This is the expected return if the agent starts at state $s$, takes the action $a$ and then follows the policy forever after. In other words the value of the pair $(s, a)$ under the policy $\pi$. 76 | 77 | **Bellman Equations of value function:** 78 | Each of the value functions satisfy a particular recursive relationship called the **Bellman Equation**. 79 | 80 | * **Bellman Equation of the state-value function :** 81 | 82 | $V_{\pi}(s) = E_{\pi}(R_{t+1} + \gamma V_{\pi}(S_{t+1}) | S_t = s)$. 83 | 84 | * **Bellman Equation of the action-value function :** 85 | 86 | $Q_{\pi}(s, a) = E_{\pi}(R_{t+1} + \gamma Q_{\pi}(S_t, A_{t+1})|S_t = s, A_t = a)$. 87 | 88 | **Monte Carlo vs TD Learning:** 89 | 90 | * **Monte Carlo:** we update the value function from **a complete episode**, and so we use the actual accurate discounted return of this episode. 91 | 92 | $V_{\pi}(S_{t}) \leftarrow V_{\pi}(S_{t}) + \alpha [G_{t} - V(S_{t})]$. 93 | 94 | * **TD Learning:** we update the value function from **a step**, so we replace $G_{t}$​ that we don’t have with an estimated return called TD target. 95 | 96 | $V_{\pi}(S_{t}) \leftarrow V_{\pi}(S_{t}) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_{t})]$. 97 | 98 | **Link between Value and Policy:** $\pi_{}^{*}(s) = argmax_{a} Q_{}^{*}(s, a)$ 99 | 100 | #### **Code** 101 | 102 | Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**. 103 | 104 | - Epsilon-greedy policy (acting policy) 105 | - Greedy-policy (updating policy) 106 | 107 | Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table. 108 | 109 | Q-Learning 110 | 111 | 112 | Q-Learning 113 | 114 | ```python 115 | # Training parameters 116 | n_training_episodes = 10000 # Total training episodes 117 | learning_rate = 0.7 # Learning rate 118 | 119 | # Evaluation parameters 120 | n_eval_episodes = 100 # Total number of test episodes 121 | 122 | # Environment parameters 123 | env_id = "FrozenLake-v1" # Name of the environment 124 | max_steps = 99 # Max steps per episode 125 | gamma = 0.95 # Discounting rate 126 | eval_seed = [] # The evaluation seed of the environment 127 | 128 | # Exploration parameters 129 | max_epsilon = 1.0 # Exploration probability at start 130 | min_epsilon = 0.05 # Minimum exploration probability 131 | decay_rate = 0.0005 # Exponential decay rate for exploration prob 132 | ``` 133 | 134 | ```python 135 | def initialize_Q_table(env): 136 | num_possible_states = env.observation_space.n 137 | num_possible_action = env.action_space.n 138 | 139 | Qtable = np.zeros((num_possible_states, num_possible_action)) 140 | return Qtable 141 | ``` 142 | 143 | ```python 144 | def greedy_policy(Qtable, state): 145 | # Exploitation: take the action with the highest state, action value 146 | action = np.argmax(Qtable[state]) 147 | 148 | return action 149 | 150 | 151 | def epsilon_greedy_policy(Qtable, state, epsilon): 152 | # Randomly generate a number between 0 and 1 153 | random_int = random.uniform(0, 1) 154 | 155 | if random_int > epsilon: # exploitation 156 | action = np.argmax(Qtable[state]) 157 | elif random_int <= epsilon: # exploration 158 | action = env.action_space.sample() 159 | 160 | return action 161 | ``` 162 | 163 | ```python 164 | def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable): 165 | """ 166 | For episode in the total of training episodes: 167 | 168 | Reduce epsilon (since we need less and less exploration) 169 | Reset the environment 170 | 171 | For step in max timesteps: 172 | Choose the action At using epsilon greedy policy 173 | Take the action (a) and observe the outcome state(s') and reward (r) 174 | Update the Q-value using Bellman equation Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)] 175 | If done, finish the episode 176 | Our next state is the new state 177 | """ 178 | 179 | for episode in tqdm(range(n_training_episodes)): 180 | 181 | # Reduce epsilon (because we need less and less exploration) 182 | epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 183 | 184 | state = env.reset() 185 | step = 0 186 | done = False 187 | 188 | for step in range(max_steps): 189 | 190 | action = epsilon_greedy_policy(Qtable, state, epsilon) 191 | 192 | new_state, reward, done, info = env.step(action) 193 | 194 | # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)] 195 | # NB: Qtable[new_state][greedy_policy(Qtable, new_state)] = np.max(Qtable[new_state]) 196 | Qtable[state][action] += learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action]) 197 | 198 | if done: 199 | break 200 | 201 | # Our next state is the new state 202 | state = new_state 203 | 204 | return Qtable 205 | ``` 206 | 207 | ```python 208 | env = gym.make('FrozenLake-v1', map_name="4x4", is_slippery=False) 209 | Qtable = initialize_Q_table(env) 210 | Qtable = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable) 211 | ``` 212 | ## Unit 3: Deep Q-Learning with Atari Games 213 | 214 | ## Bonus Unit: Automatic Hyperparameter Tuning using Optuna 215 | -------------------------------------------------------------------------------- /imgs/Q-learning.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Elameri/huggingface-deep-rl-class-notes/df71bb03e79873cc0f59979f5b0b34af1af4e562/imgs/Q-learning.png -------------------------------------------------------------------------------- /imgs/RL_process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Elameri/huggingface-deep-rl-class-notes/df71bb03e79873cc0f59979f5b0b34af1af4e562/imgs/RL_process.png -------------------------------------------------------------------------------- /imgs/off-on-policy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Elameri/huggingface-deep-rl-class-notes/df71bb03e79873cc0f59979f5b0b34af1af4e562/imgs/off-on-policy.png --------------------------------------------------------------------------------