├── 1 - Table of contents.pdf
├── 2 - Overview of this book.pdf
├── 2 - Preface.pdf
├── 3 - Chapter 1 Basic Concepts.pdf
├── 3 - Chapter 10 Actor-Critic Methods.pdf
├── 3 - Chapter 2 State Values and Bellman Equation.pdf
├── 3 - Chapter 3 Optimal State Values and Bellman Optimality Equation.pdf
├── 3 - Chapter 4 Value Iteration and Policy Iteration.pdf
├── 3 - Chapter 5 Monte Carlo Methods.pdf
├── 3 - Chapter 6 Stochastic Approximation.pdf
├── 3 - Chapter 7 Temporal-Difference Methods.pdf
├── 3 - Chapter 8 Value Function Methods.pdf
├── 3 - Chapter 9 Policy Gradient Methods.pdf
├── 4 - Appendix.pdf
├── Book-all-in-one.pdf
├── Code for grid world
    ├── README.md
    ├── matlab_version
    │   ├── figure_plot.m
    │   ├── main_example.m
    │   ├── policy_offline_Q_learning.jpg
    │   ├── policy_offline_Q_learning.pdf
    │   ├── trajectory_Bellman_Equation.jpg
    │   ├── trajectory_Bellman_Equation.pdf
    │   ├── trajectory_Bellman_Equation_dotted.jpg
    │   ├── trajectory_Q_learning.jpg
    │   └── trajectory_Q_learning.pdf
    └── python_version
    │   ├── examples
    │       ├── __pycache__
    │       │   └── arguments.cpython-311.pyc
    │       ├── arguments.py
    │       └── example_grid_world.py
    │   ├── plots
    │       ├── sample1.png
    │       ├── sample2.png
    │       ├── sample3.png
    │       └── sample4.png
    │   └── src
    │       ├── __pycache__
    │           ├── grid_world.cpython-311.pyc
    │           ├── grid_world.cpython-38.pyc
    │           └── utils.cpython-311.pyc
    │       └── grid_world.py
├── Figure_ChineseBookCover.png
├── Figure_EnglishLectureVideo.png
├── Figure_chapterMap.png
├── Lecture slides
    ├── Readme.md
    ├── slidesContinuouslyUpdated
    │   ├── L1-Basic concepts.pdf
    │   ├── L10-Actor Critic.pdf
    │   ├── L2-Bellman equation.pdf
    │   ├── L3-Bellman optimality equation.pdf
    │   ├── L4-Value iteration and policy iteration.pdf
    │   ├── L5-Monte Carlo methods.pdf
    │   ├── L6-Stochastic approximation.pdf
    │   ├── L7-Temporal-Difference Learning.pdf
    │   ├── L8-Value function methods.pdf.pdf
    │   └── L9-Policy gradient methods.pdf
    └── slidesForMyLectureVideos
    │   ├── L1-basic concepts.pdf
    │   ├── L10_Actor Critic.pdf
    │   ├── L2-Bellman equation.pdf
    │   ├── L3-Bellman optimality equation.pdf
    │   ├── L4-Value iteration and policy iteration.pdf
    │   ├── L5-MC.pdf
    │   ├── L6-Stochastic approximation and stochastic gradient descent.pdf
    │   ├── L7-Temporal-difference learning.pdf
    │   ├── L8_Value function approximation.pdf
    │   └── L9_Policy gradient.pdf
├── Readme.md
└── springerBookCover.png


/1 - Table of contents.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/1 - Table of contents.pdf


--------------------------------------------------------------------------------
/2 - Overview of this book.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/2 - Overview of this book.pdf


--------------------------------------------------------------------------------
/2 - Preface.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/2 - Preface.pdf


--------------------------------------------------------------------------------
/3 - Chapter 1 Basic Concepts.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/3 - Chapter 1 Basic Concepts.pdf


--------------------------------------------------------------------------------
/3 - Chapter 10 Actor-Critic Methods.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/3 - Chapter 10 Actor-Critic Methods.pdf


--------------------------------------------------------------------------------
/3 - Chapter 2 State Values and Bellman Equation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/3 - Chapter 2 State Values and Bellman Equation.pdf


--------------------------------------------------------------------------------
/3 - Chapter 3 Optimal State Values and Bellman Optimality Equation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/3 - Chapter 3 Optimal State Values and Bellman Optimality Equation.pdf


--------------------------------------------------------------------------------
/3 - Chapter 4 Value Iteration and Policy Iteration.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/3 - Chapter 4 Value Iteration and Policy Iteration.pdf


--------------------------------------------------------------------------------
/3 - Chapter 5 Monte Carlo Methods.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/3 - Chapter 5 Monte Carlo Methods.pdf


--------------------------------------------------------------------------------
/3 - Chapter 6 Stochastic Approximation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/3 - Chapter 6 Stochastic Approximation.pdf


--------------------------------------------------------------------------------
/3 - Chapter 7 Temporal-Difference Methods.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/3 - Chapter 7 Temporal-Difference Methods.pdf


--------------------------------------------------------------------------------
/3 - Chapter 8 Value Function Methods.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/3 - Chapter 8 Value Function Methods.pdf


--------------------------------------------------------------------------------
/3 - Chapter 9 Policy Gradient Methods.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/3 - Chapter 9 Policy Gradient Methods.pdf


--------------------------------------------------------------------------------
/4 - Appendix.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/4 - Appendix.pdf


--------------------------------------------------------------------------------
/Book-all-in-one.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Book-all-in-one.pdf


--------------------------------------------------------------------------------
/Code for grid world/README.md:
--------------------------------------------------------------------------------
  1 | # Code for the Grid-World Environment 
  2 | 
  3 | ## Overview
  4 | 
  5 | We added the code for the grid-world environment in my book. Interested readers can develop and test their own algorithms in this environment. Both Python and MATLAB versions are provided.
  6 | 
  7 | Please note that we do not provide the code of all the algorithms involved in the book. That is because they are the homework for the students in offline teaching: the students need to develop their own algorithms using the provided environment. Nevertheless, there are third-party implementations of some algorithms. Interested readers can check the links on the home page of the book.
  8 | 
  9 | I need to thank my PhD students, Yize Mi and Jianan Li, who are also the Teaching Assistants of my offline teaching. They contributed greatly to the code.
 10 | 
 11 | You are welcome to provide any feedback about the code such as bugs if detected.
 12 | 
 13 | ----
 14 | 
 15 | ## Python Version
 16 | 
 17 | ### Requirements
 18 | 
 19 | - We support Python 3.7, 3.8, 3.9,  3.10 and 3.11. Make sure the following packages are installed: `numpy` and `matplotlib`.
 20 | 
 21 | 
 22 | ### How to Run the Default Example
 23 | 
 24 | To run the example, please follow the procedures :
 25 | 
 26 | 1. Change the directory to the file `examples/`
 27 | 
 28 | ```bash
 29 | cd examples
 30 | ```
 31 | 
 32 | 2. Run the script:
 33 | 
 34 | ```bash
 35 | python example_grid_world.py
 36 | ```
 37 | 
 38 | You will see an animation as shown below:
 39 | 
 40 | - The blue star denotes the agent's current position within the grid world.
 41 | - The arrows on each grid illustrate the policy for that state. 
 42 | - The green line traces the agent's historical trajectory. 
 43 | - Obstacles are marked with yellow grids. 
 44 | - The target state is indicated by a blue grid. 
 45 | - The numerical values displayed on each grid represent the state values, which are initially generated as random numbers between 0 and 10. You may need to design your own algorithms to calculate these state values later on. 
 46 | - The horizontal number list above the grid world represents the horizontal coordinates (x-axis) of each grid.
 47 | - The vertical number list on the left side represents their vertical coordinates (y-axis).
 48 | 
 49 | ![](python_version/plots/sample4.png)
 50 | 
 51 | ### Customize the Parameters of the Grid World Environment
 52 | 
 53 | If you would like to customize your own grid world environment, please open `examples/arguments.py` and then change the following arguments:
 54 | 
 55 | "**env-size**", "**start-state**", "**target-state**", "**forbidden-states**", "**reward-target**", "**reward-forbidden**", "**reward-step**":
 56 | 
 57 | - "env-size" is represented as a tuple, where the first element represents the column index (horizontal coordinate), and the second element represents the row index (vertical coordinate).
 58 | 
 59 | - "start-state" denotes where the agent starts.
 60 | 
 61 | - "target-state" denotes the position of the target. 
 62 | 
 63 | - "forbidden-states" denotes the positions of obstacles. 
 64 | 
 65 | - "reward-target", "reward-forbidden" and "reward-step" represent the reward when reaching the target, the reward when entering a forbidden area, and the reward for each step, respectively.  
 66 | 
 67 | An example is shown below:
 68 | 
 69 | To specify the target state, modify the default value in the following sentence:
 70 | 
 71 | ```python
 72 | parser.add_argument("--target-state", type=Union[list, tuple, np.ndarray], default=(4,4))
 73 | ```
 74 | 
 75 | Please note that the coordinate system used for all states within the environment—such as the start state, target state, and forbidden states—adheres to the conventional Python setup. In this system, the point `(0, 0)` is commonly designated as the origin of coordinates.
 76 | 
 77 | 
 78 | 
 79 | If you want to save figures in each step, please modify the "debug" argument to  "True":
 80 | 
 81 | ```bash
 82 | parser.add_argument("--debug", type=bool, default=True)
 83 | ```
 84 | 
 85 | 
 86 | 
 87 | ### Create an Instance
 88 | 
 89 | If you would like to use the grid world environment to test your own RL algorithms, it is necessary to create an instance. The procedure for creating an instance and interacting with it can be found in `examples/example_grid_world.py`:
 90 | 
 91 | ```python
 92 | from src.grid_world import GridWorld
 93 | 
 94 |  	env = GridWorld()
 95 |     state = env.reset()               
 96 |     for t in range(20):
 97 |         env.render()
 98 |         action = np.random.choice(env.action_space)
 99 |         next_state, reward, done, info = env.step(action)
100 |         print(f"Step: {t}, Action: {action}, Next state: {next_state+(np.array([1,1]))}, Reward: {reward}, Done: {done}")
101 | 
102 | ```
103 | 
104 | ![](python_version/plots/sample1.png)
105 | 
106 | - The policy is constructed as a matrix form shown below, which can be designed to be deterministic or stochastic. The example is a stochastic version:
107 | 
108 | 
109 |  ```python
110 |      # Add policy
111 |      policy_matrix=np.random.rand(env.num_states,len(env.action_space))                                       
112 |      policy_matrix /= policy_matrix.sum(axis=1)[:, np.newaxis] 
113 |  ```
114 | 
115 | - Moreover, to change the shape of the arrows, you can open `src/grid_world.py`:
116 | 
117 | 
118 |  ```python
119 | self.ax.add_patch(patches.FancyArrow(x, y, dx=(0.1+action_probability/2)*dx, dy=(0.1+action_probability/2)*dy, color=self.color_policy, width=0.001, head_width=0.05))   
120 |  ```
121 | 
122 | 
123 | 
124 | ![](python_version/plots/sample2.png)
125 | 
126 | -  To add state value to each grid:
127 | 
128 | 
129 | ```python
130 | values = np.random.uniform(0,10,(env.num_states,))
131 | env.add_state_values(values)
132 | ```
133 | 
134 | ![](python_version/plots/sample3.png)
135 | 
136 | - To render the environment:
137 | 
138 | 
139 | ```python
140 | env.render(animation_interval=3)    # the figure will stop for 3 seconds
141 | ```
142 | 
143 | ------
144 | 
145 | ## MATLAB Version
146 | 
147 | ### Requirements
148 | 
149 | - MATLAB >= R2020a, in order to implement the function *exportgraphics()*.
150 | 
151 | ### How to Run the Default Example
152 | 
153 | Please start the m-file `main.m`. 
154 | 
155 | Four figures will be generated: 
156 | 
157 | The first figure is to show the policy: The length of an arrow is proportional to the probability of choosing this action, and the circle represents the agent would stay still. The meanings of other graphics and colors in this visualization are consistent with those used in Python.
158 | 
159 | <img src="matlab_version/policy_offline_Q_learning.jpg" alt="policy_offline_Q_learning" style="zoom:67%;" />
160 | 
161 | The shape of the arrow can be customized in `figure_plot_1.m`
162 | 
163 | ```matlab
164 | function drawPolicyArrow(kk, ind, i_bias, j_bias, kk_new, ratio, greenColor, action)
165 |     % Obtain the action vector
166 |     action = action{kk};
167 | 
168 |     % For the non-moving action, draw a circle
169 |     if action(1) == 0 && action(2) == 0  % Assuming the fifth action is to stay
170 |         plot(i_bias(ind), j_bias(ind), 'o', 'MarkerSize', 8, 'linewidth', 2, 'color', greenColor);
171 |         return;
172 |     else
173 |         arrow = annotation('arrow', 'Position', [i_bias(ind), j_bias(ind), ratio * kk_new * action(1), - ratio * kk_new * action(2)], 'LineStyle', '-', 'Color', greenColor, 'LineWidth', 2);
174 |         arrow.Parent = gca;
175 |     end
176 | end
177 | ```
178 | 
179 | The second and the third figures are used to draw the trajectory in two different manners: The former is for the trajectory generated by a stochastic policy. The latter is provided to show the deterministic trajectory. 
180 | 
181 | <img src="matlab_version/trajectory_Q_learning.jpg" alt="trajectory_Q_learning" style="zoom:67%;" />
182 | 
183 | <img src="matlab_version/trajectory_Bellman_Equation_dotted.jpg" alt="trajectory_Bellman_Equation_dotted" style="zoom:67%;" />
184 | 
185 | The fourth figure is used to show the state value for each state. 
186 | 
187 | <img src="matlab_version/trajectory_Bellman_Equation.jpg" alt="trajectory_Bellman_Equation" style="zoom:67%;" />
188 | 
189 | ### Code Description
190 | 
191 | - The main reinforcement learning algorithm is shown below:
192 | 
193 | 
194 | ```matlab
195 | for step = 1:episode_length
196 |     action = stochastic_policy(state_history(step, :), action_space, policy, x_length, y_length);   
197 |     % Calculate the new state and reward
198 |     [new_state, reward] = next_state_and_reward(state_history(step, :), action, x_length, y_length, final_state, obstacle_state, reward_forbidden, reward_target, reward_step);
199 |     % Update state and reward history
200 |     state_history(step+1, :) = new_state;
201 |     reward_history(step) = reward;
202 | end
203 | ```
204 | 
205 | - The policy is shown as:
206 | 
207 | 
208 | ```matlab
209 | function action = stochastic_policy(state, action_space, policy, x_length, y_length)
210 |     % Extract the action space and policy for a specific state
211 |     state_1d = x_length * (state(2)-1) + state(1); 
212 |     actions = action_space{state_1d};
213 |     policy_i = policy(state_1d, :);
214 | 
215 |     % Ensure the sum of policy probabilities is 1
216 |     assert(sum(policy_i) == 1, 'The sum of policy probabilities must be 1.');
217 |     
218 |     % Generate a random index based on policy probabilities
219 |     action_index = randsrc(1, 1, [1:length(actions); policy_i]);
220 |     
221 |     % Select an action
222 |     action = actions{action_index};
223 | end
224 | ```
225 | 
226 | - The state transition function is shown below:
227 | 
228 | 
229 | ```matlab
230 | function [new_state, reward] = next_state_and_reward(state, action, x_length, y_length, target_state, obstacle_state, reward_forbidden, reward_target, reward_step)
231 |     new_x = state(1) + action(1);
232 |     new_y = state(2) + action(2);
233 |     new_state = [new_x, new_y];
234 | 
235 |     % Check if the new state is out of bounds
236 |     if new_x < 1 || new_x > x_length || new_y < 1 || new_y > y_length
237 |         new_state = state;
238 |         reward = reward_forbidden;
239 |     elseif ismember(new_state, obstacle_state, 'rows')
240 |         % If the new state is an obstacle
241 |         reward = reward_forbidden;
242 |     elseif isequal(new_state, target_state)
243 |         % If the new state is the target state
244 |         reward = reward_target;
245 |     else
246 |          % If the new state is a normal cell
247 |         reward = reward_step;
248 |     end
249 | end
250 | ```
251 | 


--------------------------------------------------------------------------------
/Code for grid world/matlab_version/figure_plot.m:
--------------------------------------------------------------------------------
  1 | % by the Intelligent Unmanned Systems Laboratory, Westlake University, 2024
  2 | 
  3 | function figure_plot(x_length, y_length, agent_state,final_state, obstacle_state, state_value, state_number, episode_length,state_update_2d,policy, action)
  4 |     %% Inverse y coordinate
  5 |  
  6 |     
  7 |     xa_used = agent_state(:, 1) + 0.5;
  8 |     ya_used = y_length+1-agent_state(:, 2) + 0.5;
  9 |     
 10 |     
 11 |     state_space=x_length*y_length;
 12 |     
 13 |     
 14 |     xf = final_state(:, 1);
 15 |     yf = y_length+1-final_state(:, 2); 
 16 |     
 17 |     
 18 |     
 19 |     xo = obstacle_state(:, 1);
 20 |     yo = y_length+1-obstacle_state(:, 2);
 21 |     
 22 |     
 23 |     
 24 |     xs = state_update_2d(:, 1); 
 25 |     ys = state_update_2d(:, 2);
 26 |     
 27 |     state_update = (ys-1) * x_length + xs; 
 28 |                                                         
 29 |     
 30 |     
 31 |     %%
 32 |     
 33 |     greenColor=[0.4660 0.6740 0.1880]*0.8;
 34 |     
 35 |     
 36 |     
 37 |     % Initialize the figure
 38 |     figure();
 39 |     
 40 |     
 41 |     % Add labels on the axes
 42 |     addAxisLabels(x_length, y_length);
 43 |     
 44 |     % Draw the grid, state values, and policy arrows
 45 |     r = drawGridStateValuesAndPolicy(x_length, y_length, state_number, state_value, policy, greenColor, action);
 46 |     
 47 |     % Color the obstacles and the final state
 48 |     colorObstacles(xo, yo, r);
 49 |     colorFinalState(xf, yf, r);
 50 |     
 51 |     % Draw the agent
 52 |     agent = plot(xa_used, ya_used, '*', 'markersize', 15, 'linewidth', 2, 'color', 'b');  
 53 |     hold on;
 54 |     
 55 |     axis equal
 56 |     axis off
 57 |     exportgraphics(gca,'policy_offline_Q_learning.pdf') 
 58 |     
 59 |     
 60 |     
 61 |     % Initialize the figure
 62 |     figure();
 63 |     
 64 |     % Add labels on the axes
 65 |     addAxisLabels(x_length, y_length);
 66 |     
 67 |     % Draw the grid and add state values
 68 |     r = drawGridAndStateValues(x_length, y_length, state_value);
 69 |     
 70 |     % Color the obstacles and the final state
 71 |     colorObstacles(xo, yo, r);
 72 |     colorFinalState(xf, yf, r);
 73 |     
 74 |     % Compute the de-normalized states
 75 |     for i = 1:state_space
 76 |         state_two_dimension_new(i, :) = de_normalized_state(state_number(i), x_length, y_length);
 77 |     end
 78 |     
 79 |     
 80 |     % Draw the agent
 81 |     agent = plot(xa_used, ya_used, '*', 'markersize', 15, 'linewidth', 2, 'color', 'b');
 82 |     hold on;
 83 |     
 84 |     
 85 |     % Set axis properties and export the figure
 86 |     axis equal;
 87 |     axis off;
 88 |     exportgraphics(gca, 'trajectory_Bellman_Equation.pdf');
 89 |     
 90 |     
 91 |     
 92 |     
 93 |     
 94 |     
 95 |     % Initialize the figure
 96 |     figure();
 97 |     
 98 |     % Add labels on the axes
 99 |     addAxisLabels(x_length, y_length);
100 |     
101 |     % Draw the grid and add state values
102 |     r= drawGridAndStateValues(x_length, y_length, state_value);
103 |     
104 |     % Draw state transitions
105 |     for i=1:state_space
106 |         state_two_dimension_new(i,:)=de_normalized_state(state_number(i),x_length,y_length);
107 |     end
108 |     drawStateTransitions(state_space, state_update, state_two_dimension_new, episode_length);
109 |     
110 |     % Color the obstacles and the final state
111 |     
112 |     
113 |     colorObstacles(xo, yo, r);
114 |     colorFinalState(xf, yf, r);
115 |     
116 |     
117 |     % Draw the agent
118 |     agent = plot(xa_used, ya_used, '*', 'markersize', 15, 'linewidth', 2, 'color', 'b');
119 |     hold on;
120 |     
121 |     
122 |     % Set axis properties and export the figure
123 |     axis equal;
124 |     axis off;
125 |     exportgraphics(gca, 'trajectory_Q_learning.pdf');
126 |     
127 |     
128 |     
129 |     
130 |     % Initialize the figure
131 |     figure();
132 |     
133 |     % Add labels on the axes
134 |     addAxisLabels(x_length, y_length);
135 |     
136 |     % Draw the grid and add state values
137 |     r = drawGridAndStateValues(x_length, y_length, state_value);
138 |     
139 |     % Color the obstacles and the final state
140 |     colorObstacles(xo, yo, r);
141 |     colorFinalState(xf, yf, r);
142 |     
143 |     % Compute the de-normalized states
144 |     for i = 1:state_space
145 |         state_two_dimension_new(i, :) = de_normalized_state(state_number(i), x_length, y_length);
146 |     end
147 |     
148 |     % Draw transitions between states
149 |     for i = 1:episode_length - 1
150 |         line([state_two_dimension_new(state_update(i), 1) + 0.5, state_two_dimension_new(state_update(i + 1), 1) + 0.5], ...
151 |              [state_two_dimension_new(state_update(i), 2) + 0.5, state_two_dimension_new(state_update(i + 1), 2) + 0.5], ...
152 |              'Color', 'black', 'LineStyle', '--');
153 |         hold on;
154 |     end
155 |     
156 |     % Draw the agent
157 |     agent = plot(xa_used, ya_used, '*', 'markersize', 15, 'linewidth', 2, 'color', 'b');
158 |     hold on;
159 |     
160 |     
161 |     % Set axis properties and export the figure
162 |     axis equal;
163 |     axis off;
164 |     exportgraphics(gca, 'trajectory_Bellman_Equation.pdf');
165 |     
166 |     % Function definitions would be the same as provided previously
167 | 
168 | 
169 | end
170 | 
171 | 
172 | 
173 | function o=de_normalized_state(each_state,x_length,y_length)
174 | 
175 |          o=[mod(each_state-1,x_length),y_length-1-fix((each_state-1)/(x_length))]+[1,1];
176 | end
177 | 
178 | 
179 | 
180 | 
181 | function addAxisLabels(x_length, y_length)
182 |     for i = 1:x_length
183 |         text(i + 0.5, y_length + 1.1, num2str(i));
184 |     end
185 |     for j = y_length:-1:1
186 |         text(0.9, j + 0.5, num2str(y_length - j + 1));
187 |     end
188 | end
189 | 
190 | function r= drawGridStateValuesAndPolicy(x_length, y_length, state_number, state_value, policy, greenColor, action)
191 |     ind = 0;
192 |     ratio = 0.5; % adjust the length of arrow
193 |     state_coordinate = zeros(x_length * y_length, 2); % Initialize state_coordinate
194 |     for j = y_length:-1:1       
195 |         for i = 1:x_length      
196 |             r(i, j) = rectangle('Position', [i j 1 1]);
197 |             ind = ind + 1;
198 |             state_coordinate(state_number(ind), :) = [i, j];
199 |             text(i + 0.4, j + 0.5, ['s', num2str(ind)]);
200 |             hold on;
201 |             
202 |             % Calculate bias
203 |             i_bias(ind) = state_coordinate(state_number(ind), 1) + 0.5;
204 |             j_bias(ind) = state_coordinate(state_number(ind), 2) + 0.5;
205 |             
206 |             % Draw policy arrows or state markers
207 |             for kk = 1:size(policy, 2)
208 |                 if policy(state_number(ind), kk) ~= 0
209 |                     kk_new = policy(state_number(ind), kk) / 2 + 0.5;
210 |                     drawPolicyArrow(kk, ind, i_bias, j_bias, kk_new, ratio, greenColor, action);                
211 |                 end
212 |             end
213 |         end
214 |     end
215 | end
216 | 
217 | 
218 | function drawPolicyArrow(kk, ind, i_bias, j_bias, kk_new, ratio, greenColor, action)
219 |     % Obtain the action vector
220 |     action = action{kk};
221 | 
222 |     % For the non-moving action, draw a circle to represent the stay state
223 |     if action(1) == 0 && action(2) == 0  % Assuming the fifth action is to stay
224 |         plot(i_bias(ind), j_bias(ind), 'o', 'MarkerSize', 8, 'linewidth', 2, 'color', greenColor);
225 |         return;
226 |     else
227 |         % Draw an arrow to represent the moving action; note that '-' used when drawing the y-axis arrow ensures consistency with the inverse y-coordinate handling.
228 |         arrow = annotation('arrow', 'Position', [i_bias(ind), j_bias(ind), ratio * kk_new * action(1), - ratio * kk_new * action(2)], 'LineStyle', '-', 'Color', greenColor, 'LineWidth', 2);
229 |         arrow.Parent = gca;
230 |     end
231 | end
232 | 
233 | 
234 | % Function to draw the grid and add state values
235 | function r = drawGridAndStateValues(x_length, y_length, state_value)
236 |     ind = 0;
237 |     for j = y_length:-1:1       
238 |         for i = 1:x_length       
239 |             r(i, j) = rectangle('Position', [i j 1 1]);
240 |             ind = ind + 1;
241 |             text(i + 0.4, j + 0.5, num2str(round(state_value(ind), 2)));
242 |             hold on;           
243 |         end
244 |     end
245 | end
246 | 
247 | % Function to color the obstacles
248 | function colorObstacles(xo, yo, r)
249 |     for i = 1:length(xo)
250 |         r(xo(i), yo(i)).FaceColor = [0.9290 0.6940 0.1250];
251 |     end
252 | end
253 | 
254 | % Function to color the final state
255 | function colorFinalState(xf, yf, r)
256 |     r(xf, yf).FaceColor = [0.3010 0.7450 0.9330];
257 | end
258 | 
259 | % Function to draw state transitions
260 | function drawStateTransitions(state_space, state_update, state_two_dimension_new, episode_length)
261 |     for i = 1:episode_length - 1
262 |         if state_two_dimension_new(state_update(i), 2) ~= state_two_dimension_new(state_update(i + 1), 2)       
263 |             line([state_two_dimension_new(state_update(i), 1) + 0.5, state_two_dimension_new(state_update(i), 1) + 0.5 + 0.03 * randn(1), state_two_dimension_new(state_update(i + 1), 1) + 0.5 + 0.03 * randn(1), state_two_dimension_new(state_update(i + 1), 1) + 0.5], ...
264 |                  [state_two_dimension_new(state_update(i), 2) + 0.5, state_two_dimension_new(state_update(i), 2) + 0.25 + 0.03 * randn(1), state_two_dimension_new(state_update(i + 1), 2) + 0.75 + 0.03 * randn(1), state_two_dimension_new(state_update(i + 1), 2) + 0.5], ...
265 |                  'Color', 'green');
266 |         elseif state_two_dimension_new(state_update(i), 1) ~= state_two_dimension_new(state_update(i + 1), 1)       
267 |             line([state_two_dimension_new(state_update(i), 1) + 0.5, state_two_dimension_new(state_update(i), 1) + 0.25 + 0.03 * randn(1), state_two_dimension_new(state_update(i + 1), 1) + 0.75 + 0.03 * randn(1), state_two_dimension_new(state_update(i + 1), 1) + 0.5], ...
268 |                  [state_two_dimension_new(state_update(i), 2) + 0.5, state_two_dimension_new(state_update(i), 2) + 0.5 + 0.03 * randn(1), state_two_dimension_new(state_update(i + 1), 2) + 0.5 + 0.03 * randn(1), state_two_dimension_new(state_update(i + 1), 2) + 0.5], ...
269 |                  'Color', 'green');
270 |         end
271 |         hold on;
272 |     end
273 | end
274 | 
275 | 
276 | 


--------------------------------------------------------------------------------
/Code for grid world/matlab_version/main_example.m:
--------------------------------------------------------------------------------
  1 | % by the Intelligent Unmanned Systems Laboratory, Westlake University, 2024
  2 | 
  3 | clear 
  4 | close all
  5 | 
  6 | % Initialize environment parameters
  7 | agent_state = [1, 1];
  8 | final_state = [3, 3];
  9 | obstacle_state = [1, 3; 2, 1; 1, 2];
 10 | x_length = 3;
 11 | y_length = 4;
 12 | state_space = x_length * y_length; 
 13 | state=1:state_space;
 14 | state_value=ones(state_space,1);
 15 | 
 16 | reward_forbidden = -1;
 17 | reward_target = 1;
 18 | reward_step = 0;
 19 | 
 20 | % Define actions: up, right, down, left, stay
 21 | actions = {[0, -1], [1, 0], [0, 1], [-1, 0], [0, 0]};
 22 | 
 23 | % Initialize a cell array to store the action space for each state
 24 | action_space = cell(state_space, 1);
 25 | 
 26 | % Populate the action space
 27 | for i = 1:state_space       
 28 |     action_space{i} = actions;
 29 | end
 30 | 
 31 | number_of_action=5;
 32 | 
 33 | policy=zeros(state_space, number_of_action); % policy can be deterministic or stochastic, shown as follows:
 34 | 
 35 | 
 36 | 
 37 | 
 38 | % stochastic policy 
 39 | 
 40 | for i=1:state_space
 41 |     policy(i,:)=.2;          
 42 | end
 43 | % policy(3,2)=0; policy(3,4)=.4;
 44 | % policy(5,5)=0; policy(5,3)=.4;
 45 | policy(7,3)=1; policy(7,4)= 0; policy(7,2)= 0; policy(7,1)= 0; policy(7,5)= 0;
 46 | % policy(6,2)=0; policy(6,3) = 1; policy(6,4) = 0; policy(6,5) = 0; policy (6,1) = 0;
 47 | 
 48 | 
 49 | 
 50 | 
 51 | % Initialize the episode
 52 | episode_length = 1000;
 53 | 
 54 | state_history = zeros(episode_length, 2); 
 55 | reward_history = zeros(episode_length, 1);  
 56 | 
 57 | % Set the initial state
 58 | state_history(1, :) = agent_state;
 59 | 
 60 | for step = 1:episode_length
 61 |     action = stochastic_policy(state_history(step, :), action_space, policy, x_length, y_length);   
 62 |     % Calculate the new state and reward
 63 |     [new_state, reward] = next_state_and_reward(state_history(step, :), action, x_length, y_length, final_state, obstacle_state, reward_forbidden, reward_target, reward_step);
 64 |     % Update state and reward history
 65 |     state_history(step+1, :) = new_state;
 66 |     reward_history(step) = reward;
 67 | end
 68 | 
 69 | figure_plot(x_length, y_length, agent_state, final_state, obstacle_state, state_value, state, episode_length, state_history, policy, actions);
 70 | 
 71 | %%   useful function 
 72 | function [new_state, reward] = next_state_and_reward(state, action, x_length, y_length, target_state, obstacle_state, reward_forbidden, reward_target, reward_step)
 73 |     new_x = state(1) + action(1);
 74 |     new_y = state(2) + action(2);
 75 |     new_state = [new_x, new_y];
 76 | 
 77 |     % Check if the new state is out of bounds
 78 |     if new_x < 1 || new_x > x_length || new_y < 1 || new_y > y_length
 79 |         new_state = state;
 80 |         reward = reward_forbidden;
 81 |     elseif ismember(new_state, obstacle_state, 'rows')
 82 |         % If the new state is an obstacle
 83 |         reward = reward_forbidden;
 84 |     elseif isequal(new_state, target_state)
 85 |         % If the new state is the target state
 86 |         reward = reward_target;
 87 |     else
 88 |          % If the new state is a normal cell
 89 |         reward = reward_step;
 90 |     end
 91 | end
 92 | 
 93 | function action = stochastic_policy(state, action_space, policy, x_length, y_length)
 94 |     % Extract the action space and policy for a specific state
 95 |     state_1d = x_length * (state(2)-1) + state(1); 
 96 |     actions = action_space{state_1d};
 97 |     policy_i = policy(state_1d, :);
 98 | 
 99 |     % Ensure the sum of policy probabilities is 1
100 |     assert(sum(policy_i) == 1, 'The sum of policy probabilities must be 1.');
101 |     
102 |     % Generate a random index based on policy probabilities
103 |     action_index = randsrc(1, 1, [1:length(actions); policy_i]);
104 |     
105 |     % Select an action
106 |     action = actions{action_index};
107 | end
108 | 


--------------------------------------------------------------------------------
/Code for grid world/matlab_version/policy_offline_Q_learning.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Code for grid world/matlab_version/policy_offline_Q_learning.jpg


--------------------------------------------------------------------------------
/Code for grid world/matlab_version/policy_offline_Q_learning.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Code for grid world/matlab_version/policy_offline_Q_learning.pdf


--------------------------------------------------------------------------------
/Code for grid world/matlab_version/trajectory_Bellman_Equation.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Code for grid world/matlab_version/trajectory_Bellman_Equation.jpg


--------------------------------------------------------------------------------
/Code for grid world/matlab_version/trajectory_Bellman_Equation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Code for grid world/matlab_version/trajectory_Bellman_Equation.pdf


--------------------------------------------------------------------------------
/Code for grid world/matlab_version/trajectory_Bellman_Equation_dotted.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Code for grid world/matlab_version/trajectory_Bellman_Equation_dotted.jpg


--------------------------------------------------------------------------------
/Code for grid world/matlab_version/trajectory_Q_learning.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Code for grid world/matlab_version/trajectory_Q_learning.jpg


--------------------------------------------------------------------------------
/Code for grid world/matlab_version/trajectory_Q_learning.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Code for grid world/matlab_version/trajectory_Q_learning.pdf


--------------------------------------------------------------------------------
/Code for grid world/python_version/examples/__pycache__/arguments.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Code for grid world/python_version/examples/__pycache__/arguments.cpython-311.pyc


--------------------------------------------------------------------------------
/Code for grid world/python_version/examples/arguments.py:
--------------------------------------------------------------------------------
 1 | __credits__ = ["Intelligent Unmanned Systems Laboratory at Westlake University."]
 2 | '''
 3 | Specify parameters of the env
 4 | '''
 5 | from typing import Union
 6 | import numpy as np
 7 | import argparse
 8 | 
 9 | parser = argparse.ArgumentParser("Grid World Environment")
10 | 
11 | ## ==================== User settings ===================='''
12 | # specify the number of columns and rows of the grid world
13 | parser.add_argument("--env-size", type=Union[list, tuple, np.ndarray], default=(5,5) )   
14 | 
15 | # specify the start state
16 | parser.add_argument("--start-state", type=Union[list, tuple, np.ndarray], default=(2,2))
17 | 
18 | # specify the target state
19 | parser.add_argument("--target-state", type=Union[list, tuple, np.ndarray], default=(4,4))
20 | 
21 | # sepcify the forbidden states
22 | parser.add_argument("--forbidden-states", type=list, default=[ (2, 1), (3, 3), (1, 3)] )
23 | 
24 | # sepcify the reward when reaching target
25 | parser.add_argument("--reward-target", type=float, default = 10)
26 | 
27 | # sepcify the reward when entering into forbidden area
28 | parser.add_argument("--reward-forbidden", type=float, default = -5)
29 | 
30 | # sepcify the reward for each step
31 | parser.add_argument("--reward-step", type=float, default = -1)
32 | ## ==================== End of User settings ====================
33 | 
34 | 
35 | ## ==================== Advanced Settings ====================
36 | parser.add_argument("--action-space", type=list, default=[(0, 1), (1, 0), (0, -1), (-1, 0), (0, 0)] )  # down, right, up, left, stay           
37 | parser.add_argument("--debug", type=bool, default=False)
38 | parser.add_argument("--animation-interval", type=float, default = 0.2)
39 | ## ==================== End of Advanced settings ====================
40 | 
41 | 
42 | args = parser.parse_args()     
43 | def validate_environment_parameters(env_size, start_state, target_state, forbidden_states):
44 |     if not (isinstance(env_size, tuple) or isinstance(env_size, list) or isinstance(env_size, np.ndarray)) and len(env_size) != 2:
45 |         raise ValueError("Invalid environment size. Expected a tuple (rows, cols) with positive dimensions.")
46 |     
47 |     for i in range(2):
48 |         assert start_state[i] < env_size[i]
49 |         assert target_state[i] < env_size[i]
50 |         for j in range(len(forbidden_states)):
51 |             assert forbidden_states[j][i] < env_size[i]
52 | try:
53 |     validate_environment_parameters(args.env_size, args.start_state, args.target_state, args.forbidden_states)
54 | except ValueError as e:
55 |     print("Error:", e)


--------------------------------------------------------------------------------
/Code for grid world/python_version/examples/example_grid_world.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import sys
 3 | sys.path.append("..")
 4 | from src.grid_world import GridWorld
 5 | import random
 6 | import numpy as np
 7 | 
 8 | # Example usage:
 9 | if __name__ == "__main__":             
10 |     env = GridWorld()
11 |     state = env.reset()               
12 |     for t in range(1000):
13 |         env.render()
14 |         action = random.choice(env.action_space)
15 |         next_state, reward, done, info = env.step(action)
16 |         print(f"Step: {t}, Action: {action}, State: {next_state+(np.array([1,1]))}, Reward: {reward}, Done: {done}")
17 |         # if done:
18 |         #     break
19 |     
20 |     # Add policy
21 |     policy_matrix=np.random.rand(env.num_states,len(env.action_space))                                            
22 |     policy_matrix /= policy_matrix.sum(axis=1)[:, np.newaxis]  # make the sum of elements in each row to be 1
23 | 
24 |     env.add_policy(policy_matrix)
25 | 
26 |     
27 |     # Add state values
28 |     values = np.random.uniform(0,10,(env.num_states,))
29 |     env.add_state_values(values)
30 | 
31 |     # Render the environment
32 |     env.render(animation_interval=2)


--------------------------------------------------------------------------------
/Code for grid world/python_version/plots/sample1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Code for grid world/python_version/plots/sample1.png


--------------------------------------------------------------------------------
/Code for grid world/python_version/plots/sample2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Code for grid world/python_version/plots/sample2.png


--------------------------------------------------------------------------------
/Code for grid world/python_version/plots/sample3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Code for grid world/python_version/plots/sample3.png


--------------------------------------------------------------------------------
/Code for grid world/python_version/plots/sample4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Code for grid world/python_version/plots/sample4.png


--------------------------------------------------------------------------------
/Code for grid world/python_version/src/__pycache__/grid_world.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Code for grid world/python_version/src/__pycache__/grid_world.cpython-311.pyc


--------------------------------------------------------------------------------
/Code for grid world/python_version/src/__pycache__/grid_world.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Code for grid world/python_version/src/__pycache__/grid_world.cpython-38.pyc


--------------------------------------------------------------------------------
/Code for grid world/python_version/src/__pycache__/utils.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Code for grid world/python_version/src/__pycache__/utils.cpython-311.pyc


--------------------------------------------------------------------------------
/Code for grid world/python_version/src/grid_world.py:
--------------------------------------------------------------------------------
  1 | __credits__ = ["Intelligent Unmanned Systems Laboratory at Westlake University."]
  2 | 
  3 | import sys    
  4 | sys.path.append("..")         
  5 | import numpy as np
  6 | import matplotlib.pyplot as plt
  7 | import matplotlib.patches as patches          
  8 | from examples.arguments import args           
  9 | 
 10 | class GridWorld():
 11 | 
 12 |     def __init__(self, env_size=args.env_size, 
 13 |                  start_state=args.start_state, 
 14 |                  target_state=args.target_state, 
 15 |                  forbidden_states=args.forbidden_states):
 16 | 
 17 |         self.env_size = env_size
 18 |         self.num_states = env_size[0] * env_size[1]
 19 |         self.start_state = start_state
 20 |         self.target_state = target_state
 21 |         self.forbidden_states = forbidden_states
 22 | 
 23 |         self.agent_state = start_state
 24 |         self.action_space = args.action_space          
 25 |         self.reward_target = args.reward_target
 26 |         self.reward_forbidden = args.reward_forbidden
 27 |         self.reward_step = args.reward_step
 28 | 
 29 |         self.canvas = None
 30 |         self.animation_interval = args.animation_interval
 31 | 
 32 | 
 33 |         self.color_forbid = (0.9290,0.6940,0.125)
 34 |         self.color_target = (0.3010,0.7450,0.9330)
 35 |         self.color_policy = (0.4660,0.6740,0.1880)
 36 |         self.color_trajectory = (0, 1, 0)
 37 |         self.color_agent = (0,0,1)
 38 | 
 39 | 
 40 | 
 41 |     def reset(self):
 42 |         self.agent_state = self.start_state
 43 |         self.traj = [self.agent_state] 
 44 |         return self.agent_state, {}
 45 | 
 46 | 
 47 |     def step(self, action):
 48 |         assert action in self.action_space, "Invalid action"
 49 | 
 50 |         next_state, reward  = self._get_next_state_and_reward(self.agent_state, action)
 51 |         done = self._is_done(next_state)
 52 | 
 53 |         x_store = next_state[0] + 0.03 * np.random.randn()
 54 |         y_store = next_state[1] + 0.03 * np.random.randn()
 55 |         state_store = tuple(np.array((x_store,  y_store)) + 0.2 * np.array(action))
 56 |         state_store_2 = (next_state[0], next_state[1])
 57 | 
 58 |         self.agent_state = next_state
 59 | 
 60 |         self.traj.append(state_store)   
 61 |         self.traj.append(state_store_2)
 62 |         return self.agent_state, reward, done, {}   
 63 |     
 64 |         
 65 |     def _get_next_state_and_reward(self, state, action):
 66 |         x, y = state
 67 |         new_state = tuple(np.array(state) + np.array(action))
 68 |         if y + 1 > self.env_size[1] - 1 and action == (0,1):    # down
 69 |             y = self.env_size[1] - 1
 70 |             reward = self.reward_forbidden  
 71 |         elif x + 1 > self.env_size[0] - 1 and action == (1,0):  # right
 72 |             x = self.env_size[0] - 1
 73 |             reward = self.reward_forbidden  
 74 |         elif y - 1 < 0 and action == (0,-1):   # up
 75 |             y = 0
 76 |             reward = self.reward_forbidden  
 77 |         elif x - 1 < 0 and action == (-1, 0):  # left
 78 |             x = 0
 79 |             reward = self.reward_forbidden 
 80 |         elif new_state == self.target_state:  # stay
 81 |             x, y = self.target_state
 82 |             reward = self.reward_target
 83 |         elif new_state in self.forbidden_states:  # stay
 84 |             x, y = state
 85 |             reward = self.reward_forbidden        
 86 |         else:
 87 |             x, y = new_state
 88 |             reward = self.reward_step
 89 |             
 90 |         return (x, y), reward
 91 |         
 92 | 
 93 |     def _is_done(self, state):
 94 |         return state == self.target_state
 95 |     
 96 | 
 97 |     def render(self, animation_interval=args.animation_interval):
 98 |         if self.canvas is None:
 99 |             plt.ion()                             
100 |             self.canvas, self.ax = plt.subplots()   
101 |             self.ax.set_xlim(-0.5, self.env_size[0] - 0.5)
102 |             self.ax.set_ylim(-0.5, self.env_size[1] - 0.5)
103 |             self.ax.xaxis.set_ticks(np.arange(-0.5, self.env_size[0], 1))     
104 |             self.ax.yaxis.set_ticks(np.arange(-0.5, self.env_size[1], 1))     
105 |             self.ax.grid(True, linestyle="-", color="gray", linewidth="1", axis='both')          
106 |             self.ax.set_aspect('equal')
107 |             self.ax.invert_yaxis()                           
108 |             self.ax.xaxis.set_ticks_position('top')           
109 |             
110 |             idx_labels_x = [i for i in range(self.env_size[0])]
111 |             idx_labels_y = [i for i in range(self.env_size[1])]
112 |             for lb in idx_labels_x:
113 |                 self.ax.text(lb, -0.75, str(lb+1), size=10, ha='center', va='center', color='black')           
114 |             for lb in idx_labels_y:
115 |                 self.ax.text(-0.75, lb, str(lb+1), size=10, ha='center', va='center', color='black')
116 |             self.ax.tick_params(bottom=False, left=False, right=False, top=False, labelbottom=False, labelleft=False,labeltop=False)   
117 | 
118 |             self.target_rect = patches.Rectangle( (self.target_state[0]-0.5, self.target_state[1]-0.5), 1, 1, linewidth=1, edgecolor=self.color_target, facecolor=self.color_target)
119 |             self.ax.add_patch(self.target_rect)     
120 | 
121 |             for forbidden_state in self.forbidden_states:
122 |                 rect = patches.Rectangle((forbidden_state[0]-0.5, forbidden_state[1]-0.5), 1, 1, linewidth=1, edgecolor=self.color_forbid, facecolor=self.color_forbid)
123 |                 self.ax.add_patch(rect)
124 | 
125 |             self.agent_star, = self.ax.plot([], [], marker = '*', color=self.color_agent, markersize=20, linewidth=0.5) 
126 |             self.traj_obj, = self.ax.plot([], [], color=self.color_trajectory, linewidth=0.5)
127 | 
128 |         # self.agent_circle.center = (self.agent_state[0], self.agent_state[1])
129 |         self.agent_star.set_data([self.agent_state[0]],[self.agent_state[1]])       
130 |         traj_x, traj_y = zip(*self.traj)         
131 |         self.traj_obj.set_data(traj_x, traj_y)
132 | 
133 |         plt.draw()
134 |         plt.pause(animation_interval)
135 |         if args.debug:
136 |             input('press Enter to continue...')     
137 | 
138 | 
139 |  
140 |     def add_policy(self, policy_matrix):                  
141 |         for state, state_action_group in enumerate(policy_matrix):    
142 |             x = state % self.env_size[0]
143 |             y = state // self.env_size[0]
144 |             for i, action_probability in enumerate(state_action_group):
145 |                 if action_probability !=0:
146 |                     dx, dy = self.action_space[i]
147 |                     if (dx, dy) != (0,0):
148 |                         self.ax.add_patch(patches.FancyArrow(x, y, dx=(0.1+action_probability/2)*dx, dy=(0.1+action_probability/2)*dy, color=self.color_policy, width=0.001, head_width=0.05))
149 |                     else:
150 |                         self.ax.add_patch(patches.Circle((x, y), radius=0.07, facecolor=self.color_policy, edgecolor=self.color_policy, linewidth=1, fill=False))
151 |     
152 |     def add_state_values(self, values, precision=1):
153 |         '''
154 |             values: iterable
155 |         '''
156 |         values = np.round(values, precision)
157 |         for i, value in enumerate(values):
158 |             x = i % self.env_size[0]
159 |             y = i // self.env_size[0]
160 |             self.ax.text(x, y, str(value), ha='center', va='center', fontsize=10, color='black')


--------------------------------------------------------------------------------
/Figure_ChineseBookCover.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Figure_ChineseBookCover.png


--------------------------------------------------------------------------------
/Figure_EnglishLectureVideo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Figure_EnglishLectureVideo.png


--------------------------------------------------------------------------------
/Figure_chapterMap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Figure_chapterMap.png


--------------------------------------------------------------------------------
/Lecture slides/Readme.md:
--------------------------------------------------------------------------------
 1 | My lecture slides are put into two folders.
 2 | 
 3 | - The folder "slidesForMyLectureVideos" contains all **the slides that I used to record my lecture videos**.
 4 | 
 5 | - The folder "slidesContinuouslyUpdated" contains **the slides that I updated continuously**.
 6 | 
 7 | The slides in the two folders are very similar, but there are some minor differences, such as typo correction and content adjustment.
 8 | 
 9 | **If you are not studying my online lecture videos, I suggest you check the slides in the slidesContinuouslyUpdated folder since they have been improved continuously.**
10 | 
11 | 


--------------------------------------------------------------------------------
/Lecture slides/slidesContinuouslyUpdated/L1-Basic concepts.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesContinuouslyUpdated/L1-Basic concepts.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesContinuouslyUpdated/L10-Actor Critic.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesContinuouslyUpdated/L10-Actor Critic.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesContinuouslyUpdated/L2-Bellman equation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesContinuouslyUpdated/L2-Bellman equation.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesContinuouslyUpdated/L3-Bellman optimality equation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesContinuouslyUpdated/L3-Bellman optimality equation.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesContinuouslyUpdated/L4-Value iteration and policy iteration.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesContinuouslyUpdated/L4-Value iteration and policy iteration.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesContinuouslyUpdated/L5-Monte Carlo methods.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesContinuouslyUpdated/L5-Monte Carlo methods.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesContinuouslyUpdated/L6-Stochastic approximation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesContinuouslyUpdated/L6-Stochastic approximation.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesContinuouslyUpdated/L7-Temporal-Difference Learning.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesContinuouslyUpdated/L7-Temporal-Difference Learning.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesContinuouslyUpdated/L8-Value function methods.pdf.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesContinuouslyUpdated/L8-Value function methods.pdf.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesContinuouslyUpdated/L9-Policy gradient methods.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesContinuouslyUpdated/L9-Policy gradient methods.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesForMyLectureVideos/L1-basic concepts.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesForMyLectureVideos/L1-basic concepts.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesForMyLectureVideos/L10_Actor Critic.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesForMyLectureVideos/L10_Actor Critic.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesForMyLectureVideos/L2-Bellman equation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesForMyLectureVideos/L2-Bellman equation.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesForMyLectureVideos/L3-Bellman optimality equation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesForMyLectureVideos/L3-Bellman optimality equation.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesForMyLectureVideos/L4-Value iteration and policy iteration.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesForMyLectureVideos/L4-Value iteration and policy iteration.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesForMyLectureVideos/L5-MC.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesForMyLectureVideos/L5-MC.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesForMyLectureVideos/L6-Stochastic approximation and stochastic gradient descent.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesForMyLectureVideos/L6-Stochastic approximation and stochastic gradient descent.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesForMyLectureVideos/L7-Temporal-difference learning.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesForMyLectureVideos/L7-Temporal-difference learning.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesForMyLectureVideos/L8_Value function approximation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesForMyLectureVideos/L8_Value function approximation.pdf


--------------------------------------------------------------------------------
/Lecture slides/slidesForMyLectureVideos/L9_Policy gradient.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/Lecture slides/slidesForMyLectureVideos/L9_Policy gradient.pdf


--------------------------------------------------------------------------------
/Readme.md:
--------------------------------------------------------------------------------
  1 | # (May 2025) 9,000+ stars!
  2 | This textbook has received 9,000+ stars! Glad that it is helpful to many readers.
  3 | 
  4 | # (Mar 2025) English lecture videos completed!
  5 | 
  6 | [![](./Figure_EnglishLectureVideo.png)](https://youtube.com/playlist?list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&si=B6mRR7vxBAjRAm_F)
  7 | 
  8 | **My English open course is online now.** You can click the above figure or the [link here](https://youtube.com/playlist?list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&si=D1T4pcyHsMxj6CzB) to jump to our YouTube channel. You can also click the following links to be directed to specific lecture videos. You are warmly welcome to check out the English videos to help your learning.
  9 | 
 10 | - [Overview of Reinforcement Learning in 30 Minutes](https://www.youtube.com/watch?v=ZHMWHr9811U&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=1)
 11 | - [L1: Basic Concepts (P1-State, action, policy, ...)](https://www.youtube.com/watch?v=zJHtM5dN69g&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=2)
 12 | - [L1: Basic Concepts (P2-Reward,return, Markov decision process)](https://www.youtube.com/watch?v=repVl3_GYCI&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=3)
 13 | - [L2: Bellman Equation (P1-Motivating examples)](https://www.youtube.com/watch?v=XCzWrlgZCwc&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=4)
 14 | - [L2: Bellman Equation (P2-State value)](https://www.youtube.com/watch?v=DSvi3xEN13I&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=5)
 15 | - [L2: Bellman Equation (P3-Bellman equation-Derivation)](https://www.youtube.com/watch?v=eNtId8yPWkA&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=6)
 16 | - [L2: Bellman Equation (P4-Matrix-vector form and solution)](https://www.youtube.com/watch?v=EtCfBG_eP2w&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=7)
 17 | - [L2: Bellman Equation (P5-Action value)](https://www.youtube.com/watch?v=zJo2sLDzfcU&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=8)
 18 | - [L3: Bellman Optimality Equation (P1-Motivating example)](https://www.youtube.com/watch?v=lXKY_Hyg4SQ&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=9)
 19 | - [L3: Bellman Optimality Equation (P2-Optimal policy)](https://www.youtube.com/watch?v=BxyjdHhK8a8&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=10)
 20 | - [L3: Bellman Optimality Equation (P3-More on BOE)](https://www.youtube.com/watch?v=FXftTCKotC8&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=11)
 21 | - [L3: Bellman Optimality Equation (P4-Interesting properties)](https://www.youtube.com/watch?v=a--bck2ow9s&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=12)
 22 | - [L4: Value Iteration and Policy Iteration (P1-Value iteration)](https://www.youtube.com/watch?v=wMAVmLDIvQU&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=13)
 23 | - [L4: Value Iteration and Policy Iteration (P2-Policy iteration)](https://www.youtube.com/watch?v=Pka6Om0nYQ8&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=14)
 24 | - [L4: Value Iteration and Policy Iteration (P3-Truncated policy iteration)](https://www.youtube.com/watch?v=tUjPFPD3Vc8&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=15)
 25 | - [L5: Monte Carlo Learning (P1-Motivating examples)](https://www.youtube.com/watch?v=DO1yXinAV_Q&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=16)
 26 | - [L5: Monte Carlo Learning (P2-MC Basic-introduction)](https://www.youtube.com/watch?v=6ShisunU0zs&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=17)
 27 | - [L5: Monte Carlo Learning (P3-MC Basic-examples)](https://www.youtube.com/watch?v=axA0yns9FxU&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=18)
 28 | - [L5: Monte Carlo Learning (P4-MC Exploring Starts)](https://www.youtube.com/watch?v=Qt8OMHPkLqg&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=19)
 29 | - [L5: Monte Carlo Learning (P5-MC Epsilon-Greedy-introduction)](https://www.youtube.com/watch?v=dM3fYE630pY&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=20)
 30 | - [L5: Monte Carlo Learning (P6-MC Epsilon-Greedy-examples)](https://www.youtube.com/watch?v=x6X_5ePT9gQ&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=21)
 31 | - [L6: Stochastic Approximation and SGD (P1-Motivating example)](https://www.youtube.com/watch?v=1bMgejvWoAo&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=22)
 32 | - [L6: Stochastic Approximation and SGD (P2-RM algorithm: introduction)](https://www.youtube.com/watch?v=1FTGcNUUnCE&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=23)
 33 | - [L6: Stochastic Approximation and SGD (P3-RM algorithm: convergence)](https://www.youtube.com/watch?v=juNDoAFEre4&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=24)
 34 | - [L6: Stochastic Approximation and SGD (P4-SGD algorithm: introduction)](https://www.youtube.com/watch?v=EZO7Iadp5m4&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=25)
 35 | - [L6: Stochastic Approximation and SGD (P5-SGD algorithm: examples)](https://www.youtube.com/watch?v=BsxU_4qvvNA&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=26)
 36 | - [L6: Stochastic Approximation and SGD (P6-SGD algorithm: properties)](https://www.youtube.com/watch?v=fWxX9YuEHjE&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=27)
 37 | - [L6: Stochastic Approximation and SGD (P7-SGD algorithm: comparison)](https://www.youtube.com/watch?v=yNEV2cLKuzU&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=28)
 38 | - [L7: Temporal-Difference Learning (P1-Motivating example)](https://www.youtube.com/watch?v=u1X-7XX3dtI&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=29)
 39 | - [L7: Temporal-Difference Learning (P2-TD algorithm: introduction)](https://www.youtube.com/watch?v=XiCUsc7CCE0&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=30)
 40 | - [L7: Temporal-Difference Learning (P3-TD algorithm: convergence)](https://www.youtube.com/watch?v=faWg8M91-Oo&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=31)
 41 | - [L7: Temporal-Difference Learning (P4-Sarsa)](https://www.youtube.com/watch?v=jYwQufkBUPo&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=32)
 42 | - [L7: Temporal-Difference Learning (P5-Expected Sarsa & n-step Sarsa)](https://www.youtube.com/watch?v=0kKzQbWZOlk&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=33)
 43 | - [L7: Temporal-Difference Learning (P6-Q-learning: introduction)](https://www.youtube.com/watch?v=4BvYR2hm730&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=34)
 44 | - [L7: Temporal-Difference Learning (P7-Q-learning: pseudo code)](https://www.youtube.com/watch?v=I0YhlOIFF4s&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=35)
 45 | - [L7: Temporal-Difference Learning (P8-Unified viewpoint and summary)](https://www.youtube.com/watch?v=3t74lvk1GBM&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=36)
 46 | - [L8: Value Function Approximation (P1-Motivating example–curve fitting)](https://www.youtube.com/watch?v=uJXcI8fcdWc&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=37)
 47 | - [L8: Value Function Approximation (P2-Objective function)](https://www.youtube.com/watch?v=Z3HI1TfpJP0&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=38)
 48 | - [L8: Value Function Approximation (P3-Optimization algorithm)](https://www.youtube.com/watch?v=piBDwrKt0uU&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=39)
 49 | - [L8: Value Function Approximation (P4-illustrative examples and analysis)](https://www.youtube.com/watch?v=VFyBNEZxMMs&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=40)
 50 | - [L8: Value Function Approximation (P5-Sarsa and Q-learning)](https://www.youtube.com/watch?v=C-HtY4-W_zw&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=41)
 51 | - [L8: Value Function Approximation (P6-DQN–basic idea)](https://www.youtube.com/watch?v=lZCcbZbqVSQ&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=42)
 52 | - [L8: Value Function Approximation (P7-DQN–experience replay)](https://www.youtube.com/watch?v=rynEdAdebi0&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=43)
 53 | - [L8: Value Function Approximation (P8-DQN–implementation and example)](https://www.youtube.com/watch?v=vQHuCHjd6hA&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=44)
 54 | - [L9: Policy Gradient Methods (P1-Basic idea)](https://www.youtube.com/watch?v=mtFHOj83QSo&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=45)
 55 | - [L9: Policy Gradient Methods (P2-Metric 1–Average value)](https://www.youtube.com/watch?v=la8jQc3hX1M&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=46)
 56 | - [L9: Policy Gradient Methods (P3-Metric 2–Average reward)](https://www.youtube.com/watch?v=8RZ_rQFe69E&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=47)
 57 | - [L9: Policy Gradient Methods (P4-Gradients of the metrics)](https://www.youtube.com/watch?v=MvmtPXur3Ls&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=48)
 58 | - [L9: Policy Gradient Methods (P5-Gradient-based algorithms & REINFORCE)](https://www.youtube.com/watch?v=1DQnnUC8ng8&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=49)
 59 | - [L10: Actor-Critic Methods (P1-The simplest Actor-Critic)](https://www.youtube.com/watch?v=kjCZAT5Wh80&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=50)
 60 | - [L10: Actor-Critic Methods (P2-Advantage Actor-Critic)](https://www.youtube.com/watch?v=vZVXJJcZNEM&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=51)
 61 | - [L10: Actor-Critic Methods (P3-Importance sampling & off-policy Actor-Critic)](https://www.youtube.com/watch?v=TfO5mnsiGKc&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=52)
 62 | - [L10: Actor-Critic Methods (P4-Deterministic Actor-Critic)](https://www.youtube.com/watch?v=dTjz1RNtic4&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=53)
 63 | - [L10: Actor-Critic Methods (P5-Summary and goodbye!)](https://www.youtube.com/watch?v=npvnnKcXoBs&list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8&index=54)
 64 | 
 65 | ***
 66 | ***
 67 | 
 68 | # Why a new book on reinforcement learning?
 69 | 
 70 | This book aims to provide a **mathematical but friendly** introduction to the fundamental concepts, basic problems, and classic algorithms in reinforcement learning. Some essential features of this book are highlighted as follows.
 71 | 
 72 | - The book introduces reinforcement learning from a mathematical point of view. Hopefully, readers will not only know the procedure of an algorithm but also understand why it was designed in the first place and why it works effectively.
 73 | 
 74 | - The depth of the mathematics is carefully controlled to an adequate level. The mathematics is also presented in a carefully designed manner to ensure that the book is friendly to read. Readers can selectively read the materials presented in gray boxes according to their interests.
 75 | 
 76 | - Many illustrative examples are given to help readers better understand the topics. All the examples in this book are based on a grid world task, which is easy to understand and helpful for illustrating concepts and algorithms.
 77 | 
 78 | - When introducing an algorithm, the book aims to separate its core idea from complications that may be distracting. In this way, readers can better grasp the core idea of an algorithm.
 79 | 
 80 | - The contents of the book are coherently organized. Each chapter is built based on the preceding chapter and lays a necessary foundation for the subsequent one.
 81 | 
 82 | [![Book cover](./springerBookCover.png)](https://link.springer.com/book/9789819739431)
 83 | ![Book cover](./Figure_ChineseBookCover.png)
 84 | 
 85 | # Contents
 86 | 
 87 | The topics addressed in the book are shown in the figure below. This book contains ten chapters, which can be classified into two parts: the first part is about basic tools, and the second part is about algorithms. The ten chapters are highly correlated. In general, it is necessary to study the earlier chapters first before the later ones.
 88 | 
 89 | ![The map of this book](./Figure_chapterMap.png)
 90 | 
 91 | 
 92 | # Lecture videos 
 93 | 
 94 | By combining the book with my lecture videos, I believe you can study better. 
 95 | 
 96 | - **Chinese lecture videos:** You can check the [Bilibili channel](https://space.bilibili.com/2044042934) or the [Youtube channel](https://www.youtube.com/channel/UCztGtS5YYiNv8x3pj9hLVgg/playlists).
 97 | - **English lecture videos:** The English lecture videos have been uploaded to YouTube. Please see the links and details in another part of this document.
 98 | 
 99 | The lecture videos have received **1,500,000+ views** over the Internet and received very good feedback!
100 | 
101 | # Readership
102 | 
103 | This book is designed for senior undergraduate students, graduate students, researchers, and practitioners interested in reinforcement learning.
104 | 
105 | It does not require readers to have any background in reinforcement learning because it starts by introducing the most basic concepts. If the reader already has some background in reinforcement learning, I believe the book can help them understand some topics more deeply or provide different perspectives.
106 | 
107 | This book, however, requires the reader to have some knowledge of probability theory and linear algebra. Some basics of the required mathematics are also included in the appendix of this book.
108 | 
109 | # About the author
110 | You can find my info on my homepage https://www.shiyuzhao.net (GoogleSite) and my research group website https://shiyuzhao.westlake.edu.cn
111 | 
112 | I have been teaching a graduate-level course on reinforcement learning since 2019. Along with teaching, I have been preparing this book as the lecture notes for my students. 
113 | 
114 | I sincerely hope this book can help readers smoothly enter the exciting field of reinforcement learning.
115 | 
116 | # Citation
117 | 
118 | ```
119 | @book{zhao2025RLBook,
120 |   title={Mathematical Foundations of Reinforcement Learning},
121 |   author={S. Zhao},
122 |   year={2025},
123 |   publisher={Springer Nature Press and Tsinghua University Press}
124 | }
125 | ```
126 | 
127 | # Third-party code and materials
128 | 
129 | Many enthusiastic readers sent me the source code or notes that they developed when they studied this book. If you create any materials based on course, you are welcome to write an email. I am happy to share the links here and hope they may be helpful to other readers. I must emphasize that I have not verified the code. If you have any questions, you can directly contact the developers. 
130 | 
131 | **Code**
132 | 
133 | *Python:*
134 | - https://github.com/zhoubay/Code-for-Mathematical-Foundations-of-Reinforcement-Learning (Mar 2025, by Xibin ZHOU)
135 | 
136 | - https://github.com/10-OASIS-01/minrl (Feb 2025)
137 | 
138 | - https://github.com/SupermanCaozh/The_Coding_Foundation_in_Reinforcement_Learning  (by Zehong Cao, Aug 2024)
139 | 
140 | - https://github.com/ziwenhahaha/Code-of-RL-Beginning by RLGamer (Mar 2024)
141 |   - Videos for code explanation: https://www.bilibili.com/video/BV1fW421w7NH
142 | 
143 | - https://github.com/jwk1rose/RL_Learning by Wenkang Ji (Feb 2024)
144 | 
145 | *R:*
146 | 
147 | - https://github.com/NewbieToEverything/Code-Mathmatical-Foundation-of-Reinforcement-Learning
148 | 
149 | *C++:*
150 | 
151 | - https://github.com/purundong/test_rl
152 | 
153 | 
154 | **Study notes**
155 | 
156 | *English:*
157 | 
158 | - https://lyk-love.cn/tags/reinforcement-learning/ 
159 | by a graduate student from UC Davis
160 | 
161 | *Chinese:* 
162 | 
163 | - https://zhuanlan.zhihu.com/p/692207843 
164 | 
165 | - https://blog.csdn.net/qq_64671439/category_12540921.html
166 | 
167 | - http://t.csdnimg.cn/EH4rj
168 | 
169 | - https://blog.csdn.net/LvGreat/article/details/135454738
170 | 
171 | - https://xinzhe.blog.csdn.net/article/details/129452000  
172 | 
173 | - https://blog.csdn.net/v20000727/article/details/136870879?spm=1001.2014.3001.5502
174 | 
175 | - https://blog.csdn.net/m0_64952374/category_12883361.html
176 | 
177 | There are also many others notes made by many other readers on the Internet. I am not able to put them all here. You are welcome to recommend to me if you find a good one.
178 | 
179 | **Bilibili videos made based on my course**
180 | 
181 | - https://www.bilibili.com/video/BV1fW421w7NH
182 | 
183 | - https://www.bilibili.com/video/BV1Ne411m7GX
184 |   
185 | - https://www.bilibili.com/video/BV1HX4y1H7uR
186 |   
187 | - https://www.bilibili.com/video/BV1TgzsYDEnP
188 |   
189 | - https://www.bilibili.com/video/BV1CQ4y1J7zu
190 | 
191 | # Update history 
192 | 
193 | **(Apr 2025) 8,000+ stars!**
194 | This textbook has received 8,000+ stars! Glad that it is helpful to many readers.
195 | 
196 | **(Mar 2025) 7,000+ stars!**
197 | This textbook has received 7,000+ stars! Glad that it is helpful to many readers.
198 | 
199 | **(Feb 2025) 5,000+ stars**
200 | This textbook has received 5,000+ stars! Glad that it is helpful to many readers.
201 | 
202 | **(Dec 2024) 4,000+ stars**
203 | This textbook has received 4,000+ stars! Glad that it is helpful to many readers.
204 | 
205 | **(Oct 2024) Book cover**
206 | 
207 | The design of the book cover is finished. The book will be officially published by Springer early next year. It has been published by Tsinghua University Press.
208 | 
209 | 
210 | **(Sep 2024) Minor update before printing by Springer**
211 | 
212 | I revised some very minor places that readers may hardly notice. It is supposed to be the final version before printing by Springer. 
213 | 
214 | **(Aug 2024) 3000 Stars and more code**
215 | 
216 | The book has received 3000+ stars, which is a great achievement to me. Thanks to everyone. Hope it really helped you.
217 | 
218 | I also received more code implementation from enthusiastic readers. For example, this [GitHub page](https://github.com/SupermanCaozh/The_Coding_Foundation_in_Reinforcement_Learning) provided Python implementation of almost all examples in my book. On the one hand, I am very glad to see that. On the other hand, I am a little worried that my students in my offline class may use the code to do their homework:-). Overall, I am happy because it indicates that the book and open course are really helpful to the readers; Otherwise, they would not bother to develop the code by themselves:-)
219 | 
220 | **(Jun 2024) Minor update before printing**
221 | 
222 | This is the fourth version of the book draft. It is supposed to be the final one before the book is officially published. Specifically, when proofreading the book manuscript, I detected some very minor issues. Together with some reported by enthusiastic readers, they have been revised in this version.
223 | 
224 | **(Apr 2024) Code for the Grid-World Environment**
225 | 
226 | We added the code for the grid-world environment in my book. Interested readers can develop and test their own algorithms in this environment. Both Python and MATLAB versions are provided.
227 | 
228 | Please note that we do not provide the code of all the algorithms involved in the book. That is because they are the homework for the students in offline teaching: the students need to develop their own algorithms using the provided environment. Nevertheless, there are third-party implementations of some algorithms. Interested readers can check the links on the home page of the book.
229 | 
230 | I need to thank my PhD students, Yize Mi and Jianan Li, who are also the Teaching Assistants of my offline teaching. They contributed greatly to the code.
231 | 
232 | You are welcome to provide any feedback about the code such as bugs if detected.
233 | 
234 | **(Mar 2024) 2K stars**
235 | 
236 | The book has received 2K stars. I also received many positive evaluations of the book from many readers. Very glad that it can be helpful. 
237 | 
238 | **(Mar 2024) Minor update**
239 | 
240 | The third version of the draft of the book is online now.
241 | 
242 | Compared to the second version, the third version is improved in the sense that some minor typos have been corrected. Here, I would like to thank the readers who sent me their feedback. 
243 | 
244 | **(Sep 2023) 1000+ stars**
245 | 
246 | The book received 1000+ stars! Thank everybody!
247 | 
248 | **(Aug 2023) Major update - second version**
249 | 
250 | *The second version of the draft of the book is online now!!*
251 | 
252 | Compared to the first version, which was online one year ago, the second version has been improved in various ways. For example, we replotted most of the figures, reorganized some contents to make them clearer, corrected some typos, and added Chapter 10, which was not included in the first version. 
253 | 
254 | I put the first draft of this book online in August 2022. Up to now, I have received valuable feedback from many readers worldwide. I want to express my gratitude to these readers.
255 | 
256 | **(Nov 2022) Will be jointly published**
257 | 
258 | This book will be published *jointly by Springer Nature and Tsinghua University Press*. It will probably be printed in the second half of 2023.
259 | 
260 | I have received some comments and suggestions about this book from some readers. Thanks a lot, and I appreciate it. I am still collecting feedback and will probably revise the draft in several months. Your feedback can make this book more helpful for other readers!
261 | 
262 | **(Oct 2022) Lecture notes and vidoes**
263 | 
264 | The *lecture slides* have been uploaded in the folder "Lecture slides."
265 | 
266 | The *lecture videos* (in Chinese) are online. Please check our Bilibili channel https://space.bilibili.com/2044042934 or the Youtube channel https://www.youtube.com/channel/UCztGtS5YYiNv8x3pj9hLVgg/playlists
267 | 
268 | **(Aug 2022) First draft**
269 | 
270 | The first draft of the book is online.
271 | 


--------------------------------------------------------------------------------
/springerBookCover.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning/8b70409e752fee2d68772f1c235b6c5726b563d4/springerBookCover.png


--------------------------------------------------------------------------------