└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | # Multi-Objective Reinforement Learning
  4 | 
  5 | <br>
  6 | 
  7 | ---
  8 | 
  9 | <br>
 10 | 
 11 | # Introduction
 12 | 
 13 | <br>
 14 | 
 15 | ---
 16 | 
 17 | <br>
 18 | 
 19 | ## RL problem
 20 | 
 21 | - The learning agent is not provided explicitly what action to take
 22 | - The learning agent determines the best action to maximize long-term rewards and execute it
 23 | - The selected action makes the current state of the environment to transit into its successive state
 24 | - The agent receives a scalar reward signal that evaluates the effect of this state transition
 25 | - The agent learns optimal or near-optimal action policies from such interactions in order to maximize some notion of long-term objectives
 26 | 
 27 | ![](https://i.imgur.com/fNXVuXZ.png)
 28 | 
 29 | <br>
 30 | 
 31 | ---
 32 | 
 33 | <br>
 34 | 
 35 | 
 36 | ### Challenges of RL
 37 | 
 38 | Despite of many advances in RL theory and algorithms, one remained challenge is to scale up to larger and more complex problems. 
 39 | The scaling problem for sequential decision-making mainly includes the following aspects:
 40 | 
 41 | 1. Large or continuous state or action space
 42 | 1. Hierarchically organized tasks and sub-tasks
 43 | 1. To solve several tasks with different rewards simultaneously
 44 | 
 45 |     **Multi-objective reinforcement learning (MORL) problem**
 46 | 
 47 | <br>
 48 | 
 49 | ---
 50 | 
 51 | <br>
 52 | 
 53 | ### Multi-Objective Reinforcement Learning (MORL)
 54 | 
 55 | A combination of multi-objective optimization (MOO) and RL techniques to solve the sequential decision-making problems with multiple conflicting objectives
 56 | 
 57 | 1. Obtain action policies which optimizes two or more objectives at the same time
 58 | 1. Each objective has its own associated reward signal
 59 | 1. The reward is not a scalar value but a vector
 60 | 1. Combine the objectives if they are related
 61 | 1. Optimize the objectives separately if they are completely unrelated
 62 | 1. Make a trade-off among the conflicting objectives
 63 | 
 64 | <br>
 65 | 
 66 | ---
 67 | 
 68 | <br>
 69 | 
 70 | ### Multi-Objective Optimization (MOO) Strategies
 71 | > Multi-objective to Single-objective Strategy
 72 | 
 73 | 1. To optimize a scalar value
 74 |     * Weighted sum method
 75 |     * Constraint method
 76 |     * Sequential method
 77 |     * Max-min method
 78 | 
 79 |     ![](https://i.imgur.com/ZLL6EDA.png)
 80 | 
 81 | <br>
 82 | 
 83 | > Pareto Strategy
 84 | 
 85 | 1. Vector-valued utilities 
 86 | 1. Non-inferior and alternative solutions
 87 | 1. Constitute the Pareto front
 88 | 
 89 |     ![](https://i.imgur.com/MgEaF3f.png)
 90 |     ![](https://i.imgur.com/5SElsBk.png)
 91 | 
 92 | <br>
 93 | 
 94 | ---
 95 | 
 96 | <br>
 97 | 
 98 | ### MORL algorithms
 99 | 
100 | > Single-policy Approaches
101 | 
102 | Find the best single policy specified by a user or derived from the problem domain
103 | 
104 | <br>
105 | 
106 | > Multiple-policy Approaches
107 | 
108 | Find a set of policies that approximate the Pareto front
109 | 
110 | <br>
111 | 
112 | ---
113 | 
114 | <br>
115 | 
116 | # Backgrounds
117 | 
118 | <br>
119 | 
120 | ---
121 | 
122 | <br>
123 | 
124 | ## Markov Decision Process (MDP) Models
125 | 
126 | > A sequential decision-making problem can be formulated as an MDP
127 | 
128 | S: The state space of a finite set of states
129 | 
130 | A: The action space of a finite set of actions
131 | 
132 | R: The reward function
133 | 
134 | P: The matrix of state transition probability
135 | 
136 | <br>
137 | 
138 | > Objective functions
139 | 
140 | 1. Discounted reward criteria
141 | 1. Average reward criteria
142 | 
143 | ![](https://i.imgur.com/b5E0sVi.png)
144 | 
145 | <br>
146 | 
147 | ---
148 | 
149 | <br>
150 | 
151 | ## MDP Objective Functions 
152 | 
153 | > Discounted Reward Criteria
154 | > 
155 | ![](https://i.imgur.com/3IUIPZJ.png)
156 | 
157 | > Average Reward Criteria
158 | > 
159 | ![](https://i.imgur.com/OcIPSiN.png)
160 | 
161 | <br>
162 | 
163 | ---
164 | 
165 | <br>
166 | 
167 | ## Basic RL Algorithms
168 | 
169 | 1. RL algorithms integrate the techniques of Monte Carlo, stochastic approximation, and function approximation to obtain approximate solutions of MDPs
170 | 1. As a central mechanism of RL, temporal-difference (TD) learning can be viewed as a combination of Monte Carlo and DP
171 | 1. TD algorithms can learn the value functions using state transition data without model information
172 |     - Similar to Monte Carlo methods
173 | 1. TD methods can update the current estimation of value functions partially based on previous learned results
174 |     - Similar to DP
175 | 
176 | <br>
177 | 
178 | > Discounted Reward Criteria Q-Learning Algorithm
179 | 
180 | ![](https://i.imgur.com/tlGXEUV.png)
181 | 
182 | <br>
183 | 
184 | > Average Reward Criteria R-Learning Algorithm
185 | 
186 | ![](https://i.imgur.com/GqSppMn.png)
187 | 
188 | <br>
189 | 
190 | ---
191 | 
192 | <br>
193 | 
194 | ## MOO Problems
195 | 
196 | Maximize the Pareto optimality or the weighted scalar of all the elements and satisfy the constraint functions
197 | 
198 | ![](https://i.imgur.com/zxDNq3O.png)
199 | 
200 | <br>
201 | 
202 | ---
203 | 
204 | <br>
205 | 
206 | ## MOO Optimal Solutions
207 | 
208 | > Multi-objective to Single-objective Strategy
209 | 
210 | Solutions can be obtained by solving a SOO problem
211 | 
212 | > Pareto Dominance and Pareto Front
213 | 
214 | Find all the dominating solutions instead of the dominated ones
215 | Practically, find a set of solutions that approximates the real Pareto front
216 | 
217 | ![](https://i.imgur.com/SZEybJ8.png)
218 | 
219 | <br>
220 | 
221 | ---
222 | 
223 | <br>
224 | 
225 | # MORL Problem
226 | 
227 | <br>
228 | 
229 | ---
230 | 
231 | <br>
232 | 
233 | ## Basic Architecture
234 | 
235 | Maximize the Pareto optimality or the weighted scalar of all the elements and satisfy the constraint functions
236 | 
237 | ![](https://i.imgur.com/DPj9uOM.png)
238 | 
239 | Vectored state-action value function
240 | 
241 | ![](https://i.imgur.com/7i0qJRK.png)
242 | 
243 | <br>
244 | 
245 | ---
246 | 
247 | <br>
248 | 
249 | ## Major Research Topics
250 | 
251 | 1. MORL is a highly interdisciplinary field and it refers to the integration of MOO methods and RL techniques to solve sequential decision making problems with multiple conflicting objectives
252 | 1. The related disciplines of MORL include artificial intelligence, decision and optimization theory, operations research, control theory, and so on
253 | 1. MORL suitably represents the designer’s preferences or ensure the optimization priority with some policies in the Pareto front
254 | 1. Design efficient MORL algorithms
255 | 
256 | <br>
257 | 
258 | ---
259 | 
260 | <br>
261 | 
262 | # Representative Approaches to MORL
263 | 
264 | <br>
265 | 
266 | ---
267 | 
268 | <br>
269 | 
270 | ## MORL Approaches
271 | 
272 | > Single-policy Approaches
273 | 
274 | 1. Weighted sum approach
275 | 1. W-learning
276 | 1. Analytic hierarchy process (AHP) approach
277 | 1. Ranking approach
278 | 1. Geometric approach
279 | 
280 | <br>
281 | 
282 | > Multiple-policy Approaches
283 | 
284 | 1. Convex hull approach
285 | 1. Varying parameter approach
286 | 
287 | <br>
288 | 
289 | ---
290 | 
291 | <br>
292 | 
293 | ## Weighted Sum Approach
294 | 
295 | > GM-Sarsa(0)
296 | 
297 | 1. Sum up the Q-values for all the objectives to estimate the combined Q-function
298 | 1. The updates are based on the actually selected actions rather than the best action determined by the value function
299 | 1. Has smaller errors between the estimated Q-values and the true Q-values
300 | 
301 | ![](https://i.imgur.com/Po5vLN4.png)
302 | 
303 | <br>
304 | 
305 | > Weighted sum approach
306 | 
307 | ![](https://i.imgur.com/xglNRgY.png)
308 | 
309 | <br>
310 | 
311 | ---
312 | 
313 | <br>
314 | 
315 | ## W-Learning Approach
316 | 
317 | > Top-Q method to compute W values
318 | 
319 | * Assign the W value as the highest Q-value among all the objectives in the current state
320 | 
321 | ![](https://i.imgur.com/aUZBUbz.png)
322 | 
323 | * Synthetic objective function for the Top-Q approach
324 | 
325 | ![](https://i.imgur.com/R0RWYQw.png)
326 | 
327 | * The objective with the highest Q-value may have similar priorities for different actions, while other objectives cannot be satisfied due to their low action values
328 | * A change in reward scaling or the design of reward functions may greatly influence the results of the winner-take-all contest
329 | 
330 | <br>
331 | 
332 | > W-Learning
333 | 
334 | * All the W values, except the highest W value, are updated after the action of each step is selected and executed 
335 | 
336 | ![](https://i.imgur.com/Wxc9UNh.png)
337 | 
338 | <br>
339 | 
340 | > Negotiated W-Learning
341 | 
342 | * Explicitly find that if an objective is not 
343 | * preferred to determine the next action
344 | * Might lose the most long-term reward
345 | 
346 | ![](https://i.imgur.com/0dZKUjT.png)
347 | 
348 | <br>
349 | 
350 | ---
351 | 
352 | <br>
353 | 
354 | ## Analytic Hierarchy Process (AHP) Approach
355 | 
356 | * The designer of MORL algorithms may not have enough prior knowledge about the optimization problem
357 | * The degree of relative importance between two objectives can be quantified by L grades, and a scalar value is defined for each grade
358 | * Requires a lot of prior knowledge of the problem domain
359 | 
360 | Relative importance matrix
361 | 
362 | ![](https://i.imgur.com/4WoDZU4.png)
363 | 
364 | Importance factor
365 | 
366 | ![](https://i.imgur.com/17KRCkG.png)
367 | 
368 | Value of improvement
369 | 
370 | ![](https://i.imgur.com/FWJR3vM.png)
371 | 
372 | Fuzzy inference system: compute the goodness of 𝑎~𝑝~  relative to 𝑎~𝑞~
373 | 
374 | <br>
375 | 
376 | ---
377 | 
378 | <br>
379 | 
380 | ## Ranking Approach
381 | 
382 | * Also called the sequential approach or the threshold approach
383 | * Ensure the effectiveness of the subordinate objective
384 | * Threshold values were specified for some objectives in order to put the constraints on the objectives
385 | 
386 | ![](https://i.imgur.com/1psjKrM.png)
387 | 
388 | ![](https://i.imgur.com/sX9UgY3.png)
389 | 
390 | <br>
391 | 
392 | ---
393 | 
394 | <br>
395 | 
396 | ## Geometric Approach
397 | 
398 | * Deal with dynamic unknown Markovian environments with long-term average reward vectors
399 | * Assume that actions of other agents may influence the dynamics of the environment and the game is irreducible or ergodic
400 | 
401 | > Multiple directions RL (MDRL) and single direction RL (SDRL)
402 | 
403 | Approximate a desired target set in a multidimensional objective space
404 | 
405 | ![](https://i.imgur.com/NXUotPp.png)
406 | 
407 | <br>
408 | 
409 | ---
410 | 
411 | <br>
412 | 
413 | ## Convex Hull Approach
414 | 
415 | * Simultaneously learn optimal policies for all linear preference assignments in the objective space
416 | * Can find the optimal policy for any linear preference function
417 | * Since multiple policies are learned at once, the integrated RL algorithms should be off-policy algorithms
418 | 
419 | Definition 1: Translation and scaling operations
420 | 
421 | ![](https://i.imgur.com/c9kteMr.png)
422 | 
423 | Definition 2: Summing two convex hulls
424 | 
425 | ![](https://i.imgur.com/2oZoLCC.png)
426 | 
427 | <br>
428 | 
429 | ![](https://i.imgur.com/krUuxse.png)
430 | 
431 | <br>
432 | 
433 | ---
434 | 
435 | <br>
436 | 
437 | ## Varying Parameter Approach
438 | 
439 | * A multiple-policy approach can be realized by performing multiple runs with different parameters, objective thresholds, and orderings in any single-policy algorithm
440 | 
441 | > Policy gradient methods and the idea of varying parameters
442 | 
443 | * Estimate multiple policy gradients for each objective
444 | * Vary the weights of the objective gradients to find multiple policies
445 | 
446 | <br>
447 | 
448 | ---
449 | 
450 | <br>
451 | 
452 | ## Summary
453 | 
454 | > Multi-objective fitted Q-iteration (FQI)
455 | 
456 | Find control policies for all the linear combinations of preferences assigned to the objectives in a single training procedure
457 | 
458 | ![](https://i.imgur.com/2qNRsIK.png)
459 | 
460 | <br>
461 | 
462 | ---
463 | 
464 | <br>
465 | 
466 | # Important Directions of Recent Research on MORL
467 | 
468 | <br>
469 | 
470 | ---
471 | 
472 | <br>
473 | 
474 | ## Further Development of MORL Approaches
475 | 
476 | > To obtain suitable representations of the preferences and improve the efficiency of MORL algorithms
477 | 
478 | * Estimation of distribution algorithms (EDA)
479 |     * Incorporate the notions in evolutionary MOO
480 |     * Acquire various strategies by a single run
481 | * Learning classifier system
482 |     * The choice of action-selection policies can greatly affect the performance of the learning system
483 | * α-domination strategy
484 |     * Use a goal-directed bias based on the achievement level of each evaluation
485 | * Parallel genetic algorithm (PGA)
486 |     * Evolve a neuro-controller
487 |     * Perturbation stochastic approximation (SPSA) was used to improve the convergence
488 | * Adaptive margins
489 | 
490 | > To obtain Pareto optimal policies in large or continuous spaces
491 | 
492 | * Multi-agent framework
493 |     * ie. Traffic signal control
494 | * Fitted Q-iteration (FQI)
495 |     * Approximate the Pareto front
496 | * Consistency multi-objective dynamic programming
497 | 
498 | <br>
499 | 
500 | The small scale of previous MORL problems may not verify the algorithm’s performance in dealing with a wide range of different problem settings, and the algorithm implementations always require much prior knowledge about the problem domain
501 | 
502 | 


--------------------------------------------------------------------------------