└── README.md /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # Multi-Objective Reinforement Learning 4 | 5 |
6 | 7 | --- 8 | 9 |
10 | 11 | # Introduction 12 | 13 |
14 | 15 | --- 16 | 17 |
18 | 19 | ## RL problem 20 | 21 | - The learning agent is not provided explicitly what action to take 22 | - The learning agent determines the best action to maximize long-term rewards and execute it 23 | - The selected action makes the current state of the environment to transit into its successive state 24 | - The agent receives a scalar reward signal that evaluates the effect of this state transition 25 | - The agent learns optimal or near-optimal action policies from such interactions in order to maximize some notion of long-term objectives 26 | 27 | ![](https://i.imgur.com/fNXVuXZ.png) 28 | 29 |
30 | 31 | --- 32 | 33 |
34 | 35 | 36 | ### Challenges of RL 37 | 38 | Despite of many advances in RL theory and algorithms, one remained challenge is to scale up to larger and more complex problems. 39 | The scaling problem for sequential decision-making mainly includes the following aspects: 40 | 41 | 1. Large or continuous state or action space 42 | 1. Hierarchically organized tasks and sub-tasks 43 | 1. To solve several tasks with different rewards simultaneously 44 | 45 | **Multi-objective reinforcement learning (MORL) problem** 46 | 47 |
48 | 49 | --- 50 | 51 |
52 | 53 | ### Multi-Objective Reinforcement Learning (MORL) 54 | 55 | A combination of multi-objective optimization (MOO) and RL techniques to solve the sequential decision-making problems with multiple conflicting objectives 56 | 57 | 1. Obtain action policies which optimizes two or more objectives at the same time 58 | 1. Each objective has its own associated reward signal 59 | 1. The reward is not a scalar value but a vector 60 | 1. Combine the objectives if they are related 61 | 1. Optimize the objectives separately if they are completely unrelated 62 | 1. Make a trade-off among the conflicting objectives 63 | 64 |
65 | 66 | --- 67 | 68 |
69 | 70 | ### Multi-Objective Optimization (MOO) Strategies 71 | > Multi-objective to Single-objective Strategy 72 | 73 | 1. To optimize a scalar value 74 | * Weighted sum method 75 | * Constraint method 76 | * Sequential method 77 | * Max-min method 78 | 79 | ![](https://i.imgur.com/ZLL6EDA.png) 80 | 81 |
82 | 83 | > Pareto Strategy 84 | 85 | 1. Vector-valued utilities 86 | 1. Non-inferior and alternative solutions 87 | 1. Constitute the Pareto front 88 | 89 | ![](https://i.imgur.com/MgEaF3f.png) 90 | ![](https://i.imgur.com/5SElsBk.png) 91 | 92 |
93 | 94 | --- 95 | 96 |
97 | 98 | ### MORL algorithms 99 | 100 | > Single-policy Approaches 101 | 102 | Find the best single policy specified by a user or derived from the problem domain 103 | 104 |
105 | 106 | > Multiple-policy Approaches 107 | 108 | Find a set of policies that approximate the Pareto front 109 | 110 |
111 | 112 | --- 113 | 114 |
115 | 116 | # Backgrounds 117 | 118 |
119 | 120 | --- 121 | 122 |
123 | 124 | ## Markov Decision Process (MDP) Models 125 | 126 | > A sequential decision-making problem can be formulated as an MDP 127 | 128 | S: The state space of a finite set of states 129 | 130 | A: The action space of a finite set of actions 131 | 132 | R: The reward function 133 | 134 | P: The matrix of state transition probability 135 | 136 |
137 | 138 | > Objective functions 139 | 140 | 1. Discounted reward criteria 141 | 1. Average reward criteria 142 | 143 | ![](https://i.imgur.com/b5E0sVi.png) 144 | 145 |
146 | 147 | --- 148 | 149 |
150 | 151 | ## MDP Objective Functions 152 | 153 | > Discounted Reward Criteria 154 | > 155 | ![](https://i.imgur.com/3IUIPZJ.png) 156 | 157 | > Average Reward Criteria 158 | > 159 | ![](https://i.imgur.com/OcIPSiN.png) 160 | 161 |
162 | 163 | --- 164 | 165 |
166 | 167 | ## Basic RL Algorithms 168 | 169 | 1. RL algorithms integrate the techniques of Monte Carlo, stochastic approximation, and function approximation to obtain approximate solutions of MDPs 170 | 1. As a central mechanism of RL, temporal-difference (TD) learning can be viewed as a combination of Monte Carlo and DP 171 | 1. TD algorithms can learn the value functions using state transition data without model information 172 | - Similar to Monte Carlo methods 173 | 1. TD methods can update the current estimation of value functions partially based on previous learned results 174 | - Similar to DP 175 | 176 |
177 | 178 | > Discounted Reward Criteria Q-Learning Algorithm 179 | 180 | ![](https://i.imgur.com/tlGXEUV.png) 181 | 182 |
183 | 184 | > Average Reward Criteria R-Learning Algorithm 185 | 186 | ![](https://i.imgur.com/GqSppMn.png) 187 | 188 |
189 | 190 | --- 191 | 192 |
193 | 194 | ## MOO Problems 195 | 196 | Maximize the Pareto optimality or the weighted scalar of all the elements and satisfy the constraint functions 197 | 198 | ![](https://i.imgur.com/zxDNq3O.png) 199 | 200 |
201 | 202 | --- 203 | 204 |
205 | 206 | ## MOO Optimal Solutions 207 | 208 | > Multi-objective to Single-objective Strategy 209 | 210 | Solutions can be obtained by solving a SOO problem 211 | 212 | > Pareto Dominance and Pareto Front 213 | 214 | Find all the dominating solutions instead of the dominated ones 215 | Practically, find a set of solutions that approximates the real Pareto front 216 | 217 | ![](https://i.imgur.com/SZEybJ8.png) 218 | 219 |
220 | 221 | --- 222 | 223 |
224 | 225 | # MORL Problem 226 | 227 |
228 | 229 | --- 230 | 231 |
232 | 233 | ## Basic Architecture 234 | 235 | Maximize the Pareto optimality or the weighted scalar of all the elements and satisfy the constraint functions 236 | 237 | ![](https://i.imgur.com/DPj9uOM.png) 238 | 239 | Vectored state-action value function 240 | 241 | ![](https://i.imgur.com/7i0qJRK.png) 242 | 243 |
244 | 245 | --- 246 | 247 |
248 | 249 | ## Major Research Topics 250 | 251 | 1. MORL is a highly interdisciplinary field and it refers to the integration of MOO methods and RL techniques to solve sequential decision making problems with multiple conflicting objectives 252 | 1. The related disciplines of MORL include artificial intelligence, decision and optimization theory, operations research, control theory, and so on 253 | 1. MORL suitably represents the designer’s preferences or ensure the optimization priority with some policies in the Pareto front 254 | 1. Design efficient MORL algorithms 255 | 256 |
257 | 258 | --- 259 | 260 |
261 | 262 | # Representative Approaches to MORL 263 | 264 |
265 | 266 | --- 267 | 268 |
269 | 270 | ## MORL Approaches 271 | 272 | > Single-policy Approaches 273 | 274 | 1. Weighted sum approach 275 | 1. W-learning 276 | 1. Analytic hierarchy process (AHP) approach 277 | 1. Ranking approach 278 | 1. Geometric approach 279 | 280 |
281 | 282 | > Multiple-policy Approaches 283 | 284 | 1. Convex hull approach 285 | 1. Varying parameter approach 286 | 287 |
288 | 289 | --- 290 | 291 |
292 | 293 | ## Weighted Sum Approach 294 | 295 | > GM-Sarsa(0) 296 | 297 | 1. Sum up the Q-values for all the objectives to estimate the combined Q-function 298 | 1. The updates are based on the actually selected actions rather than the best action determined by the value function 299 | 1. Has smaller errors between the estimated Q-values and the true Q-values 300 | 301 | ![](https://i.imgur.com/Po5vLN4.png) 302 | 303 |
304 | 305 | > Weighted sum approach 306 | 307 | ![](https://i.imgur.com/xglNRgY.png) 308 | 309 |
310 | 311 | --- 312 | 313 |
314 | 315 | ## W-Learning Approach 316 | 317 | > Top-Q method to compute W values 318 | 319 | * Assign the W value as the highest Q-value among all the objectives in the current state 320 | 321 | ![](https://i.imgur.com/aUZBUbz.png) 322 | 323 | * Synthetic objective function for the Top-Q approach 324 | 325 | ![](https://i.imgur.com/R0RWYQw.png) 326 | 327 | * The objective with the highest Q-value may have similar priorities for different actions, while other objectives cannot be satisfied due to their low action values 328 | * A change in reward scaling or the design of reward functions may greatly influence the results of the winner-take-all contest 329 | 330 |
331 | 332 | > W-Learning 333 | 334 | * All the W values, except the highest W value, are updated after the action of each step is selected and executed 335 | 336 | ![](https://i.imgur.com/Wxc9UNh.png) 337 | 338 |
339 | 340 | > Negotiated W-Learning 341 | 342 | * Explicitly find that if an objective is not 343 | * preferred to determine the next action 344 | * Might lose the most long-term reward 345 | 346 | ![](https://i.imgur.com/0dZKUjT.png) 347 | 348 |
349 | 350 | --- 351 | 352 |
353 | 354 | ## Analytic Hierarchy Process (AHP) Approach 355 | 356 | * The designer of MORL algorithms may not have enough prior knowledge about the optimization problem 357 | * The degree of relative importance between two objectives can be quantified by L grades, and a scalar value is defined for each grade 358 | * Requires a lot of prior knowledge of the problem domain 359 | 360 | Relative importance matrix 361 | 362 | ![](https://i.imgur.com/4WoDZU4.png) 363 | 364 | Importance factor 365 | 366 | ![](https://i.imgur.com/17KRCkG.png) 367 | 368 | Value of improvement 369 | 370 | ![](https://i.imgur.com/FWJR3vM.png) 371 | 372 | Fuzzy inference system: compute the goodness of 𝑎~𝑝~ relative to 𝑎~𝑞~ 373 | 374 |
375 | 376 | --- 377 | 378 |
379 | 380 | ## Ranking Approach 381 | 382 | * Also called the sequential approach or the threshold approach 383 | * Ensure the effectiveness of the subordinate objective 384 | * Threshold values were specified for some objectives in order to put the constraints on the objectives 385 | 386 | ![](https://i.imgur.com/1psjKrM.png) 387 | 388 | ![](https://i.imgur.com/sX9UgY3.png) 389 | 390 |
391 | 392 | --- 393 | 394 |
395 | 396 | ## Geometric Approach 397 | 398 | * Deal with dynamic unknown Markovian environments with long-term average reward vectors 399 | * Assume that actions of other agents may influence the dynamics of the environment and the game is irreducible or ergodic 400 | 401 | > Multiple directions RL (MDRL) and single direction RL (SDRL) 402 | 403 | Approximate a desired target set in a multidimensional objective space 404 | 405 | ![](https://i.imgur.com/NXUotPp.png) 406 | 407 |
408 | 409 | --- 410 | 411 |
412 | 413 | ## Convex Hull Approach 414 | 415 | * Simultaneously learn optimal policies for all linear preference assignments in the objective space 416 | * Can find the optimal policy for any linear preference function 417 | * Since multiple policies are learned at once, the integrated RL algorithms should be off-policy algorithms 418 | 419 | Definition 1: Translation and scaling operations 420 | 421 | ![](https://i.imgur.com/c9kteMr.png) 422 | 423 | Definition 2: Summing two convex hulls 424 | 425 | ![](https://i.imgur.com/2oZoLCC.png) 426 | 427 |
428 | 429 | ![](https://i.imgur.com/krUuxse.png) 430 | 431 |
432 | 433 | --- 434 | 435 |
436 | 437 | ## Varying Parameter Approach 438 | 439 | * A multiple-policy approach can be realized by performing multiple runs with different parameters, objective thresholds, and orderings in any single-policy algorithm 440 | 441 | > Policy gradient methods and the idea of varying parameters 442 | 443 | * Estimate multiple policy gradients for each objective 444 | * Vary the weights of the objective gradients to find multiple policies 445 | 446 |
447 | 448 | --- 449 | 450 |
451 | 452 | ## Summary 453 | 454 | > Multi-objective fitted Q-iteration (FQI) 455 | 456 | Find control policies for all the linear combinations of preferences assigned to the objectives in a single training procedure 457 | 458 | ![](https://i.imgur.com/2qNRsIK.png) 459 | 460 |
461 | 462 | --- 463 | 464 |
465 | 466 | # Important Directions of Recent Research on MORL 467 | 468 |
469 | 470 | --- 471 | 472 |
473 | 474 | ## Further Development of MORL Approaches 475 | 476 | > To obtain suitable representations of the preferences and improve the efficiency of MORL algorithms 477 | 478 | * Estimation of distribution algorithms (EDA) 479 | * Incorporate the notions in evolutionary MOO 480 | * Acquire various strategies by a single run 481 | * Learning classifier system 482 | * The choice of action-selection policies can greatly affect the performance of the learning system 483 | * α-domination strategy 484 | * Use a goal-directed bias based on the achievement level of each evaluation 485 | * Parallel genetic algorithm (PGA) 486 | * Evolve a neuro-controller 487 | * Perturbation stochastic approximation (SPSA) was used to improve the convergence 488 | * Adaptive margins 489 | 490 | > To obtain Pareto optimal policies in large or continuous spaces 491 | 492 | * Multi-agent framework 493 | * ie. Traffic signal control 494 | * Fitted Q-iteration (FQI) 495 | * Approximate the Pareto front 496 | * Consistency multi-objective dynamic programming 497 | 498 |
499 | 500 | The small scale of previous MORL problems may not verify the algorithm’s performance in dealing with a wide range of different problem settings, and the algorithm implementations always require much prior knowledge about the problem domain 501 | 502 | --------------------------------------------------------------------------------