└── README.md
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | # Multi-Objective Reinforement Learning
4 |
5 |
6 |
7 | ---
8 |
9 |
10 |
11 | # Introduction
12 |
13 |
14 |
15 | ---
16 |
17 |
18 |
19 | ## RL problem
20 |
21 | - The learning agent is not provided explicitly what action to take
22 | - The learning agent determines the best action to maximize long-term rewards and execute it
23 | - The selected action makes the current state of the environment to transit into its successive state
24 | - The agent receives a scalar reward signal that evaluates the effect of this state transition
25 | - The agent learns optimal or near-optimal action policies from such interactions in order to maximize some notion of long-term objectives
26 |
27 | 
28 |
29 |
30 |
31 | ---
32 |
33 |
34 |
35 |
36 | ### Challenges of RL
37 |
38 | Despite of many advances in RL theory and algorithms, one remained challenge is to scale up to larger and more complex problems.
39 | The scaling problem for sequential decision-making mainly includes the following aspects:
40 |
41 | 1. Large or continuous state or action space
42 | 1. Hierarchically organized tasks and sub-tasks
43 | 1. To solve several tasks with different rewards simultaneously
44 |
45 | **Multi-objective reinforcement learning (MORL) problem**
46 |
47 |
48 |
49 | ---
50 |
51 |
52 |
53 | ### Multi-Objective Reinforcement Learning (MORL)
54 |
55 | A combination of multi-objective optimization (MOO) and RL techniques to solve the sequential decision-making problems with multiple conflicting objectives
56 |
57 | 1. Obtain action policies which optimizes two or more objectives at the same time
58 | 1. Each objective has its own associated reward signal
59 | 1. The reward is not a scalar value but a vector
60 | 1. Combine the objectives if they are related
61 | 1. Optimize the objectives separately if they are completely unrelated
62 | 1. Make a trade-off among the conflicting objectives
63 |
64 |
65 |
66 | ---
67 |
68 |
69 |
70 | ### Multi-Objective Optimization (MOO) Strategies
71 | > Multi-objective to Single-objective Strategy
72 |
73 | 1. To optimize a scalar value
74 | * Weighted sum method
75 | * Constraint method
76 | * Sequential method
77 | * Max-min method
78 |
79 | 
80 |
81 |
82 |
83 | > Pareto Strategy
84 |
85 | 1. Vector-valued utilities
86 | 1. Non-inferior and alternative solutions
87 | 1. Constitute the Pareto front
88 |
89 | 
90 | 
91 |
92 |
93 |
94 | ---
95 |
96 |
97 |
98 | ### MORL algorithms
99 |
100 | > Single-policy Approaches
101 |
102 | Find the best single policy specified by a user or derived from the problem domain
103 |
104 |
105 |
106 | > Multiple-policy Approaches
107 |
108 | Find a set of policies that approximate the Pareto front
109 |
110 |
111 |
112 | ---
113 |
114 |
115 |
116 | # Backgrounds
117 |
118 |
119 |
120 | ---
121 |
122 |
123 |
124 | ## Markov Decision Process (MDP) Models
125 |
126 | > A sequential decision-making problem can be formulated as an MDP
127 |
128 | S: The state space of a finite set of states
129 |
130 | A: The action space of a finite set of actions
131 |
132 | R: The reward function
133 |
134 | P: The matrix of state transition probability
135 |
136 |
137 |
138 | > Objective functions
139 |
140 | 1. Discounted reward criteria
141 | 1. Average reward criteria
142 |
143 | 
144 |
145 |
146 |
147 | ---
148 |
149 |
150 |
151 | ## MDP Objective Functions
152 |
153 | > Discounted Reward Criteria
154 | >
155 | 
156 |
157 | > Average Reward Criteria
158 | >
159 | 
160 |
161 |
162 |
163 | ---
164 |
165 |
166 |
167 | ## Basic RL Algorithms
168 |
169 | 1. RL algorithms integrate the techniques of Monte Carlo, stochastic approximation, and function approximation to obtain approximate solutions of MDPs
170 | 1. As a central mechanism of RL, temporal-difference (TD) learning can be viewed as a combination of Monte Carlo and DP
171 | 1. TD algorithms can learn the value functions using state transition data without model information
172 | - Similar to Monte Carlo methods
173 | 1. TD methods can update the current estimation of value functions partially based on previous learned results
174 | - Similar to DP
175 |
176 |
177 |
178 | > Discounted Reward Criteria Q-Learning Algorithm
179 |
180 | 
181 |
182 |
183 |
184 | > Average Reward Criteria R-Learning Algorithm
185 |
186 | 
187 |
188 |
189 |
190 | ---
191 |
192 |
193 |
194 | ## MOO Problems
195 |
196 | Maximize the Pareto optimality or the weighted scalar of all the elements and satisfy the constraint functions
197 |
198 | 
199 |
200 |
201 |
202 | ---
203 |
204 |
205 |
206 | ## MOO Optimal Solutions
207 |
208 | > Multi-objective to Single-objective Strategy
209 |
210 | Solutions can be obtained by solving a SOO problem
211 |
212 | > Pareto Dominance and Pareto Front
213 |
214 | Find all the dominating solutions instead of the dominated ones
215 | Practically, find a set of solutions that approximates the real Pareto front
216 |
217 | 
218 |
219 |
220 |
221 | ---
222 |
223 |
224 |
225 | # MORL Problem
226 |
227 |
228 |
229 | ---
230 |
231 |
232 |
233 | ## Basic Architecture
234 |
235 | Maximize the Pareto optimality or the weighted scalar of all the elements and satisfy the constraint functions
236 |
237 | 
238 |
239 | Vectored state-action value function
240 |
241 | 
242 |
243 |
244 |
245 | ---
246 |
247 |
248 |
249 | ## Major Research Topics
250 |
251 | 1. MORL is a highly interdisciplinary field and it refers to the integration of MOO methods and RL techniques to solve sequential decision making problems with multiple conflicting objectives
252 | 1. The related disciplines of MORL include artificial intelligence, decision and optimization theory, operations research, control theory, and so on
253 | 1. MORL suitably represents the designer’s preferences or ensure the optimization priority with some policies in the Pareto front
254 | 1. Design efficient MORL algorithms
255 |
256 |
257 |
258 | ---
259 |
260 |
261 |
262 | # Representative Approaches to MORL
263 |
264 |
265 |
266 | ---
267 |
268 |
269 |
270 | ## MORL Approaches
271 |
272 | > Single-policy Approaches
273 |
274 | 1. Weighted sum approach
275 | 1. W-learning
276 | 1. Analytic hierarchy process (AHP) approach
277 | 1. Ranking approach
278 | 1. Geometric approach
279 |
280 |
281 |
282 | > Multiple-policy Approaches
283 |
284 | 1. Convex hull approach
285 | 1. Varying parameter approach
286 |
287 |
288 |
289 | ---
290 |
291 |
292 |
293 | ## Weighted Sum Approach
294 |
295 | > GM-Sarsa(0)
296 |
297 | 1. Sum up the Q-values for all the objectives to estimate the combined Q-function
298 | 1. The updates are based on the actually selected actions rather than the best action determined by the value function
299 | 1. Has smaller errors between the estimated Q-values and the true Q-values
300 |
301 | 
302 |
303 |
304 |
305 | > Weighted sum approach
306 |
307 | 
308 |
309 |
310 |
311 | ---
312 |
313 |
314 |
315 | ## W-Learning Approach
316 |
317 | > Top-Q method to compute W values
318 |
319 | * Assign the W value as the highest Q-value among all the objectives in the current state
320 |
321 | 
322 |
323 | * Synthetic objective function for the Top-Q approach
324 |
325 | 
326 |
327 | * The objective with the highest Q-value may have similar priorities for different actions, while other objectives cannot be satisfied due to their low action values
328 | * A change in reward scaling or the design of reward functions may greatly influence the results of the winner-take-all contest
329 |
330 |
331 |
332 | > W-Learning
333 |
334 | * All the W values, except the highest W value, are updated after the action of each step is selected and executed
335 |
336 | 
337 |
338 |
339 |
340 | > Negotiated W-Learning
341 |
342 | * Explicitly find that if an objective is not
343 | * preferred to determine the next action
344 | * Might lose the most long-term reward
345 |
346 | 
347 |
348 |
349 |
350 | ---
351 |
352 |
353 |
354 | ## Analytic Hierarchy Process (AHP) Approach
355 |
356 | * The designer of MORL algorithms may not have enough prior knowledge about the optimization problem
357 | * The degree of relative importance between two objectives can be quantified by L grades, and a scalar value is defined for each grade
358 | * Requires a lot of prior knowledge of the problem domain
359 |
360 | Relative importance matrix
361 |
362 | 
363 |
364 | Importance factor
365 |
366 | 
367 |
368 | Value of improvement
369 |
370 | 
371 |
372 | Fuzzy inference system: compute the goodness of 𝑎~𝑝~ relative to 𝑎~𝑞~
373 |
374 |
375 |
376 | ---
377 |
378 |
379 |
380 | ## Ranking Approach
381 |
382 | * Also called the sequential approach or the threshold approach
383 | * Ensure the effectiveness of the subordinate objective
384 | * Threshold values were specified for some objectives in order to put the constraints on the objectives
385 |
386 | 
387 |
388 | 
389 |
390 |
391 |
392 | ---
393 |
394 |
395 |
396 | ## Geometric Approach
397 |
398 | * Deal with dynamic unknown Markovian environments with long-term average reward vectors
399 | * Assume that actions of other agents may influence the dynamics of the environment and the game is irreducible or ergodic
400 |
401 | > Multiple directions RL (MDRL) and single direction RL (SDRL)
402 |
403 | Approximate a desired target set in a multidimensional objective space
404 |
405 | 
406 |
407 |
408 |
409 | ---
410 |
411 |
412 |
413 | ## Convex Hull Approach
414 |
415 | * Simultaneously learn optimal policies for all linear preference assignments in the objective space
416 | * Can find the optimal policy for any linear preference function
417 | * Since multiple policies are learned at once, the integrated RL algorithms should be off-policy algorithms
418 |
419 | Definition 1: Translation and scaling operations
420 |
421 | 
422 |
423 | Definition 2: Summing two convex hulls
424 |
425 | 
426 |
427 |
428 |
429 | 
430 |
431 |
432 |
433 | ---
434 |
435 |
436 |
437 | ## Varying Parameter Approach
438 |
439 | * A multiple-policy approach can be realized by performing multiple runs with different parameters, objective thresholds, and orderings in any single-policy algorithm
440 |
441 | > Policy gradient methods and the idea of varying parameters
442 |
443 | * Estimate multiple policy gradients for each objective
444 | * Vary the weights of the objective gradients to find multiple policies
445 |
446 |
447 |
448 | ---
449 |
450 |
451 |
452 | ## Summary
453 |
454 | > Multi-objective fitted Q-iteration (FQI)
455 |
456 | Find control policies for all the linear combinations of preferences assigned to the objectives in a single training procedure
457 |
458 | 
459 |
460 |
461 |
462 | ---
463 |
464 |
465 |
466 | # Important Directions of Recent Research on MORL
467 |
468 |
469 |
470 | ---
471 |
472 |
473 |
474 | ## Further Development of MORL Approaches
475 |
476 | > To obtain suitable representations of the preferences and improve the efficiency of MORL algorithms
477 |
478 | * Estimation of distribution algorithms (EDA)
479 | * Incorporate the notions in evolutionary MOO
480 | * Acquire various strategies by a single run
481 | * Learning classifier system
482 | * The choice of action-selection policies can greatly affect the performance of the learning system
483 | * α-domination strategy
484 | * Use a goal-directed bias based on the achievement level of each evaluation
485 | * Parallel genetic algorithm (PGA)
486 | * Evolve a neuro-controller
487 | * Perturbation stochastic approximation (SPSA) was used to improve the convergence
488 | * Adaptive margins
489 |
490 | > To obtain Pareto optimal policies in large or continuous spaces
491 |
492 | * Multi-agent framework
493 | * ie. Traffic signal control
494 | * Fitted Q-iteration (FQI)
495 | * Approximate the Pareto front
496 | * Consistency multi-objective dynamic programming
497 |
498 |
499 |
500 | The small scale of previous MORL problems may not verify the algorithm’s performance in dealing with a wide range of different problem settings, and the algorithm implementations always require much prior knowledge about the problem domain
501 |
502 |
--------------------------------------------------------------------------------