├── .config ├── README.md ├── baby-cr-5.14.patch ├── baby-dl-5.14.patch ├── baby-hrrn-5.14.patch ├── baby-mlfq-5.14.patch ├── baby-rr-5.14.patch └── baby-vrt-5.14.patch /README.md: -------------------------------------------------------------------------------- 1 | # Baby-CPU-Scheduler 2 | 3 | This is a very basic and lightweight yet very performant CPU scheduler. 4 | You can use it for learning purposes as a base ground CPU scheduler on 5 | Linux. Notice that many features are disabled to make this scheduler as 6 | simple as possible. 7 | 8 | Baby Scheduler is very lightweight and powerful for 9 | normal usage. I am using (`baby-dl`) as my main scheduler and sometimes I 10 | switch back to CacULE for testing. The throughput in Baby Scheduler is 11 | higher due to the task loadbalancer that I made for Baby Scheduler. The 12 | loadbalancing is done via only one CPU which is CPU0 in which CPU0 scan 13 | all other CPUs and move one task in every tick. The balancing is only 14 | depending on the number of tasks (i.e. no load, weight or other factors). 15 | 16 | 17 | Baby scheduler is only 1036 LOC where 254 LOC of it are just dependent 18 | functions that are copied from CFS without any changes to let the 19 | scheduler compile and run. You can find all Baby code in 20 | `bs.c`, `bs.h`, and `numa_fair.h`. 21 | 22 | ## Flavors 23 | Currently there are three flavors of Baby Scheduler 24 | * Deadline Scheduling (dl) - main 25 | * Virtual Run Time Scheduling (vrt) 26 | * Round Robin Scheduling (rr) 27 | 28 | All the three flavors have the same tasks load balancing method. 29 | They only differ in the strategy of picking the next task to run, and other minor differences. 30 | 31 | ### Deadline Scheduling 32 | Baby's main scheduling is the deadline flavor. The scheduler picks the task with the earliest deadline. 33 | A new task gets a `deadline = now() + 1.180ms`. The task with earliest deadline will be picked and run 34 | on the CPU until it gets to sleep or another task had earlier deadline. While the task is running, its 35 | deadline gets updated only when its deadline is past - compared with the current time (now). So, basically 36 | we don't want to update its deadline with `now() + 1.180ms` on every tick, otherwise, I call this situation 37 | by horse and carrot. I am using the deadline as a slice too, we don't want to keep preempting tasks in 38 | which their deadlines are very close to each other. The best solution is to give a minimum/maximum slice to 39 | the running task to at least gets its next updated deadline to be not very close to the competing task. This 40 | can save some context switching. Anyway, our deadline/slice is not too hight, it is only a 1.180ms. 41 | 42 | ### Virtual Run Time Scheduling 43 | The scheduler picks next task that has 44 | least vruntime, however, all CFS load/weight for task priority are 45 | replaced with a simpler mechanism. Tasks priorities are injected in 46 | vruntime where NICE0 priority task has a vruntime = real_exec_time, 47 | but NICE-20 task has a vruntime < real_exec_time in which NICE-20 task 48 | will run 20 more milliseconds than NICE0 one in a race time of 40ms. 49 | See the equation in kernel/sched/bs.c:convert_to_vruntime. 50 | 51 | ### Round Robin Scheduling 52 | 53 | ## Patch and Compile 54 | ### Patch 55 | First, you need to patch the kernel source with one of the flavors of baby scheduler. See an example of patching linux kernel [here](https://github.com/hamadmarri/cacule-cpu-scheduler#how-to-apply-the-patch). You can use the same method to patch with baby instead of cacule. 56 | 57 | 1. Download the linux kernel (https://www.kernel.org/) that is same version as the patch (i.e if patch file name is baby-5.14.patch, then download https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.14.9.tar.xz) 58 | 2. Unzip linux kernel 59 | 3. Download baby patch file and place it inside the just unzipped linux kernel folder 60 | 4. cd linux-(version) 61 | 5. patch -p1 < baby-5.14.patch 62 | 63 | 64 | ### Configure 65 | Baby uses periodic HZ only, and you need to disable other scheduler features such as `CGROUP_SCHED` and stat/debuging features. 66 | 67 | run `make menuconfig` 68 | and choose `CONFIG_HZ_PERIODIC` 69 | 70 | You should see this when you run `cat .config | grep -i hz`: 71 | 72 | ``` 73 | CONFIG_HZ_PERIODIC=y 74 | # CONFIG_NO_HZ_IDLE is not set 75 | # CONFIG_NO_HZ_FULL is not set 76 | # CONFIG_NO_HZ is not set 77 | ``` 78 | 79 | Then disable the following: 80 | * CONFIG_EXPERT 81 | * CONFIG_DEBUG_KERNEL 82 | * CONFIG_SCHED_DEBUG 83 | * CONFIG_SCHEDSTATS 84 | * NO_HZ 85 | * SCHED_AUTOGROUP 86 | * CGROUP_SCHED 87 | * UCLAMP_TASK 88 | * SCHED_CORE 89 | 90 | 91 | Make sure that `CONFIG_BS_SCHED` is selected (it appears at the top when running make menuconfig). 92 | Confirm by running `cat .config | grep -i bs_sched` 93 | ``` 94 | CONFIG_BS_SCHED=y 95 | ``` 96 | 97 | Now compile `make` \ 98 | Then install the modules `sudo make modules_install` \ 99 | Then install the kernel `sudo make install`\ 100 | Reboot and choose the new kernel 101 | 102 | To confirm that Baby is currently running: 103 | ``` 104 | $ dmesg | grep "Baby CPU" 105 | Baby CPU scheduler (dl) v5.14 by Hamad Al Marri. 106 | ``` 107 | -------------------------------------------------------------------------------- /baby-hrrn-5.14.patch: -------------------------------------------------------------------------------- 1 | diff --git a/include/linux/sched.h b/include/linux/sched.h 2 | index f6935787e7e8..2d668efc68f5 100644 3 | --- a/include/linux/sched.h 4 | +++ b/include/linux/sched.h 5 | @@ -462,6 +462,15 @@ struct sched_statistics { 6 | #endif 7 | }; 8 | 9 | +#ifdef CONFIG_BS_SCHED 10 | +struct bs_node { 11 | + struct bs_node* next; 12 | + struct bs_node* prev; 13 | + u64 hrrn_start_time; 14 | + u64 vruntime; 15 | +}; 16 | +#endif 17 | + 18 | struct sched_entity { 19 | /* For load-balancing: */ 20 | struct load_weight load; 21 | @@ -469,6 +478,10 @@ struct sched_entity { 22 | struct list_head group_node; 23 | unsigned int on_rq; 24 | 25 | +#ifdef CONFIG_BS_SCHED 26 | + struct bs_node bs_node; 27 | +#endif 28 | + 29 | u64 exec_start; 30 | u64 sum_exec_runtime; 31 | u64 vruntime; 32 | diff --git a/init/Kconfig b/init/Kconfig 33 | index 55f9f7738ebb..6948dd798f43 100644 34 | --- a/init/Kconfig 35 | +++ b/init/Kconfig 36 | @@ -105,6 +105,34 @@ config THREAD_INFO_IN_TASK 37 | One subtle change that will be needed is to use try_get_task_stack() 38 | and put_task_stack() in save_thread_stack_tsk() and get_wchan(). 39 | 40 | +config BS_SCHED 41 | + bool "Baby Scheduler" 42 | + default y 43 | + select TICK_CPU_ACCOUNTING 44 | + select PREEMPT 45 | + help 46 | + This is a very basic and lightweight yet very performant CPU scheduler. 47 | + You can use it for learning purposes as a base ground CPU scheduler on 48 | + Linux. Notice that many features are disabled to make this scheduler as 49 | + simple as possible. Baby Scheduler is very lightweight and powerful for 50 | + normal usage. I am using it as my main scheduler and sometimes I 51 | + switch back to CacULE for testing. The throughput in Baby Scheduler is 52 | + higher due to the task loadbalancer that I made for Baby Scheduler. The 53 | + loadbalancing is done via only one CPU which is CPU0 in which CPU0 scan 54 | + all other CPUs and move one task in every tick. The balancing is only 55 | + depending on the number of tasks (i.e. no load, weight or other factors). 56 | + Baby scheduler is only 1036 LOC where 254 LOC of it are just dependent 57 | + functions that are copied from CFS without any changes to let the 58 | + scheduler compile and run. You can find all Baby code is reduced in 59 | + bs.c, bs.h, and numa_fair.h. Baby scheduler picks next task that has 60 | + least vruntime, however, all CFS load/weight for task priority are 61 | + replaced with a simpler mechanism. Tasks priorities are injected in 62 | + vruntime where NICE0 priority task has a vruntime = real_exec_time, 63 | + but NICE-20 task has a vruntime < real_exec_time in which NICE-20 task 64 | + will run 20 more milliseconds than NICE0 one in a race time of 40ms. 65 | + See the equation in kernel/sched/bs.c:convert_to_vruntime. 66 | + 67 | + 68 | menu "General setup" 69 | 70 | config BROKEN 71 | @@ -789,6 +817,7 @@ menu "Scheduler features" 72 | config UCLAMP_TASK 73 | bool "Enable utilization clamping for RT/FAIR tasks" 74 | depends on CPU_FREQ_GOV_SCHEDUTIL 75 | + depends on !BS_SCHED 76 | help 77 | This feature enables the scheduler to track the clamped utilization 78 | of each CPU based on RUNNABLE tasks scheduled on that CPU. 79 | @@ -954,6 +983,7 @@ config CGROUP_WRITEBACK 80 | 81 | menuconfig CGROUP_SCHED 82 | bool "CPU controller" 83 | + depends on !BS_SCHED 84 | default n 85 | help 86 | This feature lets CPU scheduler recognize task groups and control CPU 87 | @@ -1231,6 +1261,8 @@ config CHECKPOINT_RESTORE 88 | 89 | config SCHED_AUTOGROUP 90 | bool "Automatic process group scheduling" 91 | + default n 92 | + depends on !BS_SCHED 93 | select CGROUPS 94 | select CGROUP_SCHED 95 | select FAIR_GROUP_SCHED 96 | @@ -1420,6 +1452,7 @@ config BPF 97 | menuconfig EXPERT 98 | bool "Configure standard kernel features (expert users)" 99 | # Unhide debug options, to make the on-by-default options visible 100 | + depends on !BS_SCHED 101 | select DEBUG_KERNEL 102 | help 103 | This option allows certain base kernel options and settings 104 | diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz 105 | index 38ef6d06888e..0897412321fa 100644 106 | --- a/kernel/Kconfig.hz 107 | +++ b/kernel/Kconfig.hz 108 | @@ -5,7 +5,7 @@ 109 | 110 | choice 111 | prompt "Timer frequency" 112 | - default HZ_250 113 | + default HZ_803 114 | help 115 | Allows the configuration of the timer frequency. It is customary 116 | to have the timer interrupt run at 1000 Hz but 100 Hz may be more 117 | @@ -40,6 +40,9 @@ choice 118 | on SMP and NUMA systems and exactly dividing by both PAL and 119 | NTSC frame rates for video and multimedia work. 120 | 121 | + config HZ_803 122 | + bool "803 HZ" 123 | + 124 | config HZ_1000 125 | bool "1000 HZ" 126 | help 127 | @@ -53,6 +56,7 @@ config HZ 128 | default 100 if HZ_100 129 | default 250 if HZ_250 130 | default 300 if HZ_300 131 | + default 803 if HZ_803 132 | default 1000 if HZ_1000 133 | 134 | config SCHED_HRTICK 135 | diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt 136 | index 5876e30c5740..f5a195470121 100644 137 | --- a/kernel/Kconfig.preempt 138 | +++ b/kernel/Kconfig.preempt 139 | @@ -2,7 +2,7 @@ 140 | 141 | choice 142 | prompt "Preemption Model" 143 | - default PREEMPT_NONE 144 | + default PREEMPT 145 | 146 | config PREEMPT_NONE 147 | bool "No Forced Preemption (Server)" 148 | @@ -103,6 +103,7 @@ config PREEMPT_DYNAMIC 149 | config SCHED_CORE 150 | bool "Core Scheduling for SMT" 151 | depends on SCHED_SMT 152 | + depends on !BS_SCHED 153 | help 154 | This option permits Core Scheduling, a means of coordinated task 155 | selection across SMT siblings. When enabled -- see 156 | diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile 157 | index 978fcfca5871..464b134de739 100644 158 | --- a/kernel/sched/Makefile 159 | +++ b/kernel/sched/Makefile 160 | @@ -23,7 +23,7 @@ CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer 161 | endif 162 | 163 | obj-y += core.o loadavg.o clock.o cputime.o 164 | -obj-y += idle.o fair.o rt.o deadline.o 165 | +obj-y += idle.o bs.o rt.o deadline.o 166 | obj-y += wait.o wait_bit.o swait.o completion.o 167 | 168 | obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o 169 | diff --git a/kernel/sched/bs.c b/kernel/sched/bs.c 170 | new file mode 100644 171 | index 000000000000..0dcd34ede851 172 | --- /dev/null 173 | +++ b/kernel/sched/bs.c 174 | @@ -0,0 +1,777 @@ 175 | +// SPDX-License-Identifier: GPL-2.0 176 | +/* 177 | + * Basic Scheduler (BS) Class (SCHED_NORMAL/SCHED_BATCH) 178 | + * 179 | + * Copyright (C) 2021, Hamad Al Marri 180 | + */ 181 | +#include "sched.h" 182 | +#include "pelt.h" 183 | +#include "fair_numa.h" 184 | +#include "bs.h" 185 | + 186 | +u64 sched_granularity = 590000ULL; 187 | + 188 | +#define HZ_PERIOD (1000000000 / HZ) 189 | +#define RACE_TIME 40000000 190 | +#define FACTOR (RACE_TIME / HZ_PERIOD) 191 | + 192 | +#define YIELD_MARK(bsn) ((bsn)->vruntime |= 0x8000000000000000ULL) 193 | +#define YIELD_UNMARK(bsn) ((bsn)->vruntime &= 0x7FFFFFFFFFFFFFFFULL) 194 | + 195 | +#define HRRN_MAX_LIFE_NS 5000000000ULL 196 | + 197 | +static u64 convert_to_vruntime(u64 delta, struct sched_entity *se) 198 | +{ 199 | + struct task_struct *p = task_of(se); 200 | + s64 prio_diff; 201 | + 202 | + if (PRIO_TO_NICE(p->prio) == 0) 203 | + return delta; 204 | + 205 | + prio_diff = PRIO_TO_NICE(p->prio) * 1000000; 206 | + prio_diff /= FACTOR; 207 | + 208 | + if ((s64)(delta + prio_diff) < 0) 209 | + return 1; 210 | + 211 | + return delta + prio_diff; 212 | +} 213 | + 214 | +static inline void normalize_lifetime(u64 now, struct bs_node *bsn) 215 | +{ 216 | + u64 life_time, old_hrrn_x; 217 | + s64 diff; 218 | + 219 | + life_time = now - bsn->hrrn_start_time; 220 | + diff = life_time - HRRN_MAX_LIFE_NS; 221 | + 222 | + if (diff > 0) { 223 | + // unmark YIELD. No need to check or remark since 224 | + // this normalize action doesn't happen very often 225 | + YIELD_UNMARK(bsn); 226 | + 227 | + // multiply life_time by 1024 for more precision 228 | + old_hrrn_x = (life_time << 7) / ((bsn->vruntime >> 3) | 1); 229 | + 230 | + // reset life to half max_life (i.e ~2.5s) 231 | + bsn->hrrn_start_time = now - (HRRN_MAX_LIFE_NS >> 1); 232 | + 233 | + // avoid division by zero 234 | + if (old_hrrn_x == 0) old_hrrn_x = 1; 235 | + 236 | + // reset vruntime based on old hrrn ratio 237 | + bsn->vruntime = (HRRN_MAX_LIFE_NS << 9) / old_hrrn_x; 238 | + } 239 | +} 240 | + 241 | +static void update_curr(struct cfs_rq *cfs_rq) 242 | +{ 243 | + struct sched_entity *curr = cfs_rq->curr; 244 | + u64 now = sched_clock(); 245 | + u64 delta_exec; 246 | + 247 | + if (unlikely(!curr)) 248 | + return; 249 | + 250 | + delta_exec = now - curr->exec_start; 251 | + if (unlikely((s64)delta_exec <= 0)) 252 | + return; 253 | + 254 | + curr->exec_start = now; 255 | + curr->sum_exec_runtime += delta_exec; 256 | + 257 | + curr->bs_node.vruntime += convert_to_vruntime(delta_exec, curr); 258 | + normalize_lifetime(now, &curr->bs_node); 259 | +} 260 | + 261 | +static void update_curr_fair(struct rq *rq) 262 | +{ 263 | + update_curr(cfs_rq_of(&rq->curr->se)); 264 | +} 265 | + 266 | +static inline u64 calc_hrrn(u64 now, struct bs_node *bsn) 267 | +{ 268 | + u64 l = now - bsn->hrrn_start_time; 269 | + u64 r = bsn->vruntime | 1; 270 | + 271 | + return l / r; 272 | +} 273 | + 274 | +/** 275 | + * Does a have smaller vruntime than b? 276 | + */ 277 | +static inline bool 278 | +entity_before(struct bs_node *a, struct bs_node *b) 279 | +{ 280 | + u64 a_hrrn = calc_hrrn(sched_clock(), a); 281 | + u64 b_hrrn = calc_hrrn(sched_clock(), b); 282 | + 283 | + return (s64)(a_hrrn - b_hrrn) > 0; 284 | +} 285 | + 286 | +static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) 287 | +{ 288 | + struct bs_node *bsn = &se->bs_node; 289 | + 290 | + bsn->next = bsn->prev = NULL; 291 | + 292 | + // if empty 293 | + if (!cfs_rq->head) { 294 | + cfs_rq->head = bsn; 295 | + } 296 | + else { 297 | + bsn->next = cfs_rq->head; 298 | + cfs_rq->head->prev = bsn; 299 | + cfs_rq->head = bsn; 300 | + } 301 | +} 302 | + 303 | +static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) 304 | +{ 305 | + struct bs_node *bsn = &se->bs_node; 306 | + struct bs_node *prev, *next; 307 | + 308 | + // if only one se in rq 309 | + if (cfs_rq->head->next == NULL) { 310 | + cfs_rq->head = NULL; 311 | + } 312 | + // if it is the head 313 | + else if (bsn == cfs_rq->head) { 314 | + cfs_rq->head = cfs_rq->head->next; 315 | + cfs_rq->head->prev = NULL; 316 | + } 317 | + // if in the middle 318 | + else { 319 | + prev = bsn->prev; 320 | + next = bsn->next; 321 | + 322 | + prev->next = next; 323 | + if (next) 324 | + next->prev = prev; 325 | + } 326 | +} 327 | + 328 | +static inline void 329 | +enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) 330 | +{ 331 | + bool curr = cfs_rq->curr == se; 332 | + 333 | + update_curr(cfs_rq); 334 | + 335 | + account_entity_enqueue(cfs_rq, se); 336 | + 337 | + if (!curr) 338 | + __enqueue_entity(cfs_rq, se); 339 | + 340 | + se->on_rq = 1; 341 | +} 342 | + 343 | +static inline void 344 | +dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) 345 | +{ 346 | + update_curr(cfs_rq); 347 | + 348 | + if (se != cfs_rq->curr) 349 | + __dequeue_entity(cfs_rq, se); 350 | + 351 | + se->on_rq = 0; 352 | + account_entity_dequeue(cfs_rq, se); 353 | +} 354 | + 355 | +static void 356 | +enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) 357 | +{ 358 | + struct sched_entity *se = &p->se; 359 | + struct cfs_rq *cfs_rq = cfs_rq_of(se); 360 | + int idle_h_nr_running = task_has_idle_policy(p); 361 | + 362 | + if (!se->on_rq) { 363 | + enqueue_entity(cfs_rq, se, flags); 364 | + cfs_rq->h_nr_running++; 365 | + cfs_rq->idle_h_nr_running += idle_h_nr_running; 366 | + } 367 | + 368 | + add_nr_running(rq, 1); 369 | +} 370 | + 371 | +static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) 372 | +{ 373 | + struct sched_entity *se = &p->se; 374 | + struct cfs_rq *cfs_rq = cfs_rq_of(se); 375 | + int idle_h_nr_running = task_has_idle_policy(p); 376 | + 377 | + dequeue_entity(cfs_rq, se, flags); 378 | + 379 | + cfs_rq->h_nr_running--; 380 | + cfs_rq->idle_h_nr_running -= idle_h_nr_running; 381 | + 382 | + sub_nr_running(rq, 1); 383 | +} 384 | + 385 | +static void yield_task_fair(struct rq *rq) 386 | +{ 387 | + struct task_struct *curr = rq->curr; 388 | + struct cfs_rq *cfs_rq = task_cfs_rq(curr); 389 | + 390 | + YIELD_MARK(&curr->se.bs_node); 391 | + 392 | + /* 393 | + * Are we the only task in the tree? 394 | + */ 395 | + if (unlikely(rq->nr_running == 1)) 396 | + return; 397 | + 398 | + if (curr->policy != SCHED_BATCH) { 399 | + update_rq_clock(rq); 400 | + /* 401 | + * Update run-time statistics of the 'current'. 402 | + */ 403 | + update_curr(cfs_rq); 404 | + /* 405 | + * Tell update_rq_clock() that we've just updated, 406 | + * so we don't do microscopic update in schedule() 407 | + * and double the fastpath cost. 408 | + */ 409 | + rq_clock_skip_update(rq); 410 | + } 411 | +} 412 | + 413 | +static bool yield_to_task_fair(struct rq *rq, struct task_struct *p) 414 | +{ 415 | + yield_task_fair(rq); 416 | + return true; 417 | +} 418 | + 419 | +static void 420 | +set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) 421 | +{ 422 | + if (se->on_rq) 423 | + __dequeue_entity(cfs_rq, se); 424 | + 425 | + se->exec_start = sched_clock(); 426 | + cfs_rq->curr = se; 427 | + se->prev_sum_exec_runtime = se->sum_exec_runtime; 428 | +} 429 | + 430 | +static struct sched_entity * 431 | +pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) 432 | +{ 433 | + struct bs_node *bsn = cfs_rq->head; 434 | + struct bs_node *next; 435 | + 436 | + if (!bsn) 437 | + return curr; 438 | + 439 | + next = bsn->next; 440 | + while (next) { 441 | + if (entity_before(next, bsn)) 442 | + bsn = next; 443 | + 444 | + next = next->next; 445 | + } 446 | + 447 | + if (curr && entity_before(&curr->bs_node, bsn)) 448 | + return curr; 449 | + 450 | + return se_of(bsn); 451 | +} 452 | + 453 | +struct task_struct * 454 | +pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) 455 | +{ 456 | + struct cfs_rq *cfs_rq = &rq->cfs; 457 | + struct sched_entity *se; 458 | + struct task_struct *p; 459 | + int new_tasks; 460 | + 461 | +again: 462 | + if (!sched_fair_runnable(rq)) 463 | + goto idle; 464 | + 465 | + if (prev) 466 | + put_prev_task(rq, prev); 467 | + 468 | + se = pick_next_entity(cfs_rq, NULL); 469 | + set_next_entity(cfs_rq, se); 470 | + 471 | + p = task_of(se); 472 | + 473 | + if (prev) 474 | + YIELD_UNMARK(&prev->se.bs_node); 475 | + 476 | +done: __maybe_unused; 477 | +#ifdef CONFIG_SMP 478 | + /* 479 | + * Move the next running task to the front of 480 | + * the list, so our cfs_tasks list becomes MRU 481 | + * one. 482 | + */ 483 | + list_move(&p->se.group_node, &rq->cfs_tasks); 484 | +#endif 485 | + 486 | + return p; 487 | + 488 | +idle: 489 | + if (!rf) 490 | + return NULL; 491 | + 492 | + new_tasks = newidle_balance(rq, rf); 493 | + 494 | + /* 495 | + * Because newidle_balance() releases (and re-acquires) rq->lock, it is 496 | + * possible for any higher priority task to appear. In that case we 497 | + * must re-start the pick_next_entity() loop. 498 | + */ 499 | + if (new_tasks < 0) 500 | + return RETRY_TASK; 501 | + 502 | + if (new_tasks > 0) 503 | + goto again; 504 | + 505 | + /* 506 | + * rq is about to be idle, check if we need to update the 507 | + * lost_idle_time of clock_pelt 508 | + */ 509 | + update_idle_rq_clock_pelt(rq); 510 | + 511 | + return NULL; 512 | +} 513 | + 514 | +static struct task_struct *__pick_next_task_fair(struct rq *rq) 515 | +{ 516 | + return pick_next_task_fair(rq, NULL, NULL); 517 | +} 518 | + 519 | +#ifdef CONFIG_SMP 520 | +static struct task_struct *pick_task_fair(struct rq *rq) 521 | +{ 522 | + struct sched_entity *se; 523 | + struct cfs_rq *cfs_rq = &rq->cfs; 524 | + struct sched_entity *curr = cfs_rq->curr; 525 | + 526 | + if (!cfs_rq->nr_running) 527 | + return NULL; 528 | + 529 | + /* When we pick for a remote RQ, we'll not have done put_prev_entity() */ 530 | + if (curr) { 531 | + if (curr->on_rq) 532 | + update_curr(cfs_rq); 533 | + else 534 | + curr = NULL; 535 | + } 536 | + 537 | + se = pick_next_entity(cfs_rq, curr); 538 | + 539 | + return task_of(se); 540 | +} 541 | +#endif 542 | + 543 | +static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev) 544 | +{ 545 | + /* 546 | + * If still on the runqueue then deactivate_task() 547 | + * was not called and update_curr() has to be done: 548 | + */ 549 | + if (prev->on_rq) 550 | + update_curr(cfs_rq); 551 | + 552 | + if (prev->on_rq) 553 | + __enqueue_entity(cfs_rq, prev); 554 | + 555 | + cfs_rq->curr = NULL; 556 | +} 557 | + 558 | +static void put_prev_task_fair(struct rq *rq, struct task_struct *prev) 559 | +{ 560 | + struct sched_entity *se = &prev->se; 561 | + 562 | + put_prev_entity(cfs_rq_of(se), se); 563 | +} 564 | + 565 | +static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) 566 | +{ 567 | + struct sched_entity *se = &p->se; 568 | + struct cfs_rq *cfs_rq = cfs_rq_of(se); 569 | + 570 | +#ifdef CONFIG_SMP 571 | + if (task_on_rq_queued(p)) { 572 | + /* 573 | + * Move the next running task to the front of the list, so our 574 | + * cfs_tasks list becomes MRU one. 575 | + */ 576 | + list_move(&se->group_node, &rq->cfs_tasks); 577 | + } 578 | +#endif 579 | + 580 | + set_next_entity(cfs_rq, se); 581 | +} 582 | + 583 | +static void 584 | +check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) 585 | +{ 586 | + if (pick_next_entity(cfs_rq, curr) != curr) 587 | + resched_curr(rq_of(cfs_rq)); 588 | +} 589 | + 590 | +static void 591 | +entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) 592 | +{ 593 | + update_curr(cfs_rq); 594 | + 595 | + if (cfs_rq->nr_running > 1) 596 | + check_preempt_tick(cfs_rq, curr); 597 | +} 598 | + 599 | +static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) 600 | +{ 601 | + struct task_struct *curr = rq->curr; 602 | + struct sched_entity *se = &curr->se, *wse = &p->se; 603 | + 604 | + if (unlikely(se == wse)) 605 | + return; 606 | + 607 | + if (test_tsk_need_resched(curr)) 608 | + return; 609 | + 610 | + /* Idle tasks are by definition preempted by non-idle tasks. */ 611 | + if (unlikely(task_has_idle_policy(curr)) && 612 | + likely(!task_has_idle_policy(p))) 613 | + goto preempt; 614 | + 615 | + /* 616 | + * Batch and idle tasks do not preempt non-idle tasks (their preemption 617 | + * is driven by the tick): 618 | + */ 619 | + if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION)) 620 | + return; 621 | + 622 | + /* 623 | + * Lower priority tasks do not preempt higher ones 624 | + */ 625 | + if (p->prio > curr->prio) 626 | + return; 627 | + 628 | + update_curr(cfs_rq_of(se)); 629 | + 630 | + if (entity_before(&wse->bs_node, &se->bs_node)) 631 | + goto preempt; 632 | + 633 | + return; 634 | + 635 | +preempt: 636 | + resched_curr(rq); 637 | +} 638 | + 639 | +#ifdef CONFIG_SMP 640 | +static int 641 | +balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) 642 | +{ 643 | + if (rq->nr_running) 644 | + return 1; 645 | + 646 | + return newidle_balance(rq, rf) != 0; 647 | +} 648 | + 649 | +static int 650 | +select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) 651 | +{ 652 | + struct rq *rq = cpu_rq(prev_cpu); 653 | + unsigned int min_this = rq->nr_running; 654 | + unsigned int min = rq->nr_running; 655 | + int cpu, new_cpu = prev_cpu; 656 | + 657 | + for_each_online_cpu(cpu) { 658 | + if (cpu_rq(cpu)->nr_running < min) { 659 | + new_cpu = cpu; 660 | + min = cpu_rq(cpu)->nr_running; 661 | + } 662 | + } 663 | + 664 | + if (min == min_this) 665 | + return prev_cpu; 666 | + 667 | + return new_cpu; 668 | +} 669 | + 670 | +static int 671 | +can_migrate_task(struct task_struct *p, int dst_cpu, struct rq *src_rq) 672 | +{ 673 | + if (task_running(src_rq, p)) 674 | + return 0; 675 | + 676 | + /* Disregard pcpu kthreads; they are where they need to be. */ 677 | + if (kthread_is_per_cpu(p)) 678 | + return 0; 679 | + 680 | + if (!cpumask_test_cpu(dst_cpu, p->cpus_ptr)) 681 | + return 0; 682 | + 683 | + return 1; 684 | +} 685 | + 686 | +static void pull_from(struct rq *this_rq, 687 | + struct rq *src_rq, 688 | + struct rq_flags *src_rf, 689 | + struct task_struct *p) 690 | +{ 691 | + struct rq_flags rf; 692 | + 693 | + // detach task 694 | + deactivate_task(src_rq, p, DEQUEUE_NOCLOCK); 695 | + set_task_cpu(p, cpu_of(this_rq)); 696 | + 697 | + // unlock src rq 698 | + rq_unlock(src_rq, src_rf); 699 | + 700 | + // lock this rq 701 | + rq_lock(this_rq, &rf); 702 | + update_rq_clock(this_rq); 703 | + 704 | + activate_task(this_rq, p, ENQUEUE_NOCLOCK); 705 | + check_preempt_curr(this_rq, p, 0); 706 | + 707 | + // unlock this rq 708 | + rq_unlock(this_rq, &rf); 709 | + 710 | + local_irq_restore(src_rf->flags); 711 | +} 712 | + 713 | +static int move_task(struct rq *this_rq, struct rq *src_rq, 714 | + struct rq_flags *src_rf) 715 | +{ 716 | + struct cfs_rq *src_cfs_rq = &src_rq->cfs; 717 | + struct task_struct *p; 718 | + struct bs_node *bsn = src_cfs_rq->head; 719 | + int moved = 0; 720 | + 721 | + while (bsn) { 722 | + p = task_of(se_of(bsn)); 723 | + if (can_migrate_task(p, cpu_of(this_rq), src_rq)) { 724 | + pull_from(this_rq, src_rq, src_rf, p); 725 | + moved = 1; 726 | + break; 727 | + } 728 | + 729 | + bsn = bsn->next; 730 | + } 731 | + 732 | + if (!moved) { 733 | + rq_unlock(src_rq, src_rf); 734 | + local_irq_restore(src_rf->flags); 735 | + } 736 | + 737 | + return moved; 738 | +} 739 | + 740 | +static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) 741 | +{ 742 | + int this_cpu = this_rq->cpu; 743 | + struct rq *src_rq; 744 | + int src_cpu = -1, cpu; 745 | + int pulled_task = 0; 746 | + unsigned int max = 0; 747 | + struct rq_flags src_rf; 748 | + 749 | + /* 750 | + * We must set idle_stamp _before_ calling idle_balance(), such that we 751 | + * measure the duration of idle_balance() as idle time. 752 | + */ 753 | + this_rq->idle_stamp = rq_clock(this_rq); 754 | + 755 | + /* 756 | + * Do not pull tasks towards !active CPUs... 757 | + */ 758 | + if (!cpu_active(this_cpu)) 759 | + return 0; 760 | + 761 | + rq_unpin_lock(this_rq, rf); 762 | + raw_spin_unlock(&this_rq->__lock); 763 | + 764 | + for_each_online_cpu(cpu) { 765 | + /* 766 | + * Stop searching for tasks to pull if there are 767 | + * now runnable tasks on this rq. 768 | + */ 769 | + if (this_rq->nr_running > 0) 770 | + goto out; 771 | + 772 | + if (cpu == this_cpu) 773 | + continue; 774 | + 775 | + src_rq = cpu_rq(cpu); 776 | + 777 | + if (src_rq->nr_running < 2) 778 | + continue; 779 | + 780 | + if (src_rq->nr_running > max) { 781 | + max = src_rq->nr_running; 782 | + src_cpu = cpu; 783 | + } 784 | + } 785 | + 786 | + if (src_cpu != -1) { 787 | + src_rq = cpu_rq(src_cpu); 788 | + 789 | + rq_lock_irqsave(src_rq, &src_rf); 790 | + update_rq_clock(src_rq); 791 | + 792 | + if (src_rq->nr_running < 2) { 793 | + rq_unlock(src_rq, &src_rf); 794 | + local_irq_restore(src_rf.flags); 795 | + } else { 796 | + pulled_task = move_task(this_rq, src_rq, &src_rf); 797 | + } 798 | + } 799 | + 800 | +out: 801 | + raw_spin_lock(&this_rq->__lock); 802 | + 803 | + /* 804 | + * While browsing the domains, we released the rq lock, a task could 805 | + * have been enqueued in the meantime. Since we're not going idle, 806 | + * pretend we pulled a task. 807 | + */ 808 | + if (this_rq->cfs.h_nr_running && !pulled_task) 809 | + pulled_task = 1; 810 | + 811 | + /* Is there a task of a high priority class? */ 812 | + if (this_rq->nr_running != this_rq->cfs.h_nr_running) 813 | + pulled_task = -1; 814 | + 815 | + if (pulled_task) 816 | + this_rq->idle_stamp = 0; 817 | + 818 | + rq_repin_lock(this_rq, rf); 819 | + 820 | + return pulled_task; 821 | +} 822 | + 823 | +static inline int on_null_domain(struct rq *rq) 824 | +{ 825 | + return unlikely(!rcu_dereference_sched(rq->sd)); 826 | +} 827 | + 828 | +void trigger_load_balance(struct rq *this_rq) 829 | +{ 830 | + int this_cpu = cpu_of(this_rq); 831 | + int cpu; 832 | + unsigned int max, min; 833 | + struct rq *max_rq, *min_rq, *c_rq; 834 | + struct rq_flags src_rf; 835 | + 836 | + if (this_cpu != 0) 837 | + return; 838 | + 839 | + max = min = this_rq->nr_running; 840 | + max_rq = min_rq = this_rq; 841 | + 842 | + for_each_online_cpu(cpu) { 843 | + c_rq = cpu_rq(cpu); 844 | + 845 | + /* 846 | + * Don't need to rebalance while attached to NULL domain or 847 | + * runqueue CPU is not active 848 | + */ 849 | + if (unlikely(on_null_domain(c_rq) || !cpu_active(cpu))) 850 | + continue; 851 | + 852 | + if (c_rq->nr_running < min) { 853 | + min = c_rq->nr_running; 854 | + min_rq = c_rq; 855 | + } 856 | + 857 | + if (c_rq->nr_running > max) { 858 | + max = c_rq->nr_running; 859 | + max_rq = c_rq; 860 | + } 861 | + } 862 | + 863 | + if (min_rq == max_rq || max - min < 2) 864 | + return; 865 | + 866 | + rq_lock_irqsave(max_rq, &src_rf); 867 | + update_rq_clock(max_rq); 868 | + 869 | + if (max_rq->nr_running < 2) { 870 | + rq_unlock(max_rq, &src_rf); 871 | + local_irq_restore(src_rf.flags); 872 | + return; 873 | + } 874 | + 875 | + move_task(min_rq, max_rq, &src_rf); 876 | +} 877 | + 878 | +void update_group_capacity(struct sched_domain *sd, int cpu) {} 879 | +#endif /* CONFIG_SMP */ 880 | + 881 | +static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) 882 | +{ 883 | + struct sched_entity *se = &curr->se; 884 | + struct cfs_rq *cfs_rq = cfs_rq_of(se); 885 | + 886 | + entity_tick(cfs_rq, se, queued); 887 | + 888 | + if (static_branch_unlikely(&sched_numa_balancing)) 889 | + task_tick_numa(rq, curr); 890 | +} 891 | + 892 | +static void task_fork_fair(struct task_struct *p) 893 | +{ 894 | + struct cfs_rq *cfs_rq; 895 | + struct sched_entity *curr; 896 | + struct rq *rq = this_rq(); 897 | + struct rq_flags rf; 898 | + 899 | + p->se.bs_node.vruntime = 0; 900 | + 901 | + rq_lock(rq, &rf); 902 | + update_rq_clock(rq); 903 | + 904 | + cfs_rq = task_cfs_rq(current); 905 | + curr = cfs_rq->curr; 906 | + if (curr) 907 | + update_curr(cfs_rq); 908 | + 909 | + rq_unlock(rq, &rf); 910 | +} 911 | + 912 | +/* 913 | + * All the scheduling class methods: 914 | + */ 915 | +DEFINE_SCHED_CLASS(fair) = { 916 | + 917 | + .enqueue_task = enqueue_task_fair, 918 | + .dequeue_task = dequeue_task_fair, 919 | + .yield_task = yield_task_fair, 920 | + .yield_to_task = yield_to_task_fair, 921 | + 922 | + .check_preempt_curr = check_preempt_wakeup, 923 | + 924 | + .pick_next_task = __pick_next_task_fair, 925 | + .put_prev_task = put_prev_task_fair, 926 | + .set_next_task = set_next_task_fair, 927 | + 928 | +#ifdef CONFIG_SMP 929 | + .balance = balance_fair, 930 | + .pick_task = pick_task_fair, 931 | + .select_task_rq = select_task_rq_fair, 932 | + .migrate_task_rq = migrate_task_rq_fair, 933 | + 934 | + .rq_online = rq_online_fair, 935 | + .rq_offline = rq_offline_fair, 936 | + 937 | + .task_dead = task_dead_fair, 938 | + .set_cpus_allowed = set_cpus_allowed_common, 939 | +#endif 940 | + 941 | + .task_tick = task_tick_fair, 942 | + .task_fork = task_fork_fair, 943 | + 944 | + .prio_changed = prio_changed_fair, 945 | + .switched_from = switched_from_fair, 946 | + .switched_to = switched_to_fair, 947 | + 948 | + .get_rr_interval = get_rr_interval_fair, 949 | + 950 | + .update_curr = update_curr_fair, 951 | +}; 952 | diff --git a/kernel/sched/bs.h b/kernel/sched/bs.h 953 | new file mode 100644 954 | index 000000000000..e0466c0e6ec3 955 | --- /dev/null 956 | +++ b/kernel/sched/bs.h 957 | @@ -0,0 +1,149 @@ 958 | + 959 | +/* 960 | + * After fork, child runs first. If set to 0 (default) then 961 | + * parent will (try to) run first. 962 | + */ 963 | +unsigned int sysctl_sched_child_runs_first __read_mostly; 964 | + 965 | +const_debug unsigned int sysctl_sched_migration_cost = 500000UL; 966 | + 967 | +void __init sched_init_granularity(void) {} 968 | + 969 | +#ifdef CONFIG_SMP 970 | +/* Give new sched_entity start runnable values to heavy its load in infant time */ 971 | +void init_entity_runnable_average(struct sched_entity *se) {} 972 | +void post_init_entity_util_avg(struct task_struct *p) {} 973 | +void update_max_interval(void) {} 974 | +static int newidle_balance(struct rq *this_rq, struct rq_flags *rf); 975 | + 976 | +static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) 977 | +{ 978 | + update_scan_period(p, new_cpu); 979 | +} 980 | + 981 | +static void rq_online_fair(struct rq *rq) {} 982 | +static void rq_offline_fair(struct rq *rq) {} 983 | +static void task_dead_fair(struct task_struct *p) 984 | +{ 985 | + struct cfs_rq *cfs_rq = cfs_rq_of(&p->se); 986 | + unsigned long flags; 987 | + 988 | + raw_spin_lock_irqsave(&cfs_rq->removed.lock, flags); 989 | + ++cfs_rq->removed.nr; 990 | + raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags); 991 | +} 992 | + 993 | +#endif /** CONFIG_SMP */ 994 | + 995 | +void init_cfs_rq(struct cfs_rq *cfs_rq) 996 | +{ 997 | + cfs_rq->tasks_timeline = RB_ROOT_CACHED; 998 | +#ifdef CONFIG_SMP 999 | + raw_spin_lock_init(&cfs_rq->removed.lock); 1000 | +#endif 1001 | +} 1002 | + 1003 | +__init void init_sched_fair_class(void) {} 1004 | + 1005 | +void reweight_task(struct task_struct *p, int prio) {} 1006 | + 1007 | +static inline struct sched_entity *se_of(struct bs_node *bsn) 1008 | +{ 1009 | + return container_of(bsn, struct sched_entity, bs_node); 1010 | +} 1011 | + 1012 | +#ifdef CONFIG_SCHED_SMT 1013 | +DEFINE_STATIC_KEY_FALSE(sched_smt_present); 1014 | +EXPORT_SYMBOL_GPL(sched_smt_present); 1015 | + 1016 | +static inline void set_idle_cores(int cpu, int val) 1017 | +{ 1018 | + struct sched_domain_shared *sds; 1019 | + 1020 | + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); 1021 | + if (sds) 1022 | + WRITE_ONCE(sds->has_idle_cores, val); 1023 | +} 1024 | + 1025 | +static inline bool test_idle_cores(int cpu, bool def) 1026 | +{ 1027 | + struct sched_domain_shared *sds; 1028 | + 1029 | + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); 1030 | + if (sds) 1031 | + return READ_ONCE(sds->has_idle_cores); 1032 | + 1033 | + return def; 1034 | +} 1035 | + 1036 | +void __update_idle_core(struct rq *rq) 1037 | +{ 1038 | + int core = cpu_of(rq); 1039 | + int cpu; 1040 | + 1041 | + rcu_read_lock(); 1042 | + if (test_idle_cores(core, true)) 1043 | + goto unlock; 1044 | + 1045 | + for_each_cpu(cpu, cpu_smt_mask(core)) { 1046 | + if (cpu == core) 1047 | + continue; 1048 | + 1049 | + if (!available_idle_cpu(cpu)) 1050 | + goto unlock; 1051 | + } 1052 | + 1053 | + set_idle_cores(core, 1); 1054 | +unlock: 1055 | + rcu_read_unlock(); 1056 | +} 1057 | +#endif 1058 | + 1059 | +static void 1060 | +account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) 1061 | +{ 1062 | +#ifdef CONFIG_SMP 1063 | + if (entity_is_task(se)) { 1064 | + struct rq *rq = rq_of(cfs_rq); 1065 | + 1066 | + account_numa_enqueue(rq, task_of(se)); 1067 | + list_add(&se->group_node, &rq->cfs_tasks); 1068 | + } 1069 | +#endif 1070 | + cfs_rq->nr_running++; 1071 | +} 1072 | + 1073 | +static void 1074 | +account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) 1075 | +{ 1076 | +#ifdef CONFIG_SMP 1077 | + if (entity_is_task(se)) { 1078 | + account_numa_dequeue(rq_of(cfs_rq), task_of(se)); 1079 | + list_del_init(&se->group_node); 1080 | + } 1081 | +#endif 1082 | + cfs_rq->nr_running--; 1083 | +} 1084 | + 1085 | + 1086 | +static void 1087 | +prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio) {} 1088 | + 1089 | +static void switched_from_fair(struct rq *rq, struct task_struct *p) {} 1090 | + 1091 | +static void switched_to_fair(struct rq *rq, struct task_struct *p) 1092 | +{ 1093 | + if (task_on_rq_queued(p)) { 1094 | + /* 1095 | + * We were most likely switched from sched_rt, so 1096 | + * kick off the schedule if running, otherwise just see 1097 | + * if we can still preempt the current task. 1098 | + */ 1099 | + resched_curr(rq); 1100 | + } 1101 | +} 1102 | + 1103 | +static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task) 1104 | +{ 1105 | + return 0; 1106 | +} 1107 | diff --git a/kernel/sched/core.c b/kernel/sched/core.c 1108 | index 399c37c95392..4566e232d9c3 100644 1109 | --- a/kernel/sched/core.c 1110 | +++ b/kernel/sched/core.c 1111 | @@ -4228,6 +4228,10 @@ void wake_up_new_task(struct task_struct *p) 1112 | update_rq_clock(rq); 1113 | post_init_entity_util_avg(p); 1114 | 1115 | +#ifdef CONFIG_BS_SCHED 1116 | + p->se.bs_node.hrrn_start_time = sched_clock(); 1117 | +#endif 1118 | + 1119 | activate_task(rq, p, ENQUEUE_NOCLOCK); 1120 | trace_sched_wakeup_new(p); 1121 | check_preempt_curr(rq, p, WF_FORK); 1122 | @@ -8995,6 +8999,10 @@ void __init sched_init(void) 1123 | 1124 | wait_bit_init(); 1125 | 1126 | +#ifdef CONFIG_BS_SCHED 1127 | + printk(KERN_INFO "Baby CPU scheduler (hrrn) v5.14 by Hamad Al Marri."); 1128 | +#endif 1129 | + 1130 | #ifdef CONFIG_FAIR_GROUP_SCHED 1131 | ptr += 2 * nr_cpu_ids * sizeof(void **); 1132 | #endif 1133 | diff --git a/kernel/sched/fair_numa.h b/kernel/sched/fair_numa.h 1134 | new file mode 100644 1135 | index 000000000000..a564478ec3f7 1136 | --- /dev/null 1137 | +++ b/kernel/sched/fair_numa.h 1138 | @@ -0,0 +1,1968 @@ 1139 | + 1140 | +#ifdef CONFIG_NUMA_BALANCING 1141 | + 1142 | +unsigned int sysctl_numa_balancing_scan_period_min = 1000; 1143 | +unsigned int sysctl_numa_balancing_scan_period_max = 60000; 1144 | + 1145 | +/* Portion of address space to scan in MB */ 1146 | +unsigned int sysctl_numa_balancing_scan_size = 256; 1147 | + 1148 | +/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */ 1149 | +unsigned int sysctl_numa_balancing_scan_delay = 1000; 1150 | + 1151 | +struct numa_group { 1152 | + refcount_t refcount; 1153 | + 1154 | + spinlock_t lock; /* nr_tasks, tasks */ 1155 | + int nr_tasks; 1156 | + pid_t gid; 1157 | + int active_nodes; 1158 | + 1159 | + struct rcu_head rcu; 1160 | + unsigned long total_faults; 1161 | + unsigned long max_faults_cpu; 1162 | + /* 1163 | + * Faults_cpu is used to decide whether memory should move 1164 | + * towards the CPU. As a consequence, these stats are weighted 1165 | + * more by CPU use than by memory faults. 1166 | + */ 1167 | + unsigned long *faults_cpu; 1168 | + unsigned long faults[]; 1169 | +}; 1170 | + 1171 | +/* 1172 | + * For functions that can be called in multiple contexts that permit reading 1173 | + * ->numa_group (see struct task_struct for locking rules). 1174 | + */ 1175 | +static struct numa_group *deref_task_numa_group(struct task_struct *p) 1176 | +{ 1177 | + return rcu_dereference_check(p->numa_group, p == current || 1178 | + (lockdep_is_held(__rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu))); 1179 | +} 1180 | + 1181 | +static struct numa_group *deref_curr_numa_group(struct task_struct *p) 1182 | +{ 1183 | + return rcu_dereference_protected(p->numa_group, p == current); 1184 | +} 1185 | + 1186 | +static inline unsigned long group_faults_priv(struct numa_group *ng); 1187 | +static inline unsigned long group_faults_shared(struct numa_group *ng); 1188 | + 1189 | +static unsigned int task_nr_scan_windows(struct task_struct *p) 1190 | +{ 1191 | + unsigned long rss = 0; 1192 | + unsigned long nr_scan_pages; 1193 | + 1194 | + /* 1195 | + * Calculations based on RSS as non-present and empty pages are skipped 1196 | + * by the PTE scanner and NUMA hinting faults should be trapped based 1197 | + * on resident pages 1198 | + */ 1199 | + nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT); 1200 | + rss = get_mm_rss(p->mm); 1201 | + if (!rss) 1202 | + rss = nr_scan_pages; 1203 | + 1204 | + rss = round_up(rss, nr_scan_pages); 1205 | + return rss / nr_scan_pages; 1206 | +} 1207 | + 1208 | +/* For sanity's sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */ 1209 | +#define MAX_SCAN_WINDOW 2560 1210 | + 1211 | +static unsigned int task_scan_min(struct task_struct *p) 1212 | +{ 1213 | + unsigned int scan_size = READ_ONCE(sysctl_numa_balancing_scan_size); 1214 | + unsigned int scan, floor; 1215 | + unsigned int windows = 1; 1216 | + 1217 | + if (scan_size < MAX_SCAN_WINDOW) 1218 | + windows = MAX_SCAN_WINDOW / scan_size; 1219 | + floor = 1000 / windows; 1220 | + 1221 | + scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p); 1222 | + return max_t(unsigned int, floor, scan); 1223 | +} 1224 | + 1225 | +static unsigned int task_scan_max(struct task_struct *p) 1226 | +{ 1227 | + unsigned long smin = task_scan_min(p); 1228 | + unsigned long smax; 1229 | + struct numa_group *ng; 1230 | + 1231 | + /* Watch for min being lower than max due to floor calculations */ 1232 | + smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p); 1233 | + 1234 | + /* Scale the maximum scan period with the amount of shared memory. */ 1235 | + ng = deref_curr_numa_group(p); 1236 | + if (ng) { 1237 | + unsigned long shared = group_faults_shared(ng); 1238 | + unsigned long private = group_faults_priv(ng); 1239 | + unsigned long period = smax; 1240 | + 1241 | + period *= refcount_read(&ng->refcount); 1242 | + period *= shared + 1; 1243 | + period /= private + shared + 1; 1244 | + 1245 | + smax = max(smax, period); 1246 | + } 1247 | + 1248 | + return max(smin, smax); 1249 | +} 1250 | + 1251 | +static void account_numa_enqueue(struct rq *rq, struct task_struct *p) 1252 | +{ 1253 | + rq->nr_numa_running += (p->numa_preferred_nid != NUMA_NO_NODE); 1254 | + rq->nr_preferred_running += (p->numa_preferred_nid == task_node(p)); 1255 | +} 1256 | + 1257 | +static void account_numa_dequeue(struct rq *rq, struct task_struct *p) 1258 | +{ 1259 | + rq->nr_numa_running -= (p->numa_preferred_nid != NUMA_NO_NODE); 1260 | + rq->nr_preferred_running -= (p->numa_preferred_nid == task_node(p)); 1261 | +} 1262 | + 1263 | +/* Shared or private faults. */ 1264 | +#define NR_NUMA_HINT_FAULT_TYPES 2 1265 | + 1266 | +/* Memory and CPU locality */ 1267 | +#define NR_NUMA_HINT_FAULT_STATS (NR_NUMA_HINT_FAULT_TYPES * 2) 1268 | + 1269 | +/* Averaged statistics, and temporary buffers. */ 1270 | +#define NR_NUMA_HINT_FAULT_BUCKETS (NR_NUMA_HINT_FAULT_STATS * 2) 1271 | + 1272 | +pid_t task_numa_group_id(struct task_struct *p) 1273 | +{ 1274 | + struct numa_group *ng; 1275 | + pid_t gid = 0; 1276 | + 1277 | + rcu_read_lock(); 1278 | + ng = rcu_dereference(p->numa_group); 1279 | + if (ng) 1280 | + gid = ng->gid; 1281 | + rcu_read_unlock(); 1282 | + 1283 | + return gid; 1284 | +} 1285 | + 1286 | +/* 1287 | + * The averaged statistics, shared & private, memory & CPU, 1288 | + * occupy the first half of the array. The second half of the 1289 | + * array is for current counters, which are averaged into the 1290 | + * first set by task_numa_placement. 1291 | + */ 1292 | +static inline int task_faults_idx(enum numa_faults_stats s, int nid, int priv) 1293 | +{ 1294 | + return NR_NUMA_HINT_FAULT_TYPES * (s * nr_node_ids + nid) + priv; 1295 | +} 1296 | + 1297 | +static inline unsigned long task_faults(struct task_struct *p, int nid) 1298 | +{ 1299 | + if (!p->numa_faults) 1300 | + return 0; 1301 | + 1302 | + return p->numa_faults[task_faults_idx(NUMA_MEM, nid, 0)] + 1303 | + p->numa_faults[task_faults_idx(NUMA_MEM, nid, 1)]; 1304 | +} 1305 | + 1306 | +static inline unsigned long group_faults(struct task_struct *p, int nid) 1307 | +{ 1308 | + struct numa_group *ng = deref_task_numa_group(p); 1309 | + 1310 | + if (!ng) 1311 | + return 0; 1312 | + 1313 | + return ng->faults[task_faults_idx(NUMA_MEM, nid, 0)] + 1314 | + ng->faults[task_faults_idx(NUMA_MEM, nid, 1)]; 1315 | +} 1316 | + 1317 | +static inline unsigned long group_faults_cpu(struct numa_group *group, int nid) 1318 | +{ 1319 | + return group->faults_cpu[task_faults_idx(NUMA_MEM, nid, 0)] + 1320 | + group->faults_cpu[task_faults_idx(NUMA_MEM, nid, 1)]; 1321 | +} 1322 | + 1323 | +static inline unsigned long group_faults_priv(struct numa_group *ng) 1324 | +{ 1325 | + unsigned long faults = 0; 1326 | + int node; 1327 | + 1328 | + for_each_online_node(node) { 1329 | + faults += ng->faults[task_faults_idx(NUMA_MEM, node, 1)]; 1330 | + } 1331 | + 1332 | + return faults; 1333 | +} 1334 | + 1335 | +static inline unsigned long group_faults_shared(struct numa_group *ng) 1336 | +{ 1337 | + unsigned long faults = 0; 1338 | + int node; 1339 | + 1340 | + for_each_online_node(node) { 1341 | + faults += ng->faults[task_faults_idx(NUMA_MEM, node, 0)]; 1342 | + } 1343 | + 1344 | + return faults; 1345 | +} 1346 | + 1347 | +/* 1348 | + * A node triggering more than 1/3 as many NUMA faults as the maximum is 1349 | + * considered part of a numa group's pseudo-interleaving set. Migrations 1350 | + * between these nodes are slowed down, to allow things to settle down. 1351 | + */ 1352 | +#define ACTIVE_NODE_FRACTION 3 1353 | + 1354 | +static bool numa_is_active_node(int nid, struct numa_group *ng) 1355 | +{ 1356 | + return group_faults_cpu(ng, nid) * ACTIVE_NODE_FRACTION > ng->max_faults_cpu; 1357 | +} 1358 | + 1359 | +/* Handle placement on systems where not all nodes are directly connected. */ 1360 | +static unsigned long score_nearby_nodes(struct task_struct *p, int nid, 1361 | + int maxdist, bool task) 1362 | +{ 1363 | + unsigned long score = 0; 1364 | + int node; 1365 | + 1366 | + /* 1367 | + * All nodes are directly connected, and the same distance 1368 | + * from each other. No need for fancy placement algorithms. 1369 | + */ 1370 | + if (sched_numa_topology_type == NUMA_DIRECT) 1371 | + return 0; 1372 | + 1373 | + /* 1374 | + * This code is called for each node, introducing N^2 complexity, 1375 | + * which should be ok given the number of nodes rarely exceeds 8. 1376 | + */ 1377 | + for_each_online_node(node) { 1378 | + unsigned long faults; 1379 | + int dist = node_distance(nid, node); 1380 | + 1381 | + /* 1382 | + * The furthest away nodes in the system are not interesting 1383 | + * for placement; nid was already counted. 1384 | + */ 1385 | + if (dist == sched_max_numa_distance || node == nid) 1386 | + continue; 1387 | + 1388 | + /* 1389 | + * On systems with a backplane NUMA topology, compare groups 1390 | + * of nodes, and move tasks towards the group with the most 1391 | + * memory accesses. When comparing two nodes at distance 1392 | + * "hoplimit", only nodes closer by than "hoplimit" are part 1393 | + * of each group. Skip other nodes. 1394 | + */ 1395 | + if (sched_numa_topology_type == NUMA_BACKPLANE && 1396 | + dist >= maxdist) 1397 | + continue; 1398 | + 1399 | + /* Add up the faults from nearby nodes. */ 1400 | + if (task) 1401 | + faults = task_faults(p, node); 1402 | + else 1403 | + faults = group_faults(p, node); 1404 | + 1405 | + /* 1406 | + * On systems with a glueless mesh NUMA topology, there are 1407 | + * no fixed "groups of nodes". Instead, nodes that are not 1408 | + * directly connected bounce traffic through intermediate 1409 | + * nodes; a numa_group can occupy any set of nodes. 1410 | + * The further away a node is, the less the faults count. 1411 | + * This seems to result in good task placement. 1412 | + */ 1413 | + if (sched_numa_topology_type == NUMA_GLUELESS_MESH) { 1414 | + faults *= (sched_max_numa_distance - dist); 1415 | + faults /= (sched_max_numa_distance - LOCAL_DISTANCE); 1416 | + } 1417 | + 1418 | + score += faults; 1419 | + } 1420 | + 1421 | + return score; 1422 | +} 1423 | + 1424 | +/* 1425 | + * These return the fraction of accesses done by a particular task, or 1426 | + * task group, on a particular numa node. The group weight is given a 1427 | + * larger multiplier, in order to group tasks together that are almost 1428 | + * evenly spread out between numa nodes. 1429 | + */ 1430 | +static inline unsigned long task_weight(struct task_struct *p, int nid, 1431 | + int dist) 1432 | +{ 1433 | + unsigned long faults, total_faults; 1434 | + 1435 | + if (!p->numa_faults) 1436 | + return 0; 1437 | + 1438 | + total_faults = p->total_numa_faults; 1439 | + 1440 | + if (!total_faults) 1441 | + return 0; 1442 | + 1443 | + faults = task_faults(p, nid); 1444 | + faults += score_nearby_nodes(p, nid, dist, true); 1445 | + 1446 | + return 1000 * faults / total_faults; 1447 | +} 1448 | + 1449 | +static inline unsigned long group_weight(struct task_struct *p, int nid, 1450 | + int dist) 1451 | +{ 1452 | + struct numa_group *ng = deref_task_numa_group(p); 1453 | + unsigned long faults, total_faults; 1454 | + 1455 | + if (!ng) 1456 | + return 0; 1457 | + 1458 | + total_faults = ng->total_faults; 1459 | + 1460 | + if (!total_faults) 1461 | + return 0; 1462 | + 1463 | + faults = group_faults(p, nid); 1464 | + faults += score_nearby_nodes(p, nid, dist, false); 1465 | + 1466 | + return 1000 * faults / total_faults; 1467 | +} 1468 | + 1469 | +bool should_numa_migrate_memory(struct task_struct *p, struct page * page, 1470 | + int src_nid, int dst_cpu) 1471 | +{ 1472 | + struct numa_group *ng = deref_curr_numa_group(p); 1473 | + int dst_nid = cpu_to_node(dst_cpu); 1474 | + int last_cpupid, this_cpupid; 1475 | + 1476 | + this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); 1477 | + last_cpupid = page_cpupid_xchg_last(page, this_cpupid); 1478 | + 1479 | + /* 1480 | + * Allow first faults or private faults to migrate immediately early in 1481 | + * the lifetime of a task. The magic number 4 is based on waiting for 1482 | + * two full passes of the "multi-stage node selection" test that is 1483 | + * executed below. 1484 | + */ 1485 | + if ((p->numa_preferred_nid == NUMA_NO_NODE || p->numa_scan_seq <= 4) && 1486 | + (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid))) 1487 | + return true; 1488 | + 1489 | + /* 1490 | + * Multi-stage node selection is used in conjunction with a periodic 1491 | + * migration fault to build a temporal task<->page relation. By using 1492 | + * a two-stage filter we remove short/unlikely relations. 1493 | + * 1494 | + * Using P(p) ~ n_p / n_t as per frequentist probability, we can equate 1495 | + * a task's usage of a particular page (n_p) per total usage of this 1496 | + * page (n_t) (in a given time-span) to a probability. 1497 | + * 1498 | + * Our periodic faults will sample this probability and getting the 1499 | + * same result twice in a row, given these samples are fully 1500 | + * independent, is then given by P(n)^2, provided our sample period 1501 | + * is sufficiently short compared to the usage pattern. 1502 | + * 1503 | + * This quadric squishes small probabilities, making it less likely we 1504 | + * act on an unlikely task<->page relation. 1505 | + */ 1506 | + if (!cpupid_pid_unset(last_cpupid) && 1507 | + cpupid_to_nid(last_cpupid) != dst_nid) 1508 | + return false; 1509 | + 1510 | + /* Always allow migrate on private faults */ 1511 | + if (cpupid_match_pid(p, last_cpupid)) 1512 | + return true; 1513 | + 1514 | + /* A shared fault, but p->numa_group has not been set up yet. */ 1515 | + if (!ng) 1516 | + return true; 1517 | + 1518 | + /* 1519 | + * Destination node is much more heavily used than the source 1520 | + * node? Allow migration. 1521 | + */ 1522 | + if (group_faults_cpu(ng, dst_nid) > group_faults_cpu(ng, src_nid) * 1523 | + ACTIVE_NODE_FRACTION) 1524 | + return true; 1525 | + 1526 | + /* 1527 | + * Distribute memory according to CPU & memory use on each node, 1528 | + * with 3/4 hysteresis to avoid unnecessary memory migrations: 1529 | + * 1530 | + * faults_cpu(dst) 3 faults_cpu(src) 1531 | + * --------------- * - > --------------- 1532 | + * faults_mem(dst) 4 faults_mem(src) 1533 | + */ 1534 | + return group_faults_cpu(ng, dst_nid) * group_faults(p, src_nid) * 3 > 1535 | + group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4; 1536 | +} 1537 | + 1538 | +/* 1539 | + * 'numa_type' describes the node at the moment of load balancing. 1540 | + */ 1541 | +enum numa_type { 1542 | + /* The node has spare capacity that can be used to run more tasks. */ 1543 | + node_has_spare = 0, 1544 | + /* 1545 | + * The node is fully used and the tasks don't compete for more CPU 1546 | + * cycles. Nevertheless, some tasks might wait before running. 1547 | + */ 1548 | + node_fully_busy, 1549 | + /* 1550 | + * The node is overloaded and can't provide expected CPU cycles to all 1551 | + * tasks. 1552 | + */ 1553 | + node_overloaded 1554 | +}; 1555 | + 1556 | +/* Cached statistics for all CPUs within a node */ 1557 | +struct numa_stats { 1558 | + unsigned long load; 1559 | + unsigned long runnable; 1560 | + unsigned long util; 1561 | + /* Total compute capacity of CPUs on a node */ 1562 | + unsigned long compute_capacity; 1563 | + unsigned int nr_running; 1564 | + unsigned int weight; 1565 | + enum numa_type node_type; 1566 | + int idle_cpu; 1567 | +}; 1568 | + 1569 | +static inline bool is_core_idle(int cpu) 1570 | +{ 1571 | +#ifdef CONFIG_SCHED_SMT 1572 | + int sibling; 1573 | + 1574 | + for_each_cpu(sibling, cpu_smt_mask(cpu)) { 1575 | + if (cpu == sibling) 1576 | + continue; 1577 | + 1578 | + if (!idle_cpu(sibling)) 1579 | + return false; 1580 | + } 1581 | +#endif 1582 | + 1583 | + return true; 1584 | +} 1585 | + 1586 | +struct task_numa_env { 1587 | + struct task_struct *p; 1588 | + 1589 | + int src_cpu, src_nid; 1590 | + int dst_cpu, dst_nid; 1591 | + 1592 | + struct numa_stats src_stats, dst_stats; 1593 | + 1594 | + int imbalance_pct; 1595 | + int dist; 1596 | + 1597 | + struct task_struct *best_task; 1598 | + long best_imp; 1599 | + int best_cpu; 1600 | +}; 1601 | + 1602 | +static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq) 1603 | +{ 1604 | + return cfs_rq->avg.load_avg; 1605 | +} 1606 | + 1607 | +static unsigned long cpu_load(struct rq *rq) 1608 | +{ 1609 | + return cfs_rq_load_avg(&rq->cfs); 1610 | +} 1611 | + 1612 | +static inline unsigned long cfs_rq_runnable_avg(struct cfs_rq *cfs_rq) 1613 | +{ 1614 | + return cfs_rq->avg.runnable_avg; 1615 | +} 1616 | + 1617 | +static unsigned long cpu_runnable(struct rq *rq) 1618 | +{ 1619 | + return cfs_rq_runnable_avg(&rq->cfs); 1620 | +} 1621 | + 1622 | +static inline unsigned long cpu_util(int cpu) 1623 | +{ 1624 | + struct cfs_rq *cfs_rq; 1625 | + unsigned int util; 1626 | + 1627 | + cfs_rq = &cpu_rq(cpu)->cfs; 1628 | + util = READ_ONCE(cfs_rq->avg.util_avg); 1629 | + 1630 | + if (sched_feat(UTIL_EST)) 1631 | + util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued)); 1632 | + 1633 | + return min_t(unsigned long, util, capacity_orig_of(cpu)); 1634 | +} 1635 | + 1636 | +static unsigned long capacity_of(int cpu) 1637 | +{ 1638 | + return cpu_rq(cpu)->cpu_capacity; 1639 | +} 1640 | + 1641 | +static inline enum 1642 | +numa_type numa_classify(unsigned int imbalance_pct, 1643 | + struct numa_stats *ns) 1644 | +{ 1645 | + if ((ns->nr_running > ns->weight) && 1646 | + (((ns->compute_capacity * 100) < (ns->util * imbalance_pct)) || 1647 | + ((ns->compute_capacity * imbalance_pct) < (ns->runnable * 100)))) 1648 | + return node_overloaded; 1649 | + 1650 | + if ((ns->nr_running < ns->weight) || 1651 | + (((ns->compute_capacity * 100) > (ns->util * imbalance_pct)) && 1652 | + ((ns->compute_capacity * imbalance_pct) > (ns->runnable * 100)))) 1653 | + return node_has_spare; 1654 | + 1655 | + return node_fully_busy; 1656 | +} 1657 | + 1658 | +#ifdef CONFIG_SCHED_SMT 1659 | +/* Forward declarations of select_idle_sibling helpers */ 1660 | +static inline bool test_idle_cores(int cpu, bool def); 1661 | +static inline int numa_idle_core(int idle_core, int cpu) 1662 | +{ 1663 | + if (!static_branch_likely(&sched_smt_present) || 1664 | + idle_core >= 0 || !test_idle_cores(cpu, false)) 1665 | + return idle_core; 1666 | + 1667 | + /* 1668 | + * Prefer cores instead of packing HT siblings 1669 | + * and triggering future load balancing. 1670 | + */ 1671 | + if (is_core_idle(cpu)) 1672 | + idle_core = cpu; 1673 | + 1674 | + return idle_core; 1675 | +} 1676 | +#else 1677 | +static inline int numa_idle_core(int idle_core, int cpu) 1678 | +{ 1679 | + return idle_core; 1680 | +} 1681 | +#endif 1682 | + 1683 | +/* 1684 | + * Gather all necessary information to make NUMA balancing placement 1685 | + * decisions that are compatible with standard load balancer. This 1686 | + * borrows code and logic from update_sg_lb_stats but sharing a 1687 | + * common implementation is impractical. 1688 | + */ 1689 | +static void update_numa_stats(struct task_numa_env *env, 1690 | + struct numa_stats *ns, int nid, 1691 | + bool find_idle) 1692 | +{ 1693 | + int cpu, idle_core = -1; 1694 | + 1695 | + memset(ns, 0, sizeof(*ns)); 1696 | + ns->idle_cpu = -1; 1697 | + 1698 | + rcu_read_lock(); 1699 | + for_each_cpu(cpu, cpumask_of_node(nid)) { 1700 | + struct rq *rq = cpu_rq(cpu); 1701 | + 1702 | + ns->load += cpu_load(rq); 1703 | + ns->runnable += cpu_runnable(rq); 1704 | + ns->util += cpu_util(cpu); 1705 | + ns->nr_running += rq->cfs.h_nr_running; 1706 | + ns->compute_capacity += capacity_of(cpu); 1707 | + 1708 | + if (find_idle && !rq->nr_running && idle_cpu(cpu)) { 1709 | + if (READ_ONCE(rq->numa_migrate_on) || 1710 | + !cpumask_test_cpu(cpu, env->p->cpus_ptr)) 1711 | + continue; 1712 | + 1713 | + if (ns->idle_cpu == -1) 1714 | + ns->idle_cpu = cpu; 1715 | + 1716 | + idle_core = numa_idle_core(idle_core, cpu); 1717 | + } 1718 | + } 1719 | + rcu_read_unlock(); 1720 | + 1721 | + ns->weight = cpumask_weight(cpumask_of_node(nid)); 1722 | + 1723 | + ns->node_type = numa_classify(env->imbalance_pct, ns); 1724 | + 1725 | + if (idle_core >= 0) 1726 | + ns->idle_cpu = idle_core; 1727 | +} 1728 | + 1729 | +static void task_numa_assign(struct task_numa_env *env, 1730 | + struct task_struct *p, long imp) 1731 | +{ 1732 | + struct rq *rq = cpu_rq(env->dst_cpu); 1733 | + 1734 | + /* Check if run-queue part of active NUMA balance. */ 1735 | + if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1)) { 1736 | + int cpu; 1737 | + int start = env->dst_cpu; 1738 | + 1739 | + /* Find alternative idle CPU. */ 1740 | + for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start) { 1741 | + if (cpu == env->best_cpu || !idle_cpu(cpu) || 1742 | + !cpumask_test_cpu(cpu, env->p->cpus_ptr)) { 1743 | + continue; 1744 | + } 1745 | + 1746 | + env->dst_cpu = cpu; 1747 | + rq = cpu_rq(env->dst_cpu); 1748 | + if (!xchg(&rq->numa_migrate_on, 1)) 1749 | + goto assign; 1750 | + } 1751 | + 1752 | + /* Failed to find an alternative idle CPU */ 1753 | + return; 1754 | + } 1755 | + 1756 | +assign: 1757 | + /* 1758 | + * Clear previous best_cpu/rq numa-migrate flag, since task now 1759 | + * found a better CPU to move/swap. 1760 | + */ 1761 | + if (env->best_cpu != -1 && env->best_cpu != env->dst_cpu) { 1762 | + rq = cpu_rq(env->best_cpu); 1763 | + WRITE_ONCE(rq->numa_migrate_on, 0); 1764 | + } 1765 | + 1766 | + if (env->best_task) 1767 | + put_task_struct(env->best_task); 1768 | + if (p) 1769 | + get_task_struct(p); 1770 | + 1771 | + env->best_task = p; 1772 | + env->best_imp = imp; 1773 | + env->best_cpu = env->dst_cpu; 1774 | +} 1775 | + 1776 | +static bool load_too_imbalanced(long src_load, long dst_load, 1777 | + struct task_numa_env *env) 1778 | +{ 1779 | + long imb, old_imb; 1780 | + long orig_src_load, orig_dst_load; 1781 | + long src_capacity, dst_capacity; 1782 | + 1783 | + /* 1784 | + * The load is corrected for the CPU capacity available on each node. 1785 | + * 1786 | + * src_load dst_load 1787 | + * ------------ vs --------- 1788 | + * src_capacity dst_capacity 1789 | + */ 1790 | + src_capacity = env->src_stats.compute_capacity; 1791 | + dst_capacity = env->dst_stats.compute_capacity; 1792 | + 1793 | + imb = abs(dst_load * src_capacity - src_load * dst_capacity); 1794 | + 1795 | + orig_src_load = env->src_stats.load; 1796 | + orig_dst_load = env->dst_stats.load; 1797 | + 1798 | + old_imb = abs(orig_dst_load * src_capacity - orig_src_load * dst_capacity); 1799 | + 1800 | + /* Would this change make things worse? */ 1801 | + return (imb > old_imb); 1802 | +} 1803 | + 1804 | +static unsigned int task_scan_start(struct task_struct *p) 1805 | +{ 1806 | + unsigned long smin = task_scan_min(p); 1807 | + unsigned long period = smin; 1808 | + struct numa_group *ng; 1809 | + 1810 | + /* Scale the maximum scan period with the amount of shared memory. */ 1811 | + rcu_read_lock(); 1812 | + ng = rcu_dereference(p->numa_group); 1813 | + if (ng) { 1814 | + unsigned long shared = group_faults_shared(ng); 1815 | + unsigned long private = group_faults_priv(ng); 1816 | + 1817 | + period *= refcount_read(&ng->refcount); 1818 | + period *= shared + 1; 1819 | + period /= private + shared + 1; 1820 | + } 1821 | + rcu_read_unlock(); 1822 | + 1823 | + return max(smin, period); 1824 | +} 1825 | + 1826 | +static void update_scan_period(struct task_struct *p, int new_cpu) 1827 | +{ 1828 | + int src_nid = cpu_to_node(task_cpu(p)); 1829 | + int dst_nid = cpu_to_node(new_cpu); 1830 | + 1831 | + if (!static_branch_likely(&sched_numa_balancing)) 1832 | + return; 1833 | + 1834 | + if (!p->mm || !p->numa_faults || (p->flags & PF_EXITING)) 1835 | + return; 1836 | + 1837 | + if (src_nid == dst_nid) 1838 | + return; 1839 | + 1840 | + /* 1841 | + * Allow resets if faults have been trapped before one scan 1842 | + * has completed. This is most likely due to a new task that 1843 | + * is pulled cross-node due to wakeups or load balancing. 1844 | + */ 1845 | + if (p->numa_scan_seq) { 1846 | + /* 1847 | + * Avoid scan adjustments if moving to the preferred 1848 | + * node or if the task was not previously running on 1849 | + * the preferred node. 1850 | + */ 1851 | + if (dst_nid == p->numa_preferred_nid || 1852 | + (p->numa_preferred_nid != NUMA_NO_NODE && 1853 | + src_nid != p->numa_preferred_nid)) 1854 | + return; 1855 | + } 1856 | + 1857 | + p->numa_scan_period = task_scan_start(p); 1858 | +} 1859 | + 1860 | +/* 1861 | + * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain. 1862 | + * This is an approximation as the number of running tasks may not be 1863 | + * related to the number of busy CPUs due to sched_setaffinity. 1864 | + */ 1865 | +static inline bool allow_numa_imbalance(int dst_running, int dst_weight) 1866 | +{ 1867 | + return (dst_running < (dst_weight >> 2)); 1868 | +} 1869 | + 1870 | +#define NUMA_IMBALANCE_MIN 2 1871 | + 1872 | +static inline long adjust_numa_imbalance(int imbalance, 1873 | + int dst_running, int dst_weight) 1874 | +{ 1875 | + if (!allow_numa_imbalance(dst_running, dst_weight)) 1876 | + return imbalance; 1877 | + 1878 | + /* 1879 | + * Allow a small imbalance based on a simple pair of communicating 1880 | + * tasks that remain local when the destination is lightly loaded. 1881 | + */ 1882 | + if (imbalance <= NUMA_IMBALANCE_MIN) 1883 | + return 0; 1884 | + 1885 | + return imbalance; 1886 | +} 1887 | + 1888 | +static unsigned long task_h_load(struct task_struct *p) 1889 | +{ 1890 | + return p->se.avg.load_avg; 1891 | +} 1892 | + 1893 | +/* 1894 | + * Maximum NUMA importance can be 1998 (2*999); 1895 | + * SMALLIMP @ 30 would be close to 1998/64. 1896 | + * Used to deter task migration. 1897 | + */ 1898 | +#define SMALLIMP 30 1899 | + 1900 | +/* 1901 | + * This checks if the overall compute and NUMA accesses of the system would 1902 | + * be improved if the source tasks was migrated to the target dst_cpu taking 1903 | + * into account that it might be best if task running on the dst_cpu should 1904 | + * be exchanged with the source task 1905 | + */ 1906 | +static bool task_numa_compare(struct task_numa_env *env, 1907 | + long taskimp, long groupimp, bool maymove) 1908 | +{ 1909 | + struct numa_group *cur_ng, *p_ng = deref_curr_numa_group(env->p); 1910 | + struct rq *dst_rq = cpu_rq(env->dst_cpu); 1911 | + long imp = p_ng ? groupimp : taskimp; 1912 | + struct task_struct *cur; 1913 | + long src_load, dst_load; 1914 | + int dist = env->dist; 1915 | + long moveimp = imp; 1916 | + long load; 1917 | + bool stopsearch = false; 1918 | + 1919 | + if (READ_ONCE(dst_rq->numa_migrate_on)) 1920 | + return false; 1921 | + 1922 | + rcu_read_lock(); 1923 | + cur = rcu_dereference(dst_rq->curr); 1924 | + if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur))) 1925 | + cur = NULL; 1926 | + 1927 | + /* 1928 | + * Because we have preemption enabled we can get migrated around and 1929 | + * end try selecting ourselves (current == env->p) as a swap candidate. 1930 | + */ 1931 | + if (cur == env->p) { 1932 | + stopsearch = true; 1933 | + goto unlock; 1934 | + } 1935 | + 1936 | + if (!cur) { 1937 | + if (maymove && moveimp >= env->best_imp) 1938 | + goto assign; 1939 | + else 1940 | + goto unlock; 1941 | + } 1942 | + 1943 | + /* Skip this swap candidate if cannot move to the source cpu. */ 1944 | + if (!cpumask_test_cpu(env->src_cpu, cur->cpus_ptr)) 1945 | + goto unlock; 1946 | + 1947 | + /* 1948 | + * Skip this swap candidate if it is not moving to its preferred 1949 | + * node and the best task is. 1950 | + */ 1951 | + if (env->best_task && 1952 | + env->best_task->numa_preferred_nid == env->src_nid && 1953 | + cur->numa_preferred_nid != env->src_nid) { 1954 | + goto unlock; 1955 | + } 1956 | + 1957 | + /* 1958 | + * "imp" is the fault differential for the source task between the 1959 | + * source and destination node. Calculate the total differential for 1960 | + * the source task and potential destination task. The more negative 1961 | + * the value is, the more remote accesses that would be expected to 1962 | + * be incurred if the tasks were swapped. 1963 | + * 1964 | + * If dst and source tasks are in the same NUMA group, or not 1965 | + * in any group then look only at task weights. 1966 | + */ 1967 | + cur_ng = rcu_dereference(cur->numa_group); 1968 | + if (cur_ng == p_ng) { 1969 | + imp = taskimp + task_weight(cur, env->src_nid, dist) - 1970 | + task_weight(cur, env->dst_nid, dist); 1971 | + /* 1972 | + * Add some hysteresis to prevent swapping the 1973 | + * tasks within a group over tiny differences. 1974 | + */ 1975 | + if (cur_ng) 1976 | + imp -= imp / 16; 1977 | + } else { 1978 | + /* 1979 | + * Compare the group weights. If a task is all by itself 1980 | + * (not part of a group), use the task weight instead. 1981 | + */ 1982 | + if (cur_ng && p_ng) 1983 | + imp += group_weight(cur, env->src_nid, dist) - 1984 | + group_weight(cur, env->dst_nid, dist); 1985 | + else 1986 | + imp += task_weight(cur, env->src_nid, dist) - 1987 | + task_weight(cur, env->dst_nid, dist); 1988 | + } 1989 | + 1990 | + /* Discourage picking a task already on its preferred node */ 1991 | + if (cur->numa_preferred_nid == env->dst_nid) 1992 | + imp -= imp / 16; 1993 | + 1994 | + /* 1995 | + * Encourage picking a task that moves to its preferred node. 1996 | + * This potentially makes imp larger than it's maximum of 1997 | + * 1998 (see SMALLIMP and task_weight for why) but in this 1998 | + * case, it does not matter. 1999 | + */ 2000 | + if (cur->numa_preferred_nid == env->src_nid) 2001 | + imp += imp / 8; 2002 | + 2003 | + if (maymove && moveimp > imp && moveimp > env->best_imp) { 2004 | + imp = moveimp; 2005 | + cur = NULL; 2006 | + goto assign; 2007 | + } 2008 | + 2009 | + /* 2010 | + * Prefer swapping with a task moving to its preferred node over a 2011 | + * task that is not. 2012 | + */ 2013 | + if (env->best_task && cur->numa_preferred_nid == env->src_nid && 2014 | + env->best_task->numa_preferred_nid != env->src_nid) { 2015 | + goto assign; 2016 | + } 2017 | + 2018 | + /* 2019 | + * If the NUMA importance is less than SMALLIMP, 2020 | + * task migration might only result in ping pong 2021 | + * of tasks and also hurt performance due to cache 2022 | + * misses. 2023 | + */ 2024 | + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2) 2025 | + goto unlock; 2026 | + 2027 | + /* 2028 | + * In the overloaded case, try and keep the load balanced. 2029 | + */ 2030 | + load = task_h_load(env->p) - task_h_load(cur); 2031 | + if (!load) 2032 | + goto assign; 2033 | + 2034 | + dst_load = env->dst_stats.load + load; 2035 | + src_load = env->src_stats.load - load; 2036 | + 2037 | + if (load_too_imbalanced(src_load, dst_load, env)) 2038 | + goto unlock; 2039 | + 2040 | +assign: 2041 | + /* Evaluate an idle CPU for a task numa move. */ 2042 | + if (!cur) { 2043 | + int cpu = env->dst_stats.idle_cpu; 2044 | + 2045 | + /* Nothing cached so current CPU went idle since the search. */ 2046 | + if (cpu < 0) 2047 | + cpu = env->dst_cpu; 2048 | + 2049 | + /* 2050 | + * If the CPU is no longer truly idle and the previous best CPU 2051 | + * is, keep using it. 2052 | + */ 2053 | + if (!idle_cpu(cpu) && env->best_cpu >= 0 && 2054 | + idle_cpu(env->best_cpu)) { 2055 | + cpu = env->best_cpu; 2056 | + } 2057 | + 2058 | + env->dst_cpu = cpu; 2059 | + } 2060 | + 2061 | + task_numa_assign(env, cur, imp); 2062 | + 2063 | + /* 2064 | + * If a move to idle is allowed because there is capacity or load 2065 | + * balance improves then stop the search. While a better swap 2066 | + * candidate may exist, a search is not free. 2067 | + */ 2068 | + if (maymove && !cur && env->best_cpu >= 0 && idle_cpu(env->best_cpu)) 2069 | + stopsearch = true; 2070 | + 2071 | + /* 2072 | + * If a swap candidate must be identified and the current best task 2073 | + * moves its preferred node then stop the search. 2074 | + */ 2075 | + if (!maymove && env->best_task && 2076 | + env->best_task->numa_preferred_nid == env->src_nid) { 2077 | + stopsearch = true; 2078 | + } 2079 | +unlock: 2080 | + rcu_read_unlock(); 2081 | + 2082 | + return stopsearch; 2083 | +} 2084 | + 2085 | +static void task_numa_find_cpu(struct task_numa_env *env, 2086 | + long taskimp, long groupimp) 2087 | +{ 2088 | + bool maymove = false; 2089 | + int cpu; 2090 | + 2091 | + /* 2092 | + * If dst node has spare capacity, then check if there is an 2093 | + * imbalance that would be overruled by the load balancer. 2094 | + */ 2095 | + if (env->dst_stats.node_type == node_has_spare) { 2096 | + unsigned int imbalance; 2097 | + int src_running, dst_running; 2098 | + 2099 | + /* 2100 | + * Would movement cause an imbalance? Note that if src has 2101 | + * more running tasks that the imbalance is ignored as the 2102 | + * move improves the imbalance from the perspective of the 2103 | + * CPU load balancer. 2104 | + * */ 2105 | + src_running = env->src_stats.nr_running - 1; 2106 | + dst_running = env->dst_stats.nr_running + 1; 2107 | + imbalance = max(0, dst_running - src_running); 2108 | + imbalance = adjust_numa_imbalance(imbalance, dst_running, 2109 | + env->dst_stats.weight); 2110 | + 2111 | + /* Use idle CPU if there is no imbalance */ 2112 | + if (!imbalance) { 2113 | + maymove = true; 2114 | + if (env->dst_stats.idle_cpu >= 0) { 2115 | + env->dst_cpu = env->dst_stats.idle_cpu; 2116 | + task_numa_assign(env, NULL, 0); 2117 | + return; 2118 | + } 2119 | + } 2120 | + } else { 2121 | + long src_load, dst_load, load; 2122 | + /* 2123 | + * If the improvement from just moving env->p direction is better 2124 | + * than swapping tasks around, check if a move is possible. 2125 | + */ 2126 | + load = task_h_load(env->p); 2127 | + dst_load = env->dst_stats.load + load; 2128 | + src_load = env->src_stats.load - load; 2129 | + maymove = !load_too_imbalanced(src_load, dst_load, env); 2130 | + } 2131 | + 2132 | + for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) { 2133 | + /* Skip this CPU if the source task cannot migrate */ 2134 | + if (!cpumask_test_cpu(cpu, env->p->cpus_ptr)) 2135 | + continue; 2136 | + 2137 | + env->dst_cpu = cpu; 2138 | + if (task_numa_compare(env, taskimp, groupimp, maymove)) 2139 | + break; 2140 | + } 2141 | +} 2142 | + 2143 | +static int task_numa_migrate(struct task_struct *p) 2144 | +{ 2145 | + struct task_numa_env env = { 2146 | + .p = p, 2147 | + 2148 | + .src_cpu = task_cpu(p), 2149 | + .src_nid = task_node(p), 2150 | + 2151 | + .imbalance_pct = 112, 2152 | + 2153 | + .best_task = NULL, 2154 | + .best_imp = 0, 2155 | + .best_cpu = -1, 2156 | + }; 2157 | + unsigned long taskweight, groupweight; 2158 | + struct sched_domain *sd; 2159 | + long taskimp, groupimp; 2160 | + struct numa_group *ng; 2161 | + struct rq *best_rq; 2162 | + int nid, ret, dist; 2163 | + 2164 | + /* 2165 | + * Pick the lowest SD_NUMA domain, as that would have the smallest 2166 | + * imbalance and would be the first to start moving tasks about. 2167 | + * 2168 | + * And we want to avoid any moving of tasks about, as that would create 2169 | + * random movement of tasks -- counter the numa conditions we're trying 2170 | + * to satisfy here. 2171 | + */ 2172 | + rcu_read_lock(); 2173 | + sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); 2174 | + if (sd) 2175 | + env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; 2176 | + rcu_read_unlock(); 2177 | + 2178 | + /* 2179 | + * Cpusets can break the scheduler domain tree into smaller 2180 | + * balance domains, some of which do not cross NUMA boundaries. 2181 | + * Tasks that are "trapped" in such domains cannot be migrated 2182 | + * elsewhere, so there is no point in (re)trying. 2183 | + */ 2184 | + if (unlikely(!sd)) { 2185 | + sched_setnuma(p, task_node(p)); 2186 | + return -EINVAL; 2187 | + } 2188 | + 2189 | + env.dst_nid = p->numa_preferred_nid; 2190 | + dist = env.dist = node_distance(env.src_nid, env.dst_nid); 2191 | + taskweight = task_weight(p, env.src_nid, dist); 2192 | + groupweight = group_weight(p, env.src_nid, dist); 2193 | + update_numa_stats(&env, &env.src_stats, env.src_nid, false); 2194 | + taskimp = task_weight(p, env.dst_nid, dist) - taskweight; 2195 | + groupimp = group_weight(p, env.dst_nid, dist) - groupweight; 2196 | + update_numa_stats(&env, &env.dst_stats, env.dst_nid, true); 2197 | + 2198 | + /* Try to find a spot on the preferred nid. */ 2199 | + task_numa_find_cpu(&env, taskimp, groupimp); 2200 | + 2201 | + /* 2202 | + * Look at other nodes in these cases: 2203 | + * - there is no space available on the preferred_nid 2204 | + * - the task is part of a numa_group that is interleaved across 2205 | + * multiple NUMA nodes; in order to better consolidate the group, 2206 | + * we need to check other locations. 2207 | + */ 2208 | + ng = deref_curr_numa_group(p); 2209 | + if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) { 2210 | + for_each_online_node(nid) { 2211 | + if (nid == env.src_nid || nid == p->numa_preferred_nid) 2212 | + continue; 2213 | + 2214 | + dist = node_distance(env.src_nid, env.dst_nid); 2215 | + if (sched_numa_topology_type == NUMA_BACKPLANE && 2216 | + dist != env.dist) { 2217 | + taskweight = task_weight(p, env.src_nid, dist); 2218 | + groupweight = group_weight(p, env.src_nid, dist); 2219 | + } 2220 | + 2221 | + /* Only consider nodes where both task and groups benefit */ 2222 | + taskimp = task_weight(p, nid, dist) - taskweight; 2223 | + groupimp = group_weight(p, nid, dist) - groupweight; 2224 | + if (taskimp < 0 && groupimp < 0) 2225 | + continue; 2226 | + 2227 | + env.dist = dist; 2228 | + env.dst_nid = nid; 2229 | + update_numa_stats(&env, &env.dst_stats, env.dst_nid, true); 2230 | + task_numa_find_cpu(&env, taskimp, groupimp); 2231 | + } 2232 | + } 2233 | + 2234 | + /* 2235 | + * If the task is part of a workload that spans multiple NUMA nodes, 2236 | + * and is migrating into one of the workload's active nodes, remember 2237 | + * this node as the task's preferred numa node, so the workload can 2238 | + * settle down. 2239 | + * A task that migrated to a second choice node will be better off 2240 | + * trying for a better one later. Do not set the preferred node here. 2241 | + */ 2242 | + if (ng) { 2243 | + if (env.best_cpu == -1) 2244 | + nid = env.src_nid; 2245 | + else 2246 | + nid = cpu_to_node(env.best_cpu); 2247 | + 2248 | + if (nid != p->numa_preferred_nid) 2249 | + sched_setnuma(p, nid); 2250 | + } 2251 | + 2252 | + /* No better CPU than the current one was found. */ 2253 | + if (env.best_cpu == -1) { 2254 | + trace_sched_stick_numa(p, env.src_cpu, NULL, -1); 2255 | + return -EAGAIN; 2256 | + } 2257 | + 2258 | + best_rq = cpu_rq(env.best_cpu); 2259 | + if (env.best_task == NULL) { 2260 | + ret = migrate_task_to(p, env.best_cpu); 2261 | + WRITE_ONCE(best_rq->numa_migrate_on, 0); 2262 | + if (ret != 0) 2263 | + trace_sched_stick_numa(p, env.src_cpu, NULL, env.best_cpu); 2264 | + return ret; 2265 | + } 2266 | + 2267 | + ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu); 2268 | + WRITE_ONCE(best_rq->numa_migrate_on, 0); 2269 | + 2270 | + if (ret != 0) 2271 | + trace_sched_stick_numa(p, env.src_cpu, env.best_task, env.best_cpu); 2272 | + put_task_struct(env.best_task); 2273 | + return ret; 2274 | +} 2275 | + 2276 | +/* Attempt to migrate a task to a CPU on the preferred node. */ 2277 | +static void numa_migrate_preferred(struct task_struct *p) 2278 | +{ 2279 | + unsigned long interval = HZ; 2280 | + 2281 | + /* This task has no NUMA fault statistics yet */ 2282 | + if (unlikely(p->numa_preferred_nid == NUMA_NO_NODE || !p->numa_faults)) 2283 | + return; 2284 | + 2285 | + /* Periodically retry migrating the task to the preferred node */ 2286 | + interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16); 2287 | + p->numa_migrate_retry = jiffies + interval; 2288 | + 2289 | + /* Success if task is already running on preferred CPU */ 2290 | + if (task_node(p) == p->numa_preferred_nid) 2291 | + return; 2292 | + 2293 | + /* Otherwise, try migrate to a CPU on the preferred node */ 2294 | + task_numa_migrate(p); 2295 | +} 2296 | + 2297 | +/* 2298 | + * Find out how many nodes on the workload is actively running on. Do this by 2299 | + * tracking the nodes from which NUMA hinting faults are triggered. This can 2300 | + * be different from the set of nodes where the workload's memory is currently 2301 | + * located. 2302 | + */ 2303 | +static void numa_group_count_active_nodes(struct numa_group *numa_group) 2304 | +{ 2305 | + unsigned long faults, max_faults = 0; 2306 | + int nid, active_nodes = 0; 2307 | + 2308 | + for_each_online_node(nid) { 2309 | + faults = group_faults_cpu(numa_group, nid); 2310 | + if (faults > max_faults) 2311 | + max_faults = faults; 2312 | + } 2313 | + 2314 | + for_each_online_node(nid) { 2315 | + faults = group_faults_cpu(numa_group, nid); 2316 | + if (faults * ACTIVE_NODE_FRACTION > max_faults) 2317 | + active_nodes++; 2318 | + } 2319 | + 2320 | + numa_group->max_faults_cpu = max_faults; 2321 | + numa_group->active_nodes = active_nodes; 2322 | +} 2323 | + 2324 | +#define NUMA_PERIOD_SLOTS 10 2325 | +#define NUMA_PERIOD_THRESHOLD 7 2326 | + 2327 | +/* 2328 | + * Increase the scan period (slow down scanning) if the majority of 2329 | + * our memory is already on our local node, or if the majority of 2330 | + * the page accesses are shared with other processes. 2331 | + * Otherwise, decrease the scan period. 2332 | + */ 2333 | +static void update_task_scan_period(struct task_struct *p, 2334 | + unsigned long shared, unsigned long private) 2335 | +{ 2336 | + unsigned int period_slot; 2337 | + int lr_ratio, ps_ratio; 2338 | + int diff; 2339 | + 2340 | + unsigned long remote = p->numa_faults_locality[0]; 2341 | + unsigned long local = p->numa_faults_locality[1]; 2342 | + 2343 | + /* 2344 | + * If there were no record hinting faults then either the task is 2345 | + * completely idle or all activity is areas that are not of interest 2346 | + * to automatic numa balancing. Related to that, if there were failed 2347 | + * migration then it implies we are migrating too quickly or the local 2348 | + * node is overloaded. In either case, scan slower 2349 | + */ 2350 | + if (local + shared == 0 || p->numa_faults_locality[2]) { 2351 | + p->numa_scan_period = min(p->numa_scan_period_max, 2352 | + p->numa_scan_period << 1); 2353 | + 2354 | + p->mm->numa_next_scan = jiffies + 2355 | + msecs_to_jiffies(p->numa_scan_period); 2356 | + 2357 | + return; 2358 | + } 2359 | + 2360 | + /* 2361 | + * Prepare to scale scan period relative to the current period. 2362 | + * == NUMA_PERIOD_THRESHOLD scan period stays the same 2363 | + * < NUMA_PERIOD_THRESHOLD scan period decreases (scan faster) 2364 | + * >= NUMA_PERIOD_THRESHOLD scan period increases (scan slower) 2365 | + */ 2366 | + period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS); 2367 | + lr_ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote); 2368 | + ps_ratio = (private * NUMA_PERIOD_SLOTS) / (private + shared); 2369 | + 2370 | + if (ps_ratio >= NUMA_PERIOD_THRESHOLD) { 2371 | + /* 2372 | + * Most memory accesses are local. There is no need to 2373 | + * do fast NUMA scanning, since memory is already local. 2374 | + */ 2375 | + int slot = ps_ratio - NUMA_PERIOD_THRESHOLD; 2376 | + if (!slot) 2377 | + slot = 1; 2378 | + diff = slot * period_slot; 2379 | + } else if (lr_ratio >= NUMA_PERIOD_THRESHOLD) { 2380 | + /* 2381 | + * Most memory accesses are shared with other tasks. 2382 | + * There is no point in continuing fast NUMA scanning, 2383 | + * since other tasks may just move the memory elsewhere. 2384 | + */ 2385 | + int slot = lr_ratio - NUMA_PERIOD_THRESHOLD; 2386 | + if (!slot) 2387 | + slot = 1; 2388 | + diff = slot * period_slot; 2389 | + } else { 2390 | + /* 2391 | + * Private memory faults exceed (SLOTS-THRESHOLD)/SLOTS, 2392 | + * yet they are not on the local NUMA node. Speed up 2393 | + * NUMA scanning to get the memory moved over. 2394 | + */ 2395 | + int ratio = max(lr_ratio, ps_ratio); 2396 | + diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot; 2397 | + } 2398 | + 2399 | + p->numa_scan_period = clamp(p->numa_scan_period + diff, 2400 | + task_scan_min(p), task_scan_max(p)); 2401 | + memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality)); 2402 | +} 2403 | + 2404 | +/* 2405 | + * Get the fraction of time the task has been running since the last 2406 | + * NUMA placement cycle. The scheduler keeps similar statistics, but 2407 | + * decays those on a 32ms period, which is orders of magnitude off 2408 | + * from the dozens-of-seconds NUMA balancing period. Use the scheduler 2409 | + * stats only if the task is so new there are no NUMA statistics yet. 2410 | + */ 2411 | +static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period) 2412 | +{ 2413 | + u64 runtime, delta, now; 2414 | + /* Use the start of this time slice to avoid calculations. */ 2415 | + now = p->se.exec_start; 2416 | + runtime = p->se.sum_exec_runtime; 2417 | + 2418 | + if (p->last_task_numa_placement) { 2419 | + delta = runtime - p->last_sum_exec_runtime; 2420 | + *period = now - p->last_task_numa_placement; 2421 | + 2422 | + /* Avoid time going backwards, prevent potential divide error: */ 2423 | + if (unlikely((s64)*period < 0)) 2424 | + *period = 0; 2425 | + } else { 2426 | + delta = p->se.avg.load_sum; 2427 | + *period = LOAD_AVG_MAX; 2428 | + } 2429 | + 2430 | + p->last_sum_exec_runtime = runtime; 2431 | + p->last_task_numa_placement = now; 2432 | + 2433 | + return delta; 2434 | +} 2435 | + 2436 | +/* 2437 | + * Determine the preferred nid for a task in a numa_group. This needs to 2438 | + * be done in a way that produces consistent results with group_weight, 2439 | + * otherwise workloads might not converge. 2440 | + */ 2441 | +static int preferred_group_nid(struct task_struct *p, int nid) 2442 | +{ 2443 | + nodemask_t nodes; 2444 | + int dist; 2445 | + 2446 | + /* Direct connections between all NUMA nodes. */ 2447 | + if (sched_numa_topology_type == NUMA_DIRECT) 2448 | + return nid; 2449 | + 2450 | + /* 2451 | + * On a system with glueless mesh NUMA topology, group_weight 2452 | + * scores nodes according to the number of NUMA hinting faults on 2453 | + * both the node itself, and on nearby nodes. 2454 | + */ 2455 | + if (sched_numa_topology_type == NUMA_GLUELESS_MESH) { 2456 | + unsigned long score, max_score = 0; 2457 | + int node, max_node = nid; 2458 | + 2459 | + dist = sched_max_numa_distance; 2460 | + 2461 | + for_each_online_node(node) { 2462 | + score = group_weight(p, node, dist); 2463 | + if (score > max_score) { 2464 | + max_score = score; 2465 | + max_node = node; 2466 | + } 2467 | + } 2468 | + return max_node; 2469 | + } 2470 | + 2471 | + /* 2472 | + * Finding the preferred nid in a system with NUMA backplane 2473 | + * interconnect topology is more involved. The goal is to locate 2474 | + * tasks from numa_groups near each other in the system, and 2475 | + * untangle workloads from different sides of the system. This requires 2476 | + * searching down the hierarchy of node groups, recursively searching 2477 | + * inside the highest scoring group of nodes. The nodemask tricks 2478 | + * keep the complexity of the search down. 2479 | + */ 2480 | + nodes = node_online_map; 2481 | + for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) { 2482 | + unsigned long max_faults = 0; 2483 | + nodemask_t max_group = NODE_MASK_NONE; 2484 | + int a, b; 2485 | + 2486 | + /* Are there nodes at this distance from each other? */ 2487 | + if (!find_numa_distance(dist)) 2488 | + continue; 2489 | + 2490 | + for_each_node_mask(a, nodes) { 2491 | + unsigned long faults = 0; 2492 | + nodemask_t this_group; 2493 | + nodes_clear(this_group); 2494 | + 2495 | + /* Sum group's NUMA faults; includes a==b case. */ 2496 | + for_each_node_mask(b, nodes) { 2497 | + if (node_distance(a, b) < dist) { 2498 | + faults += group_faults(p, b); 2499 | + node_set(b, this_group); 2500 | + node_clear(b, nodes); 2501 | + } 2502 | + } 2503 | + 2504 | + /* Remember the top group. */ 2505 | + if (faults > max_faults) { 2506 | + max_faults = faults; 2507 | + max_group = this_group; 2508 | + /* 2509 | + * subtle: at the smallest distance there is 2510 | + * just one node left in each "group", the 2511 | + * winner is the preferred nid. 2512 | + */ 2513 | + nid = a; 2514 | + } 2515 | + } 2516 | + /* Next round, evaluate the nodes within max_group. */ 2517 | + if (!max_faults) 2518 | + break; 2519 | + nodes = max_group; 2520 | + } 2521 | + return nid; 2522 | +} 2523 | + 2524 | +static void task_numa_placement(struct task_struct *p) 2525 | +{ 2526 | + int seq, nid, max_nid = NUMA_NO_NODE; 2527 | + unsigned long max_faults = 0; 2528 | + unsigned long fault_types[2] = { 0, 0 }; 2529 | + unsigned long total_faults; 2530 | + u64 runtime, period; 2531 | + spinlock_t *group_lock = NULL; 2532 | + struct numa_group *ng; 2533 | + 2534 | + /* 2535 | + * The p->mm->numa_scan_seq field gets updated without 2536 | + * exclusive access. Use READ_ONCE() here to ensure 2537 | + * that the field is read in a single access: 2538 | + */ 2539 | + seq = READ_ONCE(p->mm->numa_scan_seq); 2540 | + if (p->numa_scan_seq == seq) 2541 | + return; 2542 | + p->numa_scan_seq = seq; 2543 | + p->numa_scan_period_max = task_scan_max(p); 2544 | + 2545 | + total_faults = p->numa_faults_locality[0] + 2546 | + p->numa_faults_locality[1]; 2547 | + runtime = numa_get_avg_runtime(p, &period); 2548 | + 2549 | + /* If the task is part of a group prevent parallel updates to group stats */ 2550 | + ng = deref_curr_numa_group(p); 2551 | + if (ng) { 2552 | + group_lock = &ng->lock; 2553 | + spin_lock_irq(group_lock); 2554 | + } 2555 | + 2556 | + /* Find the node with the highest number of faults */ 2557 | + for_each_online_node(nid) { 2558 | + /* Keep track of the offsets in numa_faults array */ 2559 | + int mem_idx, membuf_idx, cpu_idx, cpubuf_idx; 2560 | + unsigned long faults = 0, group_faults = 0; 2561 | + int priv; 2562 | + 2563 | + for (priv = 0; priv < NR_NUMA_HINT_FAULT_TYPES; priv++) { 2564 | + long diff, f_diff, f_weight; 2565 | + 2566 | + mem_idx = task_faults_idx(NUMA_MEM, nid, priv); 2567 | + membuf_idx = task_faults_idx(NUMA_MEMBUF, nid, priv); 2568 | + cpu_idx = task_faults_idx(NUMA_CPU, nid, priv); 2569 | + cpubuf_idx = task_faults_idx(NUMA_CPUBUF, nid, priv); 2570 | + 2571 | + /* Decay existing window, copy faults since last scan */ 2572 | + diff = p->numa_faults[membuf_idx] - p->numa_faults[mem_idx] / 2; 2573 | + fault_types[priv] += p->numa_faults[membuf_idx]; 2574 | + p->numa_faults[membuf_idx] = 0; 2575 | + 2576 | + /* 2577 | + * Normalize the faults_from, so all tasks in a group 2578 | + * count according to CPU use, instead of by the raw 2579 | + * number of faults. Tasks with little runtime have 2580 | + * little over-all impact on throughput, and thus their 2581 | + * faults are less important. 2582 | + */ 2583 | + f_weight = div64_u64(runtime << 16, period + 1); 2584 | + f_weight = (f_weight * p->numa_faults[cpubuf_idx]) / 2585 | + (total_faults + 1); 2586 | + f_diff = f_weight - p->numa_faults[cpu_idx] / 2; 2587 | + p->numa_faults[cpubuf_idx] = 0; 2588 | + 2589 | + p->numa_faults[mem_idx] += diff; 2590 | + p->numa_faults[cpu_idx] += f_diff; 2591 | + faults += p->numa_faults[mem_idx]; 2592 | + p->total_numa_faults += diff; 2593 | + if (ng) { 2594 | + /* 2595 | + * safe because we can only change our own group 2596 | + * 2597 | + * mem_idx represents the offset for a given 2598 | + * nid and priv in a specific region because it 2599 | + * is at the beginning of the numa_faults array. 2600 | + */ 2601 | + ng->faults[mem_idx] += diff; 2602 | + ng->faults_cpu[mem_idx] += f_diff; 2603 | + ng->total_faults += diff; 2604 | + group_faults += ng->faults[mem_idx]; 2605 | + } 2606 | + } 2607 | + 2608 | + if (!ng) { 2609 | + if (faults > max_faults) { 2610 | + max_faults = faults; 2611 | + max_nid = nid; 2612 | + } 2613 | + } else if (group_faults > max_faults) { 2614 | + max_faults = group_faults; 2615 | + max_nid = nid; 2616 | + } 2617 | + } 2618 | + 2619 | + if (ng) { 2620 | + numa_group_count_active_nodes(ng); 2621 | + spin_unlock_irq(group_lock); 2622 | + max_nid = preferred_group_nid(p, max_nid); 2623 | + } 2624 | + 2625 | + if (max_faults) { 2626 | + /* Set the new preferred node */ 2627 | + if (max_nid != p->numa_preferred_nid) 2628 | + sched_setnuma(p, max_nid); 2629 | + } 2630 | + 2631 | + update_task_scan_period(p, fault_types[0], fault_types[1]); 2632 | +} 2633 | + 2634 | +static inline int get_numa_group(struct numa_group *grp) 2635 | +{ 2636 | + return refcount_inc_not_zero(&grp->refcount); 2637 | +} 2638 | + 2639 | +static inline void put_numa_group(struct numa_group *grp) 2640 | +{ 2641 | + if (refcount_dec_and_test(&grp->refcount)) 2642 | + kfree_rcu(grp, rcu); 2643 | +} 2644 | + 2645 | +static void task_numa_group(struct task_struct *p, int cpupid, int flags, 2646 | + int *priv) 2647 | +{ 2648 | + struct numa_group *grp, *my_grp; 2649 | + struct task_struct *tsk; 2650 | + bool join = false; 2651 | + int cpu = cpupid_to_cpu(cpupid); 2652 | + int i; 2653 | + 2654 | + if (unlikely(!deref_curr_numa_group(p))) { 2655 | + unsigned int size = sizeof(struct numa_group) + 2656 | + 4*nr_node_ids*sizeof(unsigned long); 2657 | + 2658 | + grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN); 2659 | + if (!grp) 2660 | + return; 2661 | + 2662 | + refcount_set(&grp->refcount, 1); 2663 | + grp->active_nodes = 1; 2664 | + grp->max_faults_cpu = 0; 2665 | + spin_lock_init(&grp->lock); 2666 | + grp->gid = p->pid; 2667 | + /* Second half of the array tracks nids where faults happen */ 2668 | + grp->faults_cpu = grp->faults + NR_NUMA_HINT_FAULT_TYPES * 2669 | + nr_node_ids; 2670 | + 2671 | + for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) 2672 | + grp->faults[i] = p->numa_faults[i]; 2673 | + 2674 | + grp->total_faults = p->total_numa_faults; 2675 | + 2676 | + grp->nr_tasks++; 2677 | + rcu_assign_pointer(p->numa_group, grp); 2678 | + } 2679 | + 2680 | + rcu_read_lock(); 2681 | + tsk = READ_ONCE(cpu_rq(cpu)->curr); 2682 | + 2683 | + if (!cpupid_match_pid(tsk, cpupid)) 2684 | + goto no_join; 2685 | + 2686 | + grp = rcu_dereference(tsk->numa_group); 2687 | + if (!grp) 2688 | + goto no_join; 2689 | + 2690 | + my_grp = deref_curr_numa_group(p); 2691 | + if (grp == my_grp) 2692 | + goto no_join; 2693 | + 2694 | + /* 2695 | + * Only join the other group if its bigger; if we're the bigger group, 2696 | + * the other task will join us. 2697 | + */ 2698 | + if (my_grp->nr_tasks > grp->nr_tasks) 2699 | + goto no_join; 2700 | + 2701 | + /* 2702 | + * Tie-break on the grp address. 2703 | + */ 2704 | + if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp) 2705 | + goto no_join; 2706 | + 2707 | + /* Always join threads in the same process. */ 2708 | + if (tsk->mm == current->mm) 2709 | + join = true; 2710 | + 2711 | + /* Simple filter to avoid false positives due to PID collisions */ 2712 | + if (flags & TNF_SHARED) 2713 | + join = true; 2714 | + 2715 | + /* Update priv based on whether false sharing was detected */ 2716 | + *priv = !join; 2717 | + 2718 | + if (join && !get_numa_group(grp)) 2719 | + goto no_join; 2720 | + 2721 | + rcu_read_unlock(); 2722 | + 2723 | + if (!join) 2724 | + return; 2725 | + 2726 | + BUG_ON(irqs_disabled()); 2727 | + double_lock_irq(&my_grp->lock, &grp->lock); 2728 | + 2729 | + for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) { 2730 | + my_grp->faults[i] -= p->numa_faults[i]; 2731 | + grp->faults[i] += p->numa_faults[i]; 2732 | + } 2733 | + my_grp->total_faults -= p->total_numa_faults; 2734 | + grp->total_faults += p->total_numa_faults; 2735 | + 2736 | + my_grp->nr_tasks--; 2737 | + grp->nr_tasks++; 2738 | + 2739 | + spin_unlock(&my_grp->lock); 2740 | + spin_unlock_irq(&grp->lock); 2741 | + 2742 | + rcu_assign_pointer(p->numa_group, grp); 2743 | + 2744 | + put_numa_group(my_grp); 2745 | + return; 2746 | + 2747 | +no_join: 2748 | + rcu_read_unlock(); 2749 | + return; 2750 | +} 2751 | + 2752 | +/* 2753 | + * Get rid of NUMA statistics associated with a task (either current or dead). 2754 | + * If @final is set, the task is dead and has reached refcount zero, so we can 2755 | + * safely free all relevant data structures. Otherwise, there might be 2756 | + * concurrent reads from places like load balancing and procfs, and we should 2757 | + * reset the data back to default state without freeing ->numa_faults. 2758 | + */ 2759 | +void task_numa_free(struct task_struct *p, bool final) 2760 | +{ 2761 | + /* safe: p either is current or is being freed by current */ 2762 | + struct numa_group *grp = rcu_dereference_raw(p->numa_group); 2763 | + unsigned long *numa_faults = p->numa_faults; 2764 | + unsigned long flags; 2765 | + int i; 2766 | + 2767 | + if (!numa_faults) 2768 | + return; 2769 | + 2770 | + if (grp) { 2771 | + spin_lock_irqsave(&grp->lock, flags); 2772 | + for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) 2773 | + grp->faults[i] -= p->numa_faults[i]; 2774 | + grp->total_faults -= p->total_numa_faults; 2775 | + 2776 | + grp->nr_tasks--; 2777 | + spin_unlock_irqrestore(&grp->lock, flags); 2778 | + RCU_INIT_POINTER(p->numa_group, NULL); 2779 | + put_numa_group(grp); 2780 | + } 2781 | + 2782 | + if (final) { 2783 | + p->numa_faults = NULL; 2784 | + kfree(numa_faults); 2785 | + } else { 2786 | + p->total_numa_faults = 0; 2787 | + for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) 2788 | + numa_faults[i] = 0; 2789 | + } 2790 | +} 2791 | + 2792 | +/* 2793 | + * Got a PROT_NONE fault for a page on @node. 2794 | + */ 2795 | +void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) 2796 | +{ 2797 | + struct task_struct *p = current; 2798 | + bool migrated = flags & TNF_MIGRATED; 2799 | + int cpu_node = task_node(current); 2800 | + int local = !!(flags & TNF_FAULT_LOCAL); 2801 | + struct numa_group *ng; 2802 | + int priv; 2803 | + 2804 | + if (!static_branch_likely(&sched_numa_balancing)) 2805 | + return; 2806 | + 2807 | + /* for example, ksmd faulting in a user's mm */ 2808 | + if (!p->mm) 2809 | + return; 2810 | + 2811 | + /* Allocate buffer to track faults on a per-node basis */ 2812 | + if (unlikely(!p->numa_faults)) { 2813 | + int size = sizeof(*p->numa_faults) * 2814 | + NR_NUMA_HINT_FAULT_BUCKETS * nr_node_ids; 2815 | + 2816 | + p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN); 2817 | + if (!p->numa_faults) 2818 | + return; 2819 | + 2820 | + p->total_numa_faults = 0; 2821 | + memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality)); 2822 | + } 2823 | + 2824 | + /* 2825 | + * First accesses are treated as private, otherwise consider accesses 2826 | + * to be private if the accessing pid has not changed 2827 | + */ 2828 | + if (unlikely(last_cpupid == (-1 & LAST_CPUPID_MASK))) { 2829 | + priv = 1; 2830 | + } else { 2831 | + priv = cpupid_match_pid(p, last_cpupid); 2832 | + if (!priv && !(flags & TNF_NO_GROUP)) 2833 | + task_numa_group(p, last_cpupid, flags, &priv); 2834 | + } 2835 | + 2836 | + /* 2837 | + * If a workload spans multiple NUMA nodes, a shared fault that 2838 | + * occurs wholly within the set of nodes that the workload is 2839 | + * actively using should be counted as local. This allows the 2840 | + * scan rate to slow down when a workload has settled down. 2841 | + */ 2842 | + ng = deref_curr_numa_group(p); 2843 | + if (!priv && !local && ng && ng->active_nodes > 1 && 2844 | + numa_is_active_node(cpu_node, ng) && 2845 | + numa_is_active_node(mem_node, ng)) 2846 | + local = 1; 2847 | + 2848 | + /* 2849 | + * Retry to migrate task to preferred node periodically, in case it 2850 | + * previously failed, or the scheduler moved us. 2851 | + */ 2852 | + if (time_after(jiffies, p->numa_migrate_retry)) { 2853 | + task_numa_placement(p); 2854 | + numa_migrate_preferred(p); 2855 | + } 2856 | + 2857 | + if (migrated) 2858 | + p->numa_pages_migrated += pages; 2859 | + if (flags & TNF_MIGRATE_FAIL) 2860 | + p->numa_faults_locality[2] += pages; 2861 | + 2862 | + p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages; 2863 | + p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages; 2864 | + p->numa_faults_locality[local] += pages; 2865 | +} 2866 | + 2867 | +static void reset_ptenuma_scan(struct task_struct *p) 2868 | +{ 2869 | + /* 2870 | + * We only did a read acquisition of the mmap sem, so 2871 | + * p->mm->numa_scan_seq is written to without exclusive access 2872 | + * and the update is not guaranteed to be atomic. That's not 2873 | + * much of an issue though, since this is just used for 2874 | + * statistical sampling. Use READ_ONCE/WRITE_ONCE, which are not 2875 | + * expensive, to avoid any form of compiler optimizations: 2876 | + */ 2877 | + WRITE_ONCE(p->mm->numa_scan_seq, READ_ONCE(p->mm->numa_scan_seq) + 1); 2878 | + p->mm->numa_scan_offset = 0; 2879 | +} 2880 | + 2881 | +/* 2882 | + * The expensive part of numa migration is done from task_work context. 2883 | + * Triggered from task_tick_numa(). 2884 | + */ 2885 | +static void task_numa_work(struct callback_head *work) 2886 | +{ 2887 | + unsigned long migrate, next_scan, now = jiffies; 2888 | + struct task_struct *p = current; 2889 | + struct mm_struct *mm = p->mm; 2890 | + u64 runtime = p->se.sum_exec_runtime; 2891 | + struct vm_area_struct *vma; 2892 | + unsigned long start, end; 2893 | + unsigned long nr_pte_updates = 0; 2894 | + long pages, virtpages; 2895 | + 2896 | + SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work)); 2897 | + 2898 | + work->next = work; 2899 | + /* 2900 | + * Who cares about NUMA placement when they're dying. 2901 | + * 2902 | + * NOTE: make sure not to dereference p->mm before this check, 2903 | + * exit_task_work() happens _after_ exit_mm() so we could be called 2904 | + * without p->mm even though we still had it when we enqueued this 2905 | + * work. 2906 | + */ 2907 | + if (p->flags & PF_EXITING) 2908 | + return; 2909 | + 2910 | + if (!mm->numa_next_scan) { 2911 | + mm->numa_next_scan = now + 2912 | + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); 2913 | + } 2914 | + 2915 | + /* 2916 | + * Enforce maximal scan/migration frequency.. 2917 | + */ 2918 | + migrate = mm->numa_next_scan; 2919 | + if (time_before(now, migrate)) 2920 | + return; 2921 | + 2922 | + if (p->numa_scan_period == 0) { 2923 | + p->numa_scan_period_max = task_scan_max(p); 2924 | + p->numa_scan_period = task_scan_start(p); 2925 | + } 2926 | + 2927 | + next_scan = now + msecs_to_jiffies(p->numa_scan_period); 2928 | + if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate) 2929 | + return; 2930 | + 2931 | + /* 2932 | + * Delay this task enough that another task of this mm will likely win 2933 | + * the next time around. 2934 | + */ 2935 | + p->node_stamp += 2 * TICK_NSEC; 2936 | + 2937 | + start = mm->numa_scan_offset; 2938 | + pages = sysctl_numa_balancing_scan_size; 2939 | + pages <<= 20 - PAGE_SHIFT; /* MB in pages */ 2940 | + virtpages = pages * 8; /* Scan up to this much virtual space */ 2941 | + if (!pages) 2942 | + return; 2943 | + 2944 | + 2945 | + if (!mmap_read_trylock(mm)) 2946 | + return; 2947 | + vma = find_vma(mm, start); 2948 | + if (!vma) { 2949 | + reset_ptenuma_scan(p); 2950 | + start = 0; 2951 | + vma = mm->mmap; 2952 | + } 2953 | + for (; vma; vma = vma->vm_next) { 2954 | + if (!vma_migratable(vma) || !vma_policy_mof(vma) || 2955 | + is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) { 2956 | + continue; 2957 | + } 2958 | + 2959 | + /* 2960 | + * Shared library pages mapped by multiple processes are not 2961 | + * migrated as it is expected they are cache replicated. Avoid 2962 | + * hinting faults in read-only file-backed mappings or the vdso 2963 | + * as migrating the pages will be of marginal benefit. 2964 | + */ 2965 | + if (!vma->vm_mm || 2966 | + (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ))) 2967 | + continue; 2968 | + 2969 | + /* 2970 | + * Skip inaccessible VMAs to avoid any confusion between 2971 | + * PROT_NONE and NUMA hinting ptes 2972 | + */ 2973 | + if (!vma_is_accessible(vma)) 2974 | + continue; 2975 | + 2976 | + do { 2977 | + start = max(start, vma->vm_start); 2978 | + end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE); 2979 | + end = min(end, vma->vm_end); 2980 | + nr_pte_updates = change_prot_numa(vma, start, end); 2981 | + 2982 | + /* 2983 | + * Try to scan sysctl_numa_balancing_size worth of 2984 | + * hpages that have at least one present PTE that 2985 | + * is not already pte-numa. If the VMA contains 2986 | + * areas that are unused or already full of prot_numa 2987 | + * PTEs, scan up to virtpages, to skip through those 2988 | + * areas faster. 2989 | + */ 2990 | + if (nr_pte_updates) 2991 | + pages -= (end - start) >> PAGE_SHIFT; 2992 | + virtpages -= (end - start) >> PAGE_SHIFT; 2993 | + 2994 | + start = end; 2995 | + if (pages <= 0 || virtpages <= 0) 2996 | + goto out; 2997 | + 2998 | + cond_resched(); 2999 | + } while (end != vma->vm_end); 3000 | + } 3001 | + 3002 | +out: 3003 | + /* 3004 | + * It is possible to reach the end of the VMA list but the last few 3005 | + * VMAs are not guaranteed to the vma_migratable. If they are not, we 3006 | + * would find the !migratable VMA on the next scan but not reset the 3007 | + * scanner to the start so check it now. 3008 | + */ 3009 | + if (vma) 3010 | + mm->numa_scan_offset = start; 3011 | + else 3012 | + reset_ptenuma_scan(p); 3013 | + mmap_read_unlock(mm); 3014 | + 3015 | + /* 3016 | + * Make sure tasks use at least 32x as much time to run other code 3017 | + * than they used here, to limit NUMA PTE scanning overhead to 3% max. 3018 | + * Usually update_task_scan_period slows down scanning enough; on an 3019 | + * overloaded system we need to limit overhead on a per task basis. 3020 | + */ 3021 | + if (unlikely(p->se.sum_exec_runtime != runtime)) { 3022 | + u64 diff = p->se.sum_exec_runtime - runtime; 3023 | + p->node_stamp += 32 * diff; 3024 | + } 3025 | +} 3026 | + 3027 | +void init_numa_balancing(unsigned long clone_flags, struct task_struct *p) 3028 | +{ 3029 | + int mm_users = 0; 3030 | + struct mm_struct *mm = p->mm; 3031 | + 3032 | + if (mm) { 3033 | + mm_users = atomic_read(&mm->mm_users); 3034 | + if (mm_users == 1) { 3035 | + mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); 3036 | + mm->numa_scan_seq = 0; 3037 | + } 3038 | + } 3039 | + p->node_stamp = 0; 3040 | + p->numa_scan_seq = mm ? mm->numa_scan_seq : 0; 3041 | + p->numa_scan_period = sysctl_numa_balancing_scan_delay; 3042 | + /* Protect against double add, see task_tick_numa and task_numa_work */ 3043 | + p->numa_work.next = &p->numa_work; 3044 | + p->numa_faults = NULL; 3045 | + RCU_INIT_POINTER(p->numa_group, NULL); 3046 | + p->last_task_numa_placement = 0; 3047 | + p->last_sum_exec_runtime = 0; 3048 | + 3049 | + init_task_work(&p->numa_work, task_numa_work); 3050 | + 3051 | + /* New address space, reset the preferred nid */ 3052 | + if (!(clone_flags & CLONE_VM)) { 3053 | + p->numa_preferred_nid = NUMA_NO_NODE; 3054 | + return; 3055 | + } 3056 | + 3057 | + /* 3058 | + * New thread, keep existing numa_preferred_nid which should be copied 3059 | + * already by arch_dup_task_struct but stagger when scans start. 3060 | + */ 3061 | + if (mm) { 3062 | + unsigned int delay; 3063 | + 3064 | + delay = min_t(unsigned int, task_scan_max(current), 3065 | + current->numa_scan_period * mm_users * NSEC_PER_MSEC); 3066 | + delay += 2 * TICK_NSEC; 3067 | + p->node_stamp = delay; 3068 | + } 3069 | +} 3070 | + 3071 | +static void task_tick_numa(struct rq *rq, struct task_struct *curr) 3072 | +{ 3073 | + struct callback_head *work = &curr->numa_work; 3074 | + u64 period, now; 3075 | + 3076 | + /* 3077 | + * We don't care about NUMA placement if we don't have memory. 3078 | + */ 3079 | + if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work) 3080 | + return; 3081 | + 3082 | + /* 3083 | + * Using runtime rather than walltime has the dual advantage that 3084 | + * we (mostly) drive the selection from busy threads and that the 3085 | + * task needs to have done some actual work before we bother with 3086 | + * NUMA placement. 3087 | + */ 3088 | + now = curr->se.sum_exec_runtime; 3089 | + period = (u64)curr->numa_scan_period * NSEC_PER_MSEC; 3090 | + 3091 | + if (now > curr->node_stamp + period) { 3092 | + if (!curr->node_stamp) 3093 | + curr->numa_scan_period = task_scan_start(curr); 3094 | + curr->node_stamp += period; 3095 | + 3096 | + if (!time_before(jiffies, curr->mm->numa_next_scan)) 3097 | + task_work_add(curr, work, TWA_RESUME); 3098 | + } 3099 | +} 3100 | +#else 3101 | +static void account_numa_enqueue(struct rq *rq, struct task_struct *p) {} 3102 | +static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p) {} 3103 | +static inline void update_scan_period(struct task_struct *p, int new_cpu) {} 3104 | +static void task_tick_numa(struct rq *rq, struct task_struct *curr) {} 3105 | +#endif /** CONFIG_NUMA_BALANCING */ 3106 | + 3107 | diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h 3108 | index d53d19770866..9ac800e858fc 100644 3109 | --- a/kernel/sched/sched.h 3110 | +++ b/kernel/sched/sched.h 3111 | @@ -545,9 +545,13 @@ struct cfs_rq { 3112 | * It is set to NULL otherwise (i.e when none are currently running). 3113 | */ 3114 | struct sched_entity *curr; 3115 | +#ifdef CONFIG_BS_SCHED 3116 | + struct bs_node *head; 3117 | +#else 3118 | struct sched_entity *next; 3119 | struct sched_entity *last; 3120 | struct sched_entity *skip; 3121 | +#endif /** CONFIG_BS_SCHED */ 3122 | 3123 | #ifdef CONFIG_SCHED_DEBUG 3124 | unsigned int nr_spread_over; 3125 | diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig 3126 | index 04bfd62f5e5c..fac3f528b432 100644 3127 | --- a/kernel/time/Kconfig 3128 | +++ b/kernel/time/Kconfig 3129 | @@ -88,7 +88,7 @@ config NO_HZ_COMMON 3130 | 3131 | choice 3132 | prompt "Timer tick handling" 3133 | - default NO_HZ_IDLE if NO_HZ 3134 | + default HZ_PERIODIC if BS_SCHED 3135 | 3136 | config HZ_PERIODIC 3137 | bool "Periodic timer ticks (constant rate, no dynticks)" 3138 | @@ -98,6 +98,7 @@ config HZ_PERIODIC 3139 | 3140 | config NO_HZ_IDLE 3141 | bool "Idle dynticks system (tickless idle)" 3142 | + depends on !BS_SCHED 3143 | select NO_HZ_COMMON 3144 | help 3145 | This option enables a tickless idle system: timer interrupts 3146 | @@ -112,6 +113,7 @@ config NO_HZ_FULL 3147 | # We need at least one periodic CPU for timekeeping 3148 | depends on SMP 3149 | depends on HAVE_CONTEXT_TRACKING 3150 | + depends on !BS_SCHED 3151 | # VIRT_CPU_ACCOUNTING_GEN dependency 3152 | depends on HAVE_VIRT_CPU_ACCOUNTING_GEN 3153 | select NO_HZ_COMMON 3154 | @@ -168,6 +170,8 @@ config CONTEXT_TRACKING_FORCE 3155 | 3156 | config NO_HZ 3157 | bool "Old Idle dynticks config" 3158 | + default n 3159 | + depends on !BS_SCHED 3160 | help 3161 | This is the old config entry that enables dynticks idle. 3162 | We keep it around for a little while to enforce backward 3163 | diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug 3164 | index ffd22e499997..c10894554e7b 100644 3165 | --- a/lib/Kconfig.debug 3166 | +++ b/lib/Kconfig.debug 3167 | @@ -583,6 +583,7 @@ endmenu 3168 | 3169 | config DEBUG_KERNEL 3170 | bool "Kernel debugging" 3171 | + depends on !BS_SCHED 3172 | help 3173 | Say Y here if you are developing drivers or trying to debug and 3174 | identify kernel problems. 3175 | -------------------------------------------------------------------------------- /baby-rr-5.14.patch: -------------------------------------------------------------------------------- 1 | diff --git a/include/linux/sched.h b/include/linux/sched.h 2 | index f6935787e7e8..8d2f939d40d4 100644 3 | --- a/include/linux/sched.h 4 | +++ b/include/linux/sched.h 5 | @@ -462,6 +462,13 @@ struct sched_statistics { 6 | #endif 7 | }; 8 | 9 | +#ifdef CONFIG_BS_SCHED 10 | +struct bs_node { 11 | + struct bs_node* next; 12 | + struct bs_node* prev; 13 | +}; 14 | +#endif 15 | + 16 | struct sched_entity { 17 | /* For load-balancing: */ 18 | struct load_weight load; 19 | @@ -469,6 +476,10 @@ struct sched_entity { 20 | struct list_head group_node; 21 | unsigned int on_rq; 22 | 23 | +#ifdef CONFIG_BS_SCHED 24 | + struct bs_node bs_node; 25 | +#endif 26 | + 27 | u64 exec_start; 28 | u64 sum_exec_runtime; 29 | u64 vruntime; 30 | diff --git a/init/Kconfig b/init/Kconfig 31 | index 55f9f7738ebb..a8b1cee206dd 100644 32 | --- a/init/Kconfig 33 | +++ b/init/Kconfig 34 | @@ -105,6 +105,14 @@ config THREAD_INFO_IN_TASK 35 | One subtle change that will be needed is to use try_get_task_stack() 36 | and put_task_stack() in save_thread_stack_tsk() and get_wchan(). 37 | 38 | +config BS_SCHED 39 | + bool "Baby Scheduler" 40 | + default y 41 | + help 42 | + It is a Round Robin (RR) version of Baby Scheduler. It is even simpler 43 | + than normal Baby Scheduler where no usage of vruntime and no tasks 44 | + priority considerations. It is just a basic RR. 45 | + 46 | menu "General setup" 47 | 48 | config BROKEN 49 | diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile 50 | index 978fcfca5871..464b134de739 100644 51 | --- a/kernel/sched/Makefile 52 | +++ b/kernel/sched/Makefile 53 | @@ -23,7 +23,7 @@ CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer 54 | endif 55 | 56 | obj-y += core.o loadavg.o clock.o cputime.o 57 | -obj-y += idle.o fair.o rt.o deadline.o 58 | +obj-y += idle.o bs.o rt.o deadline.o 59 | obj-y += wait.o wait_bit.o swait.o completion.o 60 | 61 | obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o 62 | diff --git a/kernel/sched/bs.c b/kernel/sched/bs.c 63 | new file mode 100644 64 | index 000000000000..402fcc80518f 65 | --- /dev/null 66 | +++ b/kernel/sched/bs.c 67 | @@ -0,0 +1,696 @@ 68 | +// SPDX-License-Identifier: GPL-2.0 69 | +/* 70 | + * Baby Scheduler (BS) Class (SCHED_NORMAL/SCHED_BATCH) 71 | + * 72 | + * Copyright (C) 2021, Hamad Al Marri 73 | + */ 74 | +#include "sched.h" 75 | +#include "pelt.h" 76 | +#include "fair_numa.h" 77 | +#include "bs.h" 78 | + 79 | +static void update_curr(struct cfs_rq *cfs_rq) 80 | +{ 81 | + struct sched_entity *curr = cfs_rq->curr; 82 | + u64 now = rq_clock_task(rq_of(cfs_rq)); 83 | + u64 delta_exec; 84 | + 85 | + if (unlikely(!curr)) 86 | + return; 87 | + 88 | + delta_exec = now - curr->exec_start; 89 | + if (unlikely((s64)delta_exec <= 0)) 90 | + return; 91 | + 92 | + curr->exec_start = now; 93 | + curr->sum_exec_runtime += delta_exec; 94 | +} 95 | + 96 | +static void update_curr_fair(struct rq *rq) 97 | +{ 98 | + update_curr(cfs_rq_of(&rq->curr->se)); 99 | +} 100 | + 101 | +static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) 102 | +{ 103 | + struct bs_node *bsn = &se->bs_node; 104 | + struct bs_node *prev; 105 | + 106 | + bsn->next = bsn->prev = NULL; 107 | + 108 | + // if empty 109 | + if (!cfs_rq->head) { 110 | + cfs_rq->head = bsn; 111 | + cfs_rq->cursor = bsn; 112 | + } 113 | + // if cursor == head 114 | + else if (cfs_rq->cursor == cfs_rq->head) { 115 | + bsn->next = cfs_rq->head; 116 | + cfs_rq->head->prev = bsn; 117 | + cfs_rq->head = bsn; 118 | + } 119 | + // if cursor != head 120 | + else { 121 | + prev = cfs_rq->cursor->prev; 122 | + 123 | + bsn->next = cfs_rq->cursor; 124 | + cfs_rq->cursor->prev = bsn; 125 | + 126 | + prev->next = bsn; 127 | + bsn->prev = prev; 128 | + } 129 | +} 130 | + 131 | +static inline void rotate_cursor(struct cfs_rq *cfs_rq) 132 | +{ 133 | + cfs_rq->cursor = cfs_rq->cursor->next; 134 | + 135 | + if (!cfs_rq->cursor) 136 | + cfs_rq->cursor = cfs_rq->head; 137 | +} 138 | + 139 | +static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) 140 | +{ 141 | + struct bs_node *bsn = &se->bs_node; 142 | + struct bs_node *prev, *next; 143 | + 144 | + // if only one se in rq 145 | + if (cfs_rq->head->next == NULL) { 146 | + cfs_rq->head = NULL; 147 | + cfs_rq->cursor = NULL; 148 | + } else if (bsn == cfs_rq->head) { 149 | + // if it is the head 150 | + cfs_rq->head = cfs_rq->head->next; 151 | + cfs_rq->head->prev = NULL; 152 | + 153 | + if (bsn == cfs_rq->cursor) 154 | + rotate_cursor(cfs_rq); 155 | + } else { 156 | + // if in the middle 157 | + if (bsn == cfs_rq->cursor) 158 | + rotate_cursor(cfs_rq); 159 | + 160 | + prev = bsn->prev; 161 | + next = bsn->next; 162 | + 163 | + prev->next = next; 164 | + if (next) 165 | + next->prev = prev; 166 | + } 167 | +} 168 | + 169 | +static void 170 | +enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) 171 | +{ 172 | + bool curr = cfs_rq->curr == se; 173 | + 174 | + update_curr(cfs_rq); 175 | + account_entity_enqueue(cfs_rq, se); 176 | + 177 | + if (!curr) 178 | + __enqueue_entity(cfs_rq, se); 179 | + 180 | + se->on_rq = 1; 181 | +} 182 | + 183 | +static void 184 | +dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) 185 | +{ 186 | + update_curr(cfs_rq); 187 | + 188 | + if (se != cfs_rq->curr) 189 | + __dequeue_entity(cfs_rq, se); 190 | + 191 | + se->on_rq = 0; 192 | + account_entity_dequeue(cfs_rq, se); 193 | +} 194 | + 195 | +static void 196 | +enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) 197 | +{ 198 | + struct sched_entity *se = &p->se; 199 | + struct cfs_rq *cfs_rq = cfs_rq_of(se); 200 | + int idle_h_nr_running = task_has_idle_policy(p); 201 | + 202 | + if (!se->on_rq) { 203 | + enqueue_entity(cfs_rq, se, flags); 204 | + cfs_rq->h_nr_running++; 205 | + cfs_rq->idle_h_nr_running += idle_h_nr_running; 206 | + } 207 | + 208 | + add_nr_running(rq, 1); 209 | +} 210 | + 211 | +static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) 212 | +{ 213 | + struct sched_entity *se = &p->se; 214 | + struct cfs_rq *cfs_rq = cfs_rq_of(se); 215 | + int idle_h_nr_running = task_has_idle_policy(p); 216 | + 217 | + dequeue_entity(cfs_rq, se, flags); 218 | + 219 | + cfs_rq->h_nr_running--; 220 | + cfs_rq->idle_h_nr_running -= idle_h_nr_running; 221 | + 222 | + sub_nr_running(rq, 1); 223 | +} 224 | + 225 | +static void yield_task_fair(struct rq *rq) 226 | +{ 227 | + struct task_struct *curr = rq->curr; 228 | + struct cfs_rq *cfs_rq = task_cfs_rq(curr); 229 | + 230 | + /* 231 | + * Are we the only task in the tree? 232 | + */ 233 | + if (unlikely(rq->nr_running == 1)) 234 | + return; 235 | + 236 | + if (curr->policy != SCHED_BATCH) { 237 | + update_rq_clock(rq); 238 | + /* 239 | + * Update run-time statistics of the 'current'. 240 | + */ 241 | + update_curr(cfs_rq); 242 | + /* 243 | + * Tell update_rq_clock() that we've just updated, 244 | + * so we don't do microscopic update in schedule() 245 | + * and double the fastpath cost. 246 | + */ 247 | + rq_clock_skip_update(rq); 248 | + } 249 | +} 250 | + 251 | +static bool yield_to_task_fair(struct rq *rq, struct task_struct *p) 252 | +{ 253 | + yield_task_fair(rq); 254 | + return true; 255 | +} 256 | + 257 | +static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags) 258 | +{ 259 | + struct task_struct *curr = rq->curr; 260 | + struct sched_entity *se = &curr->se, *pse = &p->se; 261 | + 262 | + if (unlikely(se == pse)) 263 | + return; 264 | + 265 | + if (test_tsk_need_resched(curr)) 266 | + return; 267 | + 268 | + /* Idle tasks are by definition preempted by non-idle tasks. */ 269 | + if (unlikely(task_has_idle_policy(curr)) && 270 | + likely(!task_has_idle_policy(p))) 271 | + goto preempt; 272 | + 273 | + /* 274 | + * Batch and idle tasks do not preempt non-idle tasks (their preemption 275 | + * is driven by the tick): 276 | + */ 277 | + if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION)) 278 | + return; 279 | + 280 | + update_curr(cfs_rq_of(se)); 281 | + 282 | +preempt: 283 | + resched_curr(rq); 284 | +} 285 | + 286 | +static void 287 | +set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) 288 | +{ 289 | + if (se->on_rq) 290 | + __dequeue_entity(cfs_rq, se); 291 | + 292 | + se->exec_start = rq_clock_task(rq_of(cfs_rq)); 293 | + cfs_rq->curr = se; 294 | + se->prev_sum_exec_runtime = se->sum_exec_runtime; 295 | +} 296 | + 297 | +static struct sched_entity * 298 | +pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) 299 | +{ 300 | + struct bs_node *bsn = cfs_rq->cursor; 301 | + 302 | + if (!bsn) 303 | + bsn = cfs_rq->cursor = cfs_rq->head; 304 | + 305 | + // update cursor 306 | + rotate_cursor(cfs_rq); 307 | + 308 | + return se_of(bsn); 309 | +} 310 | + 311 | +struct task_struct * 312 | +pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) 313 | +{ 314 | + struct cfs_rq *cfs_rq = &rq->cfs; 315 | + struct sched_entity *se; 316 | + struct task_struct *p; 317 | + int new_tasks; 318 | + 319 | +again: 320 | + if (!sched_fair_runnable(rq)) 321 | + goto idle; 322 | + 323 | + if (prev) 324 | + put_prev_task(rq, prev); 325 | + 326 | + se = pick_next_entity(cfs_rq, NULL); 327 | + set_next_entity(cfs_rq, se); 328 | + 329 | + p = task_of(se); 330 | + 331 | +done: __maybe_unused; 332 | +#ifdef CONFIG_SMP 333 | + /* 334 | + * Move the next running task to the front of 335 | + * the list, so our cfs_tasks list becomes MRU 336 | + * one. 337 | + */ 338 | + list_move(&p->se.group_node, &rq->cfs_tasks); 339 | +#endif 340 | + 341 | + return p; 342 | + 343 | +idle: 344 | + if (!rf) 345 | + return NULL; 346 | + 347 | + new_tasks = newidle_balance(rq, rf); 348 | + 349 | + /* 350 | + * Because newidle_balance() releases (and re-acquires) rq->lock, it is 351 | + * possible for any higher priority task to appear. In that case we 352 | + * must re-start the pick_next_entity() loop. 353 | + */ 354 | + if (new_tasks < 0) 355 | + return RETRY_TASK; 356 | + 357 | + if (new_tasks > 0) 358 | + goto again; 359 | + 360 | + /* 361 | + * rq is about to be idle, check if we need to update the 362 | + * lost_idle_time of clock_pelt 363 | + */ 364 | + update_idle_rq_clock_pelt(rq); 365 | + 366 | + return NULL; 367 | +} 368 | + 369 | +static struct task_struct *__pick_next_task_fair(struct rq *rq) 370 | +{ 371 | + return pick_next_task_fair(rq, NULL, NULL); 372 | +} 373 | + 374 | +#ifdef CONFIG_SMP 375 | +static struct task_struct *pick_task_fair(struct rq *rq) 376 | +{ 377 | + struct sched_entity *se; 378 | + struct cfs_rq *cfs_rq = &rq->cfs; 379 | + struct sched_entity *curr = cfs_rq->curr; 380 | + 381 | + if (!cfs_rq->nr_running) 382 | + return NULL; 383 | + 384 | + /* When we pick for a remote RQ, we'll not have done put_prev_entity() */ 385 | + if (curr) { 386 | + if (curr->on_rq) 387 | + update_curr(cfs_rq); 388 | + else 389 | + curr = NULL; 390 | + } 391 | + 392 | + se = pick_next_entity(cfs_rq, curr); 393 | + 394 | + return task_of(se); 395 | +} 396 | +#endif 397 | + 398 | +static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev) 399 | +{ 400 | + /* 401 | + * If still on the runqueue then deactivate_task() 402 | + * was not called and update_curr() has to be done: 403 | + */ 404 | + if (prev->on_rq) 405 | + update_curr(cfs_rq); 406 | + 407 | + if (prev->on_rq) 408 | + __enqueue_entity(cfs_rq, prev); 409 | + 410 | + cfs_rq->curr = NULL; 411 | +} 412 | + 413 | +static void put_prev_task_fair(struct rq *rq, struct task_struct *prev) 414 | +{ 415 | + struct sched_entity *se = &prev->se; 416 | + 417 | + put_prev_entity(cfs_rq_of(se), se); 418 | +} 419 | + 420 | +static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) 421 | +{ 422 | + struct sched_entity *se = &p->se; 423 | + struct cfs_rq *cfs_rq = cfs_rq_of(se); 424 | + 425 | +#ifdef CONFIG_SMP 426 | + if (task_on_rq_queued(p)) { 427 | + /* 428 | + * Move the next running task to the front of the list, so our 429 | + * cfs_tasks list becomes MRU one. 430 | + */ 431 | + list_move(&se->group_node, &rq->cfs_tasks); 432 | + } 433 | +#endif 434 | + 435 | + set_next_entity(cfs_rq, se); 436 | +} 437 | + 438 | +static void 439 | +check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) 440 | +{ 441 | + resched_curr(rq_of(cfs_rq)); 442 | +} 443 | + 444 | +static void 445 | +entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) 446 | +{ 447 | + update_curr(cfs_rq); 448 | + 449 | + if (cfs_rq->nr_running > 1) 450 | + check_preempt_tick(cfs_rq, curr); 451 | +} 452 | + 453 | +#ifdef CONFIG_SMP 454 | +static int 455 | +balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) 456 | +{ 457 | + if (rq->nr_running) 458 | + return 1; 459 | + 460 | + return newidle_balance(rq, rf) != 0; 461 | +} 462 | + 463 | +static int 464 | +select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) 465 | +{ 466 | + struct rq *rq = cpu_rq(prev_cpu); 467 | + unsigned int min_this = rq->nr_running; 468 | + unsigned int min = rq->nr_running; 469 | + int cpu, new_cpu = prev_cpu; 470 | + 471 | + for_each_online_cpu(cpu) { 472 | + if (cpu_rq(cpu)->nr_running < min) { 473 | + new_cpu = cpu; 474 | + min = cpu_rq(cpu)->nr_running; 475 | + } 476 | + } 477 | + 478 | + if (min == min_this) 479 | + return prev_cpu; 480 | + 481 | + return new_cpu; 482 | +} 483 | + 484 | +static int 485 | +can_migrate_task(struct task_struct *p, int dst_cpu, struct rq *src_rq) 486 | +{ 487 | + if (task_running(src_rq, p)) 488 | + return 0; 489 | + 490 | + /* Disregard pcpu kthreads; they are where they need to be. */ 491 | + if (kthread_is_per_cpu(p)) 492 | + return 0; 493 | + 494 | + if (!cpumask_test_cpu(dst_cpu, p->cpus_ptr)) 495 | + return 0; 496 | + 497 | + return 1; 498 | +} 499 | + 500 | +static void pull_from(struct rq *this_rq, 501 | + struct rq *src_rq, 502 | + struct rq_flags *src_rf, 503 | + struct task_struct *p) 504 | +{ 505 | + struct rq_flags rf; 506 | + 507 | + // detach task 508 | + deactivate_task(src_rq, p, DEQUEUE_NOCLOCK); 509 | + set_task_cpu(p, cpu_of(this_rq)); 510 | + 511 | + // unlock src rq 512 | + rq_unlock(src_rq, src_rf); 513 | + 514 | + // lock this rq 515 | + rq_lock(this_rq, &rf); 516 | + update_rq_clock(this_rq); 517 | + 518 | + activate_task(this_rq, p, ENQUEUE_NOCLOCK); 519 | + check_preempt_curr(this_rq, p, 0); 520 | + 521 | + // unlock this rq 522 | + rq_unlock(this_rq, &rf); 523 | + 524 | + local_irq_restore(src_rf->flags); 525 | +} 526 | + 527 | +static int move_task(struct rq *this_rq, struct rq *src_rq, 528 | + struct rq_flags *src_rf) 529 | +{ 530 | + struct cfs_rq *src_cfs_rq = &src_rq->cfs; 531 | + struct task_struct *p; 532 | + struct bs_node *bsn = src_cfs_rq->head; 533 | + int moved = 0; 534 | + 535 | + while (bsn) { 536 | + p = task_of(se_of(bsn)); 537 | + if (can_migrate_task(p, cpu_of(this_rq), src_rq)) { 538 | + pull_from(this_rq, src_rq, src_rf, p); 539 | + moved = 1; 540 | + break; 541 | + } 542 | + 543 | + bsn = bsn->next; 544 | + } 545 | + 546 | + if (!moved) { 547 | + rq_unlock(src_rq, src_rf); 548 | + local_irq_restore(src_rf->flags); 549 | + } 550 | + 551 | + return moved; 552 | +} 553 | + 554 | +static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) 555 | +{ 556 | + int this_cpu = this_rq->cpu; 557 | + struct rq *src_rq; 558 | + int src_cpu = -1, cpu; 559 | + int pulled_task = 0; 560 | + unsigned int max = 0; 561 | + struct rq_flags src_rf; 562 | + 563 | + /* 564 | + * We must set idle_stamp _before_ calling idle_balance(), such that we 565 | + * measure the duration of idle_balance() as idle time. 566 | + */ 567 | + this_rq->idle_stamp = rq_clock(this_rq); 568 | + 569 | + /* 570 | + * Do not pull tasks towards !active CPUs... 571 | + */ 572 | + if (!cpu_active(this_cpu)) 573 | + return 0; 574 | + 575 | + rq_unpin_lock(this_rq, rf); 576 | + raw_spin_unlock(&this_rq->__lock); 577 | + 578 | + for_each_online_cpu(cpu) { 579 | + /* 580 | + * Stop searching for tasks to pull if there are 581 | + * now runnable tasks on this rq. 582 | + */ 583 | + if (this_rq->nr_running > 0) 584 | + goto out; 585 | + 586 | + if (cpu == this_cpu) 587 | + continue; 588 | + 589 | + src_rq = cpu_rq(cpu); 590 | + 591 | + if (src_rq->nr_running < 2) 592 | + continue; 593 | + 594 | + if (src_rq->nr_running > max) { 595 | + max = src_rq->nr_running; 596 | + src_cpu = cpu; 597 | + } 598 | + } 599 | + 600 | + if (src_cpu != -1) { 601 | + src_rq = cpu_rq(src_cpu); 602 | + 603 | + rq_lock_irqsave(src_rq, &src_rf); 604 | + update_rq_clock(src_rq); 605 | + 606 | + if (src_rq->nr_running < 2) { 607 | + rq_unlock(src_rq, &src_rf); 608 | + local_irq_restore(src_rf.flags); 609 | + } else { 610 | + pulled_task = move_task(this_rq, src_rq, &src_rf); 611 | + } 612 | + } 613 | + 614 | +out: 615 | + raw_spin_lock(&this_rq->__lock); 616 | + 617 | + /* 618 | + * While browsing the domains, we released the rq lock, a task could 619 | + * have been enqueued in the meantime. Since we're not going idle, 620 | + * pretend we pulled a task. 621 | + */ 622 | + if (this_rq->cfs.h_nr_running && !pulled_task) 623 | + pulled_task = 1; 624 | + 625 | + /* Is there a task of a high priority class? */ 626 | + if (this_rq->nr_running != this_rq->cfs.h_nr_running) 627 | + pulled_task = -1; 628 | + 629 | + if (pulled_task) 630 | + this_rq->idle_stamp = 0; 631 | + 632 | + rq_repin_lock(this_rq, rf); 633 | + 634 | + return pulled_task; 635 | +} 636 | + 637 | +static inline int on_null_domain(struct rq *rq) 638 | +{ 639 | + return unlikely(!rcu_dereference_sched(rq->sd)); 640 | +} 641 | + 642 | +void trigger_load_balance(struct rq *this_rq) 643 | +{ 644 | + int this_cpu = cpu_of(this_rq); 645 | + int cpu; 646 | + unsigned int max, min; 647 | + struct rq *max_rq, *min_rq, *c_rq; 648 | + struct rq_flags src_rf; 649 | + 650 | + if (this_cpu != 0) 651 | + return; 652 | + 653 | + max = min = this_rq->nr_running; 654 | + max_rq = min_rq = this_rq; 655 | + 656 | + for_each_online_cpu(cpu) { 657 | + c_rq = cpu_rq(cpu); 658 | + 659 | + /* 660 | + * Don't need to rebalance while attached to NULL domain or 661 | + * runqueue CPU is not active 662 | + */ 663 | + if (unlikely(on_null_domain(c_rq) || !cpu_active(cpu))) 664 | + continue; 665 | + 666 | + if (c_rq->nr_running < min) { 667 | + min = c_rq->nr_running; 668 | + min_rq = c_rq; 669 | + } 670 | + 671 | + if (c_rq->nr_running > max) { 672 | + max = c_rq->nr_running; 673 | + max_rq = c_rq; 674 | + } 675 | + } 676 | + 677 | + if (min_rq == max_rq || max - min < 2) 678 | + return; 679 | + 680 | + rq_lock_irqsave(max_rq, &src_rf); 681 | + update_rq_clock(max_rq); 682 | + 683 | + if (max_rq->nr_running < 2) { 684 | + rq_unlock(max_rq, &src_rf); 685 | + local_irq_restore(src_rf.flags); 686 | + return; 687 | + } 688 | + 689 | + move_task(min_rq, max_rq, &src_rf); 690 | +} 691 | + 692 | +void update_group_capacity(struct sched_domain *sd, int cpu) {} 693 | +#endif /* CONFIG_SMP */ 694 | + 695 | +static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) 696 | +{ 697 | + struct sched_entity *se = &curr->se; 698 | + struct cfs_rq *cfs_rq = cfs_rq_of(se); 699 | + 700 | + entity_tick(cfs_rq, se, queued); 701 | + 702 | + if (static_branch_unlikely(&sched_numa_balancing)) 703 | + task_tick_numa(rq, curr); 704 | +} 705 | + 706 | +static void task_fork_fair(struct task_struct *p) 707 | +{ 708 | + struct cfs_rq *cfs_rq; 709 | + struct sched_entity *curr; 710 | + struct rq *rq = this_rq(); 711 | + struct rq_flags rf; 712 | + 713 | + rq_lock(rq, &rf); 714 | + update_rq_clock(rq); 715 | + 716 | + cfs_rq = task_cfs_rq(current); 717 | + curr = cfs_rq->curr; 718 | + if (curr) 719 | + update_curr(cfs_rq); 720 | + 721 | + rq_unlock(rq, &rf); 722 | +} 723 | + 724 | +/* 725 | + * All the scheduling class methods: 726 | + */ 727 | +DEFINE_SCHED_CLASS(fair) = { 728 | + 729 | + .enqueue_task = enqueue_task_fair, 730 | + .dequeue_task = dequeue_task_fair, 731 | + .yield_task = yield_task_fair, 732 | + .yield_to_task = yield_to_task_fair, 733 | + 734 | + .check_preempt_curr = check_preempt_wakeup, 735 | + 736 | + .pick_next_task = __pick_next_task_fair, 737 | + .put_prev_task = put_prev_task_fair, 738 | + .set_next_task = set_next_task_fair, 739 | + 740 | +#ifdef CONFIG_SMP 741 | + .balance = balance_fair, 742 | + .pick_task = pick_task_fair, 743 | + .select_task_rq = select_task_rq_fair, 744 | + .migrate_task_rq = migrate_task_rq_fair, 745 | + 746 | + .rq_online = rq_online_fair, 747 | + .rq_offline = rq_offline_fair, 748 | + 749 | + .task_dead = task_dead_fair, 750 | + .set_cpus_allowed = set_cpus_allowed_common, 751 | +#endif 752 | + 753 | + .task_tick = task_tick_fair, 754 | + .task_fork = task_fork_fair, 755 | + 756 | + .prio_changed = prio_changed_fair, 757 | + .switched_from = switched_from_fair, 758 | + .switched_to = switched_to_fair, 759 | + 760 | + .get_rr_interval = get_rr_interval_fair, 761 | + 762 | + .update_curr = update_curr_fair, 763 | +}; 764 | diff --git a/kernel/sched/bs.h b/kernel/sched/bs.h 765 | new file mode 100644 766 | index 000000000000..f5495a29cb57 767 | --- /dev/null 768 | +++ b/kernel/sched/bs.h 769 | @@ -0,0 +1,146 @@ 770 | +/* 771 | + * After fork, child runs first. If set to 0 (default) then 772 | + * parent will (try to) run first. 773 | + */ 774 | +unsigned int sysctl_sched_child_runs_first __read_mostly; 775 | + 776 | +const_debug unsigned int sysctl_sched_migration_cost = 500000UL; 777 | + 778 | +void __init sched_init_granularity(void) {} 779 | + 780 | +#ifdef CONFIG_SMP 781 | +/* Give new sched_entity start runnable values to heavy its load in infant time */ 782 | +void init_entity_runnable_average(struct sched_entity *se) {} 783 | +void post_init_entity_util_avg(struct task_struct *p) {} 784 | +void update_max_interval(void) {} 785 | +static int newidle_balance(struct rq *this_rq, struct rq_flags *rf); 786 | +#endif /** CONFIG_SMP */ 787 | + 788 | +void init_cfs_rq(struct cfs_rq *cfs_rq) 789 | +{ 790 | + cfs_rq->tasks_timeline = RB_ROOT_CACHED; 791 | +#ifdef CONFIG_SMP 792 | + raw_spin_lock_init(&cfs_rq->removed.lock); 793 | +#endif 794 | +} 795 | + 796 | +__init void init_sched_fair_class(void) {} 797 | + 798 | +void reweight_task(struct task_struct *p, int prio) {} 799 | + 800 | +static inline struct sched_entity *se_of(struct bs_node *bsn) 801 | +{ 802 | + return container_of(bsn, struct sched_entity, bs_node); 803 | +} 804 | + 805 | +#ifdef CONFIG_SCHED_SMT 806 | +DEFINE_STATIC_KEY_FALSE(sched_smt_present); 807 | +EXPORT_SYMBOL_GPL(sched_smt_present); 808 | + 809 | +static inline void set_idle_cores(int cpu, int val) 810 | +{ 811 | + struct sched_domain_shared *sds; 812 | + 813 | + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); 814 | + if (sds) 815 | + WRITE_ONCE(sds->has_idle_cores, val); 816 | +} 817 | + 818 | +static inline bool test_idle_cores(int cpu, bool def) 819 | +{ 820 | + struct sched_domain_shared *sds; 821 | + 822 | + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); 823 | + if (sds) 824 | + return READ_ONCE(sds->has_idle_cores); 825 | + 826 | + return def; 827 | +} 828 | + 829 | +void __update_idle_core(struct rq *rq) 830 | +{ 831 | + int core = cpu_of(rq); 832 | + int cpu; 833 | + 834 | + rcu_read_lock(); 835 | + if (test_idle_cores(core, true)) 836 | + goto unlock; 837 | + 838 | + for_each_cpu(cpu, cpu_smt_mask(core)) { 839 | + if (cpu == core) 840 | + continue; 841 | + 842 | + if (!available_idle_cpu(cpu)) 843 | + goto unlock; 844 | + } 845 | + 846 | + set_idle_cores(core, 1); 847 | +unlock: 848 | + rcu_read_unlock(); 849 | +} 850 | +#endif 851 | + 852 | +static void 853 | +account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) 854 | +{ 855 | +#ifdef CONFIG_SMP 856 | + if (entity_is_task(se)) { 857 | + struct rq *rq = rq_of(cfs_rq); 858 | + 859 | + account_numa_enqueue(rq, task_of(se)); 860 | + list_add(&se->group_node, &rq->cfs_tasks); 861 | + } 862 | +#endif 863 | + cfs_rq->nr_running++; 864 | +} 865 | + 866 | +static void 867 | +account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) 868 | +{ 869 | +#ifdef CONFIG_SMP 870 | + if (entity_is_task(se)) { 871 | + account_numa_dequeue(rq_of(cfs_rq), task_of(se)); 872 | + list_del_init(&se->group_node); 873 | + } 874 | +#endif 875 | + cfs_rq->nr_running--; 876 | +} 877 | + 878 | +static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) 879 | +{ 880 | + update_scan_period(p, new_cpu); 881 | +} 882 | + 883 | +static void rq_online_fair(struct rq *rq) {} 884 | +static void rq_offline_fair(struct rq *rq) {} 885 | +static void task_dead_fair(struct task_struct *p) 886 | +{ 887 | + struct cfs_rq *cfs_rq = cfs_rq_of(&p->se); 888 | + unsigned long flags; 889 | + 890 | + raw_spin_lock_irqsave(&cfs_rq->removed.lock, flags); 891 | + ++cfs_rq->removed.nr; 892 | + raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags); 893 | +} 894 | + 895 | +static void 896 | +prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio) {} 897 | + 898 | +static void switched_from_fair(struct rq *rq, struct task_struct *p) {} 899 | + 900 | +static void switched_to_fair(struct rq *rq, struct task_struct *p) 901 | +{ 902 | + if (task_on_rq_queued(p)) { 903 | + /* 904 | + * We were most likely switched from sched_rt, so 905 | + * kick off the schedule if running, otherwise just see 906 | + * if we can still preempt the current task. 907 | + */ 908 | + resched_curr(rq); 909 | + } 910 | +} 911 | + 912 | +static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task) 913 | +{ 914 | + return 0; 915 | +} 916 | diff --git a/kernel/sched/core.c b/kernel/sched/core.c 917 | index 399c37c95392..9a3adde1b831 100644 918 | --- a/kernel/sched/core.c 919 | +++ b/kernel/sched/core.c 920 | @@ -8995,6 +8995,10 @@ void __init sched_init(void) 921 | 922 | wait_bit_init(); 923 | 924 | +#ifdef CONFIG_BS_SCHED 925 | + printk(KERN_INFO "Baby CPU scheduler (rr) v5.14 by Hamad Al Marri."); 926 | +#endif 927 | + 928 | #ifdef CONFIG_FAIR_GROUP_SCHED 929 | ptr += 2 * nr_cpu_ids * sizeof(void **); 930 | #endif 931 | diff --git a/kernel/sched/fair_numa.h b/kernel/sched/fair_numa.h 932 | new file mode 100644 933 | index 000000000000..a564478ec3f7 934 | --- /dev/null 935 | +++ b/kernel/sched/fair_numa.h 936 | @@ -0,0 +1,1968 @@ 937 | + 938 | +#ifdef CONFIG_NUMA_BALANCING 939 | + 940 | +unsigned int sysctl_numa_balancing_scan_period_min = 1000; 941 | +unsigned int sysctl_numa_balancing_scan_period_max = 60000; 942 | + 943 | +/* Portion of address space to scan in MB */ 944 | +unsigned int sysctl_numa_balancing_scan_size = 256; 945 | + 946 | +/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */ 947 | +unsigned int sysctl_numa_balancing_scan_delay = 1000; 948 | + 949 | +struct numa_group { 950 | + refcount_t refcount; 951 | + 952 | + spinlock_t lock; /* nr_tasks, tasks */ 953 | + int nr_tasks; 954 | + pid_t gid; 955 | + int active_nodes; 956 | + 957 | + struct rcu_head rcu; 958 | + unsigned long total_faults; 959 | + unsigned long max_faults_cpu; 960 | + /* 961 | + * Faults_cpu is used to decide whether memory should move 962 | + * towards the CPU. As a consequence, these stats are weighted 963 | + * more by CPU use than by memory faults. 964 | + */ 965 | + unsigned long *faults_cpu; 966 | + unsigned long faults[]; 967 | +}; 968 | + 969 | +/* 970 | + * For functions that can be called in multiple contexts that permit reading 971 | + * ->numa_group (see struct task_struct for locking rules). 972 | + */ 973 | +static struct numa_group *deref_task_numa_group(struct task_struct *p) 974 | +{ 975 | + return rcu_dereference_check(p->numa_group, p == current || 976 | + (lockdep_is_held(__rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu))); 977 | +} 978 | + 979 | +static struct numa_group *deref_curr_numa_group(struct task_struct *p) 980 | +{ 981 | + return rcu_dereference_protected(p->numa_group, p == current); 982 | +} 983 | + 984 | +static inline unsigned long group_faults_priv(struct numa_group *ng); 985 | +static inline unsigned long group_faults_shared(struct numa_group *ng); 986 | + 987 | +static unsigned int task_nr_scan_windows(struct task_struct *p) 988 | +{ 989 | + unsigned long rss = 0; 990 | + unsigned long nr_scan_pages; 991 | + 992 | + /* 993 | + * Calculations based on RSS as non-present and empty pages are skipped 994 | + * by the PTE scanner and NUMA hinting faults should be trapped based 995 | + * on resident pages 996 | + */ 997 | + nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT); 998 | + rss = get_mm_rss(p->mm); 999 | + if (!rss) 1000 | + rss = nr_scan_pages; 1001 | + 1002 | + rss = round_up(rss, nr_scan_pages); 1003 | + return rss / nr_scan_pages; 1004 | +} 1005 | + 1006 | +/* For sanity's sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */ 1007 | +#define MAX_SCAN_WINDOW 2560 1008 | + 1009 | +static unsigned int task_scan_min(struct task_struct *p) 1010 | +{ 1011 | + unsigned int scan_size = READ_ONCE(sysctl_numa_balancing_scan_size); 1012 | + unsigned int scan, floor; 1013 | + unsigned int windows = 1; 1014 | + 1015 | + if (scan_size < MAX_SCAN_WINDOW) 1016 | + windows = MAX_SCAN_WINDOW / scan_size; 1017 | + floor = 1000 / windows; 1018 | + 1019 | + scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p); 1020 | + return max_t(unsigned int, floor, scan); 1021 | +} 1022 | + 1023 | +static unsigned int task_scan_max(struct task_struct *p) 1024 | +{ 1025 | + unsigned long smin = task_scan_min(p); 1026 | + unsigned long smax; 1027 | + struct numa_group *ng; 1028 | + 1029 | + /* Watch for min being lower than max due to floor calculations */ 1030 | + smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p); 1031 | + 1032 | + /* Scale the maximum scan period with the amount of shared memory. */ 1033 | + ng = deref_curr_numa_group(p); 1034 | + if (ng) { 1035 | + unsigned long shared = group_faults_shared(ng); 1036 | + unsigned long private = group_faults_priv(ng); 1037 | + unsigned long period = smax; 1038 | + 1039 | + period *= refcount_read(&ng->refcount); 1040 | + period *= shared + 1; 1041 | + period /= private + shared + 1; 1042 | + 1043 | + smax = max(smax, period); 1044 | + } 1045 | + 1046 | + return max(smin, smax); 1047 | +} 1048 | + 1049 | +static void account_numa_enqueue(struct rq *rq, struct task_struct *p) 1050 | +{ 1051 | + rq->nr_numa_running += (p->numa_preferred_nid != NUMA_NO_NODE); 1052 | + rq->nr_preferred_running += (p->numa_preferred_nid == task_node(p)); 1053 | +} 1054 | + 1055 | +static void account_numa_dequeue(struct rq *rq, struct task_struct *p) 1056 | +{ 1057 | + rq->nr_numa_running -= (p->numa_preferred_nid != NUMA_NO_NODE); 1058 | + rq->nr_preferred_running -= (p->numa_preferred_nid == task_node(p)); 1059 | +} 1060 | + 1061 | +/* Shared or private faults. */ 1062 | +#define NR_NUMA_HINT_FAULT_TYPES 2 1063 | + 1064 | +/* Memory and CPU locality */ 1065 | +#define NR_NUMA_HINT_FAULT_STATS (NR_NUMA_HINT_FAULT_TYPES * 2) 1066 | + 1067 | +/* Averaged statistics, and temporary buffers. */ 1068 | +#define NR_NUMA_HINT_FAULT_BUCKETS (NR_NUMA_HINT_FAULT_STATS * 2) 1069 | + 1070 | +pid_t task_numa_group_id(struct task_struct *p) 1071 | +{ 1072 | + struct numa_group *ng; 1073 | + pid_t gid = 0; 1074 | + 1075 | + rcu_read_lock(); 1076 | + ng = rcu_dereference(p->numa_group); 1077 | + if (ng) 1078 | + gid = ng->gid; 1079 | + rcu_read_unlock(); 1080 | + 1081 | + return gid; 1082 | +} 1083 | + 1084 | +/* 1085 | + * The averaged statistics, shared & private, memory & CPU, 1086 | + * occupy the first half of the array. The second half of the 1087 | + * array is for current counters, which are averaged into the 1088 | + * first set by task_numa_placement. 1089 | + */ 1090 | +static inline int task_faults_idx(enum numa_faults_stats s, int nid, int priv) 1091 | +{ 1092 | + return NR_NUMA_HINT_FAULT_TYPES * (s * nr_node_ids + nid) + priv; 1093 | +} 1094 | + 1095 | +static inline unsigned long task_faults(struct task_struct *p, int nid) 1096 | +{ 1097 | + if (!p->numa_faults) 1098 | + return 0; 1099 | + 1100 | + return p->numa_faults[task_faults_idx(NUMA_MEM, nid, 0)] + 1101 | + p->numa_faults[task_faults_idx(NUMA_MEM, nid, 1)]; 1102 | +} 1103 | + 1104 | +static inline unsigned long group_faults(struct task_struct *p, int nid) 1105 | +{ 1106 | + struct numa_group *ng = deref_task_numa_group(p); 1107 | + 1108 | + if (!ng) 1109 | + return 0; 1110 | + 1111 | + return ng->faults[task_faults_idx(NUMA_MEM, nid, 0)] + 1112 | + ng->faults[task_faults_idx(NUMA_MEM, nid, 1)]; 1113 | +} 1114 | + 1115 | +static inline unsigned long group_faults_cpu(struct numa_group *group, int nid) 1116 | +{ 1117 | + return group->faults_cpu[task_faults_idx(NUMA_MEM, nid, 0)] + 1118 | + group->faults_cpu[task_faults_idx(NUMA_MEM, nid, 1)]; 1119 | +} 1120 | + 1121 | +static inline unsigned long group_faults_priv(struct numa_group *ng) 1122 | +{ 1123 | + unsigned long faults = 0; 1124 | + int node; 1125 | + 1126 | + for_each_online_node(node) { 1127 | + faults += ng->faults[task_faults_idx(NUMA_MEM, node, 1)]; 1128 | + } 1129 | + 1130 | + return faults; 1131 | +} 1132 | + 1133 | +static inline unsigned long group_faults_shared(struct numa_group *ng) 1134 | +{ 1135 | + unsigned long faults = 0; 1136 | + int node; 1137 | + 1138 | + for_each_online_node(node) { 1139 | + faults += ng->faults[task_faults_idx(NUMA_MEM, node, 0)]; 1140 | + } 1141 | + 1142 | + return faults; 1143 | +} 1144 | + 1145 | +/* 1146 | + * A node triggering more than 1/3 as many NUMA faults as the maximum is 1147 | + * considered part of a numa group's pseudo-interleaving set. Migrations 1148 | + * between these nodes are slowed down, to allow things to settle down. 1149 | + */ 1150 | +#define ACTIVE_NODE_FRACTION 3 1151 | + 1152 | +static bool numa_is_active_node(int nid, struct numa_group *ng) 1153 | +{ 1154 | + return group_faults_cpu(ng, nid) * ACTIVE_NODE_FRACTION > ng->max_faults_cpu; 1155 | +} 1156 | + 1157 | +/* Handle placement on systems where not all nodes are directly connected. */ 1158 | +static unsigned long score_nearby_nodes(struct task_struct *p, int nid, 1159 | + int maxdist, bool task) 1160 | +{ 1161 | + unsigned long score = 0; 1162 | + int node; 1163 | + 1164 | + /* 1165 | + * All nodes are directly connected, and the same distance 1166 | + * from each other. No need for fancy placement algorithms. 1167 | + */ 1168 | + if (sched_numa_topology_type == NUMA_DIRECT) 1169 | + return 0; 1170 | + 1171 | + /* 1172 | + * This code is called for each node, introducing N^2 complexity, 1173 | + * which should be ok given the number of nodes rarely exceeds 8. 1174 | + */ 1175 | + for_each_online_node(node) { 1176 | + unsigned long faults; 1177 | + int dist = node_distance(nid, node); 1178 | + 1179 | + /* 1180 | + * The furthest away nodes in the system are not interesting 1181 | + * for placement; nid was already counted. 1182 | + */ 1183 | + if (dist == sched_max_numa_distance || node == nid) 1184 | + continue; 1185 | + 1186 | + /* 1187 | + * On systems with a backplane NUMA topology, compare groups 1188 | + * of nodes, and move tasks towards the group with the most 1189 | + * memory accesses. When comparing two nodes at distance 1190 | + * "hoplimit", only nodes closer by than "hoplimit" are part 1191 | + * of each group. Skip other nodes. 1192 | + */ 1193 | + if (sched_numa_topology_type == NUMA_BACKPLANE && 1194 | + dist >= maxdist) 1195 | + continue; 1196 | + 1197 | + /* Add up the faults from nearby nodes. */ 1198 | + if (task) 1199 | + faults = task_faults(p, node); 1200 | + else 1201 | + faults = group_faults(p, node); 1202 | + 1203 | + /* 1204 | + * On systems with a glueless mesh NUMA topology, there are 1205 | + * no fixed "groups of nodes". Instead, nodes that are not 1206 | + * directly connected bounce traffic through intermediate 1207 | + * nodes; a numa_group can occupy any set of nodes. 1208 | + * The further away a node is, the less the faults count. 1209 | + * This seems to result in good task placement. 1210 | + */ 1211 | + if (sched_numa_topology_type == NUMA_GLUELESS_MESH) { 1212 | + faults *= (sched_max_numa_distance - dist); 1213 | + faults /= (sched_max_numa_distance - LOCAL_DISTANCE); 1214 | + } 1215 | + 1216 | + score += faults; 1217 | + } 1218 | + 1219 | + return score; 1220 | +} 1221 | + 1222 | +/* 1223 | + * These return the fraction of accesses done by a particular task, or 1224 | + * task group, on a particular numa node. The group weight is given a 1225 | + * larger multiplier, in order to group tasks together that are almost 1226 | + * evenly spread out between numa nodes. 1227 | + */ 1228 | +static inline unsigned long task_weight(struct task_struct *p, int nid, 1229 | + int dist) 1230 | +{ 1231 | + unsigned long faults, total_faults; 1232 | + 1233 | + if (!p->numa_faults) 1234 | + return 0; 1235 | + 1236 | + total_faults = p->total_numa_faults; 1237 | + 1238 | + if (!total_faults) 1239 | + return 0; 1240 | + 1241 | + faults = task_faults(p, nid); 1242 | + faults += score_nearby_nodes(p, nid, dist, true); 1243 | + 1244 | + return 1000 * faults / total_faults; 1245 | +} 1246 | + 1247 | +static inline unsigned long group_weight(struct task_struct *p, int nid, 1248 | + int dist) 1249 | +{ 1250 | + struct numa_group *ng = deref_task_numa_group(p); 1251 | + unsigned long faults, total_faults; 1252 | + 1253 | + if (!ng) 1254 | + return 0; 1255 | + 1256 | + total_faults = ng->total_faults; 1257 | + 1258 | + if (!total_faults) 1259 | + return 0; 1260 | + 1261 | + faults = group_faults(p, nid); 1262 | + faults += score_nearby_nodes(p, nid, dist, false); 1263 | + 1264 | + return 1000 * faults / total_faults; 1265 | +} 1266 | + 1267 | +bool should_numa_migrate_memory(struct task_struct *p, struct page * page, 1268 | + int src_nid, int dst_cpu) 1269 | +{ 1270 | + struct numa_group *ng = deref_curr_numa_group(p); 1271 | + int dst_nid = cpu_to_node(dst_cpu); 1272 | + int last_cpupid, this_cpupid; 1273 | + 1274 | + this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); 1275 | + last_cpupid = page_cpupid_xchg_last(page, this_cpupid); 1276 | + 1277 | + /* 1278 | + * Allow first faults or private faults to migrate immediately early in 1279 | + * the lifetime of a task. The magic number 4 is based on waiting for 1280 | + * two full passes of the "multi-stage node selection" test that is 1281 | + * executed below. 1282 | + */ 1283 | + if ((p->numa_preferred_nid == NUMA_NO_NODE || p->numa_scan_seq <= 4) && 1284 | + (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid))) 1285 | + return true; 1286 | + 1287 | + /* 1288 | + * Multi-stage node selection is used in conjunction with a periodic 1289 | + * migration fault to build a temporal task<->page relation. By using 1290 | + * a two-stage filter we remove short/unlikely relations. 1291 | + * 1292 | + * Using P(p) ~ n_p / n_t as per frequentist probability, we can equate 1293 | + * a task's usage of a particular page (n_p) per total usage of this 1294 | + * page (n_t) (in a given time-span) to a probability. 1295 | + * 1296 | + * Our periodic faults will sample this probability and getting the 1297 | + * same result twice in a row, given these samples are fully 1298 | + * independent, is then given by P(n)^2, provided our sample period 1299 | + * is sufficiently short compared to the usage pattern. 1300 | + * 1301 | + * This quadric squishes small probabilities, making it less likely we 1302 | + * act on an unlikely task<->page relation. 1303 | + */ 1304 | + if (!cpupid_pid_unset(last_cpupid) && 1305 | + cpupid_to_nid(last_cpupid) != dst_nid) 1306 | + return false; 1307 | + 1308 | + /* Always allow migrate on private faults */ 1309 | + if (cpupid_match_pid(p, last_cpupid)) 1310 | + return true; 1311 | + 1312 | + /* A shared fault, but p->numa_group has not been set up yet. */ 1313 | + if (!ng) 1314 | + return true; 1315 | + 1316 | + /* 1317 | + * Destination node is much more heavily used than the source 1318 | + * node? Allow migration. 1319 | + */ 1320 | + if (group_faults_cpu(ng, dst_nid) > group_faults_cpu(ng, src_nid) * 1321 | + ACTIVE_NODE_FRACTION) 1322 | + return true; 1323 | + 1324 | + /* 1325 | + * Distribute memory according to CPU & memory use on each node, 1326 | + * with 3/4 hysteresis to avoid unnecessary memory migrations: 1327 | + * 1328 | + * faults_cpu(dst) 3 faults_cpu(src) 1329 | + * --------------- * - > --------------- 1330 | + * faults_mem(dst) 4 faults_mem(src) 1331 | + */ 1332 | + return group_faults_cpu(ng, dst_nid) * group_faults(p, src_nid) * 3 > 1333 | + group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4; 1334 | +} 1335 | + 1336 | +/* 1337 | + * 'numa_type' describes the node at the moment of load balancing. 1338 | + */ 1339 | +enum numa_type { 1340 | + /* The node has spare capacity that can be used to run more tasks. */ 1341 | + node_has_spare = 0, 1342 | + /* 1343 | + * The node is fully used and the tasks don't compete for more CPU 1344 | + * cycles. Nevertheless, some tasks might wait before running. 1345 | + */ 1346 | + node_fully_busy, 1347 | + /* 1348 | + * The node is overloaded and can't provide expected CPU cycles to all 1349 | + * tasks. 1350 | + */ 1351 | + node_overloaded 1352 | +}; 1353 | + 1354 | +/* Cached statistics for all CPUs within a node */ 1355 | +struct numa_stats { 1356 | + unsigned long load; 1357 | + unsigned long runnable; 1358 | + unsigned long util; 1359 | + /* Total compute capacity of CPUs on a node */ 1360 | + unsigned long compute_capacity; 1361 | + unsigned int nr_running; 1362 | + unsigned int weight; 1363 | + enum numa_type node_type; 1364 | + int idle_cpu; 1365 | +}; 1366 | + 1367 | +static inline bool is_core_idle(int cpu) 1368 | +{ 1369 | +#ifdef CONFIG_SCHED_SMT 1370 | + int sibling; 1371 | + 1372 | + for_each_cpu(sibling, cpu_smt_mask(cpu)) { 1373 | + if (cpu == sibling) 1374 | + continue; 1375 | + 1376 | + if (!idle_cpu(sibling)) 1377 | + return false; 1378 | + } 1379 | +#endif 1380 | + 1381 | + return true; 1382 | +} 1383 | + 1384 | +struct task_numa_env { 1385 | + struct task_struct *p; 1386 | + 1387 | + int src_cpu, src_nid; 1388 | + int dst_cpu, dst_nid; 1389 | + 1390 | + struct numa_stats src_stats, dst_stats; 1391 | + 1392 | + int imbalance_pct; 1393 | + int dist; 1394 | + 1395 | + struct task_struct *best_task; 1396 | + long best_imp; 1397 | + int best_cpu; 1398 | +}; 1399 | + 1400 | +static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq) 1401 | +{ 1402 | + return cfs_rq->avg.load_avg; 1403 | +} 1404 | + 1405 | +static unsigned long cpu_load(struct rq *rq) 1406 | +{ 1407 | + return cfs_rq_load_avg(&rq->cfs); 1408 | +} 1409 | + 1410 | +static inline unsigned long cfs_rq_runnable_avg(struct cfs_rq *cfs_rq) 1411 | +{ 1412 | + return cfs_rq->avg.runnable_avg; 1413 | +} 1414 | + 1415 | +static unsigned long cpu_runnable(struct rq *rq) 1416 | +{ 1417 | + return cfs_rq_runnable_avg(&rq->cfs); 1418 | +} 1419 | + 1420 | +static inline unsigned long cpu_util(int cpu) 1421 | +{ 1422 | + struct cfs_rq *cfs_rq; 1423 | + unsigned int util; 1424 | + 1425 | + cfs_rq = &cpu_rq(cpu)->cfs; 1426 | + util = READ_ONCE(cfs_rq->avg.util_avg); 1427 | + 1428 | + if (sched_feat(UTIL_EST)) 1429 | + util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued)); 1430 | + 1431 | + return min_t(unsigned long, util, capacity_orig_of(cpu)); 1432 | +} 1433 | + 1434 | +static unsigned long capacity_of(int cpu) 1435 | +{ 1436 | + return cpu_rq(cpu)->cpu_capacity; 1437 | +} 1438 | + 1439 | +static inline enum 1440 | +numa_type numa_classify(unsigned int imbalance_pct, 1441 | + struct numa_stats *ns) 1442 | +{ 1443 | + if ((ns->nr_running > ns->weight) && 1444 | + (((ns->compute_capacity * 100) < (ns->util * imbalance_pct)) || 1445 | + ((ns->compute_capacity * imbalance_pct) < (ns->runnable * 100)))) 1446 | + return node_overloaded; 1447 | + 1448 | + if ((ns->nr_running < ns->weight) || 1449 | + (((ns->compute_capacity * 100) > (ns->util * imbalance_pct)) && 1450 | + ((ns->compute_capacity * imbalance_pct) > (ns->runnable * 100)))) 1451 | + return node_has_spare; 1452 | + 1453 | + return node_fully_busy; 1454 | +} 1455 | + 1456 | +#ifdef CONFIG_SCHED_SMT 1457 | +/* Forward declarations of select_idle_sibling helpers */ 1458 | +static inline bool test_idle_cores(int cpu, bool def); 1459 | +static inline int numa_idle_core(int idle_core, int cpu) 1460 | +{ 1461 | + if (!static_branch_likely(&sched_smt_present) || 1462 | + idle_core >= 0 || !test_idle_cores(cpu, false)) 1463 | + return idle_core; 1464 | + 1465 | + /* 1466 | + * Prefer cores instead of packing HT siblings 1467 | + * and triggering future load balancing. 1468 | + */ 1469 | + if (is_core_idle(cpu)) 1470 | + idle_core = cpu; 1471 | + 1472 | + return idle_core; 1473 | +} 1474 | +#else 1475 | +static inline int numa_idle_core(int idle_core, int cpu) 1476 | +{ 1477 | + return idle_core; 1478 | +} 1479 | +#endif 1480 | + 1481 | +/* 1482 | + * Gather all necessary information to make NUMA balancing placement 1483 | + * decisions that are compatible with standard load balancer. This 1484 | + * borrows code and logic from update_sg_lb_stats but sharing a 1485 | + * common implementation is impractical. 1486 | + */ 1487 | +static void update_numa_stats(struct task_numa_env *env, 1488 | + struct numa_stats *ns, int nid, 1489 | + bool find_idle) 1490 | +{ 1491 | + int cpu, idle_core = -1; 1492 | + 1493 | + memset(ns, 0, sizeof(*ns)); 1494 | + ns->idle_cpu = -1; 1495 | + 1496 | + rcu_read_lock(); 1497 | + for_each_cpu(cpu, cpumask_of_node(nid)) { 1498 | + struct rq *rq = cpu_rq(cpu); 1499 | + 1500 | + ns->load += cpu_load(rq); 1501 | + ns->runnable += cpu_runnable(rq); 1502 | + ns->util += cpu_util(cpu); 1503 | + ns->nr_running += rq->cfs.h_nr_running; 1504 | + ns->compute_capacity += capacity_of(cpu); 1505 | + 1506 | + if (find_idle && !rq->nr_running && idle_cpu(cpu)) { 1507 | + if (READ_ONCE(rq->numa_migrate_on) || 1508 | + !cpumask_test_cpu(cpu, env->p->cpus_ptr)) 1509 | + continue; 1510 | + 1511 | + if (ns->idle_cpu == -1) 1512 | + ns->idle_cpu = cpu; 1513 | + 1514 | + idle_core = numa_idle_core(idle_core, cpu); 1515 | + } 1516 | + } 1517 | + rcu_read_unlock(); 1518 | + 1519 | + ns->weight = cpumask_weight(cpumask_of_node(nid)); 1520 | + 1521 | + ns->node_type = numa_classify(env->imbalance_pct, ns); 1522 | + 1523 | + if (idle_core >= 0) 1524 | + ns->idle_cpu = idle_core; 1525 | +} 1526 | + 1527 | +static void task_numa_assign(struct task_numa_env *env, 1528 | + struct task_struct *p, long imp) 1529 | +{ 1530 | + struct rq *rq = cpu_rq(env->dst_cpu); 1531 | + 1532 | + /* Check if run-queue part of active NUMA balance. */ 1533 | + if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1)) { 1534 | + int cpu; 1535 | + int start = env->dst_cpu; 1536 | + 1537 | + /* Find alternative idle CPU. */ 1538 | + for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start) { 1539 | + if (cpu == env->best_cpu || !idle_cpu(cpu) || 1540 | + !cpumask_test_cpu(cpu, env->p->cpus_ptr)) { 1541 | + continue; 1542 | + } 1543 | + 1544 | + env->dst_cpu = cpu; 1545 | + rq = cpu_rq(env->dst_cpu); 1546 | + if (!xchg(&rq->numa_migrate_on, 1)) 1547 | + goto assign; 1548 | + } 1549 | + 1550 | + /* Failed to find an alternative idle CPU */ 1551 | + return; 1552 | + } 1553 | + 1554 | +assign: 1555 | + /* 1556 | + * Clear previous best_cpu/rq numa-migrate flag, since task now 1557 | + * found a better CPU to move/swap. 1558 | + */ 1559 | + if (env->best_cpu != -1 && env->best_cpu != env->dst_cpu) { 1560 | + rq = cpu_rq(env->best_cpu); 1561 | + WRITE_ONCE(rq->numa_migrate_on, 0); 1562 | + } 1563 | + 1564 | + if (env->best_task) 1565 | + put_task_struct(env->best_task); 1566 | + if (p) 1567 | + get_task_struct(p); 1568 | + 1569 | + env->best_task = p; 1570 | + env->best_imp = imp; 1571 | + env->best_cpu = env->dst_cpu; 1572 | +} 1573 | + 1574 | +static bool load_too_imbalanced(long src_load, long dst_load, 1575 | + struct task_numa_env *env) 1576 | +{ 1577 | + long imb, old_imb; 1578 | + long orig_src_load, orig_dst_load; 1579 | + long src_capacity, dst_capacity; 1580 | + 1581 | + /* 1582 | + * The load is corrected for the CPU capacity available on each node. 1583 | + * 1584 | + * src_load dst_load 1585 | + * ------------ vs --------- 1586 | + * src_capacity dst_capacity 1587 | + */ 1588 | + src_capacity = env->src_stats.compute_capacity; 1589 | + dst_capacity = env->dst_stats.compute_capacity; 1590 | + 1591 | + imb = abs(dst_load * src_capacity - src_load * dst_capacity); 1592 | + 1593 | + orig_src_load = env->src_stats.load; 1594 | + orig_dst_load = env->dst_stats.load; 1595 | + 1596 | + old_imb = abs(orig_dst_load * src_capacity - orig_src_load * dst_capacity); 1597 | + 1598 | + /* Would this change make things worse? */ 1599 | + return (imb > old_imb); 1600 | +} 1601 | + 1602 | +static unsigned int task_scan_start(struct task_struct *p) 1603 | +{ 1604 | + unsigned long smin = task_scan_min(p); 1605 | + unsigned long period = smin; 1606 | + struct numa_group *ng; 1607 | + 1608 | + /* Scale the maximum scan period with the amount of shared memory. */ 1609 | + rcu_read_lock(); 1610 | + ng = rcu_dereference(p->numa_group); 1611 | + if (ng) { 1612 | + unsigned long shared = group_faults_shared(ng); 1613 | + unsigned long private = group_faults_priv(ng); 1614 | + 1615 | + period *= refcount_read(&ng->refcount); 1616 | + period *= shared + 1; 1617 | + period /= private + shared + 1; 1618 | + } 1619 | + rcu_read_unlock(); 1620 | + 1621 | + return max(smin, period); 1622 | +} 1623 | + 1624 | +static void update_scan_period(struct task_struct *p, int new_cpu) 1625 | +{ 1626 | + int src_nid = cpu_to_node(task_cpu(p)); 1627 | + int dst_nid = cpu_to_node(new_cpu); 1628 | + 1629 | + if (!static_branch_likely(&sched_numa_balancing)) 1630 | + return; 1631 | + 1632 | + if (!p->mm || !p->numa_faults || (p->flags & PF_EXITING)) 1633 | + return; 1634 | + 1635 | + if (src_nid == dst_nid) 1636 | + return; 1637 | + 1638 | + /* 1639 | + * Allow resets if faults have been trapped before one scan 1640 | + * has completed. This is most likely due to a new task that 1641 | + * is pulled cross-node due to wakeups or load balancing. 1642 | + */ 1643 | + if (p->numa_scan_seq) { 1644 | + /* 1645 | + * Avoid scan adjustments if moving to the preferred 1646 | + * node or if the task was not previously running on 1647 | + * the preferred node. 1648 | + */ 1649 | + if (dst_nid == p->numa_preferred_nid || 1650 | + (p->numa_preferred_nid != NUMA_NO_NODE && 1651 | + src_nid != p->numa_preferred_nid)) 1652 | + return; 1653 | + } 1654 | + 1655 | + p->numa_scan_period = task_scan_start(p); 1656 | +} 1657 | + 1658 | +/* 1659 | + * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain. 1660 | + * This is an approximation as the number of running tasks may not be 1661 | + * related to the number of busy CPUs due to sched_setaffinity. 1662 | + */ 1663 | +static inline bool allow_numa_imbalance(int dst_running, int dst_weight) 1664 | +{ 1665 | + return (dst_running < (dst_weight >> 2)); 1666 | +} 1667 | + 1668 | +#define NUMA_IMBALANCE_MIN 2 1669 | + 1670 | +static inline long adjust_numa_imbalance(int imbalance, 1671 | + int dst_running, int dst_weight) 1672 | +{ 1673 | + if (!allow_numa_imbalance(dst_running, dst_weight)) 1674 | + return imbalance; 1675 | + 1676 | + /* 1677 | + * Allow a small imbalance based on a simple pair of communicating 1678 | + * tasks that remain local when the destination is lightly loaded. 1679 | + */ 1680 | + if (imbalance <= NUMA_IMBALANCE_MIN) 1681 | + return 0; 1682 | + 1683 | + return imbalance; 1684 | +} 1685 | + 1686 | +static unsigned long task_h_load(struct task_struct *p) 1687 | +{ 1688 | + return p->se.avg.load_avg; 1689 | +} 1690 | + 1691 | +/* 1692 | + * Maximum NUMA importance can be 1998 (2*999); 1693 | + * SMALLIMP @ 30 would be close to 1998/64. 1694 | + * Used to deter task migration. 1695 | + */ 1696 | +#define SMALLIMP 30 1697 | + 1698 | +/* 1699 | + * This checks if the overall compute and NUMA accesses of the system would 1700 | + * be improved if the source tasks was migrated to the target dst_cpu taking 1701 | + * into account that it might be best if task running on the dst_cpu should 1702 | + * be exchanged with the source task 1703 | + */ 1704 | +static bool task_numa_compare(struct task_numa_env *env, 1705 | + long taskimp, long groupimp, bool maymove) 1706 | +{ 1707 | + struct numa_group *cur_ng, *p_ng = deref_curr_numa_group(env->p); 1708 | + struct rq *dst_rq = cpu_rq(env->dst_cpu); 1709 | + long imp = p_ng ? groupimp : taskimp; 1710 | + struct task_struct *cur; 1711 | + long src_load, dst_load; 1712 | + int dist = env->dist; 1713 | + long moveimp = imp; 1714 | + long load; 1715 | + bool stopsearch = false; 1716 | + 1717 | + if (READ_ONCE(dst_rq->numa_migrate_on)) 1718 | + return false; 1719 | + 1720 | + rcu_read_lock(); 1721 | + cur = rcu_dereference(dst_rq->curr); 1722 | + if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur))) 1723 | + cur = NULL; 1724 | + 1725 | + /* 1726 | + * Because we have preemption enabled we can get migrated around and 1727 | + * end try selecting ourselves (current == env->p) as a swap candidate. 1728 | + */ 1729 | + if (cur == env->p) { 1730 | + stopsearch = true; 1731 | + goto unlock; 1732 | + } 1733 | + 1734 | + if (!cur) { 1735 | + if (maymove && moveimp >= env->best_imp) 1736 | + goto assign; 1737 | + else 1738 | + goto unlock; 1739 | + } 1740 | + 1741 | + /* Skip this swap candidate if cannot move to the source cpu. */ 1742 | + if (!cpumask_test_cpu(env->src_cpu, cur->cpus_ptr)) 1743 | + goto unlock; 1744 | + 1745 | + /* 1746 | + * Skip this swap candidate if it is not moving to its preferred 1747 | + * node and the best task is. 1748 | + */ 1749 | + if (env->best_task && 1750 | + env->best_task->numa_preferred_nid == env->src_nid && 1751 | + cur->numa_preferred_nid != env->src_nid) { 1752 | + goto unlock; 1753 | + } 1754 | + 1755 | + /* 1756 | + * "imp" is the fault differential for the source task between the 1757 | + * source and destination node. Calculate the total differential for 1758 | + * the source task and potential destination task. The more negative 1759 | + * the value is, the more remote accesses that would be expected to 1760 | + * be incurred if the tasks were swapped. 1761 | + * 1762 | + * If dst and source tasks are in the same NUMA group, or not 1763 | + * in any group then look only at task weights. 1764 | + */ 1765 | + cur_ng = rcu_dereference(cur->numa_group); 1766 | + if (cur_ng == p_ng) { 1767 | + imp = taskimp + task_weight(cur, env->src_nid, dist) - 1768 | + task_weight(cur, env->dst_nid, dist); 1769 | + /* 1770 | + * Add some hysteresis to prevent swapping the 1771 | + * tasks within a group over tiny differences. 1772 | + */ 1773 | + if (cur_ng) 1774 | + imp -= imp / 16; 1775 | + } else { 1776 | + /* 1777 | + * Compare the group weights. If a task is all by itself 1778 | + * (not part of a group), use the task weight instead. 1779 | + */ 1780 | + if (cur_ng && p_ng) 1781 | + imp += group_weight(cur, env->src_nid, dist) - 1782 | + group_weight(cur, env->dst_nid, dist); 1783 | + else 1784 | + imp += task_weight(cur, env->src_nid, dist) - 1785 | + task_weight(cur, env->dst_nid, dist); 1786 | + } 1787 | + 1788 | + /* Discourage picking a task already on its preferred node */ 1789 | + if (cur->numa_preferred_nid == env->dst_nid) 1790 | + imp -= imp / 16; 1791 | + 1792 | + /* 1793 | + * Encourage picking a task that moves to its preferred node. 1794 | + * This potentially makes imp larger than it's maximum of 1795 | + * 1998 (see SMALLIMP and task_weight for why) but in this 1796 | + * case, it does not matter. 1797 | + */ 1798 | + if (cur->numa_preferred_nid == env->src_nid) 1799 | + imp += imp / 8; 1800 | + 1801 | + if (maymove && moveimp > imp && moveimp > env->best_imp) { 1802 | + imp = moveimp; 1803 | + cur = NULL; 1804 | + goto assign; 1805 | + } 1806 | + 1807 | + /* 1808 | + * Prefer swapping with a task moving to its preferred node over a 1809 | + * task that is not. 1810 | + */ 1811 | + if (env->best_task && cur->numa_preferred_nid == env->src_nid && 1812 | + env->best_task->numa_preferred_nid != env->src_nid) { 1813 | + goto assign; 1814 | + } 1815 | + 1816 | + /* 1817 | + * If the NUMA importance is less than SMALLIMP, 1818 | + * task migration might only result in ping pong 1819 | + * of tasks and also hurt performance due to cache 1820 | + * misses. 1821 | + */ 1822 | + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2) 1823 | + goto unlock; 1824 | + 1825 | + /* 1826 | + * In the overloaded case, try and keep the load balanced. 1827 | + */ 1828 | + load = task_h_load(env->p) - task_h_load(cur); 1829 | + if (!load) 1830 | + goto assign; 1831 | + 1832 | + dst_load = env->dst_stats.load + load; 1833 | + src_load = env->src_stats.load - load; 1834 | + 1835 | + if (load_too_imbalanced(src_load, dst_load, env)) 1836 | + goto unlock; 1837 | + 1838 | +assign: 1839 | + /* Evaluate an idle CPU for a task numa move. */ 1840 | + if (!cur) { 1841 | + int cpu = env->dst_stats.idle_cpu; 1842 | + 1843 | + /* Nothing cached so current CPU went idle since the search. */ 1844 | + if (cpu < 0) 1845 | + cpu = env->dst_cpu; 1846 | + 1847 | + /* 1848 | + * If the CPU is no longer truly idle and the previous best CPU 1849 | + * is, keep using it. 1850 | + */ 1851 | + if (!idle_cpu(cpu) && env->best_cpu >= 0 && 1852 | + idle_cpu(env->best_cpu)) { 1853 | + cpu = env->best_cpu; 1854 | + } 1855 | + 1856 | + env->dst_cpu = cpu; 1857 | + } 1858 | + 1859 | + task_numa_assign(env, cur, imp); 1860 | + 1861 | + /* 1862 | + * If a move to idle is allowed because there is capacity or load 1863 | + * balance improves then stop the search. While a better swap 1864 | + * candidate may exist, a search is not free. 1865 | + */ 1866 | + if (maymove && !cur && env->best_cpu >= 0 && idle_cpu(env->best_cpu)) 1867 | + stopsearch = true; 1868 | + 1869 | + /* 1870 | + * If a swap candidate must be identified and the current best task 1871 | + * moves its preferred node then stop the search. 1872 | + */ 1873 | + if (!maymove && env->best_task && 1874 | + env->best_task->numa_preferred_nid == env->src_nid) { 1875 | + stopsearch = true; 1876 | + } 1877 | +unlock: 1878 | + rcu_read_unlock(); 1879 | + 1880 | + return stopsearch; 1881 | +} 1882 | + 1883 | +static void task_numa_find_cpu(struct task_numa_env *env, 1884 | + long taskimp, long groupimp) 1885 | +{ 1886 | + bool maymove = false; 1887 | + int cpu; 1888 | + 1889 | + /* 1890 | + * If dst node has spare capacity, then check if there is an 1891 | + * imbalance that would be overruled by the load balancer. 1892 | + */ 1893 | + if (env->dst_stats.node_type == node_has_spare) { 1894 | + unsigned int imbalance; 1895 | + int src_running, dst_running; 1896 | + 1897 | + /* 1898 | + * Would movement cause an imbalance? Note that if src has 1899 | + * more running tasks that the imbalance is ignored as the 1900 | + * move improves the imbalance from the perspective of the 1901 | + * CPU load balancer. 1902 | + * */ 1903 | + src_running = env->src_stats.nr_running - 1; 1904 | + dst_running = env->dst_stats.nr_running + 1; 1905 | + imbalance = max(0, dst_running - src_running); 1906 | + imbalance = adjust_numa_imbalance(imbalance, dst_running, 1907 | + env->dst_stats.weight); 1908 | + 1909 | + /* Use idle CPU if there is no imbalance */ 1910 | + if (!imbalance) { 1911 | + maymove = true; 1912 | + if (env->dst_stats.idle_cpu >= 0) { 1913 | + env->dst_cpu = env->dst_stats.idle_cpu; 1914 | + task_numa_assign(env, NULL, 0); 1915 | + return; 1916 | + } 1917 | + } 1918 | + } else { 1919 | + long src_load, dst_load, load; 1920 | + /* 1921 | + * If the improvement from just moving env->p direction is better 1922 | + * than swapping tasks around, check if a move is possible. 1923 | + */ 1924 | + load = task_h_load(env->p); 1925 | + dst_load = env->dst_stats.load + load; 1926 | + src_load = env->src_stats.load - load; 1927 | + maymove = !load_too_imbalanced(src_load, dst_load, env); 1928 | + } 1929 | + 1930 | + for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) { 1931 | + /* Skip this CPU if the source task cannot migrate */ 1932 | + if (!cpumask_test_cpu(cpu, env->p->cpus_ptr)) 1933 | + continue; 1934 | + 1935 | + env->dst_cpu = cpu; 1936 | + if (task_numa_compare(env, taskimp, groupimp, maymove)) 1937 | + break; 1938 | + } 1939 | +} 1940 | + 1941 | +static int task_numa_migrate(struct task_struct *p) 1942 | +{ 1943 | + struct task_numa_env env = { 1944 | + .p = p, 1945 | + 1946 | + .src_cpu = task_cpu(p), 1947 | + .src_nid = task_node(p), 1948 | + 1949 | + .imbalance_pct = 112, 1950 | + 1951 | + .best_task = NULL, 1952 | + .best_imp = 0, 1953 | + .best_cpu = -1, 1954 | + }; 1955 | + unsigned long taskweight, groupweight; 1956 | + struct sched_domain *sd; 1957 | + long taskimp, groupimp; 1958 | + struct numa_group *ng; 1959 | + struct rq *best_rq; 1960 | + int nid, ret, dist; 1961 | + 1962 | + /* 1963 | + * Pick the lowest SD_NUMA domain, as that would have the smallest 1964 | + * imbalance and would be the first to start moving tasks about. 1965 | + * 1966 | + * And we want to avoid any moving of tasks about, as that would create 1967 | + * random movement of tasks -- counter the numa conditions we're trying 1968 | + * to satisfy here. 1969 | + */ 1970 | + rcu_read_lock(); 1971 | + sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); 1972 | + if (sd) 1973 | + env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; 1974 | + rcu_read_unlock(); 1975 | + 1976 | + /* 1977 | + * Cpusets can break the scheduler domain tree into smaller 1978 | + * balance domains, some of which do not cross NUMA boundaries. 1979 | + * Tasks that are "trapped" in such domains cannot be migrated 1980 | + * elsewhere, so there is no point in (re)trying. 1981 | + */ 1982 | + if (unlikely(!sd)) { 1983 | + sched_setnuma(p, task_node(p)); 1984 | + return -EINVAL; 1985 | + } 1986 | + 1987 | + env.dst_nid = p->numa_preferred_nid; 1988 | + dist = env.dist = node_distance(env.src_nid, env.dst_nid); 1989 | + taskweight = task_weight(p, env.src_nid, dist); 1990 | + groupweight = group_weight(p, env.src_nid, dist); 1991 | + update_numa_stats(&env, &env.src_stats, env.src_nid, false); 1992 | + taskimp = task_weight(p, env.dst_nid, dist) - taskweight; 1993 | + groupimp = group_weight(p, env.dst_nid, dist) - groupweight; 1994 | + update_numa_stats(&env, &env.dst_stats, env.dst_nid, true); 1995 | + 1996 | + /* Try to find a spot on the preferred nid. */ 1997 | + task_numa_find_cpu(&env, taskimp, groupimp); 1998 | + 1999 | + /* 2000 | + * Look at other nodes in these cases: 2001 | + * - there is no space available on the preferred_nid 2002 | + * - the task is part of a numa_group that is interleaved across 2003 | + * multiple NUMA nodes; in order to better consolidate the group, 2004 | + * we need to check other locations. 2005 | + */ 2006 | + ng = deref_curr_numa_group(p); 2007 | + if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) { 2008 | + for_each_online_node(nid) { 2009 | + if (nid == env.src_nid || nid == p->numa_preferred_nid) 2010 | + continue; 2011 | + 2012 | + dist = node_distance(env.src_nid, env.dst_nid); 2013 | + if (sched_numa_topology_type == NUMA_BACKPLANE && 2014 | + dist != env.dist) { 2015 | + taskweight = task_weight(p, env.src_nid, dist); 2016 | + groupweight = group_weight(p, env.src_nid, dist); 2017 | + } 2018 | + 2019 | + /* Only consider nodes where both task and groups benefit */ 2020 | + taskimp = task_weight(p, nid, dist) - taskweight; 2021 | + groupimp = group_weight(p, nid, dist) - groupweight; 2022 | + if (taskimp < 0 && groupimp < 0) 2023 | + continue; 2024 | + 2025 | + env.dist = dist; 2026 | + env.dst_nid = nid; 2027 | + update_numa_stats(&env, &env.dst_stats, env.dst_nid, true); 2028 | + task_numa_find_cpu(&env, taskimp, groupimp); 2029 | + } 2030 | + } 2031 | + 2032 | + /* 2033 | + * If the task is part of a workload that spans multiple NUMA nodes, 2034 | + * and is migrating into one of the workload's active nodes, remember 2035 | + * this node as the task's preferred numa node, so the workload can 2036 | + * settle down. 2037 | + * A task that migrated to a second choice node will be better off 2038 | + * trying for a better one later. Do not set the preferred node here. 2039 | + */ 2040 | + if (ng) { 2041 | + if (env.best_cpu == -1) 2042 | + nid = env.src_nid; 2043 | + else 2044 | + nid = cpu_to_node(env.best_cpu); 2045 | + 2046 | + if (nid != p->numa_preferred_nid) 2047 | + sched_setnuma(p, nid); 2048 | + } 2049 | + 2050 | + /* No better CPU than the current one was found. */ 2051 | + if (env.best_cpu == -1) { 2052 | + trace_sched_stick_numa(p, env.src_cpu, NULL, -1); 2053 | + return -EAGAIN; 2054 | + } 2055 | + 2056 | + best_rq = cpu_rq(env.best_cpu); 2057 | + if (env.best_task == NULL) { 2058 | + ret = migrate_task_to(p, env.best_cpu); 2059 | + WRITE_ONCE(best_rq->numa_migrate_on, 0); 2060 | + if (ret != 0) 2061 | + trace_sched_stick_numa(p, env.src_cpu, NULL, env.best_cpu); 2062 | + return ret; 2063 | + } 2064 | + 2065 | + ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu); 2066 | + WRITE_ONCE(best_rq->numa_migrate_on, 0); 2067 | + 2068 | + if (ret != 0) 2069 | + trace_sched_stick_numa(p, env.src_cpu, env.best_task, env.best_cpu); 2070 | + put_task_struct(env.best_task); 2071 | + return ret; 2072 | +} 2073 | + 2074 | +/* Attempt to migrate a task to a CPU on the preferred node. */ 2075 | +static void numa_migrate_preferred(struct task_struct *p) 2076 | +{ 2077 | + unsigned long interval = HZ; 2078 | + 2079 | + /* This task has no NUMA fault statistics yet */ 2080 | + if (unlikely(p->numa_preferred_nid == NUMA_NO_NODE || !p->numa_faults)) 2081 | + return; 2082 | + 2083 | + /* Periodically retry migrating the task to the preferred node */ 2084 | + interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16); 2085 | + p->numa_migrate_retry = jiffies + interval; 2086 | + 2087 | + /* Success if task is already running on preferred CPU */ 2088 | + if (task_node(p) == p->numa_preferred_nid) 2089 | + return; 2090 | + 2091 | + /* Otherwise, try migrate to a CPU on the preferred node */ 2092 | + task_numa_migrate(p); 2093 | +} 2094 | + 2095 | +/* 2096 | + * Find out how many nodes on the workload is actively running on. Do this by 2097 | + * tracking the nodes from which NUMA hinting faults are triggered. This can 2098 | + * be different from the set of nodes where the workload's memory is currently 2099 | + * located. 2100 | + */ 2101 | +static void numa_group_count_active_nodes(struct numa_group *numa_group) 2102 | +{ 2103 | + unsigned long faults, max_faults = 0; 2104 | + int nid, active_nodes = 0; 2105 | + 2106 | + for_each_online_node(nid) { 2107 | + faults = group_faults_cpu(numa_group, nid); 2108 | + if (faults > max_faults) 2109 | + max_faults = faults; 2110 | + } 2111 | + 2112 | + for_each_online_node(nid) { 2113 | + faults = group_faults_cpu(numa_group, nid); 2114 | + if (faults * ACTIVE_NODE_FRACTION > max_faults) 2115 | + active_nodes++; 2116 | + } 2117 | + 2118 | + numa_group->max_faults_cpu = max_faults; 2119 | + numa_group->active_nodes = active_nodes; 2120 | +} 2121 | + 2122 | +#define NUMA_PERIOD_SLOTS 10 2123 | +#define NUMA_PERIOD_THRESHOLD 7 2124 | + 2125 | +/* 2126 | + * Increase the scan period (slow down scanning) if the majority of 2127 | + * our memory is already on our local node, or if the majority of 2128 | + * the page accesses are shared with other processes. 2129 | + * Otherwise, decrease the scan period. 2130 | + */ 2131 | +static void update_task_scan_period(struct task_struct *p, 2132 | + unsigned long shared, unsigned long private) 2133 | +{ 2134 | + unsigned int period_slot; 2135 | + int lr_ratio, ps_ratio; 2136 | + int diff; 2137 | + 2138 | + unsigned long remote = p->numa_faults_locality[0]; 2139 | + unsigned long local = p->numa_faults_locality[1]; 2140 | + 2141 | + /* 2142 | + * If there were no record hinting faults then either the task is 2143 | + * completely idle or all activity is areas that are not of interest 2144 | + * to automatic numa balancing. Related to that, if there were failed 2145 | + * migration then it implies we are migrating too quickly or the local 2146 | + * node is overloaded. In either case, scan slower 2147 | + */ 2148 | + if (local + shared == 0 || p->numa_faults_locality[2]) { 2149 | + p->numa_scan_period = min(p->numa_scan_period_max, 2150 | + p->numa_scan_period << 1); 2151 | + 2152 | + p->mm->numa_next_scan = jiffies + 2153 | + msecs_to_jiffies(p->numa_scan_period); 2154 | + 2155 | + return; 2156 | + } 2157 | + 2158 | + /* 2159 | + * Prepare to scale scan period relative to the current period. 2160 | + * == NUMA_PERIOD_THRESHOLD scan period stays the same 2161 | + * < NUMA_PERIOD_THRESHOLD scan period decreases (scan faster) 2162 | + * >= NUMA_PERIOD_THRESHOLD scan period increases (scan slower) 2163 | + */ 2164 | + period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS); 2165 | + lr_ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote); 2166 | + ps_ratio = (private * NUMA_PERIOD_SLOTS) / (private + shared); 2167 | + 2168 | + if (ps_ratio >= NUMA_PERIOD_THRESHOLD) { 2169 | + /* 2170 | + * Most memory accesses are local. There is no need to 2171 | + * do fast NUMA scanning, since memory is already local. 2172 | + */ 2173 | + int slot = ps_ratio - NUMA_PERIOD_THRESHOLD; 2174 | + if (!slot) 2175 | + slot = 1; 2176 | + diff = slot * period_slot; 2177 | + } else if (lr_ratio >= NUMA_PERIOD_THRESHOLD) { 2178 | + /* 2179 | + * Most memory accesses are shared with other tasks. 2180 | + * There is no point in continuing fast NUMA scanning, 2181 | + * since other tasks may just move the memory elsewhere. 2182 | + */ 2183 | + int slot = lr_ratio - NUMA_PERIOD_THRESHOLD; 2184 | + if (!slot) 2185 | + slot = 1; 2186 | + diff = slot * period_slot; 2187 | + } else { 2188 | + /* 2189 | + * Private memory faults exceed (SLOTS-THRESHOLD)/SLOTS, 2190 | + * yet they are not on the local NUMA node. Speed up 2191 | + * NUMA scanning to get the memory moved over. 2192 | + */ 2193 | + int ratio = max(lr_ratio, ps_ratio); 2194 | + diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot; 2195 | + } 2196 | + 2197 | + p->numa_scan_period = clamp(p->numa_scan_period + diff, 2198 | + task_scan_min(p), task_scan_max(p)); 2199 | + memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality)); 2200 | +} 2201 | + 2202 | +/* 2203 | + * Get the fraction of time the task has been running since the last 2204 | + * NUMA placement cycle. The scheduler keeps similar statistics, but 2205 | + * decays those on a 32ms period, which is orders of magnitude off 2206 | + * from the dozens-of-seconds NUMA balancing period. Use the scheduler 2207 | + * stats only if the task is so new there are no NUMA statistics yet. 2208 | + */ 2209 | +static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period) 2210 | +{ 2211 | + u64 runtime, delta, now; 2212 | + /* Use the start of this time slice to avoid calculations. */ 2213 | + now = p->se.exec_start; 2214 | + runtime = p->se.sum_exec_runtime; 2215 | + 2216 | + if (p->last_task_numa_placement) { 2217 | + delta = runtime - p->last_sum_exec_runtime; 2218 | + *period = now - p->last_task_numa_placement; 2219 | + 2220 | + /* Avoid time going backwards, prevent potential divide error: */ 2221 | + if (unlikely((s64)*period < 0)) 2222 | + *period = 0; 2223 | + } else { 2224 | + delta = p->se.avg.load_sum; 2225 | + *period = LOAD_AVG_MAX; 2226 | + } 2227 | + 2228 | + p->last_sum_exec_runtime = runtime; 2229 | + p->last_task_numa_placement = now; 2230 | + 2231 | + return delta; 2232 | +} 2233 | + 2234 | +/* 2235 | + * Determine the preferred nid for a task in a numa_group. This needs to 2236 | + * be done in a way that produces consistent results with group_weight, 2237 | + * otherwise workloads might not converge. 2238 | + */ 2239 | +static int preferred_group_nid(struct task_struct *p, int nid) 2240 | +{ 2241 | + nodemask_t nodes; 2242 | + int dist; 2243 | + 2244 | + /* Direct connections between all NUMA nodes. */ 2245 | + if (sched_numa_topology_type == NUMA_DIRECT) 2246 | + return nid; 2247 | + 2248 | + /* 2249 | + * On a system with glueless mesh NUMA topology, group_weight 2250 | + * scores nodes according to the number of NUMA hinting faults on 2251 | + * both the node itself, and on nearby nodes. 2252 | + */ 2253 | + if (sched_numa_topology_type == NUMA_GLUELESS_MESH) { 2254 | + unsigned long score, max_score = 0; 2255 | + int node, max_node = nid; 2256 | + 2257 | + dist = sched_max_numa_distance; 2258 | + 2259 | + for_each_online_node(node) { 2260 | + score = group_weight(p, node, dist); 2261 | + if (score > max_score) { 2262 | + max_score = score; 2263 | + max_node = node; 2264 | + } 2265 | + } 2266 | + return max_node; 2267 | + } 2268 | + 2269 | + /* 2270 | + * Finding the preferred nid in a system with NUMA backplane 2271 | + * interconnect topology is more involved. The goal is to locate 2272 | + * tasks from numa_groups near each other in the system, and 2273 | + * untangle workloads from different sides of the system. This requires 2274 | + * searching down the hierarchy of node groups, recursively searching 2275 | + * inside the highest scoring group of nodes. The nodemask tricks 2276 | + * keep the complexity of the search down. 2277 | + */ 2278 | + nodes = node_online_map; 2279 | + for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) { 2280 | + unsigned long max_faults = 0; 2281 | + nodemask_t max_group = NODE_MASK_NONE; 2282 | + int a, b; 2283 | + 2284 | + /* Are there nodes at this distance from each other? */ 2285 | + if (!find_numa_distance(dist)) 2286 | + continue; 2287 | + 2288 | + for_each_node_mask(a, nodes) { 2289 | + unsigned long faults = 0; 2290 | + nodemask_t this_group; 2291 | + nodes_clear(this_group); 2292 | + 2293 | + /* Sum group's NUMA faults; includes a==b case. */ 2294 | + for_each_node_mask(b, nodes) { 2295 | + if (node_distance(a, b) < dist) { 2296 | + faults += group_faults(p, b); 2297 | + node_set(b, this_group); 2298 | + node_clear(b, nodes); 2299 | + } 2300 | + } 2301 | + 2302 | + /* Remember the top group. */ 2303 | + if (faults > max_faults) { 2304 | + max_faults = faults; 2305 | + max_group = this_group; 2306 | + /* 2307 | + * subtle: at the smallest distance there is 2308 | + * just one node left in each "group", the 2309 | + * winner is the preferred nid. 2310 | + */ 2311 | + nid = a; 2312 | + } 2313 | + } 2314 | + /* Next round, evaluate the nodes within max_group. */ 2315 | + if (!max_faults) 2316 | + break; 2317 | + nodes = max_group; 2318 | + } 2319 | + return nid; 2320 | +} 2321 | + 2322 | +static void task_numa_placement(struct task_struct *p) 2323 | +{ 2324 | + int seq, nid, max_nid = NUMA_NO_NODE; 2325 | + unsigned long max_faults = 0; 2326 | + unsigned long fault_types[2] = { 0, 0 }; 2327 | + unsigned long total_faults; 2328 | + u64 runtime, period; 2329 | + spinlock_t *group_lock = NULL; 2330 | + struct numa_group *ng; 2331 | + 2332 | + /* 2333 | + * The p->mm->numa_scan_seq field gets updated without 2334 | + * exclusive access. Use READ_ONCE() here to ensure 2335 | + * that the field is read in a single access: 2336 | + */ 2337 | + seq = READ_ONCE(p->mm->numa_scan_seq); 2338 | + if (p->numa_scan_seq == seq) 2339 | + return; 2340 | + p->numa_scan_seq = seq; 2341 | + p->numa_scan_period_max = task_scan_max(p); 2342 | + 2343 | + total_faults = p->numa_faults_locality[0] + 2344 | + p->numa_faults_locality[1]; 2345 | + runtime = numa_get_avg_runtime(p, &period); 2346 | + 2347 | + /* If the task is part of a group prevent parallel updates to group stats */ 2348 | + ng = deref_curr_numa_group(p); 2349 | + if (ng) { 2350 | + group_lock = &ng->lock; 2351 | + spin_lock_irq(group_lock); 2352 | + } 2353 | + 2354 | + /* Find the node with the highest number of faults */ 2355 | + for_each_online_node(nid) { 2356 | + /* Keep track of the offsets in numa_faults array */ 2357 | + int mem_idx, membuf_idx, cpu_idx, cpubuf_idx; 2358 | + unsigned long faults = 0, group_faults = 0; 2359 | + int priv; 2360 | + 2361 | + for (priv = 0; priv < NR_NUMA_HINT_FAULT_TYPES; priv++) { 2362 | + long diff, f_diff, f_weight; 2363 | + 2364 | + mem_idx = task_faults_idx(NUMA_MEM, nid, priv); 2365 | + membuf_idx = task_faults_idx(NUMA_MEMBUF, nid, priv); 2366 | + cpu_idx = task_faults_idx(NUMA_CPU, nid, priv); 2367 | + cpubuf_idx = task_faults_idx(NUMA_CPUBUF, nid, priv); 2368 | + 2369 | + /* Decay existing window, copy faults since last scan */ 2370 | + diff = p->numa_faults[membuf_idx] - p->numa_faults[mem_idx] / 2; 2371 | + fault_types[priv] += p->numa_faults[membuf_idx]; 2372 | + p->numa_faults[membuf_idx] = 0; 2373 | + 2374 | + /* 2375 | + * Normalize the faults_from, so all tasks in a group 2376 | + * count according to CPU use, instead of by the raw 2377 | + * number of faults. Tasks with little runtime have 2378 | + * little over-all impact on throughput, and thus their 2379 | + * faults are less important. 2380 | + */ 2381 | + f_weight = div64_u64(runtime << 16, period + 1); 2382 | + f_weight = (f_weight * p->numa_faults[cpubuf_idx]) / 2383 | + (total_faults + 1); 2384 | + f_diff = f_weight - p->numa_faults[cpu_idx] / 2; 2385 | + p->numa_faults[cpubuf_idx] = 0; 2386 | + 2387 | + p->numa_faults[mem_idx] += diff; 2388 | + p->numa_faults[cpu_idx] += f_diff; 2389 | + faults += p->numa_faults[mem_idx]; 2390 | + p->total_numa_faults += diff; 2391 | + if (ng) { 2392 | + /* 2393 | + * safe because we can only change our own group 2394 | + * 2395 | + * mem_idx represents the offset for a given 2396 | + * nid and priv in a specific region because it 2397 | + * is at the beginning of the numa_faults array. 2398 | + */ 2399 | + ng->faults[mem_idx] += diff; 2400 | + ng->faults_cpu[mem_idx] += f_diff; 2401 | + ng->total_faults += diff; 2402 | + group_faults += ng->faults[mem_idx]; 2403 | + } 2404 | + } 2405 | + 2406 | + if (!ng) { 2407 | + if (faults > max_faults) { 2408 | + max_faults = faults; 2409 | + max_nid = nid; 2410 | + } 2411 | + } else if (group_faults > max_faults) { 2412 | + max_faults = group_faults; 2413 | + max_nid = nid; 2414 | + } 2415 | + } 2416 | + 2417 | + if (ng) { 2418 | + numa_group_count_active_nodes(ng); 2419 | + spin_unlock_irq(group_lock); 2420 | + max_nid = preferred_group_nid(p, max_nid); 2421 | + } 2422 | + 2423 | + if (max_faults) { 2424 | + /* Set the new preferred node */ 2425 | + if (max_nid != p->numa_preferred_nid) 2426 | + sched_setnuma(p, max_nid); 2427 | + } 2428 | + 2429 | + update_task_scan_period(p, fault_types[0], fault_types[1]); 2430 | +} 2431 | + 2432 | +static inline int get_numa_group(struct numa_group *grp) 2433 | +{ 2434 | + return refcount_inc_not_zero(&grp->refcount); 2435 | +} 2436 | + 2437 | +static inline void put_numa_group(struct numa_group *grp) 2438 | +{ 2439 | + if (refcount_dec_and_test(&grp->refcount)) 2440 | + kfree_rcu(grp, rcu); 2441 | +} 2442 | + 2443 | +static void task_numa_group(struct task_struct *p, int cpupid, int flags, 2444 | + int *priv) 2445 | +{ 2446 | + struct numa_group *grp, *my_grp; 2447 | + struct task_struct *tsk; 2448 | + bool join = false; 2449 | + int cpu = cpupid_to_cpu(cpupid); 2450 | + int i; 2451 | + 2452 | + if (unlikely(!deref_curr_numa_group(p))) { 2453 | + unsigned int size = sizeof(struct numa_group) + 2454 | + 4*nr_node_ids*sizeof(unsigned long); 2455 | + 2456 | + grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN); 2457 | + if (!grp) 2458 | + return; 2459 | + 2460 | + refcount_set(&grp->refcount, 1); 2461 | + grp->active_nodes = 1; 2462 | + grp->max_faults_cpu = 0; 2463 | + spin_lock_init(&grp->lock); 2464 | + grp->gid = p->pid; 2465 | + /* Second half of the array tracks nids where faults happen */ 2466 | + grp->faults_cpu = grp->faults + NR_NUMA_HINT_FAULT_TYPES * 2467 | + nr_node_ids; 2468 | + 2469 | + for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) 2470 | + grp->faults[i] = p->numa_faults[i]; 2471 | + 2472 | + grp->total_faults = p->total_numa_faults; 2473 | + 2474 | + grp->nr_tasks++; 2475 | + rcu_assign_pointer(p->numa_group, grp); 2476 | + } 2477 | + 2478 | + rcu_read_lock(); 2479 | + tsk = READ_ONCE(cpu_rq(cpu)->curr); 2480 | + 2481 | + if (!cpupid_match_pid(tsk, cpupid)) 2482 | + goto no_join; 2483 | + 2484 | + grp = rcu_dereference(tsk->numa_group); 2485 | + if (!grp) 2486 | + goto no_join; 2487 | + 2488 | + my_grp = deref_curr_numa_group(p); 2489 | + if (grp == my_grp) 2490 | + goto no_join; 2491 | + 2492 | + /* 2493 | + * Only join the other group if its bigger; if we're the bigger group, 2494 | + * the other task will join us. 2495 | + */ 2496 | + if (my_grp->nr_tasks > grp->nr_tasks) 2497 | + goto no_join; 2498 | + 2499 | + /* 2500 | + * Tie-break on the grp address. 2501 | + */ 2502 | + if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp) 2503 | + goto no_join; 2504 | + 2505 | + /* Always join threads in the same process. */ 2506 | + if (tsk->mm == current->mm) 2507 | + join = true; 2508 | + 2509 | + /* Simple filter to avoid false positives due to PID collisions */ 2510 | + if (flags & TNF_SHARED) 2511 | + join = true; 2512 | + 2513 | + /* Update priv based on whether false sharing was detected */ 2514 | + *priv = !join; 2515 | + 2516 | + if (join && !get_numa_group(grp)) 2517 | + goto no_join; 2518 | + 2519 | + rcu_read_unlock(); 2520 | + 2521 | + if (!join) 2522 | + return; 2523 | + 2524 | + BUG_ON(irqs_disabled()); 2525 | + double_lock_irq(&my_grp->lock, &grp->lock); 2526 | + 2527 | + for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) { 2528 | + my_grp->faults[i] -= p->numa_faults[i]; 2529 | + grp->faults[i] += p->numa_faults[i]; 2530 | + } 2531 | + my_grp->total_faults -= p->total_numa_faults; 2532 | + grp->total_faults += p->total_numa_faults; 2533 | + 2534 | + my_grp->nr_tasks--; 2535 | + grp->nr_tasks++; 2536 | + 2537 | + spin_unlock(&my_grp->lock); 2538 | + spin_unlock_irq(&grp->lock); 2539 | + 2540 | + rcu_assign_pointer(p->numa_group, grp); 2541 | + 2542 | + put_numa_group(my_grp); 2543 | + return; 2544 | + 2545 | +no_join: 2546 | + rcu_read_unlock(); 2547 | + return; 2548 | +} 2549 | + 2550 | +/* 2551 | + * Get rid of NUMA statistics associated with a task (either current or dead). 2552 | + * If @final is set, the task is dead and has reached refcount zero, so we can 2553 | + * safely free all relevant data structures. Otherwise, there might be 2554 | + * concurrent reads from places like load balancing and procfs, and we should 2555 | + * reset the data back to default state without freeing ->numa_faults. 2556 | + */ 2557 | +void task_numa_free(struct task_struct *p, bool final) 2558 | +{ 2559 | + /* safe: p either is current or is being freed by current */ 2560 | + struct numa_group *grp = rcu_dereference_raw(p->numa_group); 2561 | + unsigned long *numa_faults = p->numa_faults; 2562 | + unsigned long flags; 2563 | + int i; 2564 | + 2565 | + if (!numa_faults) 2566 | + return; 2567 | + 2568 | + if (grp) { 2569 | + spin_lock_irqsave(&grp->lock, flags); 2570 | + for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) 2571 | + grp->faults[i] -= p->numa_faults[i]; 2572 | + grp->total_faults -= p->total_numa_faults; 2573 | + 2574 | + grp->nr_tasks--; 2575 | + spin_unlock_irqrestore(&grp->lock, flags); 2576 | + RCU_INIT_POINTER(p->numa_group, NULL); 2577 | + put_numa_group(grp); 2578 | + } 2579 | + 2580 | + if (final) { 2581 | + p->numa_faults = NULL; 2582 | + kfree(numa_faults); 2583 | + } else { 2584 | + p->total_numa_faults = 0; 2585 | + for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) 2586 | + numa_faults[i] = 0; 2587 | + } 2588 | +} 2589 | + 2590 | +/* 2591 | + * Got a PROT_NONE fault for a page on @node. 2592 | + */ 2593 | +void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags) 2594 | +{ 2595 | + struct task_struct *p = current; 2596 | + bool migrated = flags & TNF_MIGRATED; 2597 | + int cpu_node = task_node(current); 2598 | + int local = !!(flags & TNF_FAULT_LOCAL); 2599 | + struct numa_group *ng; 2600 | + int priv; 2601 | + 2602 | + if (!static_branch_likely(&sched_numa_balancing)) 2603 | + return; 2604 | + 2605 | + /* for example, ksmd faulting in a user's mm */ 2606 | + if (!p->mm) 2607 | + return; 2608 | + 2609 | + /* Allocate buffer to track faults on a per-node basis */ 2610 | + if (unlikely(!p->numa_faults)) { 2611 | + int size = sizeof(*p->numa_faults) * 2612 | + NR_NUMA_HINT_FAULT_BUCKETS * nr_node_ids; 2613 | + 2614 | + p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN); 2615 | + if (!p->numa_faults) 2616 | + return; 2617 | + 2618 | + p->total_numa_faults = 0; 2619 | + memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality)); 2620 | + } 2621 | + 2622 | + /* 2623 | + * First accesses are treated as private, otherwise consider accesses 2624 | + * to be private if the accessing pid has not changed 2625 | + */ 2626 | + if (unlikely(last_cpupid == (-1 & LAST_CPUPID_MASK))) { 2627 | + priv = 1; 2628 | + } else { 2629 | + priv = cpupid_match_pid(p, last_cpupid); 2630 | + if (!priv && !(flags & TNF_NO_GROUP)) 2631 | + task_numa_group(p, last_cpupid, flags, &priv); 2632 | + } 2633 | + 2634 | + /* 2635 | + * If a workload spans multiple NUMA nodes, a shared fault that 2636 | + * occurs wholly within the set of nodes that the workload is 2637 | + * actively using should be counted as local. This allows the 2638 | + * scan rate to slow down when a workload has settled down. 2639 | + */ 2640 | + ng = deref_curr_numa_group(p); 2641 | + if (!priv && !local && ng && ng->active_nodes > 1 && 2642 | + numa_is_active_node(cpu_node, ng) && 2643 | + numa_is_active_node(mem_node, ng)) 2644 | + local = 1; 2645 | + 2646 | + /* 2647 | + * Retry to migrate task to preferred node periodically, in case it 2648 | + * previously failed, or the scheduler moved us. 2649 | + */ 2650 | + if (time_after(jiffies, p->numa_migrate_retry)) { 2651 | + task_numa_placement(p); 2652 | + numa_migrate_preferred(p); 2653 | + } 2654 | + 2655 | + if (migrated) 2656 | + p->numa_pages_migrated += pages; 2657 | + if (flags & TNF_MIGRATE_FAIL) 2658 | + p->numa_faults_locality[2] += pages; 2659 | + 2660 | + p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages; 2661 | + p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages; 2662 | + p->numa_faults_locality[local] += pages; 2663 | +} 2664 | + 2665 | +static void reset_ptenuma_scan(struct task_struct *p) 2666 | +{ 2667 | + /* 2668 | + * We only did a read acquisition of the mmap sem, so 2669 | + * p->mm->numa_scan_seq is written to without exclusive access 2670 | + * and the update is not guaranteed to be atomic. That's not 2671 | + * much of an issue though, since this is just used for 2672 | + * statistical sampling. Use READ_ONCE/WRITE_ONCE, which are not 2673 | + * expensive, to avoid any form of compiler optimizations: 2674 | + */ 2675 | + WRITE_ONCE(p->mm->numa_scan_seq, READ_ONCE(p->mm->numa_scan_seq) + 1); 2676 | + p->mm->numa_scan_offset = 0; 2677 | +} 2678 | + 2679 | +/* 2680 | + * The expensive part of numa migration is done from task_work context. 2681 | + * Triggered from task_tick_numa(). 2682 | + */ 2683 | +static void task_numa_work(struct callback_head *work) 2684 | +{ 2685 | + unsigned long migrate, next_scan, now = jiffies; 2686 | + struct task_struct *p = current; 2687 | + struct mm_struct *mm = p->mm; 2688 | + u64 runtime = p->se.sum_exec_runtime; 2689 | + struct vm_area_struct *vma; 2690 | + unsigned long start, end; 2691 | + unsigned long nr_pte_updates = 0; 2692 | + long pages, virtpages; 2693 | + 2694 | + SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work)); 2695 | + 2696 | + work->next = work; 2697 | + /* 2698 | + * Who cares about NUMA placement when they're dying. 2699 | + * 2700 | + * NOTE: make sure not to dereference p->mm before this check, 2701 | + * exit_task_work() happens _after_ exit_mm() so we could be called 2702 | + * without p->mm even though we still had it when we enqueued this 2703 | + * work. 2704 | + */ 2705 | + if (p->flags & PF_EXITING) 2706 | + return; 2707 | + 2708 | + if (!mm->numa_next_scan) { 2709 | + mm->numa_next_scan = now + 2710 | + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); 2711 | + } 2712 | + 2713 | + /* 2714 | + * Enforce maximal scan/migration frequency.. 2715 | + */ 2716 | + migrate = mm->numa_next_scan; 2717 | + if (time_before(now, migrate)) 2718 | + return; 2719 | + 2720 | + if (p->numa_scan_period == 0) { 2721 | + p->numa_scan_period_max = task_scan_max(p); 2722 | + p->numa_scan_period = task_scan_start(p); 2723 | + } 2724 | + 2725 | + next_scan = now + msecs_to_jiffies(p->numa_scan_period); 2726 | + if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate) 2727 | + return; 2728 | + 2729 | + /* 2730 | + * Delay this task enough that another task of this mm will likely win 2731 | + * the next time around. 2732 | + */ 2733 | + p->node_stamp += 2 * TICK_NSEC; 2734 | + 2735 | + start = mm->numa_scan_offset; 2736 | + pages = sysctl_numa_balancing_scan_size; 2737 | + pages <<= 20 - PAGE_SHIFT; /* MB in pages */ 2738 | + virtpages = pages * 8; /* Scan up to this much virtual space */ 2739 | + if (!pages) 2740 | + return; 2741 | + 2742 | + 2743 | + if (!mmap_read_trylock(mm)) 2744 | + return; 2745 | + vma = find_vma(mm, start); 2746 | + if (!vma) { 2747 | + reset_ptenuma_scan(p); 2748 | + start = 0; 2749 | + vma = mm->mmap; 2750 | + } 2751 | + for (; vma; vma = vma->vm_next) { 2752 | + if (!vma_migratable(vma) || !vma_policy_mof(vma) || 2753 | + is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) { 2754 | + continue; 2755 | + } 2756 | + 2757 | + /* 2758 | + * Shared library pages mapped by multiple processes are not 2759 | + * migrated as it is expected they are cache replicated. Avoid 2760 | + * hinting faults in read-only file-backed mappings or the vdso 2761 | + * as migrating the pages will be of marginal benefit. 2762 | + */ 2763 | + if (!vma->vm_mm || 2764 | + (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ))) 2765 | + continue; 2766 | + 2767 | + /* 2768 | + * Skip inaccessible VMAs to avoid any confusion between 2769 | + * PROT_NONE and NUMA hinting ptes 2770 | + */ 2771 | + if (!vma_is_accessible(vma)) 2772 | + continue; 2773 | + 2774 | + do { 2775 | + start = max(start, vma->vm_start); 2776 | + end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE); 2777 | + end = min(end, vma->vm_end); 2778 | + nr_pte_updates = change_prot_numa(vma, start, end); 2779 | + 2780 | + /* 2781 | + * Try to scan sysctl_numa_balancing_size worth of 2782 | + * hpages that have at least one present PTE that 2783 | + * is not already pte-numa. If the VMA contains 2784 | + * areas that are unused or already full of prot_numa 2785 | + * PTEs, scan up to virtpages, to skip through those 2786 | + * areas faster. 2787 | + */ 2788 | + if (nr_pte_updates) 2789 | + pages -= (end - start) >> PAGE_SHIFT; 2790 | + virtpages -= (end - start) >> PAGE_SHIFT; 2791 | + 2792 | + start = end; 2793 | + if (pages <= 0 || virtpages <= 0) 2794 | + goto out; 2795 | + 2796 | + cond_resched(); 2797 | + } while (end != vma->vm_end); 2798 | + } 2799 | + 2800 | +out: 2801 | + /* 2802 | + * It is possible to reach the end of the VMA list but the last few 2803 | + * VMAs are not guaranteed to the vma_migratable. If they are not, we 2804 | + * would find the !migratable VMA on the next scan but not reset the 2805 | + * scanner to the start so check it now. 2806 | + */ 2807 | + if (vma) 2808 | + mm->numa_scan_offset = start; 2809 | + else 2810 | + reset_ptenuma_scan(p); 2811 | + mmap_read_unlock(mm); 2812 | + 2813 | + /* 2814 | + * Make sure tasks use at least 32x as much time to run other code 2815 | + * than they used here, to limit NUMA PTE scanning overhead to 3% max. 2816 | + * Usually update_task_scan_period slows down scanning enough; on an 2817 | + * overloaded system we need to limit overhead on a per task basis. 2818 | + */ 2819 | + if (unlikely(p->se.sum_exec_runtime != runtime)) { 2820 | + u64 diff = p->se.sum_exec_runtime - runtime; 2821 | + p->node_stamp += 32 * diff; 2822 | + } 2823 | +} 2824 | + 2825 | +void init_numa_balancing(unsigned long clone_flags, struct task_struct *p) 2826 | +{ 2827 | + int mm_users = 0; 2828 | + struct mm_struct *mm = p->mm; 2829 | + 2830 | + if (mm) { 2831 | + mm_users = atomic_read(&mm->mm_users); 2832 | + if (mm_users == 1) { 2833 | + mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); 2834 | + mm->numa_scan_seq = 0; 2835 | + } 2836 | + } 2837 | + p->node_stamp = 0; 2838 | + p->numa_scan_seq = mm ? mm->numa_scan_seq : 0; 2839 | + p->numa_scan_period = sysctl_numa_balancing_scan_delay; 2840 | + /* Protect against double add, see task_tick_numa and task_numa_work */ 2841 | + p->numa_work.next = &p->numa_work; 2842 | + p->numa_faults = NULL; 2843 | + RCU_INIT_POINTER(p->numa_group, NULL); 2844 | + p->last_task_numa_placement = 0; 2845 | + p->last_sum_exec_runtime = 0; 2846 | + 2847 | + init_task_work(&p->numa_work, task_numa_work); 2848 | + 2849 | + /* New address space, reset the preferred nid */ 2850 | + if (!(clone_flags & CLONE_VM)) { 2851 | + p->numa_preferred_nid = NUMA_NO_NODE; 2852 | + return; 2853 | + } 2854 | + 2855 | + /* 2856 | + * New thread, keep existing numa_preferred_nid which should be copied 2857 | + * already by arch_dup_task_struct but stagger when scans start. 2858 | + */ 2859 | + if (mm) { 2860 | + unsigned int delay; 2861 | + 2862 | + delay = min_t(unsigned int, task_scan_max(current), 2863 | + current->numa_scan_period * mm_users * NSEC_PER_MSEC); 2864 | + delay += 2 * TICK_NSEC; 2865 | + p->node_stamp = delay; 2866 | + } 2867 | +} 2868 | + 2869 | +static void task_tick_numa(struct rq *rq, struct task_struct *curr) 2870 | +{ 2871 | + struct callback_head *work = &curr->numa_work; 2872 | + u64 period, now; 2873 | + 2874 | + /* 2875 | + * We don't care about NUMA placement if we don't have memory. 2876 | + */ 2877 | + if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work) 2878 | + return; 2879 | + 2880 | + /* 2881 | + * Using runtime rather than walltime has the dual advantage that 2882 | + * we (mostly) drive the selection from busy threads and that the 2883 | + * task needs to have done some actual work before we bother with 2884 | + * NUMA placement. 2885 | + */ 2886 | + now = curr->se.sum_exec_runtime; 2887 | + period = (u64)curr->numa_scan_period * NSEC_PER_MSEC; 2888 | + 2889 | + if (now > curr->node_stamp + period) { 2890 | + if (!curr->node_stamp) 2891 | + curr->numa_scan_period = task_scan_start(curr); 2892 | + curr->node_stamp += period; 2893 | + 2894 | + if (!time_before(jiffies, curr->mm->numa_next_scan)) 2895 | + task_work_add(curr, work, TWA_RESUME); 2896 | + } 2897 | +} 2898 | +#else 2899 | +static void account_numa_enqueue(struct rq *rq, struct task_struct *p) {} 2900 | +static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p) {} 2901 | +static inline void update_scan_period(struct task_struct *p, int new_cpu) {} 2902 | +static void task_tick_numa(struct rq *rq, struct task_struct *curr) {} 2903 | +#endif /** CONFIG_NUMA_BALANCING */ 2904 | + 2905 | diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h 2906 | index d53d19770866..693683b31f5f 100644 2907 | --- a/kernel/sched/sched.h 2908 | +++ b/kernel/sched/sched.h 2909 | @@ -545,9 +545,14 @@ struct cfs_rq { 2910 | * It is set to NULL otherwise (i.e when none are currently running). 2911 | */ 2912 | struct sched_entity *curr; 2913 | +#ifdef CONFIG_BS_SCHED 2914 | + struct bs_node *head; 2915 | + struct bs_node *cursor; 2916 | +#else 2917 | struct sched_entity *next; 2918 | struct sched_entity *last; 2919 | struct sched_entity *skip; 2920 | +#endif /** CONFIG_BS_SCHED */ 2921 | 2922 | #ifdef CONFIG_SCHED_DEBUG 2923 | unsigned int nr_spread_over; 2924 | --------------------------------------------------------------------------------