├── .config
├── README.md
├── baby-cr-5.14.patch
├── baby-dl-5.14.patch
├── baby-hrrn-5.14.patch
├── baby-mlfq-5.14.patch
├── baby-rr-5.14.patch
└── baby-vrt-5.14.patch


/README.md:
--------------------------------------------------------------------------------
  1 | # Baby-CPU-Scheduler
  2 | 
  3 | This is a very basic and lightweight yet very performant CPU scheduler.
  4 | You can use it for learning purposes as a base ground CPU scheduler on
  5 | Linux. Notice that many features are disabled to make this scheduler as
  6 | simple as possible.
  7 | 
  8 | Baby Scheduler is very lightweight and powerful for
  9 | normal usage. I am using (`baby-dl`) as my main scheduler and sometimes I
 10 | switch back to CacULE for testing. The throughput in Baby Scheduler is
 11 | higher due to the task loadbalancer that I made for Baby Scheduler. The
 12 | loadbalancing is done via only one CPU which is CPU0 in which CPU0 scan
 13 | all other CPUs and move one task in every tick. The balancing is only
 14 | depending on the number of tasks (i.e. no load, weight or other factors).
 15 | 
 16 | 
 17 | Baby scheduler is only 1036 LOC where 254 LOC of it are just dependent
 18 | functions that are copied from CFS without any changes to let the
 19 | scheduler compile and run. You can find all Baby code in
 20 | `bs.c`, `bs.h`, and `numa_fair.h`.
 21 | 
 22 | ## Flavors
 23 | Currently there are three flavors of Baby Scheduler
 24 | * Deadline Scheduling (dl) - main
 25 | * Virtual Run Time Scheduling (vrt)
 26 | * Round Robin Scheduling (rr)
 27 | 
 28 | All the three flavors have the same tasks load balancing method.
 29 | They only differ in the strategy of picking the next task to run, and other minor differences.
 30 | 
 31 | ### Deadline Scheduling
 32 | Baby's main scheduling is the deadline flavor. The scheduler picks the task with the earliest deadline.
 33 | A new task gets a `deadline = now() + 1.180ms`. The task with earliest deadline will be picked and run
 34 | on the CPU until it gets to sleep or another task had earlier deadline. While the task is running, its
 35 | deadline gets updated only when its deadline is past - compared with the current time (now). So, basically
 36 | we don't want to update its deadline with `now() + 1.180ms` on every tick, otherwise, I call this situation
 37 | by horse and carrot. I am using the deadline as a slice too, we don't want to keep preempting tasks in
 38 | which their deadlines are very close to each other. The best solution is to give a minimum/maximum slice to
 39 | the running task to at least gets its next updated deadline to be not very close to the competing task. This
 40 | can save some context switching. Anyway, our deadline/slice is not too hight, it is only a 1.180ms. 
 41 | 
 42 | ### Virtual Run Time Scheduling
 43 | The scheduler picks next task that has
 44 | least vruntime, however, all CFS load/weight for task priority are
 45 | replaced with a simpler mechanism. Tasks priorities are injected in
 46 | vruntime where NICE0 priority task has a vruntime = real_exec_time,
 47 | but NICE-20 task has a vruntime < real_exec_time in which NICE-20 task
 48 | will run 20 more milliseconds than NICE0 one in a race time of 40ms.
 49 | See the equation in kernel/sched/bs.c:convert_to_vruntime.
 50 | 
 51 | ### Round Robin Scheduling
 52 | 
 53 | ## Patch and Compile
 54 | ### Patch
 55 | First, you need to patch the kernel source with one of the flavors of baby scheduler. See an example of patching linux kernel [here](https://github.com/hamadmarri/cacule-cpu-scheduler#how-to-apply-the-patch). You can use the same method to patch with baby instead of cacule.
 56 | 
 57 | 1. Download the linux kernel (https://www.kernel.org/) that is same version as the patch (i.e if patch file name is baby-5.14.patch, then download https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.14.9.tar.xz)
 58 | 2. Unzip linux kernel
 59 | 3. Download baby patch file and place it inside the just unzipped linux kernel folder
 60 | 4. cd linux-(version)
 61 | 5. patch -p1 < baby-5.14.patch
 62 | 
 63 | 
 64 | ### Configure
 65 | Baby uses periodic HZ only, and you need to disable other scheduler features such as `CGROUP_SCHED` and stat/debuging features.
 66 | 
 67 | run `make menuconfig`
 68 | and choose `CONFIG_HZ_PERIODIC`
 69 | 
 70 | You should see this when you run `cat .config | grep -i hz`:
 71 | 
 72 | ```
 73 | CONFIG_HZ_PERIODIC=y
 74 | # CONFIG_NO_HZ_IDLE is not set
 75 | # CONFIG_NO_HZ_FULL is not set
 76 | # CONFIG_NO_HZ is not set
 77 | ```
 78 | 
 79 | Then disable the following:
 80 | * CONFIG_EXPERT
 81 | * CONFIG_DEBUG_KERNEL
 82 | * CONFIG_SCHED_DEBUG
 83 | * CONFIG_SCHEDSTATS
 84 | * NO_HZ
 85 | * SCHED_AUTOGROUP
 86 | * CGROUP_SCHED
 87 | * UCLAMP_TASK
 88 | * SCHED_CORE
 89 | 
 90 | 
 91 | Make sure that `CONFIG_BS_SCHED` is selected (it appears at the top when running make menuconfig).
 92 | Confirm by running `cat .config | grep -i bs_sched`
 93 | ```
 94 | CONFIG_BS_SCHED=y
 95 | ```
 96 | 
 97 | Now compile `make` \
 98 | Then install the modules `sudo make modules_install` \
 99 | Then install the kernel `sudo make install`\
100 | Reboot and choose the new kernel
101 | 
102 | To confirm that Baby is currently running:
103 | ```
104 | $ dmesg | grep "Baby CPU"
105 | Baby CPU scheduler (dl) v5.14 by Hamad Al Marri.
106 | ```
107 | 


--------------------------------------------------------------------------------
/baby-hrrn-5.14.patch:
--------------------------------------------------------------------------------
   1 | diff --git a/include/linux/sched.h b/include/linux/sched.h
   2 | index f6935787e7e8..2d668efc68f5 100644
   3 | --- a/include/linux/sched.h
   4 | +++ b/include/linux/sched.h
   5 | @@ -462,6 +462,15 @@ struct sched_statistics {
   6 |  #endif
   7 |  };
   8 |  
   9 | +#ifdef CONFIG_BS_SCHED
  10 | +struct bs_node {
  11 | +	struct bs_node*                 next;
  12 | +	struct bs_node*                 prev;
  13 | +	u64				hrrn_start_time;
  14 | +	u64				vruntime;
  15 | +};
  16 | +#endif
  17 | +
  18 |  struct sched_entity {
  19 |  	/* For load-balancing: */
  20 |  	struct load_weight		load;
  21 | @@ -469,6 +478,10 @@ struct sched_entity {
  22 |  	struct list_head		group_node;
  23 |  	unsigned int			on_rq;
  24 |  
  25 | +#ifdef CONFIG_BS_SCHED
  26 | +	struct bs_node                  bs_node;
  27 | +#endif
  28 | +
  29 |  	u64				exec_start;
  30 |  	u64				sum_exec_runtime;
  31 |  	u64				vruntime;
  32 | diff --git a/init/Kconfig b/init/Kconfig
  33 | index 55f9f7738ebb..6948dd798f43 100644
  34 | --- a/init/Kconfig
  35 | +++ b/init/Kconfig
  36 | @@ -105,6 +105,34 @@ config THREAD_INFO_IN_TASK
  37 |  	  One subtle change that will be needed is to use try_get_task_stack()
  38 |  	  and put_task_stack() in save_thread_stack_tsk() and get_wchan().
  39 |  
  40 | +config BS_SCHED
  41 | +	bool "Baby Scheduler"
  42 | +	default y
  43 | +	select TICK_CPU_ACCOUNTING
  44 | +	select PREEMPT
  45 | +	help
  46 | +	  This is a very basic and lightweight yet very performant CPU scheduler.
  47 | +	  You can use it for learning purposes as a base ground CPU scheduler on
  48 | +	  Linux. Notice that many features are disabled to make this scheduler as
  49 | +	  simple as possible. Baby Scheduler is very lightweight and powerful for
  50 | +	  normal usage. I am using it as my main scheduler and sometimes I
  51 | +	  switch back to CacULE for testing. The throughput in Baby Scheduler is
  52 | +	  higher due to the task loadbalancer that I made for Baby Scheduler. The
  53 | +	  loadbalancing is done via only one CPU which is CPU0 in which CPU0 scan
  54 | +	  all other CPUs and move one task in every tick. The balancing is only
  55 | +	  depending on the number of tasks (i.e. no load, weight or other factors).
  56 | +	  Baby scheduler is only 1036 LOC where 254 LOC of it are just dependent
  57 | +	  functions that are copied from CFS without any changes to let the
  58 | +	  scheduler compile and run. You can find all Baby code is reduced in
  59 | +	  bs.c, bs.h, and numa_fair.h. Baby scheduler picks next task that has
  60 | +	  least vruntime, however, all CFS load/weight for task priority are
  61 | +	  replaced with a simpler mechanism. Tasks priorities are injected in
  62 | +	  vruntime where NICE0 priority task has a vruntime = real_exec_time,
  63 | +	  but NICE-20 task has a vruntime < real_exec_time in which NICE-20 task
  64 | +	  will run 20 more milliseconds than NICE0 one in a race time of 40ms.
  65 | +	  See the equation in kernel/sched/bs.c:convert_to_vruntime.
  66 | +
  67 | +
  68 |  menu "General setup"
  69 |  
  70 |  config BROKEN
  71 | @@ -789,6 +817,7 @@ menu "Scheduler features"
  72 |  config UCLAMP_TASK
  73 |  	bool "Enable utilization clamping for RT/FAIR tasks"
  74 |  	depends on CPU_FREQ_GOV_SCHEDUTIL
  75 | +	depends on !BS_SCHED
  76 |  	help
  77 |  	  This feature enables the scheduler to track the clamped utilization
  78 |  	  of each CPU based on RUNNABLE tasks scheduled on that CPU.
  79 | @@ -954,6 +983,7 @@ config CGROUP_WRITEBACK
  80 |  
  81 |  menuconfig CGROUP_SCHED
  82 |  	bool "CPU controller"
  83 | +	depends on !BS_SCHED
  84 |  	default n
  85 |  	help
  86 |  	  This feature lets CPU scheduler recognize task groups and control CPU
  87 | @@ -1231,6 +1261,8 @@ config CHECKPOINT_RESTORE
  88 |  
  89 |  config SCHED_AUTOGROUP
  90 |  	bool "Automatic process group scheduling"
  91 | +	default n
  92 | +	depends on !BS_SCHED
  93 |  	select CGROUPS
  94 |  	select CGROUP_SCHED
  95 |  	select FAIR_GROUP_SCHED
  96 | @@ -1420,6 +1452,7 @@ config BPF
  97 |  menuconfig EXPERT
  98 |  	bool "Configure standard kernel features (expert users)"
  99 |  	# Unhide debug options, to make the on-by-default options visible
 100 | +	depends on !BS_SCHED
 101 |  	select DEBUG_KERNEL
 102 |  	help
 103 |  	  This option allows certain base kernel options and settings
 104 | diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
 105 | index 38ef6d06888e..0897412321fa 100644
 106 | --- a/kernel/Kconfig.hz
 107 | +++ b/kernel/Kconfig.hz
 108 | @@ -5,7 +5,7 @@
 109 |  
 110 |  choice
 111 |  	prompt "Timer frequency"
 112 | -	default HZ_250
 113 | +	default HZ_803
 114 |  	help
 115 |  	 Allows the configuration of the timer frequency. It is customary
 116 |  	 to have the timer interrupt run at 1000 Hz but 100 Hz may be more
 117 | @@ -40,6 +40,9 @@ choice
 118 |  	 on SMP and NUMA systems and exactly dividing by both PAL and
 119 |  	 NTSC frame rates for video and multimedia work.
 120 |  
 121 | +	config HZ_803
 122 | +		bool "803 HZ"
 123 | +
 124 |  	config HZ_1000
 125 |  		bool "1000 HZ"
 126 |  	help
 127 | @@ -53,6 +56,7 @@ config HZ
 128 |  	default 100 if HZ_100
 129 |  	default 250 if HZ_250
 130 |  	default 300 if HZ_300
 131 | +	default 803 if HZ_803
 132 |  	default 1000 if HZ_1000
 133 |  
 134 |  config SCHED_HRTICK
 135 | diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
 136 | index 5876e30c5740..f5a195470121 100644
 137 | --- a/kernel/Kconfig.preempt
 138 | +++ b/kernel/Kconfig.preempt
 139 | @@ -2,7 +2,7 @@
 140 |  
 141 |  choice
 142 |  	prompt "Preemption Model"
 143 | -	default PREEMPT_NONE
 144 | +	default PREEMPT
 145 |  
 146 |  config PREEMPT_NONE
 147 |  	bool "No Forced Preemption (Server)"
 148 | @@ -103,6 +103,7 @@ config PREEMPT_DYNAMIC
 149 |  config SCHED_CORE
 150 |  	bool "Core Scheduling for SMT"
 151 |  	depends on SCHED_SMT
 152 | +	depends on !BS_SCHED
 153 |  	help
 154 |  	  This option permits Core Scheduling, a means of coordinated task
 155 |  	  selection across SMT siblings. When enabled -- see
 156 | diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
 157 | index 978fcfca5871..464b134de739 100644
 158 | --- a/kernel/sched/Makefile
 159 | +++ b/kernel/sched/Makefile
 160 | @@ -23,7 +23,7 @@ CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
 161 |  endif
 162 |  
 163 |  obj-y += core.o loadavg.o clock.o cputime.o
 164 | -obj-y += idle.o fair.o rt.o deadline.o
 165 | +obj-y += idle.o bs.o rt.o deadline.o
 166 |  obj-y += wait.o wait_bit.o swait.o completion.o
 167 |  
 168 |  obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
 169 | diff --git a/kernel/sched/bs.c b/kernel/sched/bs.c
 170 | new file mode 100644
 171 | index 000000000000..0dcd34ede851
 172 | --- /dev/null
 173 | +++ b/kernel/sched/bs.c
 174 | @@ -0,0 +1,777 @@
 175 | +// SPDX-License-Identifier: GPL-2.0
 176 | +/*
 177 | + * Basic Scheduler (BS) Class (SCHED_NORMAL/SCHED_BATCH)
 178 | + *
 179 | + *  Copyright (C) 2021, Hamad Al Marri <hamad.s.almarri@gmail.com>
 180 | + */
 181 | +#include "sched.h"
 182 | +#include "pelt.h"
 183 | +#include "fair_numa.h"
 184 | +#include "bs.h"
 185 | +
 186 | +u64 sched_granularity = 590000ULL;
 187 | +
 188 | +#define HZ_PERIOD (1000000000 / HZ)
 189 | +#define RACE_TIME 40000000
 190 | +#define FACTOR (RACE_TIME / HZ_PERIOD)
 191 | +
 192 | +#define YIELD_MARK(bsn)		((bsn)->vruntime |= 0x8000000000000000ULL)
 193 | +#define YIELD_UNMARK(bsn)	((bsn)->vruntime &= 0x7FFFFFFFFFFFFFFFULL)
 194 | +
 195 | +#define HRRN_MAX_LIFE_NS 5000000000ULL
 196 | +
 197 | +static u64 convert_to_vruntime(u64 delta, struct sched_entity *se)
 198 | +{
 199 | +	struct task_struct *p = task_of(se);
 200 | +	s64 prio_diff;
 201 | +
 202 | +	if (PRIO_TO_NICE(p->prio) == 0)
 203 | +		return delta;
 204 | +
 205 | +	prio_diff = PRIO_TO_NICE(p->prio) * 1000000;
 206 | +	prio_diff /= FACTOR;
 207 | +
 208 | +	if ((s64)(delta + prio_diff) < 0)
 209 | +		return 1;
 210 | +
 211 | +	return delta + prio_diff;
 212 | +}
 213 | +
 214 | +static inline void normalize_lifetime(u64 now, struct bs_node *bsn)
 215 | +{
 216 | +	u64 life_time, old_hrrn_x;
 217 | +	s64 diff;
 218 | +
 219 | +	life_time	= now - bsn->hrrn_start_time;
 220 | +	diff		= life_time - HRRN_MAX_LIFE_NS;
 221 | +
 222 | +	if (diff > 0) {
 223 | +		// unmark YIELD. No need to check or remark since
 224 | +		// this normalize action doesn't happen very often
 225 | +		YIELD_UNMARK(bsn);
 226 | +
 227 | +		// multiply life_time by 1024 for more precision
 228 | +		old_hrrn_x = (life_time << 7) / ((bsn->vruntime >> 3) | 1);
 229 | +
 230 | +		// reset life to half max_life (i.e ~2.5s)
 231 | +		bsn->hrrn_start_time = now - (HRRN_MAX_LIFE_NS >> 1);
 232 | +
 233 | +		// avoid division by zero
 234 | +		if (old_hrrn_x == 0) old_hrrn_x = 1;
 235 | +
 236 | +		// reset vruntime based on old hrrn ratio
 237 | +		bsn->vruntime = (HRRN_MAX_LIFE_NS << 9) / old_hrrn_x;
 238 | +	}
 239 | +}
 240 | +
 241 | +static void update_curr(struct cfs_rq *cfs_rq)
 242 | +{
 243 | +	struct sched_entity *curr = cfs_rq->curr;
 244 | +	u64 now = sched_clock();
 245 | +	u64 delta_exec;
 246 | +
 247 | +	if (unlikely(!curr))
 248 | +		return;
 249 | +
 250 | +	delta_exec = now - curr->exec_start;
 251 | +	if (unlikely((s64)delta_exec <= 0))
 252 | +		return;
 253 | +
 254 | +	curr->exec_start = now;
 255 | +	curr->sum_exec_runtime += delta_exec;
 256 | +
 257 | +	curr->bs_node.vruntime += convert_to_vruntime(delta_exec, curr);
 258 | +	normalize_lifetime(now, &curr->bs_node);
 259 | +}
 260 | +
 261 | +static void update_curr_fair(struct rq *rq)
 262 | +{
 263 | +	update_curr(cfs_rq_of(&rq->curr->se));
 264 | +}
 265 | +
 266 | +static inline u64 calc_hrrn(u64 now, struct bs_node *bsn)
 267 | +{
 268 | +	u64 l = now - bsn->hrrn_start_time;
 269 | +	u64 r = bsn->vruntime | 1;
 270 | +
 271 | +	return l / r;
 272 | +}
 273 | +
 274 | +/**
 275 | + * Does a have smaller vruntime than b?
 276 | + */
 277 | +static inline bool
 278 | +entity_before(struct bs_node *a, struct bs_node *b)
 279 | +{
 280 | +	u64 a_hrrn = calc_hrrn(sched_clock(), a);
 281 | +	u64 b_hrrn = calc_hrrn(sched_clock(), b);
 282 | +
 283 | +	return (s64)(a_hrrn - b_hrrn) > 0;
 284 | +}
 285 | +
 286 | +static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 287 | +{
 288 | +	struct bs_node *bsn = &se->bs_node;
 289 | +
 290 | +	bsn->next = bsn->prev = NULL;
 291 | +
 292 | +	// if empty
 293 | +	if (!cfs_rq->head) {
 294 | +		cfs_rq->head	= bsn;
 295 | +	}
 296 | +	else {
 297 | +		bsn->next	     = cfs_rq->head;
 298 | +		cfs_rq->head->prev   = bsn;
 299 | +		cfs_rq->head         = bsn;
 300 | +	}
 301 | +}
 302 | +
 303 | +static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 304 | +{
 305 | +	struct bs_node *bsn = &se->bs_node;
 306 | +	struct bs_node *prev, *next;
 307 | +
 308 | +	// if only one se in rq
 309 | +	if (cfs_rq->head->next == NULL) {
 310 | +		cfs_rq->head = NULL;
 311 | +	}
 312 | +	// if it is the head
 313 | +	else if (bsn == cfs_rq->head) {
 314 | +		cfs_rq->head	   = cfs_rq->head->next;
 315 | +		cfs_rq->head->prev = NULL;
 316 | +	}
 317 | +	// if in the middle
 318 | +	else {
 319 | +		prev = bsn->prev;
 320 | +		next = bsn->next;
 321 | +
 322 | +		prev->next = next;
 323 | +		if (next)
 324 | +			next->prev = prev;
 325 | +	}
 326 | +}
 327 | +
 328 | +static inline void
 329 | +enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 330 | +{
 331 | +	bool curr = cfs_rq->curr == se;
 332 | +
 333 | +	update_curr(cfs_rq);
 334 | +
 335 | +	account_entity_enqueue(cfs_rq, se);
 336 | +
 337 | +	if (!curr)
 338 | +		__enqueue_entity(cfs_rq, se);
 339 | +
 340 | +	se->on_rq = 1;
 341 | +}
 342 | +
 343 | +static inline void
 344 | +dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 345 | +{
 346 | +	update_curr(cfs_rq);
 347 | +
 348 | +	if (se != cfs_rq->curr)
 349 | +		__dequeue_entity(cfs_rq, se);
 350 | +
 351 | +	se->on_rq = 0;
 352 | +	account_entity_dequeue(cfs_rq, se);
 353 | +}
 354 | +
 355 | +static void
 356 | +enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 357 | +{
 358 | +	struct sched_entity *se = &p->se;
 359 | +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 360 | +	int idle_h_nr_running = task_has_idle_policy(p);
 361 | +
 362 | +	if (!se->on_rq) {
 363 | +		enqueue_entity(cfs_rq, se, flags);
 364 | +		cfs_rq->h_nr_running++;
 365 | +		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 366 | +	}
 367 | +
 368 | +	add_nr_running(rq, 1);
 369 | +}
 370 | +
 371 | +static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 372 | +{
 373 | +	struct sched_entity *se = &p->se;
 374 | +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 375 | +	int idle_h_nr_running = task_has_idle_policy(p);
 376 | +
 377 | +	dequeue_entity(cfs_rq, se, flags);
 378 | +
 379 | +	cfs_rq->h_nr_running--;
 380 | +	cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 381 | +
 382 | +	sub_nr_running(rq, 1);
 383 | +}
 384 | +
 385 | +static void yield_task_fair(struct rq *rq)
 386 | +{
 387 | +	struct task_struct *curr = rq->curr;
 388 | +	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 389 | +
 390 | +	YIELD_MARK(&curr->se.bs_node);
 391 | +
 392 | +	/*
 393 | +	 * Are we the only task in the tree?
 394 | +	 */
 395 | +	if (unlikely(rq->nr_running == 1))
 396 | +		return;
 397 | +
 398 | +	if (curr->policy != SCHED_BATCH) {
 399 | +		update_rq_clock(rq);
 400 | +		/*
 401 | +		 * Update run-time statistics of the 'current'.
 402 | +		 */
 403 | +		update_curr(cfs_rq);
 404 | +		/*
 405 | +		 * Tell update_rq_clock() that we've just updated,
 406 | +		 * so we don't do microscopic update in schedule()
 407 | +		 * and double the fastpath cost.
 408 | +		 */
 409 | +		rq_clock_skip_update(rq);
 410 | +	}
 411 | +}
 412 | +
 413 | +static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 414 | +{
 415 | +	yield_task_fair(rq);
 416 | +	return true;
 417 | +}
 418 | +
 419 | +static void
 420 | +set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 421 | +{
 422 | +	if (se->on_rq)
 423 | +		__dequeue_entity(cfs_rq, se);
 424 | +
 425 | +	se->exec_start = sched_clock();
 426 | +	cfs_rq->curr = se;
 427 | +	se->prev_sum_exec_runtime = se->sum_exec_runtime;
 428 | +}
 429 | +
 430 | +static struct sched_entity *
 431 | +pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 432 | +{
 433 | +	struct bs_node *bsn = cfs_rq->head;
 434 | +	struct bs_node *next;
 435 | +
 436 | +	if (!bsn)
 437 | +		return curr;
 438 | +
 439 | +	next = bsn->next;
 440 | +	while (next) {
 441 | +		if (entity_before(next, bsn))
 442 | +			bsn = next;
 443 | +
 444 | +		next = next->next;
 445 | +	}
 446 | +
 447 | +	if (curr && entity_before(&curr->bs_node, bsn))
 448 | +		return curr;
 449 | +
 450 | +	return se_of(bsn);
 451 | +}
 452 | +
 453 | +struct task_struct *
 454 | +pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 455 | +{
 456 | +	struct cfs_rq *cfs_rq = &rq->cfs;
 457 | +	struct sched_entity *se;
 458 | +	struct task_struct *p;
 459 | +	int new_tasks;
 460 | +
 461 | +again:
 462 | +	if (!sched_fair_runnable(rq))
 463 | +		goto idle;
 464 | +
 465 | +	if (prev)
 466 | +		put_prev_task(rq, prev);
 467 | +
 468 | +	se = pick_next_entity(cfs_rq, NULL);
 469 | +	set_next_entity(cfs_rq, se);
 470 | +
 471 | +	p = task_of(se);
 472 | +
 473 | +	if (prev)
 474 | +		YIELD_UNMARK(&prev->se.bs_node);
 475 | +
 476 | +done: __maybe_unused;
 477 | +#ifdef CONFIG_SMP
 478 | +	/*
 479 | +	 * Move the next running task to the front of
 480 | +	 * the list, so our cfs_tasks list becomes MRU
 481 | +	 * one.
 482 | +	 */
 483 | +	list_move(&p->se.group_node, &rq->cfs_tasks);
 484 | +#endif
 485 | +
 486 | +	return p;
 487 | +
 488 | +idle:
 489 | +	if (!rf)
 490 | +		return NULL;
 491 | +
 492 | +	new_tasks = newidle_balance(rq, rf);
 493 | +
 494 | +	/*
 495 | +	 * Because newidle_balance() releases (and re-acquires) rq->lock, it is
 496 | +	 * possible for any higher priority task to appear. In that case we
 497 | +	 * must re-start the pick_next_entity() loop.
 498 | +	 */
 499 | +	if (new_tasks < 0)
 500 | +		return RETRY_TASK;
 501 | +
 502 | +	if (new_tasks > 0)
 503 | +		goto again;
 504 | +
 505 | +	/*
 506 | +	 * rq is about to be idle, check if we need to update the
 507 | +	 * lost_idle_time of clock_pelt
 508 | +	 */
 509 | +	update_idle_rq_clock_pelt(rq);
 510 | +
 511 | +	return NULL;
 512 | +}
 513 | +
 514 | +static struct task_struct *__pick_next_task_fair(struct rq *rq)
 515 | +{
 516 | +	return pick_next_task_fair(rq, NULL, NULL);
 517 | +}
 518 | +
 519 | +#ifdef CONFIG_SMP
 520 | +static struct task_struct *pick_task_fair(struct rq *rq)
 521 | +{
 522 | +	struct sched_entity *se;
 523 | +	struct cfs_rq *cfs_rq = &rq->cfs;
 524 | +	struct sched_entity *curr = cfs_rq->curr;
 525 | +
 526 | +	if (!cfs_rq->nr_running)
 527 | +		return NULL;
 528 | +
 529 | +	/* When we pick for a remote RQ, we'll not have done put_prev_entity() */
 530 | +	if (curr) {
 531 | +		if (curr->on_rq)
 532 | +			update_curr(cfs_rq);
 533 | +		else
 534 | +			curr = NULL;
 535 | +	}
 536 | +
 537 | +	se = pick_next_entity(cfs_rq, curr);
 538 | +
 539 | +	return task_of(se);
 540 | +}
 541 | +#endif
 542 | +
 543 | +static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 544 | +{
 545 | +	/*
 546 | +	 * If still on the runqueue then deactivate_task()
 547 | +	 * was not called and update_curr() has to be done:
 548 | +	 */
 549 | +	if (prev->on_rq)
 550 | +		update_curr(cfs_rq);
 551 | +
 552 | +	if (prev->on_rq)
 553 | +		__enqueue_entity(cfs_rq, prev);
 554 | +
 555 | +	cfs_rq->curr = NULL;
 556 | +}
 557 | +
 558 | +static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
 559 | +{
 560 | +	struct sched_entity *se = &prev->se;
 561 | +
 562 | +	put_prev_entity(cfs_rq_of(se), se);
 563 | +}
 564 | +
 565 | +static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 566 | +{
 567 | +	struct sched_entity *se = &p->se;
 568 | +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 569 | +
 570 | +#ifdef CONFIG_SMP
 571 | +	if (task_on_rq_queued(p)) {
 572 | +		/*
 573 | +		 * Move the next running task to the front of the list, so our
 574 | +		 * cfs_tasks list becomes MRU one.
 575 | +		 */
 576 | +		list_move(&se->group_node, &rq->cfs_tasks);
 577 | +	}
 578 | +#endif
 579 | +
 580 | +	set_next_entity(cfs_rq, se);
 581 | +}
 582 | +
 583 | +static void
 584 | +check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 585 | +{
 586 | +	if (pick_next_entity(cfs_rq, curr) != curr)
 587 | +		resched_curr(rq_of(cfs_rq));
 588 | +}
 589 | +
 590 | +static void
 591 | +entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 592 | +{
 593 | +	update_curr(cfs_rq);
 594 | +
 595 | +	if (cfs_rq->nr_running > 1)
 596 | +		check_preempt_tick(cfs_rq, curr);
 597 | +}
 598 | +
 599 | +static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 600 | +{
 601 | +	struct task_struct *curr = rq->curr;
 602 | +	struct sched_entity *se = &curr->se, *wse = &p->se;
 603 | +
 604 | +	if (unlikely(se == wse))
 605 | +		return;
 606 | +
 607 | +	if (test_tsk_need_resched(curr))
 608 | +		return;
 609 | +
 610 | +	/* Idle tasks are by definition preempted by non-idle tasks. */
 611 | +	if (unlikely(task_has_idle_policy(curr)) &&
 612 | +	    likely(!task_has_idle_policy(p)))
 613 | +		goto preempt;
 614 | +
 615 | +	/*
 616 | +	 * Batch and idle tasks do not preempt non-idle tasks (their preemption
 617 | +	 * is driven by the tick):
 618 | +	 */
 619 | +	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
 620 | +		return;
 621 | +
 622 | +	/*
 623 | +	 * Lower priority tasks do not preempt higher ones
 624 | +	 */
 625 | +	if (p->prio > curr->prio)
 626 | +		return;
 627 | +
 628 | +	update_curr(cfs_rq_of(se));
 629 | +
 630 | +	if (entity_before(&wse->bs_node, &se->bs_node))
 631 | +		goto preempt;
 632 | +
 633 | +	return;
 634 | +
 635 | +preempt:
 636 | +	resched_curr(rq);
 637 | +}
 638 | +
 639 | +#ifdef CONFIG_SMP
 640 | +static int
 641 | +balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 642 | +{
 643 | +	if (rq->nr_running)
 644 | +		return 1;
 645 | +
 646 | +	return newidle_balance(rq, rf) != 0;
 647 | +}
 648 | +
 649 | +static int
 650 | +select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 651 | +{
 652 | +	struct rq *rq = cpu_rq(prev_cpu);
 653 | +	unsigned int min_this = rq->nr_running;
 654 | +	unsigned int min = rq->nr_running;
 655 | +	int cpu, new_cpu = prev_cpu;
 656 | +
 657 | +	for_each_online_cpu(cpu) {
 658 | +		if (cpu_rq(cpu)->nr_running < min) {
 659 | +			new_cpu = cpu;
 660 | +			min = cpu_rq(cpu)->nr_running;
 661 | +		}
 662 | +	}
 663 | +
 664 | +	if (min == min_this)
 665 | +		return prev_cpu;
 666 | +
 667 | +	return new_cpu;
 668 | +}
 669 | +
 670 | +static int
 671 | +can_migrate_task(struct task_struct *p, int dst_cpu, struct rq *src_rq)
 672 | +{
 673 | +	if (task_running(src_rq, p))
 674 | +		return 0;
 675 | +
 676 | +	/* Disregard pcpu kthreads; they are where they need to be. */
 677 | +	if (kthread_is_per_cpu(p))
 678 | +		return 0;
 679 | +
 680 | +	if (!cpumask_test_cpu(dst_cpu, p->cpus_ptr))
 681 | +		return 0;
 682 | +
 683 | +	return 1;
 684 | +}
 685 | +
 686 | +static void pull_from(struct rq *this_rq,
 687 | +		      struct rq *src_rq,
 688 | +		      struct rq_flags *src_rf,
 689 | +		      struct task_struct *p)
 690 | +{
 691 | +	struct rq_flags rf;
 692 | +
 693 | +	// detach task
 694 | +	deactivate_task(src_rq, p, DEQUEUE_NOCLOCK);
 695 | +	set_task_cpu(p, cpu_of(this_rq));
 696 | +
 697 | +	// unlock src rq
 698 | +	rq_unlock(src_rq, src_rf);
 699 | +
 700 | +	// lock this rq
 701 | +	rq_lock(this_rq, &rf);
 702 | +	update_rq_clock(this_rq);
 703 | +
 704 | +	activate_task(this_rq, p, ENQUEUE_NOCLOCK);
 705 | +	check_preempt_curr(this_rq, p, 0);
 706 | +
 707 | +	// unlock this rq
 708 | +	rq_unlock(this_rq, &rf);
 709 | +
 710 | +	local_irq_restore(src_rf->flags);
 711 | +}
 712 | +
 713 | +static int move_task(struct rq *this_rq, struct rq *src_rq,
 714 | +			struct rq_flags *src_rf)
 715 | +{
 716 | +	struct cfs_rq *src_cfs_rq = &src_rq->cfs;
 717 | +	struct task_struct *p;
 718 | +	struct bs_node *bsn = src_cfs_rq->head;
 719 | +	int moved = 0;
 720 | +
 721 | +	while (bsn) {
 722 | +		p = task_of(se_of(bsn));
 723 | +		if (can_migrate_task(p, cpu_of(this_rq), src_rq)) {
 724 | +			pull_from(this_rq, src_rq, src_rf, p);
 725 | +			moved = 1;
 726 | +			break;
 727 | +		}
 728 | +
 729 | +		bsn = bsn->next;
 730 | +	}
 731 | +
 732 | +	if (!moved) {
 733 | +		rq_unlock(src_rq, src_rf);
 734 | +		local_irq_restore(src_rf->flags);
 735 | +	}
 736 | +
 737 | +	return moved;
 738 | +}
 739 | +
 740 | +static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 741 | +{
 742 | +	int this_cpu = this_rq->cpu;
 743 | +	struct rq *src_rq;
 744 | +	int src_cpu = -1, cpu;
 745 | +	int pulled_task = 0;
 746 | +	unsigned int max = 0;
 747 | +	struct rq_flags src_rf;
 748 | +
 749 | +	/*
 750 | +	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 751 | +	 * measure the duration of idle_balance() as idle time.
 752 | +	 */
 753 | +	this_rq->idle_stamp = rq_clock(this_rq);
 754 | +
 755 | +	/*
 756 | +	 * Do not pull tasks towards !active CPUs...
 757 | +	 */
 758 | +	if (!cpu_active(this_cpu))
 759 | +		return 0;
 760 | +
 761 | +	rq_unpin_lock(this_rq, rf);
 762 | +	raw_spin_unlock(&this_rq->__lock);
 763 | +
 764 | +	for_each_online_cpu(cpu) {
 765 | +		/*
 766 | +		 * Stop searching for tasks to pull if there are
 767 | +		 * now runnable tasks on this rq.
 768 | +		 */
 769 | +		if (this_rq->nr_running > 0)
 770 | +			goto out;
 771 | +
 772 | +		if (cpu == this_cpu)
 773 | +			continue;
 774 | +
 775 | +		src_rq = cpu_rq(cpu);
 776 | +
 777 | +		if (src_rq->nr_running < 2)
 778 | +			continue;
 779 | +
 780 | +		if (src_rq->nr_running > max) {
 781 | +			max = src_rq->nr_running;
 782 | +			src_cpu = cpu;
 783 | +		}
 784 | +	}
 785 | +
 786 | +	if (src_cpu != -1) {
 787 | +		src_rq = cpu_rq(src_cpu);
 788 | +
 789 | +		rq_lock_irqsave(src_rq, &src_rf);
 790 | +		update_rq_clock(src_rq);
 791 | +
 792 | +		if (src_rq->nr_running < 2) {
 793 | +			rq_unlock(src_rq, &src_rf);
 794 | +			local_irq_restore(src_rf.flags);
 795 | +		} else {
 796 | +			pulled_task = move_task(this_rq, src_rq, &src_rf);
 797 | +		}
 798 | +	}
 799 | +
 800 | +out:
 801 | +	raw_spin_lock(&this_rq->__lock);
 802 | +
 803 | +	/*
 804 | +	 * While browsing the domains, we released the rq lock, a task could
 805 | +	 * have been enqueued in the meantime. Since we're not going idle,
 806 | +	 * pretend we pulled a task.
 807 | +	 */
 808 | +	if (this_rq->cfs.h_nr_running && !pulled_task)
 809 | +		pulled_task = 1;
 810 | +
 811 | +	/* Is there a task of a high priority class? */
 812 | +	if (this_rq->nr_running != this_rq->cfs.h_nr_running)
 813 | +		pulled_task = -1;
 814 | +
 815 | +	if (pulled_task)
 816 | +		this_rq->idle_stamp = 0;
 817 | +
 818 | +	rq_repin_lock(this_rq, rf);
 819 | +
 820 | +	return pulled_task;
 821 | +}
 822 | +
 823 | +static inline int on_null_domain(struct rq *rq)
 824 | +{
 825 | +	return unlikely(!rcu_dereference_sched(rq->sd));
 826 | +}
 827 | +
 828 | +void trigger_load_balance(struct rq *this_rq)
 829 | +{
 830 | +	int this_cpu = cpu_of(this_rq);
 831 | +	int cpu;
 832 | +	unsigned int max, min;
 833 | +	struct rq *max_rq, *min_rq, *c_rq;
 834 | +	struct rq_flags src_rf;
 835 | +
 836 | +	if (this_cpu != 0)
 837 | +		return;
 838 | +
 839 | +	max = min = this_rq->nr_running;
 840 | +	max_rq = min_rq = this_rq;
 841 | +
 842 | +	for_each_online_cpu(cpu) {
 843 | +		c_rq = cpu_rq(cpu);
 844 | +
 845 | +		/*
 846 | +		 * Don't need to rebalance while attached to NULL domain or
 847 | +		 * runqueue CPU is not active
 848 | +		 */
 849 | +		if (unlikely(on_null_domain(c_rq) || !cpu_active(cpu)))
 850 | +			continue;
 851 | +
 852 | +		if (c_rq->nr_running < min) {
 853 | +			min = c_rq->nr_running;
 854 | +			min_rq = c_rq;
 855 | +		}
 856 | +
 857 | +		if (c_rq->nr_running > max) {
 858 | +			max = c_rq->nr_running;
 859 | +			max_rq = c_rq;
 860 | +		}
 861 | +	}
 862 | +
 863 | +	if (min_rq == max_rq || max - min < 2)
 864 | +		return;
 865 | +
 866 | +	rq_lock_irqsave(max_rq, &src_rf);
 867 | +	update_rq_clock(max_rq);
 868 | +
 869 | +	if (max_rq->nr_running < 2) {
 870 | +		rq_unlock(max_rq, &src_rf);
 871 | +		local_irq_restore(src_rf.flags);
 872 | +		return;
 873 | +	}
 874 | +
 875 | +	move_task(min_rq, max_rq, &src_rf);
 876 | +}
 877 | +
 878 | +void update_group_capacity(struct sched_domain *sd, int cpu) {}
 879 | +#endif /* CONFIG_SMP */
 880 | +
 881 | +static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 882 | +{
 883 | +	struct sched_entity *se = &curr->se;
 884 | +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 885 | +
 886 | +	entity_tick(cfs_rq, se, queued);
 887 | +
 888 | +	if (static_branch_unlikely(&sched_numa_balancing))
 889 | +		task_tick_numa(rq, curr);
 890 | +}
 891 | +
 892 | +static void task_fork_fair(struct task_struct *p)
 893 | +{
 894 | +	struct cfs_rq *cfs_rq;
 895 | +	struct sched_entity *curr;
 896 | +	struct rq *rq = this_rq();
 897 | +	struct rq_flags rf;
 898 | +
 899 | +	p->se.bs_node.vruntime = 0;
 900 | +
 901 | +	rq_lock(rq, &rf);
 902 | +	update_rq_clock(rq);
 903 | +
 904 | +	cfs_rq = task_cfs_rq(current);
 905 | +	curr = cfs_rq->curr;
 906 | +	if (curr)
 907 | +		update_curr(cfs_rq);
 908 | +
 909 | +	rq_unlock(rq, &rf);
 910 | +}
 911 | +
 912 | +/*
 913 | + * All the scheduling class methods:
 914 | + */
 915 | +DEFINE_SCHED_CLASS(fair) = {
 916 | +
 917 | +	.enqueue_task		= enqueue_task_fair,
 918 | +	.dequeue_task		= dequeue_task_fair,
 919 | +	.yield_task		= yield_task_fair,
 920 | +	.yield_to_task		= yield_to_task_fair,
 921 | +
 922 | +	.check_preempt_curr	= check_preempt_wakeup,
 923 | +
 924 | +	.pick_next_task		= __pick_next_task_fair,
 925 | +	.put_prev_task		= put_prev_task_fair,
 926 | +	.set_next_task          = set_next_task_fair,
 927 | +
 928 | +#ifdef CONFIG_SMP
 929 | +	.balance		= balance_fair,
 930 | +	.pick_task		= pick_task_fair,
 931 | +	.select_task_rq		= select_task_rq_fair,
 932 | +	.migrate_task_rq	= migrate_task_rq_fair,
 933 | +
 934 | +	.rq_online		= rq_online_fair,
 935 | +	.rq_offline		= rq_offline_fair,
 936 | +
 937 | +	.task_dead		= task_dead_fair,
 938 | +	.set_cpus_allowed	= set_cpus_allowed_common,
 939 | +#endif
 940 | +
 941 | +	.task_tick		= task_tick_fair,
 942 | +	.task_fork		= task_fork_fair,
 943 | +
 944 | +	.prio_changed		= prio_changed_fair,
 945 | +	.switched_from		= switched_from_fair,
 946 | +	.switched_to		= switched_to_fair,
 947 | +
 948 | +	.get_rr_interval	= get_rr_interval_fair,
 949 | +
 950 | +	.update_curr		= update_curr_fair,
 951 | +};
 952 | diff --git a/kernel/sched/bs.h b/kernel/sched/bs.h
 953 | new file mode 100644
 954 | index 000000000000..e0466c0e6ec3
 955 | --- /dev/null
 956 | +++ b/kernel/sched/bs.h
 957 | @@ -0,0 +1,149 @@
 958 | +
 959 | +/*
 960 | + * After fork, child runs first. If set to 0 (default) then
 961 | + * parent will (try to) run first.
 962 | + */
 963 | +unsigned int sysctl_sched_child_runs_first __read_mostly;
 964 | +
 965 | +const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
 966 | +
 967 | +void __init sched_init_granularity(void) {}
 968 | +
 969 | +#ifdef CONFIG_SMP
 970 | +/* Give new sched_entity start runnable values to heavy its load in infant time */
 971 | +void init_entity_runnable_average(struct sched_entity *se) {}
 972 | +void post_init_entity_util_avg(struct task_struct *p) {}
 973 | +void update_max_interval(void) {}
 974 | +static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
 975 | +
 976 | +static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 977 | +{
 978 | +	update_scan_period(p, new_cpu);
 979 | +}
 980 | +
 981 | +static void rq_online_fair(struct rq *rq) {}
 982 | +static void rq_offline_fair(struct rq *rq) {}
 983 | +static void task_dead_fair(struct task_struct *p)
 984 | +{
 985 | +	struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
 986 | +	unsigned long flags;
 987 | +
 988 | +	raw_spin_lock_irqsave(&cfs_rq->removed.lock, flags);
 989 | +	++cfs_rq->removed.nr;
 990 | +	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 991 | +}
 992 | +
 993 | +#endif /** CONFIG_SMP */
 994 | +
 995 | +void init_cfs_rq(struct cfs_rq *cfs_rq)
 996 | +{
 997 | +	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
 998 | +#ifdef CONFIG_SMP
 999 | +	raw_spin_lock_init(&cfs_rq->removed.lock);
1000 | +#endif
1001 | +}
1002 | +
1003 | +__init void init_sched_fair_class(void) {}
1004 | +
1005 | +void reweight_task(struct task_struct *p, int prio) {}
1006 | +
1007 | +static inline struct sched_entity *se_of(struct bs_node *bsn)
1008 | +{
1009 | +	return container_of(bsn, struct sched_entity, bs_node);
1010 | +}
1011 | +
1012 | +#ifdef CONFIG_SCHED_SMT
1013 | +DEFINE_STATIC_KEY_FALSE(sched_smt_present);
1014 | +EXPORT_SYMBOL_GPL(sched_smt_present);
1015 | +
1016 | +static inline void set_idle_cores(int cpu, int val)
1017 | +{
1018 | +	struct sched_domain_shared *sds;
1019 | +
1020 | +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
1021 | +	if (sds)
1022 | +		WRITE_ONCE(sds->has_idle_cores, val);
1023 | +}
1024 | +
1025 | +static inline bool test_idle_cores(int cpu, bool def)
1026 | +{
1027 | +	struct sched_domain_shared *sds;
1028 | +
1029 | +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
1030 | +	if (sds)
1031 | +		return READ_ONCE(sds->has_idle_cores);
1032 | +
1033 | +	return def;
1034 | +}
1035 | +
1036 | +void __update_idle_core(struct rq *rq)
1037 | +{
1038 | +	int core = cpu_of(rq);
1039 | +	int cpu;
1040 | +
1041 | +	rcu_read_lock();
1042 | +	if (test_idle_cores(core, true))
1043 | +		goto unlock;
1044 | +
1045 | +	for_each_cpu(cpu, cpu_smt_mask(core)) {
1046 | +		if (cpu == core)
1047 | +			continue;
1048 | +
1049 | +		if (!available_idle_cpu(cpu))
1050 | +			goto unlock;
1051 | +	}
1052 | +
1053 | +	set_idle_cores(core, 1);
1054 | +unlock:
1055 | +	rcu_read_unlock();
1056 | +}
1057 | +#endif
1058 | +
1059 | +static void
1060 | +account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
1061 | +{
1062 | +#ifdef CONFIG_SMP
1063 | +	if (entity_is_task(se)) {
1064 | +		struct rq *rq = rq_of(cfs_rq);
1065 | +
1066 | +		account_numa_enqueue(rq, task_of(se));
1067 | +		list_add(&se->group_node, &rq->cfs_tasks);
1068 | +	}
1069 | +#endif
1070 | +	cfs_rq->nr_running++;
1071 | +}
1072 | +
1073 | +static void
1074 | +account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
1075 | +{
1076 | +#ifdef CONFIG_SMP
1077 | +	if (entity_is_task(se)) {
1078 | +		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
1079 | +		list_del_init(&se->group_node);
1080 | +	}
1081 | +#endif
1082 | +	cfs_rq->nr_running--;
1083 | +}
1084 | +
1085 | +
1086 | +static void
1087 | +prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio) {}
1088 | +
1089 | +static void switched_from_fair(struct rq *rq, struct task_struct *p) {}
1090 | +
1091 | +static void switched_to_fair(struct rq *rq, struct task_struct *p)
1092 | +{
1093 | +	if (task_on_rq_queued(p)) {
1094 | +		/*
1095 | +		 * We were most likely switched from sched_rt, so
1096 | +		 * kick off the schedule if running, otherwise just see
1097 | +		 * if we can still preempt the current task.
1098 | +		 */
1099 | +		resched_curr(rq);
1100 | +	}
1101 | +}
1102 | +
1103 | +static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task)
1104 | +{
1105 | +	return 0;
1106 | +}
1107 | diff --git a/kernel/sched/core.c b/kernel/sched/core.c
1108 | index 399c37c95392..4566e232d9c3 100644
1109 | --- a/kernel/sched/core.c
1110 | +++ b/kernel/sched/core.c
1111 | @@ -4228,6 +4228,10 @@ void wake_up_new_task(struct task_struct *p)
1112 |  	update_rq_clock(rq);
1113 |  	post_init_entity_util_avg(p);
1114 |  
1115 | +#ifdef CONFIG_BS_SCHED
1116 | +	p->se.bs_node.hrrn_start_time = sched_clock();
1117 | +#endif
1118 | +
1119 |  	activate_task(rq, p, ENQUEUE_NOCLOCK);
1120 |  	trace_sched_wakeup_new(p);
1121 |  	check_preempt_curr(rq, p, WF_FORK);
1122 | @@ -8995,6 +8999,10 @@ void __init sched_init(void)
1123 |  
1124 |  	wait_bit_init();
1125 |  
1126 | +#ifdef CONFIG_BS_SCHED
1127 | +	printk(KERN_INFO "Baby CPU scheduler (hrrn) v5.14 by Hamad Al Marri.");
1128 | +#endif
1129 | +
1130 |  #ifdef CONFIG_FAIR_GROUP_SCHED
1131 |  	ptr += 2 * nr_cpu_ids * sizeof(void **);
1132 |  #endif
1133 | diff --git a/kernel/sched/fair_numa.h b/kernel/sched/fair_numa.h
1134 | new file mode 100644
1135 | index 000000000000..a564478ec3f7
1136 | --- /dev/null
1137 | +++ b/kernel/sched/fair_numa.h
1138 | @@ -0,0 +1,1968 @@
1139 | +
1140 | +#ifdef CONFIG_NUMA_BALANCING
1141 | +
1142 | +unsigned int sysctl_numa_balancing_scan_period_min = 1000;
1143 | +unsigned int sysctl_numa_balancing_scan_period_max = 60000;
1144 | +
1145 | +/* Portion of address space to scan in MB */
1146 | +unsigned int sysctl_numa_balancing_scan_size = 256;
1147 | +
1148 | +/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
1149 | +unsigned int sysctl_numa_balancing_scan_delay = 1000;
1150 | +
1151 | +struct numa_group {
1152 | +	refcount_t refcount;
1153 | +
1154 | +	spinlock_t lock; /* nr_tasks, tasks */
1155 | +	int nr_tasks;
1156 | +	pid_t gid;
1157 | +	int active_nodes;
1158 | +
1159 | +	struct rcu_head rcu;
1160 | +	unsigned long total_faults;
1161 | +	unsigned long max_faults_cpu;
1162 | +	/*
1163 | +	 * Faults_cpu is used to decide whether memory should move
1164 | +	 * towards the CPU. As a consequence, these stats are weighted
1165 | +	 * more by CPU use than by memory faults.
1166 | +	 */
1167 | +	unsigned long *faults_cpu;
1168 | +	unsigned long faults[];
1169 | +};
1170 | +
1171 | +/*
1172 | + * For functions that can be called in multiple contexts that permit reading
1173 | + * ->numa_group (see struct task_struct for locking rules).
1174 | + */
1175 | +static struct numa_group *deref_task_numa_group(struct task_struct *p)
1176 | +{
1177 | +	return rcu_dereference_check(p->numa_group, p == current ||
1178 | +		(lockdep_is_held(__rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu)));
1179 | +}
1180 | +
1181 | +static struct numa_group *deref_curr_numa_group(struct task_struct *p)
1182 | +{
1183 | +	return rcu_dereference_protected(p->numa_group, p == current);
1184 | +}
1185 | +
1186 | +static inline unsigned long group_faults_priv(struct numa_group *ng);
1187 | +static inline unsigned long group_faults_shared(struct numa_group *ng);
1188 | +
1189 | +static unsigned int task_nr_scan_windows(struct task_struct *p)
1190 | +{
1191 | +	unsigned long rss = 0;
1192 | +	unsigned long nr_scan_pages;
1193 | +
1194 | +	/*
1195 | +	 * Calculations based on RSS as non-present and empty pages are skipped
1196 | +	 * by the PTE scanner and NUMA hinting faults should be trapped based
1197 | +	 * on resident pages
1198 | +	 */
1199 | +	nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
1200 | +	rss = get_mm_rss(p->mm);
1201 | +	if (!rss)
1202 | +		rss = nr_scan_pages;
1203 | +
1204 | +	rss = round_up(rss, nr_scan_pages);
1205 | +	return rss / nr_scan_pages;
1206 | +}
1207 | +
1208 | +/* For sanity's sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
1209 | +#define MAX_SCAN_WINDOW 2560
1210 | +
1211 | +static unsigned int task_scan_min(struct task_struct *p)
1212 | +{
1213 | +	unsigned int scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
1214 | +	unsigned int scan, floor;
1215 | +	unsigned int windows = 1;
1216 | +
1217 | +	if (scan_size < MAX_SCAN_WINDOW)
1218 | +		windows = MAX_SCAN_WINDOW / scan_size;
1219 | +	floor = 1000 / windows;
1220 | +
1221 | +	scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
1222 | +	return max_t(unsigned int, floor, scan);
1223 | +}
1224 | +
1225 | +static unsigned int task_scan_max(struct task_struct *p)
1226 | +{
1227 | +	unsigned long smin = task_scan_min(p);
1228 | +	unsigned long smax;
1229 | +	struct numa_group *ng;
1230 | +
1231 | +	/* Watch for min being lower than max due to floor calculations */
1232 | +	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
1233 | +
1234 | +	/* Scale the maximum scan period with the amount of shared memory. */
1235 | +	ng = deref_curr_numa_group(p);
1236 | +	if (ng) {
1237 | +		unsigned long shared = group_faults_shared(ng);
1238 | +		unsigned long private = group_faults_priv(ng);
1239 | +		unsigned long period = smax;
1240 | +
1241 | +		period *= refcount_read(&ng->refcount);
1242 | +		period *= shared + 1;
1243 | +		period /= private + shared + 1;
1244 | +
1245 | +		smax = max(smax, period);
1246 | +	}
1247 | +
1248 | +	return max(smin, smax);
1249 | +}
1250 | +
1251 | +static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
1252 | +{
1253 | +	rq->nr_numa_running += (p->numa_preferred_nid != NUMA_NO_NODE);
1254 | +	rq->nr_preferred_running += (p->numa_preferred_nid == task_node(p));
1255 | +}
1256 | +
1257 | +static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
1258 | +{
1259 | +	rq->nr_numa_running -= (p->numa_preferred_nid != NUMA_NO_NODE);
1260 | +	rq->nr_preferred_running -= (p->numa_preferred_nid == task_node(p));
1261 | +}
1262 | +
1263 | +/* Shared or private faults. */
1264 | +#define NR_NUMA_HINT_FAULT_TYPES 2
1265 | +
1266 | +/* Memory and CPU locality */
1267 | +#define NR_NUMA_HINT_FAULT_STATS (NR_NUMA_HINT_FAULT_TYPES * 2)
1268 | +
1269 | +/* Averaged statistics, and temporary buffers. */
1270 | +#define NR_NUMA_HINT_FAULT_BUCKETS (NR_NUMA_HINT_FAULT_STATS * 2)
1271 | +
1272 | +pid_t task_numa_group_id(struct task_struct *p)
1273 | +{
1274 | +	struct numa_group *ng;
1275 | +	pid_t gid = 0;
1276 | +
1277 | +	rcu_read_lock();
1278 | +	ng = rcu_dereference(p->numa_group);
1279 | +	if (ng)
1280 | +		gid = ng->gid;
1281 | +	rcu_read_unlock();
1282 | +
1283 | +	return gid;
1284 | +}
1285 | +
1286 | +/*
1287 | + * The averaged statistics, shared & private, memory & CPU,
1288 | + * occupy the first half of the array. The second half of the
1289 | + * array is for current counters, which are averaged into the
1290 | + * first set by task_numa_placement.
1291 | + */
1292 | +static inline int task_faults_idx(enum numa_faults_stats s, int nid, int priv)
1293 | +{
1294 | +	return NR_NUMA_HINT_FAULT_TYPES * (s * nr_node_ids + nid) + priv;
1295 | +}
1296 | +
1297 | +static inline unsigned long task_faults(struct task_struct *p, int nid)
1298 | +{
1299 | +	if (!p->numa_faults)
1300 | +		return 0;
1301 | +
1302 | +	return p->numa_faults[task_faults_idx(NUMA_MEM, nid, 0)] +
1303 | +		p->numa_faults[task_faults_idx(NUMA_MEM, nid, 1)];
1304 | +}
1305 | +
1306 | +static inline unsigned long group_faults(struct task_struct *p, int nid)
1307 | +{
1308 | +	struct numa_group *ng = deref_task_numa_group(p);
1309 | +
1310 | +	if (!ng)
1311 | +		return 0;
1312 | +
1313 | +	return ng->faults[task_faults_idx(NUMA_MEM, nid, 0)] +
1314 | +		ng->faults[task_faults_idx(NUMA_MEM, nid, 1)];
1315 | +}
1316 | +
1317 | +static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)
1318 | +{
1319 | +	return group->faults_cpu[task_faults_idx(NUMA_MEM, nid, 0)] +
1320 | +		group->faults_cpu[task_faults_idx(NUMA_MEM, nid, 1)];
1321 | +}
1322 | +
1323 | +static inline unsigned long group_faults_priv(struct numa_group *ng)
1324 | +{
1325 | +	unsigned long faults = 0;
1326 | +	int node;
1327 | +
1328 | +	for_each_online_node(node) {
1329 | +		faults += ng->faults[task_faults_idx(NUMA_MEM, node, 1)];
1330 | +	}
1331 | +
1332 | +	return faults;
1333 | +}
1334 | +
1335 | +static inline unsigned long group_faults_shared(struct numa_group *ng)
1336 | +{
1337 | +	unsigned long faults = 0;
1338 | +	int node;
1339 | +
1340 | +	for_each_online_node(node) {
1341 | +		faults += ng->faults[task_faults_idx(NUMA_MEM, node, 0)];
1342 | +	}
1343 | +
1344 | +	return faults;
1345 | +}
1346 | +
1347 | +/*
1348 | + * A node triggering more than 1/3 as many NUMA faults as the maximum is
1349 | + * considered part of a numa group's pseudo-interleaving set. Migrations
1350 | + * between these nodes are slowed down, to allow things to settle down.
1351 | + */
1352 | +#define ACTIVE_NODE_FRACTION 3
1353 | +
1354 | +static bool numa_is_active_node(int nid, struct numa_group *ng)
1355 | +{
1356 | +	return group_faults_cpu(ng, nid) * ACTIVE_NODE_FRACTION > ng->max_faults_cpu;
1357 | +}
1358 | +
1359 | +/* Handle placement on systems where not all nodes are directly connected. */
1360 | +static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
1361 | +					int maxdist, bool task)
1362 | +{
1363 | +	unsigned long score = 0;
1364 | +	int node;
1365 | +
1366 | +	/*
1367 | +	 * All nodes are directly connected, and the same distance
1368 | +	 * from each other. No need for fancy placement algorithms.
1369 | +	 */
1370 | +	if (sched_numa_topology_type == NUMA_DIRECT)
1371 | +		return 0;
1372 | +
1373 | +	/*
1374 | +	 * This code is called for each node, introducing N^2 complexity,
1375 | +	 * which should be ok given the number of nodes rarely exceeds 8.
1376 | +	 */
1377 | +	for_each_online_node(node) {
1378 | +		unsigned long faults;
1379 | +		int dist = node_distance(nid, node);
1380 | +
1381 | +		/*
1382 | +		 * The furthest away nodes in the system are not interesting
1383 | +		 * for placement; nid was already counted.
1384 | +		 */
1385 | +		if (dist == sched_max_numa_distance || node == nid)
1386 | +			continue;
1387 | +
1388 | +		/*
1389 | +		 * On systems with a backplane NUMA topology, compare groups
1390 | +		 * of nodes, and move tasks towards the group with the most
1391 | +		 * memory accesses. When comparing two nodes at distance
1392 | +		 * "hoplimit", only nodes closer by than "hoplimit" are part
1393 | +		 * of each group. Skip other nodes.
1394 | +		 */
1395 | +		if (sched_numa_topology_type == NUMA_BACKPLANE &&
1396 | +					dist >= maxdist)
1397 | +			continue;
1398 | +
1399 | +		/* Add up the faults from nearby nodes. */
1400 | +		if (task)
1401 | +			faults = task_faults(p, node);
1402 | +		else
1403 | +			faults = group_faults(p, node);
1404 | +
1405 | +		/*
1406 | +		 * On systems with a glueless mesh NUMA topology, there are
1407 | +		 * no fixed "groups of nodes". Instead, nodes that are not
1408 | +		 * directly connected bounce traffic through intermediate
1409 | +		 * nodes; a numa_group can occupy any set of nodes.
1410 | +		 * The further away a node is, the less the faults count.
1411 | +		 * This seems to result in good task placement.
1412 | +		 */
1413 | +		if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
1414 | +			faults *= (sched_max_numa_distance - dist);
1415 | +			faults /= (sched_max_numa_distance - LOCAL_DISTANCE);
1416 | +		}
1417 | +
1418 | +		score += faults;
1419 | +	}
1420 | +
1421 | +	return score;
1422 | +}
1423 | +
1424 | +/*
1425 | + * These return the fraction of accesses done by a particular task, or
1426 | + * task group, on a particular numa node.  The group weight is given a
1427 | + * larger multiplier, in order to group tasks together that are almost
1428 | + * evenly spread out between numa nodes.
1429 | + */
1430 | +static inline unsigned long task_weight(struct task_struct *p, int nid,
1431 | +					int dist)
1432 | +{
1433 | +	unsigned long faults, total_faults;
1434 | +
1435 | +	if (!p->numa_faults)
1436 | +		return 0;
1437 | +
1438 | +	total_faults = p->total_numa_faults;
1439 | +
1440 | +	if (!total_faults)
1441 | +		return 0;
1442 | +
1443 | +	faults = task_faults(p, nid);
1444 | +	faults += score_nearby_nodes(p, nid, dist, true);
1445 | +
1446 | +	return 1000 * faults / total_faults;
1447 | +}
1448 | +
1449 | +static inline unsigned long group_weight(struct task_struct *p, int nid,
1450 | +					 int dist)
1451 | +{
1452 | +	struct numa_group *ng = deref_task_numa_group(p);
1453 | +	unsigned long faults, total_faults;
1454 | +
1455 | +	if (!ng)
1456 | +		return 0;
1457 | +
1458 | +	total_faults = ng->total_faults;
1459 | +
1460 | +	if (!total_faults)
1461 | +		return 0;
1462 | +
1463 | +	faults = group_faults(p, nid);
1464 | +	faults += score_nearby_nodes(p, nid, dist, false);
1465 | +
1466 | +	return 1000 * faults / total_faults;
1467 | +}
1468 | +
1469 | +bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
1470 | +				int src_nid, int dst_cpu)
1471 | +{
1472 | +	struct numa_group *ng = deref_curr_numa_group(p);
1473 | +	int dst_nid = cpu_to_node(dst_cpu);
1474 | +	int last_cpupid, this_cpupid;
1475 | +
1476 | +	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
1477 | +	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
1478 | +
1479 | +	/*
1480 | +	 * Allow first faults or private faults to migrate immediately early in
1481 | +	 * the lifetime of a task. The magic number 4 is based on waiting for
1482 | +	 * two full passes of the "multi-stage node selection" test that is
1483 | +	 * executed below.
1484 | +	 */
1485 | +	if ((p->numa_preferred_nid == NUMA_NO_NODE || p->numa_scan_seq <= 4) &&
1486 | +	    (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid)))
1487 | +		return true;
1488 | +
1489 | +	/*
1490 | +	 * Multi-stage node selection is used in conjunction with a periodic
1491 | +	 * migration fault to build a temporal task<->page relation. By using
1492 | +	 * a two-stage filter we remove short/unlikely relations.
1493 | +	 *
1494 | +	 * Using P(p) ~ n_p / n_t as per frequentist probability, we can equate
1495 | +	 * a task's usage of a particular page (n_p) per total usage of this
1496 | +	 * page (n_t) (in a given time-span) to a probability.
1497 | +	 *
1498 | +	 * Our periodic faults will sample this probability and getting the
1499 | +	 * same result twice in a row, given these samples are fully
1500 | +	 * independent, is then given by P(n)^2, provided our sample period
1501 | +	 * is sufficiently short compared to the usage pattern.
1502 | +	 *
1503 | +	 * This quadric squishes small probabilities, making it less likely we
1504 | +	 * act on an unlikely task<->page relation.
1505 | +	 */
1506 | +	if (!cpupid_pid_unset(last_cpupid) &&
1507 | +				cpupid_to_nid(last_cpupid) != dst_nid)
1508 | +		return false;
1509 | +
1510 | +	/* Always allow migrate on private faults */
1511 | +	if (cpupid_match_pid(p, last_cpupid))
1512 | +		return true;
1513 | +
1514 | +	/* A shared fault, but p->numa_group has not been set up yet. */
1515 | +	if (!ng)
1516 | +		return true;
1517 | +
1518 | +	/*
1519 | +	 * Destination node is much more heavily used than the source
1520 | +	 * node? Allow migration.
1521 | +	 */
1522 | +	if (group_faults_cpu(ng, dst_nid) > group_faults_cpu(ng, src_nid) *
1523 | +					ACTIVE_NODE_FRACTION)
1524 | +		return true;
1525 | +
1526 | +	/*
1527 | +	 * Distribute memory according to CPU & memory use on each node,
1528 | +	 * with 3/4 hysteresis to avoid unnecessary memory migrations:
1529 | +	 *
1530 | +	 * faults_cpu(dst)   3   faults_cpu(src)
1531 | +	 * --------------- * - > ---------------
1532 | +	 * faults_mem(dst)   4   faults_mem(src)
1533 | +	 */
1534 | +	return group_faults_cpu(ng, dst_nid) * group_faults(p, src_nid) * 3 >
1535 | +	       group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
1536 | +}
1537 | +
1538 | +/*
1539 | + * 'numa_type' describes the node at the moment of load balancing.
1540 | + */
1541 | +enum numa_type {
1542 | +	/* The node has spare capacity that can be used to run more tasks.  */
1543 | +	node_has_spare = 0,
1544 | +	/*
1545 | +	 * The node is fully used and the tasks don't compete for more CPU
1546 | +	 * cycles. Nevertheless, some tasks might wait before running.
1547 | +	 */
1548 | +	node_fully_busy,
1549 | +	/*
1550 | +	 * The node is overloaded and can't provide expected CPU cycles to all
1551 | +	 * tasks.
1552 | +	 */
1553 | +	node_overloaded
1554 | +};
1555 | +
1556 | +/* Cached statistics for all CPUs within a node */
1557 | +struct numa_stats {
1558 | +	unsigned long load;
1559 | +	unsigned long runnable;
1560 | +	unsigned long util;
1561 | +	/* Total compute capacity of CPUs on a node */
1562 | +	unsigned long compute_capacity;
1563 | +	unsigned int nr_running;
1564 | +	unsigned int weight;
1565 | +	enum numa_type node_type;
1566 | +	int idle_cpu;
1567 | +};
1568 | +
1569 | +static inline bool is_core_idle(int cpu)
1570 | +{
1571 | +#ifdef CONFIG_SCHED_SMT
1572 | +	int sibling;
1573 | +
1574 | +	for_each_cpu(sibling, cpu_smt_mask(cpu)) {
1575 | +		if (cpu == sibling)
1576 | +			continue;
1577 | +
1578 | +		if (!idle_cpu(sibling))
1579 | +			return false;
1580 | +	}
1581 | +#endif
1582 | +
1583 | +	return true;
1584 | +}
1585 | +
1586 | +struct task_numa_env {
1587 | +	struct task_struct *p;
1588 | +
1589 | +	int src_cpu, src_nid;
1590 | +	int dst_cpu, dst_nid;
1591 | +
1592 | +	struct numa_stats src_stats, dst_stats;
1593 | +
1594 | +	int imbalance_pct;
1595 | +	int dist;
1596 | +
1597 | +	struct task_struct *best_task;
1598 | +	long best_imp;
1599 | +	int best_cpu;
1600 | +};
1601 | +
1602 | +static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
1603 | +{
1604 | +	return cfs_rq->avg.load_avg;
1605 | +}
1606 | +
1607 | +static unsigned long cpu_load(struct rq *rq)
1608 | +{
1609 | +	return cfs_rq_load_avg(&rq->cfs);
1610 | +}
1611 | +
1612 | +static inline unsigned long cfs_rq_runnable_avg(struct cfs_rq *cfs_rq)
1613 | +{
1614 | +	return cfs_rq->avg.runnable_avg;
1615 | +}
1616 | +
1617 | +static unsigned long cpu_runnable(struct rq *rq)
1618 | +{
1619 | +	return cfs_rq_runnable_avg(&rq->cfs);
1620 | +}
1621 | +
1622 | +static inline unsigned long cpu_util(int cpu)
1623 | +{
1624 | +	struct cfs_rq *cfs_rq;
1625 | +	unsigned int util;
1626 | +
1627 | +	cfs_rq = &cpu_rq(cpu)->cfs;
1628 | +	util = READ_ONCE(cfs_rq->avg.util_avg);
1629 | +
1630 | +	if (sched_feat(UTIL_EST))
1631 | +		util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));
1632 | +
1633 | +	return min_t(unsigned long, util, capacity_orig_of(cpu));
1634 | +}
1635 | +
1636 | +static unsigned long capacity_of(int cpu)
1637 | +{
1638 | +	return cpu_rq(cpu)->cpu_capacity;
1639 | +}
1640 | +
1641 | +static inline enum
1642 | +numa_type numa_classify(unsigned int imbalance_pct,
1643 | +			 struct numa_stats *ns)
1644 | +{
1645 | +	if ((ns->nr_running > ns->weight) &&
1646 | +	    (((ns->compute_capacity * 100) < (ns->util * imbalance_pct)) ||
1647 | +	     ((ns->compute_capacity * imbalance_pct) < (ns->runnable * 100))))
1648 | +		return node_overloaded;
1649 | +
1650 | +	if ((ns->nr_running < ns->weight) ||
1651 | +	    (((ns->compute_capacity * 100) > (ns->util * imbalance_pct)) &&
1652 | +	     ((ns->compute_capacity * imbalance_pct) > (ns->runnable * 100))))
1653 | +		return node_has_spare;
1654 | +
1655 | +	return node_fully_busy;
1656 | +}
1657 | +
1658 | +#ifdef CONFIG_SCHED_SMT
1659 | +/* Forward declarations of select_idle_sibling helpers */
1660 | +static inline bool test_idle_cores(int cpu, bool def);
1661 | +static inline int numa_idle_core(int idle_core, int cpu)
1662 | +{
1663 | +	if (!static_branch_likely(&sched_smt_present) ||
1664 | +	    idle_core >= 0 || !test_idle_cores(cpu, false))
1665 | +		return idle_core;
1666 | +
1667 | +	/*
1668 | +	 * Prefer cores instead of packing HT siblings
1669 | +	 * and triggering future load balancing.
1670 | +	 */
1671 | +	if (is_core_idle(cpu))
1672 | +		idle_core = cpu;
1673 | +
1674 | +	return idle_core;
1675 | +}
1676 | +#else
1677 | +static inline int numa_idle_core(int idle_core, int cpu)
1678 | +{
1679 | +	return idle_core;
1680 | +}
1681 | +#endif
1682 | +
1683 | +/*
1684 | + * Gather all necessary information to make NUMA balancing placement
1685 | + * decisions that are compatible with standard load balancer. This
1686 | + * borrows code and logic from update_sg_lb_stats but sharing a
1687 | + * common implementation is impractical.
1688 | + */
1689 | +static void update_numa_stats(struct task_numa_env *env,
1690 | +			      struct numa_stats *ns, int nid,
1691 | +			      bool find_idle)
1692 | +{
1693 | +	int cpu, idle_core = -1;
1694 | +
1695 | +	memset(ns, 0, sizeof(*ns));
1696 | +	ns->idle_cpu = -1;
1697 | +
1698 | +	rcu_read_lock();
1699 | +	for_each_cpu(cpu, cpumask_of_node(nid)) {
1700 | +		struct rq *rq = cpu_rq(cpu);
1701 | +
1702 | +		ns->load += cpu_load(rq);
1703 | +		ns->runnable += cpu_runnable(rq);
1704 | +		ns->util += cpu_util(cpu);
1705 | +		ns->nr_running += rq->cfs.h_nr_running;
1706 | +		ns->compute_capacity += capacity_of(cpu);
1707 | +
1708 | +		if (find_idle && !rq->nr_running && idle_cpu(cpu)) {
1709 | +			if (READ_ONCE(rq->numa_migrate_on) ||
1710 | +			    !cpumask_test_cpu(cpu, env->p->cpus_ptr))
1711 | +				continue;
1712 | +
1713 | +			if (ns->idle_cpu == -1)
1714 | +				ns->idle_cpu = cpu;
1715 | +
1716 | +			idle_core = numa_idle_core(idle_core, cpu);
1717 | +		}
1718 | +	}
1719 | +	rcu_read_unlock();
1720 | +
1721 | +	ns->weight = cpumask_weight(cpumask_of_node(nid));
1722 | +
1723 | +	ns->node_type = numa_classify(env->imbalance_pct, ns);
1724 | +
1725 | +	if (idle_core >= 0)
1726 | +		ns->idle_cpu = idle_core;
1727 | +}
1728 | +
1729 | +static void task_numa_assign(struct task_numa_env *env,
1730 | +			     struct task_struct *p, long imp)
1731 | +{
1732 | +	struct rq *rq = cpu_rq(env->dst_cpu);
1733 | +
1734 | +	/* Check if run-queue part of active NUMA balance. */
1735 | +	if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1)) {
1736 | +		int cpu;
1737 | +		int start = env->dst_cpu;
1738 | +
1739 | +		/* Find alternative idle CPU. */
1740 | +		for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start) {
1741 | +			if (cpu == env->best_cpu || !idle_cpu(cpu) ||
1742 | +			    !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
1743 | +				continue;
1744 | +			}
1745 | +
1746 | +			env->dst_cpu = cpu;
1747 | +			rq = cpu_rq(env->dst_cpu);
1748 | +			if (!xchg(&rq->numa_migrate_on, 1))
1749 | +				goto assign;
1750 | +		}
1751 | +
1752 | +		/* Failed to find an alternative idle CPU */
1753 | +		return;
1754 | +	}
1755 | +
1756 | +assign:
1757 | +	/*
1758 | +	 * Clear previous best_cpu/rq numa-migrate flag, since task now
1759 | +	 * found a better CPU to move/swap.
1760 | +	 */
1761 | +	if (env->best_cpu != -1 && env->best_cpu != env->dst_cpu) {
1762 | +		rq = cpu_rq(env->best_cpu);
1763 | +		WRITE_ONCE(rq->numa_migrate_on, 0);
1764 | +	}
1765 | +
1766 | +	if (env->best_task)
1767 | +		put_task_struct(env->best_task);
1768 | +	if (p)
1769 | +		get_task_struct(p);
1770 | +
1771 | +	env->best_task = p;
1772 | +	env->best_imp = imp;
1773 | +	env->best_cpu = env->dst_cpu;
1774 | +}
1775 | +
1776 | +static bool load_too_imbalanced(long src_load, long dst_load,
1777 | +				struct task_numa_env *env)
1778 | +{
1779 | +	long imb, old_imb;
1780 | +	long orig_src_load, orig_dst_load;
1781 | +	long src_capacity, dst_capacity;
1782 | +
1783 | +	/*
1784 | +	 * The load is corrected for the CPU capacity available on each node.
1785 | +	 *
1786 | +	 * src_load        dst_load
1787 | +	 * ------------ vs ---------
1788 | +	 * src_capacity    dst_capacity
1789 | +	 */
1790 | +	src_capacity = env->src_stats.compute_capacity;
1791 | +	dst_capacity = env->dst_stats.compute_capacity;
1792 | +
1793 | +	imb = abs(dst_load * src_capacity - src_load * dst_capacity);
1794 | +
1795 | +	orig_src_load = env->src_stats.load;
1796 | +	orig_dst_load = env->dst_stats.load;
1797 | +
1798 | +	old_imb = abs(orig_dst_load * src_capacity - orig_src_load * dst_capacity);
1799 | +
1800 | +	/* Would this change make things worse? */
1801 | +	return (imb > old_imb);
1802 | +}
1803 | +
1804 | +static unsigned int task_scan_start(struct task_struct *p)
1805 | +{
1806 | +	unsigned long smin = task_scan_min(p);
1807 | +	unsigned long period = smin;
1808 | +	struct numa_group *ng;
1809 | +
1810 | +	/* Scale the maximum scan period with the amount of shared memory. */
1811 | +	rcu_read_lock();
1812 | +	ng = rcu_dereference(p->numa_group);
1813 | +	if (ng) {
1814 | +		unsigned long shared = group_faults_shared(ng);
1815 | +		unsigned long private = group_faults_priv(ng);
1816 | +
1817 | +		period *= refcount_read(&ng->refcount);
1818 | +		period *= shared + 1;
1819 | +		period /= private + shared + 1;
1820 | +	}
1821 | +	rcu_read_unlock();
1822 | +
1823 | +	return max(smin, period);
1824 | +}
1825 | +
1826 | +static void update_scan_period(struct task_struct *p, int new_cpu)
1827 | +{
1828 | +	int src_nid = cpu_to_node(task_cpu(p));
1829 | +	int dst_nid = cpu_to_node(new_cpu);
1830 | +
1831 | +	if (!static_branch_likely(&sched_numa_balancing))
1832 | +		return;
1833 | +
1834 | +	if (!p->mm || !p->numa_faults || (p->flags & PF_EXITING))
1835 | +		return;
1836 | +
1837 | +	if (src_nid == dst_nid)
1838 | +		return;
1839 | +
1840 | +	/*
1841 | +	 * Allow resets if faults have been trapped before one scan
1842 | +	 * has completed. This is most likely due to a new task that
1843 | +	 * is pulled cross-node due to wakeups or load balancing.
1844 | +	 */
1845 | +	if (p->numa_scan_seq) {
1846 | +		/*
1847 | +		 * Avoid scan adjustments if moving to the preferred
1848 | +		 * node or if the task was not previously running on
1849 | +		 * the preferred node.
1850 | +		 */
1851 | +		if (dst_nid == p->numa_preferred_nid ||
1852 | +		    (p->numa_preferred_nid != NUMA_NO_NODE &&
1853 | +			src_nid != p->numa_preferred_nid))
1854 | +			return;
1855 | +	}
1856 | +
1857 | +	p->numa_scan_period = task_scan_start(p);
1858 | +}
1859 | +
1860 | +/*
1861 | + * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
1862 | + * This is an approximation as the number of running tasks may not be
1863 | + * related to the number of busy CPUs due to sched_setaffinity.
1864 | + */
1865 | +static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
1866 | +{
1867 | +	return (dst_running < (dst_weight >> 2));
1868 | +}
1869 | +
1870 | +#define NUMA_IMBALANCE_MIN 2
1871 | +
1872 | +static inline long adjust_numa_imbalance(int imbalance,
1873 | +				int dst_running, int dst_weight)
1874 | +{
1875 | +	if (!allow_numa_imbalance(dst_running, dst_weight))
1876 | +		return imbalance;
1877 | +
1878 | +	/*
1879 | +	 * Allow a small imbalance based on a simple pair of communicating
1880 | +	 * tasks that remain local when the destination is lightly loaded.
1881 | +	 */
1882 | +	if (imbalance <= NUMA_IMBALANCE_MIN)
1883 | +		return 0;
1884 | +
1885 | +	return imbalance;
1886 | +}
1887 | +
1888 | +static unsigned long task_h_load(struct task_struct *p)
1889 | +{
1890 | +	return p->se.avg.load_avg;
1891 | +}
1892 | +
1893 | +/*
1894 | + * Maximum NUMA importance can be 1998 (2*999);
1895 | + * SMALLIMP @ 30 would be close to 1998/64.
1896 | + * Used to deter task migration.
1897 | + */
1898 | +#define SMALLIMP	30
1899 | +
1900 | +/*
1901 | + * This checks if the overall compute and NUMA accesses of the system would
1902 | + * be improved if the source tasks was migrated to the target dst_cpu taking
1903 | + * into account that it might be best if task running on the dst_cpu should
1904 | + * be exchanged with the source task
1905 | + */
1906 | +static bool task_numa_compare(struct task_numa_env *env,
1907 | +			      long taskimp, long groupimp, bool maymove)
1908 | +{
1909 | +	struct numa_group *cur_ng, *p_ng = deref_curr_numa_group(env->p);
1910 | +	struct rq *dst_rq = cpu_rq(env->dst_cpu);
1911 | +	long imp = p_ng ? groupimp : taskimp;
1912 | +	struct task_struct *cur;
1913 | +	long src_load, dst_load;
1914 | +	int dist = env->dist;
1915 | +	long moveimp = imp;
1916 | +	long load;
1917 | +	bool stopsearch = false;
1918 | +
1919 | +	if (READ_ONCE(dst_rq->numa_migrate_on))
1920 | +		return false;
1921 | +
1922 | +	rcu_read_lock();
1923 | +	cur = rcu_dereference(dst_rq->curr);
1924 | +	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
1925 | +		cur = NULL;
1926 | +
1927 | +	/*
1928 | +	 * Because we have preemption enabled we can get migrated around and
1929 | +	 * end try selecting ourselves (current == env->p) as a swap candidate.
1930 | +	 */
1931 | +	if (cur == env->p) {
1932 | +		stopsearch = true;
1933 | +		goto unlock;
1934 | +	}
1935 | +
1936 | +	if (!cur) {
1937 | +		if (maymove && moveimp >= env->best_imp)
1938 | +			goto assign;
1939 | +		else
1940 | +			goto unlock;
1941 | +	}
1942 | +
1943 | +	/* Skip this swap candidate if cannot move to the source cpu. */
1944 | +	if (!cpumask_test_cpu(env->src_cpu, cur->cpus_ptr))
1945 | +		goto unlock;
1946 | +
1947 | +	/*
1948 | +	 * Skip this swap candidate if it is not moving to its preferred
1949 | +	 * node and the best task is.
1950 | +	 */
1951 | +	if (env->best_task &&
1952 | +	    env->best_task->numa_preferred_nid == env->src_nid &&
1953 | +	    cur->numa_preferred_nid != env->src_nid) {
1954 | +		goto unlock;
1955 | +	}
1956 | +
1957 | +	/*
1958 | +	 * "imp" is the fault differential for the source task between the
1959 | +	 * source and destination node. Calculate the total differential for
1960 | +	 * the source task and potential destination task. The more negative
1961 | +	 * the value is, the more remote accesses that would be expected to
1962 | +	 * be incurred if the tasks were swapped.
1963 | +	 *
1964 | +	 * If dst and source tasks are in the same NUMA group, or not
1965 | +	 * in any group then look only at task weights.
1966 | +	 */
1967 | +	cur_ng = rcu_dereference(cur->numa_group);
1968 | +	if (cur_ng == p_ng) {
1969 | +		imp = taskimp + task_weight(cur, env->src_nid, dist) -
1970 | +		      task_weight(cur, env->dst_nid, dist);
1971 | +		/*
1972 | +		 * Add some hysteresis to prevent swapping the
1973 | +		 * tasks within a group over tiny differences.
1974 | +		 */
1975 | +		if (cur_ng)
1976 | +			imp -= imp / 16;
1977 | +	} else {
1978 | +		/*
1979 | +		 * Compare the group weights. If a task is all by itself
1980 | +		 * (not part of a group), use the task weight instead.
1981 | +		 */
1982 | +		if (cur_ng && p_ng)
1983 | +			imp += group_weight(cur, env->src_nid, dist) -
1984 | +			       group_weight(cur, env->dst_nid, dist);
1985 | +		else
1986 | +			imp += task_weight(cur, env->src_nid, dist) -
1987 | +			       task_weight(cur, env->dst_nid, dist);
1988 | +	}
1989 | +
1990 | +	/* Discourage picking a task already on its preferred node */
1991 | +	if (cur->numa_preferred_nid == env->dst_nid)
1992 | +		imp -= imp / 16;
1993 | +
1994 | +	/*
1995 | +	 * Encourage picking a task that moves to its preferred node.
1996 | +	 * This potentially makes imp larger than it's maximum of
1997 | +	 * 1998 (see SMALLIMP and task_weight for why) but in this
1998 | +	 * case, it does not matter.
1999 | +	 */
2000 | +	if (cur->numa_preferred_nid == env->src_nid)
2001 | +		imp += imp / 8;
2002 | +
2003 | +	if (maymove && moveimp > imp && moveimp > env->best_imp) {
2004 | +		imp = moveimp;
2005 | +		cur = NULL;
2006 | +		goto assign;
2007 | +	}
2008 | +
2009 | +	/*
2010 | +	 * Prefer swapping with a task moving to its preferred node over a
2011 | +	 * task that is not.
2012 | +	 */
2013 | +	if (env->best_task && cur->numa_preferred_nid == env->src_nid &&
2014 | +	    env->best_task->numa_preferred_nid != env->src_nid) {
2015 | +		goto assign;
2016 | +	}
2017 | +
2018 | +	/*
2019 | +	 * If the NUMA importance is less than SMALLIMP,
2020 | +	 * task migration might only result in ping pong
2021 | +	 * of tasks and also hurt performance due to cache
2022 | +	 * misses.
2023 | +	 */
2024 | +	if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
2025 | +		goto unlock;
2026 | +
2027 | +	/*
2028 | +	 * In the overloaded case, try and keep the load balanced.
2029 | +	 */
2030 | +	load = task_h_load(env->p) - task_h_load(cur);
2031 | +	if (!load)
2032 | +		goto assign;
2033 | +
2034 | +	dst_load = env->dst_stats.load + load;
2035 | +	src_load = env->src_stats.load - load;
2036 | +
2037 | +	if (load_too_imbalanced(src_load, dst_load, env))
2038 | +		goto unlock;
2039 | +
2040 | +assign:
2041 | +	/* Evaluate an idle CPU for a task numa move. */
2042 | +	if (!cur) {
2043 | +		int cpu = env->dst_stats.idle_cpu;
2044 | +
2045 | +		/* Nothing cached so current CPU went idle since the search. */
2046 | +		if (cpu < 0)
2047 | +			cpu = env->dst_cpu;
2048 | +
2049 | +		/*
2050 | +		 * If the CPU is no longer truly idle and the previous best CPU
2051 | +		 * is, keep using it.
2052 | +		 */
2053 | +		if (!idle_cpu(cpu) && env->best_cpu >= 0 &&
2054 | +		    idle_cpu(env->best_cpu)) {
2055 | +			cpu = env->best_cpu;
2056 | +		}
2057 | +
2058 | +		env->dst_cpu = cpu;
2059 | +	}
2060 | +
2061 | +	task_numa_assign(env, cur, imp);
2062 | +
2063 | +	/*
2064 | +	 * If a move to idle is allowed because there is capacity or load
2065 | +	 * balance improves then stop the search. While a better swap
2066 | +	 * candidate may exist, a search is not free.
2067 | +	 */
2068 | +	if (maymove && !cur && env->best_cpu >= 0 && idle_cpu(env->best_cpu))
2069 | +		stopsearch = true;
2070 | +
2071 | +	/*
2072 | +	 * If a swap candidate must be identified and the current best task
2073 | +	 * moves its preferred node then stop the search.
2074 | +	 */
2075 | +	if (!maymove && env->best_task &&
2076 | +	    env->best_task->numa_preferred_nid == env->src_nid) {
2077 | +		stopsearch = true;
2078 | +	}
2079 | +unlock:
2080 | +	rcu_read_unlock();
2081 | +
2082 | +	return stopsearch;
2083 | +}
2084 | +
2085 | +static void task_numa_find_cpu(struct task_numa_env *env,
2086 | +				long taskimp, long groupimp)
2087 | +{
2088 | +	bool maymove = false;
2089 | +	int cpu;
2090 | +
2091 | +	/*
2092 | +	 * If dst node has spare capacity, then check if there is an
2093 | +	 * imbalance that would be overruled by the load balancer.
2094 | +	 */
2095 | +	if (env->dst_stats.node_type == node_has_spare) {
2096 | +		unsigned int imbalance;
2097 | +		int src_running, dst_running;
2098 | +
2099 | +		/*
2100 | +		 * Would movement cause an imbalance? Note that if src has
2101 | +		 * more running tasks that the imbalance is ignored as the
2102 | +		 * move improves the imbalance from the perspective of the
2103 | +		 * CPU load balancer.
2104 | +		 * */
2105 | +		src_running = env->src_stats.nr_running - 1;
2106 | +		dst_running = env->dst_stats.nr_running + 1;
2107 | +		imbalance = max(0, dst_running - src_running);
2108 | +		imbalance = adjust_numa_imbalance(imbalance, dst_running,
2109 | +							env->dst_stats.weight);
2110 | +
2111 | +		/* Use idle CPU if there is no imbalance */
2112 | +		if (!imbalance) {
2113 | +			maymove = true;
2114 | +			if (env->dst_stats.idle_cpu >= 0) {
2115 | +				env->dst_cpu = env->dst_stats.idle_cpu;
2116 | +				task_numa_assign(env, NULL, 0);
2117 | +				return;
2118 | +			}
2119 | +		}
2120 | +	} else {
2121 | +		long src_load, dst_load, load;
2122 | +		/*
2123 | +		 * If the improvement from just moving env->p direction is better
2124 | +		 * than swapping tasks around, check if a move is possible.
2125 | +		 */
2126 | +		load = task_h_load(env->p);
2127 | +		dst_load = env->dst_stats.load + load;
2128 | +		src_load = env->src_stats.load - load;
2129 | +		maymove = !load_too_imbalanced(src_load, dst_load, env);
2130 | +	}
2131 | +
2132 | +	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
2133 | +		/* Skip this CPU if the source task cannot migrate */
2134 | +		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
2135 | +			continue;
2136 | +
2137 | +		env->dst_cpu = cpu;
2138 | +		if (task_numa_compare(env, taskimp, groupimp, maymove))
2139 | +			break;
2140 | +	}
2141 | +}
2142 | +
2143 | +static int task_numa_migrate(struct task_struct *p)
2144 | +{
2145 | +	struct task_numa_env env = {
2146 | +		.p = p,
2147 | +
2148 | +		.src_cpu = task_cpu(p),
2149 | +		.src_nid = task_node(p),
2150 | +
2151 | +		.imbalance_pct = 112,
2152 | +
2153 | +		.best_task = NULL,
2154 | +		.best_imp = 0,
2155 | +		.best_cpu = -1,
2156 | +	};
2157 | +	unsigned long taskweight, groupweight;
2158 | +	struct sched_domain *sd;
2159 | +	long taskimp, groupimp;
2160 | +	struct numa_group *ng;
2161 | +	struct rq *best_rq;
2162 | +	int nid, ret, dist;
2163 | +
2164 | +	/*
2165 | +	 * Pick the lowest SD_NUMA domain, as that would have the smallest
2166 | +	 * imbalance and would be the first to start moving tasks about.
2167 | +	 *
2168 | +	 * And we want to avoid any moving of tasks about, as that would create
2169 | +	 * random movement of tasks -- counter the numa conditions we're trying
2170 | +	 * to satisfy here.
2171 | +	 */
2172 | +	rcu_read_lock();
2173 | +	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
2174 | +	if (sd)
2175 | +		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
2176 | +	rcu_read_unlock();
2177 | +
2178 | +	/*
2179 | +	 * Cpusets can break the scheduler domain tree into smaller
2180 | +	 * balance domains, some of which do not cross NUMA boundaries.
2181 | +	 * Tasks that are "trapped" in such domains cannot be migrated
2182 | +	 * elsewhere, so there is no point in (re)trying.
2183 | +	 */
2184 | +	if (unlikely(!sd)) {
2185 | +		sched_setnuma(p, task_node(p));
2186 | +		return -EINVAL;
2187 | +	}
2188 | +
2189 | +	env.dst_nid = p->numa_preferred_nid;
2190 | +	dist = env.dist = node_distance(env.src_nid, env.dst_nid);
2191 | +	taskweight = task_weight(p, env.src_nid, dist);
2192 | +	groupweight = group_weight(p, env.src_nid, dist);
2193 | +	update_numa_stats(&env, &env.src_stats, env.src_nid, false);
2194 | +	taskimp = task_weight(p, env.dst_nid, dist) - taskweight;
2195 | +	groupimp = group_weight(p, env.dst_nid, dist) - groupweight;
2196 | +	update_numa_stats(&env, &env.dst_stats, env.dst_nid, true);
2197 | +
2198 | +	/* Try to find a spot on the preferred nid. */
2199 | +	task_numa_find_cpu(&env, taskimp, groupimp);
2200 | +
2201 | +	/*
2202 | +	 * Look at other nodes in these cases:
2203 | +	 * - there is no space available on the preferred_nid
2204 | +	 * - the task is part of a numa_group that is interleaved across
2205 | +	 *   multiple NUMA nodes; in order to better consolidate the group,
2206 | +	 *   we need to check other locations.
2207 | +	 */
2208 | +	ng = deref_curr_numa_group(p);
2209 | +	if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
2210 | +		for_each_online_node(nid) {
2211 | +			if (nid == env.src_nid || nid == p->numa_preferred_nid)
2212 | +				continue;
2213 | +
2214 | +			dist = node_distance(env.src_nid, env.dst_nid);
2215 | +			if (sched_numa_topology_type == NUMA_BACKPLANE &&
2216 | +						dist != env.dist) {
2217 | +				taskweight = task_weight(p, env.src_nid, dist);
2218 | +				groupweight = group_weight(p, env.src_nid, dist);
2219 | +			}
2220 | +
2221 | +			/* Only consider nodes where both task and groups benefit */
2222 | +			taskimp = task_weight(p, nid, dist) - taskweight;
2223 | +			groupimp = group_weight(p, nid, dist) - groupweight;
2224 | +			if (taskimp < 0 && groupimp < 0)
2225 | +				continue;
2226 | +
2227 | +			env.dist = dist;
2228 | +			env.dst_nid = nid;
2229 | +			update_numa_stats(&env, &env.dst_stats, env.dst_nid, true);
2230 | +			task_numa_find_cpu(&env, taskimp, groupimp);
2231 | +		}
2232 | +	}
2233 | +
2234 | +	/*
2235 | +	 * If the task is part of a workload that spans multiple NUMA nodes,
2236 | +	 * and is migrating into one of the workload's active nodes, remember
2237 | +	 * this node as the task's preferred numa node, so the workload can
2238 | +	 * settle down.
2239 | +	 * A task that migrated to a second choice node will be better off
2240 | +	 * trying for a better one later. Do not set the preferred node here.
2241 | +	 */
2242 | +	if (ng) {
2243 | +		if (env.best_cpu == -1)
2244 | +			nid = env.src_nid;
2245 | +		else
2246 | +			nid = cpu_to_node(env.best_cpu);
2247 | +
2248 | +		if (nid != p->numa_preferred_nid)
2249 | +			sched_setnuma(p, nid);
2250 | +	}
2251 | +
2252 | +	/* No better CPU than the current one was found. */
2253 | +	if (env.best_cpu == -1) {
2254 | +		trace_sched_stick_numa(p, env.src_cpu, NULL, -1);
2255 | +		return -EAGAIN;
2256 | +	}
2257 | +
2258 | +	best_rq = cpu_rq(env.best_cpu);
2259 | +	if (env.best_task == NULL) {
2260 | +		ret = migrate_task_to(p, env.best_cpu);
2261 | +		WRITE_ONCE(best_rq->numa_migrate_on, 0);
2262 | +		if (ret != 0)
2263 | +			trace_sched_stick_numa(p, env.src_cpu, NULL, env.best_cpu);
2264 | +		return ret;
2265 | +	}
2266 | +
2267 | +	ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu);
2268 | +	WRITE_ONCE(best_rq->numa_migrate_on, 0);
2269 | +
2270 | +	if (ret != 0)
2271 | +		trace_sched_stick_numa(p, env.src_cpu, env.best_task, env.best_cpu);
2272 | +	put_task_struct(env.best_task);
2273 | +	return ret;
2274 | +}
2275 | +
2276 | +/* Attempt to migrate a task to a CPU on the preferred node. */
2277 | +static void numa_migrate_preferred(struct task_struct *p)
2278 | +{
2279 | +	unsigned long interval = HZ;
2280 | +
2281 | +	/* This task has no NUMA fault statistics yet */
2282 | +	if (unlikely(p->numa_preferred_nid == NUMA_NO_NODE || !p->numa_faults))
2283 | +		return;
2284 | +
2285 | +	/* Periodically retry migrating the task to the preferred node */
2286 | +	interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
2287 | +	p->numa_migrate_retry = jiffies + interval;
2288 | +
2289 | +	/* Success if task is already running on preferred CPU */
2290 | +	if (task_node(p) == p->numa_preferred_nid)
2291 | +		return;
2292 | +
2293 | +	/* Otherwise, try migrate to a CPU on the preferred node */
2294 | +	task_numa_migrate(p);
2295 | +}
2296 | +
2297 | +/*
2298 | + * Find out how many nodes on the workload is actively running on. Do this by
2299 | + * tracking the nodes from which NUMA hinting faults are triggered. This can
2300 | + * be different from the set of nodes where the workload's memory is currently
2301 | + * located.
2302 | + */
2303 | +static void numa_group_count_active_nodes(struct numa_group *numa_group)
2304 | +{
2305 | +	unsigned long faults, max_faults = 0;
2306 | +	int nid, active_nodes = 0;
2307 | +
2308 | +	for_each_online_node(nid) {
2309 | +		faults = group_faults_cpu(numa_group, nid);
2310 | +		if (faults > max_faults)
2311 | +			max_faults = faults;
2312 | +	}
2313 | +
2314 | +	for_each_online_node(nid) {
2315 | +		faults = group_faults_cpu(numa_group, nid);
2316 | +		if (faults * ACTIVE_NODE_FRACTION > max_faults)
2317 | +			active_nodes++;
2318 | +	}
2319 | +
2320 | +	numa_group->max_faults_cpu = max_faults;
2321 | +	numa_group->active_nodes = active_nodes;
2322 | +}
2323 | +
2324 | +#define NUMA_PERIOD_SLOTS 10
2325 | +#define NUMA_PERIOD_THRESHOLD 7
2326 | +
2327 | +/*
2328 | + * Increase the scan period (slow down scanning) if the majority of
2329 | + * our memory is already on our local node, or if the majority of
2330 | + * the page accesses are shared with other processes.
2331 | + * Otherwise, decrease the scan period.
2332 | + */
2333 | +static void update_task_scan_period(struct task_struct *p,
2334 | +			unsigned long shared, unsigned long private)
2335 | +{
2336 | +	unsigned int period_slot;
2337 | +	int lr_ratio, ps_ratio;
2338 | +	int diff;
2339 | +
2340 | +	unsigned long remote = p->numa_faults_locality[0];
2341 | +	unsigned long local = p->numa_faults_locality[1];
2342 | +
2343 | +	/*
2344 | +	 * If there were no record hinting faults then either the task is
2345 | +	 * completely idle or all activity is areas that are not of interest
2346 | +	 * to automatic numa balancing. Related to that, if there were failed
2347 | +	 * migration then it implies we are migrating too quickly or the local
2348 | +	 * node is overloaded. In either case, scan slower
2349 | +	 */
2350 | +	if (local + shared == 0 || p->numa_faults_locality[2]) {
2351 | +		p->numa_scan_period = min(p->numa_scan_period_max,
2352 | +			p->numa_scan_period << 1);
2353 | +
2354 | +		p->mm->numa_next_scan = jiffies +
2355 | +			msecs_to_jiffies(p->numa_scan_period);
2356 | +
2357 | +		return;
2358 | +	}
2359 | +
2360 | +	/*
2361 | +	 * Prepare to scale scan period relative to the current period.
2362 | +	 *	 == NUMA_PERIOD_THRESHOLD scan period stays the same
2363 | +	 *       <  NUMA_PERIOD_THRESHOLD scan period decreases (scan faster)
2364 | +	 *	 >= NUMA_PERIOD_THRESHOLD scan period increases (scan slower)
2365 | +	 */
2366 | +	period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS);
2367 | +	lr_ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
2368 | +	ps_ratio = (private * NUMA_PERIOD_SLOTS) / (private + shared);
2369 | +
2370 | +	if (ps_ratio >= NUMA_PERIOD_THRESHOLD) {
2371 | +		/*
2372 | +		 * Most memory accesses are local. There is no need to
2373 | +		 * do fast NUMA scanning, since memory is already local.
2374 | +		 */
2375 | +		int slot = ps_ratio - NUMA_PERIOD_THRESHOLD;
2376 | +		if (!slot)
2377 | +			slot = 1;
2378 | +		diff = slot * period_slot;
2379 | +	} else if (lr_ratio >= NUMA_PERIOD_THRESHOLD) {
2380 | +		/*
2381 | +		 * Most memory accesses are shared with other tasks.
2382 | +		 * There is no point in continuing fast NUMA scanning,
2383 | +		 * since other tasks may just move the memory elsewhere.
2384 | +		 */
2385 | +		int slot = lr_ratio - NUMA_PERIOD_THRESHOLD;
2386 | +		if (!slot)
2387 | +			slot = 1;
2388 | +		diff = slot * period_slot;
2389 | +	} else {
2390 | +		/*
2391 | +		 * Private memory faults exceed (SLOTS-THRESHOLD)/SLOTS,
2392 | +		 * yet they are not on the local NUMA node. Speed up
2393 | +		 * NUMA scanning to get the memory moved over.
2394 | +		 */
2395 | +		int ratio = max(lr_ratio, ps_ratio);
2396 | +		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
2397 | +	}
2398 | +
2399 | +	p->numa_scan_period = clamp(p->numa_scan_period + diff,
2400 | +			task_scan_min(p), task_scan_max(p));
2401 | +	memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
2402 | +}
2403 | +
2404 | +/*
2405 | + * Get the fraction of time the task has been running since the last
2406 | + * NUMA placement cycle. The scheduler keeps similar statistics, but
2407 | + * decays those on a 32ms period, which is orders of magnitude off
2408 | + * from the dozens-of-seconds NUMA balancing period. Use the scheduler
2409 | + * stats only if the task is so new there are no NUMA statistics yet.
2410 | + */
2411 | +static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
2412 | +{
2413 | +	u64 runtime, delta, now;
2414 | +	/* Use the start of this time slice to avoid calculations. */
2415 | +	now = p->se.exec_start;
2416 | +	runtime = p->se.sum_exec_runtime;
2417 | +
2418 | +	if (p->last_task_numa_placement) {
2419 | +		delta = runtime - p->last_sum_exec_runtime;
2420 | +		*period = now - p->last_task_numa_placement;
2421 | +
2422 | +		/* Avoid time going backwards, prevent potential divide error: */
2423 | +		if (unlikely((s64)*period < 0))
2424 | +			*period = 0;
2425 | +	} else {
2426 | +		delta = p->se.avg.load_sum;
2427 | +		*period = LOAD_AVG_MAX;
2428 | +	}
2429 | +
2430 | +	p->last_sum_exec_runtime = runtime;
2431 | +	p->last_task_numa_placement = now;
2432 | +
2433 | +	return delta;
2434 | +}
2435 | +
2436 | +/*
2437 | + * Determine the preferred nid for a task in a numa_group. This needs to
2438 | + * be done in a way that produces consistent results with group_weight,
2439 | + * otherwise workloads might not converge.
2440 | + */
2441 | +static int preferred_group_nid(struct task_struct *p, int nid)
2442 | +{
2443 | +	nodemask_t nodes;
2444 | +	int dist;
2445 | +
2446 | +	/* Direct connections between all NUMA nodes. */
2447 | +	if (sched_numa_topology_type == NUMA_DIRECT)
2448 | +		return nid;
2449 | +
2450 | +	/*
2451 | +	 * On a system with glueless mesh NUMA topology, group_weight
2452 | +	 * scores nodes according to the number of NUMA hinting faults on
2453 | +	 * both the node itself, and on nearby nodes.
2454 | +	 */
2455 | +	if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
2456 | +		unsigned long score, max_score = 0;
2457 | +		int node, max_node = nid;
2458 | +
2459 | +		dist = sched_max_numa_distance;
2460 | +
2461 | +		for_each_online_node(node) {
2462 | +			score = group_weight(p, node, dist);
2463 | +			if (score > max_score) {
2464 | +				max_score = score;
2465 | +				max_node = node;
2466 | +			}
2467 | +		}
2468 | +		return max_node;
2469 | +	}
2470 | +
2471 | +	/*
2472 | +	 * Finding the preferred nid in a system with NUMA backplane
2473 | +	 * interconnect topology is more involved. The goal is to locate
2474 | +	 * tasks from numa_groups near each other in the system, and
2475 | +	 * untangle workloads from different sides of the system. This requires
2476 | +	 * searching down the hierarchy of node groups, recursively searching
2477 | +	 * inside the highest scoring group of nodes. The nodemask tricks
2478 | +	 * keep the complexity of the search down.
2479 | +	 */
2480 | +	nodes = node_online_map;
2481 | +	for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
2482 | +		unsigned long max_faults = 0;
2483 | +		nodemask_t max_group = NODE_MASK_NONE;
2484 | +		int a, b;
2485 | +
2486 | +		/* Are there nodes at this distance from each other? */
2487 | +		if (!find_numa_distance(dist))
2488 | +			continue;
2489 | +
2490 | +		for_each_node_mask(a, nodes) {
2491 | +			unsigned long faults = 0;
2492 | +			nodemask_t this_group;
2493 | +			nodes_clear(this_group);
2494 | +
2495 | +			/* Sum group's NUMA faults; includes a==b case. */
2496 | +			for_each_node_mask(b, nodes) {
2497 | +				if (node_distance(a, b) < dist) {
2498 | +					faults += group_faults(p, b);
2499 | +					node_set(b, this_group);
2500 | +					node_clear(b, nodes);
2501 | +				}
2502 | +			}
2503 | +
2504 | +			/* Remember the top group. */
2505 | +			if (faults > max_faults) {
2506 | +				max_faults = faults;
2507 | +				max_group = this_group;
2508 | +				/*
2509 | +				 * subtle: at the smallest distance there is
2510 | +				 * just one node left in each "group", the
2511 | +				 * winner is the preferred nid.
2512 | +				 */
2513 | +				nid = a;
2514 | +			}
2515 | +		}
2516 | +		/* Next round, evaluate the nodes within max_group. */
2517 | +		if (!max_faults)
2518 | +			break;
2519 | +		nodes = max_group;
2520 | +	}
2521 | +	return nid;
2522 | +}
2523 | +
2524 | +static void task_numa_placement(struct task_struct *p)
2525 | +{
2526 | +	int seq, nid, max_nid = NUMA_NO_NODE;
2527 | +	unsigned long max_faults = 0;
2528 | +	unsigned long fault_types[2] = { 0, 0 };
2529 | +	unsigned long total_faults;
2530 | +	u64 runtime, period;
2531 | +	spinlock_t *group_lock = NULL;
2532 | +	struct numa_group *ng;
2533 | +
2534 | +	/*
2535 | +	 * The p->mm->numa_scan_seq field gets updated without
2536 | +	 * exclusive access. Use READ_ONCE() here to ensure
2537 | +	 * that the field is read in a single access:
2538 | +	 */
2539 | +	seq = READ_ONCE(p->mm->numa_scan_seq);
2540 | +	if (p->numa_scan_seq == seq)
2541 | +		return;
2542 | +	p->numa_scan_seq = seq;
2543 | +	p->numa_scan_period_max = task_scan_max(p);
2544 | +
2545 | +	total_faults = p->numa_faults_locality[0] +
2546 | +		       p->numa_faults_locality[1];
2547 | +	runtime = numa_get_avg_runtime(p, &period);
2548 | +
2549 | +	/* If the task is part of a group prevent parallel updates to group stats */
2550 | +	ng = deref_curr_numa_group(p);
2551 | +	if (ng) {
2552 | +		group_lock = &ng->lock;
2553 | +		spin_lock_irq(group_lock);
2554 | +	}
2555 | +
2556 | +	/* Find the node with the highest number of faults */
2557 | +	for_each_online_node(nid) {
2558 | +		/* Keep track of the offsets in numa_faults array */
2559 | +		int mem_idx, membuf_idx, cpu_idx, cpubuf_idx;
2560 | +		unsigned long faults = 0, group_faults = 0;
2561 | +		int priv;
2562 | +
2563 | +		for (priv = 0; priv < NR_NUMA_HINT_FAULT_TYPES; priv++) {
2564 | +			long diff, f_diff, f_weight;
2565 | +
2566 | +			mem_idx = task_faults_idx(NUMA_MEM, nid, priv);
2567 | +			membuf_idx = task_faults_idx(NUMA_MEMBUF, nid, priv);
2568 | +			cpu_idx = task_faults_idx(NUMA_CPU, nid, priv);
2569 | +			cpubuf_idx = task_faults_idx(NUMA_CPUBUF, nid, priv);
2570 | +
2571 | +			/* Decay existing window, copy faults since last scan */
2572 | +			diff = p->numa_faults[membuf_idx] - p->numa_faults[mem_idx] / 2;
2573 | +			fault_types[priv] += p->numa_faults[membuf_idx];
2574 | +			p->numa_faults[membuf_idx] = 0;
2575 | +
2576 | +			/*
2577 | +			 * Normalize the faults_from, so all tasks in a group
2578 | +			 * count according to CPU use, instead of by the raw
2579 | +			 * number of faults. Tasks with little runtime have
2580 | +			 * little over-all impact on throughput, and thus their
2581 | +			 * faults are less important.
2582 | +			 */
2583 | +			f_weight = div64_u64(runtime << 16, period + 1);
2584 | +			f_weight = (f_weight * p->numa_faults[cpubuf_idx]) /
2585 | +				   (total_faults + 1);
2586 | +			f_diff = f_weight - p->numa_faults[cpu_idx] / 2;
2587 | +			p->numa_faults[cpubuf_idx] = 0;
2588 | +
2589 | +			p->numa_faults[mem_idx] += diff;
2590 | +			p->numa_faults[cpu_idx] += f_diff;
2591 | +			faults += p->numa_faults[mem_idx];
2592 | +			p->total_numa_faults += diff;
2593 | +			if (ng) {
2594 | +				/*
2595 | +				 * safe because we can only change our own group
2596 | +				 *
2597 | +				 * mem_idx represents the offset for a given
2598 | +				 * nid and priv in a specific region because it
2599 | +				 * is at the beginning of the numa_faults array.
2600 | +				 */
2601 | +				ng->faults[mem_idx] += diff;
2602 | +				ng->faults_cpu[mem_idx] += f_diff;
2603 | +				ng->total_faults += diff;
2604 | +				group_faults += ng->faults[mem_idx];
2605 | +			}
2606 | +		}
2607 | +
2608 | +		if (!ng) {
2609 | +			if (faults > max_faults) {
2610 | +				max_faults = faults;
2611 | +				max_nid = nid;
2612 | +			}
2613 | +		} else if (group_faults > max_faults) {
2614 | +			max_faults = group_faults;
2615 | +			max_nid = nid;
2616 | +		}
2617 | +	}
2618 | +
2619 | +	if (ng) {
2620 | +		numa_group_count_active_nodes(ng);
2621 | +		spin_unlock_irq(group_lock);
2622 | +		max_nid = preferred_group_nid(p, max_nid);
2623 | +	}
2624 | +
2625 | +	if (max_faults) {
2626 | +		/* Set the new preferred node */
2627 | +		if (max_nid != p->numa_preferred_nid)
2628 | +			sched_setnuma(p, max_nid);
2629 | +	}
2630 | +
2631 | +	update_task_scan_period(p, fault_types[0], fault_types[1]);
2632 | +}
2633 | +
2634 | +static inline int get_numa_group(struct numa_group *grp)
2635 | +{
2636 | +	return refcount_inc_not_zero(&grp->refcount);
2637 | +}
2638 | +
2639 | +static inline void put_numa_group(struct numa_group *grp)
2640 | +{
2641 | +	if (refcount_dec_and_test(&grp->refcount))
2642 | +		kfree_rcu(grp, rcu);
2643 | +}
2644 | +
2645 | +static void task_numa_group(struct task_struct *p, int cpupid, int flags,
2646 | +			int *priv)
2647 | +{
2648 | +	struct numa_group *grp, *my_grp;
2649 | +	struct task_struct *tsk;
2650 | +	bool join = false;
2651 | +	int cpu = cpupid_to_cpu(cpupid);
2652 | +	int i;
2653 | +
2654 | +	if (unlikely(!deref_curr_numa_group(p))) {
2655 | +		unsigned int size = sizeof(struct numa_group) +
2656 | +				    4*nr_node_ids*sizeof(unsigned long);
2657 | +
2658 | +		grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
2659 | +		if (!grp)
2660 | +			return;
2661 | +
2662 | +		refcount_set(&grp->refcount, 1);
2663 | +		grp->active_nodes = 1;
2664 | +		grp->max_faults_cpu = 0;
2665 | +		spin_lock_init(&grp->lock);
2666 | +		grp->gid = p->pid;
2667 | +		/* Second half of the array tracks nids where faults happen */
2668 | +		grp->faults_cpu = grp->faults + NR_NUMA_HINT_FAULT_TYPES *
2669 | +						nr_node_ids;
2670 | +
2671 | +		for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
2672 | +			grp->faults[i] = p->numa_faults[i];
2673 | +
2674 | +		grp->total_faults = p->total_numa_faults;
2675 | +
2676 | +		grp->nr_tasks++;
2677 | +		rcu_assign_pointer(p->numa_group, grp);
2678 | +	}
2679 | +
2680 | +	rcu_read_lock();
2681 | +	tsk = READ_ONCE(cpu_rq(cpu)->curr);
2682 | +
2683 | +	if (!cpupid_match_pid(tsk, cpupid))
2684 | +		goto no_join;
2685 | +
2686 | +	grp = rcu_dereference(tsk->numa_group);
2687 | +	if (!grp)
2688 | +		goto no_join;
2689 | +
2690 | +	my_grp = deref_curr_numa_group(p);
2691 | +	if (grp == my_grp)
2692 | +		goto no_join;
2693 | +
2694 | +	/*
2695 | +	 * Only join the other group if its bigger; if we're the bigger group,
2696 | +	 * the other task will join us.
2697 | +	 */
2698 | +	if (my_grp->nr_tasks > grp->nr_tasks)
2699 | +		goto no_join;
2700 | +
2701 | +	/*
2702 | +	 * Tie-break on the grp address.
2703 | +	 */
2704 | +	if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp)
2705 | +		goto no_join;
2706 | +
2707 | +	/* Always join threads in the same process. */
2708 | +	if (tsk->mm == current->mm)
2709 | +		join = true;
2710 | +
2711 | +	/* Simple filter to avoid false positives due to PID collisions */
2712 | +	if (flags & TNF_SHARED)
2713 | +		join = true;
2714 | +
2715 | +	/* Update priv based on whether false sharing was detected */
2716 | +	*priv = !join;
2717 | +
2718 | +	if (join && !get_numa_group(grp))
2719 | +		goto no_join;
2720 | +
2721 | +	rcu_read_unlock();
2722 | +
2723 | +	if (!join)
2724 | +		return;
2725 | +
2726 | +	BUG_ON(irqs_disabled());
2727 | +	double_lock_irq(&my_grp->lock, &grp->lock);
2728 | +
2729 | +	for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
2730 | +		my_grp->faults[i] -= p->numa_faults[i];
2731 | +		grp->faults[i] += p->numa_faults[i];
2732 | +	}
2733 | +	my_grp->total_faults -= p->total_numa_faults;
2734 | +	grp->total_faults += p->total_numa_faults;
2735 | +
2736 | +	my_grp->nr_tasks--;
2737 | +	grp->nr_tasks++;
2738 | +
2739 | +	spin_unlock(&my_grp->lock);
2740 | +	spin_unlock_irq(&grp->lock);
2741 | +
2742 | +	rcu_assign_pointer(p->numa_group, grp);
2743 | +
2744 | +	put_numa_group(my_grp);
2745 | +	return;
2746 | +
2747 | +no_join:
2748 | +	rcu_read_unlock();
2749 | +	return;
2750 | +}
2751 | +
2752 | +/*
2753 | + * Get rid of NUMA statistics associated with a task (either current or dead).
2754 | + * If @final is set, the task is dead and has reached refcount zero, so we can
2755 | + * safely free all relevant data structures. Otherwise, there might be
2756 | + * concurrent reads from places like load balancing and procfs, and we should
2757 | + * reset the data back to default state without freeing ->numa_faults.
2758 | + */
2759 | +void task_numa_free(struct task_struct *p, bool final)
2760 | +{
2761 | +	/* safe: p either is current or is being freed by current */
2762 | +	struct numa_group *grp = rcu_dereference_raw(p->numa_group);
2763 | +	unsigned long *numa_faults = p->numa_faults;
2764 | +	unsigned long flags;
2765 | +	int i;
2766 | +
2767 | +	if (!numa_faults)
2768 | +		return;
2769 | +
2770 | +	if (grp) {
2771 | +		spin_lock_irqsave(&grp->lock, flags);
2772 | +		for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
2773 | +			grp->faults[i] -= p->numa_faults[i];
2774 | +		grp->total_faults -= p->total_numa_faults;
2775 | +
2776 | +		grp->nr_tasks--;
2777 | +		spin_unlock_irqrestore(&grp->lock, flags);
2778 | +		RCU_INIT_POINTER(p->numa_group, NULL);
2779 | +		put_numa_group(grp);
2780 | +	}
2781 | +
2782 | +	if (final) {
2783 | +		p->numa_faults = NULL;
2784 | +		kfree(numa_faults);
2785 | +	} else {
2786 | +		p->total_numa_faults = 0;
2787 | +		for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
2788 | +			numa_faults[i] = 0;
2789 | +	}
2790 | +}
2791 | +
2792 | +/*
2793 | + * Got a PROT_NONE fault for a page on @node.
2794 | + */
2795 | +void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
2796 | +{
2797 | +	struct task_struct *p = current;
2798 | +	bool migrated = flags & TNF_MIGRATED;
2799 | +	int cpu_node = task_node(current);
2800 | +	int local = !!(flags & TNF_FAULT_LOCAL);
2801 | +	struct numa_group *ng;
2802 | +	int priv;
2803 | +
2804 | +	if (!static_branch_likely(&sched_numa_balancing))
2805 | +		return;
2806 | +
2807 | +	/* for example, ksmd faulting in a user's mm */
2808 | +	if (!p->mm)
2809 | +		return;
2810 | +
2811 | +	/* Allocate buffer to track faults on a per-node basis */
2812 | +	if (unlikely(!p->numa_faults)) {
2813 | +		int size = sizeof(*p->numa_faults) *
2814 | +			   NR_NUMA_HINT_FAULT_BUCKETS * nr_node_ids;
2815 | +
2816 | +		p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
2817 | +		if (!p->numa_faults)
2818 | +			return;
2819 | +
2820 | +		p->total_numa_faults = 0;
2821 | +		memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
2822 | +	}
2823 | +
2824 | +	/*
2825 | +	 * First accesses are treated as private, otherwise consider accesses
2826 | +	 * to be private if the accessing pid has not changed
2827 | +	 */
2828 | +	if (unlikely(last_cpupid == (-1 & LAST_CPUPID_MASK))) {
2829 | +		priv = 1;
2830 | +	} else {
2831 | +		priv = cpupid_match_pid(p, last_cpupid);
2832 | +		if (!priv && !(flags & TNF_NO_GROUP))
2833 | +			task_numa_group(p, last_cpupid, flags, &priv);
2834 | +	}
2835 | +
2836 | +	/*
2837 | +	 * If a workload spans multiple NUMA nodes, a shared fault that
2838 | +	 * occurs wholly within the set of nodes that the workload is
2839 | +	 * actively using should be counted as local. This allows the
2840 | +	 * scan rate to slow down when a workload has settled down.
2841 | +	 */
2842 | +	ng = deref_curr_numa_group(p);
2843 | +	if (!priv && !local && ng && ng->active_nodes > 1 &&
2844 | +				numa_is_active_node(cpu_node, ng) &&
2845 | +				numa_is_active_node(mem_node, ng))
2846 | +		local = 1;
2847 | +
2848 | +	/*
2849 | +	 * Retry to migrate task to preferred node periodically, in case it
2850 | +	 * previously failed, or the scheduler moved us.
2851 | +	 */
2852 | +	if (time_after(jiffies, p->numa_migrate_retry)) {
2853 | +		task_numa_placement(p);
2854 | +		numa_migrate_preferred(p);
2855 | +	}
2856 | +
2857 | +	if (migrated)
2858 | +		p->numa_pages_migrated += pages;
2859 | +	if (flags & TNF_MIGRATE_FAIL)
2860 | +		p->numa_faults_locality[2] += pages;
2861 | +
2862 | +	p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
2863 | +	p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
2864 | +	p->numa_faults_locality[local] += pages;
2865 | +}
2866 | +
2867 | +static void reset_ptenuma_scan(struct task_struct *p)
2868 | +{
2869 | +	/*
2870 | +	 * We only did a read acquisition of the mmap sem, so
2871 | +	 * p->mm->numa_scan_seq is written to without exclusive access
2872 | +	 * and the update is not guaranteed to be atomic. That's not
2873 | +	 * much of an issue though, since this is just used for
2874 | +	 * statistical sampling. Use READ_ONCE/WRITE_ONCE, which are not
2875 | +	 * expensive, to avoid any form of compiler optimizations:
2876 | +	 */
2877 | +	WRITE_ONCE(p->mm->numa_scan_seq, READ_ONCE(p->mm->numa_scan_seq) + 1);
2878 | +	p->mm->numa_scan_offset = 0;
2879 | +}
2880 | +
2881 | +/*
2882 | + * The expensive part of numa migration is done from task_work context.
2883 | + * Triggered from task_tick_numa().
2884 | + */
2885 | +static void task_numa_work(struct callback_head *work)
2886 | +{
2887 | +	unsigned long migrate, next_scan, now = jiffies;
2888 | +	struct task_struct *p = current;
2889 | +	struct mm_struct *mm = p->mm;
2890 | +	u64 runtime = p->se.sum_exec_runtime;
2891 | +	struct vm_area_struct *vma;
2892 | +	unsigned long start, end;
2893 | +	unsigned long nr_pte_updates = 0;
2894 | +	long pages, virtpages;
2895 | +
2896 | +	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
2897 | +
2898 | +	work->next = work;
2899 | +	/*
2900 | +	 * Who cares about NUMA placement when they're dying.
2901 | +	 *
2902 | +	 * NOTE: make sure not to dereference p->mm before this check,
2903 | +	 * exit_task_work() happens _after_ exit_mm() so we could be called
2904 | +	 * without p->mm even though we still had it when we enqueued this
2905 | +	 * work.
2906 | +	 */
2907 | +	if (p->flags & PF_EXITING)
2908 | +		return;
2909 | +
2910 | +	if (!mm->numa_next_scan) {
2911 | +		mm->numa_next_scan = now +
2912 | +			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
2913 | +	}
2914 | +
2915 | +	/*
2916 | +	 * Enforce maximal scan/migration frequency..
2917 | +	 */
2918 | +	migrate = mm->numa_next_scan;
2919 | +	if (time_before(now, migrate))
2920 | +		return;
2921 | +
2922 | +	if (p->numa_scan_period == 0) {
2923 | +		p->numa_scan_period_max = task_scan_max(p);
2924 | +		p->numa_scan_period = task_scan_start(p);
2925 | +	}
2926 | +
2927 | +	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
2928 | +	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
2929 | +		return;
2930 | +
2931 | +	/*
2932 | +	 * Delay this task enough that another task of this mm will likely win
2933 | +	 * the next time around.
2934 | +	 */
2935 | +	p->node_stamp += 2 * TICK_NSEC;
2936 | +
2937 | +	start = mm->numa_scan_offset;
2938 | +	pages = sysctl_numa_balancing_scan_size;
2939 | +	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
2940 | +	virtpages = pages * 8;	   /* Scan up to this much virtual space */
2941 | +	if (!pages)
2942 | +		return;
2943 | +
2944 | +
2945 | +	if (!mmap_read_trylock(mm))
2946 | +		return;
2947 | +	vma = find_vma(mm, start);
2948 | +	if (!vma) {
2949 | +		reset_ptenuma_scan(p);
2950 | +		start = 0;
2951 | +		vma = mm->mmap;
2952 | +	}
2953 | +	for (; vma; vma = vma->vm_next) {
2954 | +		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
2955 | +			is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
2956 | +			continue;
2957 | +		}
2958 | +
2959 | +		/*
2960 | +		 * Shared library pages mapped by multiple processes are not
2961 | +		 * migrated as it is expected they are cache replicated. Avoid
2962 | +		 * hinting faults in read-only file-backed mappings or the vdso
2963 | +		 * as migrating the pages will be of marginal benefit.
2964 | +		 */
2965 | +		if (!vma->vm_mm ||
2966 | +		    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
2967 | +			continue;
2968 | +
2969 | +		/*
2970 | +		 * Skip inaccessible VMAs to avoid any confusion between
2971 | +		 * PROT_NONE and NUMA hinting ptes
2972 | +		 */
2973 | +		if (!vma_is_accessible(vma))
2974 | +			continue;
2975 | +
2976 | +		do {
2977 | +			start = max(start, vma->vm_start);
2978 | +			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
2979 | +			end = min(end, vma->vm_end);
2980 | +			nr_pte_updates = change_prot_numa(vma, start, end);
2981 | +
2982 | +			/*
2983 | +			 * Try to scan sysctl_numa_balancing_size worth of
2984 | +			 * hpages that have at least one present PTE that
2985 | +			 * is not already pte-numa. If the VMA contains
2986 | +			 * areas that are unused or already full of prot_numa
2987 | +			 * PTEs, scan up to virtpages, to skip through those
2988 | +			 * areas faster.
2989 | +			 */
2990 | +			if (nr_pte_updates)
2991 | +				pages -= (end - start) >> PAGE_SHIFT;
2992 | +			virtpages -= (end - start) >> PAGE_SHIFT;
2993 | +
2994 | +			start = end;
2995 | +			if (pages <= 0 || virtpages <= 0)
2996 | +				goto out;
2997 | +
2998 | +			cond_resched();
2999 | +		} while (end != vma->vm_end);
3000 | +	}
3001 | +
3002 | +out:
3003 | +	/*
3004 | +	 * It is possible to reach the end of the VMA list but the last few
3005 | +	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
3006 | +	 * would find the !migratable VMA on the next scan but not reset the
3007 | +	 * scanner to the start so check it now.
3008 | +	 */
3009 | +	if (vma)
3010 | +		mm->numa_scan_offset = start;
3011 | +	else
3012 | +		reset_ptenuma_scan(p);
3013 | +	mmap_read_unlock(mm);
3014 | +
3015 | +	/*
3016 | +	 * Make sure tasks use at least 32x as much time to run other code
3017 | +	 * than they used here, to limit NUMA PTE scanning overhead to 3% max.
3018 | +	 * Usually update_task_scan_period slows down scanning enough; on an
3019 | +	 * overloaded system we need to limit overhead on a per task basis.
3020 | +	 */
3021 | +	if (unlikely(p->se.sum_exec_runtime != runtime)) {
3022 | +		u64 diff = p->se.sum_exec_runtime - runtime;
3023 | +		p->node_stamp += 32 * diff;
3024 | +	}
3025 | +}
3026 | +
3027 | +void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
3028 | +{
3029 | +	int mm_users = 0;
3030 | +	struct mm_struct *mm = p->mm;
3031 | +
3032 | +	if (mm) {
3033 | +		mm_users = atomic_read(&mm->mm_users);
3034 | +		if (mm_users == 1) {
3035 | +			mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
3036 | +			mm->numa_scan_seq = 0;
3037 | +		}
3038 | +	}
3039 | +	p->node_stamp			= 0;
3040 | +	p->numa_scan_seq		= mm ? mm->numa_scan_seq : 0;
3041 | +	p->numa_scan_period		= sysctl_numa_balancing_scan_delay;
3042 | +	/* Protect against double add, see task_tick_numa and task_numa_work */
3043 | +	p->numa_work.next		= &p->numa_work;
3044 | +	p->numa_faults			= NULL;
3045 | +	RCU_INIT_POINTER(p->numa_group, NULL);
3046 | +	p->last_task_numa_placement	= 0;
3047 | +	p->last_sum_exec_runtime	= 0;
3048 | +
3049 | +	init_task_work(&p->numa_work, task_numa_work);
3050 | +
3051 | +	/* New address space, reset the preferred nid */
3052 | +	if (!(clone_flags & CLONE_VM)) {
3053 | +		p->numa_preferred_nid = NUMA_NO_NODE;
3054 | +		return;
3055 | +	}
3056 | +
3057 | +	/*
3058 | +	 * New thread, keep existing numa_preferred_nid which should be copied
3059 | +	 * already by arch_dup_task_struct but stagger when scans start.
3060 | +	 */
3061 | +	if (mm) {
3062 | +		unsigned int delay;
3063 | +
3064 | +		delay = min_t(unsigned int, task_scan_max(current),
3065 | +			current->numa_scan_period * mm_users * NSEC_PER_MSEC);
3066 | +		delay += 2 * TICK_NSEC;
3067 | +		p->node_stamp = delay;
3068 | +	}
3069 | +}
3070 | +
3071 | +static void task_tick_numa(struct rq *rq, struct task_struct *curr)
3072 | +{
3073 | +	struct callback_head *work = &curr->numa_work;
3074 | +	u64 period, now;
3075 | +
3076 | +	/*
3077 | +	 * We don't care about NUMA placement if we don't have memory.
3078 | +	 */
3079 | +	if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
3080 | +		return;
3081 | +
3082 | +	/*
3083 | +	 * Using runtime rather than walltime has the dual advantage that
3084 | +	 * we (mostly) drive the selection from busy threads and that the
3085 | +	 * task needs to have done some actual work before we bother with
3086 | +	 * NUMA placement.
3087 | +	 */
3088 | +	now = curr->se.sum_exec_runtime;
3089 | +	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
3090 | +
3091 | +	if (now > curr->node_stamp + period) {
3092 | +		if (!curr->node_stamp)
3093 | +			curr->numa_scan_period = task_scan_start(curr);
3094 | +		curr->node_stamp += period;
3095 | +
3096 | +		if (!time_before(jiffies, curr->mm->numa_next_scan))
3097 | +			task_work_add(curr, work, TWA_RESUME);
3098 | +	}
3099 | +}
3100 | +#else
3101 | +static void account_numa_enqueue(struct rq *rq, struct task_struct *p) {}
3102 | +static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p) {}
3103 | +static inline void update_scan_period(struct task_struct *p, int new_cpu) {}
3104 | +static void task_tick_numa(struct rq *rq, struct task_struct *curr) {}
3105 | +#endif /** CONFIG_NUMA_BALANCING */
3106 | +
3107 | diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
3108 | index d53d19770866..9ac800e858fc 100644
3109 | --- a/kernel/sched/sched.h
3110 | +++ b/kernel/sched/sched.h
3111 | @@ -545,9 +545,13 @@ struct cfs_rq {
3112 |  	 * It is set to NULL otherwise (i.e when none are currently running).
3113 |  	 */
3114 |  	struct sched_entity	*curr;
3115 | +#ifdef CONFIG_BS_SCHED
3116 | +	struct bs_node		*head;
3117 | +#else
3118 |  	struct sched_entity	*next;
3119 |  	struct sched_entity	*last;
3120 |  	struct sched_entity	*skip;
3121 | +#endif /** CONFIG_BS_SCHED */
3122 |  
3123 |  #ifdef	CONFIG_SCHED_DEBUG
3124 |  	unsigned int		nr_spread_over;
3125 | diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
3126 | index 04bfd62f5e5c..fac3f528b432 100644
3127 | --- a/kernel/time/Kconfig
3128 | +++ b/kernel/time/Kconfig
3129 | @@ -88,7 +88,7 @@ config NO_HZ_COMMON
3130 |  
3131 |  choice
3132 |  	prompt "Timer tick handling"
3133 | -	default NO_HZ_IDLE if NO_HZ
3134 | +	default HZ_PERIODIC if BS_SCHED
3135 |  
3136 |  config HZ_PERIODIC
3137 |  	bool "Periodic timer ticks (constant rate, no dynticks)"
3138 | @@ -98,6 +98,7 @@ config HZ_PERIODIC
3139 |  
3140 |  config NO_HZ_IDLE
3141 |  	bool "Idle dynticks system (tickless idle)"
3142 | +	depends on !BS_SCHED
3143 |  	select NO_HZ_COMMON
3144 |  	help
3145 |  	  This option enables a tickless idle system: timer interrupts
3146 | @@ -112,6 +113,7 @@ config NO_HZ_FULL
3147 |  	# We need at least one periodic CPU for timekeeping
3148 |  	depends on SMP
3149 |  	depends on HAVE_CONTEXT_TRACKING
3150 | +	depends on !BS_SCHED
3151 |  	# VIRT_CPU_ACCOUNTING_GEN dependency
3152 |  	depends on HAVE_VIRT_CPU_ACCOUNTING_GEN
3153 |  	select NO_HZ_COMMON
3154 | @@ -168,6 +170,8 @@ config CONTEXT_TRACKING_FORCE
3155 |  
3156 |  config NO_HZ
3157 |  	bool "Old Idle dynticks config"
3158 | +	default n
3159 | +	depends on !BS_SCHED
3160 |  	help
3161 |  	  This is the old config entry that enables dynticks idle.
3162 |  	  We keep it around for a little while to enforce backward
3163 | diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
3164 | index ffd22e499997..c10894554e7b 100644
3165 | --- a/lib/Kconfig.debug
3166 | +++ b/lib/Kconfig.debug
3167 | @@ -583,6 +583,7 @@ endmenu
3168 |  
3169 |  config DEBUG_KERNEL
3170 |  	bool "Kernel debugging"
3171 | +	depends on !BS_SCHED
3172 |  	help
3173 |  	  Say Y here if you are developing drivers or trying to debug and
3174 |  	  identify kernel problems.
3175 | 


--------------------------------------------------------------------------------
/baby-rr-5.14.patch:
--------------------------------------------------------------------------------
   1 | diff --git a/include/linux/sched.h b/include/linux/sched.h
   2 | index f6935787e7e8..8d2f939d40d4 100644
   3 | --- a/include/linux/sched.h
   4 | +++ b/include/linux/sched.h
   5 | @@ -462,6 +462,13 @@ struct sched_statistics {
   6 |  #endif
   7 |  };
   8 |  
   9 | +#ifdef CONFIG_BS_SCHED
  10 | +struct bs_node {
  11 | +	struct bs_node*                 next;
  12 | +	struct bs_node*                 prev;
  13 | +};
  14 | +#endif
  15 | +
  16 |  struct sched_entity {
  17 |  	/* For load-balancing: */
  18 |  	struct load_weight		load;
  19 | @@ -469,6 +476,10 @@ struct sched_entity {
  20 |  	struct list_head		group_node;
  21 |  	unsigned int			on_rq;
  22 |  
  23 | +#ifdef CONFIG_BS_SCHED
  24 | +	struct bs_node                  bs_node;
  25 | +#endif
  26 | +
  27 |  	u64				exec_start;
  28 |  	u64				sum_exec_runtime;
  29 |  	u64				vruntime;
  30 | diff --git a/init/Kconfig b/init/Kconfig
  31 | index 55f9f7738ebb..a8b1cee206dd 100644
  32 | --- a/init/Kconfig
  33 | +++ b/init/Kconfig
  34 | @@ -105,6 +105,14 @@ config THREAD_INFO_IN_TASK
  35 |  	  One subtle change that will be needed is to use try_get_task_stack()
  36 |  	  and put_task_stack() in save_thread_stack_tsk() and get_wchan().
  37 |  
  38 | +config BS_SCHED
  39 | +	bool "Baby Scheduler"
  40 | +	default y
  41 | +	help
  42 | +	  It is a Round Robin (RR) version of Baby Scheduler. It is even simpler
  43 | +	  than normal Baby Scheduler where no usage of vruntime and no tasks
  44 | +	  priority considerations. It is just a basic RR.
  45 | +
  46 |  menu "General setup"
  47 |  
  48 |  config BROKEN
  49 | diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
  50 | index 978fcfca5871..464b134de739 100644
  51 | --- a/kernel/sched/Makefile
  52 | +++ b/kernel/sched/Makefile
  53 | @@ -23,7 +23,7 @@ CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
  54 |  endif
  55 |  
  56 |  obj-y += core.o loadavg.o clock.o cputime.o
  57 | -obj-y += idle.o fair.o rt.o deadline.o
  58 | +obj-y += idle.o bs.o rt.o deadline.o
  59 |  obj-y += wait.o wait_bit.o swait.o completion.o
  60 |  
  61 |  obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o pelt.o
  62 | diff --git a/kernel/sched/bs.c b/kernel/sched/bs.c
  63 | new file mode 100644
  64 | index 000000000000..402fcc80518f
  65 | --- /dev/null
  66 | +++ b/kernel/sched/bs.c
  67 | @@ -0,0 +1,696 @@
  68 | +// SPDX-License-Identifier: GPL-2.0
  69 | +/*
  70 | + * Baby Scheduler (BS) Class (SCHED_NORMAL/SCHED_BATCH)
  71 | + *
  72 | + *  Copyright (C) 2021, Hamad Al Marri <hamad.s.almarri@gmail.com>
  73 | + */
  74 | +#include "sched.h"
  75 | +#include "pelt.h"
  76 | +#include "fair_numa.h"
  77 | +#include "bs.h"
  78 | +
  79 | +static void update_curr(struct cfs_rq *cfs_rq)
  80 | +{
  81 | +	struct sched_entity *curr = cfs_rq->curr;
  82 | +	u64 now = rq_clock_task(rq_of(cfs_rq));
  83 | +	u64 delta_exec;
  84 | +
  85 | +	if (unlikely(!curr))
  86 | +		return;
  87 | +
  88 | +	delta_exec = now - curr->exec_start;
  89 | +	if (unlikely((s64)delta_exec <= 0))
  90 | +		return;
  91 | +
  92 | +	curr->exec_start = now;
  93 | +	curr->sum_exec_runtime += delta_exec;
  94 | +}
  95 | +
  96 | +static void update_curr_fair(struct rq *rq)
  97 | +{
  98 | +	update_curr(cfs_rq_of(&rq->curr->se));
  99 | +}
 100 | +
 101 | +static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 102 | +{
 103 | +	struct bs_node *bsn = &se->bs_node;
 104 | +	struct bs_node *prev;
 105 | +
 106 | +	bsn->next = bsn->prev = NULL;
 107 | +
 108 | +	// if empty
 109 | +	if (!cfs_rq->head) {
 110 | +		cfs_rq->head	= bsn;
 111 | +		cfs_rq->cursor	= bsn;
 112 | +	}
 113 | +	// if cursor == head
 114 | +	else if (cfs_rq->cursor == cfs_rq->head) {
 115 | +		bsn->next = cfs_rq->head;
 116 | +		cfs_rq->head->prev   = bsn;
 117 | +		cfs_rq->head         = bsn;
 118 | +	}
 119 | +	// if cursor != head
 120 | +	else {
 121 | +		prev = cfs_rq->cursor->prev;
 122 | +
 123 | +		bsn->next = cfs_rq->cursor;
 124 | +		cfs_rq->cursor->prev = bsn;
 125 | +
 126 | +		prev->next = bsn;
 127 | +		bsn->prev  = prev;
 128 | +	}
 129 | +}
 130 | +
 131 | +static inline void rotate_cursor(struct cfs_rq *cfs_rq)
 132 | +{
 133 | +	cfs_rq->cursor = cfs_rq->cursor->next;
 134 | +
 135 | +	if (!cfs_rq->cursor)
 136 | +		cfs_rq->cursor = cfs_rq->head;
 137 | +}
 138 | +
 139 | +static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 140 | +{
 141 | +	struct bs_node *bsn = &se->bs_node;
 142 | +	struct bs_node *prev, *next;
 143 | +
 144 | +	// if only one se in rq
 145 | +	if (cfs_rq->head->next == NULL) {
 146 | +		cfs_rq->head	= NULL;
 147 | +		cfs_rq->cursor	= NULL;
 148 | +	} else if (bsn == cfs_rq->head) {
 149 | +		// if it is the head
 150 | +		cfs_rq->head		= cfs_rq->head->next;
 151 | +		cfs_rq->head->prev	= NULL;
 152 | +
 153 | +		if (bsn == cfs_rq->cursor)
 154 | +			rotate_cursor(cfs_rq);
 155 | +	} else {
 156 | +		// if in the middle
 157 | +		if (bsn == cfs_rq->cursor)
 158 | +			rotate_cursor(cfs_rq);
 159 | +
 160 | +		prev = bsn->prev;
 161 | +		next = bsn->next;
 162 | +
 163 | +		prev->next = next;
 164 | +		if (next)
 165 | +			next->prev = prev;
 166 | +	}
 167 | +}
 168 | +
 169 | +static void
 170 | +enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 171 | +{
 172 | +	bool curr = cfs_rq->curr == se;
 173 | +
 174 | +	update_curr(cfs_rq);
 175 | +	account_entity_enqueue(cfs_rq, se);
 176 | +
 177 | +	if (!curr)
 178 | +		__enqueue_entity(cfs_rq, se);
 179 | +
 180 | +	se->on_rq = 1;
 181 | +}
 182 | +
 183 | +static void
 184 | +dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 185 | +{
 186 | +	update_curr(cfs_rq);
 187 | +
 188 | +	if (se != cfs_rq->curr)
 189 | +		__dequeue_entity(cfs_rq, se);
 190 | +
 191 | +	se->on_rq = 0;
 192 | +	account_entity_dequeue(cfs_rq, se);
 193 | +}
 194 | +
 195 | +static void
 196 | +enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 197 | +{
 198 | +	struct sched_entity *se = &p->se;
 199 | +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 200 | +	int idle_h_nr_running = task_has_idle_policy(p);
 201 | +
 202 | +	if (!se->on_rq) {
 203 | +		enqueue_entity(cfs_rq, se, flags);
 204 | +		cfs_rq->h_nr_running++;
 205 | +		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 206 | +	}
 207 | +
 208 | +	add_nr_running(rq, 1);
 209 | +}
 210 | +
 211 | +static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 212 | +{
 213 | +	struct sched_entity *se = &p->se;
 214 | +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 215 | +	int idle_h_nr_running = task_has_idle_policy(p);
 216 | +
 217 | +	dequeue_entity(cfs_rq, se, flags);
 218 | +
 219 | +	cfs_rq->h_nr_running--;
 220 | +	cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 221 | +
 222 | +	sub_nr_running(rq, 1);
 223 | +}
 224 | +
 225 | +static void yield_task_fair(struct rq *rq)
 226 | +{
 227 | +	struct task_struct *curr = rq->curr;
 228 | +	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 229 | +
 230 | +	/*
 231 | +	 * Are we the only task in the tree?
 232 | +	 */
 233 | +	if (unlikely(rq->nr_running == 1))
 234 | +		return;
 235 | +
 236 | +	if (curr->policy != SCHED_BATCH) {
 237 | +		update_rq_clock(rq);
 238 | +		/*
 239 | +		 * Update run-time statistics of the 'current'.
 240 | +		 */
 241 | +		update_curr(cfs_rq);
 242 | +		/*
 243 | +		 * Tell update_rq_clock() that we've just updated,
 244 | +		 * so we don't do microscopic update in schedule()
 245 | +		 * and double the fastpath cost.
 246 | +		 */
 247 | +		rq_clock_skip_update(rq);
 248 | +	}
 249 | +}
 250 | +
 251 | +static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 252 | +{
 253 | +	yield_task_fair(rq);
 254 | +	return true;
 255 | +}
 256 | +
 257 | +static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 258 | +{
 259 | +	struct task_struct *curr = rq->curr;
 260 | +	struct sched_entity *se = &curr->se, *pse = &p->se;
 261 | +
 262 | +	if (unlikely(se == pse))
 263 | +		return;
 264 | +
 265 | +	if (test_tsk_need_resched(curr))
 266 | +		return;
 267 | +
 268 | +	/* Idle tasks are by definition preempted by non-idle tasks. */
 269 | +	if (unlikely(task_has_idle_policy(curr)) &&
 270 | +	    likely(!task_has_idle_policy(p)))
 271 | +		goto preempt;
 272 | +
 273 | +	/*
 274 | +	 * Batch and idle tasks do not preempt non-idle tasks (their preemption
 275 | +	 * is driven by the tick):
 276 | +	 */
 277 | +	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
 278 | +		return;
 279 | +
 280 | +	update_curr(cfs_rq_of(se));
 281 | +
 282 | +preempt:
 283 | +	resched_curr(rq);
 284 | +}
 285 | +
 286 | +static void
 287 | +set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 288 | +{
 289 | +	if (se->on_rq)
 290 | +		__dequeue_entity(cfs_rq, se);
 291 | +
 292 | +	se->exec_start = rq_clock_task(rq_of(cfs_rq));
 293 | +	cfs_rq->curr = se;
 294 | +	se->prev_sum_exec_runtime = se->sum_exec_runtime;
 295 | +}
 296 | +
 297 | +static struct sched_entity *
 298 | +pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 299 | +{
 300 | +	struct bs_node *bsn = cfs_rq->cursor;
 301 | +
 302 | +	if (!bsn)
 303 | +		bsn = cfs_rq->cursor = cfs_rq->head;
 304 | +
 305 | +	// update cursor
 306 | +	rotate_cursor(cfs_rq);
 307 | +
 308 | +	return se_of(bsn);
 309 | +}
 310 | +
 311 | +struct task_struct *
 312 | +pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 313 | +{
 314 | +	struct cfs_rq *cfs_rq = &rq->cfs;
 315 | +	struct sched_entity *se;
 316 | +	struct task_struct *p;
 317 | +	int new_tasks;
 318 | +
 319 | +again:
 320 | +	if (!sched_fair_runnable(rq))
 321 | +		goto idle;
 322 | +
 323 | +	if (prev)
 324 | +		put_prev_task(rq, prev);
 325 | +
 326 | +	se = pick_next_entity(cfs_rq, NULL);
 327 | +	set_next_entity(cfs_rq, se);
 328 | +
 329 | +	p = task_of(se);
 330 | +
 331 | +done: __maybe_unused;
 332 | +#ifdef CONFIG_SMP
 333 | +	/*
 334 | +	 * Move the next running task to the front of
 335 | +	 * the list, so our cfs_tasks list becomes MRU
 336 | +	 * one.
 337 | +	 */
 338 | +	list_move(&p->se.group_node, &rq->cfs_tasks);
 339 | +#endif
 340 | +
 341 | +	return p;
 342 | +
 343 | +idle:
 344 | +	if (!rf)
 345 | +		return NULL;
 346 | +
 347 | +	new_tasks = newidle_balance(rq, rf);
 348 | +
 349 | +	/*
 350 | +	 * Because newidle_balance() releases (and re-acquires) rq->lock, it is
 351 | +	 * possible for any higher priority task to appear. In that case we
 352 | +	 * must re-start the pick_next_entity() loop.
 353 | +	 */
 354 | +	if (new_tasks < 0)
 355 | +		return RETRY_TASK;
 356 | +
 357 | +	if (new_tasks > 0)
 358 | +		goto again;
 359 | +
 360 | +	/*
 361 | +	 * rq is about to be idle, check if we need to update the
 362 | +	 * lost_idle_time of clock_pelt
 363 | +	 */
 364 | +	update_idle_rq_clock_pelt(rq);
 365 | +
 366 | +	return NULL;
 367 | +}
 368 | +
 369 | +static struct task_struct *__pick_next_task_fair(struct rq *rq)
 370 | +{
 371 | +	return pick_next_task_fair(rq, NULL, NULL);
 372 | +}
 373 | +
 374 | +#ifdef CONFIG_SMP
 375 | +static struct task_struct *pick_task_fair(struct rq *rq)
 376 | +{
 377 | +	struct sched_entity *se;
 378 | +	struct cfs_rq *cfs_rq = &rq->cfs;
 379 | +	struct sched_entity *curr = cfs_rq->curr;
 380 | +
 381 | +	if (!cfs_rq->nr_running)
 382 | +		return NULL;
 383 | +
 384 | +	/* When we pick for a remote RQ, we'll not have done put_prev_entity() */
 385 | +	if (curr) {
 386 | +		if (curr->on_rq)
 387 | +			update_curr(cfs_rq);
 388 | +		else
 389 | +			curr = NULL;
 390 | +	}
 391 | +
 392 | +	se = pick_next_entity(cfs_rq, curr);
 393 | +
 394 | +	return task_of(se);
 395 | +}
 396 | +#endif
 397 | +
 398 | +static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 399 | +{
 400 | +	/*
 401 | +	 * If still on the runqueue then deactivate_task()
 402 | +	 * was not called and update_curr() has to be done:
 403 | +	 */
 404 | +	if (prev->on_rq)
 405 | +		update_curr(cfs_rq);
 406 | +
 407 | +	if (prev->on_rq)
 408 | +		__enqueue_entity(cfs_rq, prev);
 409 | +
 410 | +	cfs_rq->curr = NULL;
 411 | +}
 412 | +
 413 | +static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
 414 | +{
 415 | +	struct sched_entity *se = &prev->se;
 416 | +
 417 | +	put_prev_entity(cfs_rq_of(se), se);
 418 | +}
 419 | +
 420 | +static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 421 | +{
 422 | +	struct sched_entity *se = &p->se;
 423 | +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 424 | +
 425 | +#ifdef CONFIG_SMP
 426 | +	if (task_on_rq_queued(p)) {
 427 | +		/*
 428 | +		 * Move the next running task to the front of the list, so our
 429 | +		 * cfs_tasks list becomes MRU one.
 430 | +		 */
 431 | +		list_move(&se->group_node, &rq->cfs_tasks);
 432 | +	}
 433 | +#endif
 434 | +
 435 | +	set_next_entity(cfs_rq, se);
 436 | +}
 437 | +
 438 | +static void
 439 | +check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 440 | +{
 441 | +	resched_curr(rq_of(cfs_rq));
 442 | +}
 443 | +
 444 | +static void
 445 | +entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 446 | +{
 447 | +	update_curr(cfs_rq);
 448 | +
 449 | +	if (cfs_rq->nr_running > 1)
 450 | +		check_preempt_tick(cfs_rq, curr);
 451 | +}
 452 | +
 453 | +#ifdef CONFIG_SMP
 454 | +static int
 455 | +balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 456 | +{
 457 | +	if (rq->nr_running)
 458 | +		return 1;
 459 | +
 460 | +	return newidle_balance(rq, rf) != 0;
 461 | +}
 462 | +
 463 | +static int
 464 | +select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 465 | +{
 466 | +	struct rq *rq = cpu_rq(prev_cpu);
 467 | +	unsigned int min_this = rq->nr_running;
 468 | +	unsigned int min = rq->nr_running;
 469 | +	int cpu, new_cpu = prev_cpu;
 470 | +
 471 | +	for_each_online_cpu(cpu) {
 472 | +		if (cpu_rq(cpu)->nr_running < min) {
 473 | +			new_cpu = cpu;
 474 | +			min = cpu_rq(cpu)->nr_running;
 475 | +		}
 476 | +	}
 477 | +
 478 | +	if (min == min_this)
 479 | +		return prev_cpu;
 480 | +
 481 | +	return new_cpu;
 482 | +}
 483 | +
 484 | +static int
 485 | +can_migrate_task(struct task_struct *p, int dst_cpu, struct rq *src_rq)
 486 | +{
 487 | +	if (task_running(src_rq, p))
 488 | +		return 0;
 489 | +
 490 | +	/* Disregard pcpu kthreads; they are where they need to be. */
 491 | +	if (kthread_is_per_cpu(p))
 492 | +		return 0;
 493 | +
 494 | +	if (!cpumask_test_cpu(dst_cpu, p->cpus_ptr))
 495 | +		return 0;
 496 | +
 497 | +	return 1;
 498 | +}
 499 | +
 500 | +static void pull_from(struct rq *this_rq,
 501 | +		      struct rq *src_rq,
 502 | +		      struct rq_flags *src_rf,
 503 | +		      struct task_struct *p)
 504 | +{
 505 | +	struct rq_flags rf;
 506 | +
 507 | +	// detach task
 508 | +	deactivate_task(src_rq, p, DEQUEUE_NOCLOCK);
 509 | +	set_task_cpu(p, cpu_of(this_rq));
 510 | +
 511 | +	// unlock src rq
 512 | +	rq_unlock(src_rq, src_rf);
 513 | +
 514 | +	// lock this rq
 515 | +	rq_lock(this_rq, &rf);
 516 | +	update_rq_clock(this_rq);
 517 | +
 518 | +	activate_task(this_rq, p, ENQUEUE_NOCLOCK);
 519 | +	check_preempt_curr(this_rq, p, 0);
 520 | +
 521 | +	// unlock this rq
 522 | +	rq_unlock(this_rq, &rf);
 523 | +
 524 | +	local_irq_restore(src_rf->flags);
 525 | +}
 526 | +
 527 | +static int move_task(struct rq *this_rq, struct rq *src_rq,
 528 | +			struct rq_flags *src_rf)
 529 | +{
 530 | +	struct cfs_rq *src_cfs_rq = &src_rq->cfs;
 531 | +	struct task_struct *p;
 532 | +	struct bs_node *bsn = src_cfs_rq->head;
 533 | +	int moved = 0;
 534 | +
 535 | +	while (bsn) {
 536 | +		p = task_of(se_of(bsn));
 537 | +		if (can_migrate_task(p, cpu_of(this_rq), src_rq)) {
 538 | +			pull_from(this_rq, src_rq, src_rf, p);
 539 | +			moved = 1;
 540 | +			break;
 541 | +		}
 542 | +
 543 | +		bsn = bsn->next;
 544 | +	}
 545 | +
 546 | +	if (!moved) {
 547 | +		rq_unlock(src_rq, src_rf);
 548 | +		local_irq_restore(src_rf->flags);
 549 | +	}
 550 | +
 551 | +	return moved;
 552 | +}
 553 | +
 554 | +static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 555 | +{
 556 | +	int this_cpu = this_rq->cpu;
 557 | +	struct rq *src_rq;
 558 | +	int src_cpu = -1, cpu;
 559 | +	int pulled_task = 0;
 560 | +	unsigned int max = 0;
 561 | +	struct rq_flags src_rf;
 562 | +
 563 | +	/*
 564 | +	 * We must set idle_stamp _before_ calling idle_balance(), such that we
 565 | +	 * measure the duration of idle_balance() as idle time.
 566 | +	 */
 567 | +	this_rq->idle_stamp = rq_clock(this_rq);
 568 | +
 569 | +	/*
 570 | +	 * Do not pull tasks towards !active CPUs...
 571 | +	 */
 572 | +	if (!cpu_active(this_cpu))
 573 | +		return 0;
 574 | +
 575 | +	rq_unpin_lock(this_rq, rf);
 576 | +	raw_spin_unlock(&this_rq->__lock);
 577 | +
 578 | +	for_each_online_cpu(cpu) {
 579 | +		/*
 580 | +		 * Stop searching for tasks to pull if there are
 581 | +		 * now runnable tasks on this rq.
 582 | +		 */
 583 | +		if (this_rq->nr_running > 0)
 584 | +			goto out;
 585 | +
 586 | +		if (cpu == this_cpu)
 587 | +			continue;
 588 | +
 589 | +		src_rq = cpu_rq(cpu);
 590 | +
 591 | +		if (src_rq->nr_running < 2)
 592 | +			continue;
 593 | +
 594 | +		if (src_rq->nr_running > max) {
 595 | +			max = src_rq->nr_running;
 596 | +			src_cpu = cpu;
 597 | +		}
 598 | +	}
 599 | +
 600 | +	if (src_cpu != -1) {
 601 | +		src_rq = cpu_rq(src_cpu);
 602 | +
 603 | +		rq_lock_irqsave(src_rq, &src_rf);
 604 | +		update_rq_clock(src_rq);
 605 | +
 606 | +		if (src_rq->nr_running < 2) {
 607 | +			rq_unlock(src_rq, &src_rf);
 608 | +			local_irq_restore(src_rf.flags);
 609 | +		} else {
 610 | +			pulled_task = move_task(this_rq, src_rq, &src_rf);
 611 | +		}
 612 | +	}
 613 | +
 614 | +out:
 615 | +	raw_spin_lock(&this_rq->__lock);
 616 | +
 617 | +	/*
 618 | +	 * While browsing the domains, we released the rq lock, a task could
 619 | +	 * have been enqueued in the meantime. Since we're not going idle,
 620 | +	 * pretend we pulled a task.
 621 | +	 */
 622 | +	if (this_rq->cfs.h_nr_running && !pulled_task)
 623 | +		pulled_task = 1;
 624 | +
 625 | +	/* Is there a task of a high priority class? */
 626 | +	if (this_rq->nr_running != this_rq->cfs.h_nr_running)
 627 | +		pulled_task = -1;
 628 | +
 629 | +	if (pulled_task)
 630 | +		this_rq->idle_stamp = 0;
 631 | +
 632 | +	rq_repin_lock(this_rq, rf);
 633 | +
 634 | +	return pulled_task;
 635 | +}
 636 | +
 637 | +static inline int on_null_domain(struct rq *rq)
 638 | +{
 639 | +	return unlikely(!rcu_dereference_sched(rq->sd));
 640 | +}
 641 | +
 642 | +void trigger_load_balance(struct rq *this_rq)
 643 | +{
 644 | +	int this_cpu = cpu_of(this_rq);
 645 | +	int cpu;
 646 | +	unsigned int max, min;
 647 | +	struct rq *max_rq, *min_rq, *c_rq;
 648 | +	struct rq_flags src_rf;
 649 | +
 650 | +	if (this_cpu != 0)
 651 | +		return;
 652 | +
 653 | +	max = min = this_rq->nr_running;
 654 | +	max_rq = min_rq = this_rq;
 655 | +
 656 | +	for_each_online_cpu(cpu) {
 657 | +		c_rq = cpu_rq(cpu);
 658 | +
 659 | +		/*
 660 | +		 * Don't need to rebalance while attached to NULL domain or
 661 | +		 * runqueue CPU is not active
 662 | +		 */
 663 | +		if (unlikely(on_null_domain(c_rq) || !cpu_active(cpu)))
 664 | +			continue;
 665 | +
 666 | +		if (c_rq->nr_running < min) {
 667 | +			min = c_rq->nr_running;
 668 | +			min_rq = c_rq;
 669 | +		}
 670 | +
 671 | +		if (c_rq->nr_running > max) {
 672 | +			max = c_rq->nr_running;
 673 | +			max_rq = c_rq;
 674 | +		}
 675 | +	}
 676 | +
 677 | +	if (min_rq == max_rq || max - min < 2)
 678 | +		return;
 679 | +
 680 | +	rq_lock_irqsave(max_rq, &src_rf);
 681 | +	update_rq_clock(max_rq);
 682 | +
 683 | +	if (max_rq->nr_running < 2) {
 684 | +		rq_unlock(max_rq, &src_rf);
 685 | +		local_irq_restore(src_rf.flags);
 686 | +		return;
 687 | +	}
 688 | +
 689 | +	move_task(min_rq, max_rq, &src_rf);
 690 | +}
 691 | +
 692 | +void update_group_capacity(struct sched_domain *sd, int cpu) {}
 693 | +#endif /* CONFIG_SMP */
 694 | +
 695 | +static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 696 | +{
 697 | +	struct sched_entity *se = &curr->se;
 698 | +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 699 | +
 700 | +	entity_tick(cfs_rq, se, queued);
 701 | +
 702 | +	if (static_branch_unlikely(&sched_numa_balancing))
 703 | +		task_tick_numa(rq, curr);
 704 | +}
 705 | +
 706 | +static void task_fork_fair(struct task_struct *p)
 707 | +{
 708 | +	struct cfs_rq *cfs_rq;
 709 | +	struct sched_entity *curr;
 710 | +	struct rq *rq = this_rq();
 711 | +	struct rq_flags rf;
 712 | +
 713 | +	rq_lock(rq, &rf);
 714 | +	update_rq_clock(rq);
 715 | +
 716 | +	cfs_rq = task_cfs_rq(current);
 717 | +	curr = cfs_rq->curr;
 718 | +	if (curr)
 719 | +		update_curr(cfs_rq);
 720 | +
 721 | +	rq_unlock(rq, &rf);
 722 | +}
 723 | +
 724 | +/*
 725 | + * All the scheduling class methods:
 726 | + */
 727 | +DEFINE_SCHED_CLASS(fair) = {
 728 | +
 729 | +	.enqueue_task		= enqueue_task_fair,
 730 | +	.dequeue_task		= dequeue_task_fair,
 731 | +	.yield_task		= yield_task_fair,
 732 | +	.yield_to_task		= yield_to_task_fair,
 733 | +
 734 | +	.check_preempt_curr	= check_preempt_wakeup,
 735 | +
 736 | +	.pick_next_task		= __pick_next_task_fair,
 737 | +	.put_prev_task		= put_prev_task_fair,
 738 | +	.set_next_task          = set_next_task_fair,
 739 | +
 740 | +#ifdef CONFIG_SMP
 741 | +	.balance		= balance_fair,
 742 | +	.pick_task		= pick_task_fair,
 743 | +	.select_task_rq		= select_task_rq_fair,
 744 | +	.migrate_task_rq	= migrate_task_rq_fair,
 745 | +
 746 | +	.rq_online		= rq_online_fair,
 747 | +	.rq_offline		= rq_offline_fair,
 748 | +
 749 | +	.task_dead		= task_dead_fair,
 750 | +	.set_cpus_allowed	= set_cpus_allowed_common,
 751 | +#endif
 752 | +
 753 | +	.task_tick		= task_tick_fair,
 754 | +	.task_fork		= task_fork_fair,
 755 | +
 756 | +	.prio_changed		= prio_changed_fair,
 757 | +	.switched_from		= switched_from_fair,
 758 | +	.switched_to		= switched_to_fair,
 759 | +
 760 | +	.get_rr_interval	= get_rr_interval_fair,
 761 | +
 762 | +	.update_curr		= update_curr_fair,
 763 | +};
 764 | diff --git a/kernel/sched/bs.h b/kernel/sched/bs.h
 765 | new file mode 100644
 766 | index 000000000000..f5495a29cb57
 767 | --- /dev/null
 768 | +++ b/kernel/sched/bs.h
 769 | @@ -0,0 +1,146 @@
 770 | +/*
 771 | + * After fork, child runs first. If set to 0 (default) then
 772 | + * parent will (try to) run first.
 773 | + */
 774 | +unsigned int sysctl_sched_child_runs_first __read_mostly;
 775 | +
 776 | +const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
 777 | +
 778 | +void __init sched_init_granularity(void) {}
 779 | +
 780 | +#ifdef CONFIG_SMP
 781 | +/* Give new sched_entity start runnable values to heavy its load in infant time */
 782 | +void init_entity_runnable_average(struct sched_entity *se) {}
 783 | +void post_init_entity_util_avg(struct task_struct *p) {}
 784 | +void update_max_interval(void) {}
 785 | +static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
 786 | +#endif /** CONFIG_SMP */
 787 | +
 788 | +void init_cfs_rq(struct cfs_rq *cfs_rq)
 789 | +{
 790 | +	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
 791 | +#ifdef CONFIG_SMP
 792 | +	raw_spin_lock_init(&cfs_rq->removed.lock);
 793 | +#endif
 794 | +}
 795 | +
 796 | +__init void init_sched_fair_class(void) {}
 797 | +
 798 | +void reweight_task(struct task_struct *p, int prio) {}
 799 | +
 800 | +static inline struct sched_entity *se_of(struct bs_node *bsn)
 801 | +{
 802 | +	return container_of(bsn, struct sched_entity, bs_node);
 803 | +}
 804 | +
 805 | +#ifdef CONFIG_SCHED_SMT
 806 | +DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 807 | +EXPORT_SYMBOL_GPL(sched_smt_present);
 808 | +
 809 | +static inline void set_idle_cores(int cpu, int val)
 810 | +{
 811 | +	struct sched_domain_shared *sds;
 812 | +
 813 | +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
 814 | +	if (sds)
 815 | +		WRITE_ONCE(sds->has_idle_cores, val);
 816 | +}
 817 | +
 818 | +static inline bool test_idle_cores(int cpu, bool def)
 819 | +{
 820 | +	struct sched_domain_shared *sds;
 821 | +
 822 | +	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
 823 | +	if (sds)
 824 | +		return READ_ONCE(sds->has_idle_cores);
 825 | +
 826 | +	return def;
 827 | +}
 828 | +
 829 | +void __update_idle_core(struct rq *rq)
 830 | +{
 831 | +	int core = cpu_of(rq);
 832 | +	int cpu;
 833 | +
 834 | +	rcu_read_lock();
 835 | +	if (test_idle_cores(core, true))
 836 | +		goto unlock;
 837 | +
 838 | +	for_each_cpu(cpu, cpu_smt_mask(core)) {
 839 | +		if (cpu == core)
 840 | +			continue;
 841 | +
 842 | +		if (!available_idle_cpu(cpu))
 843 | +			goto unlock;
 844 | +	}
 845 | +
 846 | +	set_idle_cores(core, 1);
 847 | +unlock:
 848 | +	rcu_read_unlock();
 849 | +}
 850 | +#endif
 851 | +
 852 | +static void
 853 | +account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 854 | +{
 855 | +#ifdef CONFIG_SMP
 856 | +	if (entity_is_task(se)) {
 857 | +		struct rq *rq = rq_of(cfs_rq);
 858 | +
 859 | +		account_numa_enqueue(rq, task_of(se));
 860 | +		list_add(&se->group_node, &rq->cfs_tasks);
 861 | +	}
 862 | +#endif
 863 | +	cfs_rq->nr_running++;
 864 | +}
 865 | +
 866 | +static void
 867 | +account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 868 | +{
 869 | +#ifdef CONFIG_SMP
 870 | +	if (entity_is_task(se)) {
 871 | +		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
 872 | +		list_del_init(&se->group_node);
 873 | +	}
 874 | +#endif
 875 | +	cfs_rq->nr_running--;
 876 | +}
 877 | +
 878 | +static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 879 | +{
 880 | +	update_scan_period(p, new_cpu);
 881 | +}
 882 | +
 883 | +static void rq_online_fair(struct rq *rq) {}
 884 | +static void rq_offline_fair(struct rq *rq) {}
 885 | +static void task_dead_fair(struct task_struct *p)
 886 | +{
 887 | +	struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
 888 | +	unsigned long flags;
 889 | +
 890 | +	raw_spin_lock_irqsave(&cfs_rq->removed.lock, flags);
 891 | +	++cfs_rq->removed.nr;
 892 | +	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 893 | +}
 894 | +
 895 | +static void
 896 | +prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio) {}
 897 | +
 898 | +static void switched_from_fair(struct rq *rq, struct task_struct *p) {}
 899 | +
 900 | +static void switched_to_fair(struct rq *rq, struct task_struct *p)
 901 | +{
 902 | +	if (task_on_rq_queued(p)) {
 903 | +		/*
 904 | +		 * We were most likely switched from sched_rt, so
 905 | +		 * kick off the schedule if running, otherwise just see
 906 | +		 * if we can still preempt the current task.
 907 | +		 */
 908 | +		resched_curr(rq);
 909 | +	}
 910 | +}
 911 | +
 912 | +static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task)
 913 | +{
 914 | +	return 0;
 915 | +}
 916 | diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 917 | index 399c37c95392..9a3adde1b831 100644
 918 | --- a/kernel/sched/core.c
 919 | +++ b/kernel/sched/core.c
 920 | @@ -8995,6 +8995,10 @@ void __init sched_init(void)
 921 |  
 922 |  	wait_bit_init();
 923 |  
 924 | +#ifdef CONFIG_BS_SCHED
 925 | +	printk(KERN_INFO "Baby CPU scheduler (rr) v5.14 by Hamad Al Marri.");
 926 | +#endif
 927 | +
 928 |  #ifdef CONFIG_FAIR_GROUP_SCHED
 929 |  	ptr += 2 * nr_cpu_ids * sizeof(void **);
 930 |  #endif
 931 | diff --git a/kernel/sched/fair_numa.h b/kernel/sched/fair_numa.h
 932 | new file mode 100644
 933 | index 000000000000..a564478ec3f7
 934 | --- /dev/null
 935 | +++ b/kernel/sched/fair_numa.h
 936 | @@ -0,0 +1,1968 @@
 937 | +
 938 | +#ifdef CONFIG_NUMA_BALANCING
 939 | +
 940 | +unsigned int sysctl_numa_balancing_scan_period_min = 1000;
 941 | +unsigned int sysctl_numa_balancing_scan_period_max = 60000;
 942 | +
 943 | +/* Portion of address space to scan in MB */
 944 | +unsigned int sysctl_numa_balancing_scan_size = 256;
 945 | +
 946 | +/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 947 | +unsigned int sysctl_numa_balancing_scan_delay = 1000;
 948 | +
 949 | +struct numa_group {
 950 | +	refcount_t refcount;
 951 | +
 952 | +	spinlock_t lock; /* nr_tasks, tasks */
 953 | +	int nr_tasks;
 954 | +	pid_t gid;
 955 | +	int active_nodes;
 956 | +
 957 | +	struct rcu_head rcu;
 958 | +	unsigned long total_faults;
 959 | +	unsigned long max_faults_cpu;
 960 | +	/*
 961 | +	 * Faults_cpu is used to decide whether memory should move
 962 | +	 * towards the CPU. As a consequence, these stats are weighted
 963 | +	 * more by CPU use than by memory faults.
 964 | +	 */
 965 | +	unsigned long *faults_cpu;
 966 | +	unsigned long faults[];
 967 | +};
 968 | +
 969 | +/*
 970 | + * For functions that can be called in multiple contexts that permit reading
 971 | + * ->numa_group (see struct task_struct for locking rules).
 972 | + */
 973 | +static struct numa_group *deref_task_numa_group(struct task_struct *p)
 974 | +{
 975 | +	return rcu_dereference_check(p->numa_group, p == current ||
 976 | +		(lockdep_is_held(__rq_lockp(task_rq(p))) && !READ_ONCE(p->on_cpu)));
 977 | +}
 978 | +
 979 | +static struct numa_group *deref_curr_numa_group(struct task_struct *p)
 980 | +{
 981 | +	return rcu_dereference_protected(p->numa_group, p == current);
 982 | +}
 983 | +
 984 | +static inline unsigned long group_faults_priv(struct numa_group *ng);
 985 | +static inline unsigned long group_faults_shared(struct numa_group *ng);
 986 | +
 987 | +static unsigned int task_nr_scan_windows(struct task_struct *p)
 988 | +{
 989 | +	unsigned long rss = 0;
 990 | +	unsigned long nr_scan_pages;
 991 | +
 992 | +	/*
 993 | +	 * Calculations based on RSS as non-present and empty pages are skipped
 994 | +	 * by the PTE scanner and NUMA hinting faults should be trapped based
 995 | +	 * on resident pages
 996 | +	 */
 997 | +	nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
 998 | +	rss = get_mm_rss(p->mm);
 999 | +	if (!rss)
1000 | +		rss = nr_scan_pages;
1001 | +
1002 | +	rss = round_up(rss, nr_scan_pages);
1003 | +	return rss / nr_scan_pages;
1004 | +}
1005 | +
1006 | +/* For sanity's sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
1007 | +#define MAX_SCAN_WINDOW 2560
1008 | +
1009 | +static unsigned int task_scan_min(struct task_struct *p)
1010 | +{
1011 | +	unsigned int scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
1012 | +	unsigned int scan, floor;
1013 | +	unsigned int windows = 1;
1014 | +
1015 | +	if (scan_size < MAX_SCAN_WINDOW)
1016 | +		windows = MAX_SCAN_WINDOW / scan_size;
1017 | +	floor = 1000 / windows;
1018 | +
1019 | +	scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
1020 | +	return max_t(unsigned int, floor, scan);
1021 | +}
1022 | +
1023 | +static unsigned int task_scan_max(struct task_struct *p)
1024 | +{
1025 | +	unsigned long smin = task_scan_min(p);
1026 | +	unsigned long smax;
1027 | +	struct numa_group *ng;
1028 | +
1029 | +	/* Watch for min being lower than max due to floor calculations */
1030 | +	smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
1031 | +
1032 | +	/* Scale the maximum scan period with the amount of shared memory. */
1033 | +	ng = deref_curr_numa_group(p);
1034 | +	if (ng) {
1035 | +		unsigned long shared = group_faults_shared(ng);
1036 | +		unsigned long private = group_faults_priv(ng);
1037 | +		unsigned long period = smax;
1038 | +
1039 | +		period *= refcount_read(&ng->refcount);
1040 | +		period *= shared + 1;
1041 | +		period /= private + shared + 1;
1042 | +
1043 | +		smax = max(smax, period);
1044 | +	}
1045 | +
1046 | +	return max(smin, smax);
1047 | +}
1048 | +
1049 | +static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
1050 | +{
1051 | +	rq->nr_numa_running += (p->numa_preferred_nid != NUMA_NO_NODE);
1052 | +	rq->nr_preferred_running += (p->numa_preferred_nid == task_node(p));
1053 | +}
1054 | +
1055 | +static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
1056 | +{
1057 | +	rq->nr_numa_running -= (p->numa_preferred_nid != NUMA_NO_NODE);
1058 | +	rq->nr_preferred_running -= (p->numa_preferred_nid == task_node(p));
1059 | +}
1060 | +
1061 | +/* Shared or private faults. */
1062 | +#define NR_NUMA_HINT_FAULT_TYPES 2
1063 | +
1064 | +/* Memory and CPU locality */
1065 | +#define NR_NUMA_HINT_FAULT_STATS (NR_NUMA_HINT_FAULT_TYPES * 2)
1066 | +
1067 | +/* Averaged statistics, and temporary buffers. */
1068 | +#define NR_NUMA_HINT_FAULT_BUCKETS (NR_NUMA_HINT_FAULT_STATS * 2)
1069 | +
1070 | +pid_t task_numa_group_id(struct task_struct *p)
1071 | +{
1072 | +	struct numa_group *ng;
1073 | +	pid_t gid = 0;
1074 | +
1075 | +	rcu_read_lock();
1076 | +	ng = rcu_dereference(p->numa_group);
1077 | +	if (ng)
1078 | +		gid = ng->gid;
1079 | +	rcu_read_unlock();
1080 | +
1081 | +	return gid;
1082 | +}
1083 | +
1084 | +/*
1085 | + * The averaged statistics, shared & private, memory & CPU,
1086 | + * occupy the first half of the array. The second half of the
1087 | + * array is for current counters, which are averaged into the
1088 | + * first set by task_numa_placement.
1089 | + */
1090 | +static inline int task_faults_idx(enum numa_faults_stats s, int nid, int priv)
1091 | +{
1092 | +	return NR_NUMA_HINT_FAULT_TYPES * (s * nr_node_ids + nid) + priv;
1093 | +}
1094 | +
1095 | +static inline unsigned long task_faults(struct task_struct *p, int nid)
1096 | +{
1097 | +	if (!p->numa_faults)
1098 | +		return 0;
1099 | +
1100 | +	return p->numa_faults[task_faults_idx(NUMA_MEM, nid, 0)] +
1101 | +		p->numa_faults[task_faults_idx(NUMA_MEM, nid, 1)];
1102 | +}
1103 | +
1104 | +static inline unsigned long group_faults(struct task_struct *p, int nid)
1105 | +{
1106 | +	struct numa_group *ng = deref_task_numa_group(p);
1107 | +
1108 | +	if (!ng)
1109 | +		return 0;
1110 | +
1111 | +	return ng->faults[task_faults_idx(NUMA_MEM, nid, 0)] +
1112 | +		ng->faults[task_faults_idx(NUMA_MEM, nid, 1)];
1113 | +}
1114 | +
1115 | +static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)
1116 | +{
1117 | +	return group->faults_cpu[task_faults_idx(NUMA_MEM, nid, 0)] +
1118 | +		group->faults_cpu[task_faults_idx(NUMA_MEM, nid, 1)];
1119 | +}
1120 | +
1121 | +static inline unsigned long group_faults_priv(struct numa_group *ng)
1122 | +{
1123 | +	unsigned long faults = 0;
1124 | +	int node;
1125 | +
1126 | +	for_each_online_node(node) {
1127 | +		faults += ng->faults[task_faults_idx(NUMA_MEM, node, 1)];
1128 | +	}
1129 | +
1130 | +	return faults;
1131 | +}
1132 | +
1133 | +static inline unsigned long group_faults_shared(struct numa_group *ng)
1134 | +{
1135 | +	unsigned long faults = 0;
1136 | +	int node;
1137 | +
1138 | +	for_each_online_node(node) {
1139 | +		faults += ng->faults[task_faults_idx(NUMA_MEM, node, 0)];
1140 | +	}
1141 | +
1142 | +	return faults;
1143 | +}
1144 | +
1145 | +/*
1146 | + * A node triggering more than 1/3 as many NUMA faults as the maximum is
1147 | + * considered part of a numa group's pseudo-interleaving set. Migrations
1148 | + * between these nodes are slowed down, to allow things to settle down.
1149 | + */
1150 | +#define ACTIVE_NODE_FRACTION 3
1151 | +
1152 | +static bool numa_is_active_node(int nid, struct numa_group *ng)
1153 | +{
1154 | +	return group_faults_cpu(ng, nid) * ACTIVE_NODE_FRACTION > ng->max_faults_cpu;
1155 | +}
1156 | +
1157 | +/* Handle placement on systems where not all nodes are directly connected. */
1158 | +static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
1159 | +					int maxdist, bool task)
1160 | +{
1161 | +	unsigned long score = 0;
1162 | +	int node;
1163 | +
1164 | +	/*
1165 | +	 * All nodes are directly connected, and the same distance
1166 | +	 * from each other. No need for fancy placement algorithms.
1167 | +	 */
1168 | +	if (sched_numa_topology_type == NUMA_DIRECT)
1169 | +		return 0;
1170 | +
1171 | +	/*
1172 | +	 * This code is called for each node, introducing N^2 complexity,
1173 | +	 * which should be ok given the number of nodes rarely exceeds 8.
1174 | +	 */
1175 | +	for_each_online_node(node) {
1176 | +		unsigned long faults;
1177 | +		int dist = node_distance(nid, node);
1178 | +
1179 | +		/*
1180 | +		 * The furthest away nodes in the system are not interesting
1181 | +		 * for placement; nid was already counted.
1182 | +		 */
1183 | +		if (dist == sched_max_numa_distance || node == nid)
1184 | +			continue;
1185 | +
1186 | +		/*
1187 | +		 * On systems with a backplane NUMA topology, compare groups
1188 | +		 * of nodes, and move tasks towards the group with the most
1189 | +		 * memory accesses. When comparing two nodes at distance
1190 | +		 * "hoplimit", only nodes closer by than "hoplimit" are part
1191 | +		 * of each group. Skip other nodes.
1192 | +		 */
1193 | +		if (sched_numa_topology_type == NUMA_BACKPLANE &&
1194 | +					dist >= maxdist)
1195 | +			continue;
1196 | +
1197 | +		/* Add up the faults from nearby nodes. */
1198 | +		if (task)
1199 | +			faults = task_faults(p, node);
1200 | +		else
1201 | +			faults = group_faults(p, node);
1202 | +
1203 | +		/*
1204 | +		 * On systems with a glueless mesh NUMA topology, there are
1205 | +		 * no fixed "groups of nodes". Instead, nodes that are not
1206 | +		 * directly connected bounce traffic through intermediate
1207 | +		 * nodes; a numa_group can occupy any set of nodes.
1208 | +		 * The further away a node is, the less the faults count.
1209 | +		 * This seems to result in good task placement.
1210 | +		 */
1211 | +		if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
1212 | +			faults *= (sched_max_numa_distance - dist);
1213 | +			faults /= (sched_max_numa_distance - LOCAL_DISTANCE);
1214 | +		}
1215 | +
1216 | +		score += faults;
1217 | +	}
1218 | +
1219 | +	return score;
1220 | +}
1221 | +
1222 | +/*
1223 | + * These return the fraction of accesses done by a particular task, or
1224 | + * task group, on a particular numa node.  The group weight is given a
1225 | + * larger multiplier, in order to group tasks together that are almost
1226 | + * evenly spread out between numa nodes.
1227 | + */
1228 | +static inline unsigned long task_weight(struct task_struct *p, int nid,
1229 | +					int dist)
1230 | +{
1231 | +	unsigned long faults, total_faults;
1232 | +
1233 | +	if (!p->numa_faults)
1234 | +		return 0;
1235 | +
1236 | +	total_faults = p->total_numa_faults;
1237 | +
1238 | +	if (!total_faults)
1239 | +		return 0;
1240 | +
1241 | +	faults = task_faults(p, nid);
1242 | +	faults += score_nearby_nodes(p, nid, dist, true);
1243 | +
1244 | +	return 1000 * faults / total_faults;
1245 | +}
1246 | +
1247 | +static inline unsigned long group_weight(struct task_struct *p, int nid,
1248 | +					 int dist)
1249 | +{
1250 | +	struct numa_group *ng = deref_task_numa_group(p);
1251 | +	unsigned long faults, total_faults;
1252 | +
1253 | +	if (!ng)
1254 | +		return 0;
1255 | +
1256 | +	total_faults = ng->total_faults;
1257 | +
1258 | +	if (!total_faults)
1259 | +		return 0;
1260 | +
1261 | +	faults = group_faults(p, nid);
1262 | +	faults += score_nearby_nodes(p, nid, dist, false);
1263 | +
1264 | +	return 1000 * faults / total_faults;
1265 | +}
1266 | +
1267 | +bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
1268 | +				int src_nid, int dst_cpu)
1269 | +{
1270 | +	struct numa_group *ng = deref_curr_numa_group(p);
1271 | +	int dst_nid = cpu_to_node(dst_cpu);
1272 | +	int last_cpupid, this_cpupid;
1273 | +
1274 | +	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
1275 | +	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
1276 | +
1277 | +	/*
1278 | +	 * Allow first faults or private faults to migrate immediately early in
1279 | +	 * the lifetime of a task. The magic number 4 is based on waiting for
1280 | +	 * two full passes of the "multi-stage node selection" test that is
1281 | +	 * executed below.
1282 | +	 */
1283 | +	if ((p->numa_preferred_nid == NUMA_NO_NODE || p->numa_scan_seq <= 4) &&
1284 | +	    (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid)))
1285 | +		return true;
1286 | +
1287 | +	/*
1288 | +	 * Multi-stage node selection is used in conjunction with a periodic
1289 | +	 * migration fault to build a temporal task<->page relation. By using
1290 | +	 * a two-stage filter we remove short/unlikely relations.
1291 | +	 *
1292 | +	 * Using P(p) ~ n_p / n_t as per frequentist probability, we can equate
1293 | +	 * a task's usage of a particular page (n_p) per total usage of this
1294 | +	 * page (n_t) (in a given time-span) to a probability.
1295 | +	 *
1296 | +	 * Our periodic faults will sample this probability and getting the
1297 | +	 * same result twice in a row, given these samples are fully
1298 | +	 * independent, is then given by P(n)^2, provided our sample period
1299 | +	 * is sufficiently short compared to the usage pattern.
1300 | +	 *
1301 | +	 * This quadric squishes small probabilities, making it less likely we
1302 | +	 * act on an unlikely task<->page relation.
1303 | +	 */
1304 | +	if (!cpupid_pid_unset(last_cpupid) &&
1305 | +				cpupid_to_nid(last_cpupid) != dst_nid)
1306 | +		return false;
1307 | +
1308 | +	/* Always allow migrate on private faults */
1309 | +	if (cpupid_match_pid(p, last_cpupid))
1310 | +		return true;
1311 | +
1312 | +	/* A shared fault, but p->numa_group has not been set up yet. */
1313 | +	if (!ng)
1314 | +		return true;
1315 | +
1316 | +	/*
1317 | +	 * Destination node is much more heavily used than the source
1318 | +	 * node? Allow migration.
1319 | +	 */
1320 | +	if (group_faults_cpu(ng, dst_nid) > group_faults_cpu(ng, src_nid) *
1321 | +					ACTIVE_NODE_FRACTION)
1322 | +		return true;
1323 | +
1324 | +	/*
1325 | +	 * Distribute memory according to CPU & memory use on each node,
1326 | +	 * with 3/4 hysteresis to avoid unnecessary memory migrations:
1327 | +	 *
1328 | +	 * faults_cpu(dst)   3   faults_cpu(src)
1329 | +	 * --------------- * - > ---------------
1330 | +	 * faults_mem(dst)   4   faults_mem(src)
1331 | +	 */
1332 | +	return group_faults_cpu(ng, dst_nid) * group_faults(p, src_nid) * 3 >
1333 | +	       group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
1334 | +}
1335 | +
1336 | +/*
1337 | + * 'numa_type' describes the node at the moment of load balancing.
1338 | + */
1339 | +enum numa_type {
1340 | +	/* The node has spare capacity that can be used to run more tasks.  */
1341 | +	node_has_spare = 0,
1342 | +	/*
1343 | +	 * The node is fully used and the tasks don't compete for more CPU
1344 | +	 * cycles. Nevertheless, some tasks might wait before running.
1345 | +	 */
1346 | +	node_fully_busy,
1347 | +	/*
1348 | +	 * The node is overloaded and can't provide expected CPU cycles to all
1349 | +	 * tasks.
1350 | +	 */
1351 | +	node_overloaded
1352 | +};
1353 | +
1354 | +/* Cached statistics for all CPUs within a node */
1355 | +struct numa_stats {
1356 | +	unsigned long load;
1357 | +	unsigned long runnable;
1358 | +	unsigned long util;
1359 | +	/* Total compute capacity of CPUs on a node */
1360 | +	unsigned long compute_capacity;
1361 | +	unsigned int nr_running;
1362 | +	unsigned int weight;
1363 | +	enum numa_type node_type;
1364 | +	int idle_cpu;
1365 | +};
1366 | +
1367 | +static inline bool is_core_idle(int cpu)
1368 | +{
1369 | +#ifdef CONFIG_SCHED_SMT
1370 | +	int sibling;
1371 | +
1372 | +	for_each_cpu(sibling, cpu_smt_mask(cpu)) {
1373 | +		if (cpu == sibling)
1374 | +			continue;
1375 | +
1376 | +		if (!idle_cpu(sibling))
1377 | +			return false;
1378 | +	}
1379 | +#endif
1380 | +
1381 | +	return true;
1382 | +}
1383 | +
1384 | +struct task_numa_env {
1385 | +	struct task_struct *p;
1386 | +
1387 | +	int src_cpu, src_nid;
1388 | +	int dst_cpu, dst_nid;
1389 | +
1390 | +	struct numa_stats src_stats, dst_stats;
1391 | +
1392 | +	int imbalance_pct;
1393 | +	int dist;
1394 | +
1395 | +	struct task_struct *best_task;
1396 | +	long best_imp;
1397 | +	int best_cpu;
1398 | +};
1399 | +
1400 | +static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
1401 | +{
1402 | +	return cfs_rq->avg.load_avg;
1403 | +}
1404 | +
1405 | +static unsigned long cpu_load(struct rq *rq)
1406 | +{
1407 | +	return cfs_rq_load_avg(&rq->cfs);
1408 | +}
1409 | +
1410 | +static inline unsigned long cfs_rq_runnable_avg(struct cfs_rq *cfs_rq)
1411 | +{
1412 | +	return cfs_rq->avg.runnable_avg;
1413 | +}
1414 | +
1415 | +static unsigned long cpu_runnable(struct rq *rq)
1416 | +{
1417 | +	return cfs_rq_runnable_avg(&rq->cfs);
1418 | +}
1419 | +
1420 | +static inline unsigned long cpu_util(int cpu)
1421 | +{
1422 | +	struct cfs_rq *cfs_rq;
1423 | +	unsigned int util;
1424 | +
1425 | +	cfs_rq = &cpu_rq(cpu)->cfs;
1426 | +	util = READ_ONCE(cfs_rq->avg.util_avg);
1427 | +
1428 | +	if (sched_feat(UTIL_EST))
1429 | +		util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));
1430 | +
1431 | +	return min_t(unsigned long, util, capacity_orig_of(cpu));
1432 | +}
1433 | +
1434 | +static unsigned long capacity_of(int cpu)
1435 | +{
1436 | +	return cpu_rq(cpu)->cpu_capacity;
1437 | +}
1438 | +
1439 | +static inline enum
1440 | +numa_type numa_classify(unsigned int imbalance_pct,
1441 | +			 struct numa_stats *ns)
1442 | +{
1443 | +	if ((ns->nr_running > ns->weight) &&
1444 | +	    (((ns->compute_capacity * 100) < (ns->util * imbalance_pct)) ||
1445 | +	     ((ns->compute_capacity * imbalance_pct) < (ns->runnable * 100))))
1446 | +		return node_overloaded;
1447 | +
1448 | +	if ((ns->nr_running < ns->weight) ||
1449 | +	    (((ns->compute_capacity * 100) > (ns->util * imbalance_pct)) &&
1450 | +	     ((ns->compute_capacity * imbalance_pct) > (ns->runnable * 100))))
1451 | +		return node_has_spare;
1452 | +
1453 | +	return node_fully_busy;
1454 | +}
1455 | +
1456 | +#ifdef CONFIG_SCHED_SMT
1457 | +/* Forward declarations of select_idle_sibling helpers */
1458 | +static inline bool test_idle_cores(int cpu, bool def);
1459 | +static inline int numa_idle_core(int idle_core, int cpu)
1460 | +{
1461 | +	if (!static_branch_likely(&sched_smt_present) ||
1462 | +	    idle_core >= 0 || !test_idle_cores(cpu, false))
1463 | +		return idle_core;
1464 | +
1465 | +	/*
1466 | +	 * Prefer cores instead of packing HT siblings
1467 | +	 * and triggering future load balancing.
1468 | +	 */
1469 | +	if (is_core_idle(cpu))
1470 | +		idle_core = cpu;
1471 | +
1472 | +	return idle_core;
1473 | +}
1474 | +#else
1475 | +static inline int numa_idle_core(int idle_core, int cpu)
1476 | +{
1477 | +	return idle_core;
1478 | +}
1479 | +#endif
1480 | +
1481 | +/*
1482 | + * Gather all necessary information to make NUMA balancing placement
1483 | + * decisions that are compatible with standard load balancer. This
1484 | + * borrows code and logic from update_sg_lb_stats but sharing a
1485 | + * common implementation is impractical.
1486 | + */
1487 | +static void update_numa_stats(struct task_numa_env *env,
1488 | +			      struct numa_stats *ns, int nid,
1489 | +			      bool find_idle)
1490 | +{
1491 | +	int cpu, idle_core = -1;
1492 | +
1493 | +	memset(ns, 0, sizeof(*ns));
1494 | +	ns->idle_cpu = -1;
1495 | +
1496 | +	rcu_read_lock();
1497 | +	for_each_cpu(cpu, cpumask_of_node(nid)) {
1498 | +		struct rq *rq = cpu_rq(cpu);
1499 | +
1500 | +		ns->load += cpu_load(rq);
1501 | +		ns->runnable += cpu_runnable(rq);
1502 | +		ns->util += cpu_util(cpu);
1503 | +		ns->nr_running += rq->cfs.h_nr_running;
1504 | +		ns->compute_capacity += capacity_of(cpu);
1505 | +
1506 | +		if (find_idle && !rq->nr_running && idle_cpu(cpu)) {
1507 | +			if (READ_ONCE(rq->numa_migrate_on) ||
1508 | +			    !cpumask_test_cpu(cpu, env->p->cpus_ptr))
1509 | +				continue;
1510 | +
1511 | +			if (ns->idle_cpu == -1)
1512 | +				ns->idle_cpu = cpu;
1513 | +
1514 | +			idle_core = numa_idle_core(idle_core, cpu);
1515 | +		}
1516 | +	}
1517 | +	rcu_read_unlock();
1518 | +
1519 | +	ns->weight = cpumask_weight(cpumask_of_node(nid));
1520 | +
1521 | +	ns->node_type = numa_classify(env->imbalance_pct, ns);
1522 | +
1523 | +	if (idle_core >= 0)
1524 | +		ns->idle_cpu = idle_core;
1525 | +}
1526 | +
1527 | +static void task_numa_assign(struct task_numa_env *env,
1528 | +			     struct task_struct *p, long imp)
1529 | +{
1530 | +	struct rq *rq = cpu_rq(env->dst_cpu);
1531 | +
1532 | +	/* Check if run-queue part of active NUMA balance. */
1533 | +	if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1)) {
1534 | +		int cpu;
1535 | +		int start = env->dst_cpu;
1536 | +
1537 | +		/* Find alternative idle CPU. */
1538 | +		for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start) {
1539 | +			if (cpu == env->best_cpu || !idle_cpu(cpu) ||
1540 | +			    !cpumask_test_cpu(cpu, env->p->cpus_ptr)) {
1541 | +				continue;
1542 | +			}
1543 | +
1544 | +			env->dst_cpu = cpu;
1545 | +			rq = cpu_rq(env->dst_cpu);
1546 | +			if (!xchg(&rq->numa_migrate_on, 1))
1547 | +				goto assign;
1548 | +		}
1549 | +
1550 | +		/* Failed to find an alternative idle CPU */
1551 | +		return;
1552 | +	}
1553 | +
1554 | +assign:
1555 | +	/*
1556 | +	 * Clear previous best_cpu/rq numa-migrate flag, since task now
1557 | +	 * found a better CPU to move/swap.
1558 | +	 */
1559 | +	if (env->best_cpu != -1 && env->best_cpu != env->dst_cpu) {
1560 | +		rq = cpu_rq(env->best_cpu);
1561 | +		WRITE_ONCE(rq->numa_migrate_on, 0);
1562 | +	}
1563 | +
1564 | +	if (env->best_task)
1565 | +		put_task_struct(env->best_task);
1566 | +	if (p)
1567 | +		get_task_struct(p);
1568 | +
1569 | +	env->best_task = p;
1570 | +	env->best_imp = imp;
1571 | +	env->best_cpu = env->dst_cpu;
1572 | +}
1573 | +
1574 | +static bool load_too_imbalanced(long src_load, long dst_load,
1575 | +				struct task_numa_env *env)
1576 | +{
1577 | +	long imb, old_imb;
1578 | +	long orig_src_load, orig_dst_load;
1579 | +	long src_capacity, dst_capacity;
1580 | +
1581 | +	/*
1582 | +	 * The load is corrected for the CPU capacity available on each node.
1583 | +	 *
1584 | +	 * src_load        dst_load
1585 | +	 * ------------ vs ---------
1586 | +	 * src_capacity    dst_capacity
1587 | +	 */
1588 | +	src_capacity = env->src_stats.compute_capacity;
1589 | +	dst_capacity = env->dst_stats.compute_capacity;
1590 | +
1591 | +	imb = abs(dst_load * src_capacity - src_load * dst_capacity);
1592 | +
1593 | +	orig_src_load = env->src_stats.load;
1594 | +	orig_dst_load = env->dst_stats.load;
1595 | +
1596 | +	old_imb = abs(orig_dst_load * src_capacity - orig_src_load * dst_capacity);
1597 | +
1598 | +	/* Would this change make things worse? */
1599 | +	return (imb > old_imb);
1600 | +}
1601 | +
1602 | +static unsigned int task_scan_start(struct task_struct *p)
1603 | +{
1604 | +	unsigned long smin = task_scan_min(p);
1605 | +	unsigned long period = smin;
1606 | +	struct numa_group *ng;
1607 | +
1608 | +	/* Scale the maximum scan period with the amount of shared memory. */
1609 | +	rcu_read_lock();
1610 | +	ng = rcu_dereference(p->numa_group);
1611 | +	if (ng) {
1612 | +		unsigned long shared = group_faults_shared(ng);
1613 | +		unsigned long private = group_faults_priv(ng);
1614 | +
1615 | +		period *= refcount_read(&ng->refcount);
1616 | +		period *= shared + 1;
1617 | +		period /= private + shared + 1;
1618 | +	}
1619 | +	rcu_read_unlock();
1620 | +
1621 | +	return max(smin, period);
1622 | +}
1623 | +
1624 | +static void update_scan_period(struct task_struct *p, int new_cpu)
1625 | +{
1626 | +	int src_nid = cpu_to_node(task_cpu(p));
1627 | +	int dst_nid = cpu_to_node(new_cpu);
1628 | +
1629 | +	if (!static_branch_likely(&sched_numa_balancing))
1630 | +		return;
1631 | +
1632 | +	if (!p->mm || !p->numa_faults || (p->flags & PF_EXITING))
1633 | +		return;
1634 | +
1635 | +	if (src_nid == dst_nid)
1636 | +		return;
1637 | +
1638 | +	/*
1639 | +	 * Allow resets if faults have been trapped before one scan
1640 | +	 * has completed. This is most likely due to a new task that
1641 | +	 * is pulled cross-node due to wakeups or load balancing.
1642 | +	 */
1643 | +	if (p->numa_scan_seq) {
1644 | +		/*
1645 | +		 * Avoid scan adjustments if moving to the preferred
1646 | +		 * node or if the task was not previously running on
1647 | +		 * the preferred node.
1648 | +		 */
1649 | +		if (dst_nid == p->numa_preferred_nid ||
1650 | +		    (p->numa_preferred_nid != NUMA_NO_NODE &&
1651 | +			src_nid != p->numa_preferred_nid))
1652 | +			return;
1653 | +	}
1654 | +
1655 | +	p->numa_scan_period = task_scan_start(p);
1656 | +}
1657 | +
1658 | +/*
1659 | + * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
1660 | + * This is an approximation as the number of running tasks may not be
1661 | + * related to the number of busy CPUs due to sched_setaffinity.
1662 | + */
1663 | +static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
1664 | +{
1665 | +	return (dst_running < (dst_weight >> 2));
1666 | +}
1667 | +
1668 | +#define NUMA_IMBALANCE_MIN 2
1669 | +
1670 | +static inline long adjust_numa_imbalance(int imbalance,
1671 | +				int dst_running, int dst_weight)
1672 | +{
1673 | +	if (!allow_numa_imbalance(dst_running, dst_weight))
1674 | +		return imbalance;
1675 | +
1676 | +	/*
1677 | +	 * Allow a small imbalance based on a simple pair of communicating
1678 | +	 * tasks that remain local when the destination is lightly loaded.
1679 | +	 */
1680 | +	if (imbalance <= NUMA_IMBALANCE_MIN)
1681 | +		return 0;
1682 | +
1683 | +	return imbalance;
1684 | +}
1685 | +
1686 | +static unsigned long task_h_load(struct task_struct *p)
1687 | +{
1688 | +	return p->se.avg.load_avg;
1689 | +}
1690 | +
1691 | +/*
1692 | + * Maximum NUMA importance can be 1998 (2*999);
1693 | + * SMALLIMP @ 30 would be close to 1998/64.
1694 | + * Used to deter task migration.
1695 | + */
1696 | +#define SMALLIMP	30
1697 | +
1698 | +/*
1699 | + * This checks if the overall compute and NUMA accesses of the system would
1700 | + * be improved if the source tasks was migrated to the target dst_cpu taking
1701 | + * into account that it might be best if task running on the dst_cpu should
1702 | + * be exchanged with the source task
1703 | + */
1704 | +static bool task_numa_compare(struct task_numa_env *env,
1705 | +			      long taskimp, long groupimp, bool maymove)
1706 | +{
1707 | +	struct numa_group *cur_ng, *p_ng = deref_curr_numa_group(env->p);
1708 | +	struct rq *dst_rq = cpu_rq(env->dst_cpu);
1709 | +	long imp = p_ng ? groupimp : taskimp;
1710 | +	struct task_struct *cur;
1711 | +	long src_load, dst_load;
1712 | +	int dist = env->dist;
1713 | +	long moveimp = imp;
1714 | +	long load;
1715 | +	bool stopsearch = false;
1716 | +
1717 | +	if (READ_ONCE(dst_rq->numa_migrate_on))
1718 | +		return false;
1719 | +
1720 | +	rcu_read_lock();
1721 | +	cur = rcu_dereference(dst_rq->curr);
1722 | +	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
1723 | +		cur = NULL;
1724 | +
1725 | +	/*
1726 | +	 * Because we have preemption enabled we can get migrated around and
1727 | +	 * end try selecting ourselves (current == env->p) as a swap candidate.
1728 | +	 */
1729 | +	if (cur == env->p) {
1730 | +		stopsearch = true;
1731 | +		goto unlock;
1732 | +	}
1733 | +
1734 | +	if (!cur) {
1735 | +		if (maymove && moveimp >= env->best_imp)
1736 | +			goto assign;
1737 | +		else
1738 | +			goto unlock;
1739 | +	}
1740 | +
1741 | +	/* Skip this swap candidate if cannot move to the source cpu. */
1742 | +	if (!cpumask_test_cpu(env->src_cpu, cur->cpus_ptr))
1743 | +		goto unlock;
1744 | +
1745 | +	/*
1746 | +	 * Skip this swap candidate if it is not moving to its preferred
1747 | +	 * node and the best task is.
1748 | +	 */
1749 | +	if (env->best_task &&
1750 | +	    env->best_task->numa_preferred_nid == env->src_nid &&
1751 | +	    cur->numa_preferred_nid != env->src_nid) {
1752 | +		goto unlock;
1753 | +	}
1754 | +
1755 | +	/*
1756 | +	 * "imp" is the fault differential for the source task between the
1757 | +	 * source and destination node. Calculate the total differential for
1758 | +	 * the source task and potential destination task. The more negative
1759 | +	 * the value is, the more remote accesses that would be expected to
1760 | +	 * be incurred if the tasks were swapped.
1761 | +	 *
1762 | +	 * If dst and source tasks are in the same NUMA group, or not
1763 | +	 * in any group then look only at task weights.
1764 | +	 */
1765 | +	cur_ng = rcu_dereference(cur->numa_group);
1766 | +	if (cur_ng == p_ng) {
1767 | +		imp = taskimp + task_weight(cur, env->src_nid, dist) -
1768 | +		      task_weight(cur, env->dst_nid, dist);
1769 | +		/*
1770 | +		 * Add some hysteresis to prevent swapping the
1771 | +		 * tasks within a group over tiny differences.
1772 | +		 */
1773 | +		if (cur_ng)
1774 | +			imp -= imp / 16;
1775 | +	} else {
1776 | +		/*
1777 | +		 * Compare the group weights. If a task is all by itself
1778 | +		 * (not part of a group), use the task weight instead.
1779 | +		 */
1780 | +		if (cur_ng && p_ng)
1781 | +			imp += group_weight(cur, env->src_nid, dist) -
1782 | +			       group_weight(cur, env->dst_nid, dist);
1783 | +		else
1784 | +			imp += task_weight(cur, env->src_nid, dist) -
1785 | +			       task_weight(cur, env->dst_nid, dist);
1786 | +	}
1787 | +
1788 | +	/* Discourage picking a task already on its preferred node */
1789 | +	if (cur->numa_preferred_nid == env->dst_nid)
1790 | +		imp -= imp / 16;
1791 | +
1792 | +	/*
1793 | +	 * Encourage picking a task that moves to its preferred node.
1794 | +	 * This potentially makes imp larger than it's maximum of
1795 | +	 * 1998 (see SMALLIMP and task_weight for why) but in this
1796 | +	 * case, it does not matter.
1797 | +	 */
1798 | +	if (cur->numa_preferred_nid == env->src_nid)
1799 | +		imp += imp / 8;
1800 | +
1801 | +	if (maymove && moveimp > imp && moveimp > env->best_imp) {
1802 | +		imp = moveimp;
1803 | +		cur = NULL;
1804 | +		goto assign;
1805 | +	}
1806 | +
1807 | +	/*
1808 | +	 * Prefer swapping with a task moving to its preferred node over a
1809 | +	 * task that is not.
1810 | +	 */
1811 | +	if (env->best_task && cur->numa_preferred_nid == env->src_nid &&
1812 | +	    env->best_task->numa_preferred_nid != env->src_nid) {
1813 | +		goto assign;
1814 | +	}
1815 | +
1816 | +	/*
1817 | +	 * If the NUMA importance is less than SMALLIMP,
1818 | +	 * task migration might only result in ping pong
1819 | +	 * of tasks and also hurt performance due to cache
1820 | +	 * misses.
1821 | +	 */
1822 | +	if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
1823 | +		goto unlock;
1824 | +
1825 | +	/*
1826 | +	 * In the overloaded case, try and keep the load balanced.
1827 | +	 */
1828 | +	load = task_h_load(env->p) - task_h_load(cur);
1829 | +	if (!load)
1830 | +		goto assign;
1831 | +
1832 | +	dst_load = env->dst_stats.load + load;
1833 | +	src_load = env->src_stats.load - load;
1834 | +
1835 | +	if (load_too_imbalanced(src_load, dst_load, env))
1836 | +		goto unlock;
1837 | +
1838 | +assign:
1839 | +	/* Evaluate an idle CPU for a task numa move. */
1840 | +	if (!cur) {
1841 | +		int cpu = env->dst_stats.idle_cpu;
1842 | +
1843 | +		/* Nothing cached so current CPU went idle since the search. */
1844 | +		if (cpu < 0)
1845 | +			cpu = env->dst_cpu;
1846 | +
1847 | +		/*
1848 | +		 * If the CPU is no longer truly idle and the previous best CPU
1849 | +		 * is, keep using it.
1850 | +		 */
1851 | +		if (!idle_cpu(cpu) && env->best_cpu >= 0 &&
1852 | +		    idle_cpu(env->best_cpu)) {
1853 | +			cpu = env->best_cpu;
1854 | +		}
1855 | +
1856 | +		env->dst_cpu = cpu;
1857 | +	}
1858 | +
1859 | +	task_numa_assign(env, cur, imp);
1860 | +
1861 | +	/*
1862 | +	 * If a move to idle is allowed because there is capacity or load
1863 | +	 * balance improves then stop the search. While a better swap
1864 | +	 * candidate may exist, a search is not free.
1865 | +	 */
1866 | +	if (maymove && !cur && env->best_cpu >= 0 && idle_cpu(env->best_cpu))
1867 | +		stopsearch = true;
1868 | +
1869 | +	/*
1870 | +	 * If a swap candidate must be identified and the current best task
1871 | +	 * moves its preferred node then stop the search.
1872 | +	 */
1873 | +	if (!maymove && env->best_task &&
1874 | +	    env->best_task->numa_preferred_nid == env->src_nid) {
1875 | +		stopsearch = true;
1876 | +	}
1877 | +unlock:
1878 | +	rcu_read_unlock();
1879 | +
1880 | +	return stopsearch;
1881 | +}
1882 | +
1883 | +static void task_numa_find_cpu(struct task_numa_env *env,
1884 | +				long taskimp, long groupimp)
1885 | +{
1886 | +	bool maymove = false;
1887 | +	int cpu;
1888 | +
1889 | +	/*
1890 | +	 * If dst node has spare capacity, then check if there is an
1891 | +	 * imbalance that would be overruled by the load balancer.
1892 | +	 */
1893 | +	if (env->dst_stats.node_type == node_has_spare) {
1894 | +		unsigned int imbalance;
1895 | +		int src_running, dst_running;
1896 | +
1897 | +		/*
1898 | +		 * Would movement cause an imbalance? Note that if src has
1899 | +		 * more running tasks that the imbalance is ignored as the
1900 | +		 * move improves the imbalance from the perspective of the
1901 | +		 * CPU load balancer.
1902 | +		 * */
1903 | +		src_running = env->src_stats.nr_running - 1;
1904 | +		dst_running = env->dst_stats.nr_running + 1;
1905 | +		imbalance = max(0, dst_running - src_running);
1906 | +		imbalance = adjust_numa_imbalance(imbalance, dst_running,
1907 | +							env->dst_stats.weight);
1908 | +
1909 | +		/* Use idle CPU if there is no imbalance */
1910 | +		if (!imbalance) {
1911 | +			maymove = true;
1912 | +			if (env->dst_stats.idle_cpu >= 0) {
1913 | +				env->dst_cpu = env->dst_stats.idle_cpu;
1914 | +				task_numa_assign(env, NULL, 0);
1915 | +				return;
1916 | +			}
1917 | +		}
1918 | +	} else {
1919 | +		long src_load, dst_load, load;
1920 | +		/*
1921 | +		 * If the improvement from just moving env->p direction is better
1922 | +		 * than swapping tasks around, check if a move is possible.
1923 | +		 */
1924 | +		load = task_h_load(env->p);
1925 | +		dst_load = env->dst_stats.load + load;
1926 | +		src_load = env->src_stats.load - load;
1927 | +		maymove = !load_too_imbalanced(src_load, dst_load, env);
1928 | +	}
1929 | +
1930 | +	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
1931 | +		/* Skip this CPU if the source task cannot migrate */
1932 | +		if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
1933 | +			continue;
1934 | +
1935 | +		env->dst_cpu = cpu;
1936 | +		if (task_numa_compare(env, taskimp, groupimp, maymove))
1937 | +			break;
1938 | +	}
1939 | +}
1940 | +
1941 | +static int task_numa_migrate(struct task_struct *p)
1942 | +{
1943 | +	struct task_numa_env env = {
1944 | +		.p = p,
1945 | +
1946 | +		.src_cpu = task_cpu(p),
1947 | +		.src_nid = task_node(p),
1948 | +
1949 | +		.imbalance_pct = 112,
1950 | +
1951 | +		.best_task = NULL,
1952 | +		.best_imp = 0,
1953 | +		.best_cpu = -1,
1954 | +	};
1955 | +	unsigned long taskweight, groupweight;
1956 | +	struct sched_domain *sd;
1957 | +	long taskimp, groupimp;
1958 | +	struct numa_group *ng;
1959 | +	struct rq *best_rq;
1960 | +	int nid, ret, dist;
1961 | +
1962 | +	/*
1963 | +	 * Pick the lowest SD_NUMA domain, as that would have the smallest
1964 | +	 * imbalance and would be the first to start moving tasks about.
1965 | +	 *
1966 | +	 * And we want to avoid any moving of tasks about, as that would create
1967 | +	 * random movement of tasks -- counter the numa conditions we're trying
1968 | +	 * to satisfy here.
1969 | +	 */
1970 | +	rcu_read_lock();
1971 | +	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
1972 | +	if (sd)
1973 | +		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
1974 | +	rcu_read_unlock();
1975 | +
1976 | +	/*
1977 | +	 * Cpusets can break the scheduler domain tree into smaller
1978 | +	 * balance domains, some of which do not cross NUMA boundaries.
1979 | +	 * Tasks that are "trapped" in such domains cannot be migrated
1980 | +	 * elsewhere, so there is no point in (re)trying.
1981 | +	 */
1982 | +	if (unlikely(!sd)) {
1983 | +		sched_setnuma(p, task_node(p));
1984 | +		return -EINVAL;
1985 | +	}
1986 | +
1987 | +	env.dst_nid = p->numa_preferred_nid;
1988 | +	dist = env.dist = node_distance(env.src_nid, env.dst_nid);
1989 | +	taskweight = task_weight(p, env.src_nid, dist);
1990 | +	groupweight = group_weight(p, env.src_nid, dist);
1991 | +	update_numa_stats(&env, &env.src_stats, env.src_nid, false);
1992 | +	taskimp = task_weight(p, env.dst_nid, dist) - taskweight;
1993 | +	groupimp = group_weight(p, env.dst_nid, dist) - groupweight;
1994 | +	update_numa_stats(&env, &env.dst_stats, env.dst_nid, true);
1995 | +
1996 | +	/* Try to find a spot on the preferred nid. */
1997 | +	task_numa_find_cpu(&env, taskimp, groupimp);
1998 | +
1999 | +	/*
2000 | +	 * Look at other nodes in these cases:
2001 | +	 * - there is no space available on the preferred_nid
2002 | +	 * - the task is part of a numa_group that is interleaved across
2003 | +	 *   multiple NUMA nodes; in order to better consolidate the group,
2004 | +	 *   we need to check other locations.
2005 | +	 */
2006 | +	ng = deref_curr_numa_group(p);
2007 | +	if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
2008 | +		for_each_online_node(nid) {
2009 | +			if (nid == env.src_nid || nid == p->numa_preferred_nid)
2010 | +				continue;
2011 | +
2012 | +			dist = node_distance(env.src_nid, env.dst_nid);
2013 | +			if (sched_numa_topology_type == NUMA_BACKPLANE &&
2014 | +						dist != env.dist) {
2015 | +				taskweight = task_weight(p, env.src_nid, dist);
2016 | +				groupweight = group_weight(p, env.src_nid, dist);
2017 | +			}
2018 | +
2019 | +			/* Only consider nodes where both task and groups benefit */
2020 | +			taskimp = task_weight(p, nid, dist) - taskweight;
2021 | +			groupimp = group_weight(p, nid, dist) - groupweight;
2022 | +			if (taskimp < 0 && groupimp < 0)
2023 | +				continue;
2024 | +
2025 | +			env.dist = dist;
2026 | +			env.dst_nid = nid;
2027 | +			update_numa_stats(&env, &env.dst_stats, env.dst_nid, true);
2028 | +			task_numa_find_cpu(&env, taskimp, groupimp);
2029 | +		}
2030 | +	}
2031 | +
2032 | +	/*
2033 | +	 * If the task is part of a workload that spans multiple NUMA nodes,
2034 | +	 * and is migrating into one of the workload's active nodes, remember
2035 | +	 * this node as the task's preferred numa node, so the workload can
2036 | +	 * settle down.
2037 | +	 * A task that migrated to a second choice node will be better off
2038 | +	 * trying for a better one later. Do not set the preferred node here.
2039 | +	 */
2040 | +	if (ng) {
2041 | +		if (env.best_cpu == -1)
2042 | +			nid = env.src_nid;
2043 | +		else
2044 | +			nid = cpu_to_node(env.best_cpu);
2045 | +
2046 | +		if (nid != p->numa_preferred_nid)
2047 | +			sched_setnuma(p, nid);
2048 | +	}
2049 | +
2050 | +	/* No better CPU than the current one was found. */
2051 | +	if (env.best_cpu == -1) {
2052 | +		trace_sched_stick_numa(p, env.src_cpu, NULL, -1);
2053 | +		return -EAGAIN;
2054 | +	}
2055 | +
2056 | +	best_rq = cpu_rq(env.best_cpu);
2057 | +	if (env.best_task == NULL) {
2058 | +		ret = migrate_task_to(p, env.best_cpu);
2059 | +		WRITE_ONCE(best_rq->numa_migrate_on, 0);
2060 | +		if (ret != 0)
2061 | +			trace_sched_stick_numa(p, env.src_cpu, NULL, env.best_cpu);
2062 | +		return ret;
2063 | +	}
2064 | +
2065 | +	ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu);
2066 | +	WRITE_ONCE(best_rq->numa_migrate_on, 0);
2067 | +
2068 | +	if (ret != 0)
2069 | +		trace_sched_stick_numa(p, env.src_cpu, env.best_task, env.best_cpu);
2070 | +	put_task_struct(env.best_task);
2071 | +	return ret;
2072 | +}
2073 | +
2074 | +/* Attempt to migrate a task to a CPU on the preferred node. */
2075 | +static void numa_migrate_preferred(struct task_struct *p)
2076 | +{
2077 | +	unsigned long interval = HZ;
2078 | +
2079 | +	/* This task has no NUMA fault statistics yet */
2080 | +	if (unlikely(p->numa_preferred_nid == NUMA_NO_NODE || !p->numa_faults))
2081 | +		return;
2082 | +
2083 | +	/* Periodically retry migrating the task to the preferred node */
2084 | +	interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
2085 | +	p->numa_migrate_retry = jiffies + interval;
2086 | +
2087 | +	/* Success if task is already running on preferred CPU */
2088 | +	if (task_node(p) == p->numa_preferred_nid)
2089 | +		return;
2090 | +
2091 | +	/* Otherwise, try migrate to a CPU on the preferred node */
2092 | +	task_numa_migrate(p);
2093 | +}
2094 | +
2095 | +/*
2096 | + * Find out how many nodes on the workload is actively running on. Do this by
2097 | + * tracking the nodes from which NUMA hinting faults are triggered. This can
2098 | + * be different from the set of nodes where the workload's memory is currently
2099 | + * located.
2100 | + */
2101 | +static void numa_group_count_active_nodes(struct numa_group *numa_group)
2102 | +{
2103 | +	unsigned long faults, max_faults = 0;
2104 | +	int nid, active_nodes = 0;
2105 | +
2106 | +	for_each_online_node(nid) {
2107 | +		faults = group_faults_cpu(numa_group, nid);
2108 | +		if (faults > max_faults)
2109 | +			max_faults = faults;
2110 | +	}
2111 | +
2112 | +	for_each_online_node(nid) {
2113 | +		faults = group_faults_cpu(numa_group, nid);
2114 | +		if (faults * ACTIVE_NODE_FRACTION > max_faults)
2115 | +			active_nodes++;
2116 | +	}
2117 | +
2118 | +	numa_group->max_faults_cpu = max_faults;
2119 | +	numa_group->active_nodes = active_nodes;
2120 | +}
2121 | +
2122 | +#define NUMA_PERIOD_SLOTS 10
2123 | +#define NUMA_PERIOD_THRESHOLD 7
2124 | +
2125 | +/*
2126 | + * Increase the scan period (slow down scanning) if the majority of
2127 | + * our memory is already on our local node, or if the majority of
2128 | + * the page accesses are shared with other processes.
2129 | + * Otherwise, decrease the scan period.
2130 | + */
2131 | +static void update_task_scan_period(struct task_struct *p,
2132 | +			unsigned long shared, unsigned long private)
2133 | +{
2134 | +	unsigned int period_slot;
2135 | +	int lr_ratio, ps_ratio;
2136 | +	int diff;
2137 | +
2138 | +	unsigned long remote = p->numa_faults_locality[0];
2139 | +	unsigned long local = p->numa_faults_locality[1];
2140 | +
2141 | +	/*
2142 | +	 * If there were no record hinting faults then either the task is
2143 | +	 * completely idle or all activity is areas that are not of interest
2144 | +	 * to automatic numa balancing. Related to that, if there were failed
2145 | +	 * migration then it implies we are migrating too quickly or the local
2146 | +	 * node is overloaded. In either case, scan slower
2147 | +	 */
2148 | +	if (local + shared == 0 || p->numa_faults_locality[2]) {
2149 | +		p->numa_scan_period = min(p->numa_scan_period_max,
2150 | +			p->numa_scan_period << 1);
2151 | +
2152 | +		p->mm->numa_next_scan = jiffies +
2153 | +			msecs_to_jiffies(p->numa_scan_period);
2154 | +
2155 | +		return;
2156 | +	}
2157 | +
2158 | +	/*
2159 | +	 * Prepare to scale scan period relative to the current period.
2160 | +	 *	 == NUMA_PERIOD_THRESHOLD scan period stays the same
2161 | +	 *       <  NUMA_PERIOD_THRESHOLD scan period decreases (scan faster)
2162 | +	 *	 >= NUMA_PERIOD_THRESHOLD scan period increases (scan slower)
2163 | +	 */
2164 | +	period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS);
2165 | +	lr_ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
2166 | +	ps_ratio = (private * NUMA_PERIOD_SLOTS) / (private + shared);
2167 | +
2168 | +	if (ps_ratio >= NUMA_PERIOD_THRESHOLD) {
2169 | +		/*
2170 | +		 * Most memory accesses are local. There is no need to
2171 | +		 * do fast NUMA scanning, since memory is already local.
2172 | +		 */
2173 | +		int slot = ps_ratio - NUMA_PERIOD_THRESHOLD;
2174 | +		if (!slot)
2175 | +			slot = 1;
2176 | +		diff = slot * period_slot;
2177 | +	} else if (lr_ratio >= NUMA_PERIOD_THRESHOLD) {
2178 | +		/*
2179 | +		 * Most memory accesses are shared with other tasks.
2180 | +		 * There is no point in continuing fast NUMA scanning,
2181 | +		 * since other tasks may just move the memory elsewhere.
2182 | +		 */
2183 | +		int slot = lr_ratio - NUMA_PERIOD_THRESHOLD;
2184 | +		if (!slot)
2185 | +			slot = 1;
2186 | +		diff = slot * period_slot;
2187 | +	} else {
2188 | +		/*
2189 | +		 * Private memory faults exceed (SLOTS-THRESHOLD)/SLOTS,
2190 | +		 * yet they are not on the local NUMA node. Speed up
2191 | +		 * NUMA scanning to get the memory moved over.
2192 | +		 */
2193 | +		int ratio = max(lr_ratio, ps_ratio);
2194 | +		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
2195 | +	}
2196 | +
2197 | +	p->numa_scan_period = clamp(p->numa_scan_period + diff,
2198 | +			task_scan_min(p), task_scan_max(p));
2199 | +	memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
2200 | +}
2201 | +
2202 | +/*
2203 | + * Get the fraction of time the task has been running since the last
2204 | + * NUMA placement cycle. The scheduler keeps similar statistics, but
2205 | + * decays those on a 32ms period, which is orders of magnitude off
2206 | + * from the dozens-of-seconds NUMA balancing period. Use the scheduler
2207 | + * stats only if the task is so new there are no NUMA statistics yet.
2208 | + */
2209 | +static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
2210 | +{
2211 | +	u64 runtime, delta, now;
2212 | +	/* Use the start of this time slice to avoid calculations. */
2213 | +	now = p->se.exec_start;
2214 | +	runtime = p->se.sum_exec_runtime;
2215 | +
2216 | +	if (p->last_task_numa_placement) {
2217 | +		delta = runtime - p->last_sum_exec_runtime;
2218 | +		*period = now - p->last_task_numa_placement;
2219 | +
2220 | +		/* Avoid time going backwards, prevent potential divide error: */
2221 | +		if (unlikely((s64)*period < 0))
2222 | +			*period = 0;
2223 | +	} else {
2224 | +		delta = p->se.avg.load_sum;
2225 | +		*period = LOAD_AVG_MAX;
2226 | +	}
2227 | +
2228 | +	p->last_sum_exec_runtime = runtime;
2229 | +	p->last_task_numa_placement = now;
2230 | +
2231 | +	return delta;
2232 | +}
2233 | +
2234 | +/*
2235 | + * Determine the preferred nid for a task in a numa_group. This needs to
2236 | + * be done in a way that produces consistent results with group_weight,
2237 | + * otherwise workloads might not converge.
2238 | + */
2239 | +static int preferred_group_nid(struct task_struct *p, int nid)
2240 | +{
2241 | +	nodemask_t nodes;
2242 | +	int dist;
2243 | +
2244 | +	/* Direct connections between all NUMA nodes. */
2245 | +	if (sched_numa_topology_type == NUMA_DIRECT)
2246 | +		return nid;
2247 | +
2248 | +	/*
2249 | +	 * On a system with glueless mesh NUMA topology, group_weight
2250 | +	 * scores nodes according to the number of NUMA hinting faults on
2251 | +	 * both the node itself, and on nearby nodes.
2252 | +	 */
2253 | +	if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
2254 | +		unsigned long score, max_score = 0;
2255 | +		int node, max_node = nid;
2256 | +
2257 | +		dist = sched_max_numa_distance;
2258 | +
2259 | +		for_each_online_node(node) {
2260 | +			score = group_weight(p, node, dist);
2261 | +			if (score > max_score) {
2262 | +				max_score = score;
2263 | +				max_node = node;
2264 | +			}
2265 | +		}
2266 | +		return max_node;
2267 | +	}
2268 | +
2269 | +	/*
2270 | +	 * Finding the preferred nid in a system with NUMA backplane
2271 | +	 * interconnect topology is more involved. The goal is to locate
2272 | +	 * tasks from numa_groups near each other in the system, and
2273 | +	 * untangle workloads from different sides of the system. This requires
2274 | +	 * searching down the hierarchy of node groups, recursively searching
2275 | +	 * inside the highest scoring group of nodes. The nodemask tricks
2276 | +	 * keep the complexity of the search down.
2277 | +	 */
2278 | +	nodes = node_online_map;
2279 | +	for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
2280 | +		unsigned long max_faults = 0;
2281 | +		nodemask_t max_group = NODE_MASK_NONE;
2282 | +		int a, b;
2283 | +
2284 | +		/* Are there nodes at this distance from each other? */
2285 | +		if (!find_numa_distance(dist))
2286 | +			continue;
2287 | +
2288 | +		for_each_node_mask(a, nodes) {
2289 | +			unsigned long faults = 0;
2290 | +			nodemask_t this_group;
2291 | +			nodes_clear(this_group);
2292 | +
2293 | +			/* Sum group's NUMA faults; includes a==b case. */
2294 | +			for_each_node_mask(b, nodes) {
2295 | +				if (node_distance(a, b) < dist) {
2296 | +					faults += group_faults(p, b);
2297 | +					node_set(b, this_group);
2298 | +					node_clear(b, nodes);
2299 | +				}
2300 | +			}
2301 | +
2302 | +			/* Remember the top group. */
2303 | +			if (faults > max_faults) {
2304 | +				max_faults = faults;
2305 | +				max_group = this_group;
2306 | +				/*
2307 | +				 * subtle: at the smallest distance there is
2308 | +				 * just one node left in each "group", the
2309 | +				 * winner is the preferred nid.
2310 | +				 */
2311 | +				nid = a;
2312 | +			}
2313 | +		}
2314 | +		/* Next round, evaluate the nodes within max_group. */
2315 | +		if (!max_faults)
2316 | +			break;
2317 | +		nodes = max_group;
2318 | +	}
2319 | +	return nid;
2320 | +}
2321 | +
2322 | +static void task_numa_placement(struct task_struct *p)
2323 | +{
2324 | +	int seq, nid, max_nid = NUMA_NO_NODE;
2325 | +	unsigned long max_faults = 0;
2326 | +	unsigned long fault_types[2] = { 0, 0 };
2327 | +	unsigned long total_faults;
2328 | +	u64 runtime, period;
2329 | +	spinlock_t *group_lock = NULL;
2330 | +	struct numa_group *ng;
2331 | +
2332 | +	/*
2333 | +	 * The p->mm->numa_scan_seq field gets updated without
2334 | +	 * exclusive access. Use READ_ONCE() here to ensure
2335 | +	 * that the field is read in a single access:
2336 | +	 */
2337 | +	seq = READ_ONCE(p->mm->numa_scan_seq);
2338 | +	if (p->numa_scan_seq == seq)
2339 | +		return;
2340 | +	p->numa_scan_seq = seq;
2341 | +	p->numa_scan_period_max = task_scan_max(p);
2342 | +
2343 | +	total_faults = p->numa_faults_locality[0] +
2344 | +		       p->numa_faults_locality[1];
2345 | +	runtime = numa_get_avg_runtime(p, &period);
2346 | +
2347 | +	/* If the task is part of a group prevent parallel updates to group stats */
2348 | +	ng = deref_curr_numa_group(p);
2349 | +	if (ng) {
2350 | +		group_lock = &ng->lock;
2351 | +		spin_lock_irq(group_lock);
2352 | +	}
2353 | +
2354 | +	/* Find the node with the highest number of faults */
2355 | +	for_each_online_node(nid) {
2356 | +		/* Keep track of the offsets in numa_faults array */
2357 | +		int mem_idx, membuf_idx, cpu_idx, cpubuf_idx;
2358 | +		unsigned long faults = 0, group_faults = 0;
2359 | +		int priv;
2360 | +
2361 | +		for (priv = 0; priv < NR_NUMA_HINT_FAULT_TYPES; priv++) {
2362 | +			long diff, f_diff, f_weight;
2363 | +
2364 | +			mem_idx = task_faults_idx(NUMA_MEM, nid, priv);
2365 | +			membuf_idx = task_faults_idx(NUMA_MEMBUF, nid, priv);
2366 | +			cpu_idx = task_faults_idx(NUMA_CPU, nid, priv);
2367 | +			cpubuf_idx = task_faults_idx(NUMA_CPUBUF, nid, priv);
2368 | +
2369 | +			/* Decay existing window, copy faults since last scan */
2370 | +			diff = p->numa_faults[membuf_idx] - p->numa_faults[mem_idx] / 2;
2371 | +			fault_types[priv] += p->numa_faults[membuf_idx];
2372 | +			p->numa_faults[membuf_idx] = 0;
2373 | +
2374 | +			/*
2375 | +			 * Normalize the faults_from, so all tasks in a group
2376 | +			 * count according to CPU use, instead of by the raw
2377 | +			 * number of faults. Tasks with little runtime have
2378 | +			 * little over-all impact on throughput, and thus their
2379 | +			 * faults are less important.
2380 | +			 */
2381 | +			f_weight = div64_u64(runtime << 16, period + 1);
2382 | +			f_weight = (f_weight * p->numa_faults[cpubuf_idx]) /
2383 | +				   (total_faults + 1);
2384 | +			f_diff = f_weight - p->numa_faults[cpu_idx] / 2;
2385 | +			p->numa_faults[cpubuf_idx] = 0;
2386 | +
2387 | +			p->numa_faults[mem_idx] += diff;
2388 | +			p->numa_faults[cpu_idx] += f_diff;
2389 | +			faults += p->numa_faults[mem_idx];
2390 | +			p->total_numa_faults += diff;
2391 | +			if (ng) {
2392 | +				/*
2393 | +				 * safe because we can only change our own group
2394 | +				 *
2395 | +				 * mem_idx represents the offset for a given
2396 | +				 * nid and priv in a specific region because it
2397 | +				 * is at the beginning of the numa_faults array.
2398 | +				 */
2399 | +				ng->faults[mem_idx] += diff;
2400 | +				ng->faults_cpu[mem_idx] += f_diff;
2401 | +				ng->total_faults += diff;
2402 | +				group_faults += ng->faults[mem_idx];
2403 | +			}
2404 | +		}
2405 | +
2406 | +		if (!ng) {
2407 | +			if (faults > max_faults) {
2408 | +				max_faults = faults;
2409 | +				max_nid = nid;
2410 | +			}
2411 | +		} else if (group_faults > max_faults) {
2412 | +			max_faults = group_faults;
2413 | +			max_nid = nid;
2414 | +		}
2415 | +	}
2416 | +
2417 | +	if (ng) {
2418 | +		numa_group_count_active_nodes(ng);
2419 | +		spin_unlock_irq(group_lock);
2420 | +		max_nid = preferred_group_nid(p, max_nid);
2421 | +	}
2422 | +
2423 | +	if (max_faults) {
2424 | +		/* Set the new preferred node */
2425 | +		if (max_nid != p->numa_preferred_nid)
2426 | +			sched_setnuma(p, max_nid);
2427 | +	}
2428 | +
2429 | +	update_task_scan_period(p, fault_types[0], fault_types[1]);
2430 | +}
2431 | +
2432 | +static inline int get_numa_group(struct numa_group *grp)
2433 | +{
2434 | +	return refcount_inc_not_zero(&grp->refcount);
2435 | +}
2436 | +
2437 | +static inline void put_numa_group(struct numa_group *grp)
2438 | +{
2439 | +	if (refcount_dec_and_test(&grp->refcount))
2440 | +		kfree_rcu(grp, rcu);
2441 | +}
2442 | +
2443 | +static void task_numa_group(struct task_struct *p, int cpupid, int flags,
2444 | +			int *priv)
2445 | +{
2446 | +	struct numa_group *grp, *my_grp;
2447 | +	struct task_struct *tsk;
2448 | +	bool join = false;
2449 | +	int cpu = cpupid_to_cpu(cpupid);
2450 | +	int i;
2451 | +
2452 | +	if (unlikely(!deref_curr_numa_group(p))) {
2453 | +		unsigned int size = sizeof(struct numa_group) +
2454 | +				    4*nr_node_ids*sizeof(unsigned long);
2455 | +
2456 | +		grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
2457 | +		if (!grp)
2458 | +			return;
2459 | +
2460 | +		refcount_set(&grp->refcount, 1);
2461 | +		grp->active_nodes = 1;
2462 | +		grp->max_faults_cpu = 0;
2463 | +		spin_lock_init(&grp->lock);
2464 | +		grp->gid = p->pid;
2465 | +		/* Second half of the array tracks nids where faults happen */
2466 | +		grp->faults_cpu = grp->faults + NR_NUMA_HINT_FAULT_TYPES *
2467 | +						nr_node_ids;
2468 | +
2469 | +		for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
2470 | +			grp->faults[i] = p->numa_faults[i];
2471 | +
2472 | +		grp->total_faults = p->total_numa_faults;
2473 | +
2474 | +		grp->nr_tasks++;
2475 | +		rcu_assign_pointer(p->numa_group, grp);
2476 | +	}
2477 | +
2478 | +	rcu_read_lock();
2479 | +	tsk = READ_ONCE(cpu_rq(cpu)->curr);
2480 | +
2481 | +	if (!cpupid_match_pid(tsk, cpupid))
2482 | +		goto no_join;
2483 | +
2484 | +	grp = rcu_dereference(tsk->numa_group);
2485 | +	if (!grp)
2486 | +		goto no_join;
2487 | +
2488 | +	my_grp = deref_curr_numa_group(p);
2489 | +	if (grp == my_grp)
2490 | +		goto no_join;
2491 | +
2492 | +	/*
2493 | +	 * Only join the other group if its bigger; if we're the bigger group,
2494 | +	 * the other task will join us.
2495 | +	 */
2496 | +	if (my_grp->nr_tasks > grp->nr_tasks)
2497 | +		goto no_join;
2498 | +
2499 | +	/*
2500 | +	 * Tie-break on the grp address.
2501 | +	 */
2502 | +	if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp)
2503 | +		goto no_join;
2504 | +
2505 | +	/* Always join threads in the same process. */
2506 | +	if (tsk->mm == current->mm)
2507 | +		join = true;
2508 | +
2509 | +	/* Simple filter to avoid false positives due to PID collisions */
2510 | +	if (flags & TNF_SHARED)
2511 | +		join = true;
2512 | +
2513 | +	/* Update priv based on whether false sharing was detected */
2514 | +	*priv = !join;
2515 | +
2516 | +	if (join && !get_numa_group(grp))
2517 | +		goto no_join;
2518 | +
2519 | +	rcu_read_unlock();
2520 | +
2521 | +	if (!join)
2522 | +		return;
2523 | +
2524 | +	BUG_ON(irqs_disabled());
2525 | +	double_lock_irq(&my_grp->lock, &grp->lock);
2526 | +
2527 | +	for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++) {
2528 | +		my_grp->faults[i] -= p->numa_faults[i];
2529 | +		grp->faults[i] += p->numa_faults[i];
2530 | +	}
2531 | +	my_grp->total_faults -= p->total_numa_faults;
2532 | +	grp->total_faults += p->total_numa_faults;
2533 | +
2534 | +	my_grp->nr_tasks--;
2535 | +	grp->nr_tasks++;
2536 | +
2537 | +	spin_unlock(&my_grp->lock);
2538 | +	spin_unlock_irq(&grp->lock);
2539 | +
2540 | +	rcu_assign_pointer(p->numa_group, grp);
2541 | +
2542 | +	put_numa_group(my_grp);
2543 | +	return;
2544 | +
2545 | +no_join:
2546 | +	rcu_read_unlock();
2547 | +	return;
2548 | +}
2549 | +
2550 | +/*
2551 | + * Get rid of NUMA statistics associated with a task (either current or dead).
2552 | + * If @final is set, the task is dead and has reached refcount zero, so we can
2553 | + * safely free all relevant data structures. Otherwise, there might be
2554 | + * concurrent reads from places like load balancing and procfs, and we should
2555 | + * reset the data back to default state without freeing ->numa_faults.
2556 | + */
2557 | +void task_numa_free(struct task_struct *p, bool final)
2558 | +{
2559 | +	/* safe: p either is current or is being freed by current */
2560 | +	struct numa_group *grp = rcu_dereference_raw(p->numa_group);
2561 | +	unsigned long *numa_faults = p->numa_faults;
2562 | +	unsigned long flags;
2563 | +	int i;
2564 | +
2565 | +	if (!numa_faults)
2566 | +		return;
2567 | +
2568 | +	if (grp) {
2569 | +		spin_lock_irqsave(&grp->lock, flags);
2570 | +		for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
2571 | +			grp->faults[i] -= p->numa_faults[i];
2572 | +		grp->total_faults -= p->total_numa_faults;
2573 | +
2574 | +		grp->nr_tasks--;
2575 | +		spin_unlock_irqrestore(&grp->lock, flags);
2576 | +		RCU_INIT_POINTER(p->numa_group, NULL);
2577 | +		put_numa_group(grp);
2578 | +	}
2579 | +
2580 | +	if (final) {
2581 | +		p->numa_faults = NULL;
2582 | +		kfree(numa_faults);
2583 | +	} else {
2584 | +		p->total_numa_faults = 0;
2585 | +		for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
2586 | +			numa_faults[i] = 0;
2587 | +	}
2588 | +}
2589 | +
2590 | +/*
2591 | + * Got a PROT_NONE fault for a page on @node.
2592 | + */
2593 | +void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
2594 | +{
2595 | +	struct task_struct *p = current;
2596 | +	bool migrated = flags & TNF_MIGRATED;
2597 | +	int cpu_node = task_node(current);
2598 | +	int local = !!(flags & TNF_FAULT_LOCAL);
2599 | +	struct numa_group *ng;
2600 | +	int priv;
2601 | +
2602 | +	if (!static_branch_likely(&sched_numa_balancing))
2603 | +		return;
2604 | +
2605 | +	/* for example, ksmd faulting in a user's mm */
2606 | +	if (!p->mm)
2607 | +		return;
2608 | +
2609 | +	/* Allocate buffer to track faults on a per-node basis */
2610 | +	if (unlikely(!p->numa_faults)) {
2611 | +		int size = sizeof(*p->numa_faults) *
2612 | +			   NR_NUMA_HINT_FAULT_BUCKETS * nr_node_ids;
2613 | +
2614 | +		p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
2615 | +		if (!p->numa_faults)
2616 | +			return;
2617 | +
2618 | +		p->total_numa_faults = 0;
2619 | +		memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality));
2620 | +	}
2621 | +
2622 | +	/*
2623 | +	 * First accesses are treated as private, otherwise consider accesses
2624 | +	 * to be private if the accessing pid has not changed
2625 | +	 */
2626 | +	if (unlikely(last_cpupid == (-1 & LAST_CPUPID_MASK))) {
2627 | +		priv = 1;
2628 | +	} else {
2629 | +		priv = cpupid_match_pid(p, last_cpupid);
2630 | +		if (!priv && !(flags & TNF_NO_GROUP))
2631 | +			task_numa_group(p, last_cpupid, flags, &priv);
2632 | +	}
2633 | +
2634 | +	/*
2635 | +	 * If a workload spans multiple NUMA nodes, a shared fault that
2636 | +	 * occurs wholly within the set of nodes that the workload is
2637 | +	 * actively using should be counted as local. This allows the
2638 | +	 * scan rate to slow down when a workload has settled down.
2639 | +	 */
2640 | +	ng = deref_curr_numa_group(p);
2641 | +	if (!priv && !local && ng && ng->active_nodes > 1 &&
2642 | +				numa_is_active_node(cpu_node, ng) &&
2643 | +				numa_is_active_node(mem_node, ng))
2644 | +		local = 1;
2645 | +
2646 | +	/*
2647 | +	 * Retry to migrate task to preferred node periodically, in case it
2648 | +	 * previously failed, or the scheduler moved us.
2649 | +	 */
2650 | +	if (time_after(jiffies, p->numa_migrate_retry)) {
2651 | +		task_numa_placement(p);
2652 | +		numa_migrate_preferred(p);
2653 | +	}
2654 | +
2655 | +	if (migrated)
2656 | +		p->numa_pages_migrated += pages;
2657 | +	if (flags & TNF_MIGRATE_FAIL)
2658 | +		p->numa_faults_locality[2] += pages;
2659 | +
2660 | +	p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
2661 | +	p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
2662 | +	p->numa_faults_locality[local] += pages;
2663 | +}
2664 | +
2665 | +static void reset_ptenuma_scan(struct task_struct *p)
2666 | +{
2667 | +	/*
2668 | +	 * We only did a read acquisition of the mmap sem, so
2669 | +	 * p->mm->numa_scan_seq is written to without exclusive access
2670 | +	 * and the update is not guaranteed to be atomic. That's not
2671 | +	 * much of an issue though, since this is just used for
2672 | +	 * statistical sampling. Use READ_ONCE/WRITE_ONCE, which are not
2673 | +	 * expensive, to avoid any form of compiler optimizations:
2674 | +	 */
2675 | +	WRITE_ONCE(p->mm->numa_scan_seq, READ_ONCE(p->mm->numa_scan_seq) + 1);
2676 | +	p->mm->numa_scan_offset = 0;
2677 | +}
2678 | +
2679 | +/*
2680 | + * The expensive part of numa migration is done from task_work context.
2681 | + * Triggered from task_tick_numa().
2682 | + */
2683 | +static void task_numa_work(struct callback_head *work)
2684 | +{
2685 | +	unsigned long migrate, next_scan, now = jiffies;
2686 | +	struct task_struct *p = current;
2687 | +	struct mm_struct *mm = p->mm;
2688 | +	u64 runtime = p->se.sum_exec_runtime;
2689 | +	struct vm_area_struct *vma;
2690 | +	unsigned long start, end;
2691 | +	unsigned long nr_pte_updates = 0;
2692 | +	long pages, virtpages;
2693 | +
2694 | +	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
2695 | +
2696 | +	work->next = work;
2697 | +	/*
2698 | +	 * Who cares about NUMA placement when they're dying.
2699 | +	 *
2700 | +	 * NOTE: make sure not to dereference p->mm before this check,
2701 | +	 * exit_task_work() happens _after_ exit_mm() so we could be called
2702 | +	 * without p->mm even though we still had it when we enqueued this
2703 | +	 * work.
2704 | +	 */
2705 | +	if (p->flags & PF_EXITING)
2706 | +		return;
2707 | +
2708 | +	if (!mm->numa_next_scan) {
2709 | +		mm->numa_next_scan = now +
2710 | +			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
2711 | +	}
2712 | +
2713 | +	/*
2714 | +	 * Enforce maximal scan/migration frequency..
2715 | +	 */
2716 | +	migrate = mm->numa_next_scan;
2717 | +	if (time_before(now, migrate))
2718 | +		return;
2719 | +
2720 | +	if (p->numa_scan_period == 0) {
2721 | +		p->numa_scan_period_max = task_scan_max(p);
2722 | +		p->numa_scan_period = task_scan_start(p);
2723 | +	}
2724 | +
2725 | +	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
2726 | +	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
2727 | +		return;
2728 | +
2729 | +	/*
2730 | +	 * Delay this task enough that another task of this mm will likely win
2731 | +	 * the next time around.
2732 | +	 */
2733 | +	p->node_stamp += 2 * TICK_NSEC;
2734 | +
2735 | +	start = mm->numa_scan_offset;
2736 | +	pages = sysctl_numa_balancing_scan_size;
2737 | +	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
2738 | +	virtpages = pages * 8;	   /* Scan up to this much virtual space */
2739 | +	if (!pages)
2740 | +		return;
2741 | +
2742 | +
2743 | +	if (!mmap_read_trylock(mm))
2744 | +		return;
2745 | +	vma = find_vma(mm, start);
2746 | +	if (!vma) {
2747 | +		reset_ptenuma_scan(p);
2748 | +		start = 0;
2749 | +		vma = mm->mmap;
2750 | +	}
2751 | +	for (; vma; vma = vma->vm_next) {
2752 | +		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
2753 | +			is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
2754 | +			continue;
2755 | +		}
2756 | +
2757 | +		/*
2758 | +		 * Shared library pages mapped by multiple processes are not
2759 | +		 * migrated as it is expected they are cache replicated. Avoid
2760 | +		 * hinting faults in read-only file-backed mappings or the vdso
2761 | +		 * as migrating the pages will be of marginal benefit.
2762 | +		 */
2763 | +		if (!vma->vm_mm ||
2764 | +		    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
2765 | +			continue;
2766 | +
2767 | +		/*
2768 | +		 * Skip inaccessible VMAs to avoid any confusion between
2769 | +		 * PROT_NONE and NUMA hinting ptes
2770 | +		 */
2771 | +		if (!vma_is_accessible(vma))
2772 | +			continue;
2773 | +
2774 | +		do {
2775 | +			start = max(start, vma->vm_start);
2776 | +			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
2777 | +			end = min(end, vma->vm_end);
2778 | +			nr_pte_updates = change_prot_numa(vma, start, end);
2779 | +
2780 | +			/*
2781 | +			 * Try to scan sysctl_numa_balancing_size worth of
2782 | +			 * hpages that have at least one present PTE that
2783 | +			 * is not already pte-numa. If the VMA contains
2784 | +			 * areas that are unused or already full of prot_numa
2785 | +			 * PTEs, scan up to virtpages, to skip through those
2786 | +			 * areas faster.
2787 | +			 */
2788 | +			if (nr_pte_updates)
2789 | +				pages -= (end - start) >> PAGE_SHIFT;
2790 | +			virtpages -= (end - start) >> PAGE_SHIFT;
2791 | +
2792 | +			start = end;
2793 | +			if (pages <= 0 || virtpages <= 0)
2794 | +				goto out;
2795 | +
2796 | +			cond_resched();
2797 | +		} while (end != vma->vm_end);
2798 | +	}
2799 | +
2800 | +out:
2801 | +	/*
2802 | +	 * It is possible to reach the end of the VMA list but the last few
2803 | +	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
2804 | +	 * would find the !migratable VMA on the next scan but not reset the
2805 | +	 * scanner to the start so check it now.
2806 | +	 */
2807 | +	if (vma)
2808 | +		mm->numa_scan_offset = start;
2809 | +	else
2810 | +		reset_ptenuma_scan(p);
2811 | +	mmap_read_unlock(mm);
2812 | +
2813 | +	/*
2814 | +	 * Make sure tasks use at least 32x as much time to run other code
2815 | +	 * than they used here, to limit NUMA PTE scanning overhead to 3% max.
2816 | +	 * Usually update_task_scan_period slows down scanning enough; on an
2817 | +	 * overloaded system we need to limit overhead on a per task basis.
2818 | +	 */
2819 | +	if (unlikely(p->se.sum_exec_runtime != runtime)) {
2820 | +		u64 diff = p->se.sum_exec_runtime - runtime;
2821 | +		p->node_stamp += 32 * diff;
2822 | +	}
2823 | +}
2824 | +
2825 | +void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
2826 | +{
2827 | +	int mm_users = 0;
2828 | +	struct mm_struct *mm = p->mm;
2829 | +
2830 | +	if (mm) {
2831 | +		mm_users = atomic_read(&mm->mm_users);
2832 | +		if (mm_users == 1) {
2833 | +			mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
2834 | +			mm->numa_scan_seq = 0;
2835 | +		}
2836 | +	}
2837 | +	p->node_stamp			= 0;
2838 | +	p->numa_scan_seq		= mm ? mm->numa_scan_seq : 0;
2839 | +	p->numa_scan_period		= sysctl_numa_balancing_scan_delay;
2840 | +	/* Protect against double add, see task_tick_numa and task_numa_work */
2841 | +	p->numa_work.next		= &p->numa_work;
2842 | +	p->numa_faults			= NULL;
2843 | +	RCU_INIT_POINTER(p->numa_group, NULL);
2844 | +	p->last_task_numa_placement	= 0;
2845 | +	p->last_sum_exec_runtime	= 0;
2846 | +
2847 | +	init_task_work(&p->numa_work, task_numa_work);
2848 | +
2849 | +	/* New address space, reset the preferred nid */
2850 | +	if (!(clone_flags & CLONE_VM)) {
2851 | +		p->numa_preferred_nid = NUMA_NO_NODE;
2852 | +		return;
2853 | +	}
2854 | +
2855 | +	/*
2856 | +	 * New thread, keep existing numa_preferred_nid which should be copied
2857 | +	 * already by arch_dup_task_struct but stagger when scans start.
2858 | +	 */
2859 | +	if (mm) {
2860 | +		unsigned int delay;
2861 | +
2862 | +		delay = min_t(unsigned int, task_scan_max(current),
2863 | +			current->numa_scan_period * mm_users * NSEC_PER_MSEC);
2864 | +		delay += 2 * TICK_NSEC;
2865 | +		p->node_stamp = delay;
2866 | +	}
2867 | +}
2868 | +
2869 | +static void task_tick_numa(struct rq *rq, struct task_struct *curr)
2870 | +{
2871 | +	struct callback_head *work = &curr->numa_work;
2872 | +	u64 period, now;
2873 | +
2874 | +	/*
2875 | +	 * We don't care about NUMA placement if we don't have memory.
2876 | +	 */
2877 | +	if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
2878 | +		return;
2879 | +
2880 | +	/*
2881 | +	 * Using runtime rather than walltime has the dual advantage that
2882 | +	 * we (mostly) drive the selection from busy threads and that the
2883 | +	 * task needs to have done some actual work before we bother with
2884 | +	 * NUMA placement.
2885 | +	 */
2886 | +	now = curr->se.sum_exec_runtime;
2887 | +	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
2888 | +
2889 | +	if (now > curr->node_stamp + period) {
2890 | +		if (!curr->node_stamp)
2891 | +			curr->numa_scan_period = task_scan_start(curr);
2892 | +		curr->node_stamp += period;
2893 | +
2894 | +		if (!time_before(jiffies, curr->mm->numa_next_scan))
2895 | +			task_work_add(curr, work, TWA_RESUME);
2896 | +	}
2897 | +}
2898 | +#else
2899 | +static void account_numa_enqueue(struct rq *rq, struct task_struct *p) {}
2900 | +static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p) {}
2901 | +static inline void update_scan_period(struct task_struct *p, int new_cpu) {}
2902 | +static void task_tick_numa(struct rq *rq, struct task_struct *curr) {}
2903 | +#endif /** CONFIG_NUMA_BALANCING */
2904 | +
2905 | diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
2906 | index d53d19770866..693683b31f5f 100644
2907 | --- a/kernel/sched/sched.h
2908 | +++ b/kernel/sched/sched.h
2909 | @@ -545,9 +545,14 @@ struct cfs_rq {
2910 |  	 * It is set to NULL otherwise (i.e when none are currently running).
2911 |  	 */
2912 |  	struct sched_entity	*curr;
2913 | +#ifdef CONFIG_BS_SCHED
2914 | +	struct bs_node		*head;
2915 | +	struct bs_node		*cursor;
2916 | +#else
2917 |  	struct sched_entity	*next;
2918 |  	struct sched_entity	*last;
2919 |  	struct sched_entity	*skip;
2920 | +#endif /** CONFIG_BS_SCHED */
2921 |  
2922 |  #ifdef	CONFIG_SCHED_DEBUG
2923 |  	unsigned int		nr_spread_over;
2924 | 


--------------------------------------------------------------------------------