[linux-block.git] / Documentation / scheduler / schedutil.rst

=========
Schedutil
=========

.. note::

   All this assumes a linear relation between frequency and work capacity,
   we know this is flawed, but it is the best workable approximation.


PELT (Per Entity Load Tracking)
===============================

With PELT we track some metrics across the various scheduler entities, from
individual tasks to task-group slices to CPU runqueues. As the basis for this
we use an Exponentially Weighted Moving Average (EWMA), each period (1024us)
is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute
half, while the rest of history contribute the other half.

Specifically:

  ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...

  ewma(u) = ewma_sum(u) / ewma_sum(1)

Since this is essentially a progression of an infinite geometric series, the
results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
is key, since it gives the ability to recompose the averages when tasks move
around.

Note that blocked tasks still contribute to the aggregates (task-group slices
and CPU runqueues), which reflects their expected contribution when they
resume running.

Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
reflects the time an entity spends on the CPU, while 'runnable' reflects the
time an entity spends on the runqueue. When there is only a single task these
two metrics are the same, but once there is contention for the CPU 'running'
will decrease to reflect the fraction of time each task spends on the CPU
while 'runnable' will increase to reflect the amount of contention.

For more detail see: kernel/sched/pelt.c


Frequency / CPU Invariance
==========================

Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
a big CPU, we allow architectures to scale the time delta with two ratios, one
Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.

For simple DVFS architectures (where software is in full control) we trivially
compute the ratio as::

	    f_cur
  r_dvfs := -----
            f_max

For more dynamic systems where the hardware is in control of DVFS we use
hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.
For Intel specifically, we use::

	   APERF
  f_cur := ----- * P0
	   MPERF

	     4C-turbo;	if available and turbo enabled
  f_max := { 1C-turbo;	if turbo enabled
	     P0;	otherwise

                    f_cur
  r_dvfs := min( 1, ----- )
                    f_max

We pick 4C turbo over 1C turbo to make it slightly more sustainable.

r_cpu is determined as the ratio of highest performance level of the current
CPU vs the highest performance level of any other CPU in the system.

  r_tot = r_dvfs * r_cpu

The result is that the above 'running' and 'runnable' metrics become invariant
of DVFS and CPU type. IOW. we can transfer and compare them between CPUs.

For more detail see:

 - kernel/sched/pelt.h:update_rq_clock_pelt()
 - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
 - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"


UTIL_EST / UTIL_EST_FASTUP
==========================

Because periodic tasks have their averages decayed while they sleep, even
though when running their expected utilization will be the same, they suffer a
(DVFS) ramp-up after they are running again.

To alleviate this (a default enabled option) UTIL_EST drives an Infinite
Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
filter to instantly increase and only decay on decrease.

A further runqueue wide sum (of runnable tasks) is maintained of:

  util_est := \Sum_t max( t_running, t_util_est_ewma )

For more detail see: kernel/sched/fair.c:util_est_dequeue()


UCLAMP
======

It is possible to set effective u_min and u_max clamps on each CFS or RT task;
the runqueue keeps an max aggregate of these clamps for all running tasks.

For more detail see: include/uapi/linux/sched/types.h


Schedutil / DVFS
================

Every time the scheduler load tracking is updated (task wakeup, task
migration, time progression) we call out to schedutil to update the hardware
DVFS state.

The basis is the CPU runqueue's 'running' metric, which per the above it is
the frequency invariant utilization estimate of the CPU. From this we compute
a desired frequency like::

             max( running, util_est );	if UTIL_EST
  u_cfs := { running;			otherwise

               clamp( u_cfs + u_rt , u_min, u_max );	if UCLAMP_TASK
  u_clamp := { u_cfs + u_rt;				otherwise

  u := u_clamp + u_irq + u_dl;		[approx. see source for more detail]

  f_des := min( f_max, 1.25 u * f_max )

XXX IO-wait: when the update is due to a task wakeup from IO-completion we
boost 'u' above.

This frequency is then used to select a P-state/OPP or directly munged into a
CPPC style request to the hardware.

XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
required to satisfy the workload.

Because these callbacks are directly from the scheduler, the DVFS hardware
interaction should be 'fast' and non-blocking. Schedutil supports
rate-limiting DVFS requests for when hardware interaction is slow and
expensive, this reduces effectiveness.

For more information see: kernel/sched/cpufreq_schedutil.c


NOTES
=====

 - On low-load scenarios, where DVFS is most relevant, the 'running' numbers
   will closely reflect utilization.

 - In saturated scenarios task movement will cause some transient dips,
   suppose we have a CPU saturated with 4 tasks, then when we migrate a task
   to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
   new CPU will gain 0.25. This is inevitable and time progression will
   correct this. XXX do we still guarantee f_max due to no idle-time?

 - Much of the above is about avoiding DVFS dips, and independent DVFS domains
   having to re-learn / ramp-up when load shifts.
Commit	Line	Data
b57b8496 TY	1	=========
	2	Schedutil
	3	=========
0301925d	4
b57b8496	5	.. note::
0301925d	6
b57b8496 TY	7	All this assumes a linear relation between frequency and work capacity,
b57b8496 TY	8	we know this is flawed, but it is the best workable approximation.
0301925d PZ	9
	10
	11	PELT (Per Entity Load Tracking)
b57b8496	12	===============================
0301925d PZ	13
	14	With PELT we track some metrics across the various scheduler entities, from
	15	individual tasks to task-group slices to CPU runqueues. As the basis for this
	16	we use an Exponentially Weighted Moving Average (EWMA), each period (1024us)
	17	is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute
	18	half, while the rest of history contribute the other half.
	19
	20	Specifically:
	21
	22	ewma_sum(u) := u_0 + u_1y + u_2y^2 + ...
	23
	24	ewma(u) = ewma_sum(u) / ewma_sum(1)
	25
	26	Since this is essentially a progression of an infinite geometric series, the
	27	results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
	28	is key, since it gives the ability to recompose the averages when tasks move
	29	around.
	30
	31	Note that blocked tasks still contribute to the aggregates (task-group slices
	32	and CPU runqueues), which reflects their expected contribution when they
	33	resume running.
	34
	35	Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
	36	reflects the time an entity spends on the CPU, while 'runnable' reflects the
	37	time an entity spends on the runqueue. When there is only a single task these
	38	two metrics are the same, but once there is contention for the CPU 'running'
	39	will decrease to reflect the fraction of time each task spends on the CPU
	40	while 'runnable' will increase to reflect the amount of contention.
	41
	42	For more detail see: kernel/sched/pelt.c
	43
	44
b57b8496 TY	45	Frequency / CPU Invariance
b57b8496 TY	46	==========================
0301925d PZ	47
	48	Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
	49	for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
	50	a big CPU, we allow architectures to scale the time delta with two ratios, one
	51	Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.
	52
	53	For simple DVFS architectures (where software is in full control) we trivially
b57b8496	54	compute the ratio as::
0301925d PZ	55
	56	f_cur
	57	r_dvfs := -----
	58	f_max
	59
	60	For more dynamic systems where the hardware is in control of DVFS we use
	61	hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.
b57b8496	62	For Intel specifically, we use::
0301925d PZ	63
	64	APERF
	65	f_cur := ----- * P0
	66	MPERF
	67
	68	4C-turbo; if available and turbo enabled
	69	f_max := { 1C-turbo; if turbo enabled
	70	P0; otherwise
	71
	72	f_cur
	73	r_dvfs := min( 1, ----- )
	74	f_max
	75
	76	We pick 4C turbo over 1C turbo to make it slightly more sustainable.
	77
	78	r_cpu is determined as the ratio of highest performance level of the current
	79	CPU vs the highest performance level of any other CPU in the system.
	80
	81	r_tot = r_dvfs * r_cpu
	82
	83	The result is that the above 'running' and 'runnable' metrics become invariant
	84	of DVFS and CPU type. IOW. we can transfer and compare them between CPUs.
	85
	86	For more detail see:
	87
	88	- kernel/sched/pelt.h:update_rq_clock_pelt()
	89	- arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
	90	- Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
	91
	92
	93	UTIL_EST / UTIL_EST_FASTUP
b57b8496	94	==========================
0301925d PZ	95
	96	Because periodic tasks have their averages decayed while they sleep, even
	97	though when running their expected utilization will be the same, they suffer a
	98	(DVFS) ramp-up after they are running again.
	99
	100	To alleviate this (a default enabled option) UTIL_EST drives an Infinite
	101	Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
	102	highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
	103	filter to instantly increase and only decay on decrease.
	104
	105	A further runqueue wide sum (of runnable tasks) is maintained of:
	106
	107	util_est := \Sum_t max( t_running, t_util_est_ewma )
	108
	109	For more detail see: kernel/sched/fair.c:util_est_dequeue()
	110
	111
	112	UCLAMP
b57b8496	113	======
0301925d PZ	114
	115	It is possible to set effective u_min and u_max clamps on each CFS or RT task;
	116	the runqueue keeps an max aggregate of these clamps for all running tasks.
	117
	118	For more detail see: include/uapi/linux/sched/types.h
	119
	120
	121	Schedutil / DVFS
b57b8496	122	================
0301925d PZ	123
	124	Every time the scheduler load tracking is updated (task wakeup, task
	125	migration, time progression) we call out to schedutil to update the hardware
	126	DVFS state.
	127
	128	The basis is the CPU runqueue's 'running' metric, which per the above it is
	129	the frequency invariant utilization estimate of the CPU. From this we compute
b57b8496	130	a desired frequency like::
0301925d PZ	131
	132	max( running, util_est ); if UTIL_EST
	133	u_cfs := { running; otherwise
	134
	135	clamp( u_cfs + u_rt , u_min, u_max ); if UCLAMP_TASK
	136	u_clamp := { u_cfs + u_rt; otherwise
	137
	138	u := u_clamp + u_irq + u_dl; [approx. see source for more detail]
	139
	140	f_des := min( f_max, 1.25 u * f_max )
	141
b57b8496	142	XXX IO-wait: when the update is due to a task wakeup from IO-completion we
0301925d PZ	143	boost 'u' above.
	144
	145	This frequency is then used to select a P-state/OPP or directly munged into a
	146	CPPC style request to the hardware.
	147
	148	XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
	149	required to satisfy the workload.
	150
	151	Because these callbacks are directly from the scheduler, the DVFS hardware
	152	interaction should be 'fast' and non-blocking. Schedutil supports
	153	rate-limiting DVFS requests for when hardware interaction is slow and
	154	expensive, this reduces effectiveness.
	155
	156	For more information see: kernel/sched/cpufreq_schedutil.c
	157
	158
	159	NOTES
b57b8496	160	=====
0301925d PZ	161
	162	- On low-load scenarios, where DVFS is most relevant, the 'running' numbers
	163	will closely reflect utilization.
	164
	165	- In saturated scenarios task movement will cause some transient dips,
	166	suppose we have a CPU saturated with 4 tasks, then when we migrate a task
	167	to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
	168	new CPU will gain 0.25. This is inevitable and time progression will
	169	correct this. XXX do we still guarantee f_max due to no idle-time?
	170
	171	- Much of the above is about avoiding DVFS dips, and independent DVFS domains
	172	having to re-learn / ramp-up when load shifts.
	173