Commit | Line | Data |
---|---|---|
d6a3b247 MCC |
1 | ================= |
2 | Scheduler Domains | |
3 | ================= | |
4 | ||
e2495b57 | 5 | Each CPU has a "base" scheduling domain (struct sched_domain). The domain |
1da177e4 | 6 | hierarchy is built from these base domains via the ->parent pointer. ->parent |
e2495b57 BP |
7 | MUST be NULL terminated, and domain structures should be per-CPU as they are |
8 | locklessly updated. | |
1da177e4 LT |
9 | |
10 | Each scheduling domain spans a number of CPUs (stored in the ->span field). | |
11 | A domain's span MUST be a superset of it child's span (this restriction could | |
12 | be relaxed if the need arises), and a base domain for CPU i MUST span at least | |
13 | i. The top domain for each CPU will generally span all CPUs in the system | |
14 | although strictly it doesn't have to, but this could lead to a case where some | |
15 | CPUs will never be given tasks to run unless the CPUs allowed mask is | |
16 | explicitly set. A sched domain's span means "balance process load among these | |
17 | CPUs". | |
18 | ||
19 | Each scheduling domain must have one or more CPU groups (struct sched_group) | |
20 | which are organised as a circular one way linked list from the ->groups | |
21 | pointer. The union of cpumasks of these groups MUST be the same as the | |
7b912104 AF |
22 | domain's span. The group pointed to by the ->groups pointer MUST contain the CPU |
23 | to which the domain belongs. Groups may be shared among CPUs as they contain | |
24 | read only data after they have been set up. The intersection of cpumasks from | |
25 | any two of these groups may be non empty. If this is the case the SD_OVERLAP | |
26 | flag is set on the corresponding scheduling domain and its groups may not be | |
27 | shared between CPUs. | |
1da177e4 LT |
28 | |
29 | Balancing within a sched domain occurs between groups. That is, each group | |
30 | is treated as one entity. The load of a group is defined as the sum of the | |
31 | load of each of its member CPUs, and only when the load of a group becomes | |
32 | out of balance are tasks moved between groups. | |
33 | ||
0a0fca9d | 34 | In kernel/sched/core.c, trigger_load_balance() is run periodically on each CPU |
e2495b57 BP |
35 | through scheduler_tick(). It raises a softirq after the next regularly scheduled |
36 | rebalancing event for the current runqueue has arrived. The actual load | |
37 | balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run | |
38 | in softirq context (SCHED_SOFTIRQ). | |
39 | ||
40 | The latter function takes two arguments: the current CPU and whether it was idle | |
41 | at the time the scheduler_tick() happened and iterates over all sched domains | |
42 | our CPU is on, starting from its base domain and going up the ->parent chain. | |
43 | While doing that, it checks to see if the current domain has exhausted its | |
44 | rebalance interval. If so, it runs load_balance() on that domain. It then checks | |
45 | the parent sched_domain (if it exists), and the parent of the parent and so | |
46 | forth. | |
47 | ||
48 | Initially, load_balance() finds the busiest group in the current sched domain. | |
49 | If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in | |
50 | that group. If it manages to find such a runqueue, it locks both our initial | |
51 | CPU's runqueue and the newly found busiest one and starts moving tasks from it | |
52 | to our runqueue. The exact number of tasks amounts to an imbalance previously | |
53 | computed while iterating over this sched domain's groups. | |
1da177e4 | 54 | |
d6a3b247 MCC |
55 | Implementing sched domains |
56 | ========================== | |
57 | ||
1da177e4 LT |
58 | The "base" domain will "span" the first level of the hierarchy. In the case |
59 | of SMT, you'll span all siblings of the physical CPU, with each group being | |
60 | a single virtual CPU. | |
61 | ||
62 | In SMP, the parent of the base domain will span all physical CPUs in the | |
63 | node. Each group being a single physical CPU. Then with NUMA, the parent | |
64 | of the SMP domain will span the entire machine, with each group having the | |
65 | cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example, | |
66 | might have just one domain covering its one NUMA level. | |
67 | ||
68 | The implementor should read comments in include/linux/sched.h: | |
69 | struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of | |
70 | the specifics and what to tune. | |
71 | ||
1da177e4 | 72 | Architectures may retain the regular override the default SD_*_INIT flags |
0a0fca9d | 73 | while using the generic domain builder in kernel/sched/core.c if they wish to |
1da177e4 LT |
74 | retain the traditional SMT->SMP->NUMA topology (or some subset of that). This |
75 | can be done by #define'ing ARCH_HASH_SCHED_TUNE. | |
76 | ||
77 | Alternatively, the architecture may completely override the generic domain | |
78 | builder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your | |
79 | arch_init_sched_domains function. This function will attach domains to all | |
80 | CPUs using cpu_attach_domain. | |
81 | ||
e29c98d1 GS |
82 | The sched-domains debugging infrastructure can be enabled by enabling |
83 | CONFIG_SCHED_DEBUG. This enables an error checking parse of the sched domains | |
1da177e4 LT |
84 | which should catch most possible errors (described above). It also prints out |
85 | the domain structure in a visual format. |