Commit | Line | Data |
---|---|---|
603699bb MCC |
1 | =========== |
2 | Static Keys | |
3 | =========== | |
1cfa60dc | 4 | |
603699bb | 5 | .. warning:: |
412758cb | 6 | |
603699bb | 7 | DEPRECATED API: |
412758cb | 8 | |
603699bb MCC |
9 | The use of 'struct static_key' directly, is now DEPRECATED. In addition |
10 | static_key_{true,false}() is also DEPRECATED. IE DO NOT use the following:: | |
412758cb | 11 | |
603699bb MCC |
12 | struct static_key false = STATIC_KEY_INIT_FALSE; |
13 | struct static_key true = STATIC_KEY_INIT_TRUE; | |
14 | static_key_true() | |
15 | static_key_false() | |
412758cb | 16 | |
603699bb | 17 | The updated API replacements are:: |
1cfa60dc | 18 | |
603699bb MCC |
19 | DEFINE_STATIC_KEY_TRUE(key); |
20 | DEFINE_STATIC_KEY_FALSE(key); | |
21 | DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count); | |
22 | DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count); | |
23 | static_branch_likely() | |
24 | static_branch_unlikely() | |
25 | ||
26 | Abstract | |
27 | ======== | |
1cfa60dc JB |
28 | |
29 | Static keys allows the inclusion of seldom used features in | |
30 | performance-sensitive fast-path kernel code, via a GCC feature and a code | |
603699bb | 31 | patching technique. A quick example:: |
1cfa60dc | 32 | |
412758cb | 33 | DEFINE_STATIC_KEY_FALSE(key); |
1cfa60dc JB |
34 | |
35 | ... | |
36 | ||
412758cb | 37 | if (static_branch_unlikely(&key)) |
1cfa60dc JB |
38 | do unlikely code |
39 | else | |
40 | do likely code | |
41 | ||
42 | ... | |
412758cb | 43 | static_branch_enable(&key); |
1cfa60dc | 44 | ... |
412758cb | 45 | static_branch_disable(&key); |
1cfa60dc JB |
46 | ... |
47 | ||
412758cb | 48 | The static_branch_unlikely() branch will be generated into the code with as little |
1cfa60dc JB |
49 | impact to the likely code path as possible. |
50 | ||
51 | ||
603699bb MCC |
52 | Motivation |
53 | ========== | |
1cfa60dc JB |
54 | |
55 | ||
56 | Currently, tracepoints are implemented using a conditional branch. The | |
57 | conditional check requires checking a global variable for each tracepoint. | |
58 | Although the overhead of this check is small, it increases when the memory | |
59 | cache comes under pressure (memory cache lines for these global variables may | |
60 | be shared with other memory accesses). As we increase the number of tracepoints | |
61 | in the kernel this overhead may become more of an issue. In addition, | |
62 | tracepoints are often dormant (disabled) and provide no direct kernel | |
63 | functionality. Thus, it is highly desirable to reduce their impact as much as | |
64 | possible. Although tracepoints are the original motivation for this work, other | |
65 | kernel code paths should be able to make use of the static keys facility. | |
66 | ||
67 | ||
603699bb MCC |
68 | Solution |
69 | ======== | |
1cfa60dc JB |
70 | |
71 | ||
72 | gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label: | |
73 | ||
74 | http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html | |
75 | ||
76 | Using the 'asm goto', we can create branches that are either taken or not taken | |
77 | by default, without the need to check memory. Then, at run-time, we can patch | |
78 | the branch site to change the branch direction. | |
79 | ||
603699bb | 80 | For example, if we have a simple branch that is disabled by default:: |
1cfa60dc | 81 | |
412758cb | 82 | if (static_branch_unlikely(&key)) |
1cfa60dc JB |
83 | printk("I am the true branch\n"); |
84 | ||
85 | Thus, by default the 'printk' will not be emitted. And the code generated will | |
86 | consist of a single atomic 'no-op' instruction (5 bytes on x86), in the | |
87 | straight-line code path. When the branch is 'flipped', we will patch the | |
88 | 'no-op' in the straight-line codepath with a 'jump' instruction to the | |
89 | out-of-line true branch. Thus, changing branch direction is expensive but | |
90 | branch selection is basically 'free'. That is the basic tradeoff of this | |
91 | optimization. | |
92 | ||
93 | This lowlevel patching mechanism is called 'jump label patching', and it gives | |
94 | the basis for the static keys facility. | |
95 | ||
603699bb MCC |
96 | Static key label API, usage and examples |
97 | ======================================== | |
1cfa60dc JB |
98 | |
99 | ||
603699bb | 100 | In order to make use of this optimization you must first define a key:: |
1cfa60dc | 101 | |
412758cb | 102 | DEFINE_STATIC_KEY_TRUE(key); |
1cfa60dc | 103 | |
603699bb | 104 | or:: |
1cfa60dc | 105 | |
412758cb JB |
106 | DEFINE_STATIC_KEY_FALSE(key); |
107 | ||
1cfa60dc | 108 | |
412758cb | 109 | The key must be global, that is, it can't be allocated on the stack or dynamically |
1cfa60dc JB |
110 | allocated at run-time. |
111 | ||
603699bb | 112 | The key is then used in code as:: |
1cfa60dc | 113 | |
412758cb | 114 | if (static_branch_unlikely(&key)) |
1cfa60dc JB |
115 | do unlikely code |
116 | else | |
117 | do likely code | |
118 | ||
603699bb | 119 | Or:: |
1cfa60dc | 120 | |
412758cb | 121 | if (static_branch_likely(&key)) |
1cfa60dc JB |
122 | do likely code |
123 | else | |
124 | do unlikely code | |
125 | ||
412758cb JB |
126 | Keys defined via DEFINE_STATIC_KEY_TRUE(), or DEFINE_STATIC_KEY_FALSE, may |
127 | be used in either static_branch_likely() or static_branch_unlikely() | |
9bb0e9cb | 128 | statements. |
1cfa60dc | 129 | |
603699bb | 130 | Branch(es) can be set true via:: |
1cfa60dc | 131 | |
603699bb | 132 | static_branch_enable(&key); |
1cfa60dc | 133 | |
603699bb | 134 | or false via:: |
412758cb | 135 | |
603699bb | 136 | static_branch_disable(&key); |
1cfa60dc | 137 | |
603699bb | 138 | The branch(es) can then be switched via reference counts:: |
1cfa60dc | 139 | |
412758cb JB |
140 | static_branch_inc(&key); |
141 | ... | |
142 | static_branch_dec(&key); | |
1cfa60dc | 143 | |
412758cb JB |
144 | Thus, 'static_branch_inc()' means 'make the branch true', and |
145 | 'static_branch_dec()' means 'make the branch false' with appropriate | |
146 | reference counting. For example, if the key is initialized true, a | |
147 | static_branch_dec(), will switch the branch to false. And a subsequent | |
148 | static_branch_inc(), will change the branch back to true. Likewise, if the | |
149 | key is initialized false, a 'static_branch_inc()', will change the branch to | |
150 | true. And then a 'static_branch_dec()', will again make the branch false. | |
1cfa60dc | 151 | |
603699bb | 152 | Where an array of keys is required, it can be defined as:: |
ef0da55a CM |
153 | |
154 | DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count); | |
155 | ||
603699bb | 156 | or:: |
ef0da55a CM |
157 | |
158 | DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count); | |
1cfa60dc JB |
159 | |
160 | 4) Architecture level code patching interface, 'jump labels' | |
161 | ||
162 | ||
163 | There are a few functions and macros that architectures must implement in order | |
164 | to take advantage of this optimization. If there is no architecture support, we | |
3821fd35 JB |
165 | simply fall back to a traditional, load, test, and jump sequence. Also, the |
166 | struct jump_entry table must be at least 4-byte aligned because the | |
167 | static_key->entry field makes use of the two least significant bits. | |
1cfa60dc | 168 | |
603699bb MCC |
169 | * ``select HAVE_ARCH_JUMP_LABEL``, |
170 | see: arch/x86/Kconfig | |
1cfa60dc | 171 | |
603699bb MCC |
172 | * ``#define JUMP_LABEL_NOP_SIZE``, |
173 | see: arch/x86/include/asm/jump_label.h | |
1cfa60dc | 174 | |
603699bb MCC |
175 | * ``__always_inline bool arch_static_branch(struct static_key *key, bool branch)``, |
176 | see: arch/x86/include/asm/jump_label.h | |
412758cb | 177 | |
603699bb MCC |
178 | * ``__always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)``, |
179 | see: arch/x86/include/asm/jump_label.h | |
1cfa60dc | 180 | |
603699bb MCC |
181 | * ``void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type)``, |
182 | see: arch/x86/kernel/jump_label.c | |
1cfa60dc | 183 | |
603699bb MCC |
184 | * ``__init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type)``, |
185 | see: arch/x86/kernel/jump_label.c | |
1cfa60dc | 186 | |
603699bb MCC |
187 | * ``struct jump_entry``, |
188 | see: arch/x86/include/asm/jump_label.h | |
1cfa60dc JB |
189 | |
190 | ||
191 | 5) Static keys / jump label analysis, results (x86_64): | |
192 | ||
193 | ||
194 | As an example, let's add the following branch to 'getppid()', such that the | |
603699bb | 195 | system call now looks like:: |
1cfa60dc | 196 | |
603699bb MCC |
197 | SYSCALL_DEFINE0(getppid) |
198 | { | |
1cfa60dc JB |
199 | int pid; |
200 | ||
603699bb MCC |
201 | + if (static_branch_unlikely(&key)) |
202 | + printk("I am the true branch\n"); | |
1cfa60dc JB |
203 | |
204 | rcu_read_lock(); | |
205 | pid = task_tgid_vnr(rcu_dereference(current->real_parent)); | |
206 | rcu_read_unlock(); | |
207 | ||
208 | return pid; | |
603699bb MCC |
209 | } |
210 | ||
211 | The resulting instructions with jump labels generated by GCC is:: | |
212 | ||
213 | ffffffff81044290 <sys_getppid>: | |
214 | ffffffff81044290: 55 push %rbp | |
215 | ffffffff81044291: 48 89 e5 mov %rsp,%rbp | |
216 | ffffffff81044294: e9 00 00 00 00 jmpq ffffffff81044299 <sys_getppid+0x9> | |
217 | ffffffff81044299: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax | |
218 | ffffffff810442a0: 00 00 | |
219 | ffffffff810442a2: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax | |
220 | ffffffff810442a9: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax | |
221 | ffffffff810442b0: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi | |
222 | ffffffff810442b7: e8 f4 d9 00 00 callq ffffffff81051cb0 <pid_vnr> | |
223 | ffffffff810442bc: 5d pop %rbp | |
224 | ffffffff810442bd: 48 98 cltq | |
225 | ffffffff810442bf: c3 retq | |
226 | ffffffff810442c0: 48 c7 c7 e3 54 98 81 mov $0xffffffff819854e3,%rdi | |
227 | ffffffff810442c7: 31 c0 xor %eax,%eax | |
228 | ffffffff810442c9: e8 71 13 6d 00 callq ffffffff8171563f <printk> | |
229 | ffffffff810442ce: eb c9 jmp ffffffff81044299 <sys_getppid+0x9> | |
230 | ||
231 | Without the jump label optimization it looks like:: | |
232 | ||
233 | ffffffff810441f0 <sys_getppid>: | |
234 | ffffffff810441f0: 8b 05 8a 52 d8 00 mov 0xd8528a(%rip),%eax # ffffffff81dc9480 <key> | |
235 | ffffffff810441f6: 55 push %rbp | |
236 | ffffffff810441f7: 48 89 e5 mov %rsp,%rbp | |
237 | ffffffff810441fa: 85 c0 test %eax,%eax | |
238 | ffffffff810441fc: 75 27 jne ffffffff81044225 <sys_getppid+0x35> | |
239 | ffffffff810441fe: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax | |
240 | ffffffff81044205: 00 00 | |
241 | ffffffff81044207: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax | |
242 | ffffffff8104420e: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax | |
243 | ffffffff81044215: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi | |
244 | ffffffff8104421c: e8 2f da 00 00 callq ffffffff81051c50 <pid_vnr> | |
245 | ffffffff81044221: 5d pop %rbp | |
246 | ffffffff81044222: 48 98 cltq | |
247 | ffffffff81044224: c3 retq | |
248 | ffffffff81044225: 48 c7 c7 13 53 98 81 mov $0xffffffff81985313,%rdi | |
249 | ffffffff8104422c: 31 c0 xor %eax,%eax | |
250 | ffffffff8104422e: e8 60 0f 6d 00 callq ffffffff81715193 <printk> | |
251 | ffffffff81044233: eb c9 jmp ffffffff810441fe <sys_getppid+0xe> | |
252 | ffffffff81044235: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1) | |
253 | ffffffff8104423c: 00 00 00 00 | |
1cfa60dc JB |
254 | |
255 | Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction | |
256 | vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched | |
257 | to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump | |
603699bb | 258 | label case adds:: |
1cfa60dc | 259 | |
603699bb | 260 | 6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes. |
1cfa60dc JB |
261 | |
262 | If we then include the padding bytes, the jump label code saves, 16 total bytes | |
c94bed8e | 263 | of instruction memory for this small function. In this case the non-jump label |
c79a8d85 | 264 | function is 80 bytes long. Thus, we have saved 20% of the instruction |
1cfa60dc JB |
265 | footprint. We can in fact improve this even further, since the 5-byte no-op |
266 | really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp. | |
267 | However, we have not yet implemented optimal no-op sizes (they are currently | |
268 | hard-coded). | |
269 | ||
270 | Since there are a number of static key API uses in the scheduler paths, | |
271 | 'pipe-test' (also known as 'perf bench sched pipe') can be used to show the | |
272 | performance improvement. Testing done on 3.3.0-rc2: | |
273 | ||
603699bb | 274 | jump label disabled:: |
1cfa60dc JB |
275 | |
276 | Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): | |
277 | ||
278 | 855.700314 task-clock # 0.534 CPUs utilized ( +- 0.11% ) | |
279 | 200,003 context-switches # 0.234 M/sec ( +- 0.00% ) | |
280 | 0 CPU-migrations # 0.000 M/sec ( +- 39.58% ) | |
281 | 487 page-faults # 0.001 M/sec ( +- 0.02% ) | |
282 | 1,474,374,262 cycles # 1.723 GHz ( +- 0.17% ) | |
283 | <not supported> stalled-cycles-frontend | |
284 | <not supported> stalled-cycles-backend | |
285 | 1,178,049,567 instructions # 0.80 insns per cycle ( +- 0.06% ) | |
286 | 208,368,926 branches # 243.507 M/sec ( +- 0.06% ) | |
287 | 5,569,188 branch-misses # 2.67% of all branches ( +- 0.54% ) | |
288 | ||
289 | 1.601607384 seconds time elapsed ( +- 0.07% ) | |
290 | ||
603699bb | 291 | jump label enabled:: |
1cfa60dc JB |
292 | |
293 | Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): | |
294 | ||
295 | 841.043185 task-clock # 0.533 CPUs utilized ( +- 0.12% ) | |
296 | 200,004 context-switches # 0.238 M/sec ( +- 0.00% ) | |
297 | 0 CPU-migrations # 0.000 M/sec ( +- 40.87% ) | |
298 | 487 page-faults # 0.001 M/sec ( +- 0.05% ) | |
299 | 1,432,559,428 cycles # 1.703 GHz ( +- 0.18% ) | |
300 | <not supported> stalled-cycles-frontend | |
301 | <not supported> stalled-cycles-backend | |
302 | 1,175,363,994 instructions # 0.82 insns per cycle ( +- 0.04% ) | |
303 | 206,859,359 branches # 245.956 M/sec ( +- 0.04% ) | |
304 | 4,884,119 branch-misses # 2.36% of all branches ( +- 0.85% ) | |
305 | ||
306 | 1.579384366 seconds time elapsed | |
307 | ||
308 | The percentage of saved branches is .7%, and we've saved 12% on | |
309 | 'branch-misses'. This is where we would expect to get the most savings, since | |
310 | this optimization is about reducing the number of branches. In addition, we've | |
311 | saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time. |