Commit | Line | Data |
---|---|---|
06955392 CD |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | =============================== | |
4 | Kernel level exception handling | |
5 | =============================== | |
6 | ||
7 | Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com> | |
1da177e4 | 8 | |
3697cd9a AW |
9 | When a process runs in kernel mode, it often has to access user |
10 | mode memory whose address has been passed by an untrusted program. | |
1da177e4 LT |
11 | To protect itself the kernel has to verify this address. |
12 | ||
3697cd9a AW |
13 | In older versions of Linux this was done with the |
14 | int verify_area(int type, const void * addr, unsigned long size) | |
720a8459 | 15 | function (which has since been replaced by access_ok()). |
1da177e4 | 16 | |
3697cd9a | 17 | This function verified that the memory area starting at address |
670e9f34 | 18 | 'addr' and of size 'size' was accessible for the operation specified |
3697cd9a AW |
19 | in type (read or write). To do this, verify_read had to look up the |
20 | virtual memory area (vma) that contained the address addr. In the | |
21 | normal case (correctly working program), this test was successful. | |
1da177e4 LT |
22 | It only failed for a few buggy programs. In some kernel profiling |
23 | tests, this normally unneeded verification used up a considerable | |
24 | amount of time. | |
25 | ||
3697cd9a | 26 | To overcome this situation, Linus decided to let the virtual memory |
1da177e4 LT |
27 | hardware present in every Linux-capable CPU handle this test. |
28 | ||
29 | How does this work? | |
30 | ||
3697cd9a AW |
31 | Whenever the kernel tries to access an address that is currently not |
32 | accessible, the CPU generates a page fault exception and calls the | |
06955392 | 33 | page fault handler:: |
1da177e4 | 34 | |
06955392 | 35 | void do_page_fault(struct pt_regs *regs, unsigned long error_code) |
1da177e4 | 36 | |
3697cd9a | 37 | in arch/x86/mm/fault.c. The parameters on the stack are set up by |
9db9b767 | 38 | the low level assembly glue in arch/x86/entry/entry_32.S. The parameter |
3697cd9a | 39 | regs is a pointer to the saved registers on the stack, error_code |
1da177e4 LT |
40 | contains a reason code for the exception. |
41 | ||
3697cd9a AW |
42 | do_page_fault first obtains the unaccessible address from the CPU |
43 | control register CR2. If the address is within the virtual address | |
44 | space of the process, the fault probably occurred, because the page | |
45 | was not swapped in, write protected or something similar. However, | |
46 | we are interested in the other case: the address is not valid, there | |
47 | is no vma that contains this address. In this case, the kernel jumps | |
48 | to the bad_area label. | |
49 | ||
50 | There it uses the address of the instruction that caused the exception | |
51 | (i.e. regs->eip) to find an address where the execution can continue | |
52 | (fixup). If this search is successful, the fault handler modifies the | |
53 | return address (again regs->eip) and returns. The execution will | |
1da177e4 LT |
54 | continue at the address in fixup. |
55 | ||
56 | Where does fixup point to? | |
57 | ||
3697cd9a AW |
58 | Since we jump to the contents of fixup, fixup obviously points |
59 | to executable code. This code is hidden inside the user access macros. | |
60 | I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h | |
61 | as an example. The definition is somewhat hard to follow, so let's peek at | |
1da177e4 | 62 | the code generated by the preprocessor and the compiler. I selected |
3697cd9a | 63 | the get_user call in drivers/char/sysrq.c for a detailed examination. |
1da177e4 | 64 | |
06955392 CD |
65 | The original code in sysrq.c line 587:: |
66 | ||
1da177e4 LT |
67 | get_user(c, buf); |
68 | ||
06955392 CD |
69 | The preprocessor output (edited to become somewhat readable):: |
70 | ||
71 | ( | |
72 | { | |
73 | long __gu_err = - 14 , __gu_val = 0; | |
74 | const __typeof__(*( ( buf ) )) *__gu_addr = ((buf)); | |
75 | if (((((0 + current_set[0])->tss.segment) == 0x18 ) || | |
76 | (((sizeof(*(buf))) <= 0xC0000000UL) && | |
77 | ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf))))))) | |
78 | do { | |
79 | __gu_err = 0; | |
80 | switch ((sizeof(*(buf)))) { | |
81 | case 1: | |
82 | __asm__ __volatile__( | |
83 | "1: mov" "b" " %2,%" "b" "1\n" | |
84 | "2:\n" | |
85 | ".section .fixup,\"ax\"\n" | |
86 | "3: movl %3,%0\n" | |
87 | " xor" "b" " %" "b" "1,%" "b" "1\n" | |
88 | " jmp 2b\n" | |
89 | ".section __ex_table,\"a\"\n" | |
90 | " .align 4\n" | |
91 | " .long 1b,3b\n" | |
92 | ".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *) | |
93 | ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ; | |
94 | break; | |
95 | case 2: | |
96 | __asm__ __volatile__( | |
97 | "1: mov" "w" " %2,%" "w" "1\n" | |
98 | "2:\n" | |
99 | ".section .fixup,\"ax\"\n" | |
100 | "3: movl %3,%0\n" | |
101 | " xor" "w" " %" "w" "1,%" "w" "1\n" | |
102 | " jmp 2b\n" | |
103 | ".section __ex_table,\"a\"\n" | |
104 | " .align 4\n" | |
105 | " .long 1b,3b\n" | |
106 | ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) | |
107 | ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )); | |
108 | break; | |
109 | case 4: | |
110 | __asm__ __volatile__( | |
111 | "1: mov" "l" " %2,%" "" "1\n" | |
112 | "2:\n" | |
113 | ".section .fixup,\"ax\"\n" | |
114 | "3: movl %3,%0\n" | |
115 | " xor" "l" " %" "" "1,%" "" "1\n" | |
116 | " jmp 2b\n" | |
117 | ".section __ex_table,\"a\"\n" | |
118 | " .align 4\n" " .long 1b,3b\n" | |
119 | ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) | |
120 | ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err)); | |
121 | break; | |
122 | default: | |
123 | (__gu_val) = __get_user_bad(); | |
124 | } | |
125 | } while (0) ; | |
126 | ((c)) = (__typeof__(*((buf))))__gu_val; | |
127 | __gu_err; | |
128 | } | |
129 | ); | |
1da177e4 LT |
130 | |
131 | WOW! Black GCC/assembly magic. This is impossible to follow, so let's | |
06955392 | 132 | see what code gcc generates:: |
1da177e4 LT |
133 | |
134 | > xorl %edx,%edx | |
135 | > movl current_set,%eax | |
3697cd9a AW |
136 | > cmpl $24,788(%eax) |
137 | > je .L1424 | |
1da177e4 | 138 | > cmpl $-1073741825,64(%esp) |
3697cd9a | 139 | > ja .L1423 |
1da177e4 | 140 | > .L1424: |
3697cd9a | 141 | > movl %edx,%eax |
1da177e4 LT |
142 | > movl 64(%esp),%ebx |
143 | > #APP | |
144 | > 1: movb (%ebx),%dl /* this is the actual user access */ | |
145 | > 2: | |
146 | > .section .fixup,"ax" | |
147 | > 3: movl $-14,%eax | |
148 | > xorb %dl,%dl | |
149 | > jmp 2b | |
150 | > .section __ex_table,"a" | |
151 | > .align 4 | |
152 | > .long 1b,3b | |
153 | > .text | |
154 | > #NO_APP | |
155 | > .L1423: | |
156 | > movzbl %dl,%esi | |
157 | ||
3697cd9a AW |
158 | The optimizer does a good job and gives us something we can actually |
159 | understand. Can we? The actual user access is quite obvious. Thanks | |
160 | to the unified address space we can just access the address in user | |
1da177e4 LT |
161 | memory. But what does the .section stuff do????? |
162 | ||
06955392 | 163 | To understand this we have to look at the final kernel:: |
1da177e4 LT |
164 | |
165 | > objdump --section-headers vmlinux | |
3697cd9a | 166 | > |
1da177e4 | 167 | > vmlinux: file format elf32-i386 |
3697cd9a | 168 | > |
1da177e4 LT |
169 | > Sections: |
170 | > Idx Name Size VMA LMA File off Algn | |
171 | > 0 .text 00098f40 c0100000 c0100000 00001000 2**4 | |
172 | > CONTENTS, ALLOC, LOAD, READONLY, CODE | |
173 | > 1 .fixup 000016bc c0198f40 c0198f40 00099f40 2**0 | |
174 | > CONTENTS, ALLOC, LOAD, READONLY, CODE | |
175 | > 2 .rodata 0000f127 c019a5fc c019a5fc 0009b5fc 2**2 | |
176 | > CONTENTS, ALLOC, LOAD, READONLY, DATA | |
177 | > 3 __ex_table 000015c0 c01a9724 c01a9724 000aa724 2**2 | |
178 | > CONTENTS, ALLOC, LOAD, READONLY, DATA | |
179 | > 4 .data 0000ea58 c01abcf0 c01abcf0 000abcf0 2**4 | |
180 | > CONTENTS, ALLOC, LOAD, DATA | |
181 | > 5 .bss 00018e21 c01ba748 c01ba748 000ba748 2**2 | |
182 | > ALLOC | |
183 | > 6 .comment 00000ec4 00000000 00000000 000ba748 2**0 | |
184 | > CONTENTS, READONLY | |
185 | > 7 .note 00001068 00000ec4 00000ec4 000bb60c 2**0 | |
186 | > CONTENTS, READONLY | |
187 | ||
188 | There are obviously 2 non standard ELF sections in the generated object | |
189 | file. But first we want to find out what happened to our code in the | |
06955392 | 190 | final kernel executable:: |
1da177e4 LT |
191 | |
192 | > objdump --disassemble --section=.text vmlinux | |
193 | > | |
194 | > c017e785 <do_con_write+c1> xorl %edx,%edx | |
195 | > c017e787 <do_con_write+c3> movl 0xc01c7bec,%eax | |
196 | > c017e78c <do_con_write+c8> cmpl $0x18,0x314(%eax) | |
197 | > c017e793 <do_con_write+cf> je c017e79f <do_con_write+db> | |
198 | > c017e795 <do_con_write+d1> cmpl $0xbfffffff,0x40(%esp,1) | |
199 | > c017e79d <do_con_write+d9> ja c017e7a7 <do_con_write+e3> | |
200 | > c017e79f <do_con_write+db> movl %edx,%eax | |
201 | > c017e7a1 <do_con_write+dd> movl 0x40(%esp,1),%ebx | |
202 | > c017e7a5 <do_con_write+e1> movb (%ebx),%dl | |
203 | > c017e7a7 <do_con_write+e3> movzbl %dl,%esi | |
204 | ||
205 | The whole user memory access is reduced to 10 x86 machine instructions. | |
206 | The instructions bracketed in the .section directives are no longer | |
3697cd9a | 207 | in the normal execution path. They are located in a different section |
06955392 | 208 | of the executable file:: |
1da177e4 LT |
209 | |
210 | > objdump --disassemble --section=.fixup vmlinux | |
3697cd9a | 211 | > |
1da177e4 LT |
212 | > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax |
213 | > c0199ffa <.fixup+10ba> xorb %dl,%dl | |
214 | > c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3> | |
215 | ||
06955392 CD |
216 | And finally:: |
217 | ||
1da177e4 | 218 | > objdump --full-contents --section=__ex_table vmlinux |
3697cd9a | 219 | > |
1da177e4 LT |
220 | > c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................ |
221 | > c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................ | |
222 | > c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................ | |
223 | ||
06955392 | 224 | or in human readable byte order:: |
1da177e4 LT |
225 | |
226 | > c01aa7c4 c017c093 c0199fe0 c017c097 c017c099 ................ | |
227 | > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ | |
228 | ^^^^^^^^^^^^^^^^^ | |
229 | this is the interesting part! | |
230 | > c01aa7e4 c0180a08 c019a001 c0180a0a c019a004 ................ | |
231 | ||
06955392 | 232 | What happened? The assembly directives:: |
1da177e4 | 233 | |
06955392 CD |
234 | .section .fixup,"ax" |
235 | .section __ex_table,"a" | |
1da177e4 LT |
236 | |
237 | told the assembler to move the following code to the specified | |
06955392 CD |
238 | sections in the ELF object file. So the instructions:: |
239 | ||
240 | 3: movl $-14,%eax | |
241 | xorb %dl,%dl | |
242 | jmp 2b | |
243 | ||
244 | ended up in the .fixup section of the object file and the addresses:: | |
245 | ||
1da177e4 | 246 | .long 1b,3b |
06955392 | 247 | |
1da177e4 | 248 | ended up in the __ex_table section of the object file. 1b and 3b |
3697cd9a AW |
249 | are local labels. The local label 1b (1b stands for next label 1 |
250 | backward) is the address of the instruction that might fault, i.e. | |
1da177e4 LT |
251 | in our case the address of the label 1 is c017e7a5: |
252 | the original assembly code: > 1: movb (%ebx),%dl | |
253 | and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl | |
254 | ||
255 | The local label 3 (backwards again) is the address of the code to handle | |
256 | the fault, in our case the actual value is c0199ff5: | |
257 | the original assembly code: > 3: movl $-14,%eax | |
258 | and linked in vmlinux : > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax | |
259 | ||
abcb1e02 ND |
260 | If the fixup was able to handle the exception, control flow may be returned |
261 | to the instruction after the one that triggered the fault, ie. local label 2b. | |
262 | ||
06955392 CD |
263 | The assembly code:: |
264 | ||
1da177e4 LT |
265 | > .section __ex_table,"a" |
266 | > .align 4 | |
267 | > .long 1b,3b | |
268 | ||
06955392 CD |
269 | becomes the value pair:: |
270 | ||
1da177e4 LT |
271 | > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ |
272 | ^this is ^this is | |
3697cd9a | 273 | 1b 3b |
06955392 | 274 | |
1da177e4 LT |
275 | c017e7a5,c0199ff5 in the exception table of the kernel. |
276 | ||
277 | So, what actually happens if a fault from kernel mode with no suitable | |
278 | vma occurs? | |
279 | ||
06955392 CD |
280 | #. access to invalid address:: |
281 | ||
282 | > c017e7a5 <do_con_write+e1> movb (%ebx),%dl | |
283 | #. MMU generates exception | |
284 | #. CPU calls do_page_fault | |
285 | #. do page fault calls search_exception_table (regs->eip == c017e7a5); | |
286 | #. search_exception_table looks up the address c017e7a5 in the | |
287 | exception table (i.e. the contents of the ELF section __ex_table) | |
288 | and returns the address of the associated fault handle code c0199ff5. | |
289 | #. do_page_fault modifies its own return address to point to the fault | |
290 | handle code and returns. | |
291 | #. execution continues in the fault handling code. | |
292 | #. a) EAX becomes -EFAULT (== -14) | |
293 | b) DL becomes zero (the value we "read" from user space) | |
294 | c) execution continues at local label 2 (address of the | |
295 | instruction immediately after the faulting user access). | |
1da177e4 LT |
296 | |
297 | The steps 8a to 8c in a certain way emulate the faulting instruction. | |
298 | ||
299 | That's it, mostly. If you look at our example, you might ask why | |
300 | we set EAX to -EFAULT in the exception handler code. Well, the | |
301 | get_user macro actually returns a value: 0, if the user access was | |
302 | successful, -EFAULT on failure. Our original code did not test this | |
303 | return value, however the inline assembly code in get_user tries to | |
304 | return -EFAULT. GCC selected EAX to return this value. | |
305 | ||
306 | NOTE: | |
307 | Due to the way that the exception table is built and needs to be ordered, | |
308 | only use exceptions for code in the .text section. Any other section | |
309 | will cause the exception table to not be sorted correctly, and the | |
310 | exceptions will fail. | |
548acf19 TL |
311 | |
312 | Things changed when 64-bit support was added to x86 Linux. Rather than | |
313 | double the size of the exception table by expanding the two entries | |
314 | from 32-bits to 64 bits, a clever trick was used to store addresses | |
315 | as relative offsets from the table itself. The assembly code changed | |
06955392 CD |
316 | from:: |
317 | ||
318 | .long 1b,3b | |
319 | to: | |
320 | .long (from) - . | |
321 | .long (to) - . | |
548acf19 TL |
322 | |
323 | and the C-code that uses these values converts back to absolute addresses | |
06955392 | 324 | like this:: |
548acf19 TL |
325 | |
326 | ex_insn_addr(const struct exception_table_entry *x) | |
327 | { | |
328 | return (unsigned long)&x->insn + x->insn; | |
329 | } | |
330 | ||
331 | In v4.6 the exception table entry was expanded with a new field "handler". | |
332 | This is also 32-bits wide and contains a third relative function | |
333 | pointer which points to one of: | |
334 | ||
06955392 CD |
335 | 1) ``int ex_handler_default(const struct exception_table_entry *fixup)`` |
336 | This is legacy case that just jumps to the fixup code | |
337 | ||
338 | 2) ``int ex_handler_fault(const struct exception_table_entry *fixup)`` | |
339 | This case provides the fault number of the trap that occurred at | |
340 | entry->insn. It is used to distinguish page faults from machine | |
341 | check. | |
342 | ||
343 | 3) ``int ex_handler_ext(const struct exception_table_entry *fixup)`` | |
344 | This case is used for uaccess_err ... we need to set a flag | |
345 | in the task structure. Before the handler functions existed this | |
346 | case was handled by adding a large offset to the fixup to tag | |
347 | it as special. | |
348 | ||
548acf19 | 349 | More functions can easily be added. |
abcb1e02 ND |
350 | |
351 | CONFIG_BUILDTIME_TABLE_SORT allows the __ex_table section to be sorted post | |
352 | link of the kernel image, via a host utility scripts/sorttable. It will set the | |
353 | symbol main_extable_sort_needed to 0, avoiding sorting the __ex_table section | |
354 | at boot time. With the exception table sorted, at runtime when an exception | |
355 | occurs we can quickly lookup the __ex_table entry via binary search. | |
356 | ||
357 | This is not just a boot time optimization, some architectures require this | |
358 | table to be sorted in order to handle exceptions relatively early in the boot | |
359 | process. For example, i386 makes use of this form of exception handling before | |
360 | paging support is even enabled! |