Commit | Line | Data |
---|---|---|
1497624f FV |
1 | .. _kernel_hacking_hack: |
2 | ||
c4fcd7ca MCC |
3 | ============================================ |
4 | Unreliable Guide To Hacking The Linux Kernel | |
5 | ============================================ | |
6 | ||
7 | :Author: Rusty Russell | |
8 | ||
9 | Introduction | |
10 | ============ | |
11 | ||
12 | Welcome, gentle reader, to Rusty's Remarkably Unreliable Guide to Linux | |
13 | Kernel Hacking. This document describes the common routines and general | |
14 | requirements for kernel code: its goal is to serve as a primer for Linux | |
15 | kernel development for experienced C programmers. I avoid implementation | |
16 | details: that's what the code is for, and I ignore whole tracts of | |
17 | useful routines. | |
18 | ||
19 | Before you read this, please understand that I never wanted to write | |
20 | this document, being grossly under-qualified, but I always wanted to | |
21 | read it, and this was the only way. I hope it will grow into a | |
22 | compendium of best practice, common starting points and random | |
23 | information. | |
24 | ||
25 | The Players | |
26 | =========== | |
27 | ||
28 | At any time each of the CPUs in a system can be: | |
29 | ||
30 | - not associated with any process, serving a hardware interrupt; | |
31 | ||
32 | - not associated with any process, serving a softirq or tasklet; | |
33 | ||
34 | - running in kernel space, associated with a process (user context); | |
35 | ||
36 | - running a process in user space. | |
37 | ||
38 | There is an ordering between these. The bottom two can preempt each | |
39 | other, but above that is a strict hierarchy: each can only be preempted | |
40 | by the ones above it. For example, while a softirq is running on a CPU, | |
41 | no other softirq will preempt it, but a hardware interrupt can. However, | |
42 | any other CPUs in the system execute independently. | |
43 | ||
44 | We'll see a number of ways that the user context can block interrupts, | |
45 | to become truly non-preemptable. | |
46 | ||
47 | User Context | |
48 | ------------ | |
49 | ||
50 | User context is when you are coming in from a system call or other trap: | |
51 | like userspace, you can be preempted by more important tasks and by | |
52 | interrupts. You can sleep, by calling :c:func:`schedule()`. | |
53 | ||
54 | .. note:: | |
55 | ||
56 | You are always in user context on module load and unload, and on | |
57 | operations on the block device layer. | |
58 | ||
59 | In user context, the ``current`` pointer (indicating the task we are | |
60 | currently executing) is valid, and :c:func:`in_interrupt()` | |
dca1e58e | 61 | (``include/linux/preempt.h``) is false. |
c4fcd7ca MCC |
62 | |
63 | .. warning:: | |
64 | ||
65 | Beware that if you have preemption or softirqs disabled (see below), | |
66 | :c:func:`in_interrupt()` will return a false positive. | |
67 | ||
68 | Hardware Interrupts (Hard IRQs) | |
69 | ------------------------------- | |
70 | ||
71 | Timer ticks, network cards and keyboard are examples of real hardware | |
72 | which produce interrupts at any time. The kernel runs interrupt | |
73 | handlers, which services the hardware. The kernel guarantees that this | |
74 | handler is never re-entered: if the same interrupt arrives, it is queued | |
75 | (or dropped). Because it disables interrupts, this handler has to be | |
76 | fast: frequently it simply acknowledges the interrupt, marks a 'software | |
77 | interrupt' for execution and exits. | |
78 | ||
fe450eeb CD |
79 | You can tell you are in a hardware interrupt, because in_hardirq() returns |
80 | true. | |
c4fcd7ca MCC |
81 | |
82 | .. warning:: | |
83 | ||
84 | Beware that this will return a false positive if interrupts are | |
85 | disabled (see below). | |
86 | ||
87 | Software Interrupt Context: Softirqs and Tasklets | |
88 | ------------------------------------------------- | |
89 | ||
90 | Whenever a system call is about to return to userspace, or a hardware | |
91 | interrupt handler exits, any 'software interrupts' which are marked | |
92 | pending (usually by hardware interrupts) are run (``kernel/softirq.c``). | |
93 | ||
94 | Much of the real interrupt handling work is done here. Early in the | |
95 | transition to SMP, there were only 'bottom halves' (BHs), which didn't | |
96 | take advantage of multiple CPUs. Shortly after we switched from wind-up | |
97 | computers made of match-sticks and snot, we abandoned this limitation | |
98 | and switched to 'softirqs'. | |
99 | ||
100 | ``include/linux/interrupt.h`` lists the different softirqs. A very | |
101 | important softirq is the timer softirq (``include/linux/timer.h``): you | |
102 | can register to have it call functions for you in a given length of | |
103 | time. | |
104 | ||
105 | Softirqs are often a pain to deal with, since the same softirq will run | |
106 | simultaneously on more than one CPU. For this reason, tasklets | |
107 | (``include/linux/interrupt.h``) are more often used: they are | |
108 | dynamically-registrable (meaning you can have as many as you want), and | |
109 | they also guarantee that any tasklet will only run on one CPU at any | |
110 | time, although different tasklets can run simultaneously. | |
111 | ||
112 | .. warning:: | |
113 | ||
114 | The name 'tasklet' is misleading: they have nothing to do with | |
f35cf1a5 | 115 | 'tasks'. |
c4fcd7ca MCC |
116 | |
117 | You can tell you are in a softirq (or tasklet) using the | |
dca1e58e | 118 | :c:func:`in_softirq()` macro (``include/linux/preempt.h``). |
c4fcd7ca MCC |
119 | |
120 | .. warning:: | |
121 | ||
dca1e58e | 122 | Beware that this will return a false positive if a |
e648174b | 123 | :ref:`bottom half lock <local_bh_disable>` is held. |
c4fcd7ca MCC |
124 | |
125 | Some Basic Rules | |
126 | ================ | |
127 | ||
128 | No memory protection | |
129 | If you corrupt memory, whether in user context or interrupt context, | |
130 | the whole machine will crash. Are you sure you can't do what you | |
131 | want in userspace? | |
132 | ||
133 | No floating point or MMX | |
134 | The FPU context is not saved; even in user context the FPU state | |
135 | probably won't correspond with the current process: you would mess | |
136 | with some user process' FPU state. If you really want to do this, | |
137 | you would have to explicitly save/restore the full FPU state (and | |
138 | avoid context switches). It is generally a bad idea; use fixed point | |
139 | arithmetic first. | |
140 | ||
141 | A rigid stack limit | |
142 | Depending on configuration options the kernel stack is about 3K to | |
143 | 6K for most 32-bit architectures: it's about 14K on most 64-bit | |
144 | archs, and often shared with interrupts so you can't use it all. | |
145 | Avoid deep recursion and huge local arrays on the stack (allocate | |
146 | them dynamically instead). | |
147 | ||
148 | The Linux kernel is portable | |
149 | Let's keep it that way. Your code should be 64-bit clean, and | |
150 | endian-independent. You should also minimize CPU specific stuff, | |
151 | e.g. inline assembly should be cleanly encapsulated and minimized to | |
152 | ease porting. Generally it should be restricted to the | |
153 | architecture-dependent part of the kernel tree. | |
154 | ||
155 | ioctls: Not writing a new system call | |
156 | ===================================== | |
157 | ||
dca1e58e | 158 | A system call generally looks like this:: |
c4fcd7ca MCC |
159 | |
160 | asmlinkage long sys_mycall(int arg) | |
161 | { | |
162 | return 0; | |
163 | } | |
164 | ||
165 | ||
166 | First, in most cases you don't want to create a new system call. You | |
167 | create a character device and implement an appropriate ioctl for it. | |
168 | This is much more flexible than system calls, doesn't have to be entered | |
169 | in every architecture's ``include/asm/unistd.h`` and | |
170 | ``arch/kernel/entry.S`` file, and is much more likely to be accepted by | |
171 | Linus. | |
172 | ||
173 | If all your routine does is read or write some parameter, consider | |
174 | implementing a :c:func:`sysfs()` interface instead. | |
175 | ||
176 | Inside the ioctl you're in user context to a process. When a error | |
dca1e58e MCC |
177 | occurs you return a negated errno (see |
178 | ``include/uapi/asm-generic/errno-base.h``, | |
179 | ``include/uapi/asm-generic/errno.h`` and ``include/linux/errno.h``), | |
c4fcd7ca MCC |
180 | otherwise you return 0. |
181 | ||
182 | After you slept you should check if a signal occurred: the Unix/Linux | |
183 | way of handling signals is to temporarily exit the system call with the | |
184 | ``-ERESTARTSYS`` error. The system call entry code will switch back to | |
185 | user context, process the signal handler and then your system call will | |
186 | be restarted (unless the user disabled that). So you should be prepared | |
187 | to process the restart, e.g. if you're in the middle of manipulating | |
188 | some data structure. | |
189 | ||
190 | :: | |
191 | ||
192 | if (signal_pending(current)) | |
193 | return -ERESTARTSYS; | |
194 | ||
195 | ||
196 | If you're doing longer computations: first think userspace. If you | |
197 | **really** want to do it in kernel you should regularly check if you need | |
198 | to give up the CPU (remember there is cooperative multitasking per CPU). | |
dca1e58e | 199 | Idiom:: |
c4fcd7ca MCC |
200 | |
201 | cond_resched(); /* Will sleep */ | |
202 | ||
203 | ||
204 | A short note on interface design: the UNIX system call motto is "Provide | |
205 | mechanism not policy". | |
206 | ||
207 | Recipes for Deadlock | |
208 | ==================== | |
209 | ||
210 | You cannot call any routines which may sleep, unless: | |
211 | ||
212 | - You are in user context. | |
213 | ||
214 | - You do not own any spinlocks. | |
215 | ||
216 | - You have interrupts enabled (actually, Andi Kleen says that the | |
217 | scheduling code will enable them for you, but that's probably not | |
218 | what you wanted). | |
219 | ||
220 | Note that some functions may sleep implicitly: common ones are the user | |
221 | space access functions (\*_user) and memory allocation functions | |
222 | without ``GFP_ATOMIC``. | |
223 | ||
224 | You should always compile your kernel ``CONFIG_DEBUG_ATOMIC_SLEEP`` on, | |
225 | and it will warn you if you break these rules. If you **do** break the | |
226 | rules, you will eventually lock up your box. | |
227 | ||
228 | Really. | |
229 | ||
230 | Common Routines | |
231 | =============== | |
232 | ||
dca1e58e MCC |
233 | :c:func:`printk()` |
234 | ------------------ | |
235 | ||
236 | Defined in ``include/linux/printk.h`` | |
c4fcd7ca MCC |
237 | |
238 | :c:func:`printk()` feeds kernel messages to the console, dmesg, and | |
239 | the syslog daemon. It is useful for debugging and reporting errors, and | |
240 | can be used inside interrupt context, but use with caution: a machine | |
241 | which has its console flooded with printk messages is unusable. It uses | |
242 | a format string mostly compatible with ANSI C printf, and C string | |
dca1e58e | 243 | concatenation to give it a first "priority" argument:: |
c4fcd7ca MCC |
244 | |
245 | printk(KERN_INFO "i = %u\n", i); | |
246 | ||
247 | ||
dca1e58e | 248 | See ``include/linux/kern_levels.h``; for other ``KERN_`` values; these are |
c4fcd7ca | 249 | interpreted by syslog as the level. Special case: for printing an IP |
dca1e58e | 250 | address use:: |
c4fcd7ca MCC |
251 | |
252 | __be32 ipaddress; | |
253 | printk(KERN_INFO "my ip: %pI4\n", &ipaddress); | |
254 | ||
255 | ||
256 | :c:func:`printk()` internally uses a 1K buffer and does not catch | |
257 | overruns. Make sure that will be enough. | |
258 | ||
259 | .. note:: | |
260 | ||
261 | You will know when you are a real kernel hacker when you start | |
262 | typoing printf as printk in your user programs :) | |
263 | ||
264 | .. note:: | |
265 | ||
266 | Another sidenote: the original Unix Version 6 sources had a comment | |
267 | on top of its printf function: "Printf should not be used for | |
268 | chit-chat". You should follow that advice. | |
269 | ||
dca1e58e MCC |
270 | :c:func:`copy_to_user()` / :c:func:`copy_from_user()` / :c:func:`get_user()` / :c:func:`put_user()` |
271 | --------------------------------------------------------------------------------------------------- | |
272 | ||
273 | Defined in ``include/linux/uaccess.h`` / ``asm/uaccess.h`` | |
c4fcd7ca MCC |
274 | |
275 | **[SLEEPS]** | |
276 | ||
277 | :c:func:`put_user()` and :c:func:`get_user()` are used to get | |
278 | and put single values (such as an int, char, or long) from and to | |
279 | userspace. A pointer into userspace should never be simply dereferenced: | |
280 | data should be copied using these routines. Both return ``-EFAULT`` or | |
281 | 0. | |
282 | ||
283 | :c:func:`copy_to_user()` and :c:func:`copy_from_user()` are | |
284 | more general: they copy an arbitrary amount of data to and from | |
285 | userspace. | |
286 | ||
287 | .. warning:: | |
288 | ||
289 | Unlike :c:func:`put_user()` and :c:func:`get_user()`, they | |
290 | return the amount of uncopied data (ie. 0 still means success). | |
291 | ||
f35cf1a5 KR |
292 | [Yes, this objectionable interface makes me cringe. The flamewar comes |
293 | up every year or so. --RR.] | |
c4fcd7ca MCC |
294 | |
295 | The functions may sleep implicitly. This should never be called outside | |
296 | user context (it makes no sense), with interrupts disabled, or a | |
297 | spinlock held. | |
298 | ||
dca1e58e MCC |
299 | :c:func:`kmalloc()`/:c:func:`kfree()` |
300 | ------------------------------------- | |
301 | ||
302 | Defined in ``include/linux/slab.h`` | |
c4fcd7ca MCC |
303 | |
304 | **[MAY SLEEP: SEE BELOW]** | |
305 | ||
306 | These routines are used to dynamically request pointer-aligned chunks of | |
307 | memory, like malloc and free do in userspace, but | |
308 | :c:func:`kmalloc()` takes an extra flag word. Important values: | |
309 | ||
310 | ``GFP_KERNEL`` | |
311 | May sleep and swap to free memory. Only allowed in user context, but | |
312 | is the most reliable way to allocate memory. | |
313 | ||
314 | ``GFP_ATOMIC`` | |
315 | Don't sleep. Less reliable than ``GFP_KERNEL``, but may be called | |
316 | from interrupt context. You should **really** have a good | |
317 | out-of-memory error-handling strategy. | |
318 | ||
319 | ``GFP_DMA`` | |
320 | Allocate ISA DMA lower than 16MB. If you don't know what that is you | |
321 | don't need it. Very unreliable. | |
322 | ||
323 | If you see a sleeping function called from invalid context warning | |
324 | message, then maybe you called a sleeping allocation function from | |
325 | interrupt context without ``GFP_ATOMIC``. You should really fix that. | |
326 | Run, don't walk. | |
327 | ||
dca1e58e MCC |
328 | If you are allocating at least ``PAGE_SIZE`` (``asm/page.h`` or |
329 | ``asm/page_types.h``) bytes, consider using :c:func:`__get_free_pages()` | |
330 | (``include/linux/gfp.h``). It takes an order argument (0 for page sized, | |
c4fcd7ca MCC |
331 | 1 for double page, 2 for four pages etc.) and the same memory priority |
332 | flag word as above. | |
333 | ||
334 | If you are allocating more than a page worth of bytes you can use | |
335 | :c:func:`vmalloc()`. It'll allocate virtual memory in the kernel | |
336 | map. This block is not contiguous in physical memory, but the MMU makes | |
337 | it look like it is for you (so it'll only look contiguous to the CPUs, | |
338 | not to external device drivers). If you really need large physically | |
339 | contiguous memory for some weird device, you have a problem: it is | |
340 | poorly supported in Linux because after some time memory fragmentation | |
341 | in a running kernel makes it hard. The best way is to allocate the block | |
342 | early in the boot process via the :c:func:`alloc_bootmem()` | |
343 | routine. | |
344 | ||
345 | Before inventing your own cache of often-used objects consider using a | |
346 | slab cache in ``include/linux/slab.h`` | |
347 | ||
3a4928cf JP |
348 | :c:macro:`current` |
349 | ------------------ | |
dca1e58e MCC |
350 | |
351 | Defined in ``include/asm/current.h`` | |
c4fcd7ca MCC |
352 | |
353 | This global variable (really a macro) contains a pointer to the current | |
354 | task structure, so is only valid in user context. For example, when a | |
355 | process makes a system call, this will point to the task structure of | |
356 | the calling process. It is **not NULL** in interrupt context. | |
357 | ||
dca1e58e MCC |
358 | :c:func:`mdelay()`/:c:func:`udelay()` |
359 | ------------------------------------- | |
360 | ||
361 | Defined in ``include/asm/delay.h`` / ``include/linux/delay.h`` | |
c4fcd7ca MCC |
362 | |
363 | The :c:func:`udelay()` and :c:func:`ndelay()` functions can be | |
364 | used for small pauses. Do not use large values with them as you risk | |
365 | overflow - the helper function :c:func:`mdelay()` is useful here, or | |
366 | consider :c:func:`msleep()`. | |
367 | ||
dca1e58e MCC |
368 | :c:func:`cpu_to_be32()`/:c:func:`be32_to_cpu()`/:c:func:`cpu_to_le32()`/:c:func:`le32_to_cpu()` |
369 | ----------------------------------------------------------------------------------------------- | |
370 | ||
371 | Defined in ``include/asm/byteorder.h`` | |
c4fcd7ca MCC |
372 | |
373 | The :c:func:`cpu_to_be32()` family (where the "32" can be replaced | |
374 | by 64 or 16, and the "be" can be replaced by "le") are the general way | |
375 | to do endian conversions in the kernel: they return the converted value. | |
376 | All variations supply the reverse as well: | |
377 | :c:func:`be32_to_cpu()`, etc. | |
378 | ||
379 | There are two major variations of these functions: the pointer | |
380 | variation, such as :c:func:`cpu_to_be32p()`, which take a pointer | |
381 | to the given type, and return the converted value. The other variation | |
382 | is the "in-situ" family, such as :c:func:`cpu_to_be32s()`, which | |
383 | convert value referred to by the pointer, and return void. | |
384 | ||
dca1e58e MCC |
385 | :c:func:`local_irq_save()`/:c:func:`local_irq_restore()` |
386 | -------------------------------------------------------- | |
387 | ||
388 | Defined in ``include/linux/irqflags.h`` | |
c4fcd7ca MCC |
389 | |
390 | These routines disable hard interrupts on the local CPU, and restore | |
391 | them. They are reentrant; saving the previous state in their one | |
392 | ``unsigned long flags`` argument. If you know that interrupts are | |
393 | enabled, you can simply use :c:func:`local_irq_disable()` and | |
394 | :c:func:`local_irq_enable()`. | |
395 | ||
dca1e58e MCC |
396 | .. _local_bh_disable: |
397 | ||
398 | :c:func:`local_bh_disable()`/:c:func:`local_bh_enable()` | |
399 | -------------------------------------------------------- | |
400 | ||
401 | Defined in ``include/linux/bottom_half.h`` | |
402 | ||
c4fcd7ca MCC |
403 | |
404 | These routines disable soft interrupts on the local CPU, and restore | |
405 | them. They are reentrant; if soft interrupts were disabled before, they | |
406 | will still be disabled after this pair of functions has been called. | |
407 | They prevent softirqs and tasklets from running on the current CPU. | |
408 | ||
dca1e58e MCC |
409 | :c:func:`smp_processor_id()` |
410 | ---------------------------- | |
411 | ||
412 | Defined in ``include/linux/smp.h`` | |
c4fcd7ca MCC |
413 | |
414 | :c:func:`get_cpu()` disables preemption (so you won't suddenly get | |
415 | moved to another CPU) and returns the current processor number, between | |
416 | 0 and ``NR_CPUS``. Note that the CPU numbers are not necessarily | |
417 | continuous. You return it again with :c:func:`put_cpu()` when you | |
418 | are done. | |
419 | ||
420 | If you know you cannot be preempted by another task (ie. you are in | |
421 | interrupt context, or have preemption disabled) you can use | |
422 | smp_processor_id(). | |
423 | ||
dca1e58e MCC |
424 | ``__init``/``__exit``/``__initdata`` |
425 | ------------------------------------ | |
426 | ||
427 | Defined in ``include/linux/init.h`` | |
c4fcd7ca MCC |
428 | |
429 | After boot, the kernel frees up a special section; functions marked with | |
430 | ``__init`` and data structures marked with ``__initdata`` are dropped | |
431 | after boot is complete: similarly modules discard this memory after | |
432 | initialization. ``__exit`` is used to declare a function which is only | |
433 | required on exit: the function will be dropped if this file is not | |
434 | compiled as a module. See the header file for use. Note that it makes no | |
435 | sense for a function marked with ``__init`` to be exported to modules | |
dca1e58e MCC |
436 | with :c:func:`EXPORT_SYMBOL()` or :c:func:`EXPORT_SYMBOL_GPL()`- this |
437 | will break. | |
438 | ||
439 | :c:func:`__initcall()`/:c:func:`module_init()` | |
440 | ---------------------------------------------- | |
c4fcd7ca | 441 | |
dca1e58e | 442 | Defined in ``include/linux/init.h`` / ``include/linux/module.h`` |
c4fcd7ca MCC |
443 | |
444 | Many parts of the kernel are well served as a module | |
445 | (dynamically-loadable parts of the kernel). Using the | |
446 | :c:func:`module_init()` and :c:func:`module_exit()` macros it | |
447 | is easy to write code without #ifdefs which can operate both as a module | |
448 | or built into the kernel. | |
449 | ||
450 | The :c:func:`module_init()` macro defines which function is to be | |
451 | called at module insertion time (if the file is compiled as a module), | |
452 | or at boot time: if the file is not compiled as a module the | |
453 | :c:func:`module_init()` macro becomes equivalent to | |
454 | :c:func:`__initcall()`, which through linker magic ensures that | |
455 | the function is called on boot. | |
456 | ||
457 | The function can return a negative error number to cause module loading | |
458 | to fail (unfortunately, this has no effect if the module is compiled | |
459 | into the kernel). This function is called in user context with | |
460 | interrupts enabled, so it can sleep. | |
461 | ||
dca1e58e MCC |
462 | :c:func:`module_exit()` |
463 | ----------------------- | |
464 | ||
465 | ||
466 | Defined in ``include/linux/module.h`` | |
c4fcd7ca MCC |
467 | |
468 | This macro defines the function to be called at module removal time (or | |
469 | never, in the case of the file compiled into the kernel). It will only | |
470 | be called if the module usage count has reached zero. This function can | |
471 | also sleep, but cannot fail: everything must be cleaned up by the time | |
472 | it returns. | |
473 | ||
474 | Note that this macro is optional: if it is not present, your module will | |
475 | not be removable (except for 'rmmod -f'). | |
476 | ||
dca1e58e MCC |
477 | :c:func:`try_module_get()`/:c:func:`module_put()` |
478 | ------------------------------------------------- | |
479 | ||
480 | Defined in ``include/linux/module.h`` | |
c4fcd7ca MCC |
481 | |
482 | These manipulate the module usage count, to protect against removal (a | |
483 | module also can't be removed if another module uses one of its exported | |
484 | symbols: see below). Before calling into module code, you should call | |
485 | :c:func:`try_module_get()` on that module: if it fails, then the | |
486 | module is being removed and you should act as if it wasn't there. | |
487 | Otherwise, you can safely enter the module, and call | |
488 | :c:func:`module_put()` when you're finished. | |
489 | ||
490 | Most registerable structures have an owner field, such as in the | |
491 | :c:type:`struct file_operations <file_operations>` structure. | |
492 | Set this field to the macro ``THIS_MODULE``. | |
493 | ||
494 | Wait Queues ``include/linux/wait.h`` | |
495 | ==================================== | |
496 | ||
497 | **[SLEEPS]** | |
498 | ||
499 | A wait queue is used to wait for someone to wake you up when a certain | |
500 | condition is true. They must be used carefully to ensure there is no | |
dca1e58e | 501 | race condition. You declare a :c:type:`wait_queue_head_t`, and then processes |
650fc870 | 502 | which want to wait for that condition declare a :c:type:`wait_queue_entry_t` |
c4fcd7ca MCC |
503 | referring to themselves, and place that in the queue. |
504 | ||
505 | Declaring | |
506 | --------- | |
507 | ||
508 | You declare a ``wait_queue_head_t`` using the | |
509 | :c:func:`DECLARE_WAIT_QUEUE_HEAD()` macro, or using the | |
510 | :c:func:`init_waitqueue_head()` routine in your initialization | |
511 | code. | |
512 | ||
513 | Queuing | |
514 | ------- | |
515 | ||
516 | Placing yourself in the waitqueue is fairly complex, because you must | |
517 | put yourself in the queue before checking the condition. There is a | |
518 | macro to do this: :c:func:`wait_event_interruptible()` | |
dca1e58e | 519 | (``include/linux/wait.h``) The first argument is the wait queue head, and |
c4fcd7ca | 520 | the second is an expression which is evaluated; the macro returns 0 when |
dca1e58e | 521 | this expression is true, or ``-ERESTARTSYS`` if a signal is received. The |
c4fcd7ca MCC |
522 | :c:func:`wait_event()` version ignores signals. |
523 | ||
524 | Waking Up Queued Tasks | |
525 | ---------------------- | |
526 | ||
c1de03a4 | 527 | Call :c:func:`wake_up()` (``include/linux/wait.h``), which will wake |
c4fcd7ca MCC |
528 | up every process in the queue. The exception is if one has |
529 | ``TASK_EXCLUSIVE`` set, in which case the remainder of the queue will | |
530 | not be woken. There are other variants of this basic function available | |
531 | in the same header. | |
532 | ||
533 | Atomic Operations | |
534 | ================= | |
535 | ||
536 | Certain operations are guaranteed atomic on all platforms. The first | |
dca1e58e MCC |
537 | class of operations work on :c:type:`atomic_t` (``include/asm/atomic.h``); |
538 | this contains a signed integer (at least 32 bits long), and you must use | |
539 | these functions to manipulate or read :c:type:`atomic_t` variables. | |
c4fcd7ca MCC |
540 | :c:func:`atomic_read()` and :c:func:`atomic_set()` get and set |
541 | the counter, :c:func:`atomic_add()`, :c:func:`atomic_sub()`, | |
542 | :c:func:`atomic_inc()`, :c:func:`atomic_dec()`, and | |
543 | :c:func:`atomic_dec_and_test()` (returns true if it was | |
544 | decremented to zero). | |
545 | ||
546 | Yes. It returns true (i.e. != 0) if the atomic variable is zero. | |
547 | ||
548 | Note that these functions are slower than normal arithmetic, and so | |
549 | should not be used unnecessarily. | |
550 | ||
551 | The second class of atomic operations is atomic bit operations on an | |
552 | ``unsigned long``, defined in ``include/linux/bitops.h``. These | |
553 | operations generally take a pointer to the bit pattern, and a bit | |
554 | number: 0 is the least significant bit. :c:func:`set_bit()`, | |
555 | :c:func:`clear_bit()` and :c:func:`change_bit()` set, clear, | |
556 | and flip the given bit. :c:func:`test_and_set_bit()`, | |
557 | :c:func:`test_and_clear_bit()` and | |
558 | :c:func:`test_and_change_bit()` do the same thing, except return | |
559 | true if the bit was previously set; these are particularly useful for | |
560 | atomically setting flags. | |
561 | ||
562 | It is possible to call these operations with bit indices greater than | |
dca1e58e | 563 | ``BITS_PER_LONG``. The resulting behavior is strange on big-endian |
c4fcd7ca MCC |
564 | platforms though so it is a good idea not to do this. |
565 | ||
566 | Symbols | |
567 | ======= | |
568 | ||
569 | Within the kernel proper, the normal linking rules apply (ie. unless a | |
570 | symbol is declared to be file scope with the ``static`` keyword, it can | |
571 | be used anywhere in the kernel). However, for modules, a special | |
572 | exported symbol table is kept which limits the entry points to the | |
573 | kernel proper. Modules can also export symbols. | |
574 | ||
dca1e58e MCC |
575 | :c:func:`EXPORT_SYMBOL()` |
576 | ------------------------- | |
577 | ||
578 | Defined in ``include/linux/export.h`` | |
c4fcd7ca MCC |
579 | |
580 | This is the classic method of exporting a symbol: dynamically loaded | |
581 | modules will be able to use the symbol as normal. | |
582 | ||
dca1e58e MCC |
583 | :c:func:`EXPORT_SYMBOL_GPL()` |
584 | ----------------------------- | |
585 | ||
586 | Defined in ``include/linux/export.h`` | |
c4fcd7ca MCC |
587 | |
588 | Similar to :c:func:`EXPORT_SYMBOL()` except that the symbols | |
589 | exported by :c:func:`EXPORT_SYMBOL_GPL()` can only be seen by | |
590 | modules with a :c:func:`MODULE_LICENSE()` that specifies a GPL | |
591 | compatible license. It implies that the function is considered an | |
592 | internal implementation issue, and not really an interface. Some | |
593 | maintainers and developers may however require EXPORT_SYMBOL_GPL() | |
594 | when adding any new APIs or functionality. | |
595 | ||
c4f4af40 MM |
596 | :c:func:`EXPORT_SYMBOL_NS()` |
597 | ---------------------------- | |
598 | ||
599 | Defined in ``include/linux/export.h`` | |
600 | ||
601 | This is the variant of `EXPORT_SYMBOL()` that allows specifying a symbol | |
602 | namespace. Symbol Namespaces are documented in | |
7f3f7bfb | 603 | Documentation/core-api/symbol-namespaces.rst |
c4f4af40 MM |
604 | |
605 | :c:func:`EXPORT_SYMBOL_NS_GPL()` | |
606 | -------------------------------- | |
607 | ||
608 | Defined in ``include/linux/export.h`` | |
609 | ||
610 | This is the variant of `EXPORT_SYMBOL_GPL()` that allows specifying a symbol | |
611 | namespace. Symbol Namespaces are documented in | |
7f3f7bfb | 612 | Documentation/core-api/symbol-namespaces.rst |
c4f4af40 | 613 | |
c4fcd7ca MCC |
614 | Routines and Conventions |
615 | ======================== | |
616 | ||
617 | Double-linked lists ``include/linux/list.h`` | |
618 | -------------------------------------------- | |
619 | ||
620 | There used to be three sets of linked-list routines in the kernel | |
621 | headers, but this one is the winner. If you don't have some particular | |
622 | pressing need for a single list, it's a good choice. | |
623 | ||
624 | In particular, :c:func:`list_for_each_entry()` is useful. | |
625 | ||
626 | Return Conventions | |
627 | ------------------ | |
628 | ||
629 | For code called in user context, it's very common to defy C convention, | |
dca1e58e | 630 | and return 0 for success, and a negative error number (eg. ``-EFAULT``) for |
c4fcd7ca MCC |
631 | failure. This can be unintuitive at first, but it's fairly widespread in |
632 | the kernel. | |
633 | ||
dca1e58e | 634 | Using :c:func:`ERR_PTR()` (``include/linux/err.h``) to encode a |
c4fcd7ca MCC |
635 | negative error number into a pointer, and :c:func:`IS_ERR()` and |
636 | :c:func:`PTR_ERR()` to get it back out again: avoids a separate | |
637 | pointer parameter for the error number. Icky, but in a good way. | |
638 | ||
639 | Breaking Compilation | |
640 | -------------------- | |
641 | ||
642 | Linus and the other developers sometimes change function or structure | |
643 | names in development kernels; this is not done just to keep everyone on | |
644 | their toes: it reflects a fundamental change (eg. can no longer be | |
645 | called with interrupts on, or does extra checks, or doesn't do checks | |
646 | which were caught before). Usually this is accompanied by a fairly | |
f35cf1a5 KR |
647 | complete note to the appropriate kernel development mailing list; search |
648 | the archives. Simply doing a global replace on the file usually makes | |
649 | things **worse**. | |
c4fcd7ca MCC |
650 | |
651 | Initializing structure members | |
652 | ------------------------------ | |
653 | ||
654 | The preferred method of initializing structures is to use designated | |
dca1e58e | 655 | initialisers, as defined by ISO C99, eg:: |
c4fcd7ca MCC |
656 | |
657 | static struct block_device_operations opt_fops = { | |
658 | .open = opt_open, | |
659 | .release = opt_release, | |
660 | .ioctl = opt_ioctl, | |
661 | .check_media_change = opt_media_change, | |
662 | }; | |
663 | ||
664 | ||
665 | This makes it easy to grep for, and makes it clear which structure | |
666 | fields are set. You should do this because it looks cool. | |
667 | ||
668 | GNU Extensions | |
669 | -------------- | |
670 | ||
671 | GNU Extensions are explicitly allowed in the Linux kernel. Note that | |
672 | some of the more complex ones are not very well supported, due to lack | |
673 | of general use, but the following are considered standard (see the GCC | |
674 | info page section "C Extensions" for more details - Yes, really the info | |
675 | page, the man page is only a short summary of the stuff in info). | |
676 | ||
677 | - Inline functions | |
678 | ||
679 | - Statement expressions (ie. the ({ and }) constructs). | |
680 | ||
681 | - Declaring attributes of a function / variable / type | |
682 | (__attribute__) | |
683 | ||
684 | - typeof | |
685 | ||
686 | - Zero length arrays | |
687 | ||
688 | - Macro varargs | |
689 | ||
690 | - Arithmetic on void pointers | |
691 | ||
692 | - Non-Constant initializers | |
693 | ||
694 | - Assembler Instructions (not outside arch/ and include/asm/) | |
695 | ||
696 | - Function names as strings (__func__). | |
697 | ||
698 | - __builtin_constant_p() | |
699 | ||
700 | Be wary when using long long in the kernel, the code gcc generates for | |
701 | it is horrible and worse: division and multiplication does not work on | |
702 | i386 because the GCC runtime functions for it are missing from the | |
703 | kernel environment. | |
704 | ||
705 | C++ | |
706 | --- | |
707 | ||
708 | Using C++ in the kernel is usually a bad idea, because the kernel does | |
709 | not provide the necessary runtime environment and the include files are | |
710 | not tested for it. It is still possible, but not recommended. If you | |
711 | really want to do this, forget about exceptions at least. | |
712 | ||
423860a6 MW |
713 | #if |
714 | --- | |
c4fcd7ca MCC |
715 | |
716 | It is generally considered cleaner to use macros in header files (or at | |
717 | the top of .c files) to abstract away functions rather than using \`#if' | |
718 | pre-processor statements throughout the source code. | |
719 | ||
720 | Putting Your Stuff in the Kernel | |
721 | ================================ | |
722 | ||
723 | In order to get your stuff into shape for official inclusion, or even to | |
724 | make a neat patch, there's administrative work to be done: | |
725 | ||
f35cf1a5 KR |
726 | - Figure out who are the owners of the code you've been modifying. Look |
727 | at the top of the source files, inside the ``MAINTAINERS`` file, and | |
728 | last of all in the ``CREDITS`` file. You should coordinate with these | |
729 | people to make sure you're not duplicating effort, or trying something | |
730 | that's already been rejected. | |
c4fcd7ca | 731 | |
f35cf1a5 KR |
732 | Make sure you put your name and email address at the top of any files |
733 | you create or modify significantly. This is the first place people | |
c4fcd7ca MCC |
734 | will look when they find a bug, or when **they** want to make a change. |
735 | ||
736 | - Usually you want a configuration option for your kernel hack. Edit | |
737 | ``Kconfig`` in the appropriate directory. The Config language is | |
738 | simple to use by cut and paste, and there's complete documentation in | |
cd238eff | 739 | ``Documentation/kbuild/kconfig-language.rst``. |
c4fcd7ca MCC |
740 | |
741 | In your description of the option, make sure you address both the | |
742 | expert user and the user who knows nothing about your feature. | |
743 | Mention incompatibilities and issues here. **Definitely** end your | |
744 | description with “if in doubt, say N” (or, occasionally, \`Y'); this | |
745 | is for people who have no idea what you are talking about. | |
746 | ||
747 | - Edit the ``Makefile``: the CONFIG variables are exported here so you | |
748 | can usually just add a "obj-$(CONFIG_xxx) += xxx.o" line. The syntax | |
cd238eff | 749 | is documented in ``Documentation/kbuild/makefiles.rst``. |
c4fcd7ca | 750 | |
f35cf1a5 KR |
751 | - Put yourself in ``CREDITS`` if you consider what you've done |
752 | noteworthy, usually beyond a single file (your name should be at the | |
753 | top of the source files anyway). ``MAINTAINERS`` means you want to be | |
754 | consulted when changes are made to a subsystem, and hear about bugs; | |
755 | it implies a more-than-passing commitment to some part of the code. | |
c4fcd7ca MCC |
756 | |
757 | - Finally, don't forget to read | |
9db370de | 758 | ``Documentation/process/submitting-patches.rst`` |
c4fcd7ca MCC |
759 | |
760 | Kernel Cantrips | |
761 | =============== | |
762 | ||
763 | Some favorites from browsing the source. Feel free to add to this list. | |
764 | ||
dca1e58e | 765 | ``arch/x86/include/asm/delay.h``:: |
c4fcd7ca MCC |
766 | |
767 | #define ndelay(n) (__builtin_constant_p(n) ? \ | |
768 | ((n) > 20000 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \ | |
769 | __ndelay(n)) | |
770 | ||
771 | ||
dca1e58e | 772 | ``include/linux/fs.h``:: |
c4fcd7ca MCC |
773 | |
774 | /* | |
775 | * Kernel pointers have redundant information, so we can use a | |
776 | * scheme where we can return either an error code or a dentry | |
777 | * pointer with the same return value. | |
778 | * | |
779 | * This should be a per-architecture thing, to allow different | |
780 | * error and pointer decisions. | |
781 | */ | |
782 | #define ERR_PTR(err) ((void *)((long)(err))) | |
783 | #define PTR_ERR(ptr) ((long)(ptr)) | |
784 | #define IS_ERR(ptr) ((unsigned long)(ptr) > (unsigned long)(-1000)) | |
785 | ||
dca1e58e | 786 | ``arch/x86/include/asm/uaccess_32.h:``:: |
c4fcd7ca MCC |
787 | |
788 | #define copy_to_user(to,from,n) \ | |
789 | (__builtin_constant_p(n) ? \ | |
790 | __constant_copy_to_user((to),(from),(n)) : \ | |
791 | __generic_copy_to_user((to),(from),(n))) | |
792 | ||
793 | ||
dca1e58e | 794 | ``arch/sparc/kernel/head.S:``:: |
c4fcd7ca MCC |
795 | |
796 | /* | |
797 | * Sun people can't spell worth damn. "compatability" indeed. | |
798 | * At least we *know* we can't spell, and use a spell-checker. | |
799 | */ | |
800 | ||
801 | /* Uh, actually Linus it is I who cannot spell. Too much murky | |
802 | * Sparc assembly will do this to ya. | |
803 | */ | |
804 | C_LABEL(cputypvar): | |
805 | .asciz "compatibility" | |
806 | ||
807 | /* Tested on SS-5, SS-10. Probably someone at Sun applied a spell-checker. */ | |
808 | .align 4 | |
809 | C_LABEL(cputypvar_sun4m): | |
810 | .asciz "compatible" | |
811 | ||
812 | ||
dca1e58e | 813 | ``arch/sparc/lib/checksum.S:``:: |
c4fcd7ca MCC |
814 | |
815 | /* Sun, you just can't beat me, you just can't. Stop trying, | |
816 | * give up. I'm serious, I am going to kick the living shit | |
817 | * out of you, game over, lights out. | |
818 | */ | |
819 | ||
820 | ||
821 | Thanks | |
822 | ====== | |
823 | ||
824 | Thanks to Andi Kleen for the idea, answering my questions, fixing my | |
825 | mistakes, filling content, etc. Philipp Rumpf for more spelling and | |
826 | clarity fixes, and some excellent non-obvious points. Werner Almesberger | |
827 | for giving me a great summary of :c:func:`disable_irq()`, and Jes | |
828 | Sorensen and Andrea Arcangeli added caveats. Michael Elizabeth Chastain | |
829 | for checking and adding to the Configure section. Telsa Gwynne for | |
830 | teaching me DocBook. |