Commit | Line | Data |
---|---|---|
1497624f FV |
1 | .. _kernel_hacking_hack: |
2 | ||
c4fcd7ca MCC |
3 | ============================================ |
4 | Unreliable Guide To Hacking The Linux Kernel | |
5 | ============================================ | |
6 | ||
7 | :Author: Rusty Russell | |
8 | ||
9 | Introduction | |
10 | ============ | |
11 | ||
12 | Welcome, gentle reader, to Rusty's Remarkably Unreliable Guide to Linux | |
13 | Kernel Hacking. This document describes the common routines and general | |
14 | requirements for kernel code: its goal is to serve as a primer for Linux | |
15 | kernel development for experienced C programmers. I avoid implementation | |
16 | details: that's what the code is for, and I ignore whole tracts of | |
17 | useful routines. | |
18 | ||
19 | Before you read this, please understand that I never wanted to write | |
20 | this document, being grossly under-qualified, but I always wanted to | |
21 | read it, and this was the only way. I hope it will grow into a | |
22 | compendium of best practice, common starting points and random | |
23 | information. | |
24 | ||
25 | The Players | |
26 | =========== | |
27 | ||
28 | At any time each of the CPUs in a system can be: | |
29 | ||
30 | - not associated with any process, serving a hardware interrupt; | |
31 | ||
32 | - not associated with any process, serving a softirq or tasklet; | |
33 | ||
34 | - running in kernel space, associated with a process (user context); | |
35 | ||
36 | - running a process in user space. | |
37 | ||
38 | There is an ordering between these. The bottom two can preempt each | |
39 | other, but above that is a strict hierarchy: each can only be preempted | |
40 | by the ones above it. For example, while a softirq is running on a CPU, | |
41 | no other softirq will preempt it, but a hardware interrupt can. However, | |
42 | any other CPUs in the system execute independently. | |
43 | ||
44 | We'll see a number of ways that the user context can block interrupts, | |
45 | to become truly non-preemptable. | |
46 | ||
47 | User Context | |
48 | ------------ | |
49 | ||
50 | User context is when you are coming in from a system call or other trap: | |
51 | like userspace, you can be preempted by more important tasks and by | |
52 | interrupts. You can sleep, by calling :c:func:`schedule()`. | |
53 | ||
54 | .. note:: | |
55 | ||
56 | You are always in user context on module load and unload, and on | |
57 | operations on the block device layer. | |
58 | ||
59 | In user context, the ``current`` pointer (indicating the task we are | |
60 | currently executing) is valid, and :c:func:`in_interrupt()` | |
dca1e58e | 61 | (``include/linux/preempt.h``) is false. |
c4fcd7ca MCC |
62 | |
63 | .. warning:: | |
64 | ||
65 | Beware that if you have preemption or softirqs disabled (see below), | |
66 | :c:func:`in_interrupt()` will return a false positive. | |
67 | ||
68 | Hardware Interrupts (Hard IRQs) | |
69 | ------------------------------- | |
70 | ||
71 | Timer ticks, network cards and keyboard are examples of real hardware | |
72 | which produce interrupts at any time. The kernel runs interrupt | |
73 | handlers, which services the hardware. The kernel guarantees that this | |
74 | handler is never re-entered: if the same interrupt arrives, it is queued | |
75 | (or dropped). Because it disables interrupts, this handler has to be | |
76 | fast: frequently it simply acknowledges the interrupt, marks a 'software | |
77 | interrupt' for execution and exits. | |
78 | ||
fe450eeb CD |
79 | You can tell you are in a hardware interrupt, because in_hardirq() returns |
80 | true. | |
c4fcd7ca MCC |
81 | |
82 | .. warning:: | |
83 | ||
84 | Beware that this will return a false positive if interrupts are | |
85 | disabled (see below). | |
86 | ||
87 | Software Interrupt Context: Softirqs and Tasklets | |
88 | ------------------------------------------------- | |
89 | ||
90 | Whenever a system call is about to return to userspace, or a hardware | |
91 | interrupt handler exits, any 'software interrupts' which are marked | |
92 | pending (usually by hardware interrupts) are run (``kernel/softirq.c``). | |
93 | ||
94 | Much of the real interrupt handling work is done here. Early in the | |
95 | transition to SMP, there were only 'bottom halves' (BHs), which didn't | |
96 | take advantage of multiple CPUs. Shortly after we switched from wind-up | |
97 | computers made of match-sticks and snot, we abandoned this limitation | |
98 | and switched to 'softirqs'. | |
99 | ||
100 | ``include/linux/interrupt.h`` lists the different softirqs. A very | |
101 | important softirq is the timer softirq (``include/linux/timer.h``): you | |
102 | can register to have it call functions for you in a given length of | |
103 | time. | |
104 | ||
105 | Softirqs are often a pain to deal with, since the same softirq will run | |
106 | simultaneously on more than one CPU. For this reason, tasklets | |
107 | (``include/linux/interrupt.h``) are more often used: they are | |
108 | dynamically-registrable (meaning you can have as many as you want), and | |
109 | they also guarantee that any tasklet will only run on one CPU at any | |
110 | time, although different tasklets can run simultaneously. | |
111 | ||
112 | .. warning:: | |
113 | ||
114 | The name 'tasklet' is misleading: they have nothing to do with | |
115 | 'tasks', and probably more to do with some bad vodka Alexey | |
116 | Kuznetsov had at the time. | |
117 | ||
118 | You can tell you are in a softirq (or tasklet) using the | |
dca1e58e | 119 | :c:func:`in_softirq()` macro (``include/linux/preempt.h``). |
c4fcd7ca MCC |
120 | |
121 | .. warning:: | |
122 | ||
dca1e58e MCC |
123 | Beware that this will return a false positive if a |
124 | :ref:`botton half lock <local_bh_disable>` is held. | |
c4fcd7ca MCC |
125 | |
126 | Some Basic Rules | |
127 | ================ | |
128 | ||
129 | No memory protection | |
130 | If you corrupt memory, whether in user context or interrupt context, | |
131 | the whole machine will crash. Are you sure you can't do what you | |
132 | want in userspace? | |
133 | ||
134 | No floating point or MMX | |
135 | The FPU context is not saved; even in user context the FPU state | |
136 | probably won't correspond with the current process: you would mess | |
137 | with some user process' FPU state. If you really want to do this, | |
138 | you would have to explicitly save/restore the full FPU state (and | |
139 | avoid context switches). It is generally a bad idea; use fixed point | |
140 | arithmetic first. | |
141 | ||
142 | A rigid stack limit | |
143 | Depending on configuration options the kernel stack is about 3K to | |
144 | 6K for most 32-bit architectures: it's about 14K on most 64-bit | |
145 | archs, and often shared with interrupts so you can't use it all. | |
146 | Avoid deep recursion and huge local arrays on the stack (allocate | |
147 | them dynamically instead). | |
148 | ||
149 | The Linux kernel is portable | |
150 | Let's keep it that way. Your code should be 64-bit clean, and | |
151 | endian-independent. You should also minimize CPU specific stuff, | |
152 | e.g. inline assembly should be cleanly encapsulated and minimized to | |
153 | ease porting. Generally it should be restricted to the | |
154 | architecture-dependent part of the kernel tree. | |
155 | ||
156 | ioctls: Not writing a new system call | |
157 | ===================================== | |
158 | ||
dca1e58e | 159 | A system call generally looks like this:: |
c4fcd7ca MCC |
160 | |
161 | asmlinkage long sys_mycall(int arg) | |
162 | { | |
163 | return 0; | |
164 | } | |
165 | ||
166 | ||
167 | First, in most cases you don't want to create a new system call. You | |
168 | create a character device and implement an appropriate ioctl for it. | |
169 | This is much more flexible than system calls, doesn't have to be entered | |
170 | in every architecture's ``include/asm/unistd.h`` and | |
171 | ``arch/kernel/entry.S`` file, and is much more likely to be accepted by | |
172 | Linus. | |
173 | ||
174 | If all your routine does is read or write some parameter, consider | |
175 | implementing a :c:func:`sysfs()` interface instead. | |
176 | ||
177 | Inside the ioctl you're in user context to a process. When a error | |
dca1e58e MCC |
178 | occurs you return a negated errno (see |
179 | ``include/uapi/asm-generic/errno-base.h``, | |
180 | ``include/uapi/asm-generic/errno.h`` and ``include/linux/errno.h``), | |
c4fcd7ca MCC |
181 | otherwise you return 0. |
182 | ||
183 | After you slept you should check if a signal occurred: the Unix/Linux | |
184 | way of handling signals is to temporarily exit the system call with the | |
185 | ``-ERESTARTSYS`` error. The system call entry code will switch back to | |
186 | user context, process the signal handler and then your system call will | |
187 | be restarted (unless the user disabled that). So you should be prepared | |
188 | to process the restart, e.g. if you're in the middle of manipulating | |
189 | some data structure. | |
190 | ||
191 | :: | |
192 | ||
193 | if (signal_pending(current)) | |
194 | return -ERESTARTSYS; | |
195 | ||
196 | ||
197 | If you're doing longer computations: first think userspace. If you | |
198 | **really** want to do it in kernel you should regularly check if you need | |
199 | to give up the CPU (remember there is cooperative multitasking per CPU). | |
dca1e58e | 200 | Idiom:: |
c4fcd7ca MCC |
201 | |
202 | cond_resched(); /* Will sleep */ | |
203 | ||
204 | ||
205 | A short note on interface design: the UNIX system call motto is "Provide | |
206 | mechanism not policy". | |
207 | ||
208 | Recipes for Deadlock | |
209 | ==================== | |
210 | ||
211 | You cannot call any routines which may sleep, unless: | |
212 | ||
213 | - You are in user context. | |
214 | ||
215 | - You do not own any spinlocks. | |
216 | ||
217 | - You have interrupts enabled (actually, Andi Kleen says that the | |
218 | scheduling code will enable them for you, but that's probably not | |
219 | what you wanted). | |
220 | ||
221 | Note that some functions may sleep implicitly: common ones are the user | |
222 | space access functions (\*_user) and memory allocation functions | |
223 | without ``GFP_ATOMIC``. | |
224 | ||
225 | You should always compile your kernel ``CONFIG_DEBUG_ATOMIC_SLEEP`` on, | |
226 | and it will warn you if you break these rules. If you **do** break the | |
227 | rules, you will eventually lock up your box. | |
228 | ||
229 | Really. | |
230 | ||
231 | Common Routines | |
232 | =============== | |
233 | ||
dca1e58e MCC |
234 | :c:func:`printk()` |
235 | ------------------ | |
236 | ||
237 | Defined in ``include/linux/printk.h`` | |
c4fcd7ca MCC |
238 | |
239 | :c:func:`printk()` feeds kernel messages to the console, dmesg, and | |
240 | the syslog daemon. It is useful for debugging and reporting errors, and | |
241 | can be used inside interrupt context, but use with caution: a machine | |
242 | which has its console flooded with printk messages is unusable. It uses | |
243 | a format string mostly compatible with ANSI C printf, and C string | |
dca1e58e | 244 | concatenation to give it a first "priority" argument:: |
c4fcd7ca MCC |
245 | |
246 | printk(KERN_INFO "i = %u\n", i); | |
247 | ||
248 | ||
dca1e58e | 249 | See ``include/linux/kern_levels.h``; for other ``KERN_`` values; these are |
c4fcd7ca | 250 | interpreted by syslog as the level. Special case: for printing an IP |
dca1e58e | 251 | address use:: |
c4fcd7ca MCC |
252 | |
253 | __be32 ipaddress; | |
254 | printk(KERN_INFO "my ip: %pI4\n", &ipaddress); | |
255 | ||
256 | ||
257 | :c:func:`printk()` internally uses a 1K buffer and does not catch | |
258 | overruns. Make sure that will be enough. | |
259 | ||
260 | .. note:: | |
261 | ||
262 | You will know when you are a real kernel hacker when you start | |
263 | typoing printf as printk in your user programs :) | |
264 | ||
265 | .. note:: | |
266 | ||
267 | Another sidenote: the original Unix Version 6 sources had a comment | |
268 | on top of its printf function: "Printf should not be used for | |
269 | chit-chat". You should follow that advice. | |
270 | ||
dca1e58e MCC |
271 | :c:func:`copy_to_user()` / :c:func:`copy_from_user()` / :c:func:`get_user()` / :c:func:`put_user()` |
272 | --------------------------------------------------------------------------------------------------- | |
273 | ||
274 | Defined in ``include/linux/uaccess.h`` / ``asm/uaccess.h`` | |
c4fcd7ca MCC |
275 | |
276 | **[SLEEPS]** | |
277 | ||
278 | :c:func:`put_user()` and :c:func:`get_user()` are used to get | |
279 | and put single values (such as an int, char, or long) from and to | |
280 | userspace. A pointer into userspace should never be simply dereferenced: | |
281 | data should be copied using these routines. Both return ``-EFAULT`` or | |
282 | 0. | |
283 | ||
284 | :c:func:`copy_to_user()` and :c:func:`copy_from_user()` are | |
285 | more general: they copy an arbitrary amount of data to and from | |
286 | userspace. | |
287 | ||
288 | .. warning:: | |
289 | ||
290 | Unlike :c:func:`put_user()` and :c:func:`get_user()`, they | |
291 | return the amount of uncopied data (ie. 0 still means success). | |
292 | ||
293 | [Yes, this moronic interface makes me cringe. The flamewar comes up | |
294 | every year or so. --RR.] | |
295 | ||
296 | The functions may sleep implicitly. This should never be called outside | |
297 | user context (it makes no sense), with interrupts disabled, or a | |
298 | spinlock held. | |
299 | ||
dca1e58e MCC |
300 | :c:func:`kmalloc()`/:c:func:`kfree()` |
301 | ------------------------------------- | |
302 | ||
303 | Defined in ``include/linux/slab.h`` | |
c4fcd7ca MCC |
304 | |
305 | **[MAY SLEEP: SEE BELOW]** | |
306 | ||
307 | These routines are used to dynamically request pointer-aligned chunks of | |
308 | memory, like malloc and free do in userspace, but | |
309 | :c:func:`kmalloc()` takes an extra flag word. Important values: | |
310 | ||
311 | ``GFP_KERNEL`` | |
312 | May sleep and swap to free memory. Only allowed in user context, but | |
313 | is the most reliable way to allocate memory. | |
314 | ||
315 | ``GFP_ATOMIC`` | |
316 | Don't sleep. Less reliable than ``GFP_KERNEL``, but may be called | |
317 | from interrupt context. You should **really** have a good | |
318 | out-of-memory error-handling strategy. | |
319 | ||
320 | ``GFP_DMA`` | |
321 | Allocate ISA DMA lower than 16MB. If you don't know what that is you | |
322 | don't need it. Very unreliable. | |
323 | ||
324 | If you see a sleeping function called from invalid context warning | |
325 | message, then maybe you called a sleeping allocation function from | |
326 | interrupt context without ``GFP_ATOMIC``. You should really fix that. | |
327 | Run, don't walk. | |
328 | ||
dca1e58e MCC |
329 | If you are allocating at least ``PAGE_SIZE`` (``asm/page.h`` or |
330 | ``asm/page_types.h``) bytes, consider using :c:func:`__get_free_pages()` | |
331 | (``include/linux/gfp.h``). It takes an order argument (0 for page sized, | |
c4fcd7ca MCC |
332 | 1 for double page, 2 for four pages etc.) and the same memory priority |
333 | flag word as above. | |
334 | ||
335 | If you are allocating more than a page worth of bytes you can use | |
336 | :c:func:`vmalloc()`. It'll allocate virtual memory in the kernel | |
337 | map. This block is not contiguous in physical memory, but the MMU makes | |
338 | it look like it is for you (so it'll only look contiguous to the CPUs, | |
339 | not to external device drivers). If you really need large physically | |
340 | contiguous memory for some weird device, you have a problem: it is | |
341 | poorly supported in Linux because after some time memory fragmentation | |
342 | in a running kernel makes it hard. The best way is to allocate the block | |
343 | early in the boot process via the :c:func:`alloc_bootmem()` | |
344 | routine. | |
345 | ||
346 | Before inventing your own cache of often-used objects consider using a | |
347 | slab cache in ``include/linux/slab.h`` | |
348 | ||
3a4928cf JP |
349 | :c:macro:`current` |
350 | ------------------ | |
dca1e58e MCC |
351 | |
352 | Defined in ``include/asm/current.h`` | |
c4fcd7ca MCC |
353 | |
354 | This global variable (really a macro) contains a pointer to the current | |
355 | task structure, so is only valid in user context. For example, when a | |
356 | process makes a system call, this will point to the task structure of | |
357 | the calling process. It is **not NULL** in interrupt context. | |
358 | ||
dca1e58e MCC |
359 | :c:func:`mdelay()`/:c:func:`udelay()` |
360 | ------------------------------------- | |
361 | ||
362 | Defined in ``include/asm/delay.h`` / ``include/linux/delay.h`` | |
c4fcd7ca MCC |
363 | |
364 | The :c:func:`udelay()` and :c:func:`ndelay()` functions can be | |
365 | used for small pauses. Do not use large values with them as you risk | |
366 | overflow - the helper function :c:func:`mdelay()` is useful here, or | |
367 | consider :c:func:`msleep()`. | |
368 | ||
dca1e58e MCC |
369 | :c:func:`cpu_to_be32()`/:c:func:`be32_to_cpu()`/:c:func:`cpu_to_le32()`/:c:func:`le32_to_cpu()` |
370 | ----------------------------------------------------------------------------------------------- | |
371 | ||
372 | Defined in ``include/asm/byteorder.h`` | |
c4fcd7ca MCC |
373 | |
374 | The :c:func:`cpu_to_be32()` family (where the "32" can be replaced | |
375 | by 64 or 16, and the "be" can be replaced by "le") are the general way | |
376 | to do endian conversions in the kernel: they return the converted value. | |
377 | All variations supply the reverse as well: | |
378 | :c:func:`be32_to_cpu()`, etc. | |
379 | ||
380 | There are two major variations of these functions: the pointer | |
381 | variation, such as :c:func:`cpu_to_be32p()`, which take a pointer | |
382 | to the given type, and return the converted value. The other variation | |
383 | is the "in-situ" family, such as :c:func:`cpu_to_be32s()`, which | |
384 | convert value referred to by the pointer, and return void. | |
385 | ||
dca1e58e MCC |
386 | :c:func:`local_irq_save()`/:c:func:`local_irq_restore()` |
387 | -------------------------------------------------------- | |
388 | ||
389 | Defined in ``include/linux/irqflags.h`` | |
c4fcd7ca MCC |
390 | |
391 | These routines disable hard interrupts on the local CPU, and restore | |
392 | them. They are reentrant; saving the previous state in their one | |
393 | ``unsigned long flags`` argument. If you know that interrupts are | |
394 | enabled, you can simply use :c:func:`local_irq_disable()` and | |
395 | :c:func:`local_irq_enable()`. | |
396 | ||
dca1e58e MCC |
397 | .. _local_bh_disable: |
398 | ||
399 | :c:func:`local_bh_disable()`/:c:func:`local_bh_enable()` | |
400 | -------------------------------------------------------- | |
401 | ||
402 | Defined in ``include/linux/bottom_half.h`` | |
403 | ||
c4fcd7ca MCC |
404 | |
405 | These routines disable soft interrupts on the local CPU, and restore | |
406 | them. They are reentrant; if soft interrupts were disabled before, they | |
407 | will still be disabled after this pair of functions has been called. | |
408 | They prevent softirqs and tasklets from running on the current CPU. | |
409 | ||
dca1e58e MCC |
410 | :c:func:`smp_processor_id()` |
411 | ---------------------------- | |
412 | ||
413 | Defined in ``include/linux/smp.h`` | |
c4fcd7ca MCC |
414 | |
415 | :c:func:`get_cpu()` disables preemption (so you won't suddenly get | |
416 | moved to another CPU) and returns the current processor number, between | |
417 | 0 and ``NR_CPUS``. Note that the CPU numbers are not necessarily | |
418 | continuous. You return it again with :c:func:`put_cpu()` when you | |
419 | are done. | |
420 | ||
421 | If you know you cannot be preempted by another task (ie. you are in | |
422 | interrupt context, or have preemption disabled) you can use | |
423 | smp_processor_id(). | |
424 | ||
dca1e58e MCC |
425 | ``__init``/``__exit``/``__initdata`` |
426 | ------------------------------------ | |
427 | ||
428 | Defined in ``include/linux/init.h`` | |
c4fcd7ca MCC |
429 | |
430 | After boot, the kernel frees up a special section; functions marked with | |
431 | ``__init`` and data structures marked with ``__initdata`` are dropped | |
432 | after boot is complete: similarly modules discard this memory after | |
433 | initialization. ``__exit`` is used to declare a function which is only | |
434 | required on exit: the function will be dropped if this file is not | |
435 | compiled as a module. See the header file for use. Note that it makes no | |
436 | sense for a function marked with ``__init`` to be exported to modules | |
dca1e58e MCC |
437 | with :c:func:`EXPORT_SYMBOL()` or :c:func:`EXPORT_SYMBOL_GPL()`- this |
438 | will break. | |
439 | ||
440 | :c:func:`__initcall()`/:c:func:`module_init()` | |
441 | ---------------------------------------------- | |
c4fcd7ca | 442 | |
dca1e58e | 443 | Defined in ``include/linux/init.h`` / ``include/linux/module.h`` |
c4fcd7ca MCC |
444 | |
445 | Many parts of the kernel are well served as a module | |
446 | (dynamically-loadable parts of the kernel). Using the | |
447 | :c:func:`module_init()` and :c:func:`module_exit()` macros it | |
448 | is easy to write code without #ifdefs which can operate both as a module | |
449 | or built into the kernel. | |
450 | ||
451 | The :c:func:`module_init()` macro defines which function is to be | |
452 | called at module insertion time (if the file is compiled as a module), | |
453 | or at boot time: if the file is not compiled as a module the | |
454 | :c:func:`module_init()` macro becomes equivalent to | |
455 | :c:func:`__initcall()`, which through linker magic ensures that | |
456 | the function is called on boot. | |
457 | ||
458 | The function can return a negative error number to cause module loading | |
459 | to fail (unfortunately, this has no effect if the module is compiled | |
460 | into the kernel). This function is called in user context with | |
461 | interrupts enabled, so it can sleep. | |
462 | ||
dca1e58e MCC |
463 | :c:func:`module_exit()` |
464 | ----------------------- | |
465 | ||
466 | ||
467 | Defined in ``include/linux/module.h`` | |
c4fcd7ca MCC |
468 | |
469 | This macro defines the function to be called at module removal time (or | |
470 | never, in the case of the file compiled into the kernel). It will only | |
471 | be called if the module usage count has reached zero. This function can | |
472 | also sleep, but cannot fail: everything must be cleaned up by the time | |
473 | it returns. | |
474 | ||
475 | Note that this macro is optional: if it is not present, your module will | |
476 | not be removable (except for 'rmmod -f'). | |
477 | ||
dca1e58e MCC |
478 | :c:func:`try_module_get()`/:c:func:`module_put()` |
479 | ------------------------------------------------- | |
480 | ||
481 | Defined in ``include/linux/module.h`` | |
c4fcd7ca MCC |
482 | |
483 | These manipulate the module usage count, to protect against removal (a | |
484 | module also can't be removed if another module uses one of its exported | |
485 | symbols: see below). Before calling into module code, you should call | |
486 | :c:func:`try_module_get()` on that module: if it fails, then the | |
487 | module is being removed and you should act as if it wasn't there. | |
488 | Otherwise, you can safely enter the module, and call | |
489 | :c:func:`module_put()` when you're finished. | |
490 | ||
491 | Most registerable structures have an owner field, such as in the | |
492 | :c:type:`struct file_operations <file_operations>` structure. | |
493 | Set this field to the macro ``THIS_MODULE``. | |
494 | ||
495 | Wait Queues ``include/linux/wait.h`` | |
496 | ==================================== | |
497 | ||
498 | **[SLEEPS]** | |
499 | ||
500 | A wait queue is used to wait for someone to wake you up when a certain | |
501 | condition is true. They must be used carefully to ensure there is no | |
dca1e58e | 502 | race condition. You declare a :c:type:`wait_queue_head_t`, and then processes |
650fc870 | 503 | which want to wait for that condition declare a :c:type:`wait_queue_entry_t` |
c4fcd7ca MCC |
504 | referring to themselves, and place that in the queue. |
505 | ||
506 | Declaring | |
507 | --------- | |
508 | ||
509 | You declare a ``wait_queue_head_t`` using the | |
510 | :c:func:`DECLARE_WAIT_QUEUE_HEAD()` macro, or using the | |
511 | :c:func:`init_waitqueue_head()` routine in your initialization | |
512 | code. | |
513 | ||
514 | Queuing | |
515 | ------- | |
516 | ||
517 | Placing yourself in the waitqueue is fairly complex, because you must | |
518 | put yourself in the queue before checking the condition. There is a | |
519 | macro to do this: :c:func:`wait_event_interruptible()` | |
dca1e58e | 520 | (``include/linux/wait.h``) The first argument is the wait queue head, and |
c4fcd7ca | 521 | the second is an expression which is evaluated; the macro returns 0 when |
dca1e58e | 522 | this expression is true, or ``-ERESTARTSYS`` if a signal is received. The |
c4fcd7ca MCC |
523 | :c:func:`wait_event()` version ignores signals. |
524 | ||
525 | Waking Up Queued Tasks | |
526 | ---------------------- | |
527 | ||
c1de03a4 | 528 | Call :c:func:`wake_up()` (``include/linux/wait.h``), which will wake |
c4fcd7ca MCC |
529 | up every process in the queue. The exception is if one has |
530 | ``TASK_EXCLUSIVE`` set, in which case the remainder of the queue will | |
531 | not be woken. There are other variants of this basic function available | |
532 | in the same header. | |
533 | ||
534 | Atomic Operations | |
535 | ================= | |
536 | ||
537 | Certain operations are guaranteed atomic on all platforms. The first | |
dca1e58e MCC |
538 | class of operations work on :c:type:`atomic_t` (``include/asm/atomic.h``); |
539 | this contains a signed integer (at least 32 bits long), and you must use | |
540 | these functions to manipulate or read :c:type:`atomic_t` variables. | |
c4fcd7ca MCC |
541 | :c:func:`atomic_read()` and :c:func:`atomic_set()` get and set |
542 | the counter, :c:func:`atomic_add()`, :c:func:`atomic_sub()`, | |
543 | :c:func:`atomic_inc()`, :c:func:`atomic_dec()`, and | |
544 | :c:func:`atomic_dec_and_test()` (returns true if it was | |
545 | decremented to zero). | |
546 | ||
547 | Yes. It returns true (i.e. != 0) if the atomic variable is zero. | |
548 | ||
549 | Note that these functions are slower than normal arithmetic, and so | |
550 | should not be used unnecessarily. | |
551 | ||
552 | The second class of atomic operations is atomic bit operations on an | |
553 | ``unsigned long``, defined in ``include/linux/bitops.h``. These | |
554 | operations generally take a pointer to the bit pattern, and a bit | |
555 | number: 0 is the least significant bit. :c:func:`set_bit()`, | |
556 | :c:func:`clear_bit()` and :c:func:`change_bit()` set, clear, | |
557 | and flip the given bit. :c:func:`test_and_set_bit()`, | |
558 | :c:func:`test_and_clear_bit()` and | |
559 | :c:func:`test_and_change_bit()` do the same thing, except return | |
560 | true if the bit was previously set; these are particularly useful for | |
561 | atomically setting flags. | |
562 | ||
563 | It is possible to call these operations with bit indices greater than | |
dca1e58e | 564 | ``BITS_PER_LONG``. The resulting behavior is strange on big-endian |
c4fcd7ca MCC |
565 | platforms though so it is a good idea not to do this. |
566 | ||
567 | Symbols | |
568 | ======= | |
569 | ||
570 | Within the kernel proper, the normal linking rules apply (ie. unless a | |
571 | symbol is declared to be file scope with the ``static`` keyword, it can | |
572 | be used anywhere in the kernel). However, for modules, a special | |
573 | exported symbol table is kept which limits the entry points to the | |
574 | kernel proper. Modules can also export symbols. | |
575 | ||
dca1e58e MCC |
576 | :c:func:`EXPORT_SYMBOL()` |
577 | ------------------------- | |
578 | ||
579 | Defined in ``include/linux/export.h`` | |
c4fcd7ca MCC |
580 | |
581 | This is the classic method of exporting a symbol: dynamically loaded | |
582 | modules will be able to use the symbol as normal. | |
583 | ||
dca1e58e MCC |
584 | :c:func:`EXPORT_SYMBOL_GPL()` |
585 | ----------------------------- | |
586 | ||
587 | Defined in ``include/linux/export.h`` | |
c4fcd7ca MCC |
588 | |
589 | Similar to :c:func:`EXPORT_SYMBOL()` except that the symbols | |
590 | exported by :c:func:`EXPORT_SYMBOL_GPL()` can only be seen by | |
591 | modules with a :c:func:`MODULE_LICENSE()` that specifies a GPL | |
592 | compatible license. It implies that the function is considered an | |
593 | internal implementation issue, and not really an interface. Some | |
594 | maintainers and developers may however require EXPORT_SYMBOL_GPL() | |
595 | when adding any new APIs or functionality. | |
596 | ||
c4f4af40 MM |
597 | :c:func:`EXPORT_SYMBOL_NS()` |
598 | ---------------------------- | |
599 | ||
600 | Defined in ``include/linux/export.h`` | |
601 | ||
602 | This is the variant of `EXPORT_SYMBOL()` that allows specifying a symbol | |
603 | namespace. Symbol Namespaces are documented in | |
7f3f7bfb | 604 | Documentation/core-api/symbol-namespaces.rst |
c4f4af40 MM |
605 | |
606 | :c:func:`EXPORT_SYMBOL_NS_GPL()` | |
607 | -------------------------------- | |
608 | ||
609 | Defined in ``include/linux/export.h`` | |
610 | ||
611 | This is the variant of `EXPORT_SYMBOL_GPL()` that allows specifying a symbol | |
612 | namespace. Symbol Namespaces are documented in | |
7f3f7bfb | 613 | Documentation/core-api/symbol-namespaces.rst |
c4f4af40 | 614 | |
c4fcd7ca MCC |
615 | Routines and Conventions |
616 | ======================== | |
617 | ||
618 | Double-linked lists ``include/linux/list.h`` | |
619 | -------------------------------------------- | |
620 | ||
621 | There used to be three sets of linked-list routines in the kernel | |
622 | headers, but this one is the winner. If you don't have some particular | |
623 | pressing need for a single list, it's a good choice. | |
624 | ||
625 | In particular, :c:func:`list_for_each_entry()` is useful. | |
626 | ||
627 | Return Conventions | |
628 | ------------------ | |
629 | ||
630 | For code called in user context, it's very common to defy C convention, | |
dca1e58e | 631 | and return 0 for success, and a negative error number (eg. ``-EFAULT``) for |
c4fcd7ca MCC |
632 | failure. This can be unintuitive at first, but it's fairly widespread in |
633 | the kernel. | |
634 | ||
dca1e58e | 635 | Using :c:func:`ERR_PTR()` (``include/linux/err.h``) to encode a |
c4fcd7ca MCC |
636 | negative error number into a pointer, and :c:func:`IS_ERR()` and |
637 | :c:func:`PTR_ERR()` to get it back out again: avoids a separate | |
638 | pointer parameter for the error number. Icky, but in a good way. | |
639 | ||
640 | Breaking Compilation | |
641 | -------------------- | |
642 | ||
643 | Linus and the other developers sometimes change function or structure | |
644 | names in development kernels; this is not done just to keep everyone on | |
645 | their toes: it reflects a fundamental change (eg. can no longer be | |
646 | called with interrupts on, or does extra checks, or doesn't do checks | |
647 | which were caught before). Usually this is accompanied by a fairly | |
648 | complete note to the linux-kernel mailing list; search the archive. | |
649 | Simply doing a global replace on the file usually makes things **worse**. | |
650 | ||
651 | Initializing structure members | |
652 | ------------------------------ | |
653 | ||
654 | The preferred method of initializing structures is to use designated | |
dca1e58e | 655 | initialisers, as defined by ISO C99, eg:: |
c4fcd7ca MCC |
656 | |
657 | static struct block_device_operations opt_fops = { | |
658 | .open = opt_open, | |
659 | .release = opt_release, | |
660 | .ioctl = opt_ioctl, | |
661 | .check_media_change = opt_media_change, | |
662 | }; | |
663 | ||
664 | ||
665 | This makes it easy to grep for, and makes it clear which structure | |
666 | fields are set. You should do this because it looks cool. | |
667 | ||
668 | GNU Extensions | |
669 | -------------- | |
670 | ||
671 | GNU Extensions are explicitly allowed in the Linux kernel. Note that | |
672 | some of the more complex ones are not very well supported, due to lack | |
673 | of general use, but the following are considered standard (see the GCC | |
674 | info page section "C Extensions" for more details - Yes, really the info | |
675 | page, the man page is only a short summary of the stuff in info). | |
676 | ||
677 | - Inline functions | |
678 | ||
679 | - Statement expressions (ie. the ({ and }) constructs). | |
680 | ||
681 | - Declaring attributes of a function / variable / type | |
682 | (__attribute__) | |
683 | ||
684 | - typeof | |
685 | ||
686 | - Zero length arrays | |
687 | ||
688 | - Macro varargs | |
689 | ||
690 | - Arithmetic on void pointers | |
691 | ||
692 | - Non-Constant initializers | |
693 | ||
694 | - Assembler Instructions (not outside arch/ and include/asm/) | |
695 | ||
696 | - Function names as strings (__func__). | |
697 | ||
698 | - __builtin_constant_p() | |
699 | ||
700 | Be wary when using long long in the kernel, the code gcc generates for | |
701 | it is horrible and worse: division and multiplication does not work on | |
702 | i386 because the GCC runtime functions for it are missing from the | |
703 | kernel environment. | |
704 | ||
705 | C++ | |
706 | --- | |
707 | ||
708 | Using C++ in the kernel is usually a bad idea, because the kernel does | |
709 | not provide the necessary runtime environment and the include files are | |
710 | not tested for it. It is still possible, but not recommended. If you | |
711 | really want to do this, forget about exceptions at least. | |
712 | ||
423860a6 MW |
713 | #if |
714 | --- | |
c4fcd7ca MCC |
715 | |
716 | It is generally considered cleaner to use macros in header files (or at | |
717 | the top of .c files) to abstract away functions rather than using \`#if' | |
718 | pre-processor statements throughout the source code. | |
719 | ||
720 | Putting Your Stuff in the Kernel | |
721 | ================================ | |
722 | ||
723 | In order to get your stuff into shape for official inclusion, or even to | |
724 | make a neat patch, there's administrative work to be done: | |
725 | ||
726 | - Figure out whose pond you've been pissing in. Look at the top of the | |
727 | source files, inside the ``MAINTAINERS`` file, and last of all in the | |
728 | ``CREDITS`` file. You should coordinate with this person to make sure | |
729 | you're not duplicating effort, or trying something that's already | |
730 | been rejected. | |
731 | ||
732 | Make sure you put your name and EMail address at the top of any files | |
733 | you create or mangle significantly. This is the first place people | |
734 | will look when they find a bug, or when **they** want to make a change. | |
735 | ||
736 | - Usually you want a configuration option for your kernel hack. Edit | |
737 | ``Kconfig`` in the appropriate directory. The Config language is | |
738 | simple to use by cut and paste, and there's complete documentation in | |
cd238eff | 739 | ``Documentation/kbuild/kconfig-language.rst``. |
c4fcd7ca MCC |
740 | |
741 | In your description of the option, make sure you address both the | |
742 | expert user and the user who knows nothing about your feature. | |
743 | Mention incompatibilities and issues here. **Definitely** end your | |
744 | description with “if in doubt, say N” (or, occasionally, \`Y'); this | |
745 | is for people who have no idea what you are talking about. | |
746 | ||
747 | - Edit the ``Makefile``: the CONFIG variables are exported here so you | |
748 | can usually just add a "obj-$(CONFIG_xxx) += xxx.o" line. The syntax | |
cd238eff | 749 | is documented in ``Documentation/kbuild/makefiles.rst``. |
c4fcd7ca MCC |
750 | |
751 | - Put yourself in ``CREDITS`` if you've done something noteworthy, | |
752 | usually beyond a single file (your name should be at the top of the | |
753 | source files anyway). ``MAINTAINERS`` means you want to be consulted | |
754 | when changes are made to a subsystem, and hear about bugs; it implies | |
755 | a more-than-passing commitment to some part of the code. | |
756 | ||
757 | - Finally, don't forget to read | |
758 | ``Documentation/process/submitting-patches.rst`` and possibly | |
759 | ``Documentation/process/submitting-drivers.rst``. | |
760 | ||
761 | Kernel Cantrips | |
762 | =============== | |
763 | ||
764 | Some favorites from browsing the source. Feel free to add to this list. | |
765 | ||
dca1e58e | 766 | ``arch/x86/include/asm/delay.h``:: |
c4fcd7ca MCC |
767 | |
768 | #define ndelay(n) (__builtin_constant_p(n) ? \ | |
769 | ((n) > 20000 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \ | |
770 | __ndelay(n)) | |
771 | ||
772 | ||
dca1e58e | 773 | ``include/linux/fs.h``:: |
c4fcd7ca MCC |
774 | |
775 | /* | |
776 | * Kernel pointers have redundant information, so we can use a | |
777 | * scheme where we can return either an error code or a dentry | |
778 | * pointer with the same return value. | |
779 | * | |
780 | * This should be a per-architecture thing, to allow different | |
781 | * error and pointer decisions. | |
782 | */ | |
783 | #define ERR_PTR(err) ((void *)((long)(err))) | |
784 | #define PTR_ERR(ptr) ((long)(ptr)) | |
785 | #define IS_ERR(ptr) ((unsigned long)(ptr) > (unsigned long)(-1000)) | |
786 | ||
dca1e58e | 787 | ``arch/x86/include/asm/uaccess_32.h:``:: |
c4fcd7ca MCC |
788 | |
789 | #define copy_to_user(to,from,n) \ | |
790 | (__builtin_constant_p(n) ? \ | |
791 | __constant_copy_to_user((to),(from),(n)) : \ | |
792 | __generic_copy_to_user((to),(from),(n))) | |
793 | ||
794 | ||
dca1e58e | 795 | ``arch/sparc/kernel/head.S:``:: |
c4fcd7ca MCC |
796 | |
797 | /* | |
798 | * Sun people can't spell worth damn. "compatability" indeed. | |
799 | * At least we *know* we can't spell, and use a spell-checker. | |
800 | */ | |
801 | ||
802 | /* Uh, actually Linus it is I who cannot spell. Too much murky | |
803 | * Sparc assembly will do this to ya. | |
804 | */ | |
805 | C_LABEL(cputypvar): | |
806 | .asciz "compatibility" | |
807 | ||
808 | /* Tested on SS-5, SS-10. Probably someone at Sun applied a spell-checker. */ | |
809 | .align 4 | |
810 | C_LABEL(cputypvar_sun4m): | |
811 | .asciz "compatible" | |
812 | ||
813 | ||
dca1e58e | 814 | ``arch/sparc/lib/checksum.S:``:: |
c4fcd7ca MCC |
815 | |
816 | /* Sun, you just can't beat me, you just can't. Stop trying, | |
817 | * give up. I'm serious, I am going to kick the living shit | |
818 | * out of you, game over, lights out. | |
819 | */ | |
820 | ||
821 | ||
822 | Thanks | |
823 | ====== | |
824 | ||
825 | Thanks to Andi Kleen for the idea, answering my questions, fixing my | |
826 | mistakes, filling content, etc. Philipp Rumpf for more spelling and | |
827 | clarity fixes, and some excellent non-obvious points. Werner Almesberger | |
828 | for giving me a great summary of :c:func:`disable_irq()`, and Jes | |
829 | Sorensen and Andrea Arcangeli added caveats. Michael Elizabeth Chastain | |
830 | for checking and adding to the Configure section. Telsa Gwynne for | |
831 | teaching me DocBook. |