Commit | Line | Data |
---|---|---|
c4fcd7ca MCC |
1 | ============================================ |
2 | Unreliable Guide To Hacking The Linux Kernel | |
3 | ============================================ | |
4 | ||
5 | :Author: Rusty Russell | |
6 | ||
7 | Introduction | |
8 | ============ | |
9 | ||
10 | Welcome, gentle reader, to Rusty's Remarkably Unreliable Guide to Linux | |
11 | Kernel Hacking. This document describes the common routines and general | |
12 | requirements for kernel code: its goal is to serve as a primer for Linux | |
13 | kernel development for experienced C programmers. I avoid implementation | |
14 | details: that's what the code is for, and I ignore whole tracts of | |
15 | useful routines. | |
16 | ||
17 | Before you read this, please understand that I never wanted to write | |
18 | this document, being grossly under-qualified, but I always wanted to | |
19 | read it, and this was the only way. I hope it will grow into a | |
20 | compendium of best practice, common starting points and random | |
21 | information. | |
22 | ||
23 | The Players | |
24 | =========== | |
25 | ||
26 | At any time each of the CPUs in a system can be: | |
27 | ||
28 | - not associated with any process, serving a hardware interrupt; | |
29 | ||
30 | - not associated with any process, serving a softirq or tasklet; | |
31 | ||
32 | - running in kernel space, associated with a process (user context); | |
33 | ||
34 | - running a process in user space. | |
35 | ||
36 | There is an ordering between these. The bottom two can preempt each | |
37 | other, but above that is a strict hierarchy: each can only be preempted | |
38 | by the ones above it. For example, while a softirq is running on a CPU, | |
39 | no other softirq will preempt it, but a hardware interrupt can. However, | |
40 | any other CPUs in the system execute independently. | |
41 | ||
42 | We'll see a number of ways that the user context can block interrupts, | |
43 | to become truly non-preemptable. | |
44 | ||
45 | User Context | |
46 | ------------ | |
47 | ||
48 | User context is when you are coming in from a system call or other trap: | |
49 | like userspace, you can be preempted by more important tasks and by | |
50 | interrupts. You can sleep, by calling :c:func:`schedule()`. | |
51 | ||
52 | .. note:: | |
53 | ||
54 | You are always in user context on module load and unload, and on | |
55 | operations on the block device layer. | |
56 | ||
57 | In user context, the ``current`` pointer (indicating the task we are | |
58 | currently executing) is valid, and :c:func:`in_interrupt()` | |
dca1e58e | 59 | (``include/linux/preempt.h``) is false. |
c4fcd7ca MCC |
60 | |
61 | .. warning:: | |
62 | ||
63 | Beware that if you have preemption or softirqs disabled (see below), | |
64 | :c:func:`in_interrupt()` will return a false positive. | |
65 | ||
66 | Hardware Interrupts (Hard IRQs) | |
67 | ------------------------------- | |
68 | ||
69 | Timer ticks, network cards and keyboard are examples of real hardware | |
70 | which produce interrupts at any time. The kernel runs interrupt | |
71 | handlers, which services the hardware. The kernel guarantees that this | |
72 | handler is never re-entered: if the same interrupt arrives, it is queued | |
73 | (or dropped). Because it disables interrupts, this handler has to be | |
74 | fast: frequently it simply acknowledges the interrupt, marks a 'software | |
75 | interrupt' for execution and exits. | |
76 | ||
77 | You can tell you are in a hardware interrupt, because | |
78 | :c:func:`in_irq()` returns true. | |
79 | ||
80 | .. warning:: | |
81 | ||
82 | Beware that this will return a false positive if interrupts are | |
83 | disabled (see below). | |
84 | ||
85 | Software Interrupt Context: Softirqs and Tasklets | |
86 | ------------------------------------------------- | |
87 | ||
88 | Whenever a system call is about to return to userspace, or a hardware | |
89 | interrupt handler exits, any 'software interrupts' which are marked | |
90 | pending (usually by hardware interrupts) are run (``kernel/softirq.c``). | |
91 | ||
92 | Much of the real interrupt handling work is done here. Early in the | |
93 | transition to SMP, there were only 'bottom halves' (BHs), which didn't | |
94 | take advantage of multiple CPUs. Shortly after we switched from wind-up | |
95 | computers made of match-sticks and snot, we abandoned this limitation | |
96 | and switched to 'softirqs'. | |
97 | ||
98 | ``include/linux/interrupt.h`` lists the different softirqs. A very | |
99 | important softirq is the timer softirq (``include/linux/timer.h``): you | |
100 | can register to have it call functions for you in a given length of | |
101 | time. | |
102 | ||
103 | Softirqs are often a pain to deal with, since the same softirq will run | |
104 | simultaneously on more than one CPU. For this reason, tasklets | |
105 | (``include/linux/interrupt.h``) are more often used: they are | |
106 | dynamically-registrable (meaning you can have as many as you want), and | |
107 | they also guarantee that any tasklet will only run on one CPU at any | |
108 | time, although different tasklets can run simultaneously. | |
109 | ||
110 | .. warning:: | |
111 | ||
112 | The name 'tasklet' is misleading: they have nothing to do with | |
113 | 'tasks', and probably more to do with some bad vodka Alexey | |
114 | Kuznetsov had at the time. | |
115 | ||
116 | You can tell you are in a softirq (or tasklet) using the | |
dca1e58e | 117 | :c:func:`in_softirq()` macro (``include/linux/preempt.h``). |
c4fcd7ca MCC |
118 | |
119 | .. warning:: | |
120 | ||
dca1e58e MCC |
121 | Beware that this will return a false positive if a |
122 | :ref:`botton half lock <local_bh_disable>` is held. | |
c4fcd7ca MCC |
123 | |
124 | Some Basic Rules | |
125 | ================ | |
126 | ||
127 | No memory protection | |
128 | If you corrupt memory, whether in user context or interrupt context, | |
129 | the whole machine will crash. Are you sure you can't do what you | |
130 | want in userspace? | |
131 | ||
132 | No floating point or MMX | |
133 | The FPU context is not saved; even in user context the FPU state | |
134 | probably won't correspond with the current process: you would mess | |
135 | with some user process' FPU state. If you really want to do this, | |
136 | you would have to explicitly save/restore the full FPU state (and | |
137 | avoid context switches). It is generally a bad idea; use fixed point | |
138 | arithmetic first. | |
139 | ||
140 | A rigid stack limit | |
141 | Depending on configuration options the kernel stack is about 3K to | |
142 | 6K for most 32-bit architectures: it's about 14K on most 64-bit | |
143 | archs, and often shared with interrupts so you can't use it all. | |
144 | Avoid deep recursion and huge local arrays on the stack (allocate | |
145 | them dynamically instead). | |
146 | ||
147 | The Linux kernel is portable | |
148 | Let's keep it that way. Your code should be 64-bit clean, and | |
149 | endian-independent. You should also minimize CPU specific stuff, | |
150 | e.g. inline assembly should be cleanly encapsulated and minimized to | |
151 | ease porting. Generally it should be restricted to the | |
152 | architecture-dependent part of the kernel tree. | |
153 | ||
154 | ioctls: Not writing a new system call | |
155 | ===================================== | |
156 | ||
dca1e58e | 157 | A system call generally looks like this:: |
c4fcd7ca MCC |
158 | |
159 | asmlinkage long sys_mycall(int arg) | |
160 | { | |
161 | return 0; | |
162 | } | |
163 | ||
164 | ||
165 | First, in most cases you don't want to create a new system call. You | |
166 | create a character device and implement an appropriate ioctl for it. | |
167 | This is much more flexible than system calls, doesn't have to be entered | |
168 | in every architecture's ``include/asm/unistd.h`` and | |
169 | ``arch/kernel/entry.S`` file, and is much more likely to be accepted by | |
170 | Linus. | |
171 | ||
172 | If all your routine does is read or write some parameter, consider | |
173 | implementing a :c:func:`sysfs()` interface instead. | |
174 | ||
175 | Inside the ioctl you're in user context to a process. When a error | |
dca1e58e MCC |
176 | occurs you return a negated errno (see |
177 | ``include/uapi/asm-generic/errno-base.h``, | |
178 | ``include/uapi/asm-generic/errno.h`` and ``include/linux/errno.h``), | |
c4fcd7ca MCC |
179 | otherwise you return 0. |
180 | ||
181 | After you slept you should check if a signal occurred: the Unix/Linux | |
182 | way of handling signals is to temporarily exit the system call with the | |
183 | ``-ERESTARTSYS`` error. The system call entry code will switch back to | |
184 | user context, process the signal handler and then your system call will | |
185 | be restarted (unless the user disabled that). So you should be prepared | |
186 | to process the restart, e.g. if you're in the middle of manipulating | |
187 | some data structure. | |
188 | ||
189 | :: | |
190 | ||
191 | if (signal_pending(current)) | |
192 | return -ERESTARTSYS; | |
193 | ||
194 | ||
195 | If you're doing longer computations: first think userspace. If you | |
196 | **really** want to do it in kernel you should regularly check if you need | |
197 | to give up the CPU (remember there is cooperative multitasking per CPU). | |
dca1e58e | 198 | Idiom:: |
c4fcd7ca MCC |
199 | |
200 | cond_resched(); /* Will sleep */ | |
201 | ||
202 | ||
203 | A short note on interface design: the UNIX system call motto is "Provide | |
204 | mechanism not policy". | |
205 | ||
206 | Recipes for Deadlock | |
207 | ==================== | |
208 | ||
209 | You cannot call any routines which may sleep, unless: | |
210 | ||
211 | - You are in user context. | |
212 | ||
213 | - You do not own any spinlocks. | |
214 | ||
215 | - You have interrupts enabled (actually, Andi Kleen says that the | |
216 | scheduling code will enable them for you, but that's probably not | |
217 | what you wanted). | |
218 | ||
219 | Note that some functions may sleep implicitly: common ones are the user | |
220 | space access functions (\*_user) and memory allocation functions | |
221 | without ``GFP_ATOMIC``. | |
222 | ||
223 | You should always compile your kernel ``CONFIG_DEBUG_ATOMIC_SLEEP`` on, | |
224 | and it will warn you if you break these rules. If you **do** break the | |
225 | rules, you will eventually lock up your box. | |
226 | ||
227 | Really. | |
228 | ||
229 | Common Routines | |
230 | =============== | |
231 | ||
dca1e58e MCC |
232 | :c:func:`printk()` |
233 | ------------------ | |
234 | ||
235 | Defined in ``include/linux/printk.h`` | |
c4fcd7ca MCC |
236 | |
237 | :c:func:`printk()` feeds kernel messages to the console, dmesg, and | |
238 | the syslog daemon. It is useful for debugging and reporting errors, and | |
239 | can be used inside interrupt context, but use with caution: a machine | |
240 | which has its console flooded with printk messages is unusable. It uses | |
241 | a format string mostly compatible with ANSI C printf, and C string | |
dca1e58e | 242 | concatenation to give it a first "priority" argument:: |
c4fcd7ca MCC |
243 | |
244 | printk(KERN_INFO "i = %u\n", i); | |
245 | ||
246 | ||
dca1e58e | 247 | See ``include/linux/kern_levels.h``; for other ``KERN_`` values; these are |
c4fcd7ca | 248 | interpreted by syslog as the level. Special case: for printing an IP |
dca1e58e | 249 | address use:: |
c4fcd7ca MCC |
250 | |
251 | __be32 ipaddress; | |
252 | printk(KERN_INFO "my ip: %pI4\n", &ipaddress); | |
253 | ||
254 | ||
255 | :c:func:`printk()` internally uses a 1K buffer and does not catch | |
256 | overruns. Make sure that will be enough. | |
257 | ||
258 | .. note:: | |
259 | ||
260 | You will know when you are a real kernel hacker when you start | |
261 | typoing printf as printk in your user programs :) | |
262 | ||
263 | .. note:: | |
264 | ||
265 | Another sidenote: the original Unix Version 6 sources had a comment | |
266 | on top of its printf function: "Printf should not be used for | |
267 | chit-chat". You should follow that advice. | |
268 | ||
dca1e58e MCC |
269 | :c:func:`copy_to_user()` / :c:func:`copy_from_user()` / :c:func:`get_user()` / :c:func:`put_user()` |
270 | --------------------------------------------------------------------------------------------------- | |
271 | ||
272 | Defined in ``include/linux/uaccess.h`` / ``asm/uaccess.h`` | |
c4fcd7ca MCC |
273 | |
274 | **[SLEEPS]** | |
275 | ||
276 | :c:func:`put_user()` and :c:func:`get_user()` are used to get | |
277 | and put single values (such as an int, char, or long) from and to | |
278 | userspace. A pointer into userspace should never be simply dereferenced: | |
279 | data should be copied using these routines. Both return ``-EFAULT`` or | |
280 | 0. | |
281 | ||
282 | :c:func:`copy_to_user()` and :c:func:`copy_from_user()` are | |
283 | more general: they copy an arbitrary amount of data to and from | |
284 | userspace. | |
285 | ||
286 | .. warning:: | |
287 | ||
288 | Unlike :c:func:`put_user()` and :c:func:`get_user()`, they | |
289 | return the amount of uncopied data (ie. 0 still means success). | |
290 | ||
291 | [Yes, this moronic interface makes me cringe. The flamewar comes up | |
292 | every year or so. --RR.] | |
293 | ||
294 | The functions may sleep implicitly. This should never be called outside | |
295 | user context (it makes no sense), with interrupts disabled, or a | |
296 | spinlock held. | |
297 | ||
dca1e58e MCC |
298 | :c:func:`kmalloc()`/:c:func:`kfree()` |
299 | ------------------------------------- | |
300 | ||
301 | Defined in ``include/linux/slab.h`` | |
c4fcd7ca MCC |
302 | |
303 | **[MAY SLEEP: SEE BELOW]** | |
304 | ||
305 | These routines are used to dynamically request pointer-aligned chunks of | |
306 | memory, like malloc and free do in userspace, but | |
307 | :c:func:`kmalloc()` takes an extra flag word. Important values: | |
308 | ||
309 | ``GFP_KERNEL`` | |
310 | May sleep and swap to free memory. Only allowed in user context, but | |
311 | is the most reliable way to allocate memory. | |
312 | ||
313 | ``GFP_ATOMIC`` | |
314 | Don't sleep. Less reliable than ``GFP_KERNEL``, but may be called | |
315 | from interrupt context. You should **really** have a good | |
316 | out-of-memory error-handling strategy. | |
317 | ||
318 | ``GFP_DMA`` | |
319 | Allocate ISA DMA lower than 16MB. If you don't know what that is you | |
320 | don't need it. Very unreliable. | |
321 | ||
322 | If you see a sleeping function called from invalid context warning | |
323 | message, then maybe you called a sleeping allocation function from | |
324 | interrupt context without ``GFP_ATOMIC``. You should really fix that. | |
325 | Run, don't walk. | |
326 | ||
dca1e58e MCC |
327 | If you are allocating at least ``PAGE_SIZE`` (``asm/page.h`` or |
328 | ``asm/page_types.h``) bytes, consider using :c:func:`__get_free_pages()` | |
329 | (``include/linux/gfp.h``). It takes an order argument (0 for page sized, | |
c4fcd7ca MCC |
330 | 1 for double page, 2 for four pages etc.) and the same memory priority |
331 | flag word as above. | |
332 | ||
333 | If you are allocating more than a page worth of bytes you can use | |
334 | :c:func:`vmalloc()`. It'll allocate virtual memory in the kernel | |
335 | map. This block is not contiguous in physical memory, but the MMU makes | |
336 | it look like it is for you (so it'll only look contiguous to the CPUs, | |
337 | not to external device drivers). If you really need large physically | |
338 | contiguous memory for some weird device, you have a problem: it is | |
339 | poorly supported in Linux because after some time memory fragmentation | |
340 | in a running kernel makes it hard. The best way is to allocate the block | |
341 | early in the boot process via the :c:func:`alloc_bootmem()` | |
342 | routine. | |
343 | ||
344 | Before inventing your own cache of often-used objects consider using a | |
345 | slab cache in ``include/linux/slab.h`` | |
346 | ||
dca1e58e MCC |
347 | :c:func:`current()` |
348 | ------------------- | |
349 | ||
350 | Defined in ``include/asm/current.h`` | |
c4fcd7ca MCC |
351 | |
352 | This global variable (really a macro) contains a pointer to the current | |
353 | task structure, so is only valid in user context. For example, when a | |
354 | process makes a system call, this will point to the task structure of | |
355 | the calling process. It is **not NULL** in interrupt context. | |
356 | ||
dca1e58e MCC |
357 | :c:func:`mdelay()`/:c:func:`udelay()` |
358 | ------------------------------------- | |
359 | ||
360 | Defined in ``include/asm/delay.h`` / ``include/linux/delay.h`` | |
c4fcd7ca MCC |
361 | |
362 | The :c:func:`udelay()` and :c:func:`ndelay()` functions can be | |
363 | used for small pauses. Do not use large values with them as you risk | |
364 | overflow - the helper function :c:func:`mdelay()` is useful here, or | |
365 | consider :c:func:`msleep()`. | |
366 | ||
dca1e58e MCC |
367 | :c:func:`cpu_to_be32()`/:c:func:`be32_to_cpu()`/:c:func:`cpu_to_le32()`/:c:func:`le32_to_cpu()` |
368 | ----------------------------------------------------------------------------------------------- | |
369 | ||
370 | Defined in ``include/asm/byteorder.h`` | |
c4fcd7ca MCC |
371 | |
372 | The :c:func:`cpu_to_be32()` family (where the "32" can be replaced | |
373 | by 64 or 16, and the "be" can be replaced by "le") are the general way | |
374 | to do endian conversions in the kernel: they return the converted value. | |
375 | All variations supply the reverse as well: | |
376 | :c:func:`be32_to_cpu()`, etc. | |
377 | ||
378 | There are two major variations of these functions: the pointer | |
379 | variation, such as :c:func:`cpu_to_be32p()`, which take a pointer | |
380 | to the given type, and return the converted value. The other variation | |
381 | is the "in-situ" family, such as :c:func:`cpu_to_be32s()`, which | |
382 | convert value referred to by the pointer, and return void. | |
383 | ||
dca1e58e MCC |
384 | :c:func:`local_irq_save()`/:c:func:`local_irq_restore()` |
385 | -------------------------------------------------------- | |
386 | ||
387 | Defined in ``include/linux/irqflags.h`` | |
c4fcd7ca MCC |
388 | |
389 | These routines disable hard interrupts on the local CPU, and restore | |
390 | them. They are reentrant; saving the previous state in their one | |
391 | ``unsigned long flags`` argument. If you know that interrupts are | |
392 | enabled, you can simply use :c:func:`local_irq_disable()` and | |
393 | :c:func:`local_irq_enable()`. | |
394 | ||
dca1e58e MCC |
395 | .. _local_bh_disable: |
396 | ||
397 | :c:func:`local_bh_disable()`/:c:func:`local_bh_enable()` | |
398 | -------------------------------------------------------- | |
399 | ||
400 | Defined in ``include/linux/bottom_half.h`` | |
401 | ||
c4fcd7ca MCC |
402 | |
403 | These routines disable soft interrupts on the local CPU, and restore | |
404 | them. They are reentrant; if soft interrupts were disabled before, they | |
405 | will still be disabled after this pair of functions has been called. | |
406 | They prevent softirqs and tasklets from running on the current CPU. | |
407 | ||
dca1e58e MCC |
408 | :c:func:`smp_processor_id()` |
409 | ---------------------------- | |
410 | ||
411 | Defined in ``include/linux/smp.h`` | |
c4fcd7ca MCC |
412 | |
413 | :c:func:`get_cpu()` disables preemption (so you won't suddenly get | |
414 | moved to another CPU) and returns the current processor number, between | |
415 | 0 and ``NR_CPUS``. Note that the CPU numbers are not necessarily | |
416 | continuous. You return it again with :c:func:`put_cpu()` when you | |
417 | are done. | |
418 | ||
419 | If you know you cannot be preempted by another task (ie. you are in | |
420 | interrupt context, or have preemption disabled) you can use | |
421 | smp_processor_id(). | |
422 | ||
dca1e58e MCC |
423 | ``__init``/``__exit``/``__initdata`` |
424 | ------------------------------------ | |
425 | ||
426 | Defined in ``include/linux/init.h`` | |
c4fcd7ca MCC |
427 | |
428 | After boot, the kernel frees up a special section; functions marked with | |
429 | ``__init`` and data structures marked with ``__initdata`` are dropped | |
430 | after boot is complete: similarly modules discard this memory after | |
431 | initialization. ``__exit`` is used to declare a function which is only | |
432 | required on exit: the function will be dropped if this file is not | |
433 | compiled as a module. See the header file for use. Note that it makes no | |
434 | sense for a function marked with ``__init`` to be exported to modules | |
dca1e58e MCC |
435 | with :c:func:`EXPORT_SYMBOL()` or :c:func:`EXPORT_SYMBOL_GPL()`- this |
436 | will break. | |
437 | ||
438 | :c:func:`__initcall()`/:c:func:`module_init()` | |
439 | ---------------------------------------------- | |
c4fcd7ca | 440 | |
dca1e58e | 441 | Defined in ``include/linux/init.h`` / ``include/linux/module.h`` |
c4fcd7ca MCC |
442 | |
443 | Many parts of the kernel are well served as a module | |
444 | (dynamically-loadable parts of the kernel). Using the | |
445 | :c:func:`module_init()` and :c:func:`module_exit()` macros it | |
446 | is easy to write code without #ifdefs which can operate both as a module | |
447 | or built into the kernel. | |
448 | ||
449 | The :c:func:`module_init()` macro defines which function is to be | |
450 | called at module insertion time (if the file is compiled as a module), | |
451 | or at boot time: if the file is not compiled as a module the | |
452 | :c:func:`module_init()` macro becomes equivalent to | |
453 | :c:func:`__initcall()`, which through linker magic ensures that | |
454 | the function is called on boot. | |
455 | ||
456 | The function can return a negative error number to cause module loading | |
457 | to fail (unfortunately, this has no effect if the module is compiled | |
458 | into the kernel). This function is called in user context with | |
459 | interrupts enabled, so it can sleep. | |
460 | ||
dca1e58e MCC |
461 | :c:func:`module_exit()` |
462 | ----------------------- | |
463 | ||
464 | ||
465 | Defined in ``include/linux/module.h`` | |
c4fcd7ca MCC |
466 | |
467 | This macro defines the function to be called at module removal time (or | |
468 | never, in the case of the file compiled into the kernel). It will only | |
469 | be called if the module usage count has reached zero. This function can | |
470 | also sleep, but cannot fail: everything must be cleaned up by the time | |
471 | it returns. | |
472 | ||
473 | Note that this macro is optional: if it is not present, your module will | |
474 | not be removable (except for 'rmmod -f'). | |
475 | ||
dca1e58e MCC |
476 | :c:func:`try_module_get()`/:c:func:`module_put()` |
477 | ------------------------------------------------- | |
478 | ||
479 | Defined in ``include/linux/module.h`` | |
c4fcd7ca MCC |
480 | |
481 | These manipulate the module usage count, to protect against removal (a | |
482 | module also can't be removed if another module uses one of its exported | |
483 | symbols: see below). Before calling into module code, you should call | |
484 | :c:func:`try_module_get()` on that module: if it fails, then the | |
485 | module is being removed and you should act as if it wasn't there. | |
486 | Otherwise, you can safely enter the module, and call | |
487 | :c:func:`module_put()` when you're finished. | |
488 | ||
489 | Most registerable structures have an owner field, such as in the | |
490 | :c:type:`struct file_operations <file_operations>` structure. | |
491 | Set this field to the macro ``THIS_MODULE``. | |
492 | ||
493 | Wait Queues ``include/linux/wait.h`` | |
494 | ==================================== | |
495 | ||
496 | **[SLEEPS]** | |
497 | ||
498 | A wait queue is used to wait for someone to wake you up when a certain | |
499 | condition is true. They must be used carefully to ensure there is no | |
dca1e58e | 500 | race condition. You declare a :c:type:`wait_queue_head_t`, and then processes |
650fc870 | 501 | which want to wait for that condition declare a :c:type:`wait_queue_entry_t` |
c4fcd7ca MCC |
502 | referring to themselves, and place that in the queue. |
503 | ||
504 | Declaring | |
505 | --------- | |
506 | ||
507 | You declare a ``wait_queue_head_t`` using the | |
508 | :c:func:`DECLARE_WAIT_QUEUE_HEAD()` macro, or using the | |
509 | :c:func:`init_waitqueue_head()` routine in your initialization | |
510 | code. | |
511 | ||
512 | Queuing | |
513 | ------- | |
514 | ||
515 | Placing yourself in the waitqueue is fairly complex, because you must | |
516 | put yourself in the queue before checking the condition. There is a | |
517 | macro to do this: :c:func:`wait_event_interruptible()` | |
dca1e58e | 518 | (``include/linux/wait.h``) The first argument is the wait queue head, and |
c4fcd7ca | 519 | the second is an expression which is evaluated; the macro returns 0 when |
dca1e58e | 520 | this expression is true, or ``-ERESTARTSYS`` if a signal is received. The |
c4fcd7ca MCC |
521 | :c:func:`wait_event()` version ignores signals. |
522 | ||
523 | Waking Up Queued Tasks | |
524 | ---------------------- | |
525 | ||
c1de03a4 | 526 | Call :c:func:`wake_up()` (``include/linux/wait.h``), which will wake |
c4fcd7ca MCC |
527 | up every process in the queue. The exception is if one has |
528 | ``TASK_EXCLUSIVE`` set, in which case the remainder of the queue will | |
529 | not be woken. There are other variants of this basic function available | |
530 | in the same header. | |
531 | ||
532 | Atomic Operations | |
533 | ================= | |
534 | ||
535 | Certain operations are guaranteed atomic on all platforms. The first | |
dca1e58e MCC |
536 | class of operations work on :c:type:`atomic_t` (``include/asm/atomic.h``); |
537 | this contains a signed integer (at least 32 bits long), and you must use | |
538 | these functions to manipulate or read :c:type:`atomic_t` variables. | |
c4fcd7ca MCC |
539 | :c:func:`atomic_read()` and :c:func:`atomic_set()` get and set |
540 | the counter, :c:func:`atomic_add()`, :c:func:`atomic_sub()`, | |
541 | :c:func:`atomic_inc()`, :c:func:`atomic_dec()`, and | |
542 | :c:func:`atomic_dec_and_test()` (returns true if it was | |
543 | decremented to zero). | |
544 | ||
545 | Yes. It returns true (i.e. != 0) if the atomic variable is zero. | |
546 | ||
547 | Note that these functions are slower than normal arithmetic, and so | |
548 | should not be used unnecessarily. | |
549 | ||
550 | The second class of atomic operations is atomic bit operations on an | |
551 | ``unsigned long``, defined in ``include/linux/bitops.h``. These | |
552 | operations generally take a pointer to the bit pattern, and a bit | |
553 | number: 0 is the least significant bit. :c:func:`set_bit()`, | |
554 | :c:func:`clear_bit()` and :c:func:`change_bit()` set, clear, | |
555 | and flip the given bit. :c:func:`test_and_set_bit()`, | |
556 | :c:func:`test_and_clear_bit()` and | |
557 | :c:func:`test_and_change_bit()` do the same thing, except return | |
558 | true if the bit was previously set; these are particularly useful for | |
559 | atomically setting flags. | |
560 | ||
561 | It is possible to call these operations with bit indices greater than | |
dca1e58e | 562 | ``BITS_PER_LONG``. The resulting behavior is strange on big-endian |
c4fcd7ca MCC |
563 | platforms though so it is a good idea not to do this. |
564 | ||
565 | Symbols | |
566 | ======= | |
567 | ||
568 | Within the kernel proper, the normal linking rules apply (ie. unless a | |
569 | symbol is declared to be file scope with the ``static`` keyword, it can | |
570 | be used anywhere in the kernel). However, for modules, a special | |
571 | exported symbol table is kept which limits the entry points to the | |
572 | kernel proper. Modules can also export symbols. | |
573 | ||
dca1e58e MCC |
574 | :c:func:`EXPORT_SYMBOL()` |
575 | ------------------------- | |
576 | ||
577 | Defined in ``include/linux/export.h`` | |
c4fcd7ca MCC |
578 | |
579 | This is the classic method of exporting a symbol: dynamically loaded | |
580 | modules will be able to use the symbol as normal. | |
581 | ||
dca1e58e MCC |
582 | :c:func:`EXPORT_SYMBOL_GPL()` |
583 | ----------------------------- | |
584 | ||
585 | Defined in ``include/linux/export.h`` | |
c4fcd7ca MCC |
586 | |
587 | Similar to :c:func:`EXPORT_SYMBOL()` except that the symbols | |
588 | exported by :c:func:`EXPORT_SYMBOL_GPL()` can only be seen by | |
589 | modules with a :c:func:`MODULE_LICENSE()` that specifies a GPL | |
590 | compatible license. It implies that the function is considered an | |
591 | internal implementation issue, and not really an interface. Some | |
592 | maintainers and developers may however require EXPORT_SYMBOL_GPL() | |
593 | when adding any new APIs or functionality. | |
594 | ||
595 | Routines and Conventions | |
596 | ======================== | |
597 | ||
598 | Double-linked lists ``include/linux/list.h`` | |
599 | -------------------------------------------- | |
600 | ||
601 | There used to be three sets of linked-list routines in the kernel | |
602 | headers, but this one is the winner. If you don't have some particular | |
603 | pressing need for a single list, it's a good choice. | |
604 | ||
605 | In particular, :c:func:`list_for_each_entry()` is useful. | |
606 | ||
607 | Return Conventions | |
608 | ------------------ | |
609 | ||
610 | For code called in user context, it's very common to defy C convention, | |
dca1e58e | 611 | and return 0 for success, and a negative error number (eg. ``-EFAULT``) for |
c4fcd7ca MCC |
612 | failure. This can be unintuitive at first, but it's fairly widespread in |
613 | the kernel. | |
614 | ||
dca1e58e | 615 | Using :c:func:`ERR_PTR()` (``include/linux/err.h``) to encode a |
c4fcd7ca MCC |
616 | negative error number into a pointer, and :c:func:`IS_ERR()` and |
617 | :c:func:`PTR_ERR()` to get it back out again: avoids a separate | |
618 | pointer parameter for the error number. Icky, but in a good way. | |
619 | ||
620 | Breaking Compilation | |
621 | -------------------- | |
622 | ||
623 | Linus and the other developers sometimes change function or structure | |
624 | names in development kernels; this is not done just to keep everyone on | |
625 | their toes: it reflects a fundamental change (eg. can no longer be | |
626 | called with interrupts on, or does extra checks, or doesn't do checks | |
627 | which were caught before). Usually this is accompanied by a fairly | |
628 | complete note to the linux-kernel mailing list; search the archive. | |
629 | Simply doing a global replace on the file usually makes things **worse**. | |
630 | ||
631 | Initializing structure members | |
632 | ------------------------------ | |
633 | ||
634 | The preferred method of initializing structures is to use designated | |
dca1e58e | 635 | initialisers, as defined by ISO C99, eg:: |
c4fcd7ca MCC |
636 | |
637 | static struct block_device_operations opt_fops = { | |
638 | .open = opt_open, | |
639 | .release = opt_release, | |
640 | .ioctl = opt_ioctl, | |
641 | .check_media_change = opt_media_change, | |
642 | }; | |
643 | ||
644 | ||
645 | This makes it easy to grep for, and makes it clear which structure | |
646 | fields are set. You should do this because it looks cool. | |
647 | ||
648 | GNU Extensions | |
649 | -------------- | |
650 | ||
651 | GNU Extensions are explicitly allowed in the Linux kernel. Note that | |
652 | some of the more complex ones are not very well supported, due to lack | |
653 | of general use, but the following are considered standard (see the GCC | |
654 | info page section "C Extensions" for more details - Yes, really the info | |
655 | page, the man page is only a short summary of the stuff in info). | |
656 | ||
657 | - Inline functions | |
658 | ||
659 | - Statement expressions (ie. the ({ and }) constructs). | |
660 | ||
661 | - Declaring attributes of a function / variable / type | |
662 | (__attribute__) | |
663 | ||
664 | - typeof | |
665 | ||
666 | - Zero length arrays | |
667 | ||
668 | - Macro varargs | |
669 | ||
670 | - Arithmetic on void pointers | |
671 | ||
672 | - Non-Constant initializers | |
673 | ||
674 | - Assembler Instructions (not outside arch/ and include/asm/) | |
675 | ||
676 | - Function names as strings (__func__). | |
677 | ||
678 | - __builtin_constant_p() | |
679 | ||
680 | Be wary when using long long in the kernel, the code gcc generates for | |
681 | it is horrible and worse: division and multiplication does not work on | |
682 | i386 because the GCC runtime functions for it are missing from the | |
683 | kernel environment. | |
684 | ||
685 | C++ | |
686 | --- | |
687 | ||
688 | Using C++ in the kernel is usually a bad idea, because the kernel does | |
689 | not provide the necessary runtime environment and the include files are | |
690 | not tested for it. It is still possible, but not recommended. If you | |
691 | really want to do this, forget about exceptions at least. | |
692 | ||
423860a6 MW |
693 | #if |
694 | --- | |
c4fcd7ca MCC |
695 | |
696 | It is generally considered cleaner to use macros in header files (or at | |
697 | the top of .c files) to abstract away functions rather than using \`#if' | |
698 | pre-processor statements throughout the source code. | |
699 | ||
700 | Putting Your Stuff in the Kernel | |
701 | ================================ | |
702 | ||
703 | In order to get your stuff into shape for official inclusion, or even to | |
704 | make a neat patch, there's administrative work to be done: | |
705 | ||
706 | - Figure out whose pond you've been pissing in. Look at the top of the | |
707 | source files, inside the ``MAINTAINERS`` file, and last of all in the | |
708 | ``CREDITS`` file. You should coordinate with this person to make sure | |
709 | you're not duplicating effort, or trying something that's already | |
710 | been rejected. | |
711 | ||
712 | Make sure you put your name and EMail address at the top of any files | |
713 | you create or mangle significantly. This is the first place people | |
714 | will look when they find a bug, or when **they** want to make a change. | |
715 | ||
716 | - Usually you want a configuration option for your kernel hack. Edit | |
717 | ``Kconfig`` in the appropriate directory. The Config language is | |
718 | simple to use by cut and paste, and there's complete documentation in | |
719 | ``Documentation/kbuild/kconfig-language.txt``. | |
720 | ||
721 | In your description of the option, make sure you address both the | |
722 | expert user and the user who knows nothing about your feature. | |
723 | Mention incompatibilities and issues here. **Definitely** end your | |
724 | description with “if in doubt, say N” (or, occasionally, \`Y'); this | |
725 | is for people who have no idea what you are talking about. | |
726 | ||
727 | - Edit the ``Makefile``: the CONFIG variables are exported here so you | |
728 | can usually just add a "obj-$(CONFIG_xxx) += xxx.o" line. The syntax | |
729 | is documented in ``Documentation/kbuild/makefiles.txt``. | |
730 | ||
731 | - Put yourself in ``CREDITS`` if you've done something noteworthy, | |
732 | usually beyond a single file (your name should be at the top of the | |
733 | source files anyway). ``MAINTAINERS`` means you want to be consulted | |
734 | when changes are made to a subsystem, and hear about bugs; it implies | |
735 | a more-than-passing commitment to some part of the code. | |
736 | ||
737 | - Finally, don't forget to read | |
738 | ``Documentation/process/submitting-patches.rst`` and possibly | |
739 | ``Documentation/process/submitting-drivers.rst``. | |
740 | ||
741 | Kernel Cantrips | |
742 | =============== | |
743 | ||
744 | Some favorites from browsing the source. Feel free to add to this list. | |
745 | ||
dca1e58e | 746 | ``arch/x86/include/asm/delay.h``:: |
c4fcd7ca MCC |
747 | |
748 | #define ndelay(n) (__builtin_constant_p(n) ? \ | |
749 | ((n) > 20000 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \ | |
750 | __ndelay(n)) | |
751 | ||
752 | ||
dca1e58e | 753 | ``include/linux/fs.h``:: |
c4fcd7ca MCC |
754 | |
755 | /* | |
756 | * Kernel pointers have redundant information, so we can use a | |
757 | * scheme where we can return either an error code or a dentry | |
758 | * pointer with the same return value. | |
759 | * | |
760 | * This should be a per-architecture thing, to allow different | |
761 | * error and pointer decisions. | |
762 | */ | |
763 | #define ERR_PTR(err) ((void *)((long)(err))) | |
764 | #define PTR_ERR(ptr) ((long)(ptr)) | |
765 | #define IS_ERR(ptr) ((unsigned long)(ptr) > (unsigned long)(-1000)) | |
766 | ||
dca1e58e | 767 | ``arch/x86/include/asm/uaccess_32.h:``:: |
c4fcd7ca MCC |
768 | |
769 | #define copy_to_user(to,from,n) \ | |
770 | (__builtin_constant_p(n) ? \ | |
771 | __constant_copy_to_user((to),(from),(n)) : \ | |
772 | __generic_copy_to_user((to),(from),(n))) | |
773 | ||
774 | ||
dca1e58e | 775 | ``arch/sparc/kernel/head.S:``:: |
c4fcd7ca MCC |
776 | |
777 | /* | |
778 | * Sun people can't spell worth damn. "compatability" indeed. | |
779 | * At least we *know* we can't spell, and use a spell-checker. | |
780 | */ | |
781 | ||
782 | /* Uh, actually Linus it is I who cannot spell. Too much murky | |
783 | * Sparc assembly will do this to ya. | |
784 | */ | |
785 | C_LABEL(cputypvar): | |
786 | .asciz "compatibility" | |
787 | ||
788 | /* Tested on SS-5, SS-10. Probably someone at Sun applied a spell-checker. */ | |
789 | .align 4 | |
790 | C_LABEL(cputypvar_sun4m): | |
791 | .asciz "compatible" | |
792 | ||
793 | ||
dca1e58e | 794 | ``arch/sparc/lib/checksum.S:``:: |
c4fcd7ca MCC |
795 | |
796 | /* Sun, you just can't beat me, you just can't. Stop trying, | |
797 | * give up. I'm serious, I am going to kick the living shit | |
798 | * out of you, game over, lights out. | |
799 | */ | |
800 | ||
801 | ||
802 | Thanks | |
803 | ====== | |
804 | ||
805 | Thanks to Andi Kleen for the idea, answering my questions, fixing my | |
806 | mistakes, filling content, etc. Philipp Rumpf for more spelling and | |
807 | clarity fixes, and some excellent non-obvious points. Werner Almesberger | |
808 | for giving me a great summary of :c:func:`disable_irq()`, and Jes | |
809 | Sorensen and Andrea Arcangeli added caveats. Michael Elizabeth Chastain | |
810 | for checking and adding to the Configure section. Telsa Gwynne for | |
811 | teaching me DocBook. |