Commit | Line | Data |
---|---|---|
8ce156de AB |
1 | ====================== |
2 | ioctl based interfaces | |
3 | ====================== | |
4 | ||
5 | ioctl() is the most common way for applications to interface | |
6 | with device drivers. It is flexible and easily extended by adding new | |
7 | commands and can be passed through character devices, block devices as | |
8 | well as sockets and other special file descriptors. | |
9 | ||
10 | However, it is also very easy to get ioctl command definitions wrong, | |
11 | and hard to fix them later without breaking existing applications, | |
12 | so this documentation tries to help developers get it right. | |
13 | ||
14 | Command number definitions | |
15 | ========================== | |
16 | ||
17 | The command number, or request number, is the second argument passed to | |
18 | the ioctl system call. While this can be any 32-bit number that uniquely | |
19 | identifies an action for a particular driver, there are a number of | |
20 | conventions around defining them. | |
21 | ||
22 | ``include/uapi/asm-generic/ioctl.h`` provides four macros for defining | |
23 | ioctl commands that follow modern conventions: ``_IO``, ``_IOR``, | |
24 | ``_IOW``, and ``_IOWR``. These should be used for all new commands, | |
25 | with the correct parameters: | |
26 | ||
27 | _IO/_IOR/_IOW/_IOWR | |
28 | The macro name specifies how the argument will be used. It may be a | |
29 | pointer to data to be passed into the kernel (_IOW), out of the kernel | |
30 | (_IOR), or both (_IOWR). _IO can indicate either commands with no | |
31 | argument or those passing an integer value instead of a pointer. | |
32 | It is recommended to only use _IO for commands without arguments, | |
33 | and use pointers for passing data. | |
34 | ||
35 | type | |
36 | An 8-bit number, often a character literal, specific to a subsystem | |
37 | or driver, and listed in :doc:`../userspace-api/ioctl/ioctl-number` | |
38 | ||
39 | nr | |
40 | An 8-bit number identifying the specific command, unique for a give | |
41 | value of 'type' | |
42 | ||
43 | data_type | |
44 | The name of the data type pointed to by the argument, the command number | |
45 | encodes the ``sizeof(data_type)`` value in a 13-bit or 14-bit integer, | |
46 | leading to a limit of 8191 bytes for the maximum size of the argument. | |
47 | Note: do not pass sizeof(data_type) type into _IOR/_IOW/IOWR, as that | |
48 | will lead to encoding sizeof(sizeof(data_type)), i.e. sizeof(size_t). | |
49 | _IO does not have a data_type parameter. | |
50 | ||
51 | ||
52 | Interface versions | |
53 | ================== | |
54 | ||
55 | Some subsystems use version numbers in data structures to overload | |
56 | commands with different interpretations of the argument. | |
57 | ||
58 | This is generally a bad idea, since changes to existing commands tend | |
59 | to break existing applications. | |
60 | ||
61 | A better approach is to add a new ioctl command with a new number. The | |
62 | old command still needs to be implemented in the kernel for compatibility, | |
63 | but this can be a wrapper around the new implementation. | |
64 | ||
65 | Return code | |
66 | =========== | |
67 | ||
68 | ioctl commands can return negative error codes as documented in errno(3); | |
69 | these get turned into errno values in user space. On success, the return | |
70 | code should be zero. It is also possible but not recommended to return | |
71 | a positive 'long' value. | |
72 | ||
73 | When the ioctl callback is called with an unknown command number, the | |
74 | handler returns either -ENOTTY or -ENOIOCTLCMD, which also results in | |
75 | -ENOTTY being returned from the system call. Some subsystems return | |
76 | -ENOSYS or -EINVAL here for historic reasons, but this is wrong. | |
77 | ||
78 | Prior to Linux 5.5, compat_ioctl handlers were required to return | |
79 | -ENOIOCTLCMD in order to use the fallback conversion into native | |
80 | commands. As all subsystems are now responsible for handling compat | |
81 | mode themselves, this is no longer needed, but it may be important to | |
82 | consider when backporting bug fixes to older kernels. | |
83 | ||
84 | Timestamps | |
85 | ========== | |
86 | ||
87 | Traditionally, timestamps and timeout values are passed as ``struct | |
88 | timespec`` or ``struct timeval``, but these are problematic because of | |
89 | incompatible definitions of these structures in user space after the | |
90 | move to 64-bit time_t. | |
91 | ||
92 | The ``struct __kernel_timespec`` type can be used instead to be embedded | |
93 | in other data structures when separate second/nanosecond values are | |
94 | desired, or passed to user space directly. This is still not ideal though, | |
95 | as the structure matches neither the kernel's timespec64 nor the user | |
96 | space timespec exactly. The get_timespec64() and put_timespec64() helper | |
97 | functions can be used to ensure that the layout remains compatible with | |
98 | user space and the padding is treated correctly. | |
99 | ||
100 | As it is cheap to convert seconds to nanoseconds, but the opposite | |
101 | requires an expensive 64-bit division, a simple __u64 nanosecond value | |
102 | can be simpler and more efficient. | |
103 | ||
104 | Timeout values and timestamps should ideally use CLOCK_MONOTONIC time, | |
105 | as returned by ktime_get_ns() or ktime_get_ts64(). Unlike | |
106 | CLOCK_REALTIME, this makes the timestamps immune from jumping backwards | |
107 | or forwards due to leap second adjustments and clock_settime() calls. | |
108 | ||
109 | ktime_get_real_ns() can be used for CLOCK_REALTIME timestamps that | |
110 | need to be persistent across a reboot or between multiple machines. | |
111 | ||
112 | 32-bit compat mode | |
113 | ================== | |
114 | ||
115 | In order to support 32-bit user space running on a 64-bit machine, each | |
116 | subsystem or driver that implements an ioctl callback handler must also | |
117 | implement the corresponding compat_ioctl handler. | |
118 | ||
119 | As long as all the rules for data structures are followed, this is as | |
120 | easy as setting the .compat_ioctl pointer to a helper function such as | |
121 | compat_ptr_ioctl() or blkdev_compat_ptr_ioctl(). | |
122 | ||
123 | compat_ptr() | |
124 | ------------ | |
125 | ||
126 | On the s390 architecture, 31-bit user space has ambiguous representations | |
127 | for data pointers, with the upper bit being ignored. When running such | |
128 | a process in compat mode, the compat_ptr() helper must be used to | |
129 | clear the upper bit of a compat_uptr_t and turn it into a valid 64-bit | |
130 | pointer. On other architectures, this macro only performs a cast to a | |
131 | ``void __user *`` pointer. | |
132 | ||
133 | In an compat_ioctl() callback, the last argument is an unsigned long, | |
134 | which can be interpreted as either a pointer or a scalar depending on | |
135 | the command. If it is a scalar, then compat_ptr() must not be used, to | |
136 | ensure that the 64-bit kernel behaves the same way as a 32-bit kernel | |
137 | for arguments with the upper bit set. | |
138 | ||
139 | The compat_ptr_ioctl() helper can be used in place of a custom | |
140 | compat_ioctl file operation for drivers that only take arguments that | |
141 | are pointers to compatible data structures. | |
142 | ||
143 | Structure layout | |
144 | ---------------- | |
145 | ||
146 | Compatible data structures have the same layout on all architectures, | |
147 | avoiding all problematic members: | |
148 | ||
149 | * ``long`` and ``unsigned long`` are the size of a register, so | |
150 | they can be either 32-bit or 64-bit wide and cannot be used in portable | |
151 | data structures. Fixed-length replacements are ``__s32``, ``__u32``, | |
152 | ``__s64`` and ``__u64``. | |
153 | ||
154 | * Pointers have the same problem, in addition to requiring the | |
155 | use of compat_ptr(). The best workaround is to use ``__u64`` | |
156 | in place of pointers, which requires a cast to ``uintptr_t`` in user | |
157 | space, and the use of u64_to_user_ptr() in the kernel to convert | |
158 | it back into a user pointer. | |
159 | ||
160 | * On the x86-32 (i386) architecture, the alignment of 64-bit variables | |
161 | is only 32-bit, but they are naturally aligned on most other | |
162 | architectures including x86-64. This means a structure like:: | |
163 | ||
164 | struct foo { | |
165 | __u32 a; | |
166 | __u64 b; | |
167 | __u32 c; | |
168 | }; | |
169 | ||
170 | has four bytes of padding between a and b on x86-64, plus another four | |
171 | bytes of padding at the end, but no padding on i386, and it needs a | |
172 | compat_ioctl conversion handler to translate between the two formats. | |
173 | ||
174 | To avoid this problem, all structures should have their members | |
175 | naturally aligned, or explicit reserved fields added in place of the | |
176 | implicit padding. The ``pahole`` tool can be used for checking the | |
177 | alignment. | |
178 | ||
179 | * On ARM OABI user space, structures are padded to multiples of 32-bit, | |
180 | making some structs incompatible with modern EABI kernels if they | |
181 | do not end on a 32-bit boundary. | |
182 | ||
183 | * On the m68k architecture, struct members are not guaranteed to have an | |
184 | alignment greater than 16-bit, which is a problem when relying on | |
185 | implicit padding. | |
186 | ||
187 | * Bitfields and enums generally work as one would expect them to, | |
188 | but some properties of them are implementation-defined, so it is better | |
189 | to avoid them completely in ioctl interfaces. | |
190 | ||
191 | * ``char`` members can be either signed or unsigned, depending on | |
192 | the architecture, so the __u8 and __s8 types should be used for 8-bit | |
193 | integer values, though char arrays are clearer for fixed-length strings. | |
194 | ||
195 | Information leaks | |
196 | ================= | |
197 | ||
198 | Uninitialized data must not be copied back to user space, as this can | |
199 | cause an information leak, which can be used to defeat kernel address | |
200 | space layout randomization (KASLR), helping in an attack. | |
201 | ||
202 | For this reason (and for compat support) it is best to avoid any | |
203 | implicit padding in data structures. Where there is implicit padding | |
204 | in an existing structure, kernel drivers must be careful to fully | |
205 | initialize an instance of the structure before copying it to user | |
206 | space. This is usually done by calling memset() before assigning to | |
207 | individual members. | |
208 | ||
209 | Subsystem abstractions | |
210 | ====================== | |
211 | ||
212 | While some device drivers implement their own ioctl function, most | |
213 | subsystems implement the same command for multiple drivers. Ideally the | |
214 | subsystem has an .ioctl() handler that copies the arguments from and | |
215 | to user space, passing them into subsystem specific callback functions | |
216 | through normal kernel pointers. | |
217 | ||
218 | This helps in various ways: | |
219 | ||
220 | * Applications written for one driver are more likely to work for | |
221 | another one in the same subsystem if there are no subtle differences | |
222 | in the user space ABI. | |
223 | ||
224 | * The complexity of user space access and data structure layout is done | |
225 | in one place, reducing the potential for implementation bugs. | |
226 | ||
227 | * It is more likely to be reviewed by experienced developers | |
228 | that can spot problems in the interface when the ioctl is shared | |
229 | between multiple drivers than when it is only used in a single driver. | |
230 | ||
231 | Alternatives to ioctl | |
232 | ===================== | |
233 | ||
234 | There are many cases in which ioctl is not the best solution for a | |
235 | problem. Alternatives include: | |
236 | ||
237 | * System calls are a better choice for a system-wide feature that | |
238 | is not tied to a physical device or constrained by the file system | |
239 | permissions of a character device node | |
240 | ||
241 | * netlink is the preferred way of configuring any network related | |
242 | objects through sockets. | |
243 | ||
244 | * debugfs is used for ad-hoc interfaces for debugging functionality | |
245 | that does not need to be exposed as a stable interface to applications. | |
246 | ||
247 | * sysfs is a good way to expose the state of an in-kernel object | |
248 | that is not tied to a file descriptor. | |
249 | ||
250 | * configfs can be used for more complex configuration than sysfs | |
251 | ||
252 | * A custom file system can provide extra flexibility with a simple | |
253 | user interface but adds a lot of complexity to the implementation. |