Commit | Line | Data |
---|---|---|
e82894f8 TZ |
1 | |
2 | relayfs - a high-speed data relay filesystem | |
3 | ============================================ | |
4 | ||
5 | relayfs is a filesystem designed to provide an efficient mechanism for | |
6 | tools and facilities to relay large and potentially sustained streams | |
7 | of data from kernel space to user space. | |
8 | ||
9 | The main abstraction of relayfs is the 'channel'. A channel consists | |
10 | of a set of per-cpu kernel buffers each represented by a file in the | |
11 | relayfs filesystem. Kernel clients write into a channel using | |
12 | efficient write functions which automatically log to the current cpu's | |
13 | channel buffer. User space applications mmap() the per-cpu files and | |
14 | retrieve the data as it becomes available. | |
15 | ||
16 | The format of the data logged into the channel buffers is completely | |
17 | up to the relayfs client; relayfs does however provide hooks which | |
afeda2c2 | 18 | allow clients to impose some structure on the buffer data. Nor does |
e82894f8 TZ |
19 | relayfs implement any form of data filtering - this also is left to |
20 | the client. The purpose is to keep relayfs as simple as possible. | |
21 | ||
22 | This document provides an overview of the relayfs API. The details of | |
23 | the function parameters are documented along with the functions in the | |
24 | filesystem code - please see that for details. | |
25 | ||
26 | Semantics | |
27 | ========= | |
28 | ||
29 | Each relayfs channel has one buffer per CPU, each buffer has one or | |
30 | more sub-buffers. Messages are written to the first sub-buffer until | |
31 | it is too full to contain a new message, in which case it it is | |
32 | written to the next (if available). Messages are never split across | |
33 | sub-buffers. At this point, userspace can be notified so it empties | |
34 | the first sub-buffer, while the kernel continues writing to the next. | |
35 | ||
36 | When notified that a sub-buffer is full, the kernel knows how many | |
37 | bytes of it are padding i.e. unused. Userspace can use this knowledge | |
38 | to copy only valid data. | |
39 | ||
40 | After copying it, userspace can notify the kernel that a sub-buffer | |
41 | has been consumed. | |
42 | ||
43 | relayfs can operate in a mode where it will overwrite data not yet | |
44 | collected by userspace, and not wait for it to consume it. | |
45 | ||
46 | relayfs itself does not provide for communication of such data between | |
6b34350f TZ |
47 | userspace and kernel, allowing the kernel side to remain simple and |
48 | not impose a single interface on userspace. It does provide a set of | |
49 | examples and a separate helper though, described below. | |
50 | ||
51 | klog and relay-apps example code | |
52 | ================================ | |
53 | ||
54 | relayfs itself is ready to use, but to make things easier, a couple | |
55 | simple utility functions and a set of examples are provided. | |
56 | ||
57 | The relay-apps example tarball, available on the relayfs sourceforge | |
58 | site, contains a set of self-contained examples, each consisting of a | |
59 | pair of .c files containing boilerplate code for each of the user and | |
60 | kernel sides of a relayfs application; combined these two sets of | |
61 | boilerplate code provide glue to easily stream data to disk, without | |
62 | having to bother with mundane housekeeping chores. | |
63 | ||
64 | The 'klog debugging functions' patch (klog.patch in the relay-apps | |
65 | tarball) provides a couple of high-level logging functions to the | |
66 | kernel which allow writing formatted text or raw data to a channel, | |
67 | regardless of whether a channel to write into exists or not, or | |
68 | whether relayfs is compiled into the kernel or is configured as a | |
69 | module. These functions allow you to put unconditional 'trace' | |
70 | statements anywhere in the kernel or kernel modules; only when there | |
71 | is a 'klog handler' registered will data actually be logged (see the | |
72 | klog and kleak examples for details). | |
73 | ||
74 | It is of course possible to use relayfs from scratch i.e. without | |
75 | using any of the relay-apps example code or klog, but you'll have to | |
76 | implement communication between userspace and kernel, allowing both to | |
77 | convey the state of buffers (full, empty, amount of padding). | |
78 | ||
79 | klog and the relay-apps examples can be found in the relay-apps | |
80 | tarball on http://relayfs.sourceforge.net | |
e82894f8 | 81 | |
e82894f8 TZ |
82 | |
83 | The relayfs user space API | |
84 | ========================== | |
85 | ||
86 | relayfs implements basic file operations for user space access to | |
87 | relayfs channel buffer data. Here are the file operations that are | |
88 | available and some comments regarding their behavior: | |
89 | ||
90 | open() enables user to open an _existing_ buffer. | |
91 | ||
92 | mmap() results in channel buffer being mapped into the caller's | |
93 | memory space. Note that you can't do a partial mmap - you must | |
94 | map the entire file, which is NRBUF * SUBBUFSIZE. | |
95 | ||
96 | read() read the contents of a channel buffer. The bytes read are | |
97 | 'consumed' by the reader i.e. they won't be available again | |
98 | to subsequent reads. If the channel is being used in | |
99 | no-overwrite mode (the default), it can be read at any time | |
100 | even if there's an active kernel writer. If the channel is | |
101 | being used in overwrite mode and there are active channel | |
102 | writers, results may be unpredictable - users should make | |
103 | sure that all logging to the channel has ended before using | |
104 | read() with overwrite mode. | |
105 | ||
106 | poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are | |
107 | notified when sub-buffer boundaries are crossed. | |
108 | ||
109 | close() decrements the channel buffer's refcount. When the refcount | |
110 | reaches 0 i.e. when no process or kernel client has the buffer | |
111 | open, the channel buffer is freed. | |
112 | ||
113 | ||
114 | In order for a user application to make use of relayfs files, the | |
115 | relayfs filesystem must be mounted. For example, | |
116 | ||
117 | mount -t relayfs relayfs /mnt/relay | |
118 | ||
119 | NOTE: relayfs doesn't need to be mounted for kernel clients to create | |
120 | or use channels - it only needs to be mounted when user space | |
121 | applications need access to the buffer data. | |
122 | ||
123 | ||
124 | The relayfs kernel API | |
125 | ====================== | |
126 | ||
127 | Here's a summary of the API relayfs provides to in-kernel clients: | |
128 | ||
129 | ||
130 | channel management functions: | |
131 | ||
132 | relay_open(base_filename, parent, subbuf_size, n_subbufs, | |
133 | callbacks) | |
134 | relay_close(chan) | |
135 | relay_flush(chan) | |
136 | relay_reset(chan) | |
137 | relayfs_create_dir(name, parent) | |
138 | relayfs_remove_dir(dentry) | |
925ac8a2 TZ |
139 | relayfs_create_file(name, parent, mode, fops, data) |
140 | relayfs_remove_file(dentry) | |
e82894f8 TZ |
141 | |
142 | channel management typically called on instigation of userspace: | |
143 | ||
144 | relay_subbufs_consumed(chan, cpu, subbufs_consumed) | |
145 | ||
146 | write functions: | |
147 | ||
148 | relay_write(chan, data, length) | |
149 | __relay_write(chan, data, length) | |
150 | relay_reserve(chan, length) | |
151 | ||
152 | callbacks: | |
153 | ||
154 | subbuf_start(buf, subbuf, prev_subbuf, prev_padding) | |
155 | buf_mapped(buf, filp) | |
156 | buf_unmapped(buf, filp) | |
df49af8f | 157 | create_buf_file(filename, parent, mode, buf, is_global) |
03d78d11 | 158 | remove_buf_file(dentry) |
e82894f8 TZ |
159 | |
160 | helper functions: | |
161 | ||
162 | relay_buf_full(buf) | |
163 | subbuf_start_reserve(buf, length) | |
164 | ||
165 | ||
166 | Creating a channel | |
167 | ------------------ | |
168 | ||
169 | relay_open() is used to create a channel, along with its per-cpu | |
170 | channel buffers. Each channel buffer will have an associated file | |
171 | created for it in the relayfs filesystem, which can be opened and | |
172 | mmapped from user space if desired. The files are named | |
173 | basename0...basenameN-1 where N is the number of online cpus, and by | |
174 | default will be created in the root of the filesystem. If you want a | |
175 | directory structure to contain your relayfs files, you can create it | |
176 | with relayfs_create_dir() and pass the parent directory to | |
177 | relay_open(). Clients are responsible for cleaning up any directory | |
178 | structure they create when the channel is closed - use | |
179 | relayfs_remove_dir() for that. | |
180 | ||
181 | The total size of each per-cpu buffer is calculated by multiplying the | |
182 | number of sub-buffers by the sub-buffer size passed into relay_open(). | |
183 | The idea behind sub-buffers is that they're basically an extension of | |
184 | double-buffering to N buffers, and they also allow applications to | |
185 | easily implement random-access-on-buffer-boundary schemes, which can | |
186 | be important for some high-volume applications. The number and size | |
187 | of sub-buffers is completely dependent on the application and even for | |
188 | the same application, different conditions will warrant different | |
189 | values for these parameters at different times. Typically, the right | |
190 | values to use are best decided after some experimentation; in general, | |
191 | though, it's safe to assume that having only 1 sub-buffer is a bad | |
192 | idea - you're guaranteed to either overwrite data or lose events | |
193 | depending on the channel mode being used. | |
194 | ||
195 | Channel 'modes' | |
196 | --------------- | |
197 | ||
198 | relayfs channels can be used in either of two modes - 'overwrite' or | |
199 | 'no-overwrite'. The mode is entirely determined by the implementation | |
200 | of the subbuf_start() callback, as described below. In 'overwrite' | |
201 | mode, also known as 'flight recorder' mode, writes continuously cycle | |
202 | around the buffer and will never fail, but will unconditionally | |
203 | overwrite old data regardless of whether it's actually been consumed. | |
204 | In no-overwrite mode, writes will fail i.e. data will be lost, if the | |
205 | number of unconsumed sub-buffers equals the total number of | |
206 | sub-buffers in the channel. It should be clear that if there is no | |
207 | consumer or if the consumer can't consume sub-buffers fast enought, | |
208 | data will be lost in either case; the only difference is whether data | |
209 | is lost from the beginning or the end of a buffer. | |
210 | ||
211 | As explained above, a relayfs channel is made of up one or more | |
212 | per-cpu channel buffers, each implemented as a circular buffer | |
213 | subdivided into one or more sub-buffers. Messages are written into | |
214 | the current sub-buffer of the channel's current per-cpu buffer via the | |
215 | write functions described below. Whenever a message can't fit into | |
216 | the current sub-buffer, because there's no room left for it, the | |
217 | client is notified via the subbuf_start() callback that a switch to a | |
218 | new sub-buffer is about to occur. The client uses this callback to 1) | |
219 | initialize the next sub-buffer if appropriate 2) finalize the previous | |
220 | sub-buffer if appropriate and 3) return a boolean value indicating | |
221 | whether or not to actually go ahead with the sub-buffer switch. | |
222 | ||
223 | To implement 'no-overwrite' mode, the userspace client would provide | |
224 | an implementation of the subbuf_start() callback something like the | |
225 | following: | |
226 | ||
227 | static int subbuf_start(struct rchan_buf *buf, | |
228 | void *subbuf, | |
229 | void *prev_subbuf, | |
230 | unsigned int prev_padding) | |
231 | { | |
232 | if (prev_subbuf) | |
233 | *((unsigned *)prev_subbuf) = prev_padding; | |
234 | ||
235 | if (relay_buf_full(buf)) | |
236 | return 0; | |
237 | ||
238 | subbuf_start_reserve(buf, sizeof(unsigned int)); | |
239 | ||
240 | return 1; | |
241 | } | |
242 | ||
243 | If the current buffer is full i.e. all sub-buffers remain unconsumed, | |
244 | the callback returns 0 to indicate that the buffer switch should not | |
245 | occur yet i.e. until the consumer has had a chance to read the current | |
246 | set of ready sub-buffers. For the relay_buf_full() function to make | |
247 | sense, the consumer is reponsible for notifying relayfs when | |
248 | sub-buffers have been consumed via relay_subbufs_consumed(). Any | |
249 | subsequent attempts to write into the buffer will again invoke the | |
250 | subbuf_start() callback with the same parameters; only when the | |
251 | consumer has consumed one or more of the ready sub-buffers will | |
252 | relay_buf_full() return 0, in which case the buffer switch can | |
253 | continue. | |
254 | ||
255 | The implementation of the subbuf_start() callback for 'overwrite' mode | |
256 | would be very similar: | |
257 | ||
258 | static int subbuf_start(struct rchan_buf *buf, | |
259 | void *subbuf, | |
260 | void *prev_subbuf, | |
261 | unsigned int prev_padding) | |
262 | { | |
263 | if (prev_subbuf) | |
264 | *((unsigned *)prev_subbuf) = prev_padding; | |
265 | ||
266 | subbuf_start_reserve(buf, sizeof(unsigned int)); | |
267 | ||
268 | return 1; | |
269 | } | |
270 | ||
271 | In this case, the relay_buf_full() check is meaningless and the | |
272 | callback always returns 1, causing the buffer switch to occur | |
273 | unconditionally. It's also meaningless for the client to use the | |
274 | relay_subbufs_consumed() function in this mode, as it's never | |
275 | consulted. | |
276 | ||
277 | The default subbuf_start() implementation, used if the client doesn't | |
278 | define any callbacks, or doesn't define the subbuf_start() callback, | |
279 | implements the simplest possible 'no-overwrite' mode i.e. it does | |
280 | nothing but return 0. | |
281 | ||
282 | Header information can be reserved at the beginning of each sub-buffer | |
283 | by calling the subbuf_start_reserve() helper function from within the | |
284 | subbuf_start() callback. This reserved area can be used to store | |
285 | whatever information the client wants. In the example above, room is | |
286 | reserved in each sub-buffer to store the padding count for that | |
287 | sub-buffer. This is filled in for the previous sub-buffer in the | |
288 | subbuf_start() implementation; the padding value for the previous | |
289 | sub-buffer is passed into the subbuf_start() callback along with a | |
290 | pointer to the previous sub-buffer, since the padding value isn't | |
291 | known until a sub-buffer is filled. The subbuf_start() callback is | |
292 | also called for the first sub-buffer when the channel is opened, to | |
293 | give the client a chance to reserve space in it. In this case the | |
294 | previous sub-buffer pointer passed into the callback will be NULL, so | |
295 | the client should check the value of the prev_subbuf pointer before | |
296 | writing into the previous sub-buffer. | |
297 | ||
298 | Writing to a channel | |
299 | -------------------- | |
300 | ||
301 | kernel clients write data into the current cpu's channel buffer using | |
302 | relay_write() or __relay_write(). relay_write() is the main logging | |
303 | function - it uses local_irqsave() to protect the buffer and should be | |
304 | used if you might be logging from interrupt context. If you know | |
305 | you'll never be logging from interrupt context, you can use | |
306 | __relay_write(), which only disables preemption. These functions | |
307 | don't return a value, so you can't determine whether or not they | |
308 | failed - the assumption is that you wouldn't want to check a return | |
309 | value in the fast logging path anyway, and that they'll always succeed | |
310 | unless the buffer is full and no-overwrite mode is being used, in | |
311 | which case you can detect a failed write in the subbuf_start() | |
312 | callback by calling the relay_buf_full() helper function. | |
313 | ||
314 | relay_reserve() is used to reserve a slot in a channel buffer which | |
315 | can be written to later. This would typically be used in applications | |
316 | that need to write directly into a channel buffer without having to | |
317 | stage data in a temporary buffer beforehand. Because the actual write | |
318 | may not happen immediately after the slot is reserved, applications | |
319 | using relay_reserve() can keep a count of the number of bytes actually | |
320 | written, either in space reserved in the sub-buffers themselves or as | |
321 | a separate array. See the 'reserve' example in the relay-apps tarball | |
322 | at http://relayfs.sourceforge.net for an example of how this can be | |
323 | done. Because the write is under control of the client and is | |
324 | separated from the reserve, relay_reserve() doesn't protect the buffer | |
325 | at all - it's up to the client to provide the appropriate | |
326 | synchronization when using relay_reserve(). | |
327 | ||
328 | Closing a channel | |
329 | ----------------- | |
330 | ||
331 | The client calls relay_close() when it's finished using the channel. | |
332 | The channel and its associated buffers are destroyed when there are no | |
333 | longer any references to any of the channel buffers. relay_flush() | |
334 | forces a sub-buffer switch on all the channel buffers, and can be used | |
335 | to finalize and process the last sub-buffers before the channel is | |
336 | closed. | |
337 | ||
925ac8a2 TZ |
338 | Creating non-relay files |
339 | ------------------------ | |
340 | ||
341 | relay_open() automatically creates files in the relayfs filesystem to | |
342 | represent the per-cpu kernel buffers; it's often useful for | |
343 | applications to be able to create their own files alongside the relay | |
344 | files in the relayfs filesystem as well e.g. 'control' files much like | |
345 | those created in /proc or debugfs for similar purposes, used to | |
346 | communicate control information between the kernel and user sides of a | |
347 | relayfs application. For this purpose the relayfs_create_file() and | |
348 | relayfs_remove_file() API functions exist. For relayfs_create_file(), | |
349 | the caller passes in a set of user-defined file operations to be used | |
350 | for the file and an optional void * to a user-specified data item, | |
351 | which will be accessible via inode->u.generic_ip (see the relay-apps | |
352 | tarball for examples). The file_operations are a required parameter | |
353 | to relayfs_create_file() and thus the semantics of these files are | |
354 | completely defined by the caller. | |
355 | ||
356 | See the relay-apps tarball at http://relayfs.sourceforge.net for | |
357 | examples of how these non-relay files are meant to be used. | |
358 | ||
03d78d11 TZ |
359 | Creating relay files in other filesystems |
360 | ----------------------------------------- | |
361 | ||
362 | By default of course, relay_open() creates relay files in the relayfs | |
363 | filesystem. Because relay_file_operations is exported, however, it's | |
364 | also possible to create and use relay files in other pseudo-filesytems | |
365 | such as debugfs. | |
366 | ||
367 | For this purpose, two callback functions are provided, | |
368 | create_buf_file() and remove_buf_file(). create_buf_file() is called | |
369 | once for each per-cpu buffer from relay_open() to allow the client to | |
370 | create a file to be used to represent the corresponding buffer; if | |
371 | this callback is not defined, the default implementation will create | |
372 | and return a file in the relayfs filesystem to represent the buffer. | |
373 | The callback should return the dentry of the file created to represent | |
374 | the relay buffer. Note that the parent directory passed to | |
375 | relay_open() (and passed along to the callback), if specified, must | |
376 | exist in the same filesystem the new relay file is created in. If | |
377 | create_buf_file() is defined, remove_buf_file() must also be defined; | |
378 | it's responsible for deleting the file(s) created in create_buf_file() | |
379 | and is called during relay_close(). | |
380 | ||
df49af8f TZ |
381 | The create_buf_file() implementation can also be defined in such a way |
382 | as to allow the creation of a single 'global' buffer instead of the | |
383 | default per-cpu set. This can be useful for applications interested | |
384 | mainly in seeing the relative ordering of system-wide events without | |
385 | the need to bother with saving explicit timestamps for the purpose of | |
386 | merging/sorting per-cpu files in a postprocessing step. | |
387 | ||
388 | To have relay_open() create a global buffer, the create_buf_file() | |
389 | implementation should set the value of the is_global outparam to a | |
390 | non-zero value in addition to creating the file that will be used to | |
391 | represent the single buffer. In the case of a global buffer, | |
392 | create_buf_file() and remove_buf_file() will be called only once. The | |
393 | normal channel-writing functions e.g. relay_write() can still be used | |
394 | - writes from any cpu will transparently end up in the global buffer - | |
395 | but since it is a global buffer, callers should make sure they use the | |
396 | proper locking for such a buffer, either by wrapping writes in a | |
397 | spinlock, or by copying a write function from relayfs_fs.h and | |
398 | creating a local version that internally does the proper locking. | |
399 | ||
03d78d11 TZ |
400 | See the 'exported-relayfile' examples in the relay-apps tarball for |
401 | examples of creating and using relay files in debugfs. | |
402 | ||
e82894f8 TZ |
403 | Misc |
404 | ---- | |
405 | ||
406 | Some applications may want to keep a channel around and re-use it | |
407 | rather than open and close a new channel for each use. relay_reset() | |
408 | can be used for this purpose - it resets a channel to its initial | |
409 | state without reallocating channel buffer memory or destroying | |
410 | existing mappings. It should however only be called when it's safe to | |
411 | do so i.e. when the channel isn't currently being written to. | |
412 | ||
413 | Finally, there are a couple of utility callbacks that can be used for | |
414 | different purposes. buf_mapped() is called whenever a channel buffer | |
415 | is mmapped from user space and buf_unmapped() is called when it's | |
416 | unmapped. The client can use this notification to trigger actions | |
417 | within the kernel application, such as enabling/disabling logging to | |
418 | the channel. | |
419 | ||
420 | ||
421 | Resources | |
422 | ========= | |
423 | ||
424 | For news, example code, mailing list, etc. see the relayfs homepage: | |
425 | ||
426 | http://relayfs.sourceforge.net | |
427 | ||
428 | ||
429 | Credits | |
430 | ======= | |
431 | ||
432 | The ideas and specs for relayfs came about as a result of discussions | |
433 | on tracing involving the following: | |
434 | ||
435 | Michel Dagenais <michel.dagenais@polymtl.ca> | |
436 | Richard Moore <richardj_moore@uk.ibm.com> | |
437 | Bob Wisniewski <bob@watson.ibm.com> | |
438 | Karim Yaghmour <karim@opersys.com> | |
439 | Tom Zanussi <zanussi@us.ibm.com> | |
440 | ||
441 | Also thanks to Hubertus Franke for a lot of useful suggestions and bug | |
442 | reports. |