Commit | Line | Data |
---|---|---|
bbb5bbb0 RD |
1 | <?xml version="1.0" encoding="UTF-8"?> |
2 | <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" | |
3 | "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> | |
4 | ||
5 | <book id="Linux-filesystems-API"> | |
6 | <bookinfo> | |
7 | <title>Linux Filesystems API</title> | |
8 | ||
9 | <legalnotice> | |
10 | <para> | |
11 | This documentation is free software; you can redistribute | |
12 | it and/or modify it under the terms of the GNU General Public | |
13 | License as published by the Free Software Foundation; either | |
14 | version 2 of the License, or (at your option) any later | |
15 | version. | |
16 | </para> | |
17 | ||
18 | <para> | |
19 | This program is distributed in the hope that it will be | |
20 | useful, but WITHOUT ANY WARRANTY; without even the implied | |
21 | warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. | |
22 | See the GNU General Public License for more details. | |
23 | </para> | |
24 | ||
25 | <para> | |
26 | You should have received a copy of the GNU General Public | |
27 | License along with this program; if not, write to the Free | |
28 | Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, | |
29 | MA 02111-1307 USA | |
30 | </para> | |
31 | ||
32 | <para> | |
33 | For more details see the file COPYING in the source | |
34 | distribution of Linux. | |
35 | </para> | |
36 | </legalnotice> | |
37 | </bookinfo> | |
38 | ||
39 | <toc></toc> | |
40 | ||
41 | <chapter id="vfs"> | |
42 | <title>The Linux VFS</title> | |
5c3b4474 | 43 | <sect1 id="the_filesystem_types"><title>The Filesystem types</title> |
bbb5bbb0 RD |
44 | !Iinclude/linux/fs.h |
45 | </sect1> | |
5c3b4474 | 46 | <sect1 id="the_directory_cache"><title>The Directory Cache</title> |
bbb5bbb0 RD |
47 | !Efs/dcache.c |
48 | !Iinclude/linux/dcache.h | |
49 | </sect1> | |
5c3b4474 | 50 | <sect1 id="inode_handling"><title>Inode Handling</title> |
bbb5bbb0 RD |
51 | !Efs/inode.c |
52 | !Efs/bad_inode.c | |
53 | </sect1> | |
5c3b4474 | 54 | <sect1 id="registration_and_superblocks"><title>Registration and Superblocks</title> |
bbb5bbb0 RD |
55 | !Efs/super.c |
56 | </sect1> | |
5c3b4474 | 57 | <sect1 id="file_locks"><title>File Locks</title> |
bbb5bbb0 RD |
58 | !Efs/locks.c |
59 | !Ifs/locks.c | |
60 | </sect1> | |
5c3b4474 | 61 | <sect1 id="other_functions"><title>Other Functions</title> |
bbb5bbb0 RD |
62 | !Efs/mpage.c |
63 | !Efs/namei.c | |
64 | !Efs/buffer.c | |
65 | !Efs/bio.c | |
66 | !Efs/seq_file.c | |
67 | !Efs/filesystems.c | |
68 | !Efs/fs-writeback.c | |
69 | !Efs/block_dev.c | |
70 | </sect1> | |
71 | </chapter> | |
72 | ||
73 | <chapter id="proc"> | |
74 | <title>The proc filesystem</title> | |
75 | ||
5c3b4474 | 76 | <sect1 id="sysctl_interface"><title>sysctl interface</title> |
bbb5bbb0 RD |
77 | !Ekernel/sysctl.c |
78 | </sect1> | |
79 | ||
5c3b4474 | 80 | <sect1 id="proc_filesystem_interface"><title>proc filesystem interface</title> |
bbb5bbb0 RD |
81 | !Ifs/proc/base.c |
82 | </sect1> | |
83 | </chapter> | |
84 | ||
36182185 RD |
85 | <chapter id="fs_events"> |
86 | <title>Events based on file descriptors</title> | |
87 | !Efs/eventfd.c | |
88 | </chapter> | |
89 | ||
bbb5bbb0 RD |
90 | <chapter id="sysfs"> |
91 | <title>The Filesystem for Exporting Kernel Objects</title> | |
92 | !Efs/sysfs/file.c | |
93 | !Efs/sysfs/symlink.c | |
bbb5bbb0 RD |
94 | </chapter> |
95 | ||
96 | <chapter id="debugfs"> | |
97 | <title>The debugfs filesystem</title> | |
98 | ||
5c3b4474 | 99 | <sect1 id="debugfs_interface"><title>debugfs interface</title> |
bbb5bbb0 RD |
100 | !Efs/debugfs/inode.c |
101 | !Efs/debugfs/file.c | |
102 | </sect1> | |
103 | </chapter> | |
104 | ||
733b72c3 RD |
105 | <chapter id="LinuxJDBAPI"> |
106 | <chapterinfo> | |
107 | <title>The Linux Journalling API</title> | |
108 | ||
109 | <authorgroup> | |
110 | <author> | |
111 | <firstname>Roger</firstname> | |
112 | <surname>Gammans</surname> | |
113 | <affiliation> | |
114 | <address> | |
115 | <email>rgammans@computer-surgery.co.uk</email> | |
116 | </address> | |
117 | </affiliation> | |
118 | </author> | |
119 | </authorgroup> | |
120 | ||
121 | <authorgroup> | |
122 | <author> | |
123 | <firstname>Stephen</firstname> | |
124 | <surname>Tweedie</surname> | |
125 | <affiliation> | |
126 | <address> | |
127 | <email>sct@redhat.com</email> | |
128 | </address> | |
129 | </affiliation> | |
130 | </author> | |
131 | </authorgroup> | |
132 | ||
133 | <copyright> | |
134 | <year>2002</year> | |
135 | <holder>Roger Gammans</holder> | |
136 | </copyright> | |
137 | </chapterinfo> | |
138 | ||
139 | <title>The Linux Journalling API</title> | |
140 | ||
5c3b4474 | 141 | <sect1 id="journaling_overview"> |
733b72c3 | 142 | <title>Overview</title> |
5c3b4474 | 143 | <sect2 id="journaling_details"> |
733b72c3 RD |
144 | <title>Details</title> |
145 | <para> | |
146 | The journalling layer is easy to use. You need to | |
147 | first of all create a journal_t data structure. There are | |
148 | two calls to do this dependent on how you decide to allocate the physical | |
149 | media on which the journal resides. The journal_init_inode() call | |
150 | is for journals stored in filesystem inodes, or the journal_init_dev() | |
151 | call can be use for journal stored on a raw device (in a continuous range | |
152 | of blocks). A journal_t is a typedef for a struct pointer, so when | |
153 | you are finally finished make sure you call journal_destroy() on it | |
154 | to free up any used kernel memory. | |
155 | </para> | |
156 | ||
157 | <para> | |
158 | Once you have got your journal_t object you need to 'mount' or load the journal | |
159 | file, unless of course you haven't initialised it yet - in which case you | |
160 | need to call journal_create(). | |
161 | </para> | |
162 | ||
163 | <para> | |
164 | Most of the time however your journal file will already have been created, but | |
165 | before you load it you must call journal_wipe() to empty the journal file. | |
166 | Hang on, you say , what if the filesystem wasn't cleanly umount()'d . Well, it is the | |
167 | job of the client file system to detect this and skip the call to journal_wipe(). | |
168 | </para> | |
169 | ||
170 | <para> | |
171 | In either case the next call should be to journal_load() which prepares the | |
172 | journal file for use. Note that journal_wipe(..,0) calls journal_skip_recovery() | |
173 | for you if it detects any outstanding transactions in the journal and similarly | |
174 | journal_load() will call journal_recover() if necessary. | |
175 | I would advise reading fs/ext3/super.c for examples on this stage. | |
176 | [RGG: Why is the journal_wipe() call necessary - doesn't this needlessly | |
177 | complicate the API. Or isn't a good idea for the journal layer to hide | |
178 | dirty mounts from the client fs] | |
179 | </para> | |
180 | ||
181 | <para> | |
182 | Now you can go ahead and start modifying the underlying | |
183 | filesystem. Almost. | |
184 | </para> | |
185 | ||
186 | <para> | |
187 | ||
188 | You still need to actually journal your filesystem changes, this | |
189 | is done by wrapping them into transactions. Additionally you | |
190 | also need to wrap the modification of each of the buffers | |
191 | with calls to the journal layer, so it knows what the modifications | |
192 | you are actually making are. To do this use journal_start() which | |
193 | returns a transaction handle. | |
194 | </para> | |
195 | ||
196 | <para> | |
197 | journal_start() | |
198 | and its counterpart journal_stop(), which indicates the end of a transaction | |
199 | are nestable calls, so you can reenter a transaction if necessary, | |
200 | but remember you must call journal_stop() the same number of times as | |
201 | journal_start() before the transaction is completed (or more accurately | |
202 | leaves the update phase). Ext3/VFS makes use of this feature to simplify | |
203 | quota support. | |
204 | </para> | |
205 | ||
206 | <para> | |
207 | Inside each transaction you need to wrap the modifications to the | |
208 | individual buffers (blocks). Before you start to modify a buffer you | |
209 | need to call journal_get_{create,write,undo}_access() as appropriate, | |
210 | this allows the journalling layer to copy the unmodified data if it | |
211 | needs to. After all the buffer may be part of a previously uncommitted | |
212 | transaction. | |
213 | At this point you are at last ready to modify a buffer, and once | |
214 | you are have done so you need to call journal_dirty_{meta,}data(). | |
215 | Or if you've asked for access to a buffer you now know is now longer | |
216 | required to be pushed back on the device you can call journal_forget() | |
217 | in much the same way as you might have used bforget() in the past. | |
218 | </para> | |
219 | ||
220 | <para> | |
221 | A journal_flush() may be called at any time to commit and checkpoint | |
222 | all your transactions. | |
223 | </para> | |
224 | ||
225 | <para> | |
34e5053f AB |
226 | Then at umount time , in your put_super() you can then call journal_destroy() |
227 | to clean up your in-core journal object. | |
733b72c3 RD |
228 | </para> |
229 | ||
230 | <para> | |
231 | Unfortunately there a couple of ways the journal layer can cause a deadlock. | |
232 | The first thing to note is that each task can only have | |
233 | a single outstanding transaction at any one time, remember nothing | |
234 | commits until the outermost journal_stop(). This means | |
235 | you must complete the transaction at the end of each file/inode/address | |
236 | etc. operation you perform, so that the journalling system isn't re-entered | |
237 | on another journal. Since transactions can't be nested/batched | |
238 | across differing journals, and another filesystem other than | |
239 | yours (say ext3) may be modified in a later syscall. | |
240 | </para> | |
241 | ||
242 | <para> | |
243 | The second case to bear in mind is that journal_start() can | |
244 | block if there isn't enough space in the journal for your transaction | |
245 | (based on the passed nblocks param) - when it blocks it merely(!) needs to | |
246 | wait for transactions to complete and be committed from other tasks, | |
247 | so essentially we are waiting for journal_stop(). So to avoid | |
248 | deadlocks you must treat journal_start/stop() as if they | |
249 | were semaphores and include them in your semaphore ordering rules to prevent | |
250 | deadlocks. Note that journal_extend() has similar blocking behaviour to | |
251 | journal_start() so you can deadlock here just as easily as on journal_start(). | |
252 | </para> | |
253 | ||
254 | <para> | |
255 | Try to reserve the right number of blocks the first time. ;-). This will | |
256 | be the maximum number of blocks you are going to touch in this transaction. | |
257 | I advise having a look at at least ext3_jbd.h to see the basis on which | |
258 | ext3 uses to make these decisions. | |
259 | </para> | |
260 | ||
261 | <para> | |
262 | Another wriggle to watch out for is your on-disk block allocation strategy. | |
263 | why? Because, if you undo a delete, you need to ensure you haven't reused any | |
264 | of the freed blocks in a later transaction. One simple way of doing this | |
265 | is make sure any blocks you allocate only have checkpointed transactions | |
266 | listed against them. Ext3 does this in ext3_test_allocatable(). | |
267 | </para> | |
268 | ||
269 | <para> | |
270 | Lock is also providing through journal_{un,}lock_updates(), | |
271 | ext3 uses this when it wants a window with a clean and stable fs for a moment. | |
272 | eg. | |
273 | </para> | |
274 | ||
275 | <programlisting> | |
276 | ||
277 | journal_lock_updates() //stop new stuff happening.. | |
278 | journal_flush() // checkpoint everything. | |
279 | ..do stuff on stable fs | |
280 | journal_unlock_updates() // carry on with filesystem use. | |
281 | </programlisting> | |
282 | ||
283 | <para> | |
284 | The opportunities for abuse and DOS attacks with this should be obvious, | |
285 | if you allow unprivileged userspace to trigger codepaths containing these | |
286 | calls. | |
287 | </para> | |
288 | ||
289 | <para> | |
290 | A new feature of jbd since 2.5.25 is commit callbacks with the new | |
291 | journal_callback_set() function you can now ask the journalling layer | |
292 | to call you back when the transaction is finally committed to disk, so that | |
293 | you can do some of your own management. The key to this is the journal_callback | |
294 | struct, this maintains the internal callback information but you can | |
295 | extend it like this:- | |
296 | </para> | |
297 | <programlisting> | |
298 | struct myfs_callback_s { | |
299 | //Data structure element required by jbd.. | |
300 | struct journal_callback for_jbd; | |
301 | // Stuff for myfs allocated together. | |
302 | myfs_inode* i_commited; | |
303 | ||
304 | } | |
305 | </programlisting> | |
306 | ||
307 | <para> | |
308 | this would be useful if you needed to know when data was committed to a | |
309 | particular inode. | |
310 | </para> | |
311 | ||
312 | </sect2> | |
313 | ||
5c3b4474 | 314 | <sect2 id="jbd_summary"> |
733b72c3 RD |
315 | <title>Summary</title> |
316 | <para> | |
317 | Using the journal is a matter of wrapping the different context changes, | |
318 | being each mount, each modification (transaction) and each changed buffer | |
319 | to tell the journalling layer about them. | |
320 | </para> | |
321 | ||
322 | <para> | |
323 | Here is a some pseudo code to give you an idea of how it works, as | |
324 | an example. | |
325 | </para> | |
326 | ||
327 | <programlisting> | |
328 | journal_t* my_jnrl = journal_create(); | |
329 | journal_init_{dev,inode}(jnrl,...) | |
330 | if (clean) journal_wipe(); | |
331 | journal_load(); | |
332 | ||
333 | foreach(transaction) { /*transactions must be | |
334 | completed before | |
335 | a syscall returns to | |
336 | userspace*/ | |
337 | ||
338 | handle_t * xct=journal_start(my_jnrl); | |
339 | foreach(bh) { | |
340 | journal_get_{create,write,undo}_access(xact,bh); | |
341 | if ( myfs_modify(bh) ) { /* returns true | |
342 | if makes changes */ | |
343 | journal_dirty_{meta,}data(xact,bh); | |
344 | } else { | |
345 | journal_forget(bh); | |
346 | } | |
347 | } | |
348 | journal_stop(xct); | |
349 | } | |
350 | journal_destroy(my_jrnl); | |
351 | </programlisting> | |
352 | </sect2> | |
353 | ||
354 | </sect1> | |
355 | ||
5c3b4474 | 356 | <sect1 id="data_types"> |
733b72c3 RD |
357 | <title>Data Types</title> |
358 | <para> | |
359 | The journalling layer uses typedefs to 'hide' the concrete definitions | |
360 | of the structures used. As a client of the JBD layer you can | |
361 | just rely on the using the pointer as a magic cookie of some sort. | |
362 | ||
363 | Obviously the hiding is not enforced as this is 'C'. | |
364 | </para> | |
5c3b4474 | 365 | <sect2 id="structures"><title>Structures</title> |
733b72c3 RD |
366 | !Iinclude/linux/jbd.h |
367 | </sect2> | |
368 | </sect1> | |
369 | ||
5c3b4474 | 370 | <sect1 id="functions"> |
733b72c3 RD |
371 | <title>Functions</title> |
372 | <para> | |
373 | The functions here are split into two groups those that | |
374 | affect a journal as a whole, and those which are used to | |
375 | manage transactions | |
376 | </para> | |
5c3b4474 | 377 | <sect2 id="journal_level"><title>Journal Level</title> |
733b72c3 RD |
378 | !Efs/jbd/journal.c |
379 | !Ifs/jbd/recovery.c | |
380 | </sect2> | |
5c3b4474 | 381 | <sect2 id="transaction_level"><title>Transasction Level</title> |
733b72c3 RD |
382 | !Efs/jbd/transaction.c |
383 | </sect2> | |
384 | </sect1> | |
5c3b4474 | 385 | <sect1 id="see_also"> |
733b72c3 RD |
386 | <title>See also</title> |
387 | <para> | |
388 | <citation> | |
96824f4b | 389 | <ulink url="http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz"> |
733b72c3 RD |
390 | Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen Tweedie |
391 | </ulink> | |
392 | </citation> | |
393 | </para> | |
394 | <para> | |
395 | <citation> | |
396 | <ulink url="http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html"> | |
397 | Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen Tweedie | |
398 | </ulink> | |
399 | </citation> | |
400 | </para> | |
401 | </sect1> | |
402 | ||
403 | </chapter> | |
404 | ||
073b86da RD |
405 | <chapter id="splice"> |
406 | <title>splice API</title> | |
407 | <para> | |
408 | splice is a method for moving blocks of data around inside the | |
409 | kernel, without continually transferring them between the kernel | |
410 | and user space. | |
411 | </para> | |
412 | !Ffs/splice.c | |
413 | </chapter> | |
414 | ||
415 | <chapter id="pipes"> | |
416 | <title>pipes API</title> | |
417 | <para> | |
418 | Pipe interfaces are all for in-kernel (builtin image) use. | |
419 | They are not exported for use by modules. | |
420 | </para> | |
421 | !Iinclude/linux/pipe_fs_i.h | |
422 | !Ffs/pipe.c | |
423 | </chapter> | |
424 | ||
bbb5bbb0 | 425 | </book> |