Commit | Line | Data |
---|---|---|
25b532ce MCC |
1 | ==================== |
2 | Changes since 2.5.0: | |
3 | ==================== | |
4 | ||
5 | --- | |
6 | ||
7 | **recommended** | |
8 | ||
9 | New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(), | |
10 | sb_set_blocksize() and sb_min_blocksize(). | |
11 | ||
12 | Use them. | |
13 | ||
14 | (sb_find_get_block() replaces 2.4's get_hash_table()) | |
15 | ||
16 | --- | |
17 | ||
18 | **recommended** | |
19 | ||
20 | New methods: ->alloc_inode() and ->destroy_inode(). | |
21 | ||
22 | Remove inode->u.foo_inode_i | |
23 | ||
24 | Declare:: | |
25 | ||
26 | struct foo_inode_info { | |
27 | /* fs-private stuff */ | |
28 | struct inode vfs_inode; | |
29 | }; | |
30 | static inline struct foo_inode_info *FOO_I(struct inode *inode) | |
31 | { | |
32 | return list_entry(inode, struct foo_inode_info, vfs_inode); | |
33 | } | |
34 | ||
35 | Use FOO_I(inode) instead of &inode->u.foo_inode_i; | |
36 | ||
37 | Add foo_alloc_inode() and foo_destroy_inode() - the former should allocate | |
38 | foo_inode_info and return the address of ->vfs_inode, the latter should free | |
39 | FOO_I(inode) (see in-tree filesystems for examples). | |
40 | ||
41 | Make them ->alloc_inode and ->destroy_inode in your super_operations. | |
42 | ||
43 | Keep in mind that now you need explicit initialization of private data | |
44 | typically between calling iget_locked() and unlocking the inode. | |
45 | ||
46 | At some point that will become mandatory. | |
47 | ||
8b9f3ac5 MS |
48 | **mandatory** |
49 | ||
50 | The foo_inode_info should always be allocated through alloc_inode_sb() rather | |
51 | than kmem_cache_alloc() or kmalloc() related to set up the inode reclaim context | |
52 | correctly. | |
53 | ||
25b532ce MCC |
54 | --- |
55 | ||
56 | **mandatory** | |
57 | ||
58 | Change of file_system_type method (->read_super to ->get_sb) | |
59 | ||
60 | ->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV. | |
61 | ||
62 | Turn your foo_read_super() into a function that would return 0 in case of | |
63 | success and negative number in case of error (-EINVAL unless you have more | |
64 | informative error value to report). Call it foo_fill_super(). Now declare:: | |
65 | ||
66 | int foo_get_sb(struct file_system_type *fs_type, | |
67 | int flags, const char *dev_name, void *data, struct vfsmount *mnt) | |
68 | { | |
69 | return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super, | |
70 | mnt); | |
71 | } | |
72 | ||
73 | (or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of | |
74 | filesystem). | |
75 | ||
76 | Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as | |
77 | foo_get_sb. | |
78 | ||
79 | --- | |
80 | ||
81 | **mandatory** | |
82 | ||
83 | Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames. | |
84 | Most likely there is no need to change anything, but if you relied on | |
85 | global exclusion between renames for some internal purpose - you need to | |
86 | change your internal locking. Otherwise exclusion warranties remain the | |
87 | same (i.e. parents and victim are locked, etc.). | |
88 | ||
89 | --- | |
90 | ||
91 | **informational** | |
92 | ||
93 | Now we have the exclusion between ->lookup() and directory removal (by | |
94 | ->rmdir() and ->rename()). If you used to need that exclusion and do | |
95 | it by internal locking (most of filesystems couldn't care less) - you | |
96 | can relax your locking. | |
97 | ||
98 | --- | |
99 | ||
100 | **mandatory** | |
101 | ||
102 | ->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(), | |
103 | ->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename() | |
104 | and ->readdir() are called without BKL now. Grab it on entry, drop upon return | |
105 | - that will guarantee the same locking you used to have. If your method or its | |
106 | parts do not need BKL - better yet, now you can shift lock_kernel() and | |
107 | unlock_kernel() so that they would protect exactly what needs to be | |
108 | protected. | |
109 | ||
110 | --- | |
111 | ||
112 | **mandatory** | |
113 | ||
114 | BKL is also moved from around sb operations. BKL should have been shifted into | |
115 | individual fs sb_op functions. If you don't need it, remove it. | |
116 | ||
117 | --- | |
118 | ||
119 | **informational** | |
120 | ||
121 | check for ->link() target not being a directory is done by callers. Feel | |
122 | free to drop it... | |
123 | ||
124 | --- | |
125 | ||
126 | **informational** | |
127 | ||
128 | ->link() callers hold ->i_mutex on the object we are linking to. Some of your | |
129 | problems might be over... | |
130 | ||
131 | --- | |
132 | ||
133 | **mandatory** | |
134 | ||
135 | new file_system_type method - kill_sb(superblock). If you are converting | |
136 | an existing filesystem, set it according to ->fs_flags:: | |
137 | ||
138 | FS_REQUIRES_DEV - kill_block_super | |
139 | FS_LITTER - kill_litter_super | |
140 | neither - kill_anon_super | |
141 | ||
142 | FS_LITTER is gone - just remove it from fs_flags. | |
143 | ||
144 | --- | |
145 | ||
146 | **mandatory** | |
147 | ||
148 | FS_SINGLE is gone (actually, that had happened back when ->get_sb() | |
149 | went in - and hadn't been documented ;-/). Just remove it from fs_flags | |
150 | (and see ->get_sb() entry for other actions). | |
151 | ||
152 | --- | |
153 | ||
154 | **mandatory** | |
155 | ||
156 | ->setattr() is called without BKL now. Caller _always_ holds ->i_mutex, so | |
157 | watch for ->i_mutex-grabbing code that might be used by your ->setattr(). | |
158 | Callers of notify_change() need ->i_mutex now. | |
159 | ||
160 | --- | |
161 | ||
162 | **recommended** | |
163 | ||
164 | New super_block field ``struct export_operations *s_export_op`` for | |
165 | explicit support for exporting, e.g. via NFS. The structure is fully | |
166 | documented at its declaration in include/linux/fs.h, and in | |
9195c3e8 | 167 | Documentation/filesystems/nfs/exporting.rst. |
25b532ce MCC |
168 | |
169 | Briefly it allows for the definition of decode_fh and encode_fh operations | |
170 | to encode and decode filehandles, and allows the filesystem to use | |
171 | a standard helper function for decode_fh, and provide file-system specific | |
172 | support for this helper, particularly get_parent. | |
173 | ||
174 | It is planned that this will be required for exporting once the code | |
175 | settles down a bit. | |
176 | ||
177 | **mandatory** | |
178 | ||
179 | s_export_op is now required for exporting a filesystem. | |
d56b699d | 180 | isofs, ext2, ext3, reiserfs, fat |
25b532ce MCC |
181 | can be used as examples of very different filesystems. |
182 | ||
183 | --- | |
184 | ||
185 | **mandatory** | |
186 | ||
187 | iget4() and the read_inode2 callback have been superseded by iget5_locked() | |
188 | which has the following prototype:: | |
189 | ||
190 | struct inode *iget5_locked(struct super_block *sb, unsigned long ino, | |
191 | int (*test)(struct inode *, void *), | |
192 | int (*set)(struct inode *, void *), | |
193 | void *data); | |
194 | ||
195 | 'test' is an additional function that can be used when the inode | |
196 | number is not sufficient to identify the actual file object. 'set' | |
197 | should be a non-blocking function that initializes those parts of a | |
198 | newly created inode to allow the test function to succeed. 'data' is | |
199 | passed as an opaque value to both test and set functions. | |
200 | ||
201 | When the inode has been created by iget5_locked(), it will be returned with the | |
202 | I_NEW flag set and will still be locked. The filesystem then needs to finalize | |
203 | the initialization. Once the inode is initialized it must be unlocked by | |
204 | calling unlock_new_inode(). | |
205 | ||
206 | The filesystem is responsible for setting (and possibly testing) i_ino | |
207 | when appropriate. There is also a simpler iget_locked function that | |
208 | just takes the superblock and inode number as arguments and does the | |
209 | test and set for you. | |
210 | ||
211 | e.g.:: | |
212 | ||
213 | inode = iget_locked(sb, ino); | |
214 | if (inode->i_state & I_NEW) { | |
215 | err = read_inode_from_disk(inode); | |
216 | if (err < 0) { | |
217 | iget_failed(inode); | |
218 | return err; | |
219 | } | |
220 | unlock_new_inode(inode); | |
221 | } | |
222 | ||
223 | Note that if the process of setting up a new inode fails, then iget_failed() | |
224 | should be called on the inode to render it dead, and an appropriate error | |
225 | should be passed back to the caller. | |
226 | ||
227 | --- | |
228 | ||
229 | **recommended** | |
230 | ||
231 | ->getattr() finally getting used. See instances in nfs, minix, etc. | |
232 | ||
233 | --- | |
234 | ||
235 | **mandatory** | |
236 | ||
237 | ->revalidate() is gone. If your filesystem had it - provide ->getattr() | |
238 | and let it call whatever you had as ->revlidate() + (for symlinks that | |
239 | had ->revalidate()) add calls in ->follow_link()/->readlink(). | |
240 | ||
241 | --- | |
242 | ||
243 | **mandatory** | |
244 | ||
245 | ->d_parent changes are not protected by BKL anymore. Read access is safe | |
246 | if at least one of the following is true: | |
247 | ||
248 | * filesystem has no cross-directory rename() | |
249 | * we know that parent had been locked (e.g. we are looking at | |
250 | ->d_parent of ->lookup() argument). | |
251 | * we are called from ->rename(). | |
252 | * the child's ->d_lock is held | |
253 | ||
254 | Audit your code and add locking if needed. Notice that any place that is | |
255 | not protected by the conditions above is risky even in the old tree - you | |
256 | had been relying on BKL and that's prone to screwups. Old tree had quite | |
257 | a few holes of that kind - unprotected access to ->d_parent leading to | |
258 | anything from oops to silent memory corruption. | |
259 | ||
260 | --- | |
261 | ||
262 | **mandatory** | |
263 | ||
264 | FS_NOMOUNT is gone. If you use it - just set SB_NOUSER in flags | |
265 | (see rootfs for one kind of solution and bdev/socket/pipe for another). | |
266 | ||
267 | --- | |
268 | ||
269 | **recommended** | |
270 | ||
271 | Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter | |
272 | is still alive, but only because of the mess in drivers/s390/block/dasd.c. | |
273 | As soon as it gets fixed is_read_only() will die. | |
274 | ||
275 | --- | |
276 | ||
277 | **mandatory** | |
278 | ||
279 | ->permission() is called without BKL now. Grab it on entry, drop upon | |
280 | return - that will guarantee the same locking you used to have. If | |
281 | your method or its parts do not need BKL - better yet, now you can | |
282 | shift lock_kernel() and unlock_kernel() so that they would protect | |
283 | exactly what needs to be protected. | |
284 | ||
285 | --- | |
286 | ||
287 | **mandatory** | |
288 | ||
289 | ->statfs() is now called without BKL held. BKL should have been | |
290 | shifted into individual fs sb_op functions where it's not clear that | |
291 | it's safe to remove it. If you don't need it, remove it. | |
292 | ||
293 | --- | |
294 | ||
295 | **mandatory** | |
296 | ||
297 | is_read_only() is gone; use bdev_read_only() instead. | |
298 | ||
299 | --- | |
300 | ||
301 | **mandatory** | |
302 | ||
303 | destroy_buffers() is gone; use invalidate_bdev(). | |
304 | ||
305 | --- | |
306 | ||
307 | **mandatory** | |
308 | ||
309 | fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is | |
310 | deliberate; as soon as struct block_device * is propagated in a reasonable | |
311 | way by that code fixing will become trivial; until then nothing can be | |
312 | done. | |
313 | ||
314 | **mandatory** | |
315 | ||
316 | block truncatation on error exit from ->write_begin, and ->direct_IO | |
317 | moved from generic methods (block_write_begin, cont_write_begin, | |
318 | nobh_write_begin, blockdev_direct_IO*) to callers. Take a look at | |
319 | ext2_write_failed and callers for an example. | |
320 | ||
321 | **mandatory** | |
322 | ||
323 | ->truncate is gone. The whole truncate sequence needs to be | |
324 | implemented in ->setattr, which is now mandatory for filesystems | |
325 | implementing on-disk size changes. Start with a copy of the old inode_setattr | |
326 | and vmtruncate, and the reorder the vmtruncate + foofs_vmtruncate sequence to | |
327 | be in order of zeroing blocks using block_truncate_page or similar helpers, | |
328 | size update and on finally on-disk truncation which should not fail. | |
329 | setattr_prepare (which used to be inode_change_ok) now includes the size checks | |
330 | for ATTR_SIZE and must be called in the beginning of ->setattr unconditionally. | |
331 | ||
332 | **mandatory** | |
333 | ||
334 | ->clear_inode() and ->delete_inode() are gone; ->evict_inode() should | |
335 | be used instead. It gets called whenever the inode is evicted, whether it has | |
336 | remaining links or not. Caller does *not* evict the pagecache or inode-associated | |
337 | metadata buffers; the method has to use truncate_inode_pages_final() to get rid | |
338 | of those. Caller makes sure async writeback cannot be running for the inode while | |
339 | (or after) ->evict_inode() is called. | |
340 | ||
341 | ->drop_inode() returns int now; it's called on final iput() with | |
342 | inode->i_lock held and it returns true if filesystems wants the inode to be | |
343 | dropped. As before, generic_drop_inode() is still the default and it's been | |
344 | updated appropriately. generic_delete_inode() is also alive and it consists | |
345 | simply of return 1. Note that all actual eviction work is done by caller after | |
346 | ->drop_inode() returns. | |
347 | ||
348 | As before, clear_inode() must be called exactly once on each call of | |
349 | ->evict_inode() (as it used to be for each call of ->delete_inode()). Unlike | |
350 | before, if you are using inode-associated metadata buffers (i.e. | |
351 | mark_buffer_dirty_inode()), it's your responsibility to call | |
352 | invalidate_inode_buffers() before clear_inode(). | |
353 | ||
354 | NOTE: checking i_nlink in the beginning of ->write_inode() and bailing out | |
355 | if it's zero is not *and* *never* *had* *been* enough. Final unlink() and iput() | |
356 | may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly | |
357 | free the on-disk inode, you may end up doing that while ->write_inode() is writing | |
358 | to it. | |
359 | ||
360 | --- | |
361 | ||
362 | **mandatory** | |
363 | ||
364 | .d_delete() now only advises the dcache as to whether or not to cache | |
365 | unreferenced dentries, and is now only called when the dentry refcount goes to | |
366 | 0. Even on 0 refcount transition, it must be able to tolerate being called 0, | |
367 | 1, or more times (eg. constant, idempotent). | |
368 | ||
369 | --- | |
370 | ||
371 | **mandatory** | |
372 | ||
373 | .d_compare() calling convention and locking rules are significantly | |
374 | changed. Read updated documentation in Documentation/filesystems/vfs.rst (and | |
375 | look at examples of other filesystems) for guidance. | |
376 | ||
377 | --- | |
378 | ||
379 | **mandatory** | |
380 | ||
381 | .d_hash() calling convention and locking rules are significantly | |
382 | changed. Read updated documentation in Documentation/filesystems/vfs.rst (and | |
383 | look at examples of other filesystems) for guidance. | |
384 | ||
385 | --- | |
386 | ||
387 | **mandatory** | |
388 | ||
389 | dcache_lock is gone, replaced by fine grained locks. See fs/dcache.c | |
390 | for details of what locks to replace dcache_lock with in order to protect | |
391 | particular things. Most of the time, a filesystem only needs ->d_lock, which | |
392 | protects *all* the dcache state of a given dentry. | |
393 | ||
394 | --- | |
395 | ||
396 | **mandatory** | |
397 | ||
398 | Filesystems must RCU-free their inodes, if they can have been accessed | |
399 | via rcu-walk path walk (basically, if the file can have had a path name in the | |
400 | vfs namespace). | |
401 | ||
402 | Even though i_dentry and i_rcu share storage in a union, we will | |
403 | initialize the former in inode_init_always(), so just leave it alone in | |
404 | the callback. It used to be necessary to clean it there, but not anymore | |
405 | (starting at 3.2). | |
406 | ||
407 | --- | |
408 | ||
409 | **recommended** | |
410 | ||
411 | vfs now tries to do path walking in "rcu-walk mode", which avoids | |
412 | atomic operations and scalability hazards on dentries and inodes (see | |
413 | Documentation/filesystems/path-lookup.txt). d_hash and d_compare changes | |
414 | (above) are examples of the changes required to support this. For more complex | |
415 | filesystem callbacks, the vfs drops out of rcu-walk mode before the fs call, so | |
416 | no changes are required to the filesystem. However, this is costly and loses | |
417 | the benefits of rcu-walk mode. We will begin to add filesystem callbacks that | |
418 | are rcu-walk aware, shown below. Filesystems should take advantage of this | |
419 | where possible. | |
420 | ||
421 | --- | |
422 | ||
423 | **mandatory** | |
424 | ||
425 | d_revalidate is a callback that is made on every path element (if | |
426 | the filesystem provides it), which requires dropping out of rcu-walk mode. This | |
427 | may now be called in rcu-walk mode (nd->flags & LOOKUP_RCU). -ECHILD should be | |
428 | returned if the filesystem cannot handle rcu-walk. See | |
429 | Documentation/filesystems/vfs.rst for more details. | |
430 | ||
431 | permission is an inode permission check that is called on many or all | |
432 | directory inodes on the way down a path walk (to check for exec permission). It | |
433 | must now be rcu-walk aware (mask & MAY_NOT_BLOCK). See | |
434 | Documentation/filesystems/vfs.rst for more details. | |
435 | ||
436 | --- | |
437 | ||
438 | **mandatory** | |
439 | ||
440 | In ->fallocate() you must check the mode option passed in. If your | |
441 | filesystem does not support hole punching (deallocating space in the middle of a | |
442 | file) you must return -EOPNOTSUPP if FALLOC_FL_PUNCH_HOLE is set in mode. | |
443 | Currently you can only have FALLOC_FL_PUNCH_HOLE with FALLOC_FL_KEEP_SIZE set, | |
444 | so the i_size should not change when hole punching, even when puching the end of | |
445 | a file off. | |
446 | ||
447 | --- | |
448 | ||
449 | **mandatory** | |
450 | ||
451 | ->get_sb() is gone. Switch to use of ->mount(). Typically it's just | |
452 | a matter of switching from calling ``get_sb_``... to ``mount_``... and changing | |
453 | the function type. If you were doing it manually, just switch from setting | |
454 | ->mnt_root to some pointer to returning that pointer. On errors return | |
455 | ERR_PTR(...). | |
456 | ||
457 | --- | |
458 | ||
459 | **mandatory** | |
460 | ||
461 | ->permission() and generic_permission()have lost flags | |
462 | argument; instead of passing IPERM_FLAG_RCU we add MAY_NOT_BLOCK into mask. | |
463 | ||
464 | generic_permission() has also lost the check_acl argument; ACL checking | |
cac2f8b8 CB |
465 | has been taken to VFS and filesystems need to provide a non-NULL |
466 | ->i_op->get_inode_acl to read an ACL from disk. | |
25b532ce MCC |
467 | |
468 | --- | |
469 | ||
470 | **mandatory** | |
471 | ||
472 | If you implement your own ->llseek() you must handle SEEK_HOLE and | |
d56b699d | 473 | SEEK_DATA. You can handle this by returning -EINVAL, but it would be nicer to |
25b532ce MCC |
474 | support it in some way. The generic handler assumes that the entire file is |
475 | data and there is a virtual hole at the end of the file. So if the provided | |
476 | offset is less than i_size and SEEK_DATA is specified, return the same offset. | |
477 | If the above is true for the offset and you are given SEEK_HOLE, return the end | |
478 | of the file. If the offset is i_size or greater return -ENXIO in either case. | |
479 | ||
480 | **mandatory** | |
481 | ||
482 | If you have your own ->fsync() you must make sure to call | |
483 | filemap_write_and_wait_range() so that all dirty pages are synced out properly. | |
484 | You must also keep in mind that ->fsync() is not called with i_mutex held | |
485 | anymore, so if you require i_mutex locking you must make sure to take it and | |
486 | release it yourself. | |
487 | ||
488 | --- | |
489 | ||
490 | **mandatory** | |
491 | ||
492 | d_alloc_root() is gone, along with a lot of bugs caused by code | |
493 | misusing it. Replacement: d_make_root(inode). On success d_make_root(inode) | |
494 | allocates and returns a new dentry instantiated with the passed in inode. | |
495 | On failure NULL is returned and the passed in inode is dropped so the reference | |
496 | to inode is consumed in all cases and failure handling need not do any cleanup | |
497 | for the inode. If d_make_root(inode) is passed a NULL inode it returns NULL | |
498 | and also requires no further error handling. Typical usage is:: | |
499 | ||
500 | inode = foofs_new_inode(....); | |
501 | s->s_root = d_make_root(inode); | |
502 | if (!s->s_root) | |
503 | /* Nothing needed for the inode cleanup */ | |
504 | return -ENOMEM; | |
505 | ... | |
506 | ||
507 | --- | |
508 | ||
509 | **mandatory** | |
510 | ||
511 | The witch is dead! Well, 2/3 of it, anyway. ->d_revalidate() and | |
512 | ->lookup() do *not* take struct nameidata anymore; just the flags. | |
513 | ||
514 | --- | |
515 | ||
516 | **mandatory** | |
517 | ||
518 | ->create() doesn't take ``struct nameidata *``; unlike the previous | |
519 | two, it gets "is it an O_EXCL or equivalent?" boolean argument. Note that | |
d56b699d | 520 | local filesystems can ignore this argument - they are guaranteed that the |
25b532ce MCC |
521 | object doesn't exist. It's remote/distributed ones that might care... |
522 | ||
523 | --- | |
524 | ||
525 | **mandatory** | |
526 | ||
527 | FS_REVAL_DOT is gone; if you used to have it, add ->d_weak_revalidate() | |
528 | in your dentry operations instead. | |
529 | ||
530 | --- | |
531 | ||
532 | **mandatory** | |
533 | ||
534 | vfs_readdir() is gone; switch to iterate_dir() instead | |
535 | ||
536 | --- | |
537 | ||
538 | **mandatory** | |
539 | ||
3e327154 | 540 | ->readdir() is gone now; switch to ->iterate_shared() |
25b532ce MCC |
541 | |
542 | **mandatory** | |
543 | ||
544 | vfs_follow_link has been removed. Filesystems must use nd_set_link | |
545 | from ->follow_link for normal symlinks, or nd_jump_link for magic | |
546 | /proc/<pid> style links. | |
547 | ||
548 | --- | |
549 | ||
550 | **mandatory** | |
551 | ||
552 | iget5_locked()/ilookup5()/ilookup5_nowait() test() callback used to be | |
553 | called with both ->i_lock and inode_hash_lock held; the former is *not* | |
554 | taken anymore, so verify that your callbacks do not rely on it (none | |
555 | of the in-tree instances did). inode_hash_lock is still held, | |
556 | of course, so they are still serialized wrt removal from inode hash, | |
557 | as well as wrt set() callback of iget5_locked(). | |
558 | ||
559 | --- | |
560 | ||
561 | **mandatory** | |
562 | ||
563 | d_materialise_unique() is gone; d_splice_alias() does everything you | |
564 | need now. Remember that they have opposite orders of arguments ;-/ | |
565 | ||
566 | --- | |
567 | ||
568 | **mandatory** | |
569 | ||
570 | f_dentry is gone; use f_path.dentry, or, better yet, see if you can avoid | |
571 | it entirely. | |
572 | ||
573 | --- | |
574 | ||
575 | **mandatory** | |
576 | ||
577 | never call ->read() and ->write() directly; use __vfs_{read,write} or | |
578 | wrappers; instead of checking for ->write or ->read being NULL, look for | |
579 | FMODE_CAN_{WRITE,READ} in file->f_mode. | |
580 | ||
581 | --- | |
582 | ||
583 | **mandatory** | |
584 | ||
585 | do _not_ use new_sync_{read,write} for ->read/->write; leave it NULL | |
586 | instead. | |
587 | ||
588 | --- | |
589 | ||
590 | **mandatory** | |
591 | ->aio_read/->aio_write are gone. Use ->read_iter/->write_iter. | |
592 | ||
593 | --- | |
594 | ||
595 | **recommended** | |
596 | ||
597 | for embedded ("fast") symlinks just set inode->i_link to wherever the | |
598 | symlink body is and use simple_follow_link() as ->follow_link(). | |
599 | ||
600 | --- | |
601 | ||
602 | **mandatory** | |
603 | ||
604 | calling conventions for ->follow_link() have changed. Instead of returning | |
605 | cookie and using nd_set_link() to store the body to traverse, we return | |
606 | the body to traverse and store the cookie using explicit void ** argument. | |
607 | nameidata isn't passed at all - nd_jump_link() doesn't need it and | |
608 | nd_[gs]et_link() is gone. | |
609 | ||
610 | --- | |
611 | ||
612 | **mandatory** | |
613 | ||
614 | calling conventions for ->put_link() have changed. It gets inode instead of | |
615 | dentry, it does not get nameidata at all and it gets called only when cookie | |
616 | is non-NULL. Note that link body isn't available anymore, so if you need it, | |
617 | store it as cookie. | |
618 | ||
619 | --- | |
620 | ||
621 | **mandatory** | |
622 | ||
623 | any symlink that might use page_follow_link_light/page_put_link() must | |
624 | have inode_nohighmem(inode) called before anything might start playing with | |
625 | its pagecache. No highmem pages should end up in the pagecache of such | |
626 | symlinks. That includes any preseeding that might be done during symlink | |
56f5746c | 627 | creation. page_symlink() will honour the mapping gfp flags, so once |
25b532ce MCC |
628 | you've done inode_nohighmem() it's safe to use, but if you allocate and |
629 | insert the page manually, make sure to use the right gfp flags. | |
630 | ||
631 | --- | |
632 | ||
633 | **mandatory** | |
634 | ||
635 | ->follow_link() is replaced with ->get_link(); same API, except that | |
636 | ||
637 | * ->get_link() gets inode as a separate argument | |
638 | * ->get_link() may be called in RCU mode - in that case NULL | |
639 | dentry is passed | |
640 | ||
641 | --- | |
642 | ||
643 | **mandatory** | |
644 | ||
645 | ->get_link() gets struct delayed_call ``*done`` now, and should do | |
646 | set_delayed_call() where it used to set ``*cookie``. | |
647 | ||
648 | ->put_link() is gone - just give the destructor to set_delayed_call() | |
649 | in ->get_link(). | |
650 | ||
651 | --- | |
652 | ||
653 | **mandatory** | |
654 | ||
655 | ->getxattr() and xattr_handler.get() get dentry and inode passed separately. | |
656 | dentry might be yet to be attached to inode, so do _not_ use its ->d_inode | |
657 | in the instances. Rationale: !@#!@# security_d_instantiate() needs to be | |
658 | called before we attach dentry to inode. | |
659 | ||
660 | --- | |
661 | ||
662 | **mandatory** | |
663 | ||
664 | symlinks are no longer the only inodes that do *not* have i_bdev/i_cdev/ | |
665 | i_pipe/i_link union zeroed out at inode eviction. As the result, you can't | |
666 | assume that non-NULL value in ->i_nlink at ->destroy_inode() implies that | |
667 | it's a symlink. Checking ->i_mode is really needed now. In-tree we had | |
668 | to fix shmem_destroy_callback() that used to take that kind of shortcut; | |
669 | watch out, since that shortcut is no longer valid. | |
670 | ||
671 | --- | |
672 | ||
673 | **mandatory** | |
674 | ||
675 | ->i_mutex is replaced with ->i_rwsem now. inode_lock() et.al. work as | |
676 | they used to - they just take it exclusive. However, ->lookup() may be | |
677 | called with parent locked shared. Its instances must not | |
678 | ||
679 | * use d_instantiate) and d_rehash() separately - use d_add() or | |
680 | d_splice_alias() instead. | |
681 | * use d_rehash() alone - call d_add(new_dentry, NULL) instead. | |
682 | * in the unlikely case when (read-only) access to filesystem | |
683 | data structures needs exclusion for some reason, arrange it | |
684 | yourself. None of the in-tree filesystems needed that. | |
685 | * rely on ->d_parent and ->d_name not changing after dentry has | |
686 | been fed to d_add() or d_splice_alias(). Again, none of the | |
687 | in-tree instances relied upon that. | |
688 | ||
689 | We are guaranteed that lookups of the same name in the same directory | |
690 | will not happen in parallel ("same" in the sense of your ->d_compare()). | |
691 | Lookups on different names in the same directory can and do happen in | |
692 | parallel now. | |
693 | ||
694 | --- | |
695 | ||
3e327154 | 696 | **mandatory** |
25b532ce | 697 | |
3e327154 | 698 | ->iterate_shared() is added. |
25b532ce MCC |
699 | Exclusion on struct file level is still provided (as well as that |
700 | between it and lseek on the same struct file), but if your directory | |
701 | has been opened several times, you can get these called in parallel. | |
702 | Exclusion between that method and all directory-modifying ones is | |
703 | still provided, of course. | |
704 | ||
3e327154 LT |
705 | If you have any per-inode or per-dentry in-core data structures modified |
706 | by ->iterate_shared(), you might need something to serialize the access | |
707 | to them. If you do dcache pre-seeding, you'll need to switch to | |
708 | d_alloc_parallel() for that; look for in-tree examples. | |
25b532ce MCC |
709 | |
710 | --- | |
711 | ||
712 | **mandatory** | |
713 | ||
714 | ->atomic_open() calls without O_CREAT may happen in parallel. | |
715 | ||
716 | --- | |
717 | ||
718 | **mandatory** | |
719 | ||
720 | ->setxattr() and xattr_handler.set() get dentry and inode passed separately. | |
e65ce2a5 CB |
721 | The xattr_handler.set() gets passed the user namespace of the mount the inode |
722 | is seen from so filesystems can idmap the i_uid and i_gid accordingly. | |
25b532ce MCC |
723 | dentry might be yet to be attached to inode, so do _not_ use its ->d_inode |
724 | in the instances. Rationale: !@#!@# security_d_instantiate() needs to be | |
725 | called before we attach dentry to inode and !@#!@##!@$!$#!@#$!@$!@$ smack | |
726 | ->d_instantiate() uses not just ->getxattr() but ->setxattr() as well. | |
727 | ||
728 | --- | |
729 | ||
730 | **mandatory** | |
731 | ||
732 | ->d_compare() doesn't get parent as a separate argument anymore. If you | |
733 | used it for finding the struct super_block involved, dentry->d_sb will | |
734 | work just as well; if it's something more complicated, use dentry->d_parent. | |
735 | Just be careful not to assume that fetching it more than once will yield | |
736 | the same value - in RCU mode it could change under you. | |
737 | ||
738 | --- | |
739 | ||
740 | **mandatory** | |
741 | ||
742 | ->rename() has an added flags argument. Any flags not handled by the | |
743 | filesystem should result in EINVAL being returned. | |
744 | ||
745 | --- | |
746 | ||
747 | ||
748 | **recommended** | |
749 | ||
750 | ->readlink is optional for symlinks. Don't set, unless filesystem needs | |
751 | to fake something for readlink(2). | |
752 | ||
753 | --- | |
754 | ||
755 | **mandatory** | |
756 | ||
757 | ->getattr() is now passed a struct path rather than a vfsmount and | |
758 | dentry separately, and it now has request_mask and query_flags arguments | |
759 | to specify the fields and sync type requested by statx. Filesystems not | |
760 | supporting any statx-specific features may ignore the new arguments. | |
761 | ||
762 | --- | |
763 | ||
764 | **mandatory** | |
765 | ||
766 | ->atomic_open() calling conventions have changed. Gone is ``int *opened``, | |
767 | along with FILE_OPENED/FILE_CREATED. In place of those we have | |
768 | FMODE_OPENED/FMODE_CREATED, set in file->f_mode. Additionally, return | |
769 | value for 'called finish_no_open(), open it yourself' case has become | |
770 | 0, not 1. Since finish_no_open() itself is returning 0 now, that part | |
771 | does not need any changes in ->atomic_open() instances. | |
772 | ||
773 | --- | |
774 | ||
775 | **mandatory** | |
776 | ||
777 | alloc_file() has become static now; two wrappers are to be used instead. | |
778 | alloc_file_pseudo(inode, vfsmount, name, flags, ops) is for the cases | |
779 | when dentry needs to be created; that's the majority of old alloc_file() | |
780 | users. Calling conventions: on success a reference to new struct file | |
781 | is returned and callers reference to inode is subsumed by that. On | |
782 | failure, ERR_PTR() is returned and no caller's references are affected, | |
783 | so the caller needs to drop the inode reference it held. | |
784 | alloc_file_clone(file, flags, ops) does not affect any caller's references. | |
785 | On success you get a new struct file sharing the mount/dentry with the | |
786 | original, on failure - ERR_PTR(). | |
787 | ||
788 | --- | |
789 | ||
790 | **mandatory** | |
791 | ||
792 | ->clone_file_range() and ->dedupe_file_range have been replaced with | |
793 | ->remap_file_range(). See Documentation/filesystems/vfs.rst for more | |
794 | information. | |
795 | ||
796 | --- | |
797 | ||
798 | **recommended** | |
799 | ||
800 | ->lookup() instances doing an equivalent of:: | |
801 | ||
802 | if (IS_ERR(inode)) | |
803 | return ERR_CAST(inode); | |
804 | return d_splice_alias(inode, dentry); | |
805 | ||
806 | don't need to bother with the check - d_splice_alias() will do the | |
807 | right thing when given ERR_PTR(...) as inode. Moreover, passing NULL | |
808 | inode to d_splice_alias() will also do the right thing (equivalent of | |
809 | d_add(dentry, NULL); return NULL;), so that kind of special cases | |
810 | also doesn't need a separate treatment. | |
811 | ||
812 | --- | |
813 | ||
814 | **strongly recommended** | |
815 | ||
816 | take the RCU-delayed parts of ->destroy_inode() into a new method - | |
817 | ->free_inode(). If ->destroy_inode() becomes empty - all the better, | |
818 | just get rid of it. Synchronous work (e.g. the stuff that can't | |
819 | be done from an RCU callback, or any WARN_ON() where we want the | |
820 | stack trace) *might* be movable to ->evict_inode(); however, | |
821 | that goes only for the things that are not needed to balance something | |
822 | done by ->alloc_inode(). IOW, if it's cleaning up the stuff that | |
823 | might have accumulated over the life of in-core inode, ->evict_inode() | |
824 | might be a fit. | |
825 | ||
826 | Rules for inode destruction: | |
827 | ||
828 | * if ->destroy_inode() is non-NULL, it gets called | |
829 | * if ->free_inode() is non-NULL, it gets scheduled by call_rcu() | |
830 | * combination of NULL ->destroy_inode and NULL ->free_inode is | |
831 | treated as NULL/free_inode_nonrcu, to preserve the compatibility. | |
832 | ||
833 | Note that the callback (be it via ->free_inode() or explicit call_rcu() | |
834 | in ->destroy_inode()) is *NOT* ordered wrt superblock destruction; | |
835 | as the matter of fact, the superblock and all associated structures | |
836 | might be already gone. The filesystem driver is guaranteed to be still | |
837 | there, but that's it. Freeing memory in the callback is fine; doing | |
838 | more than that is possible, but requires a lot of care and is best | |
839 | avoided. | |
840 | ||
841 | --- | |
842 | ||
843 | **mandatory** | |
844 | ||
845 | DCACHE_RCUACCESS is gone; having an RCU delay on dentry freeing is the | |
846 | default. DCACHE_NORCU opts out, and only d_alloc_pseudo() has any | |
847 | business doing so. | |
848 | ||
849 | --- | |
850 | ||
851 | **mandatory** | |
852 | ||
853 | d_alloc_pseudo() is internal-only; uses outside of alloc_file_pseudo() are | |
854 | very suspect (and won't work in modules). Such uses are very likely to | |
855 | be misspelled d_alloc_anon(). | |
d9a9f484 AV |
856 | |
857 | --- | |
858 | ||
859 | **mandatory** | |
860 | ||
861 | [should've been added in 2016] stale comment in finish_open() nonwithstanding, | |
862 | failure exits in ->atomic_open() instances should *NOT* fput() the file, | |
863 | no matter what. Everything is handled by the caller. | |
df820f8d MS |
864 | |
865 | --- | |
866 | ||
867 | **mandatory** | |
868 | ||
869 | clone_private_mount() returns a longterm mount now, so the proper destructor of | |
870 | its result is kern_unmount() or kern_unmount_array(). | |
9b2e0016 PB |
871 | |
872 | --- | |
873 | ||
874 | **mandatory** | |
875 | ||
876 | zero-length bvec segments are disallowed, they must be filtered out before | |
877 | passed on to an iterator. | |
c42bca92 PB |
878 | |
879 | --- | |
880 | ||
881 | **mandatory** | |
882 | ||
883 | For bvec based itererators bio_iov_iter_get_pages() now doesn't copy bvecs but | |
884 | uses the one provided. Anyone issuing kiocb-I/O should ensure that the bvec and | |
885 | page references stay until I/O has completed, i.e. until ->ki_complete() has | |
886 | been called or returned with non -EIOCBQUEUED code. | |
5ceabb60 LT |
887 | |
888 | --- | |
889 | ||
890 | **mandatory** | |
891 | ||
14e43bf4 EB |
892 | mnt_want_write_file() can now only be paired with mnt_drop_write_file(), |
893 | whereas previously it could be paired with mnt_drop_write() as well. | |
f0b65f39 AV |
894 | |
895 | --- | |
896 | ||
897 | **mandatory** | |
898 | ||
899 | iov_iter_copy_from_user_atomic() is gone; use copy_page_from_iter_atomic(). | |
900 | The difference is copy_page_from_iter_atomic() advances the iterator and | |
901 | you don't need iov_iter_advance() after it. However, if you decide to use | |
902 | only a part of obtained data, you should do iov_iter_revert(). | |
58ec9059 LT |
903 | |
904 | --- | |
905 | ||
906 | **mandatory** | |
907 | ||
ffb37ca3 AV |
908 | Calling conventions for file_open_root() changed; now it takes struct path * |
909 | instead of passing mount and dentry separately. For callers that used to | |
910 | pass <mnt, mnt->mnt_root> pair (i.e. the root of given mount), a new helper | |
911 | is provided - file_open_root_mnt(). In-tree users adjusted. | |
868941b1 JD |
912 | |
913 | --- | |
914 | ||
915 | **mandatory** | |
916 | ||
917 | no_llseek is gone; don't set .llseek to that - just leave it NULL instead. | |
918 | Checks for "does that file have llseek(2), or should it fail with ESPIPE" | |
919 | should be done by looking at FMODE_LSEEK in file->f_mode. | |
25885a35 AV |
920 | |
921 | --- | |
922 | ||
923 | *mandatory* | |
924 | ||
925 | filldir_t (readdir callbacks) calling conventions have changed. Instead of | |
926 | returning 0 or -E... it returns bool now. false means "no more" (as -E... used | |
927 | to) and true - "keep going" (as 0 in old calling conventions). Rationale: | |
3e327154 LT |
928 | callers never looked at specific -E... values anyway. -> iterate_shared() |
929 | instances require no changes at all, all filldir_t ones in the tree | |
930 | converted. | |
f721d24e LT |
931 | |
932 | --- | |
933 | ||
863f144f MS |
934 | **mandatory** |
935 | ||
936 | Calling conventions for ->tmpfile() have changed. It now takes a struct | |
937 | file pointer instead of struct dentry pointer. d_tmpfile() is similarly | |
938 | changed to simplify callers. The passed file is in a non-open state and on | |
939 | success must be opened before returning (e.g. by calling | |
940 | finish_open_simple()). | |
40d49a3c MWO |
941 | |
942 | --- | |
943 | ||
944 | **mandatory** | |
945 | ||
946 | Calling convention for ->huge_fault has changed. It now takes a page | |
947 | order instead of an enum page_entry_size, and it may be called without the | |
948 | mmap_lock held. All in-tree users have been audited and do not seem to | |
949 | depend on the mmap_lock being held, but out of tree users should verify | |
950 | for themselves. If they do need it, they can return VM_FAULT_RETRY to | |
951 | be called with the mmap_lock held. | |
2ba0dd65 CB |
952 | |
953 | --- | |
954 | ||
955 | **mandatory** | |
956 | ||
957 | The order of opening block devices and matching or creating superblocks has | |
958 | changed. | |
959 | ||
960 | The old logic opened block devices first and then tried to find a | |
961 | suitable superblock to reuse based on the block device pointer. | |
962 | ||
963 | The new logic tries to find a suitable superblock first based on the device | |
964 | number, and opening the block device afterwards. | |
965 | ||
966 | Since opening block devices cannot happen under s_umount because of lock | |
967 | ordering requirements s_umount is now dropped while opening block devices and | |
968 | reacquired before calling fill_super(). | |
969 | ||
970 | In the old logic concurrent mounters would find the superblock on the list of | |
971 | superblocks for the filesystem type. Since the first opener of the block device | |
972 | would hold s_umount they would wait until the superblock became either born or | |
973 | was discarded due to initialization failure. | |
974 | ||
975 | Since the new logic drops s_umount concurrent mounters could grab s_umount and | |
976 | would spin. Instead they are now made to wait using an explicit wait-wake | |
977 | mechanism without having to hold s_umount. | |
060e6c7d CB |
978 | |
979 | --- | |
980 | ||
981 | **mandatory** | |
982 | ||
983 | The holder of a block device is now the superblock. | |
984 | ||
985 | The holder of a block device used to be the file_system_type which wasn't | |
986 | particularly useful. It wasn't possible to go from block device to owning | |
987 | superblock without matching on the device pointer stored in the superblock. | |
988 | This mechanism would only work for a single device so the block layer couldn't | |
989 | find the owning superblock of any additional devices. | |
990 | ||
991 | In the old mechanism reusing or creating a superblock for a racing mount(2) and | |
992 | umount(2) relied on the file_system_type as the holder. This was severly | |
993 | underdocumented however: | |
994 | ||
995 | (1) Any concurrent mounter that managed to grab an active reference on an | |
996 | existing superblock was made to wait until the superblock either became | |
997 | ready or until the superblock was removed from the list of superblocks of | |
998 | the filesystem type. If the superblock is ready the caller would simple | |
999 | reuse it. | |
1000 | ||
1001 | (2) If the mounter came after deactivate_locked_super() but before | |
1002 | the superblock had been removed from the list of superblocks of the | |
1003 | filesystem type the mounter would wait until the superblock was shutdown, | |
1004 | reuse the block device and allocate a new superblock. | |
1005 | ||
1006 | (3) If the mounter came after deactivate_locked_super() and after | |
1007 | the superblock had been removed from the list of superblocks of the | |
1008 | filesystem type the mounter would reuse the block device and allocate a new | |
1009 | superblock (the bd_holder point may still be set to the filesystem type). | |
1010 | ||
1011 | Because the holder of the block device was the file_system_type any concurrent | |
1012 | mounter could open the block devices of any superblock of the same | |
1013 | file_system_type without risking seeing EBUSY because the block device was | |
1014 | still in use by another superblock. | |
1015 | ||
1016 | Making the superblock the owner of the block device changes this as the holder | |
1017 | is now a unique superblock and thus block devices associated with it cannot be | |
1018 | reused by concurrent mounters. So a concurrent mounter in (2) could suddenly | |
1019 | see EBUSY when trying to open a block device whose holder was a different | |
1020 | superblock. | |
1021 | ||
1022 | The new logic thus waits until the superblock and the devices are shutdown in | |
1023 | ->kill_sb(). Removal of the superblock from the list of superblocks of the | |
1024 | filesystem type is now moved to a later point when the devices are closed: | |
1025 | ||
1026 | (1) Any concurrent mounter managing to grab an active reference on an existing | |
1027 | superblock is made to wait until the superblock is either ready or until | |
1028 | the superblock and all devices are shutdown in ->kill_sb(). If the | |
1029 | superblock is ready the caller will simply reuse it. | |
1030 | ||
1031 | (2) If the mounter comes after deactivate_locked_super() but before | |
1032 | the superblock has been removed from the list of superblocks of the | |
1033 | filesystem type the mounter is made to wait until the superblock and the | |
1034 | devices are shut down in ->kill_sb() and the superblock is removed from the | |
1035 | list of superblocks of the filesystem type. The mounter will allocate a new | |
1036 | superblock and grab ownership of the block device (the bd_holder pointer of | |
1037 | the block device will be set to the newly allocated superblock). | |
1038 | ||
1039 | (3) This case is now collapsed into (2) as the superblock is left on the list | |
1040 | of superblocks of the filesystem type until all devices are shutdown in | |
1041 | ->kill_sb(). In other words, if the superblock isn't on the list of | |
1042 | superblock of the filesystem type anymore then it has given up ownership of | |
1043 | all associated block devices (the bd_holder pointer is NULL). | |
1044 | ||
1045 | As this is a VFS level change it has no practical consequences for filesystems | |
1046 | other than that all of them must use one of the provided kill_litter_super(), | |
1047 | kill_anon_super(), or kill_block_super() helpers. |