linux-block.git
2 years agoxfs: don't leak xfs_buf_cancel structures when recovery fails
Darrick J. Wong [Fri, 27 May 2022 00:26:38 +0000 (10:26 +1000)]
xfs: don't leak xfs_buf_cancel structures when recovery fails

If log recovery fails, we free the memory used by the buffer
cancellation buckets, but we don't actually traverse each bucket list to
free the individual xfs_buf_cancel objects.  This leads to a memory
leak, as reported by kmemleak in xfs/051:

unreferenced object 0xffff888103629560 (size 32):
  comm "mount", pid 687045, jiffies 4296935916 (age 10.752s)
  hex dump (first 32 bytes):
    08 d3 0a 01 00 00 00 00 08 00 00 00 01 00 00 00  ................
    d0 f5 0b 92 81 88 ff ff 80 64 64 25 81 88 ff ff  .........dd%....
  backtrace:
    [<ffffffffa0317c83>] kmem_alloc+0x73/0x140 [xfs]
    [<ffffffffa03234a9>] xlog_recover_buf_commit_pass1+0x139/0x200 [xfs]
    [<ffffffffa032dc27>] xlog_recover_commit_trans+0x307/0x350 [xfs]
    [<ffffffffa032df15>] xlog_recovery_process_trans+0xa5/0xe0 [xfs]
    [<ffffffffa032e12d>] xlog_recover_process_data+0x8d/0x140 [xfs]
    [<ffffffffa032e49d>] xlog_do_recovery_pass+0x19d/0x740 [xfs]
    [<ffffffffa032f22d>] xlog_do_log_recovery+0x6d/0x150 [xfs]
    [<ffffffffa032f343>] xlog_do_recover+0x33/0x1d0 [xfs]
    [<ffffffffa032faba>] xlog_recover+0xda/0x190 [xfs]
    [<ffffffffa03194bc>] xfs_log_mount+0x14c/0x360 [xfs]
    [<ffffffffa030bfed>] xfs_mountfs+0x50d/0xa60 [xfs]
    [<ffffffffa03124b5>] xfs_fs_fill_super+0x6a5/0x950 [xfs]
    [<ffffffff812b92a5>] get_tree_bdev+0x175/0x280
    [<ffffffff812b7c3a>] vfs_get_tree+0x1a/0x80
    [<ffffffff812e366f>] path_mount+0x6ff/0xaa0
    [<ffffffff812e3b13>] __x64_sys_mount+0x103/0x140

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: refactor buffer cancellation table allocation
Darrick J. Wong [Fri, 27 May 2022 00:26:17 +0000 (10:26 +1000)]
xfs: refactor buffer cancellation table allocation

Move the code that allocates and frees the buffer cancellation tables
used by log recovery into the file that actually uses the tables.  This
is a precursor to some cleanups and a memory leak fix.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoMerge branch 'guilt/xfs-5.19-misc-3' into xfs-5.19-for-next
Dave Chinner [Sun, 22 May 2022 22:55:09 +0000 (08:55 +1000)]
Merge branch 'guilt/xfs-5.19-misc-3' into xfs-5.19-for-next

2 years agoxfs: share xattr name and value buffers when logging xattr updates
Darrick J. Wong [Sun, 22 May 2022 22:43:46 +0000 (08:43 +1000)]
xfs: share xattr name and value buffers when logging xattr updates

While running xfs/297 and generic/642, I noticed a crash in
xfs_attri_item_relog when it tries to copy the attr name to the new
xattri log item.  I think what happened here was that we called
->iop_commit on the old attri item (which nulls out the pointers) as
part of a log force at the same time that a chained attr operation was
ongoing.  The system was busy enough that at some later point, the defer
ops operation decided it was necessary to relog the attri log item, but
as we've detached the name buffer from the old attri log item, we can't
copy it to the new one, and kaboom.

I think there's a broader refcounting problem with LARP mode -- the
setxattr code can return to userspace before the CIL actually formats
and commits the log item, which results in a UAF bug.  Therefore, the
xattr log item needs to be able to retain a reference to the name and
value buffers until the log items have completely cleared the log.
Furthermore, each time we create an intent log item, we allocate new
memory and (re)copy the contents; sharing here would be very useful.

Solve the UAF and the unnecessary memory allocations by having the log
code create a single refcounted buffer to contain the name and value
contents.  This buffer can be passed from old to new during a relog
operation, and the logging code can (optionally) attach it to the
xfs_attr_item for reuse when LARP mode is enabled.

This also fixes a problem where the xfs_attri_log_item objects weren't
being freed back to the same cache where they came from.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: do not use logged xattr updates on V4 filesystems
Darrick J. Wong [Sun, 22 May 2022 22:41:03 +0000 (08:41 +1000)]
xfs: do not use logged xattr updates on V4 filesystems

V4 superblocks do not contain the log_incompat feature bit, which means
that we cannot protect xattr log items against kernels that are too old
to know how to recover them.  Turn off the log items for such
filesystems and adjust the "delayed" name to reflect what it's really
controlling.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Remove duplicate include
Jiapeng Chong [Sun, 22 May 2022 06:47:17 +0000 (16:47 +1000)]
xfs: Remove duplicate include

Clean up the following includecheck warning:

./fs/xfs/xfs_attr_item.c: xfs_inode.h is included more than once.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: reduce IOCB_NOWAIT judgment for retry exclusive unaligned DIO
Kaixu Xia [Sun, 22 May 2022 06:47:11 +0000 (16:47 +1000)]
xfs: reduce IOCB_NOWAIT judgment for retry exclusive unaligned DIO

Retry unaligned DIO with exclusive blocking semantics only when the
IOCB_NOWAIT flag is not set. If we are doing nonblocking user I/O,
propagate the error directly.

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Remove dead code
Jiapeng Chong [Sun, 22 May 2022 06:46:57 +0000 (16:46 +1000)]
xfs: Remove dead code

Remove tht entire xlog_recover_check_summary() function, this entire
function is dead code and has been for 12 years.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: fix typo in comment
Julia Lawall [Sun, 22 May 2022 06:46:38 +0000 (16:46 +1000)]
xfs: fix typo in comment

Spelling mistake (triple letters) in comment.
Detected with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: rename struct xfs_attr_item to xfs_attr_intent
Darrick J. Wong [Sun, 22 May 2022 06:00:26 +0000 (16:00 +1000)]
xfs: rename struct xfs_attr_item to xfs_attr_intent

Everywhere else in XFS, structures that capture the state of an ongoing
deferred work item all have names that end with "_intent".  The new
extended attribute deferred work items are not named as such, so fix it
to follow the naming convention used elsewhere.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: clean up state variable usage in xfs_attr_node_remove_attr
Darrick J. Wong [Sun, 22 May 2022 05:59:48 +0000 (15:59 +1000)]
xfs: clean up state variable usage in xfs_attr_node_remove_attr

The state variable is now a local variable pointing to a heap
allocation, so we don't need to zero-initialize it, nor do we need the
conditional to decide if we should free it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: put attr[id] log item cache init with the others
Darrick J. Wong [Sun, 22 May 2022 05:59:48 +0000 (15:59 +1000)]
xfs: put attr[id] log item cache init with the others

Initialize and destroy the xattr log item caches in the same places that
we do all the other log item caches.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: remove struct xfs_attr_item.xattri_flags
Darrick J. Wong [Sun, 22 May 2022 05:59:48 +0000 (15:59 +1000)]
xfs: remove struct xfs_attr_item.xattri_flags

Nobody uses this field, so get rid of it and the unused flag definition.
Rearrange the structure layout to reduce its size from 104 to 96 bytes.
This gets us from 39 to 42 objects per page.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: use a separate slab cache for deferred xattr work state
Darrick J. Wong [Sun, 22 May 2022 05:59:48 +0000 (15:59 +1000)]
xfs: use a separate slab cache for deferred xattr work state

Create a separate slab cache for struct xfs_attr_item objects, since we
can pack the (104-byte) intent items more tightly than we can with the
general slab cache objects.  On x86, this means 39 intents per memory
page instead of 32.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: put the xattr intent item op flags in their own namespace
Darrick J. Wong [Sun, 22 May 2022 05:59:48 +0000 (15:59 +1000)]
xfs: put the xattr intent item op flags in their own namespace

The flags that are stored in the extended attr intent log item really
should have a separate namespace from the rest of the XFS_ATTR_* flags.
Give them one to make it a little more obvious that they're intent item
flags.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: clean up xfs_attr_node_hasname
Darrick J. Wong [Sun, 22 May 2022 05:59:34 +0000 (15:59 +1000)]
xfs: clean up xfs_attr_node_hasname

The calling conventions of this function are a mess -- callers /can/
provide a pointer to a pointer to a state structure, but it's not
required, and as evidenced by the last two patches, the callers that do
weren't be careful enough about how to deal with an existing da state.

Push the allocation and freeing responsibilty to the callers, which
means that callers from the xattr node state machine steps now have the
visibility to allocate or free the da state structure as they please.
As a bonus, the node remove/add paths for larp-mode replaces can reset
the da state structure instead of freeing and immediately reallocating
it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: free xfs_attrd_log_items correctly
Darrick J. Wong [Fri, 20 May 2022 04:42:49 +0000 (14:42 +1000)]
xfs: free xfs_attrd_log_items correctly

Technically speaking, objects allocated out of a specific slab cache are
supposed to be freed to that slab cache.  The popular slab backends will
take care of this for us, but SLOB famously doesn't.  Fix this, even if
slob + xfs are not that common of a combination.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: validate xattr name earlier in recovery
Darrick J. Wong [Fri, 20 May 2022 04:42:36 +0000 (14:42 +1000)]
xfs: validate xattr name earlier in recovery

When we're validating a recovered xattr log item during log recovery, we
should check the name before starting to allocate resources.  This isn't
strictly necessary on its own, but it means that we won't bother with
huge memory allocations during recovery if the attr name is garbage,
which will simplify the changes in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: reject unknown xattri log item filter flags during recovery
Darrick J. Wong [Fri, 20 May 2022 04:42:15 +0000 (14:42 +1000)]
xfs: reject unknown xattri log item filter flags during recovery

Make sure we screen the "attr flags" field of recovered xattr intent log
items to reject flag bits that we don't know about.  This is really the
attr *filter* field from xfs_da_args, so rename the field and create
a mask to make checking for invalid bits easier.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: reject unknown xattri log item operation flags during recovery
Darrick J. Wong [Fri, 20 May 2022 04:41:47 +0000 (14:41 +1000)]
xfs: reject unknown xattri log item operation flags during recovery

Make sure we screen the op flags field of recovered xattr intent log
items to reject flag bits that we don't know about.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: don't leak the retained da state when doing a leaf to node conversion
Darrick J. Wong [Fri, 20 May 2022 04:41:42 +0000 (14:41 +1000)]
xfs: don't leak the retained da state when doing a leaf to node conversion

If a setxattr operation finds an xattr structure in leaf format, adding
the attr can fail due to lack of space and hence requires an upgrade to
node format.  After this happens, we'll roll the transaction and
re-enter the state machine, at which time we need to perform a second
lookup of the attribute name to find its new location.  This lookup
attaches a new da state structure to the xfs_attr_item but doesn't free
the old one (from the leaf lookup) and leaks it.  Fix that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: don't leak da state when freeing the attr intent item
Darrick J. Wong [Fri, 20 May 2022 04:41:34 +0000 (14:41 +1000)]
xfs: don't leak da state when freeing the attr intent item

kmemleak reported that we lost an xfs_da_state while removing xattrs in
generic/020:

unreferenced object 0xffff88801c0e4b40 (size 480):
  comm "attr", pid 30515, jiffies 4294931061 (age 5.960s)
  hex dump (first 32 bytes):
    78 bc 65 07 00 c9 ff ff 00 30 60 1c 80 88 ff ff  x.e......0`.....
    02 00 00 00 00 00 00 00 80 18 83 4e 80 88 ff ff  ...........N....
  backtrace:
    [<ffffffffa023ef4a>] xfs_da_state_alloc+0x1a/0x30 [xfs]
    [<ffffffffa021b6f3>] xfs_attr_node_hasname+0x23/0x90 [xfs]
    [<ffffffffa021c6f1>] xfs_attr_set_iter+0x441/0xa30 [xfs]
    [<ffffffffa02b5104>] xfs_xattri_finish_update+0x44/0x80 [xfs]
    [<ffffffffa02b515e>] xfs_attr_finish_item+0x1e/0x40 [xfs]
    [<ffffffffa0244744>] xfs_defer_finish_noroll+0x184/0x740 [xfs]
    [<ffffffffa02a6473>] __xfs_trans_commit+0x153/0x3e0 [xfs]
    [<ffffffffa021d149>] xfs_attr_set+0x469/0x7e0 [xfs]
    [<ffffffffa02a78d9>] xfs_xattr_set+0x89/0xd0 [xfs]
    [<ffffffff812e6512>] __vfs_removexattr+0x52/0x70
    [<ffffffff812e6a08>] __vfs_removexattr_locked+0xb8/0x150
    [<ffffffff812e6af6>] vfs_removexattr+0x56/0x100
    [<ffffffff812e6bf8>] removexattr+0x58/0x90
    [<ffffffff812e6cce>] path_removexattr+0x9e/0xc0
    [<ffffffff812e6d44>] __x64_sys_lremovexattr+0x14/0x20
    [<ffffffff81786b35>] do_syscall_64+0x35/0x80

I think this is a consequence of xfs_attr_node_removename_setup
attaching a new da(btree) state to xfs_attr_item and never freeing it.
I /think/ it's the case that the remove paths could detach the da state
earlier in the remove state machine since nothing else accesses the
state.  However, let's future-proof the new xattr code by adding a
catch-all when we free the xfs_attr_item to make sure we never leak the
da state.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoMerge branch 'xfs-5.19-quota-warn-remove' into xfs-5.19-for-next
Dave Chinner [Thu, 12 May 2022 05:23:07 +0000 (15:23 +1000)]
Merge branch 'xfs-5.19-quota-warn-remove' into xfs-5.19-for-next

2 years agoxfs: can't use kmem_zalloc() for attribute buffers
Dave Chinner [Thu, 12 May 2022 05:12:57 +0000 (15:12 +1000)]
xfs: can't use kmem_zalloc() for attribute buffers

Because heap allocation of 64kB buffers will fail:

....
 XFS: fs_mark(8414) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
 XFS: fs_mark(8417) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
 XFS: fs_mark(8409) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
 XFS: fs_mark(8428) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
 XFS: fs_mark(8430) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
 XFS: fs_mark(8437) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
 XFS: fs_mark(8433) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
 XFS: fs_mark(8406) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
 XFS: fs_mark(8412) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
 XFS: fs_mark(8432) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
 XFS: fs_mark(8424) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40)
....

I'd use kvmalloc() instead, but....

- 48.19% xfs_attr_create_intent
  - 46.89% xfs_attri_init
     - kvmalloc_node
- 46.04% __kmalloc_node
   - kmalloc_large_node
      - 45.99% __alloc_pages
 - 39.39% __alloc_pages_slowpath.constprop.0
    - 38.89% __alloc_pages_direct_compact
       - 38.71% try_to_compact_pages
  - compact_zone_order
  - compact_zone
     - 21.09% isolate_migratepages_block
  10.31% PageHuge
  5.82% set_pfnblock_flags_mask
  0.86% get_pfnblock_flags_mask
     - 4.48% __reset_isolation_suitable
  4.44% __reset_isolation_pfn
     - 3.56% __pageblock_pfn_to_page
  1.33% pfn_to_online_page
       2.83% get_pfnblock_flags_mask
     - 0.87% migrate_pages
  0.86% compaction_alloc
       0.84% find_suitable_fallback
 - 6.60% get_page_from_freelist
      4.99% clear_page_erms
    - 1.19% _raw_spin_lock_irqsave
       - do_raw_spin_lock
    __pv_queued_spin_lock_slowpath
- 0.86% __vmalloc_node_range
     0.65% __alloc_pages_bulk

.... this is just yet another reminder of how much kvmalloc() sucks.
So lift xlog_cil_kvmalloc(), rename it to xlog_kvmalloc() and use
that instead....

We also clean up the attribute name and value lengths as they no
longer need to be rounded out to sizes compatible with log vectors.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: detect empty attr leaf blocks in xfs_attr3_leaf_verify
Dave Chinner [Thu, 12 May 2022 05:12:57 +0000 (15:12 +1000)]
xfs: detect empty attr leaf blocks in xfs_attr3_leaf_verify

xfs_repair flags these as a corruption error, so the verifier should
catch software bugs that result in empty leaf blocks being written
to disk, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: ATTR_REPLACE algorithm with LARP enabled needs rework
Dave Chinner [Thu, 12 May 2022 05:12:56 +0000 (15:12 +1000)]
xfs: ATTR_REPLACE algorithm with LARP enabled needs rework

We can't use the same algorithm for replacing an existing attribute
when logging attributes. The existing algorithm is essentially:

1. create new attr w/ INCOMPLETE
2. atomically flip INCOMPLETE flags between old + new attribute
3. remove old attr which is marked w/ INCOMPLETE

This algorithm guarantees that we see either the old or new
attribute, and if we fail after the atomic flag flip, we don't have
to recover the removal of the old attr because we never see
INCOMPLETE attributes in lookups.

For logged attributes, however, this does not work. The logged
attribute intents do not track the work that has been done as the
transaction rolls, and hence the only recovery mechanism we have is
"run the replace operation from scratch".

This is further exacerbated by the attempt to avoid needing the
INCOMPLETE flag to create an atomic swap. This means we can create
a second active attribute of the same name before we remove the
original. If we fail at any point after the create but before the
removal has completed, we end up with duplicate attributes in
the attr btree and recovery only tries to replace one of them.

There are several other failure modes where we can leave partially
allocated remote attributes that expose stale data, partially free
remote attributes that enable UAF based stale data exposure, etc.

TO fix this, we need a different algorithm for replace operations
when LARP is enabled. Luckily, it's not that complex if we take the
right first step. That is, the first thing we log is the attri
intent with the new name/value pair and mark the old attr as
INCOMPLETE in the same transaction.

From there, we then remove the old attr and keep relogging the
new name/value in the intent, such that we always know that we have
to create the new attr in recovery. Once the old attr is removed,
we then run a normal ATTR_CREATE operation relogging the intent as
we go. If the new attr is local, then it gets created in a single
atomic transaction that also logs the final intent done. If the new
attr is remote, the we set INCOMPLETE on the new attr while we
allocate and set the remote value, and then we clear the INCOMPLETE
flag at in the last transaction taht logs the final intent done.

If we fail at any point in this algorithm, log recovery will always
see the same state on disk: the new name/value in the intent, and
either an INCOMPLETE attr or no attr in the attr btree. If we find
an INCOMPLETE attr, we run the full replace starting with removing
the INCOMPLETE attr. If we don't find it, then we simply create the
new attr.

Notably, recovery of a failed create that has an INCOMPLETE flag set
is now the same - we start with the lookup of the INCOMPLETE attr,
and if that exists then we do the full replace recovery process,
otherwise we just create the new attr.

Hence changing the way we do the replace operation when LARP is
enabled allows us to use the same log recovery algorithm for both
the ATTR_CREATE and ATTR_REPLACE operations. This is also the same
algorithm we use for runtime ATTR_REPLACE operations (except for the
step setting up the initial conditions).

The result is that:

- ATTR_CREATE uses the same algorithm regardless of whether LARP is
  enabled or not
- ATTR_REPLACE with larp=0 is identical to the old algorithm
- ATTR_REPLACE with larp=1 runs an unmodified attr removal algorithm
  from the larp=0 code and then runs the unmodified ATTR_CREATE
  code.
- log recovery when larp=1 runs the same ATTR_REPLACE algorithm as
  it uses at runtime.

Because the state machine is now quite clean, changing the algorithm
is really just a case of changing the initial state and how the
states link together for the ATTR_REPLACE case. Hence it's not a
huge amount of code for what is a fairly substantial rework
of the attr logging and recovery algorithm....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: use XFS_DA_OP flags in deferred attr ops
Dave Chinner [Thu, 12 May 2022 05:12:56 +0000 (15:12 +1000)]
xfs: use XFS_DA_OP flags in deferred attr ops

We currently store the high level attr operation in
args->attr_flags. This field contains what the VFS is telling us to
do, but don't necessarily match what we are doing in the low level
modification state machine. e.g. XATTR_REPLACE implies both
XFS_DA_OP_ADDNAME and XFS_DA_OP_RENAME because it is doing both a
remove and adding a new attr.

However, deep in the individual state machine operations, we check
errors against this high level VFS op flags, not the low level
XFS_DA_OP flags. Indeed, we don't even have a low level flag for
a REMOVE operation, so the only way we know we are doing a remove
is the complete absence of XATTR_REPLACE, XATTR_CREATE,
XFS_DA_OP_ADDNAME and XFS_DA_OP_RENAME. And because there are other
flags in these fields, this is a pain to check if we need to.

As the XFS_DA_OP flags are only needed once the deferred operations
are set up, set these flags appropriately when we set the initial
operation state. We also introduce a XFS_DA_OP_REMOVE flag to make
it easy to know that we are doing a remove operation.

With these, we can remove the use of XATTR_REPLACE and XATTR_CREATE
in low level lookup operations, and manipulate the low level flags
according to the low level context that is operating. e.g. log
recovery does not have a VFS xattr operation state to copy into
args->attr_flags, and the low level state machine ops we do for
recovery do not match the high level VFS operations that were in
progress when the system failed...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: remove xfs_attri_remove_iter
Dave Chinner [Thu, 12 May 2022 05:12:56 +0000 (15:12 +1000)]
xfs: remove xfs_attri_remove_iter

xfs_attri_remove_iter is not used anymore, so remove it and all the
infrastructure it uses and is needed to drive it. THe
xfs_attr_refillstate() function now throws an unused warning, so
isolate the xfs_attr_fillstate()/xfs_attr_refillstate() code pair
with an #if 0 and a comment explaining why we want to keep this code
and restore the optimisation it provides in the near future.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: switch attr remove to xfs_attri_set_iter
Dave Chinner [Thu, 12 May 2022 05:12:56 +0000 (15:12 +1000)]
xfs: switch attr remove to xfs_attri_set_iter

Now that xfs_attri_set_iter() has initial states for removing
attributes, switch the pure attribute removal code over to using it.
This requires attrs being removed to always be marked as INCOMPLETE
before we start the removal due to the fact we look up the attr to
remove again in xfs_attr_node_remove_attr().

Note: this drops the fillstate/refillstate optimisations from
the remove path that avoid having to look up the path again after
setting the incomplete flag and removing remote attrs. Restoring
that optimisation to this path is future Dave's problem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: introduce attr remove initial states into xfs_attr_set_iter
Dave Chinner [Thu, 12 May 2022 05:12:56 +0000 (15:12 +1000)]
xfs: introduce attr remove initial states into xfs_attr_set_iter

We need to merge the add and remove code paths to enable safe
recovery of replace operations. Hoist the initial remove states from
xfs_attr_remove_iter into xfs_attr_set_iter. We will make use of
them in the next patches.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: xfs_attr_set_iter() does not need to return EAGAIN
Dave Chinner [Thu, 12 May 2022 05:12:55 +0000 (15:12 +1000)]
xfs: xfs_attr_set_iter() does not need to return EAGAIN

Now that the full xfs_attr_set_iter() state machine always
terminates with either the state being XFS_DAS_DONE on success or
an error on failure, we can get rid of the need for it to return
-EAGAIN whenever it needs to roll the transaction before running
the next state.

That is, we don't need to spray -EAGAIN return states everywhere,
the caller just check the state machine state for completion to
determine what action should be taken next. This greatly simplifies
the code within the state machine implementation as it now only has
to handle 0 for success or -errno for error and it doesn't need to
tell the caller to retry.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: clean up final attr removal in xfs_attr_set_iter
Dave Chinner [Thu, 12 May 2022 05:12:55 +0000 (15:12 +1000)]
xfs: clean up final attr removal in xfs_attr_set_iter

Clean up the final leaf/node states in xfs_attr_set_iter() to
further simplify the high level state machine and to set the
completion state correctly. As we are adding a separate state
for node format removal, we need to ensure that node formats
are collapsed back to shortform or empty correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: remote xattr removal in xfs_attr_set_iter() is conditional
Dave Chinner [Thu, 12 May 2022 05:12:55 +0000 (15:12 +1000)]
xfs: remote xattr removal in xfs_attr_set_iter() is conditional

We may not have a remote value for the old xattr we have to remove,
so skip over the remote value removal states and go straight to
the xattr name removal in the leaf/node block.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: XFS_DAS_LEAF_REPLACE state only needed if !LARP
Dave Chinner [Thu, 12 May 2022 05:12:55 +0000 (15:12 +1000)]
xfs: XFS_DAS_LEAF_REPLACE state only needed if !LARP

We can skip the REPLACE state when LARP is enabled, but that means
the XFS_DAS_FLIP_LFLAG state is now poorly named - it indicates
something that has been done rather than what the state is going to
do. Rename it to "REMOVE_OLD" to indicate that we are now going to
perform removal of the old attr.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: split remote attr setting out from replace path
Dave Chinner [Thu, 12 May 2022 05:12:55 +0000 (15:12 +1000)]
xfs: split remote attr setting out from replace path

When we set a new xattr, we have three exit paths:

1. nothing else to do
2. allocate and set the remote xattr value
3. perform the rest of a replace operation

Currently we push both 2 and 3 into the same state, regardless of
whether we just set a remote attribute or not. Once we've set the
remote xattr, we have two exit states:

1. nothing else to do
2. perform the rest of a replace operation

Hence we can split the remote xattr allocation and setting into
their own states and factor it out of xfs_attr_set_iter() to further
clean up the state machine and the implementation of the state
machine.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: consolidate leaf/node states in xfs_attr_set_iter
Dave Chinner [Thu, 12 May 2022 05:12:54 +0000 (15:12 +1000)]
xfs: consolidate leaf/node states in xfs_attr_set_iter

The operations performed from XFS_DAS_FOUND_LBLK through to
XFS_DAS_RM_LBLK are now identical to XFS_DAS_FOUND_NBLK through to
XFS_DAS_RM_NBLK. We can collapse these down into a single set of
code.

To do this, define the states that leaf and node run through as
separate sets of sequential states. Then as we move to the next
state, we can use increments rather than specific state assignments
to move through the states. This means the state progression is set
by the initial state that enters the series and we don't need to
duplicate the code anymore.

At the exit point of the series we need to select the correct leaf
or node state, but that can also be done by state increment rather
than assignment.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: kill XFS_DAC_LEAF_ADDNAME_INIT
Dave Chinner [Thu, 12 May 2022 05:12:54 +0000 (15:12 +1000)]
xfs: kill XFS_DAC_LEAF_ADDNAME_INIT

We re-enter the XFS_DAS_FOUND_LBLK state when we have to allocate
multiple extents for a remote xattr. We currently have a flag
called XFS_DAC_LEAF_ADDNAME_INIT to avoid running the remote attr
hole finding code more than once.

However, for the node format tree, we have a separate state for this
so we never reenter the state machine at XFS_DAS_FOUND_NBLK and so
it does not need a special flag to skip over the remote attr hold
finding code.

Convert the leaf block code to use the same state machine as the
node blocks and kill the  XFS_DAC_LEAF_ADDNAME_INIT flag.

This further points out that this "ALLOC" state is only traversed
if we have remote xattrs or we are doing a rename operation. Rename
both the leaf and node alloc states to _ALLOC_RMT to indicate they
are iterating to do allocation of remote xattr blocks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: separate out initial attr_set states
Dave Chinner [Thu, 12 May 2022 05:12:52 +0000 (15:12 +1000)]
xfs: separate out initial attr_set states

We current use XFS_DAS_UNINIT for several steps in the attr_set
state machine. We use it for setting shortform xattrs, converting
from shortform to leaf, leaf add, leaf-to-node and leaf add. All of
these things are essentially known before we start the state machine
iterating, so we really should separate them out:

XFS_DAS_SF_ADD:
- tries to do a shortform add
- on success -> done
- on ENOSPC converts to leaf, -> XFS_DAS_LEAF_ADD
- on error, dies.

XFS_DAS_LEAF_ADD:
- tries to do leaf add
- on success:
- inline attr -> done
- remote xattr || REPLACE -> XFS_DAS_FOUND_LBLK
- on ENOSPC converts to node, -> XFS_DAS_NODE_ADD
- on error, dies

XFS_DAS_NODE_ADD:
- tries to do node add
- on success:
- inline attr -> done
- remote xattr || REPLACE -> XFS_DAS_FOUND_NBLK
- on error, dies

This makes it easier to understand how the state machine starts
up and sets us up on the path to further state machine
simplifications.

This also converts the DAS state tracepoints to use strings rather
than numbers, as converting between enums and numbers requires
manual counting rather than just reading the name.

This also introduces a XFS_DAS_DONE state so that we can trace
successful operation completions easily.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: don't set quota warning values
Catherine Hoang [Tue, 10 May 2022 20:28:00 +0000 (13:28 -0700)]
xfs: don't set quota warning values

Having just dropped support for quota warning limits and warning
counters, the warning fields no longer have any meaning. Prevent these
fields from being set by removing QC_WARNS_MASK from XFS_QC_SETINFO_MASK
and XFS_QC_MASK.

Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: remove warning counters from struct xfs_dquot_res
Catherine Hoang [Tue, 10 May 2022 20:27:59 +0000 (13:27 -0700)]
xfs: remove warning counters from struct xfs_dquot_res

Warning counts are not used anywhere in the kernel. In addition, there
are no use cases, test coverage, or documentation for this functionality.
Remove the 'warnings' field from struct xfs_dquot_res and any other
related code.

Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: remove quota warning limit from struct xfs_quota_limits
Catherine Hoang [Tue, 10 May 2022 20:27:58 +0000 (13:27 -0700)]
xfs: remove quota warning limit from struct xfs_quota_limits

Warning limits in xfs quota is an unused feature that is currently
documented as unimplemented, and it is unclear what the intended
behavior of these limits are. Remove the â€˜warn’ field from struct
xfs_quota_limits and any other related code.

Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: rework deferred attribute operation setup
Dave Chinner [Wed, 11 May 2022 07:05:23 +0000 (17:05 +1000)]
xfs: rework deferred attribute operation setup

Logged attribute intents only have set and remove types - there is
no separate intent type for a replace operation. We should have a
separate type for a replace operation, as it needs to perform
operations that neither SET or REMOVE can perform.

Add this type to the intent items and rearrange the deferred
operation setup to reflect the different operations we are
performing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: make xattri_leaf_bp more useful
Dave Chinner [Wed, 11 May 2022 07:04:23 +0000 (17:04 +1000)]
xfs: make xattri_leaf_bp more useful

We currently set it and hold it when converting from short to leaf
form, then release it only to immediately look it back up again
to do the leaf insert.

Do a bit of refactoring to xfs_attr_leaf_try_add() to avoid this
messy handling of the newly allocated leaf buffer.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: initialise attrd item to zero
Dave Chinner [Wed, 11 May 2022 07:03:23 +0000 (17:03 +1000)]
xfs: initialise attrd item to zero

On the first allocation of a attrd item, xfs_trans_add_item() fires
an assert like so:

 XFS (pmem0): EXPERIMENTAL logged extended attributes feature added. Use at your own risk!
 XFS: Assertion failed: !test_bit(XFS_LI_DIRTY, &lip->li_flags), file: fs/xfs/xfs_trans.c, line: 683
 ------------[ cut here ]------------
 kernel BUG at fs/xfs/xfs_message.c:102!
 Call Trace:
  <TASK>
  xfs_trans_add_item+0x17e/0x190
  xfs_trans_get_attrd+0x67/0x90
  xfs_attr_create_done+0x13/0x20
  xfs_defer_finish_noroll+0x100/0x690
  __xfs_trans_commit+0x144/0x330
  xfs_trans_commit+0x10/0x20
  xfs_attr_set+0x3e2/0x4c0
  xfs_initxattrs+0xaa/0xe0
  security_inode_init_security+0xb0/0x130
  xfs_init_security+0x18/0x20
  xfs_generic_create+0x13a/0x340
  xfs_vn_create+0x17/0x20
  path_openat+0xff3/0x12f0
  do_filp_open+0xb2/0x150

The attrd log item is allocated via kmem_cache_alloc, and
xfs_log_item_init() does not zero the entire log item structure - it
assumes that the structure is already all zeros as it only
initialises non-zero fields. Fix the attr items to be allocated
via the *zalloc methods.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: avoid empty xattr transaction when attrs are inline
Dave Chinner [Wed, 11 May 2022 07:02:23 +0000 (17:02 +1000)]
xfs: avoid empty xattr transaction when attrs are inline

generic/642 triggered a reproducable assert failure in
xlog_cil_commit() that resulted from a xfs_attr_set() committing
an empty but dirty transaction. When the CIL is empty and this
occurs, xlog_cil_commit() tries a background push and this triggers
a "pushing an empty CIL" assert.

XFS: Assertion failed: !list_empty(&cil->xc_cil), file: fs/xfs/xfs_log_cil.c, line: 1274
Call Trace:
 <TASK>
 xlog_cil_commit+0xa5a/0xad0
 __xfs_trans_commit+0xb8/0x330
 xfs_trans_commit+0x10/0x20
 xfs_attr_set+0x3e2/0x4c0
 xfs_xattr_set+0x8d/0xe0
 __vfs_setxattr+0x6b/0x90
 __vfs_setxattr_noperm+0x76/0x220
 __vfs_setxattr_locked+0xdf/0x100
 vfs_setxattr+0x94/0x170
 setxattr+0x110/0x200
 path_setxattr+0xbf/0xe0
 __x64_sys_setxattr+0x2b/0x30
 do_syscall_64+0x35/0x80

The problem is related to the breakdown of attribute addition in
xfs_attr_set_iter() and how it is called from deferred operations.
When we have a pure leaf xattr insert, we add the xattr to the leaf
and set the next state to XFS_DAS_FOUND_LBLK and return -EAGAIN.
This requeues the xattr defered work, rolls the transaction and
runs xfs_attr_set_iter() again. This then checks the xattr for
being remote (it's not) and whether a replace op is being done (this
is a create op) and if neither are true it returns without having
done anything.

xfs_xattri_finish_update() then unconditionally sets the transaction
dirty, and the deferops finishes and returns to __xfs_trans_commit()
which sees the transaction dirty and tries to commit it by calling
xlog_cil_commit(). The transaction is empty, and then the assert
fires if this happens when the CIL is empty.

This patch addresses the structure of xfs_attr_set_iter() that
requires re-entry on leaf add even when nothing will be done. This
gets rid of the trailing empty transaction and so doesn't trigger
the XFS_TRANS_DIRTY assignment in xfs_xattri_finish_update()
incorrectly. Addressing that is for a different patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson<allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: add leaf to node error tag
Allison Henderson [Wed, 11 May 2022 07:01:23 +0000 (17:01 +1000)]
xfs: add leaf to node error tag

Add an error tag on xfs_attr3_leaf_to_node to test log attribute
recovery and replay.

Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: add leaf split error tag
Allison Henderson [Wed, 11 May 2022 07:01:23 +0000 (17:01 +1000)]
xfs: add leaf split error tag

Add an error tag on xfs_da3_split to test log attribute recovery
and replay.

Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Add helper function xfs_init_attr_trans
Allison Henderson [Wed, 11 May 2022 07:01:23 +0000 (17:01 +1000)]
xfs: Add helper function xfs_init_attr_trans

Quick helper function to collapse duplicate code to initialize
transactions for attributes

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Add helper function xfs_attr_leaf_addname
Allison Henderson [Wed, 11 May 2022 07:01:22 +0000 (17:01 +1000)]
xfs: Add helper function xfs_attr_leaf_addname

This patch adds a helper function xfs_attr_leaf_addname.  While this
does help to break down xfs_attr_set_iter, it does also hoist out some
of the state management.  This patch has been moved to the end of the
clean up series for further discussion.

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Merge xfs_delattr_context into xfs_attr_item
Allison Henderson [Wed, 11 May 2022 07:01:22 +0000 (17:01 +1000)]
xfs: Merge xfs_delattr_context into xfs_attr_item

This is a clean up patch that merges xfs_delattr_context into
xfs_attr_item.  Now that the refactoring is complete and the delayed
operation infrastructure is in place, we can combine these to eliminate
the extra struct

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Add larp debug option
Allison Henderson [Wed, 11 May 2022 07:01:22 +0000 (17:01 +1000)]
xfs: Add larp debug option

This patch adds a debug option to enable log attribute replay. Eventually
this can be removed when delayed attrs becomes permanent.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Add log attribute error tag
Allison Henderson [Wed, 11 May 2022 07:01:22 +0000 (17:01 +1000)]
xfs: Add log attribute error tag

This patch adds an error tag that we can use to test log attribute
recovery and replay

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Remove unused xfs_attr_*_args
Allison Henderson [Wed, 11 May 2022 07:01:22 +0000 (17:01 +1000)]
xfs: Remove unused xfs_attr_*_args

Remove xfs_attr_set_args, xfs_attr_remove_args, and xfs_attr_trans_roll.
These high level loops are now driven by the delayed operations code,
and can be removed.

Additionally collapse in the leaf_bp parameter of xfs_attr_set_iter
since we only have one caller that passes dac->leaf_bp

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Add xfs_attr_set_deferred and xfs_attr_remove_deferred
Allison Henderson [Wed, 11 May 2022 07:01:13 +0000 (17:01 +1000)]
xfs: Add xfs_attr_set_deferred and xfs_attr_remove_deferred

These routines set up and queue a new deferred attribute operations.
These functions are meant to be called by any routine needing to
initiate a deferred attribute operation as opposed to the existing
inline operations. New helper function xfs_attr_item_init also added.

Finally enable delayed attributes in xfs_attr_set and xfs_attr_remove.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Skip flip flags for delayed attrs
Allison Henderson [Mon, 9 May 2022 09:09:10 +0000 (19:09 +1000)]
xfs: Skip flip flags for delayed attrs

This is a clean up patch that skips the flip flag logic for delayed attr
renames.  Since the log replay keeps the inode locked, we do not need to
worry about race windows with attr lookups.  So we can skip over
flipping the flag and the extra transaction roll for it

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Implement attr logging and replay
Allison Henderson [Mon, 9 May 2022 09:09:07 +0000 (19:09 +1000)]
xfs: Implement attr logging and replay

This patch adds the needed routines to create, log and recover logged
extended attribute intents.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Set up infrastructure for log attribute replay
Allison Henderson [Wed, 4 May 2022 02:41:02 +0000 (12:41 +1000)]
xfs: Set up infrastructure for log attribute replay

Currently attributes are modified directly across one or more
transactions. But they are not logged or replayed in the event of an
error. The goal of log attr replay is to enable logging and replaying
of attribute operations using the existing delayed operations
infrastructure.  This will later enable the attributes to become part of
larger multi part operations that also must first be recorded to the
log.  This is mostly of interest in the scheme of parent pointers which
would need to maintain an attribute containing parent inode information
any time an inode is moved, created, or removed.  Parent pointers would
then be of interest to any feature that would need to quickly derive an
inode path from the mount point. Online scrub, nfs lookups and fs grow
or shrink operations are all features that could take advantage of this.

This patch adds two new log item types for setting or removing
attributes as deferred operations.  The xfs_attri_log_item will log an
intent to set or remove an attribute.  The corresponding
xfs_attrd_log_item holds a reference to the xfs_attri_log_item and is
freed once the transaction is done.  Both log items use a generic
xfs_attr_log_format structure that contains the attribute name, value,
flags, inode, and an op_flag that indicates if the operations is a set
or remove.

[dchinner: added extra little bits needed for intent whiteouts]

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Return from xfs_attr_set_iter if there are no more rmtblks to process
Allison Henderson [Wed, 4 May 2022 02:40:02 +0000 (12:40 +1000)]
xfs: Return from xfs_attr_set_iter if there are no more rmtblks to process

During an attr rename operation, blocks are saved for later removal
as rmtblkno2. The rmtblkno is used in the case of needing to alloc
more blocks if not enough were available.  However, in the case
that no further blocks need to be added or removed, we can return as soon
as xfs_attr_node_addname completes, rather than rolling the transaction
with an -EAGAIN return.  This extra loop does not hurt anything right
now, but it will be a problem later when we get into log items because
we end up with an empty log transaction.  So, add a simple check to
cut out the unneeded iteration.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: Fix double unlock in defer capture code
Allison Henderson [Wed, 4 May 2022 02:39:02 +0000 (12:39 +1000)]
xfs: Fix double unlock in defer capture code

The new deferred attr patch set uncovered a double unlock in the
recent port of the defer ops capture and continue code.  During log
recovery, we're allowed to hold buffers to a transaction that's being
used to replay an intent item.  When we capture the resources as part
of scheduling a continuation of an intent chain, we call xfs_buf_hold
to retain our reference to the buffer beyond the transaction commit,
but we do /not/ call xfs_trans_bhold to maintain the buffer lock.
This means that xfs_defer_ops_continue needs to relock the buffers
before xfs_defer_restore_resources joins then tothe new transaction.

Additionally, the buffers should not be passed back via the dres
structure since they need to remain locked unlike the inodes.  So
simply set dr_bufs to zero after populating the dres structure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoMerge branch 'guilt/xfs-5.19-fuzz-fixes' into xfs-5.19-for-next
Dave Chinner [Wed, 4 May 2022 02:38:02 +0000 (12:38 +1000)]
Merge branch 'guilt/xfs-5.19-fuzz-fixes' into xfs-5.19-for-next

2 years agoMerge tag 'reflink-speedups-5.19_2022-04-28' of git://git.kernel.org/pub/scm/linux...
Dave Chinner [Wed, 4 May 2022 02:37:40 +0000 (12:37 +1000)]
Merge tag 'reflink-speedups-5.19_2022-04-28' of git://git./linux/kernel/git/djwong/xfs-linux into xfs-5.19-for-next

xfs: fix reflink inefficiencies

As Dave Chinner has complained about on IRC, there are a couple of
things about reflink that are very inefficient.  First of all, we
limited the size of all bunmapi operations to avoid flooding the log
with defer ops in the worst case, but recent changes to the defer
ops code have solved that problem, so get rid of the bunmapi length
clamp.

Second, the log reservations for reflink operations are far far
larger than they need to be.  Shrink them to exactly what we need to
handle each deferred RUI and CUI log item, and no more.  Also reduce
logcount because we don't need 8 rolls per operation.  Introduce a
transaction reservation compatibility layer to avoid changing the
minimum log size calculations.

Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoMerge tag 'rmap-speedups-5.19_2022-04-28' of git://git.kernel.org/pub/scm/linux/kerne...
Dave Chinner [Wed, 4 May 2022 02:37:18 +0000 (12:37 +1000)]
Merge tag 'rmap-speedups-5.19_2022-04-28' of git://git./linux/kernel/git/djwong/xfs-linux into xfs-5.19-for-next

xfs: fix rmap inefficiencies

Reduce the performance impact of the reverse mapping btree when
reflink is enabled by using the much faster non-overlapped btree
lookup functions when we're searching the rmap index with a fully
specified key.  If we find the exact record we're looking for,
great!  We don't have to perform the full overlapped scan.  For
filesystems with high sharing factors this reduces the xfs_scrub
runtime by a good 15%%.

This has been shown to reduce the fstests runtime for realtime rmap
configurations by 30%%, since the lack of AGs severely limits
scalability.

Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoMerge branch 'guilt/xlog-intent-whiteouts' into xfs-5.19-for-next
Dave Chinner [Wed, 4 May 2022 02:37:02 +0000 (12:37 +1000)]
Merge branch 'guilt/xlog-intent-whiteouts' into xfs-5.19-for-next

2 years agoMerge branch 'guilt/xfs-5.19-misc-2' into xfs-5.19-for-next
Dave Chinner [Wed, 4 May 2022 02:36:54 +0000 (12:36 +1000)]
Merge branch 'guilt/xfs-5.19-misc-2' into xfs-5.19-for-next

2 years agoxfs: validate v5 feature fields
Dave Chinner [Wed, 4 May 2022 02:17:18 +0000 (12:17 +1000)]
xfs: validate v5 feature fields

We don't check that the v4 feature flags taht v5 requires to be set
are actually set anywhere. Do this check when we see that the
filesystem is a v5 filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: set XFS_FEAT_NLINK correctly
Dave Chinner [Wed, 4 May 2022 02:14:13 +0000 (12:14 +1000)]
xfs: set XFS_FEAT_NLINK correctly

While xfs_has_nlink() is not used in kernel, it is used in userspace
(e.g. by xfs_db) so we need to set the XFS_FEAT_NLINK flag correctly
in xfs_sb_version_to_features().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: validate inode fork size against fork format
Dave Chinner [Wed, 4 May 2022 02:13:53 +0000 (12:13 +1000)]
xfs: validate inode fork size against fork format

xfs_repair catches fork size/format mismatches, but the in-kernel
verifier doesn't, leading to null pointer failures when attempting
to perform operations on the fork. This can occur in the
xfs_dir_is_empty() where the in-memory fork format does not match
the size and so the fork data pointer is accessed incorrectly.

Note: this causes new failures in xfs/348 which is testing mode vs
ftype mismatches. We now detect a regular file that has been changed
to a directory or symlink mode as being corrupt because the data
fork is for a symlink or directory should be in local form when
there are only 3 bytes of data in the data fork. Hence the inode
verify for the regular file now fires w/ -EFSCORRUPTED because
the inode fork format does not match the format the corrupted mode
says it should be in.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: detect self referencing btree sibling pointers
Dave Chinner [Wed, 4 May 2022 02:13:35 +0000 (12:13 +1000)]
xfs: detect self referencing btree sibling pointers

To catch the obvious graph cycle problem and hence potential endless
looping.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: intent item whiteouts
Dave Chinner [Wed, 4 May 2022 01:50:29 +0000 (11:50 +1000)]
xfs: intent item whiteouts

When we log modifications based on intents, we add both intent
and intent done items to the modification being made. These get
written to the log to ensure that the operation is re-run if the
intent done is not found in the log.

However, for operations that complete wholly within a single
checkpoint, the change in the checkpoint is atomic and will never
need replay. In this case, we don't need to actually write the
intent and intent done items to the journal because log recovery
will never need to manually restart this modification.

Log recovery currently handles intent/intent done matching by
inserting the intent into the AIL, then removing it when a matching
intent done item is found. Hence for all the intent-based operations
that complete within a checkpoint, we spend all that time parsing
the intent/intent done items just to cancel them and do nothing with
them.

Hence it follows that the only time we actually need intents in the
log is when the modification crosses checkpoint boundaries in the
log and so may only be partially complete in the journal. Hence if
we commit and intent done item to the CIL and the intent item is in
the same checkpoint, we don't actually have to write them to the
journal because log recovery will always cancel the intents.

We've never really worried about the overhead of logging intents
unnecessarily like this because the intents we log are generally
very much smaller than the change being made. e.g. freeing an extent
involves modifying at lease two freespace btree blocks and the AGF,
so the EFI/EFD overhead is only a small increase in space and
processing time compared to the overall cost of freeing an extent.

However, delayed attributes change this cost equation dramatically,
especially for inline attributes. In the case of adding an inline
attribute, we only log the inode core and attribute fork at present.
With delayed attributes, we now log the attr intent which includes
the name and value, the inode core adn attr fork, and finally the
attr intent done item. We increase the number of items we log from 1
to 3, and the number of log vectors (regions) goes up from 3 to 7.
Hence we tripple the number of objects that the CIL has to process,
and more than double the number of log vectors that need to be
written to the journal.

At scale, this means delayed attributes cause a non-pipelined CIL to
become CPU bound processing all the extra items, resulting in a > 40%
performance degradation on 16-way file+xattr create worklaods.
Pipelining the CIL (as per 5.15) reduces the performance degradation
to 20%, but now the limitation is the rate at which the log items
can be written to the iclogs and iclogs be dispatched for IO and
completed.

Even log IO completion is slowed down by these intents, because it
now has to process 3x the number of items in the checkpoint.
Processing completed intents is especially inefficient here, because
we first insert the intent into the AIL, then remove it from the AIL
when the intent done is processed. IOWs, we are also doing expensive
operations in log IO completion we could completely avoid if we
didn't log completed intent/intent done pairs.

Enter log item whiteouts.

When an intent done is committed, we can check to see if the
associated intent is in the same checkpoint as we are currently
committing the intent done to. If so, we can mark the intent log
item with a whiteout and immediately free the intent done item
rather than committing it to the CIL. We can basically skip the
entire formatting and CIL insertion steps for the intent done item.

However, we cannot remove the intent item from the CIL at this point
because the unlocked per-cpu CIL item lists do not permit removal
without holding the CIL context lock exclusively. Transaction commit
only holds the context lock shared, hence the best we can do is mark
the intent item with a whiteout so that the CIL push can release it
rather than writing it to the log.

This means we never write the intent to the log if the intent done
has also been committed to the same checkpoint, but we'll always
write the intent if the intent done has not been committed or has
been committed to a different checkpoint. This will result in
correct log recovery behaviour in all cases, without the overhead of
logging unnecessary intents.

This intent whiteout concept is generic - we can apply it to all
intent/intent done pairs that have a direct 1:1 relationship. The
way deferred ops iterate and relog intents mean that all intents
currently have a 1:1 relationship with their done intent, and hence
we can apply this cancellation to all existing intent/intent done
implementations.

For delayed attributes with a 16-way 64kB xattr create workload,
whiteouts reduce the amount of journalled metadata from ~2.5GB/s
down to ~600MB/s and improve the creation rate from 9000/s to
14000/s.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: whiteouts release intents that are not in the AIL
Dave Chinner [Wed, 4 May 2022 01:46:47 +0000 (11:46 +1000)]
xfs: whiteouts release intents that are not in the AIL

When we release an intent that a whiteout applies to, it will not
have been committed to the journal and so won't be in the AIL. Hence
when we drop the last reference to the intent, we do not want to try
to remove it from the AIL as that will trigger a filesystem
shutdown. Hence make the removal of intents from the AIL conditional
on them actually being in the AIL so we do the correct thing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: add log item method to return related intents
Dave Chinner [Wed, 4 May 2022 01:46:39 +0000 (11:46 +1000)]
xfs: add log item method to return related intents

To apply a whiteout to an intent item when an intent done item is
committed, we need to be able to retrieve the intent item from the
the intent done item. Add a log item op method for doing this, and
wire all the intent done items up to it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: factor and move some code in xfs_log_cil.c
Dave Chinner [Wed, 4 May 2022 01:46:30 +0000 (11:46 +1000)]
xfs: factor and move some code in xfs_log_cil.c

In preparation for adding support for intent item whiteouts.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: tag transactions that contain intent done items
Dave Chinner [Wed, 4 May 2022 01:46:21 +0000 (11:46 +1000)]
xfs: tag transactions that contain intent done items

Intent whiteouts will require extra work to be done during
transaction commit if the transaction contains an intent done item.

To determine if a transaction contains an intent done item, we want
to avoid having to walk all the items in the transaction to check if
they are intent done items. Hence when we add an intent done item to
a transaction, tag the transaction to indicate that it contains such
an item.

We don't tag the transaction when the defer ops is relogging an
intent to move it forward in the log. Whiteouts will never apply to
these cases, so we don't need to bother looking for them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: add log item flags to indicate intents
Dave Chinner [Wed, 4 May 2022 01:46:09 +0000 (11:46 +1000)]
xfs: add log item flags to indicate intents

We currently have a couple of helper functions that try to infer
whether the log item is an intent or intent done item from the
combinations of operations it supports.  This is incredibly fragile
and not very efficient as it requires checking specific combinations
of ops.

We need to be able to identify intent and intent done items quickly
and easily in upcoming patches, so simply add intent and intent done
type flags to the log item ops flags. These are static flags to
begin with, so intent items should have been typed like this from
the start.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: don't commit the first deferred transaction without intents
Dave Chinner [Wed, 4 May 2022 01:46:00 +0000 (11:46 +1000)]
xfs: don't commit the first deferred transaction without intents

If the first operation in a string of defer ops has no intents,
then there is no reason to commit it before running the first call
to xfs_defer_finish_one(). This allows the defer ops to be used
effectively for non-intent based operations without requiring an
unnecessary extra transaction commit when first called.

This fixes a regression in per-attribute modification transaction
count when delayed attributes are not being used.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: hide log iovec alignment constraints
Dave Chinner [Wed, 4 May 2022 01:45:50 +0000 (11:45 +1000)]
xfs: hide log iovec alignment constraints

Callers currently have to round out the size of buffers to match the
aligment constraints of log iovecs and xlog_write(). They should not
need to know this detail, so introduce a new function to calculate
the iovec length (for use in ->iop_size implementations). Also
modify xlog_finish_iovec() to round up the length to the correct
alignment so the callers don't need to do this, either.

Convert the only user - inode forks - of this alignment rounding to
use the new interface.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: fix potential log item leak
Dave Chinner [Wed, 4 May 2022 01:45:11 +0000 (11:45 +1000)]
xfs: fix potential log item leak

Ever since we added shadown format buffers to the log items, log
items need to handle the item being released with shadow buffers
attached. Due to the fact this requirement was added at the same
time we added new rmap/reflink intents, we missed the cleanup of
those items.

In theory, this means shadow buffers can be leaked in a very small
window when a shutdown is initiated. Testing with KASAN shows this
leak does not happen in practice - we haven't identified a single
leak in several years of shutdown testing since ~v4.8 kernels.

However, the intent whiteout cleanup mechanism results in every
cancelled intent in exactly the same state as this tiny race window
creates and so if intents down clean up shadow buffers on final
release we will leak the shadow buffer for just about every intent
we create.

Hence we start with this patch to close this condition off and
ensure that when whiteouts start to be used we don't leak lots of
memory.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: zero inode fork buffer at allocation
Dave Chinner [Wed, 4 May 2022 01:44:55 +0000 (11:44 +1000)]
xfs: zero inode fork buffer at allocation

When we first allocate or resize an inline inode fork, we round up
the allocation to 4 byte alingment to make journal alignment
constraints. We don't clear the unused bytes, so we can copy up to
three uninitialised bytes into the journal. Zero those bytes so we
only ever copy zeros into the journal.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: rename xfs_*alloc*_log_count to _block_count
Darrick J. Wong [Tue, 26 Apr 2022 01:38:24 +0000 (18:38 -0700)]
xfs: rename xfs_*alloc*_log_count to _block_count

These functions return the maximum number of blocks that could be logged
in a particular transaction.  "log count" is confusing since there's a
separate concept of a log (operation) count in the reservation code, so
let's change it to "block count" to be less confusing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: rewrite xfs_reflink_end_cow to use intents
Darrick J. Wong [Tue, 26 Apr 2022 01:38:15 +0000 (18:38 -0700)]
xfs: rewrite xfs_reflink_end_cow to use intents

Currently, the code that performs CoW remapping after a write has this
odd behavior where it walks /backwards/ through the data fork to remap
extents in reverse order.  Earlier, we rewrote the reflink remap
function to use deferred bmap log items instead of trying to cram as
much into the first transaction that we could.  Now do the same for the
CoW remap code.  There doesn't seem to be any performance impact; we're
just making better use of code that we added for the benefit of reflink.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: reduce transaction reservations with reflink
Darrick J. Wong [Tue, 26 Apr 2022 01:38:14 +0000 (18:38 -0700)]
xfs: reduce transaction reservations with reflink

Before to the introduction of deferred refcount operations, reflink
would try to cram refcount btree updates into the same transaction as an
allocation or a free event.  Mainline XFS has never actually done that,
but we never refactored the transaction reservations to reflect that we
now do all refcount updates in separate transactions.  Fix this to
reduce the transaction reservation size even farther, so that between
this patch and the previous one, we reduce the tr_write and tr_itruncate
sizes by 66%.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: reduce the absurdly large log operation count
Darrick J. Wong [Tue, 26 Apr 2022 01:38:14 +0000 (18:38 -0700)]
xfs: reduce the absurdly large log operation count

Back in the early days of reflink and rmap development I set the
transaction reservation sizes to be overly generous for rmap+reflink
filesystems, and a little under-generous for rmap-only filesystems.

Since we don't need *eight* transaction rolls to handle three new log
intent items, decrease the logcounts to what we actually need, and amend
the shadow reservation computation function to reflect what we used to
do so that the minimum log size doesn't change.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: report "max_resp" used for min log size computation
Darrick J. Wong [Tue, 26 Apr 2022 01:38:13 +0000 (18:38 -0700)]
xfs: report "max_resp" used for min log size computation

Move the tracepoint that computes the size of the transaction used to
compute the minimum log size into xfs_log_get_max_trans_res so that we
only have to compute this stuff once.

Leave xfs_log_get_max_trans_res as a non-static function so that xfs_db
can call it to report the results of the userspace computation of the
same value to diagnose mkfs/kernel misinteractions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: create shadow transaction reservations for computing minimum log size
Darrick J. Wong [Tue, 26 Apr 2022 01:38:13 +0000 (18:38 -0700)]
xfs: create shadow transaction reservations for computing minimum log size

Every time someone changes the transaction reservation sizes, they
introduce potential compatibility problems if the changes affect the
minimum log size that we validate at mount time.  If the minimum log
size gets larger (which should be avoided because doing so presents a
serious risk of log livelock), filesystems created with old mkfs will
not mount on a newer kernel; if the minimum size shrinks, filesystems
created with newer mkfs will not mount on older kernels.

Therefore, enable the creation of a shadow log reservation structure
where we can "undo" the effects of tweaks when computing minimum log
sizes.  These shadow reservations should never be used in practice, but
they insulate us from perturbations in minimum log size.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: remove a __xfs_bunmapi call from reflink
Darrick J. Wong [Tue, 26 Apr 2022 01:38:12 +0000 (18:38 -0700)]
xfs: remove a __xfs_bunmapi call from reflink

This raw call isn't necessary since we can always remove a full delalloc
extent.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: stop artificially limiting the length of bunmap calls
Darrick J. Wong [Tue, 26 Apr 2022 01:37:15 +0000 (18:37 -0700)]
xfs: stop artificially limiting the length of bunmap calls

In commit e1a4e37cc7b6, we clamped the length of bunmapi calls on the
data forks of shared files to avoid two failure scenarios: one where the
extent being unmapped is so sparsely shared that we exceed the
transaction reservation with the sheer number of refcount btree updates
and EFI intent items; and the other where we attach so many deferred
updates to the transaction that we pin the log tail and later the log
head meets the tail, causing the log to livelock.

We avoid triggering the first problem by tracking the number of ops in
the refcount btree cursor and forcing a requeue of the refcount intent
item any time we think that we might be close to overflowing.  This has
been baked into XFS since before the original e1a4 patch.

A recent patchset fixed the second problem by changing the deferred ops
code to finish all the work items created by each round of trying to
complete a refcount intent item, which eliminates the long chains of
deferred items (27dad); and causing long-running transactions to relog
their intent log items when space in the log gets low (74f4d).

Because this clamp affects /any/ unmapping request regardless of the
sharing factors of the component blocks, it degrades the performance of
all large unmapping requests -- whereas with an unshared file we can
unmap millions of blocks in one go, shared files are limited to
unmapping a few thousand blocks at a time, which causes the upper level
code to spin in a bunmapi loop even if it wasn't needed.

This also eliminates one more place where log recovery behavior can
differ from online behavior, because bunmapi operations no longer need
to requeue.  The fstest generic/447 was created to test the old fix, and
it still passes with this applied.

Partial-revert-of: e1a4e37cc7b6 ("xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent")
Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: count EFIs when deciding to ask for a continuation of a refcount update
Darrick J. Wong [Tue, 26 Apr 2022 22:29:54 +0000 (15:29 -0700)]
xfs: count EFIs when deciding to ask for a continuation of a refcount update

A long time ago, I added to XFS the ability to use deferred reference
count operations as part of a transaction chain.  This enabled us to
avoid blowing out the transaction reservation when the blocks in a
physical extent all had different reference counts because we could ask
the deferred operation manager for a continuation, which would get us a
clean transaction.

The refcount code asks for a continuation when the number of refcount
record updates reaches the point where we think that the transaction has
logged enough full btree blocks due to refcount (and free space) btree
shape changes and refcount record updates that we're in danger of
overflowing the transaction.

We did not previously count the EFIs logged to the refcount update
transaction because the clamps on the length of a bunmap operation were
sufficient to avoid overflowing the transaction reservation even in the
worst case situation where every other block of the unmapped extent is
shared.

Unfortunately, the restrictions on bunmap length avoid failure in the
worst case by imposing a maximum unmap length of ~3000 blocks, even for
non-pathological cases.  This seriously limits performance when freeing
large extents.

Therefore, track EFIs with the same counter as refcount record updates,
and use that information as input into when we should ask for a
continuation.  This enables the next patch to drop the clumsy bunmap
limitation.

Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: speed up write operations by using non-overlapped lookups when possible
Darrick J. Wong [Tue, 26 Apr 2022 01:37:06 +0000 (18:37 -0700)]
xfs: speed up write operations by using non-overlapped lookups when possible

Reverse mapping on a reflink-capable filesystem has some pretty high
overhead when performing file operations.  This is because the rmap
records for logically and physically adjacent extents might not be
adjacent in the rmap index due to data block sharing.  As a result, we
use expensive overlapped-interval btree search, which walks every record
that overlaps with the supplied key in the hopes of finding the record.

However, profiling data shows that when the index contains a record that
is an exact match for a query key, the non-overlapped btree search
function can find the record much faster than the overlapped version.
Try the non-overlapped lookup first when we're trying to find the left
neighbor rmap record for a given file mapping, which makes unwritten
extent conversion and remap operations run faster if data block sharing
is minimal in this part of the filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: speed up rmap lookups by using non-overlapped lookups when possible
Darrick J. Wong [Tue, 26 Apr 2022 01:37:06 +0000 (18:37 -0700)]
xfs: speed up rmap lookups by using non-overlapped lookups when possible

Reverse mapping on a reflink-capable filesystem has some pretty high
overhead when performing file operations.  This is because the rmap
records for logically and physically adjacent extents might not be
adjacent in the rmap index due to data block sharing.  As a result, we
use expensive overlapped-interval btree search, which walks every record
that overlaps with the supplied key in the hopes of finding the record.

However, profiling data shows that when the index contains a record that
is an exact match for a query key, the non-overlapped btree search
function can find the record much faster than the overlapped version.
Try the non-overlapped lookup first, which will make scrub run much
faster.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: simplify xfs_rmap_lookup_le call sites
Darrick J. Wong [Tue, 26 Apr 2022 01:37:06 +0000 (18:37 -0700)]
xfs: simplify xfs_rmap_lookup_le call sites

Most callers of xfs_rmap_lookup_le will retrieve the btree record
immediately if the lookup succeeds.  The overlapped version of this
function (xfs_rmap_lookup_le_range) will return the record if the lookup
succeeds, so make the regular version do it too.  Get rid of the useless
len argument, since it's not part of the lookup key.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2 years agoxfs: capture buffer ops in the xfs_buf tracepoints
Darrick J. Wong [Tue, 26 Apr 2022 01:37:05 +0000 (18:37 -0700)]
xfs: capture buffer ops in the xfs_buf tracepoints

Record the buffer ops in the xfs_buf tracepoints so that we can monitor
the alleged type of the buffer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2 years agoxfs: revert "xfs: actually bump warning counts when we send warnings"
Eric Sandeen [Tue, 26 Apr 2022 03:35:26 +0000 (13:35 +1000)]
xfs: revert "xfs: actually bump warning counts when we send warnings"

This reverts commit 4b8628d57b725b32616965e66975fcdebe008fe7.

XFS quota has had the concept of a "quota warning limit" since
the earliest Irix implementation, but a mechanism for incrementing
the warning counter was never implemented, as documented in the
xfs_quota(8) man page. We do know from the historical archive that
it was never incremented at runtime during quota reservation
operations.

With this commit, the warning counter quickly increments for every
allocation attempt after the user has crossed a quote soft
limit threshold, and this in turn transitions the user to hard
quota failures, rendering soft quota thresholds and timers useless.
This was reported as a regression by users.

Because the intended behavior of this warning counter has never been
understood or documented, and the result of this change is a regression
in soft quota functionality, revert this commit to make soft quota
limits and timers operable again.

Fixes: 4b8628d57b72 ("xfs: actually bump warning counts when we send warnings)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: fix soft lockup via spinning in filestream ag selection loop
Brian Foster [Tue, 26 Apr 2022 03:34:54 +0000 (13:34 +1000)]
xfs: fix soft lockup via spinning in filestream ag selection loop

The filestream AG selection loop uses pagf data to aid in AG
selection, which depends on pagf initialization. If the in-core
structure is not initialized, the caller invokes the AGF read path
to do so and carries on. If another task enters the loop and finds
a pagf init already in progress, the AGF read returns -EAGAIN and
the task continues the loop. This does not increment the current ag
index, however, which means the task spins on the current AGF buffer
until unlocked.

If the AGF read I/O submitted by the initial task happens to be
delayed for whatever reason, this results in soft lockup warnings
via the spinning task. This is reproduced by xfs/170. To avoid this
problem, fix the AGF trylock failure path to properly iterate to the
next AG. If a task iterates all AGs without making progress, the
trylock behavior is dropped in favor of blocking locks and thus a
soft lockup is no longer possible.

Fixes: f48e2df8a877ca1c ("xfs: make xfs_*read_agf return EAGAIN to ALLOC_FLAG_TRYLOCK callers")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: improve __xfs_set_acl
Yang Xu [Tue, 26 Apr 2022 03:34:42 +0000 (13:34 +1000)]
xfs: improve __xfs_set_acl

Provide a proper stub for the !CONFIG_XFS_POSIX_ACL case.

Also use a easy way for xfs_get_acl stub.

Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoMerge tag 'large-extent-counters-v9' of https://github.com/chandanr/linux into xfs...
Dave Chinner [Thu, 21 Apr 2022 06:46:17 +0000 (16:46 +1000)]
Merge tag 'large-extent-counters-v9' of https://github.com/chandanr/linux into xfs-5.19-for-next

xfs: Large extent counters

The commit xfs: fix inode fork extent count overflow
(3f8a4f1d876d3e3e49e50b0396eaffcc4ba71b08) mentions that 10 billion
data fork extents should be possible to create. However the
corresponding on-disk field has a signed 32-bit type. Hence this
patchset extends the per-inode data fork extent counter to 64 bits
(out of which 48 bits are used to store the extent count).

Also, XFS has an attribute fork extent counter which is 16 bits
wide. A workload that,
1. Creates 1 million 255-byte sized xattrs,
2. Deletes 50% of these xattrs in an alternating manner,
3. Tries to insert 400,000 new 255-byte sized xattrs
   causes the xattr extent counter to overflow.

Dave tells me that there are instances where a single file has more
than 100 million hardlinks. With parent pointers being stored in
xattrs, we will overflow the signed 16-bits wide attribute extent
counter when large number of hardlinks are created. Hence this
patchset extends the on-disk field to 32-bits.

The following changes are made to accomplish this,
1. A 64-bit inode field is carved out of existing di_pad and
   di_flushiter fields to hold the 64-bit data fork extent counter.
2. The existing 32-bit inode data fork extent counter will be used to
   hold the attribute fork extent counter.
3. A new incompat superblock flag to prevent older kernels from mounting
   the filesystem.

Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoMerge branch 'guilt/xlog-write-rework' into xfs-5.19-for-next
Dave Chinner [Thu, 21 Apr 2022 06:45:52 +0000 (16:45 +1000)]
Merge branch 'guilt/xlog-write-rework' into xfs-5.19-for-next

2 years agoMerge branch 'guilt/xfs-unsigned-flags-5.18' into xfs-5.19-for-next
Dave Chinner [Thu, 21 Apr 2022 06:45:03 +0000 (16:45 +1000)]
Merge branch 'guilt/xfs-unsigned-flags-5.18' into xfs-5.19-for-next

2 years agoMerge branch 'guilt/5.19-miscellaneous' into xfs-5.19-for-next
Dave Chinner [Thu, 21 Apr 2022 01:40:17 +0000 (11:40 +1000)]
Merge branch 'guilt/5.19-miscellaneous' into xfs-5.19-for-next

2 years agoxfs: convert log ticket and iclog flags to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:48:01 +0000 (10:48 +1000)]
xfs: convert log ticket and iclog flags to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2 years agoxfs: convert shutdown reasons to unsigned.
Dave Chinner [Thu, 21 Apr 2022 00:47:38 +0000 (10:47 +1000)]
xfs: convert shutdown reasons to unsigned.

5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned
fields to be unsigned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>