git.kernel.dk Git - linux-2.6-block.git/log

btrfs: record new subvolume in parent dir earlier to avoid dir logging races

Instead of recording that a new subvolume was created in a directory after
we add the entry do the directory, record it before adding the entry. This
is to avoid races where after creating the entry and before recording the
new subvolume in the directory (the call to btrfs_record_new_subvolume()),
another task logs the directory, so we end up with a log tree where we
logged a directory that has an entry pointing to a root that was not yet
committed, resulting in an invalid entry if the log is persisted and
replayed later due to a power failure or crash.

Also state this requirement in the function comment for
btrfs_record_new_subvolume(), similar to what we do for the
btrfs_record_unlink_dir() and btrfs_record_snapshot_destroy().

Fixes: 45c4102f0d82 ("btrfs: avoid transaction commit on any fsync after subvolume creation")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix inode lookup error handling during log replay

When replaying log trees we use read_one_inode() to get an inode, which is
just a wrapper around btrfs_iget_logging(), which in turn is a wrapper for
btrfs_iget(). But read_one_inode() always returns NULL for any error
that btrfs_iget_logging() / btrfs_iget() may return and this is a problem
because:

1) In many callers of read_one_inode() we convert the NULL into -EIO,
   which is not accurate since btrfs_iget() may return -ENOMEM and -ENOENT
   for example, besides -EIO and other errors. So during log replay we
   may end up reporting a false -EIO, which is confusing since we may
   not have had any IO error at all;

2) When replaying directory deletes, at replay_dir_deletes(), we assume
   the NULL returned from read_one_inode() means that the inode doesn't
   exist and then proceed as if no error had happened. This is wrong
   because unless btrfs_iget() returned ERR_PTR(-ENOENT), we had an
   actual error and the target inode may exist in the target subvolume
   root - this may later result in the log replay code failing at a
   later stage (if we are "lucky") or succeed but leaving some
   inconsistency in the filesystem.

So fix this by not ignoring errors from btrfs_iget_logging() and as
a consequence remove the read_one_inode() wrapper and just use
btrfs_iget_logging() directly. Also since btrfs_iget_logging() is
supposed to be called only against subvolume roots, just like
read_one_inode() which had a comment about it, add an assertion to
btrfs_iget_logging() to check that the target root corresponds to a
subvolume root.

Fixes: 5d4f98a28c7d ("Btrfs: Mixed back reference  (FORWARD ROLLING FORMAT CHANGE)")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix iteration of extrefs during log replay

At __inode_add_ref() when processing extrefs, if we jump into the next
label we have an undefined value of victim_name.len, since we haven't
initialized it before we did the goto. This results in an invalid memory
access in the next iteration of the loop since victim_name.len was not
initialized to the length of the name of the current extref.

Fix this by initializing victim_name.len with the current extref's name
length.

Fixes: e43eec81c516 ("btrfs: use struct qstr instead of name and namelen pairs")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix missing error handling when searching for inode refs during log replay

During log replay, at __add_inode_ref(), when we are searching for inode
ref keys we totally ignore if btrfs_search_slot() returns an error. This
may make a log replay succeed when there was an actual error and leave
some metadata inconsistency in a subvolume tree. Fix this by checking if
an error was returned from btrfs_search_slot() and if so, return it to
the caller.

Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix failure to rebuild free space tree using multiple transactions

If we are rebuilding a free space tree, while modifying the free space
tree we may need to allocate a new metadata block group.
If we end up using multiple transactions for the rebuild, when we call
btrfs_end_transaction() we enter btrfs_create_pending_block_groups()
which calls add_block_group_free_space() to add items to the free space
tree for the block group.

Then later during the free space tree rebuild, at
btrfs_rebuild_free_space_tree(), we may find such new block groups
and call populate_free_space_tree() for them, which fails with -EEXIST
because there are already items in the free space tree. Then we abort the
transaction with -EEXIST at btrfs_rebuild_free_space_tree().
Notice that we say "may find" the new block groups because a new block
group may be inserted in the block groups rbtree, which is being iterated
by the rebuild process, before or after the current node where the rebuild
process is currently at.

Syzbot recently reported such case which produces a trace like the
following:

  ------------[ cut here ]------------
  BTRFS: Transaction aborted (error -17)
  WARNING: CPU: 1 PID: 7626 at fs/btrfs/free-space-tree.c:1341 btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341
  Modules linked in:
  CPU: 1 UID: 0 PID: 7626 Comm: syz.2.25 Not tainted 6.15.0-rc7-syzkaller-00085-gd7fa1af5b33e-dirty #0 PREEMPT
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
  pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341
  lr : btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341
  sp : ffff80009c4f7740
  x29: ffff80009c4f77b0 x28: ffff0000d4c3f400 x27: 0000000000000000
  x26: dfff800000000000 x25: ffff70001389eee8 x24: 0000000000000003
  x23: 1fffe000182b6e7b x22: 0000000000000000 x21: ffff0000c15b73d8
  x20: 00000000ffffffef x19: ffff0000c15b7378 x18: 1fffe0003386f276
  x17: ffff80008f31e000 x16: ffff80008adbe98c x15: 0000000000000001
  x14: 1fffe0001b281550 x13: 0000000000000000 x12: 0000000000000000
  x11: ffff60001b281551 x10: 0000000000000003 x9 : 1c8922000a902c00
  x8 : 1c8922000a902c00 x7 : ffff800080485878 x6 : 0000000000000000
  x5 : 0000000000000001 x4 : 0000000000000001 x3 : ffff80008047843c
  x2 : 0000000000000001 x1 : ffff80008b3ebc40 x0 : 0000000000000001
  Call trace:
   btrfs_rebuild_free_space_tree+0x470/0x54c fs/btrfs/free-space-tree.c:1341 (P)
   btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074
   btrfs_remount_rw fs/btrfs/super.c:1319 [inline]
   btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543
   reconfigure_super+0x1d4/0x6f0 fs/super.c:1083
   do_remount fs/namespace.c:3365 [inline]
   path_mount+0xb34/0xde0 fs/namespace.c:4200
   do_mount fs/namespace.c:4221 [inline]
   __do_sys_mount fs/namespace.c:4432 [inline]
   __se_sys_mount fs/namespace.c:4409 [inline]
   __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409
   __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
   invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
   el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
   do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
   el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767
   el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
   el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600
  irq event stamp: 330
  hardirqs last  enabled at (329): [<ffff80008048590c>] raw_spin_rq_unlock_irq kernel/sched/sched.h:1525 [inline]
  hardirqs last  enabled at (329): [<ffff80008048590c>] finish_lock_switch+0xb0/0x1c0 kernel/sched/core.c:5130
  hardirqs last disabled at (330): [<ffff80008adb9e60>] el1_dbg+0x24/0x80 arch/arm64/kernel/entry-common.c:511
  softirqs last  enabled at (10): [<ffff8000801fbf10>] local_bh_enable+0x10/0x34 include/linux/bottom_half.h:32
  softirqs last disabled at (8): [<ffff8000801fbedc>] local_bh_disable+0x10/0x34 include/linux/bottom_half.h:19
  ---[ end trace 0000000000000000 ]---

Fix this by flagging new block groups which had their free space tree
entries already added and then skip them in the rebuild process. Also,
since the rebuild may be triggered when doing a remount, make sure that
when we clear an existing free space tree that we clear such flag from
every existing block group, otherwise we would skip those block groups
during the rebuild.

Reported-by: syzbot+d0014fb0fc39c5487ae5@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/68460a54.050a0220.daf97.0af5.GAE@google.com/
Fixes: 882af9f13e83 ("btrfs: handle free space tree rebuild in multiple transactions")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zoned: fix alloc_offset calculation for partly conventional block groups

When one of two zones composing a DUP block group is a conventional zone,
we have the zone_info[i]->alloc_offset = WP_CONVENTIONAL. That will, of
course, not match the write pointer of the other zone, and fails that
block group.

This commit solves that issue by properly recovering the emulated write
pointer from the last allocated extent. The offset for the SINGLE, DUP,
and RAID1 are straight-forward: it is same as the end of last allocated
extent. The RAID0 and RAID10 are a bit tricky that we need to do the math
of striping.

This is the kernel equivalent of Naohiro's user-space commit:
"btrfs-progs: zoned: fix alloc_offset calculation for partly
conventional block groups".

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: handle csum tree error with rescue=ibadroots correctly

[BUG]
There is syzbot based reproducer that can crash the kernel, with the
following call trace: (With some debug output added)

DEBUG: rescue=ibadroots parsed
BTRFS: device fsid 14d642db-7b15-43e4-81e6-4b8fac6a25f8 devid 1 transid 8 /dev/loop0 (7:0) scanned by repro (1010)
BTRFS info (device loop0): first mount of filesystem 14d642db-7b15-43e4-81e6-4b8fac6a25f8
BTRFS info (device loop0): using blake2b (blake2b-256-generic) checksum algorithm
BTRFS info (device loop0): using free-space-tree
BTRFS warning (device loop0): checksum verify failed on logical 5312512 mirror 1 wanted 0xb043382657aede36608fd3386d6b001692ff406164733d94e2d9a180412c6003 found 0x810ceb2bacb7f0f9eb2bf3b2b15c02af867cb35ad450898169f3b1f0bd818651 level 0
DEBUG: read tree root path failed for tree csum, ret=-5
BTRFS warning (device loop0): checksum verify failed on logical 5328896 mirror 1 wanted 0x51be4e8b303da58e6340226815b70e3a93592dac3f30dd510c7517454de8567a found 0x51be4e8b303da58e634022a315b70e3a93592dac3f30dd510c7517454de8567a level 0
BTRFS warning (device loop0): checksum verify failed on logical 5292032 mirror 1 wanted 0x1924ccd683be9efc2fa98582ef58760e3848e9043db8649ee382681e220cdee4 found 0x0cb6184f6e8799d9f8cb335dccd1d1832da1071d12290dab3b85b587ecacca6e level 0
process 'repro' launched './file2' with NULL argv: empty string added
DEBUG: no csum root, idatacsums=0 ibadroots=134217728
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000041: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000208-0x000000000000020f]
CPU: 5 UID: 0 PID: 1010 Comm: repro Tainted: G           OE       6.15.0-custom+ #249 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:btrfs_lookup_csum+0x93/0x3d0 [btrfs]
Call Trace:
  <TASK>
  btrfs_lookup_bio_sums+0x47a/0xdf0 [btrfs]
  btrfs_submit_bbio+0x43e/0x1a80 [btrfs]
  submit_one_bio+0xde/0x160 [btrfs]
  btrfs_readahead+0x498/0x6a0 [btrfs]
  read_pages+0x1c3/0xb20
  page_cache_ra_order+0x4b5/0xc20
  filemap_get_pages+0x2d3/0x19e0
  filemap_read+0x314/0xde0
  __kernel_read+0x35b/0x900
  bprm_execve+0x62e/0x1140
  do_execveat_common.isra.0+0x3fc/0x520
  __x64_sys_execveat+0xdc/0x130
  do_syscall_64+0x54/0x1d0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
---[ end trace 0000000000000000 ]---

[CAUSE]
Firstly the fs has a corrupted csum tree root, thus to mount the fs we
have to go "ro,rescue=ibadroots" mount option.

Normally with that mount option, a bad csum tree root should set
BTRFS_FS_STATE_NO_DATA_CSUMS flag, so that any future data read will
ignore csum search.

But in this particular case, we have the following call trace that
caused NULL csum root, but not setting BTRFS_FS_STATE_NO_DATA_CSUMS:

load_global_roots_objectid():

ret = btrfs_search_slot();
/* Succeeded */
btrfs_item_key_to_cpu()
found = true;
/* We found the root item for csum tree. */
root = read_tree_root_path();
if (IS_ERR(root)) {
if (!btrfs_test_opt(fs_info, IGNOREBADROOTS))
/*
* Since we have rescue=ibadroots mount option,
* @ret is still 0.
*/
break;
if (!found || ret) {
/* @found is true, @ret is 0, error handling for csum
* tree is skipped.
*/
}

This means we completely skipped to set BTRFS_FS_STATE_NO_DATA_CSUMS if
the csum tree is corrupted, which results unexpected later csum lookup.

[FIX]
If read_tree_root_path() failed, always populate @ret to the error
number.

As at the end of the function, we need @ret to determine if we need to
do the extra error handling for csum tree.

Fixes: abed4aaae4f7 ("btrfs: track the csum, extent, and free space trees in a rb tree")
Reported-by: Zhiyu Zhang <zhiyuzhang999@gmail.com>
Reported-by: Longxing Li <coregee2000@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix race between async reclaim worker and close_ctree()

Syzbot reported an assertion failure due to an attempt to add a delayed
iput after we have set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info
state:

  WARNING: CPU: 0 PID: 65 at fs/btrfs/inode.c:3420 btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420
  Modules linked in:
  CPU: 0 UID: 0 PID: 65 Comm: kworker/u8:4 Not tainted 6.15.0-next-20250530-syzkaller #0 PREEMPT(full)
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
  Workqueue: btrfs-endio-write btrfs_work_helper
  RIP: 0010:btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420
  Code: 4e ad 5d (...)
  RSP: 0018:ffffc9000213f780 EFLAGS: 00010293
  RAX: ffffffff83c635b7 RBX: ffff888058920000 RCX: ffff88801c769e00
  RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000000
  RBP: 0000000000000001 R08: ffff888058921b67 R09: 1ffff1100b12436c
  R10: dffffc0000000000 R11: ffffed100b12436d R12: 0000000000000001
  R13: dffffc0000000000 R14: ffff88807d748000 R15: 0000000000000100
  FS:  0000000000000000(0000) GS:ffff888125c53000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00002000000bd038 CR3: 000000006a142000 CR4: 00000000003526f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   btrfs_put_ordered_extent+0x19f/0x470 fs/btrfs/ordered-data.c:635
   btrfs_finish_one_ordered+0x11d8/0x1b10 fs/btrfs/inode.c:3312
   btrfs_work_helper+0x399/0xc20 fs/btrfs/async-thread.c:312
   process_one_work kernel/workqueue.c:3238 [inline]
   process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321
   worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402
   kthread+0x70e/0x8a0 kernel/kthread.c:464
   ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
   </TASK>

This can happen due to a race with the async reclaim worker like this:

1) The async metadata reclaim worker enters shrink_delalloc(), which calls
   btrfs_start_delalloc_roots() with an nr_pages argument that has a value
   less than LONG_MAX, and that in turn enters start_delalloc_inodes(),
   which sets the local variable 'full_flush' to false because
   wbc->nr_to_write is less than LONG_MAX;

2) There it finds inode X in a root's delalloc list, grabs a reference for
   inode X (with igrab()), and triggers writeback for it with
   filemap_fdatawrite_wbc(), which creates an ordered extent for inode X;

3) The unmount sequence starts from another task, we enter close_ctree()
   and we flush the workqueue fs_info->endio_write_workers, which waits
   for the ordered extent for inode X to complete and when dropping the
   last reference of the ordered extent, with btrfs_put_ordered_extent(),
   when we call btrfs_add_delayed_iput() we don't add the inode to the
   list of delayed iputs because it has a refcount of 2, so we decrement
   it to 1 and return;

4) Shortly after at close_ctree() we call btrfs_run_delayed_iputs() which
   runs all delayed iputs, and then we set BTRFS_FS_STATE_NO_DELAYED_IPUT
   in the fs_info state;

5) The async reclaim worker, after calling filemap_fdatawrite_wbc(), now
   calls btrfs_add_delayed_iput() for inode X and there we trigger an
   assertion failure since the fs_info state has the flag
   BTRFS_FS_STATE_NO_DELAYED_IPUT set.

Fix this by setting BTRFS_FS_STATE_NO_DELAYED_IPUT only after we wait for
the async reclaim workers to finish, after we call cancel_work_sync() for
them at close_ctree(), and by running delayed iputs after wait for the
reclaim workers to finish and before setting the bit.

This race was recently introduced by commit 19e60b2a95f5 ("btrfs: add
extra warning if delayed iput is added when it's not allowed"). Without
the new validation at btrfs_add_delayed_iput(), this described scenario
was safe because close_ctree() later calls btrfs_commit_super(). That
will run any final delayed iputs added by reclaim workers in the window
between the btrfs_run_delayed_iputs() and the the reclaim workers being
shut down.

Reported-by: syzbot+0ed30ad435bf6f5b7a42@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6840481c.a00a0220.d4325.000c.GAE@google.com/T/#u
Fixes: 19e60b2a95f5 ("btrfs: add extra warning if delayed iput is added when it's not allowed")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix assertion when building free space tree

When building the free space tree with the block group tree feature
enabled, we can hit an assertion failure like this:

  BTRFS info (device loop0 state M): rebuilding free space tree
  assertion failed: ret == 0, in fs/btrfs/free-space-tree.c:1102
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/free-space-tree.c:1102!
  Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
  Modules linked in:
  CPU: 1 UID: 0 PID: 6592 Comm: syz-executor322 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025
  pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102
  lr : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102
  sp : ffff8000a4ce7600
  x29: ffff8000a4ce76e0 x28: ffff0000c9bc6000 x27: ffff0000ddfff3d8
  x26: ffff0000ddfff378 x25: dfff800000000000 x24: 0000000000000001
  x23: ffff8000a4ce7660 x22: ffff70001499cecc x21: ffff0000e1d8c160
  x20: ffff0000e1cb7800 x19: ffff0000e1d8c0b0 x18: 00000000ffffffff
  x17: ffff800092f39000 x16: ffff80008ad27e48 x15: ffff700011e740c0
  x14: 1ffff00011e740c0 x13: 0000000000000004 x12: ffffffffffffffff
  x11: ffff700011e740c0 x10: 0000000000ff0100 x9 : 94ef24f55d2dbc00
  x8 : 94ef24f55d2dbc00 x7 : 0000000000000001 x6 : 0000000000000001
  x5 : ffff8000a4ce6f98 x4 : ffff80008f415ba0 x3 : ffff800080548ef0
  x2 : 0000000000000000 x1 : 0000000100000000 x0 : 000000000000003e
  Call trace:
   populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 (P)
   btrfs_rebuild_free_space_tree+0x14c/0x54c fs/btrfs/free-space-tree.c:1337
   btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074
   btrfs_remount_rw fs/btrfs/super.c:1319 [inline]
   btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543
   reconfigure_super+0x1d4/0x6f0 fs/super.c:1083
   do_remount fs/namespace.c:3365 [inline]
   path_mount+0xb34/0xde0 fs/namespace.c:4200
   do_mount fs/namespace.c:4221 [inline]
   __do_sys_mount fs/namespace.c:4432 [inline]
   __se_sys_mount fs/namespace.c:4409 [inline]
   __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409
   __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
   invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
   el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
   do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
   el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767
   el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
   el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600
  Code: f0047182 91178042 528089c3 9771d47b (d4210000)
  ---[ end trace 0000000000000000 ]---

This happens because we are processing an empty block group, which has
no extents allocated from it, there are no items for this block group,
including the block group item since block group items are stored in a
dedicated tree when using the block group tree feature. It also means
this is the block group with the highest start offset, so there are no
higher keys in the extent root, hence btrfs_search_slot_for_read()
returns 1 (no higher key found).

Fix this by asserting 'ret' is 0 only if the block group tree feature
is not enabled, in which case we should find a block group item for
the block group since it's stored in the extent root and block group
item keys are greater than extent item keys (the value for
BTRFS_BLOCK_GROUP_ITEM_KEY is 192 and for BTRFS_EXTENT_ITEM_KEY and
BTRFS_METADATA_ITEM_KEY the values are 168 and 169 respectively).
In case 'ret' is 1, we just need to add a record to the free space
tree which spans the whole block group, and we can achieve this by
making 'ret == 0' as the while loop's condition.

Reported-by: syzbot+36fae25c35159a763a2a@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6841dca8.a00a0220.d4325.0020.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: don't silently ignore unexpected extent type when replaying log

If there's an unexpected (invalid) extent type, we just silently ignore
it. This means a corruption or some bug somewhere, so instead return
-EUCLEAN to the caller, making log replay fail, and print an error message
with relevant information.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix invalid inode pointer dereferences during log replay

In a few places where we call read_one_inode(), if we get a NULL pointer
we end up jumping into an error path, or fallthrough in case of
__add_inode_ref(), where we then do something like this:

iput(&inode->vfs_inode);

which results in an invalid inode pointer that triggers an invalid memory
access, resulting in a crash.

Fix this by making sure we don't do such dereferences.

Fixes: b4c50cbb01a1 ("btrfs: return a btrfs_inode from read_one_inode()")
CC: stable@vger.kernel.org # 6.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb

If we break out of the loop because an extent buffer doesn't have the bit
EXTENT_BUFFER_TREE_REF set, we end up unlocking the xarray twice, once
before we tested for the bit and break out of the loop, and once again
after the loop.

Fix this by testing the bit and exiting before unlocking the xarray.
The time spent testing the bit is negligible and it's not worth trying
to do that outside the critical section delimited by the xarray lock due
to the code complexity required to avoid it (like using a local boolean
variable to track whether the xarray is locked or not). The xarray unlock
only needs to be done before calling release_extent_buffer(), as that
needs to lock the xarray (through xa_cmpxchg_irq()) and does a more
significant amount of work.

Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/linux-btrfs/aDRNDU0GM1_D4Xnw@stanley.mountain/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: update superblock's device bytes_used when dropping chunk

Each superblock contains a copy of the device item for that device. In a
transaction which drops a chunk but doesn't create any new ones, we were
correctly updating the device item in the chunk tree but not copying
over the new bytes_used value to the superblock.

This can be seen by doing the following:

  # dd if=/dev/zero of=test bs=4096 count=2621440
  # mkfs.btrfs test
  # mount test /root/temp

  # cd /root/temp
  # for i in {00..10}; do dd if=/dev/zero of=$i bs=4096 count=32768; done
  # sync
  # rm *
  # sync
  # btrfs balance start -dusage=0 .
  # sync

  # cd
  # umount /root/temp
  # btrfs check test

For btrfs-check to detect this, you will also need my patch at
https://github.com/kdave/btrfs-progs/pull/991.

Change btrfs_remove_dev_extents() so that it adds the devices to the
fs_info->post_commit_list if they're not there already. This causes
btrfs_commit_device_sizes() to be called, which updates the bytes_used
value in the superblock.

Fixes: bbbf7243d62d ("btrfs: combine device update operations during transaction commit")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix a race between renames and directory logging

We have a race between a rename and directory inode logging that if it
happens and we crash/power fail before the rename completes, the next time
the filesystem is mounted, the log replay code will end up deleting the
file that was being renamed.

This is best explained following a step by step analysis of an interleaving
of steps that lead into this situation.

Consider the initial conditions:

1) We are at transaction N;

2) We have directories A and B created in a past transaction (< N);

3) We have inode X corresponding to a file that has 2 hardlinks, one in
   directory A and the other in directory B, so we'll name them as
   "A/foo_link1" and "B/foo_link2". Both hard links were persisted in a
   past transaction (< N);

4) We have inode Y corresponding to a file that as a single hard link and
   is located in directory A, we'll name it as "A/bar". This file was also
   persisted in a past transaction (< N).

The steps leading to a file loss are the following and for all of them we
are under transaction N:

1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field
    is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir();

2) Task A starts a rename for inode Y, with the goal of renaming from
    "A/bar" to "A/baz", so we enter btrfs_rename();

3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling
    btrfs_insert_inode_ref();

4) Because the rename happens in the same directory, we don't set the
    last_unlink_trans field of directoty A's inode to the current
    transaction id, that is, we don't cal btrfs_record_unlink_dir();

5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY
    and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode()
    (actually the dir index item is added as a delayed item, but the
    effect is the same);

6) Now before task A adds the new entry "A/baz" to directory A by
    calling btrfs_add_link(), another task, task B is logging inode X;

7) Task B starts a fsync of inode X and after logging inode X, at
    btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since
    inode X has a last_unlink_trans value of N, set at in step 1;

8) At btrfs_log_all_parents() we search for all parent directories of
    inode X using the commit root, so we find directories A and B and log
    them. Bu when logging direct A, we don't have a dir index item for
    inode Y anymore, neither the old name "A/bar" nor for the new name
    "A/baz" since the rename has deleted the old name but has not yet
    inserted the new name - task A hasn't called yet btrfs_add_link() to
    do that.

    Note that logging directory A doesn't fallback to a transaction
    commit because its last_unlink_trans has a lower value than the
    current transaction's id (see step 4);

9) Task B finishes logging directories A and B and gets back to
    btrfs_sync_file() where it calls btrfs_sync_log() to persist the log
    tree;

10) Task B successfully persisted the log tree, btrfs_sync_log() completed
    with success, and a power failure happened.

    We have a log tree without any directory entry for inode Y, so the
    log replay code deletes the entry for inode Y, name "A/bar", from the
    subvolume tree since it doesn't exist in the log tree and the log
    tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY
    item that covers the index range for the dentry that corresponds to
    "A/bar").

    Since there's no other hard link for inode Y and the log replay code
    deletes the name "A/bar", the file is lost.

The issue wouldn't happen if task B synced the log only after task A
called btrfs_log_new_name(), which would update the log with the new name
for inode Y ("A/bar").

Fix this by pinning the log root during renames before removing the old
directory entry, and unpinning after btrfs_log_new_name() is called.

Fixes: 259c4b96d78d ("btrfs: stop doing unnecessary log updates during a rename")
CC: stable@vger.kernel.org # 5.18+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: scrub: add prefix for the error messages

Add a "scrub: " prefix to all messages logged by scrub so that it's
easy to filter them from dmesg for analysis.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: warn if leaking delayed_nodes in btrfs_put_root()

Add a warning for leaked delayed_nodes when putting a root. We currently
do this for inodes, but not delayed_nodes.

Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Remove the changelog from the commit message. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix delayed ref refcount leak in debug assertion

If the delayed_root is not empty we are increasing the number of
references to a delayed_node without decreasing it, causing a leak. Fix
by decrementing the delayed_node reference count.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Remove the changelog from the commit message. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: include root in error message when unlinking inode

To help debugging include the root number in the error message, and since
this is a critical error that implies a metadata inconsistency and results
in a transaction abort change the log message level from "info" to
"critical", which is a much better fit.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails

In the zoned mode there's a bug in the extent buffer tree conversion to
xarray. The reference for eb is dropped and code continues but the
references get dropped by releasing the batch.

Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: move misplaced comment of btrfs_path::keep_locks

Commit 925baeddc5b0 ("Btrfs: Start btree concurrency work.") added the
comment for the field keep_locks. This got moved later but without the
comment, so move it to the right place and fix the comment style.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove standalone "nologreplay" mount option

Standalone "nologreplay" mount option has been marked deprecated since
commit 74ef00185eb8 ("btrfs: introduce "rescue=" mount option"), which
dates back to v5.9 (2020).

Furthermore there is no other filesystem with the same named mount
option, so this one is btrfs specific and we will not hit the same
problem when removing "norecovery" mount option.

So let's remove the standalone "nologreplay" mount option.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use a single variable to track return value at btrfs_page_mkwrite()

We have two variables to track return values, ret and ret2, with types
vm_fault_t (an unsigned int type) and int, which makes it a bit confusing
and harder to keep track. So use a single variable, of type int, and under
the 'out' label return vmf_error(ret) in case ret contains an error,
otherwise return VM_FAULT_NOPAGE. This is equivalent to what we had before
and it's simpler.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: don't return VM_FAULT_SIGBUS on failure to set delalloc for mmap write

If the call to btrfs_set_extent_delalloc() fails we are always returning
VM_FAULT_SIGBUS, which is odd since the error means "bad access" and the
most likely cause for btrfs_set_extent_delalloc() is -ENOMEM, which should
be translated to VM_FAULT_OOM.

Instead of returning VM_FAULT_SIGBUS return vmf_error(ret2), which gives
us a more appropriate return value, and we use that everywhere else too.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify early error checking in btrfs_page_mkwrite()

We have this entangled error checks early at btrfs_page_mkwrite():

1) Try to reserve delalloc space by calling btrfs_delalloc_reserve_space()
   and storing the return value in the ret2 variable;

2) If the reservation succeed, call file_update_time() and store the
   return value in ret2 and also set the local variable 'reserved' to
   true (1);

3) Then do an error check on ret2 to see if any of the previous calls
   failed and if so, jump either to the 'out' label or to the
   'out_noreserve' label, depending on whether 'reserved' is true or
   not.

This is unnecessarily complex. Instead change this to a simpler and
more straightforward approach:

1) Call btrfs_delalloc_reserve_space(), if that returns an error jump to
   the 'out_noreserve' label;

2) The call file_update_time() and if that returns an error jump to the
   'out' label.

Like this there's less nested if statements, no need to use a local
variable to track if space was reserved and if statements are used only
to check errors.

Also move the call to extent_changeset_free() out of the 'out_noreserve'
label and under the 'out' label  since the changeset is allocated only if
the call to reserve delalloc space succeeded.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass true to btrfs_delalloc_release_space() at btrfs_page_mkwrite()

In the last call to btrfs_delalloc_release_space() where the value of the
variable 'ret' is never zero, we pass the expression 'ret != 0' as the
value for the argument 'qgroup_free', which always evaluates to true.
Make this less confusing and more clear by explicitly passing true
instead.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix wrong start offset for delalloc space release during mmap write

If we're doing a mmap write against a folio that has i_size somewhere in
the middle and we have multiple sectors in the folio, we may have to
release excess space previously reserved, for the range going from the
rounded up (to sector size) i_size to the folio's end offset. We are
calculating the right amount to release and passing it to
btrfs_delalloc_release_space(), but we are passing the wrong start offset
of that range - we're passing the folio's start offset instead of the
end offset, plus 1, of the range for which we keep the reservation. This
may result in releasing more space then we should and eventually trigger
an underflow of the data space_info's bytes_may_use counter.

So fix this by passing the start offset as 'end + 1' instead of
'page_start' to btrfs_delalloc_release_space().

Fixes: d0b7da88f640 ("Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix harmless race getting delayed ref head count when running delayed refs

When running delayed references we are reading the number of ready delayed
ref heads without taking any lock which can make KCSAN report a race since
we can have concurrent tasks updating that number, such as for example
when freeing a tree block which will end up decrementing that counter or
when adding a new delayed ref while COWing a tree block which will
increment that counter.

This is a harmless race since running one more or one less delayed ref
head doesn't result in any problem, in the critical section of a
transaction commit we always run any remaining delayed refs and at that
point no one can create more.

So fix this harmless race by annotating the read with data_race().

Reported-by: cen zhang <zzzccc427@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAFRLqsUCLMz0hY-GaPj1Z=fhkgRHjxVXHZ8kz0PvkFN0b=8L2Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: log error codes during failures when writing super blocks

When writing super blocks, at write_dev_supers(), we log an error message
when we get some error but we don't show which error we got and we have
that information. So enhance the error messages with the error codes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify error return logic when getting folio at prepare_one_folio()

There's no need to have special logic to return -EAGAIN in case the call
to __filemap_get_folio() fails, because when FGP_NOWAIT is passed to
__filemap_get_folio() it returns ERR_PTR(-EAGAIN) if it needs to do
something that would imply blocking.

The reason we have this logic is from the days before we migrated to the
folio interface, when we called pagecache_get_page() which would return
NULL instead of an error pointer.

So remove this special casing and always return the error that the call
to __filemap_get_folio() returned.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: return real error from __filemap_get_folio() calls

We have a few places that always assume a -ENOMEM error happened in case a
call to __filemap_get_folio() returns an error, which is just too much of
an assumption and even if it would be the case at some point in time, it's
not future proof and there's nothing in the documentation that guarantees
that only ERR_PTR(-ENOMEM) can be returned with the flags we are passing
to it.

So use the exact error returned by __filemap_get_folio() instead.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove superfluous return value check at btrfs_dio_iomap_begin()

In the if statement that checks the return value from
btrfs_check_data_free_space(), there's no point to check if 'ret' is not
zero in the else branch, since the main if branch checked that it's zero,
so in the else branch it necessarily has a non-zero value.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix invalid data space release when truncating block in NOCOW mode

If when truncating a block we fail to reserve data space and then we
proceed anyway because we can do a NOCOW write, if we later get an error
when trying to get the folio from the inode's mapping, we end up releasing
data space that we haven't reserved, screwing up the bytes_may_use counter
from the data space_info, eventually resulting in an underflow when all
other reservations done by other tasks are released, if any, or right away
if there are no other reservations at the moment.

This is because when we get an error when trying to grab the block's folio
we call btrfs_delalloc_release_space(), which releases metadata (which we
have reserved) and data (which we haven't reserved).

Fix this by calling btrfs_delalloc_release_space() only if we did reserve
data space, that is, if we aren't falling back to NOCOW, meaning the local
variable @only_release_metadata has a false value, otherwise release only
metadata by calling btrfs_delalloc_release_metadata().

Fixes: 6d4572a9d71d ("btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: update Kconfig option descriptions

Expand what the options do and if they are OK to be enabled.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: update list of features built under experimental config

The list is out of date, the extent shrinker got fixed in 6.13. Add new
entries: the COW fixup warning in 6.15, rund robin policies in 6.14.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: send: remove btrfs_debug() calls

There are debugging prints for each emitted send command and other
related actions. This does not seem right as the number of commands can
be high and dumping that to the system log will likely hit some rate
limiting. This should be done by trace points that are more lightweight
and can keep up with high frequency.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent()

We are using an integer for the 'delalloc' argument but all we need is a
boolean, so switch the type to 'bool' and rename the parameter to
'is_delalloc' to better match the fact that it's a boolean.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes()

We are using an integer for the 'delalloc' argument but all we need is a
boolean, so switch the type to 'bool' and rename the parameter to
'is_delalloc' to better match the fact that it's a boolean.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fold error checks when allocating ordered extent and update comments

Instead of having an error check and return on each branch of the if
statement, move the error check to happen after that if branch, reducing
source code and object code sizes.

Before this change:

   $ size fs/btrfs/btrfs.ko
      text    data     bss     dec     hex filename
   1840174 163742   16136 2020052 1ed2d4 fs/btrfs/btrfs.ko

After this change:

   $ size fs/btrfs/btrfs.ko
      text    data     bss     dec     hex filename
   1840138 163742   16136 2020016 1ed2b0 fs/btrfs/btrfs.ko

While at it and moving the comments, update the comments to be more clear
about how qgroup reserved space is released and the intricacies of how
it's managed for COW writes.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: check we grabbed inode reference when allocating an ordered extent

When allocating an ordered extent we call igrab() to get a reference on
the inode and attach it to the ordered extent. For an ordered extent we
always must have an inode reference since we during its life cycle we
need to access the inode for several things like for example:

* Inserting the ordered extent right after allocating it, when calling
  insert_ordered_extent() - we need to lock the inode's ordered_tree_lock;

* In the bio submission path we need to add checksums to the ordered
  extent and we end up at btrfs_add_ordered_sum(), where again we need
  to grab the inode from the ordered extent to lock the inode's
  ordered_tree_lock;

* When finishing an ordered extent, at btrfs_finish_ordered_extent(), we
  need again to access its inode in order to lock the inode's
  ordered_tree_lock;

* Etc etc etc.

Everywhere we deal with an ordered extent we always expect its inode to
be not NULL, the only exception being btrfs_put_ordered_extent() where
we check if it's NULL before calling btrfs_add_delayed_iput(), even though
we have already assumed it's not NULL when calling the tracepoint
trace_btrfs_ordered_extent_put() since the tracepoint dereferences the
inode to extract its number and root without ever checking it's NULL.

The igrab() call can return NULL if the inode is about to be freed or is
being freed (its state has I_FREEING or I_WILL_FREE set), and that's why
there's such check at btrfs_put_ordered_extent(). The igrab() and NULL
check were introduced in commit 5fd02043553b ("Btrfs: finish ordered
extents in their own thread") but even back then we always needed and
assumed igrab() returned a non-NULL pointer, since for example when
removing an ordered extent, at btrfs_remove_ordered_extent(), we assumed
the inode pointer was not NULL in order to access the inode's ordered
extent tree.

In fact whenever we allocate an ordered extent we are holding an inode
reference and the inode is not being freed or going to be freed (which
happens in the final iput), and since we depend on the inode for the
life cycle of the ordered extent, just make ordered extent allocation
to fail in case igrab() returns NULL and trigger a warning, to make it
clear it's not expected. This allows to remove the confusing NULL inode
check at btrfs_put_ordered_extent().

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix qgroup reservation leak on failure to allocate ordered extent

If we fail to allocate an ordered extent for a COW write we end up leaking
a qgroup data reservation since we called btrfs_qgroup_release_data() but
we didn't call btrfs_qgroup_free_refroot() (which would happen when
running the respective data delayed ref created by ordered extent
completion or when finishing the ordered extent in case an error happened).

So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
ordered extent for a COW write.

Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: scrub: reduce memory usage of struct scrub_sector_verification

That structure records needed info for block verification (either data
checksum pointer, or expected tree block generation).

But there is also a boolean to tell if this block belongs to a metadata
or not, as the data checksum pointer and expected tree block generation
is already a union, we need a dedicated bit to tell if this block is a
metadata or not.

However such layout means we're wasting 63 bits for x86_64, which is a
huge memory waste.

Thanks to the recent bitmap aggregation, we can easily move this
single-bit-per-block member to a new sub-bitmap.
And since we already have six 16 bits long bitmaps, adding another
bitmap won't even increase any memory usage for x86_64, as we need two
64 bits long anyway.

This will reduce the following memory usages:

- sizeof(struct scrub_sector_verification)
  From 16 bytes to 8 bytes on x86_64.

- scrub_stripe::sectors
  From 16 * 16 to 16 * 8 bytes.

- Per-device scrub_ctx memory usage
  From 128 * (16 * 16) to 128 * (16 * 8), which saves 16KiB memory.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: handle aligned EOF truncation correctly for subpage cases

[BUG]
For the following fsx -e 1 run, the btrfs still fails the run on 64K
page size with 4K fs block size:

  READ BAD DATA: offset = 0x26b3a, size = 0xfafa, fname = /mnt/btrfs/junk
  OFFSET      GOOD    BAD     RANGE
  0x26b3a     0x0000  0x15b4  0x0
  operation# (mod 256) for the bad data may be 21
  [...]
  LOG DUMP (28 total operations):
  1(  1 mod 256): SKIPPED (no operation)
  2(  2 mod 256): SKIPPED (no operation)
  3(  3 mod 256): SKIPPED (no operation)
  4(  4 mod 256): SKIPPED (no operation)
  5(  5 mod 256): WRITE    0x1ea90 thru 0x285e0 (0x9b51 bytes) HOLE
  6(  6 mod 256): ZERO     0x1b1a8 thru 0x20bd4 (0x5a2d bytes)
  7(  7 mod 256): FALLOC   0x22b1a thru 0x272fa (0x47e0 bytes) INTERIOR
  8(  8 mod 256): WRITE    0x741d thru 0x13522 (0xc106 bytes)
  9(  9 mod 256): MAPWRITE 0x73ee thru 0xdeeb (0x6afe bytes)
  10( 10 mod 256): FALLOC   0xb719 thru 0xb994 (0x27b bytes) INTERIOR
  11( 11 mod 256): COPY 0x15ed8 thru 0x18be1 (0x2d0a bytes) to 0x25f6e thru 0x28c77
  12( 12 mod 256): ZERO     0x1615e thru 0x1770e (0x15b1 bytes)
  13( 13 mod 256): SKIPPED (no operation)
  14( 14 mod 256): DEDUPE 0x20000 thru 0x27fff (0x8000 bytes) to 0x1000 thru 0x8fff
  15( 15 mod 256): SKIPPED (no operation)
  16( 16 mod 256): CLONE 0xa000 thru 0xffff (0x6000 bytes) to 0x36000 thru 0x3bfff
  17( 17 mod 256): ZERO     0x14adc thru 0x1b78a (0x6caf bytes)
  18( 18 mod 256): TRUNCATE DOWN from 0x3c000 to 0x1e2e3 ******WWWW
  19( 19 mod 256): CLONE 0x4000 thru 0x11fff (0xe000 bytes) to 0x16000 thru 0x23fff
  20( 20 mod 256): FALLOC   0x311e1 thru 0x3681b (0x563a bytes) PAST_EOF
  21( 21 mod 256): FALLOC   0x351c5 thru 0x40000 (0xae3b bytes) EXTENDING
  22( 22 mod 256): WRITE    0x920 thru 0x7e51 (0x7532 bytes)
  23( 23 mod 256): COPY 0x2b58 thru 0xc508 (0x99b1 bytes) to 0x117b1 thru 0x1b161
  24( 24 mod 256): TRUNCATE DOWN from 0x40000 to 0x3c9a5
  25( 25 mod 256): SKIPPED (no operation)
  26( 26 mod 256): MAPWRITE 0x25020 thru 0x26b06 (0x1ae7 bytes)
  27( 27 mod 256): SKIPPED (no operation)
  28( 28 mod 256): READ     0x26b3a thru 0x36633 (0xfafa bytes) ***RRRR***

[CAUSE]
The involved operations are:

  fallocating to largest ever: 0x40000
  21 pollute_eof 0x24000 thru 0x2ffff (0xc000 bytes)
  21 falloc from 0x351c5 to 0x40000 (0xae3b bytes)
  28 read 0x26b3a thru 0x36633 (0xfafa bytes)

At operation #21 a pollute_eof is done, by memory mapped write into
range [0x24000, 0x2ffff).
At this stage, the inode size is 0x24000, which is block aligned.

Then fallocate happens, and since it's expanding the inode, it will call
btrfs_truncate_block() to truncate any unaligned range.

But since the inode size is already block aligned,
btrfs_truncate_block() does nothing and exits.

However remember the folio at 0x20000 has some range polluted already,
although it will not be written back to disk, it still affects the
page cache, resulting the later operation #28 to read out the polluted
value.

[FIX]
Instead of early exit from btrfs_truncate_block() if the range is
already block aligned, do extra filio zeroing if the fs block size is
smaller than the page size and we're truncating beyond EOF.

This is to address exactly the above case where memory mapped write can
still leave some garbage beyond EOF.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: handle unaligned EOF truncation correctly for subpage cases

[BUG]
The following fsx sequence will fail on btrfs with 64K page size and 4K
fs block size:

  #fsx -d -e 1 -N 4 $mnt/junk -S 36386
  READ BAD DATA: offset = 0xe9ba, size = 0x6dd5, fname = /mnt/btrfs/junk
  OFFSET      GOOD    BAD     RANGE
  0xe9ba      0x0000  0x03ac  0x0
  operation# (mod 256) for the bad data may be 3
  ...
  LOG DUMP (4 total operations):
  1(  1 mod 256): WRITE    0x6c62 thru 0x1147d (0xa81c bytes) HOLE ***WWWW
  2(  2 mod 256): TRUNCATE DOWN from 0x1147e to 0x5448 ******WWWW
  3(  3 mod 256): ZERO     0x1c7aa thru 0x28fe2 (0xc839 bytes)
  4(  4 mod 256): MAPREAD  0xe9ba thru 0x1578e (0x6dd5 bytes) ***RRRR***

[CAUSE]
Only 2 operations are really involved in this case:

  3 pollute_eof 0x5448 thru 0xffff (0xabb8 bytes)
  3 zero from 0x1c7aa to 0x28fe3, (0xc839 bytes)
  4 mapread 0xe9ba thru 0x1578e (0x6dd5 bytes)

At operation 3, fsx pollutes beyond EOF, that is done by mmap()
and write into that mmap() range beyond EOF.

Such write will fill the range beyond EOF, but it will never reach disk
as ranges beyond EOF will not be marked dirty nor uptodate.

Then we zero_range for [0x1c7aa, 0x28fe3], and since the range is beyond
our isize (which was 0x5448), we should zero out any range beyond
EOF (0x5448).

During btrfs_zero_range(), we call btrfs_truncate_block() to dirty the
unaligned head block.
But that function only really zeroes out the block at [0x5000, 0x5fff], it
doesn't bother any range other that that block, since those ranges will
not be marked dirty nor written back.

So the range [0x6000, 0xffff] is still polluted, and later mapread()
will return the poisoned value.

[FIX]
Enhance btrfs_truncate_block() by:

- Pass a @start/@end pair to indicate the full truncation range
  This is to handle the following truncation case:

    Page size is 64K, fs block size is 4K, truncate range is
    [6K, 60K]

    0                      32K                    64K
    |   |///////////////////////////////////|     |
        6K                                  60K

    The range is not aligned for its head block, so we need to call
    btrfs_truncate_block() with @from = 6K, @front = 0, @len = 0.

    But with that information we only know to zero the range [6K, 8K),
    if we zero out the range [6K, 64K), the last block will also be
    zeroed, causing data loss.

  So here we need the full range we're truncating, so that we can avoid
  over-truncation.

- Rename @from to @offset
  As now the parameter is only utilized to locate a block, it's not
  really carrying the old @from meaning well.

- Remove @front parameter
  With the full truncate range passed in, we can determine if the
  @offset is at the head or tail block.

- Skip truncation if @offset is not in the head nor tail blocks
  The call site in hole punch unconditionally call
  btrfs_truncate_block() without even checking the range is aligned or
  not.
  If the @offset is neither in the head nor in tail block, it means we can
  safely ignore it.

- Skip truncate if the range inside the target block is already aligned

- Make btrfs_truncate_block() zero all blocks beyond EOF
  Since we have the original range, we know exactly if we're doing
  truncation beyond EOF (the @end will be (u64)-1).

  If we're doing truncation beyond EOF, then enlarge the truncation
  range to the folio end, to address the possibly polluted ranges.

  Otherwise still keep the zero range inside the block, as we can have
  large data folios soon, always truncating every blocks inside the same
  folio can be costly for large folios.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix broken drop_caches on extent buffer folios

The (correct) commit e41c81d0d30e ("mm/truncate: Replace page_mapped()
call in invalidate_inode_page()") replaced the page_mapped(page) check
with a refcount check. However, this refcount check does not work as
expected with drop_caches for btrfs's metadata pages.

Btrfs has a per-sb metadata inode with cached pages, and when not in
active use by btrfs, they have a refcount of 3. One from the initial
call to alloc_pages(), one (nr_pages == 1) from filemap_add_folio(), and
one from folio_attach_private(). We would expect such pages to get dropped
by drop_caches. However, drop_caches calls into mapping_evict_folio() via
mapping_try_invalidate() which gets a reference on the folio with
find_lock_entries(). As a result, these pages have a refcount of 4, and
fail this check.

For what it's worth, such pages do get reclaimed under memory pressure,
so I would say that while this behavior is surprising, it is not really
dangerously broken.

When I asked the mm folks about the expected refcount in this case, I
was told that the correct thing to do is to donate the refcount from the
original allocation to the page cache after inserting it.

Therefore, attempt to fix this by adding a put_folio() to the critical
spot in alloc_extent_buffer() where we are sure that we have really
allocated and attached new pages. We must also adjust
folio_detach_private() to properly handle being the last reference to the
folio and not do a use-after-free after folio_detach_private().

extent_buffers allocated by clone_extent_buffer() and
alloc_dummy_extent_buffer() are unmapped, so this transfer of ownership
from allocation to insertion in the mapping does not apply to them.
However, we can still folio_put() them safely once they are finished
being allocated and have called folio_attach_private().

Finally, removing the generic put_folio() for the allocation from
btrfs_detach_extent_buffer_folios() means we need to be careful to do
the appropriate put_folio() in allocation failure paths in
alloc_extent_buffer(), clone_extent_buffer() and
alloc_dummy_extent_buffer().

Link: https://lore.kernel.org/linux-mm/ZrwhTXKzgDnCK76Z@casper.infradead.org/
Tested-by: Klara Modin <klarasmodin@gmail.com>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use verbose assert at peek_discard_list()

We now have a verbose variant of ASSERT() so that we can print the value
of the block group's discard_index. So use it for better problem analysis
in case the assertion is triggered.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: scrub: aggregate small bitmaps into a larger one

Currently we have several small bitmaps inside scrub_stripe:

- extent_sector_bitmap
- error_bitmap
- io_error_bitmap
- csum_error_bitmap
- meta_error_bitmap
- meta_gen_error_bitmap

All those bitmaps are at most 16 bits long, but unsigned long is
either 32 or 64 (more common) bits.

This means we're wasting 1/2 or 3/4 space for each bitmap.

And we can have 128 scrub_stripe for each device, such wasted space adds up
quickly.

Instead of using a single unsigned long for each bitmap, aggregate them
into a larger bitmap, just like what we're doing for subpage support.

This reduces 24 bytes from each scrub_stripe structure on x86_64
systems.

This will need a lot of macros converting direct bitmap/bit operations into
our scrub_stripe specific helpers, but all those helpers are very small
and can be inlined.

So overall the overhead shouldn't be that huge, and we save quite some
memory space.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: scrub: fix a wrong error type when metadata bytenr mismatches

When the bytenr doesn't match for a metadata tree block, we will report
it as an csum error, which is incorrect and should be reported as a
metadata error instead.

Fixes: a3ddbaebc7c9 ("btrfs: scrub: introduce a helper to verify one metadata block")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: defrag: use list_last_entry() at defrag_collect_targets()

Instead of using list_entry() against the list's prev entry, use
list_last_entry(), which removes the need to know the last member is
accessed through the prev list pointer and the naming makes it easier
to reason about what we are doing.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify csum list release at btrfs_put_ordered_extent()

Instead of extracting each element by grabbing the list's first member in
a local list_head variable, then extracting the csum with list_entry() and
iterating with a while loop checking for list emptyness, use the iteration
helper list_for_each_entry_safe(). This also removes the need to delete
elements from the list with list_del() since the ordered extent is freed
immediately after.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify extracting delayed node at btrfs_first_prepared_delayed_node()

Instead of grabbing the next pointer from the list and then doing a
list_entry() call, we can simply use list_first_entry(), removing the need
for list_head variable.

Also there's no need to check if the list is empty before attempting to
extract the first element, we can use list_first_entry_or_null(), removing
the need for a special if statement and the 'out' label.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify extracting delayed node at btrfs_first_delayed_node()

Instead of grabbing the next pointer from the list and then doing a
list_entry() call, we can simply use list_first_entry(), removing the need
for list_head variable.

Also there's no need to check if the list is empty before attempting to
extract the first element, we can use list_first_entry_or_null(), removing
the need for a special if statement and the 'out' label.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: raid56: use list_last_entry() at cache_rbio()

Instead of using list_entry() against the list's prev entry, use
list_last_entry(), which removes the need to know the last member is
accessed through the prev list pointer and the naming makes it easier
to reason about what we are doing.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify cow only root list extraction during transaction commit

There's no need to keep a local variable to extract the first member of
the list and then do a list_entry() call, we can use list_first_entry()
instead, removing the need for the temporary variable and extracting the
first element in a single step.

Also, there's no need to do a list_del_init() followed by list_add_tail(),
instead we can use list_move_tail(). We are in transaction commit critical
section where we don't need to worry about concurrency and that's why we
don't take any locks and can use list_move_tail() (we do assert early at
commit_cowonly_roots() that we are in the critical section, that the
transaction's state is TRANS_STATE_COMMIT_DOING).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify getting and extracting previous transaction at clean_pinned_extents()

Instead of detecting if there is a previous transaction by comparing the
current transaction's list prev member to the head of the transaction
list (fs_info->trans_list), use the list_is_first() helper which contains
that logic and the naming makes sense since a new transaction is always
added to the end of the list fs_info->trans_list with list_add_tail().

We are also extracting the previous transaction with list_last_entry()
against the transaction, which is correct but confusing because that
function is usually meant to be used against a pointer to the start of a
list and not a member of a list. It is easier to reason by either calling
list_first_entry() against the list fs_info->trans_list, since we can
never have more than two transactions in the list, or by calling
list_prev_entry() against the transaction. So change that to use the later
method.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify getting and extracting previous transaction during commit

Instead of detecting if there is a previous transaction by comparing the
current transaction's list prev member to the head of the transaction
list (fs_info->trans_list), use the list_is_first() helper which contains
that logic and the naming makes sense since a new transaction is always
added to the end of the list fs_info->trans_list with list_add_tail().

And instead of extracting the previous transaction with the more generic
list_entry() helper against the current transaction's list prev member,
use the more specific list_prev_entry() helper, which makes it clear what
we are doing and is shorter.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: move transaction aborts to the error site in add_to_free_space_tree()

Transaction aborts should be done next to the place the error happens,
which was not done in add_to_free_space_tree().

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: move transaction aborts to the error site in remove_from_free_space_tree()

Transaction aborts should be done next to the place the error happens,
which was not done in remove_from_free_space_tree().

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: move transaction aborts to the error site in convert_free_space_to_extents()

Transaction aborts should be done next to the place the error happens,
which was not done in convert_free_space_to_extents(). The DEBUG_WARN()
is removed because we get the abort message.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: move transaction aborts to the error site in convert_free_space_to_bitmaps()

Transaction aborts should be done next to the place the error happens,
which was not done in convert_free_space_to_bitmaps(). The DEBUG_WARN()
is removed because we get the abort message.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: scrub: move error reporting members to stack

Currently the following members of scrub_stripe are only utilized for
error reporting:

- init_error_bitmap
- init_nr_io_errors
- init_nr_csum_errors
- init_nr_meta_errors
- init_nr_meta_gen_errors

There is no need to put all those members into scrub_stripe, which take
24 bytes for each stripe, and we have 128 stripes for each device.

Instead introduce a structure, scrub_error_records, and move all above
members into that structure.

And allocate such structure from stack inside
scrub_stripe_read_repair_worker().
Since that function is called from a workqueue context, we have more
than enough stack space for just 24 bytes.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: scrub: update device stats when an error is detected

[BUG]
Since the migration to the new scrub_stripe interface, scrub no longer
updates the device stats when hitting an error, no matter if it's a read
or checksum mismatch error. E.g:

  BTRFS info (device dm-2): scrub: started on devid 1
  BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
  BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
  BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
  BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
  BTRFS info (device dm-2): scrub: finished on devid 1 with status: 0

Note there is no line showing the device stats error update.

[CAUSE]
In the migration to the new scrub_stripe interface, we no longer call
btrfs_dev_stat_inc_and_print().

[FIX]
- Introduce a new bitmap for metadata generation errors
  * A new bitmap
    @meta_gen_error_bitmap is introduced to record which blocks have
    metadata generation mismatch errors.

  * A new counter for that bitmap
    @init_nr_meta_gen_errors, is also introduced to store the number of
    generation mismatch errors that are found during the initial read.

    This is for the error reporting at scrub_stripe_report_errors().

  * New dedicated error message for unrepaired generation mismatches

  * Update @meta_gen_error_bitmap if a transid mismatch is hit

- Add btrfs_dev_stat_inc_and_print() calls to the following call sites
  * scrub_stripe_report_errors()
  * scrub_write_endio()
    This is only for the write errors.

This means there is a minor behavior change:

- The timing of device stats error message
  Since we concentrate the error messages at
  scrub_stripe_report_errors(), the device stats error messages will all
  show up in one go, after the detailed scrub error messages:

   BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
   BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
   BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
   BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
   BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
   BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0

Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add support for reclaiming from sub-space space_info

Modify btrfs_async_{data,metadata}_reclaim() to run the reclaim process
on the sub-spaces as well.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add block reserve for treelog

We need to add a dedicated block_rsv for tree-log, because the block_rsv
serves for a tree node allocation in btrfs_alloc_tree_block(). Currently,
tree-log tree uses fs_info->empty_block_rsv, which is shared across trees
and points to the normal metadata space_info. Instead, we add a dedicated
block_rsv and that block_rsv can use the dedicated sub-space_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use proper data space_info for zoned mode

Now that, we have data sub-space for the zoned mode. Tweak some space_info
functions to use proper space_info for a file.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tweak extent/chunk allocation for space_info sub-space

Make the extent allocator and the chunk allocator aware of the sub-space.
It now uses BTRFS_SUB_GROUP_DATA_RELOC sub-space for data relocation block
group, and uses BTRFS_SUB_GROUP_TREELOG for metadata tree-log block group.

And, it needs to check the space_info is the right one when a block group
candidate is given. Also, new block group should now belong to the
specified one.

Now that, block_group->space_info is always set before
btrfs_add_bg_to_space_info(), we no longer need to "find" the space_info.
So, rename the variable name to address that as well.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: introduce tree-log sub-space_info

Introduce the tree-log sub-space_info, which is sub-space of
metadata space_info and dedicated for tree-log node allocation.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: introduce btrfs_space_info sub-group

Current code assumes we have only one space_info for each block group type
(DATA, METADATA, and SYSTEM). We sometime need multiple space infos to
manage special block groups.

One example is handling the data relocation block group for the zoned mode.
That block group is dedicated for writing relocated data and we cannot
allocate any regular extent from that block group, which is implemented in
the zoned extent allocator. This block group still belongs to the normal
data space_info. So, when all the normal data block groups are full and
there is some free space in the dedicated block group, the space_info
looks to have some free space, while it cannot allocate normal extent
anymore. That results in a strange ENOSPC error. We need to have a
space_info for the relocation data block group to represent the situation
properly.

Adds a basic infrastructure for having a "sub-group" of a space_info:
creation and removing. A sub-group space_info belongs to one of the
primary space_infos and has the same flags as its parent.

This commit first introduces the relocation data sub-space_info, and the
next commit will introduce tree-log sub-space_info. In the future, it could
be useful to implement tiered storage for btrfs e.g. by implementing a
sub-group space_info for block groups resides on a fast storage.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add space_info parameter for block group creation

Add struct btrfs_space_info parameter to btrfs_make_block_group(), its
related functions and related struct. Passed space_info will have a new
block group.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add space_info argument to btrfs_chunk_alloc()

Take a btrfs_space_info argument in btrfs_chunk_alloc(). New block group
will belong to that space_info.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: factor out check_removing_space_info() from btrfs_free_block_groups()

Factor out check_removing_space_info() from btrfs_free_block_groups(). It
sanity checks a to-be-removed space_info. There is no functional change.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: factor out do_async_reclaim_{data,metadata}_space()

Factor out the main part of btrfs_async_reclaim_data_space() to
do_async_reclaim_data_space(), so it can take data space_info parameter
it is working on. Do the same for metadata. There is no functional change.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: factor out init_space_info() from create_space_info()

Factor out initialization of the space_info struct, which is used in a
later patch. There is no functional change.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass struct btrfs_inode to btrfs_free_reserved_data_space_noquota()

As well as the last patch, pass struct btrfs_inode to the function and
let it distinguish which data space it is working on in a later patch.
There is no functional change with this commit.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass btrfs_space_info to btrfs_reserve_data_bytes()

Pass struct btrfs_space_info to btrfs_reserve_data_bytes() to allow
reserving the data from multiple data space_info candidates.

This is a preparation for the following commits and there is no functional
change.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: make extent unpinning more efficient when committing transaction

At btrfs_finish_extent_commit() we have this loop that keeps finding an
extent range to unpin in the transaction's pinned_extents io tree, caches
the extent state and then passes that cached extent state to
btrfs_clear_extent_dirty(), which will free that extent state since we
clear the only bit it can have set. So on each loop iteration we do a
full io tree search and the cached state is used only to avoid having
a tree search done by btrfs_clear_extent_dirty().

During the lifetime of a transaction we can pin many thousands of extents,
resulting in a large and deep rb tree that backs the io tree. For example,
for the following fs_mark run on a 12 cores boxes:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/nullb0
  MNT=/mnt/nullb0
  FILES=100000
  THREADS=$(nproc --all)

  echo "performance" | \
      tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  OPTS="-S 0 -L 8 -n $FILES -s 0 -t $THREADS -k"
  for ((i = 1; i <= $THREADS; i++)); do
      OPTS="$OPTS -d $MNT/d$i"
  done

  fs_mark $OPTS

  umount $MNT

an histogram for the number of ranges (elements) in the pinned extents
io tree of a transaction was the following:

  Count: 76
  Range: 5440.000 - 51088.000; Mean: 27354.368; Median: 28312.000; Stddev: 9800.767
  Percentiles:  90th: 40486.000; 95th: 43322.000; 99th: 51088.000
   5440.000 -  6805.809:     1 ###
   6805.809 - 10652.034:     1 ###
  10652.034 - 13326.178:     3 ########
  13326.178 - 16671.590:     8 ######################
  16671.590 - 20856.773:     7 ####################
  20856.773 - 26092.528:    13 ####################################
  26092.528 - 32642.571:    19 #####################################################
  32642.571 - 40836.818:    17 ###############################################
  40836.818 - 51088.000:     7 ####################

We can improve on this by grabbing the next state before calling
btrfs_clear_extent_dirty(), avoiding a full tree search on the next
iteration which always has an O(log n) complexity while grabbing the next
element (rb_next() rbtree operation) is in the worst case O(log n) too,
but very often much less than that, making it more efficient.

Here follow histograms for the execution times, in nanoseconds, of
btrfs_finish_extent_commit() before and after applying this patch and all
the other patches in the same patchset.

Before patchset:

  Count: 32
  Range: 3925691.000 - 269990635.000; Mean: 133814526.906; Median: 122758052.000; Stddev: 65776550.375
  Percentiles:  90th: 228672087.000; 95th: 265187000.000; 99th: 269990635.000
    3925691.000 -   5993208.660:     1 ####
    5993208.660 -  75878537.656:     4 ##################
   75878537.656 - 115840974.514:    12 #####################################################
  115840974.514 - 176850157.761:     6 ###########################
  176850157.761 - 269990635.000:     9 ########################################

After patchset:

  Count: 32
  Range: 1849393.000 - 231491064.000; Mean: 126978584.625; Median: 123732897.000; Stddev: 58007821.806
  Percentiles:  90th: 203055491.000; 95th: 219952699.000; 99th: 231491064.000
    1849393.000 -   2997642.092:     1 ####
    2997642.092 -  88111637.071:     9 #####################################
   88111637.071 - 142818264.414:     9 #####################################
  142818264.414 - 231491064.000:    13 #####################################################

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove variable to track trimmed bytes at btrfs_finish_extent_commit()

We don't need to keep track of discarded (trimmed) bytes at
btrfs_finish_extent_commit() but we are declaring a local variable for
that and passing a reference to the btrfs_discard_extent() calls when we
are processing delete block groups. So instead pass NULL to
btrfs_discard_extent() and remove that variable.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: don't BUG_ON() when unpinning extents during transaction commit

In the final phase of a transaction commit, when unpinning extents at
btrfs_finish_extent_commit(), there's no need to BUG_ON() if we fail to
unpin an extent range. All that can happen is that we fail to return the
extent range to the in-memory free space cache, meaning no future space
allocations can reuse that extent range while the fs is mounted.

So instead return the error to the caller and make it abort the
transaction, so that the error is noticed and prevent misteriously leaking
space. We keep track of the first error we get while unpinning an extent
range and keep trying to unpin all the following extent ranges, so that
we attempt to do all discards. The transaction abort will deal with all
resource cleanups.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove unnecessary NULL checks before freeing extent state

When manipulating extent bits for an io tree range, we have this pattern:

if (prealloc)
btrfs_free_extent_state(prealloc);

but this is not needed nowadays since btrfs_free_extent_state() ignores
a NULL pointer argument, following the common pattern of kernel and btrfs
freeing functions, as well as libc and other user space libraries.
So remove the NULL checks, reducing source code and object size.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: avoid re-searching tree when setting bits in an extent range

When setting bits for an extent range (set_extent_bit()), if the current
extent state record starts after the target range, we always do a jump to
the 'search_again' label, which will cause us to do a full tree search for
the next state if the current state ends before the target range. Unless
we need to reschedule, we can just grab the next state and process it,
avoiding a full tree search, even if that next state is not contiguous, as
we'll allocate and insert a new prealloc state if needed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: avoid repeated extent state processing when setting extent bits

When setting bits for an extent range, if we find an extent state with
its start offset greater than current start offset, we insert a new extent
state to cover the gap, with its end offset computed and stored in the
@this_end local variable, and after the insertion we update the current
start offset to @this_end + 1. However if the insert_state() call resulted
in an extent state merge then the end offset of the merged extent may be
greater than @this_end and if that's the case, since we jump to the
'search_again' label, we'll do a full tree search that will leave us in
the same extent state - this is harmless but wastes time by doing a
pointless tree search and extent state processing.

So improve on this by updating the current start offset to the end offset
of the inserted state plus 1. This also removes the use of the @this_end
variable and directly set the value in the prealloc extent state to avoid
any confusion and misuse in the future.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify last record detection at set_extent_bit()

There's no need to compare the current extent state's end offset to
(u64)-1 to check if we have the last possible record and to check as
as well if after updating the start offset to the end offset of the
current record plus one we are still inside the target range.

Instead we can simplify and exit if the current extent state ends at or
after the target range and then remove the check for the (u64)-1 as well
as the check to see if the updated start offset (to last_end + 1) is still
inside the target range. Besides the simplification, this also avoids
seaching for the next extent state record (through next_state()) when the
current extent state record ends at the same offset as our target range,
which is pointless and only wastes times iterating through the rb tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: exit after state split error at set_extent_bit()

If split_state() returned an error we call extent_io_tree_panic() which
will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an
uncommon and exotic scenario, then we fallthrough and hit a use after free
when calling set_state_bits() since the extent state record which the
local variable 'prealloc' points to was freed by split_state().

So jump to the label 'out' after calling extent_io_tree_panic() and set
the 'prealloc' pointer to NULL since split_state() has already freed it
when it hit an error.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: exit after state insertion failure at set_extent_bit()

If insert_state() state failed it returns an error pointer and we call
extent_io_tree_panic() which will trigger a BUG() call. However if
CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then
we fallthrough and call cache_state() which will dereference the error
pointer, resulting in an invalid memory access.

So jump to the 'out' label after calling extent_io_tree_panic(), it also
makes the code more clear besides dealing with the exotic scenario where
CONFIG_BUG is disabled.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify last record detection at btrfs_convert_extent_bit()

There's no need to compare the current extent state's end offset to
(u64)-1 to check if we have the last possible record and to check as
as well if after updating the start offset to the end offset of the
current record plus one we are still inside the target range.

Instead we can simplify and exit if the current extent state ends at or
after the target range and then remove the check for the (u64)-1 as well
as the check to see if the updated start offset (to last_end + 1) is still
inside the target range.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: avoid re-searching tree when converting bits in an extent range

When converting bits for an extent range (btrfs_convert_extent_bit()), if
the current extent state record starts after the target range, we always
do a jump to the 'search_again' label, which will cause us to do a full
tree search for the next state if the current state ends before the target
range. Unless we need to reschedule, we can just grab the next state and
process it, avoiding a full tree search, even if that next state is not
contiguous, as we'll allocate and insert a new prealloc state if needed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: avoid repeated extent state processing when converting extent bits

When converting bits for an extent range, if we find an extent state with
its start offset greater than current start offset, we insert a new extent
state to cover the gap, with its end offset computed and stored in the
@this_end local variable, and after the insertion we update the current
start offset to @this_end + 1. However if the insert_state() call resulted
in an extent state merge then the end offset of the merged extent may be
greater than @this_end and if that's the case, since we jump to the
'search_again' label, we'll do a full tree search that will leave us in
the same extent state - this is harmless but wastes time by doing a
pointless tree search and extent state processing.

So improve on this by updating the current start offset to the end offset
of the inserted state plus 1. This also removes the use of the @this_end
variable and directly set the value in the prealloc extent state to avoid
any confusion and misuse in the future.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: avoid unnecessary next node searches when clearing bits from extent range

When clearing bits for a range in an io tree, at clear_state_bit(), we
always go search for the next node (through next_state() -> rb_next()) and
return it. However if the current extent state record ends at or after the
target range passed to btrfs_clear_extent_bit_changeset() or
btrfs_convert_extent_bit(), we are just wasting time finding that next
node since we won't use it in those functions.

Improve on this by skipping the next node search if the current node ends
at or after the target range.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: exit after state insertion failure at btrfs_convert_extent_bit()

If insert_state() state failed it returns an error pointer and we call
extent_io_tree_panic() which will trigger a BUG() call. However if
CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then
we fallthrough and call cache_state() which will dereference the error
pointer, resulting in an invalid memory access.

So jump to the 'out' label after calling extent_io_tree_panic(), it also
makes the code more clear besides dealing with the exotic scenario where
CONFIG_BUG is disabled.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: exit after state split error at btrfs_convert_extent_bit()

If split_state() returned an error we call extent_io_tree_panic() which
will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an
uncommon and exotic scenario, then we fallthrough and hit a use after free
when calling set_state_bits() since the extent state record which the
local variable 'prealloc' points to was freed by split_state().

So jump to the label 'out' after calling extent_io_tree_panic() and set
the 'prealloc' pointer to NULL since split_state() has already freed it
when it hit an error.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove duplicate error check at btrfs_convert_extent_bit()

There's no need to check if split_state() returned an error twice, instead
unify into a single if statement after setting 'prealloc' to NULL, because
on error split_state() frees the 'prealloc' extent state record.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify last record detection at btrfs_clear_extent_bit_changeset()

Instead of checking for an end offset of (u64)-1 (U64_MAX) for the current
extent state's end, and then checking after updating the current start
offset if it's now beyond the range's end offset, we can simply stop if
the current extent state's end is greater than or equals to our range's
end offset. This helps remove one comparison under the 'next' label and
allows to remove the if statement that checks if the start offset is
greater than the end offset under the 'search_again' label.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: avoid extra tree search at btrfs_clear_extent_bit_changeset()

When we find an extent state that starts before our range's start we
split it and jump into the 'search_again' label with our start offset
remaining the same, making us then go to the 'again' label and search
again for an extent state that contains the 'start' offset, and this
time it finds the same extent state but with its start offset set to
our range's start offset (due to the split). This is because we have
consumed the preallocated extent state record and we may need to split
again, and by jumping to 'again' we release the spinlock, allocate a new
prealloc state and restart the search.

However we may not need to restart and allocate a new extent state in
case we don't find extent states that cross our end offset, therefore
no need for further extent state splits, or we may be able to do an
atomic allocation (which is quick even if it fails). In these cases
it's a waste to restart the search.

So change the behaviour to do the restart only if we need to reschedule,
otherwise fall through - if we need to allocate an extent state for split
operations, we will try an atomic allocation and if that fails we will do
the restart as before.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use bools for local variables at btrfs_clear_extent_bit_changeset()

Several variables are defined as integers but used as booleans, and the
'delete' variable can be made const since it's not changed after being
declared. So change them to proper booleans and simplify setting their
value.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add missing error return to btrfs_clear_extent_bit_changeset()

We have a couple error branches where we have an error stored in the 'err'
variable and then jump to the 'out' label, however we don't return that
error, we just return 0. Normally this is not a problem since those error
branches call extent_io_tree_panic() which triggers a BUG() call, however
it's possible to have rather exotic kernel config with CONFIG_BUG disabled
in which case the BUG() call does nothing and we fallthrough. So make sure
to return the error, not just to fix that exotic case but also to make the
code less confusing. While at it also rename the 'err' variable to 'ret'
since this is the style we prefer and use more widely.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: exit after state split error at btrfs_clear_extent_bit_changeset()

If split_state() returned an error we call extent_io_tree_panic() which
will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an
uncommon and exotic scenario, then we fallthrough and hit a use after free
when calling clear_state_bit() since the extent state record which the
local variable 'prealloc' points to was freed by split_state().

So jump to the label 'out' after calling extent_io_tree_panic() and set
the 'prealloc' pointer to NULL since split_state() has already freed it
when it hit an error.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove duplicate error check at btrfs_clear_extent_bit_changeset()

There's no need to check if split_state() returned an error twice, instead
unify into a single if statement after setting 'prealloc' to NULL, because
on error split_state() frees the 'prealloc' extent state record.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: get rid of btrfs_read_dev_super()

The function is introduced by commit a512bbf855ff ("Btrfs: superblock
duplication") at the beginning of btrfs.

It leaved a comment saying we'd need a special mount option to read all
super blocks, but it's never been implemented and there was not
need/request for it. The check/rescue tools are able to start from a
specific copy and use it as primary eventually.

This means btrfs_read_dev_super() is always reading the first super
block, making all the code finding the latest super block unnecessary.

Just remove that function and replace all call sites with
btrfs_read_disk_super(bdev, 0, false).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: merge btrfs_read_dev_one_super() into btrfs_read_disk_super()

We have two functions to read a super block from a block device:

- btrfs_read_dev_one_super()
  Exported from disk-io.c

- btrfs_read_disk_super()
  Local to volumes.c

And they have some minor differences:

- btrfs_read_dev_one_super() uses @copy_num
  Meanwhile btrfs_read_disk_super() relies on the physical and expected
  bytenr passed from the caller.

  The parameter list of btrfs_read_dev_one_super() is more user
  friendly.

- btrfs_read_disk_super() makes sure the label is NUL terminated

We do not need two different functions doing the same job, so merge the
behavior into btrfs_read_disk_super() by:

- Remove btrfs_read_dev_one_super()

- Export btrfs_read_disk_super()
  The name pairs with btrfs_release_disk_super() perfectly.

- Change the parameter list of btrfs_read_disk_super() to mimic
  btrfs_read_dev_one_super()
  All existing callers are calculating the physical address and expect
  bytenr before calling btrfs_read_disk_super() already.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: get rid of goto in alloc_test_extent_buffer()

The `free_eb` label is used only once. Simplify by moving the code inplace.

Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use buffer xarray for extent buffer writeback operations

Currently we have this ugly back and forth with the btree writeback
where we find the folio, find the eb associated with that folio, and
then attempt to writeback.  This results in two different paths for
subpage ebs and >= page size ebs.

Clean this up by adding our own infrastructure around looking up tagged
ebs and writing the ebs out directly.  This allows us to unify the
subpage and >= pagesize IO paths, resulting in a much cleaner writeback
path for extent buffers.

I ran this through fsperf on a VM with 8 CPUs and 16GiB of RAM.  I used
smallfiles100k, but reduced the files to 1k to make it run faster, the
results are as follows, with the statistically significant improvements
marked with *, there were no regressions.  fsperf was run with -n 10 for
both runs, so the baseline is the average 10 runs and the test is the
average of 10 runs.

smallfiles100k results
      metric           baseline       current        stdev            diff
================================================================================
avg_commit_ms               68.58         58.44          3.35   -14.79% *
commits                    270.60        254.70         16.24    -5.88%
dev_read_iops                  48            48             0     0.00%
dev_read_kbytes              1044          1044             0     0.00%
dev_write_iops          866117.90     850028.10      14292.20    -1.86%
dev_write_kbytes      10939976.40   10605701.20     351330.32    -3.06%
elapsed                     49.30            33          1.64   -33.06% *
end_state_mount_ns    41251498.80   35773220.70    2531205.32   -13.28% *
end_state_umount_ns      1.90e+09      1.50e+09   14186226.85   -21.38% *
max_commit_ms                 139        111.60          9.72   -19.71% *
sys_cpu                      4.90          3.86          0.88   -21.29%
write_bw_bytes        42935768.20   64318451.10    1609415.05    49.80% *
write_clat_ns_mean      366431.69     243202.60      14161.98   -33.63% *
write_clat_ns_p50        49203.20         20992        264.40   -57.34% *
write_clat_ns_p99          827392     653721.60      65904.74   -20.99% *
write_io_kbytes           2035940       2035940             0     0.00%
write_iops               10482.37      15702.75        392.92    49.80% *
write_lat_ns_max         1.01e+08      90516129    3910102.06   -10.29% *
write_lat_ns_mean       366556.19     243308.48      14154.51   -33.62% *

As you can see we get about a 33% decrease runtime, with a 50%
throughput increase, which is pretty significant.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>