Kyoji Ogasawara [Tue, 22 Jul 2025 15:38:37 +0000 (00:38 +0900)]
btrfs: fix incorrect log message for nobarrier mount option
Fix a wrong log message that appears when the "nobarrier" mount option
is unset. When "nobarrier" is unset, barrier is actually enabled.
However, the log incorrectly stated "turning off barriers".
Fixes:
eddb1a433f26 ("btrfs: add reconfigure callback for fs_context")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Mon, 11 Aug 2025 16:32:58 +0000 (01:32 +0900)]
btrfs: fix buffer index in wait_eb_writebacks()
The commit
f2cb97ee964a ("btrfs: index buffer_tree using node size")
changed the index of buffer_tree from "start >> sectorsize_bits" to "start
>> nodesize_bits". However, the change is not applied for
wait_eb_writebacks() and caused IO failures by writing in a full zone. Use
the index properly.
Fixes:
f2cb97ee964a ("btrfs: index buffer_tree using node size")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Thu, 31 Jul 2025 03:46:56 +0000 (12:46 +0900)]
btrfs: subpage: keep TOWRITE tag until folio is cleaned
btrfs_subpage_set_writeback() calls folio_start_writeback() the first time
a folio is written back, and it also clears the PAGECACHE_TAG_TOWRITE tag
even if there are still dirty blocks in the folio. This can break ordering
guarantees, such as those required by btrfs_wait_ordered_extents().
That ordering breakage leads to a real failure. For example, running
generic/464 on a zoned setup will hit the following ASSERT. This happens
because the broken ordering fails to flush existing dirty pages before the
file size is truncated.
assertion failed: !list_empty(&ordered->list) :: 0, in fs/btrfs/zoned.c:1899
------------[ cut here ]------------
kernel BUG at fs/btrfs/zoned.c:1899!
Oops: invalid opcode: 0000 [#1] SMP NOPTI
CPU: 2 UID: 0 PID:
1906169 Comm: kworker/u130:2 Kdump: loaded Not tainted 6.16.0-rc6-BTRFS-ZNS+ #554 PREEMPT(voluntary)
Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
RIP: 0010:btrfs_finish_ordered_zoned.cold+0x50/0x52 [btrfs]
RSP: 0018:
ffffc9002efdbd60 EFLAGS:
00010246
RAX:
000000000000004c RBX:
ffff88811923c4e0 RCX:
0000000000000000
RDX:
0000000000000000 RSI:
ffffffff827e38b1 RDI:
00000000ffffffff
RBP:
ffff88810005d000 R08:
00000000ffffdfff R09:
ffffffff831051c8
R10:
ffffffff83055220 R11:
0000000000000000 R12:
ffff8881c2458c00
R13:
ffff88811923c540 R14:
ffff88811923c5e8 R15:
ffff8881c1bd9680
FS:
0000000000000000(0000) GS:
ffff88a04acd0000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00007f907c7a918c CR3:
0000000004024000 CR4:
0000000000350ef0
Call Trace:
<TASK>
? srso_return_thunk+0x5/0x5f
btrfs_finish_ordered_io+0x4a/0x60 [btrfs]
btrfs_work_helper+0xf9/0x490 [btrfs]
process_one_work+0x204/0x590
? srso_return_thunk+0x5/0x5f
worker_thread+0x1d6/0x3d0
? __pfx_worker_thread+0x10/0x10
kthread+0x118/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x205/0x260
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Consider process A calling writepages() with WB_SYNC_NONE. In zoned mode or
for compressed writes, it locks several folios for delalloc and starts
writing them out. Let's call the last locked folio folio X. Suppose the
write range only partially covers folio X, leaving some pages dirty.
Process A calls btrfs_subpage_set_writeback() when building a bio. This
function call clears the TOWRITE tag of folio X, whose size = 8K and
the block size = 4K. It is following state.
0 4K 8K
|/////|/////| (flag: DIRTY, tag: DIRTY)
<-----> Process A will write this range.
Now suppose process B concurrently calls writepages() with WB_SYNC_ALL. It
calls tag_pages_for_writeback() to tag dirty folios with
PAGECACHE_TAG_TOWRITE. Since folio X is still dirty, it gets tagged. Then,
B collects tagged folios using filemap_get_folios_tag() and must wait for
folio X to be written before returning from writepages().
0 4K 8K
|/////|/////| (flag: DIRTY, tag: DIRTY|TOWRITE)
However, between tagging and collecting, process A may call
btrfs_subpage_set_writeback() and clear folio X's TOWRITE tag.
0 4K 8K
| |/////| (flag: DIRTY|WRITEBACK, tag: DIRTY)
As a result, process B won't see folio X in its batch, and returns without
waiting for it. This breaks the WB_SYNC_ALL ordering requirement.
Fix this by using btrfs_subpage_set_writeback_keepwrite(), which retains
the TOWRITE tag. We now manually clear the tag only after the folio becomes
clean, via the xas operation.
Fixes:
3470da3b7d87 ("btrfs: subpage: introduce helpers for writeback status")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 30 Jul 2025 22:50:01 +0000 (08:20 +0930)]
btrfs: clear TAG_TOWRITE from buffer tree when submitting a tree block
[POSSIBLE BUG]
After commit
5e121ae687b8 ("btrfs: use buffer xarray for extent buffer
writeback operations"), we have a dedicated xarray for extent buffers,
and a lot of tags are migrated to that buffer tree, like
PAGECACHE_TAG_TOWRITE/DIRTY/WRITEBACK.
This frees us from the limits of page flags, but there is a new
asymmetric behavior, we call buffer_tree_tag_for_writeback() to set
PAGECACHE_TAG_TOWRITE for the involved ranges, but there is no one to
clear that tag.
Before that rework, we relied on the page cache tag which was cleared
when folio_start_writeback() was called.
Although this has its own problems (e.g. the first one calling
folio_start_writeback() will clear the tag for the whole page), it at
least cleared the tag.
But now our real tags are stored in the buffer tree, no one is really
clearing the PAGECACHE_TAG_TOWRITE tag now.
[FIX]
Thankfully this is not going to cause any real bug, but just some
inefficiency iterating the extent buffers.
As if we hit an extent buffer which is not dirty but still has the
PAGECACHE_TAG_TOWRITE tag, lock_extent_buffer_for_io() will skip it so
we won't writeback the extent buffer again.
To properly fix the inefficiency, just clear the PAGECACHE_TAG_TOWRITE
inside lock_extent_buffer_for_io().
There is no error path between lock_extent_buffer_for_io() and
write_one_eb(), so we're safe to clear the tag there.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 1 Aug 2025 15:39:49 +0000 (16:39 +0100)]
btrfs: do not set mtime/ctime to current time when unlinking for log replay
If we are doing an unlink for log replay, we are updating the directory's
mtime and ctime to the current time, and this is incorrect since it should
stay with the mtime and ctime that were set when the directory was logged.
This is the same as when adding a link to an inode during log replay (with
btrfs_add_link()), where we want the mtime and ctime to be the values that
were in place when the inode was logged.
This was found with generic/547 using LOAD_FACTOR=20 and TIME_FACTOR=20,
where due to large log trees we have longer log replay times and fssum
could detect a mismatch of the mtime and ctime of a directory.
Fix this by skipping the mtime and ctime update at __btrfs_unlink_inode()
if we are in log replay context (just like btrfs_add_link()).
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 29 Jul 2025 09:31:46 +0000 (19:01 +0930)]
btrfs: clear block dirty if btrfs_writepage_cow_fixup() failed
[BUG]
If btrfs_writepage_cow_fixup() failed (returning value -EUCLEAN),
the block will be kept dirty, but with its corresponding range finished
in the ordered extent.
Currently that error pattern is only possible for experimental builds,
which places extra check to ensure we shouldn't hit a dirty block
without a corresponding ordered extent.
This means if later a writeback happens again, we can hit the following
problems:
- ASSERT(block_start != EXTENT_MAP_HOLE) in submit_one_sector()
If the original extent map is a hole, then we can hit this case, as
the new ordered extent failed, we will drop the new extent map and
re-read one from the disk.
- DEBUG_WARN() in btrfs_writepage_cow_fixup()
This is because we no longer have an ordered extent for those dirty
blocks. The original for them is already finished with error.
[CAUSE]
The function btrfs_writepage_cow_fixup() is not following the regular
error handling of writeback. The common practice is to clear the folio
dirty, start and finish the writeback for the block.
This is normally done by extent_clear_unlock_delalloc() with
PAGE_START_WRITEBACK | PAGE_END_WRITEBACK flags during
run_delalloc_range().
So if we keep those failed blocks dirty, they will stay in the page
cache and wait for the next writeback.
And since the original ordered extent is already finished and removed,
depending on the original extent map, we either hit the ASSERT() inside
submit_one_sector(), or hit the DEBUG_WARN() in
btrfs_writepage_cow_fixup() again (and very ironic).
[FIX]
Follow the regular error handling to clear the dirty flag for the block
range, start and finish writeback for that block range instead.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 29 Jul 2025 09:31:45 +0000 (19:01 +0930)]
btrfs: clear block dirty if submit_one_sector() failed
[BUG]
If submit_one_sector() failed, the block will be kept dirty, but with
their corresponding range finished in the ordered extent.
This means if a writeback happens later again, we can hit the following
problems:
- ASSERT(block_start != EXTENT_MAP_HOLE) in submit_one_sector()
If the original extent map is a hole, then we can hit this case, as
the new ordered extent failed, we will drop the new extent map and
re-read one from the disk.
- DEBUG_WARN() in btrfs_writepage_cow_fixup()
This is because we no longer have an ordered extent for those dirty
blocks. The original for them is already finished with error.
[CAUSE]
The function submit_one_sector() is not following the regular error
handling of writeback. The common practice is to clear the folio dirty,
start and finish the writeback for the block.
This is normally done by extent_clear_unlock_delalloc() with
PAGE_START_WRITEBACK | PAGE_END_WRITEBACK flags during
run_delalloc_range().
So if we keep those failed blocks dirty, they will stay in the page
cache and wait for the next writeback.
And since the original ordered extent is already finished and removed,
depending on the original extent map, we either hit the ASSERT() inside
submit_one_sector(), or hit the DEBUG_WARN() in
btrfs_writepage_cow_fixup().
[FIX]
Follow the regular error handling to clear the dirty flag for the block,
start and finish writeback for that block instead.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Wed, 16 Jul 2025 07:59:55 +0000 (16:59 +0900)]
btrfs: zoned: limit active zones to max_open_zones
When there is no active zone limit, we can technically write into any
number of zones at the same time. However, exceeding the max open zones can
degrade performance. To prevent this, set the max_active_zones to
bdev_max_open_zones() if there is no active zone limit.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Wed, 16 Jul 2025 07:59:54 +0000 (16:59 +0900)]
btrfs: zoned: fix write time activation failure for metadata block group
Since commit
13bb483d32ab ("btrfs: zoned: activate metadata block group on
write time"), we activate a metadata block group at the write time. If the
zone capacity is small enough, we can allocate the entire region before the
first write. Then, we hit the btrfs_zoned_bg_is_full() in
btrfs_zone_activate() and the activation fails.
For a data block group, we activate it at the allocation time and we should
check the fullness condition in the caller side. Add, a WARN to check the
fullness condition.
For a metadata block group, we don't need the fullness check because we
activate it at the write time. Instead, activating it once it is written
should be invalid. Catch that with a WARN too.
Fixes:
13bb483d32ab ("btrfs: zoned: activate metadata block group on write time")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Wed, 16 Jul 2025 07:59:53 +0000 (16:59 +0900)]
btrfs: zoned: fix data relocation block group reservation
btrfs_zoned_reserve_data_reloc_bg() is called on mount and at that point,
all data block groups belong to the primary data space_info. So, we don't
find anything in the data relocation space_info.
Also, the condition "bg->used > 0" can select a block group with full of
zone_unusable bytes for the candidate. As we cannot allocate from the block
group, it is useless to reserve it as the data relocation block group.
Furthermore, because of the space_info separation, we need to migrate the
selected block group to the data relocation space_info. If not, the extent
allocator cannot use the block group to do the allocation.
This commit fixes these three issues.
Fixes:
e606ff985ec7 ("btrfs: zoned: reserve data_reloc block group on mount")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Johannes Thumshirn [Wed, 23 Jul 2025 13:38:10 +0000 (15:38 +0200)]
btrfs: zoned: skip ZONE FINISH of conventional zones
Don't call ZONE FINISH for conventional zones as this will result in I/O
errors. Instead check if the zone that needs finishing is a conventional
zone and if yes skip it.
Also factor out the actual handling of finishing a single zone into a
helper function, as do_zone_finish() is growing ever bigger and the
indentations levels are getting higher.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Boris Burkov [Wed, 30 Jul 2025 16:29:23 +0000 (09:29 -0700)]
btrfs: fix iteration bug in __qgroup_excl_accounting()
__qgroup_excl_accounting() uses the qgroup iterator machinery to
update the account of one qgroups usage for all its parent hierarchy,
when we either add or remove a relation and have only exclusive usage.
However, there is a small bug there: we loop with an extra iteration
temporary qgroup called `cur` but never actually refer to that in the
body of the loop. As a result, we redundantly account the same usage to
the first qgroup in the list.
This can be reproduced in the following way:
mkfs.btrfs -f -O squota <dev>
mount <dev> <mnt>
btrfs subvol create <mnt>/sv
dd if=/dev/zero of=<mnt>/sv/f bs=1M count=1
sync
btrfs qgroup create 1/100 <mnt>
btrfs qgroup create 2/200 <mnt>
btrfs qgroup assign 1/100 2/200 <mnt>
btrfs qgroup assign 0/256 1/100 <mnt>
btrfs qgroup show <mnt>
and the broken result is (note the 2MiB on 1/100 and 0Mib on 2/100):
Qgroupid Referenced Exclusive Path
-------- ---------- --------- ----
0/5 16.00KiB 16.00KiB <toplevel>
0/256 1.02MiB 1.02MiB sv
Qgroupid Referenced Exclusive Path
-------- ---------- --------- ----
0/5 16.00KiB 16.00KiB <toplevel>
0/256 1.02MiB 1.02MiB sv
1/100 2.03MiB 2.03MiB 2/100<1 member qgroup>
2/100 0.00B 0.00B <0 member qgroups>
With this fix, which simply re-uses `qgroup` as the iteration variable,
we see the expected result:
Qgroupid Referenced Exclusive Path
-------- ---------- --------- ----
0/5 16.00KiB 16.00KiB <toplevel>
0/256 1.02MiB 1.02MiB sv
Qgroupid Referenced Exclusive Path
-------- ---------- --------- ----
0/5 16.00KiB 16.00KiB <toplevel>
0/256 1.02MiB 1.02MiB sv
1/100 1.02MiB 1.02MiB 2/100<1 member qgroup>
2/100 1.02MiB 1.02MiB <0 member qgroups>
The existing fstests did not exercise two layer inheritance so this bug
was missed. I intend to add that testing there, as well.
Fixes:
a0bdc04b0732 ("btrfs: qgroup: use qgroup_iterator in __qgroup_excl_accounting()")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Wed, 16 Jul 2025 07:59:52 +0000 (16:59 +0900)]
btrfs: zoned: do not select metadata BG as finish target
We call btrfs_zone_finish_one_bg() to zone finish one block group and make
room to activate another block group. Currently, we can choose a metadata
block group as a target. But, as we reserve an active metadata block group,
we no longer want to select a metadata block group. So, skip it in the
loop.
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 25 Jul 2025 11:03:25 +0000 (20:33 +0930)]
btrfs: do not allow relocation of partially dropped subvolumes
[BUG]
There is an internal report that balance triggered transaction abort,
with the following call trace:
item 85 key (
594509824 169 0) itemoff 12599 itemsize 33
extent refs 1 gen 197740 flags 2
ref#0: tree block backref root 7
item 86 key (
594558976 169 0) itemoff 12566 itemsize 33
extent refs 1 gen 197522 flags 2
ref#0: tree block backref root 7
...
BTRFS error (device loop0): extent item not found for insert, bytenr
594526208 num_bytes 16384 parent
449921024 root_objectid 934 owner 1 offset 0
BTRFS error (device loop0): failed to run delayed ref for logical
594526208 num_bytes 16384 type 182 action 1 ref_mod 1: -117
------------[ cut here ]------------
BTRFS: Transaction aborted (error -117)
WARNING: CPU: 1 PID: 6963 at ../fs/btrfs/extent-tree.c:2168 btrfs_run_delayed_refs+0xfa/0x110 [btrfs]
And btrfs check doesn't report anything wrong related to the extent
tree.
[CAUSE]
The cause is a little complex, firstly the extent tree indeed doesn't
have the backref for
594526208.
The extent tree only have the following two backrefs around that bytenr
on-disk:
item 65 key (
594509824 METADATA_ITEM 0) itemoff 13880 itemsize 33
refs 1 gen 197740 flags TREE_BLOCK
tree block skinny level 0
(176 0x7) tree block backref root CSUM_TREE
item 66 key (
594558976 METADATA_ITEM 0) itemoff 13847 itemsize 33
refs 1 gen 197522 flags TREE_BLOCK
tree block skinny level 0
(176 0x7) tree block backref root CSUM_TREE
But the such missing backref item is not an corruption on disk, as the
offending delayed ref belongs to subvolume 934, and that subvolume is
being dropped:
item 0 key (934 ROOT_ITEM 198229) itemoff 15844 itemsize 439
generation 198229 root_dirid 256 bytenr
10741039104 byte_limit 0 bytes_used
345571328
last_snapshot 198229 flags 0x1000000000001(RDONLY) refs 0
drop_progress key (206324 EXTENT_DATA
2711650304) drop_level 2
level 2 generation_v2 198229
And that offending tree block
594526208 is inside the dropped range of
that subvolume. That explains why there is no backref item for that
bytenr and why btrfs check is not reporting anything wrong.
But this also shows another problem, as btrfs will do all the orphan
subvolume cleanup at a read-write mount.
So half-dropped subvolume should not exist after an RW mount, and
balance itself is also exclusive to subvolume cleanup, meaning we
shouldn't hit a subvolume half-dropped during relocation.
The root cause is, there is no orphan item for this subvolume.
In fact there are 5 subvolumes from around 2021 that have the same
problem.
It looks like the original report has some older kernels running, and
caused those zombie subvolumes.
Thankfully upstream commit
8d488a8c7ba2 ("btrfs: fix subvolume/snapshot
deletion not triggered on mount") has long fixed the bug.
[ENHANCEMENT]
For repairing such old fs, btrfs-progs will be enhanced.
Considering how delayed the problem will show up (at run delayed ref
time) and at that time we have to abort transaction already, it is too
late.
Instead here we reject any half-dropped subvolume for reloc tree at the
earliest time, preventing confusion and extra time wasted on debugging
similar bugs.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 16 Jul 2025 10:41:21 +0000 (11:41 +0100)]
btrfs: error on missing block group when unaccounting log tree extent buffers
Currently we only log an error message if we can't find the block group
for a log tree extent buffer when unaccounting it (while freeing a log
tree). A missing block group means something is seriously wrong and we
end up leaking space from the metadata space info. So return -ENOENT in
case we don't find the block group.
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sun, 20 Jul 2025 05:31:39 +0000 (15:01 +0930)]
btrfs: fix wrong length parameter for btrfs_cleanup_ordered_extents()
Inside nocow_one_range(), if the checksum cloning for data reloc inode
failed, we call btrfs_cleanup_ordered_extents() to cleanup the just
allocated ordered extents.
But unlike extent_clear_unlock_delalloc(),
btrfs_cleanup_ordered_extents() requires a length, not an inclusive end
bytenr.
This can be problematic, as the @end is normally way larger than @len.
This means btrfs_cleanup_ordered_extents() can be called on folios
out of the correct range, and if the out-of-range folio is under
writeback, we can incorrectly clear the ordered flag of the folio, and
trigger the DEBUG_WARN() inside btrfs_writepage_cow_fixup().
Fix the wrong parameter with correct length instead.
Fixes:
94f6c5c17e52 ("btrfs: move ordered extent cleanup to where they are allocated")
CC: stable@vger.kernel.org # 6.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sat, 19 Jul 2025 22:26:48 +0000 (07:56 +0930)]
btrfs: make btrfs_cleanup_ordered_extents() support large folios
When hitting a large folio, btrfs_cleanup_ordered_extents() will get the
same large folio multiple times, and clearing the same range again and
again.
Thankfully this is not causing anything wrong, just inefficiency.
This is caused by the fact that we're iterating folios using the old
page index, thus can hit the same large folio again and again.
Enhance it by increasing @index to the index of the folio end, and only
increase @index by 1 if we failed to grab a folio.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Leo Martins [Mon, 21 Jul 2025 17:49:16 +0000 (10:49 -0700)]
btrfs: fix subpage deadlock in try_release_subpage_extent_buffer()
There is a potential deadlock that can happen in
try_release_subpage_extent_buffer() because the irq-safe xarray spin
lock fs_info->buffer_tree is being acquired before the irq-unsafe
eb->refs_lock.
This leads to the potential race:
// T1 (random eb->refs user) // T2 (release folio)
spin_lock(&eb->refs_lock);
// interrupt
end_bbio_meta_write()
btrfs_meta_folio_clear_writeback()
btree_release_folio()
folio_test_writeback() //false
try_release_extent_buffer()
try_release_subpage_extent_buffer()
xa_lock_irq(&fs_info->buffer_tree)
spin_lock(&eb->refs_lock); // blocked; held by T1
buffer_tree_clear_mark()
xas_lock_irqsave() // blocked; held by T2
I believe that the spin lock can safely be replaced by an rcu_read_lock.
The xa_for_each loop does not need the spin lock as it's already
internally protected by the rcu_read_lock. The extent buffer is also
protected by the rcu_read_lock so it won't be freed before we take the
eb->refs_lock and check the ref count.
The rcu_read_lock is taken and released every iteration, just like the
spin lock, which means we're not protected against concurrent
insertions into the xarray. This is fine because we rely on
folio->private to detect if there are any ebs remaining in the folio.
There is already some precedent for this with find_extent_buffer_nolock,
which loads an extent buffer from the xarray with only rcu_read_lock.
lockdep warning:
=====================================================
WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
6.16.0-0_fbk701_debug_rc0_123_g4c06e63b9203 #1 Tainted: G E N
-----------------------------------------------------
kswapd0/66 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
ffff000011ffd600 (&eb->refs_lock){+.+.}-{3:3}, at: try_release_extent_buffer+0x18c/0x560
and this task is already holding:
ffff0000c1d91b88 (&buffer_xa_class){-.-.}-{3:3}, at: try_release_extent_buffer+0x13c/0x560
which would create a new lock dependency:
(&buffer_xa_class){-.-.}-{3:3} -> (&eb->refs_lock){+.+.}-{3:3}
but this new dependency connects a HARDIRQ-irq-safe lock:
(&buffer_xa_class){-.-.}-{3:3}
... which became HARDIRQ-irq-safe at:
lock_acquire+0x178/0x358
_raw_spin_lock_irqsave+0x60/0x88
buffer_tree_clear_mark+0xc4/0x160
end_bbio_meta_write+0x238/0x398
btrfs_bio_end_io+0x1f8/0x330
btrfs_orig_write_end_io+0x1c4/0x2c0
bio_endio+0x63c/0x678
blk_update_request+0x1c4/0xa00
blk_mq_end_request+0x54/0x88
virtblk_request_done+0x124/0x1d0
blk_mq_complete_request+0x84/0xa0
virtblk_done+0x130/0x238
vring_interrupt+0x130/0x288
__handle_irq_event_percpu+0x1e8/0x708
handle_irq_event+0x98/0x1b0
handle_fasteoi_irq+0x264/0x7c0
generic_handle_domain_irq+0xa4/0x108
gic_handle_irq+0x7c/0x1a0
do_interrupt_handler+0xe4/0x148
el1_interrupt+0x30/0x50
el1h_64_irq_handler+0x14/0x20
el1h_64_irq+0x6c/0x70
_raw_spin_unlock_irq+0x38/0x70
__run_timer_base+0xdc/0x5e0
run_timer_softirq+0xa0/0x138
handle_softirqs.llvm.
13542289750107964195+0x32c/0xbd0
____do_softirq.llvm.
17674514681856217165+0x18/0x28
call_on_irq_stack+0x24/0x30
__irq_exit_rcu+0x164/0x430
irq_exit_rcu+0x18/0x88
el1_interrupt+0x34/0x50
el1h_64_irq_handler+0x14/0x20
el1h_64_irq+0x6c/0x70
arch_local_irq_enable+0x4/0x8
do_idle+0x1a0/0x3b8
cpu_startup_entry+0x60/0x80
rest_init+0x204/0x228
start_kernel+0x394/0x3f0
__primary_switched+0x8c/0x8958
to a HARDIRQ-irq-unsafe lock:
(&eb->refs_lock){+.+.}-{3:3}
... which became HARDIRQ-irq-unsafe at:
...
lock_acquire+0x178/0x358
_raw_spin_lock+0x4c/0x68
free_extent_buffer_stale+0x2c/0x170
btrfs_read_sys_array+0x1b0/0x338
open_ctree+0xeb0/0x1df8
btrfs_get_tree+0xb60/0x1110
vfs_get_tree+0x8c/0x250
fc_mount+0x20/0x98
btrfs_get_tree+0x4a4/0x1110
vfs_get_tree+0x8c/0x250
do_new_mount+0x1e0/0x6c0
path_mount+0x4ec/0xa58
__arm64_sys_mount+0x370/0x490
invoke_syscall+0x6c/0x208
el0_svc_common+0x14c/0x1b8
do_el0_svc+0x4c/0x60
el0_svc+0x4c/0x160
el0t_64_sync_handler+0x70/0x100
el0t_64_sync+0x168/0x170
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&eb->refs_lock);
local_irq_disable();
lock(&buffer_xa_class);
lock(&eb->refs_lock);
<Interrupt>
lock(&buffer_xa_class);
*** DEADLOCK ***
2 locks held by kswapd0/66:
#0:
ffff800085506e40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xe8/0xe50
#1:
ffff0000c1d91b88 (&buffer_xa_class){-.-.}-{3:3}, at: try_release_extent_buffer+0x13c/0x560
Link: https://www.kernel.org/doc/Documentation/locking/lockdep-design.rst#:~:text=Multi%2Dlock%20dependency%20rules%3A
Fixes:
19d7f65f032f ("btrfs: convert the buffer_radix to an xarray")
CC: stable@vger.kernel.org # 6.16+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 30 Jul 2025 18:18:37 +0000 (19:18 +0100)]
btrfs: fix log tree replay failure due to file with 0 links and extents
If we log a new inode (not persisted in a past transaction) that has 0
links and extents, then log another inode with an higher inode number, we
end up with failing to replay the log tree with -EINVAL. The steps for
this are:
1) create new file A
2) write some data to file A
3) open an fd on file A
4) unlink file A
5) fsync file A using the previously open fd
6) create file B (has higher inode number than file A)
7) fsync file B
8) power fail before current transaction commits
Now when attempting to mount the fs, the log replay will fail with
-ENOENT at replay_one_extent() when attempting to replay the first
extent of file A. The failure comes when trying to open the inode for
file A in the subvolume tree, since it doesn't exist.
Before commit
5f61b961599a ("btrfs: fix inode lookup error handling
during log replay"), the returned error was -EIO instead of -ENOENT,
since we converted any errors when attempting to read an inode during
log replay to -EIO.
The reason for this is that the log replay procedure fails to ignore
the current inode when we are at the stage LOG_WALK_REPLAY_ALL, our
current inode has 0 links and last inode we processed in the previous
stage has a non 0 link count. In other words, the issue is that at
replay_one_extent() we only update wc->ignore_cur_inode if the current
replay stage is LOG_WALK_REPLAY_INODES.
Fix this by updating wc->ignore_cur_inode whenever we find an inode item
regardless of the current replay stage. This is a simple solution and easy
to backport, but later we can do other alternatives like avoid logging
extents or inode items other than the inode item for inodes with a link
count of 0.
The problem with the wc->ignore_cur_inode logic has been around since
commit
f2d72f42d5fa ("Btrfs: fix warning when replaying log after fsync
of a tmpfile") but it only became frequent to hit since the more recent
commit
5e85262e542d ("btrfs: fix fsync of files with no hard links not
persisting deletion"), because we stopped skipping inodes with a link
count of 0 when logging, while before the problem would only be triggered
if trying to replay a log tree created with an older kernel which has a
logged inode with 0 links.
A test case for fstests will be submitted soon.
Reported-by: Peter Jung <ptr1337@cachyos.org>
Link: https://lore.kernel.org/linux-btrfs/fce139db-4458-4788-bb97-c29acf6cb1df@cachyos.org/
Reported-by: burneddi <burneddi@protonmail.com>
Link: https://lore.kernel.org/linux-btrfs/lh4W-Lwc0Mbk-QvBhhQyZxf6VbM3E8VtIvU3fPIQgweP_Q1n7wtlUZQc33sYlCKYd-o6rryJQfhHaNAOWWRKxpAXhM8NZPojzsJPyHMf2qY=@protonmail.com/#t
Reported-by: Russell Haley <yumpusamongus@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/598ecc75-eb80-41b3-83c2-f2317fbb9864@gmail.com/
Fixes:
f2d72f42d5fa ("Btrfs: fix warning when replaying log after fsync of a tmpfile")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 18 Jul 2025 12:07:29 +0000 (13:07 +0100)]
btrfs: send: use fallocate for hole punching with send stream v2
Currently holes are sent as writes full of zeroes, which results in
unnecessarily using disk space at the receiving end and increasing the
stream size.
In some cases we avoid sending writes of zeroes, like during a full
send operation where we just skip writes for holes.
But for some cases we fill previous holes with writes of zeroes too, like
in this scenario:
1) We have a file with a hole in the range [2M, 3M), we snapshot the
subvolume and do a full send. The range [2M, 3M) stays as a hole at
the receiver since we skip sending write commands full of zeroes;
2) We punch a hole for the range [3M, 4M) in our file, so that now it
has a 2M hole in the range [2M, 4M), and snapshot the subvolume.
Now if we do an incremental send, we will send write commands full
of zeroes for the range [2M, 4M), removing the hole for [2M, 3M) at
the receiver.
We could improve cases such as this last one by doing additional
comparisons of file extent items (or their absence) between the parent
and send snapshots, but that's a lot of code to add plus additional CPU
and IO costs.
Since the send stream v2 already has a fallocate command and btrfs-progs
implements a callback to execute fallocate since the send stream v2
support was added to it, update the kernel to use fallocate for punching
holes for V2+ streams.
Test coverage is provided by btrfs/284 which is a version of btrfs/007
that exercises send stream v2 instead of v1, using fsstress with random
operations and fssum to verify file contents.
Link: https://github.com/kdave/btrfs-progs/issues/1001
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 18 Jul 2025 17:14:40 +0000 (18:14 +0100)]
btrfs: unfold transaction aborts when writing dirty block groups
We have a single transaction abort call that can be due to an error from
one of two calls to update_block_group_item(). Unfold the transaction
abort calls so that if they happen we know which update_block_group_item()
call failed.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 11 Jul 2025 19:59:36 +0000 (20:59 +0100)]
btrfs: use saner variable type and name to indicate extrefs at add_inode_ref()
We are using a variable named 'log_ref_ver' of type int to indicate if we
are processing an extref item or not, using a value of 1 if so, otherwise
0. This is an odd name and type, so rename it to 'is_extref_item' and
change its type to bool.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 11 Jul 2025 19:48:23 +0000 (20:48 +0100)]
btrfs: don't skip remaining extrefs if dir not found during log replay
During log replay, at add_inode_ref(), if we have an extref item that
contains multiple extrefs and one of them points to a directory that does
not exist in the subvolume tree, we are supposed to ignore it and process
the remaining extrefs encoded in the extref item, since each extref can
point to a different parent inode. However when that happens we just
return from the function and ignore the remaining extrefs.
The problem has been around since extrefs were introduced, in commit
f186373fef00 ("btrfs: extended inode refs"), but it's hard to hit in
practice because getting extref items encoding multiple extref requires
getting a hash collision when computing the offset of the extref's
key. The offset if computed like this:
key.offset = btrfs_extref_hash(dir_ino, name->name, name->len);
and btrfs_extref_hash() is just a wrapper around crc32c().
Fix this by moving to next iteration of the loop when we don't find
the parent directory that an extref points to.
Fixes:
f186373fef00 ("btrfs: extended inode refs")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 11 Jul 2025 19:21:28 +0000 (20:21 +0100)]
btrfs: don't ignore inode missing when replaying log tree
During log replay, at add_inode_ref(), we return -ENOENT if our current
inode isn't found on the subvolume tree or if a parent directory isn't
found. The error comes from btrfs_iget_logging() <- btrfs_iget() <-
btrfs_read_locked_inode().
The single caller of add_inode_ref(), replay_one_buffer(), ignores an
-ENOENT error because it expects that error to mean only that a parent
directory wasn't found and that is ok.
Before commit
5f61b961599a ("btrfs: fix inode lookup error handling during
log replay") we were converting any error when getting a parent directory
to -ENOENT and any error when getting the current inode to -EIO, so our
caller would fail log replay in case we can't find the current inode.
After that commit however in case the current inode is not found we return
-ENOENT to the caller and therefore it ignores the critical fact that the
current inode was not found in the subvolume tree.
Fix this by converting -ENOENT to 0 when we don't find a parent directory,
returning -ENOENT when we don't find the current inode and making the
caller, replay_one_buffer(), not ignore -ENOENT anymore.
Fixes:
5f61b961599a ("btrfs: fix inode lookup error handling during log replay")
CC: stable@vger.kernel.org # 6.16
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 16 Jul 2025 07:25:39 +0000 (16:55 +0930)]
btrfs: enable large data folios for data reloc inode
For data reloc inodes, they are a special type of inodes that are not
exposed to user space, and are only utilized during data block groups
relocation.
They do not go under regular read-write operations, but have their file
extents manually created to have the same layout of a block group, then
its content is read from the original block group, and written back to
the new location which is in a new block group.
Previously all the handling was done in page units, and commit
c2832898126f ("btrfs: make relocate_one_page() handle subpage case")
changed the handling to subpage blocks.
On the other hand, data reloc inodes are a perfect match for large data
folios, as each relocation cluster represents one or more data extents
that are contiguous in their logical addresses.
This patch enables large folios for data reloc inodes by:
- Remove the special handling of data reloc inodes when setting folio
order
- Change relocate_one_folio() to return the file offset of the next
folio
Originally it's designed to handle fixed page sized blocks, but with
large folios, we can handle a large folio, thus we have to return the
end of the current folio.
- Remove the warning on folio_order()
- Use folio_size() to replace fixed PAGE_SIZE usage
- Use file_offset as iterator inside relocate_file_extent_cluster
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 16 Jul 2025 08:13:06 +0000 (17:43 +0930)]
btrfs: output more info when btrfs_subpage_assert() failed
The function btrfs_subpage_assert() is a very commonly utilized assert
to make sure the range passed in is correct inside the folio.
And when some code is not properly subpage/large folio compatible
btrfs_subpage_assert() will be the first to be triggered.
E.g. when I incorrectly enabled large folios for data reloc inodes, it
immediately triggered btrfs_subpage_assert().
In that case, outputting all the involved members will be very helpful,
this includes:
- start
- len
- folio position inside the mapping
- folio size
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 15 Jul 2025 03:48:39 +0000 (13:18 +0930)]
btrfs: reloc: unconditionally invalidate the page cache for each cluster
Commit
9d9ea1e68a05 ("btrfs: subpage: fix relocation potentially
overwriting last page data") fixed a bug when relocating data block
groups for subpage cases.
However for the incoming large folios for data reloc inode, we can hit
the same situation where block size is the same as page size, but the
folio we got is still larger than a block.
In that case, the old subpage specific check is no longer reliable.
Here we have to enhance the handling by:
- Unconditionally invalidate the page cache for the current cluster
We set the @flush to true so that any dirty folios are properly
written back first.
And this time instead of dropping the whole page cache, just drop the
range covered by the current cluster.
This will bring some minor performance drop, as for a large folio, the
heading half will be read twice (read by previous cluster, then
invalidated, then read again by the current cluster).
However that is required to support large folios, and this gets rid of
the kinda tricky manual uptodate flag clearing for each block.
- Remove the special handling of writing back the whole page cache
filemap_invalidate_inode() handles the write back already, and since
we're invalidating all pages in the range, we no longer need to
manually clear the uptodate flags for involved blocks.
Thus there is no need to manually write back the whole page cache.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 9 Jul 2025 14:29:17 +0000 (16:29 +0200)]
btrfs: defrag: add flag to force no-compression
Currently the defrag ioctl cannot rewrite the extents without
compression. Add a new flag for that, as setting compression to 0 (or
"no compression") means to do no changes to compression so take what is
the current default, like mount options or properties.
The defrag setting overrides mount or properties. The compression
BTRFS_DEFRAG_DONT_COMPRESS is only used for in-memory operations and
does not need to have a fixed value.
Mount with zstd:9, copy test file from /usr/bin/ (about 260KB):
$ mount -o compress=zstd:9 /dev/vda /mnt
$ filefrag -vsb testfile
filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is:
9123683e
File size of testfile is 297704 (292 blocks of 1024 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 13312.. 13439: 128: encoded
1: 128.. 255: 13364.. 13491: 128: 13440: encoded
2: 256.. 291: 13424.. 13459: 36: 13492: last,encoded,eof
testfile: 3 extents found
$ compsize testfile
Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 42% 124K 292K 292K
zstd 42% 124K 292K 292K
Defrag to uncompressed:
$ btrfs fi defrag --nocomp testfile
$ filefrag -vsb testfile
filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is:
9123683e
File size of testfile is 297704 (292 blocks of 1024 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 291: 291840.. 292131: 292: last,eof
testfile: 1 extent found
$ compsize testfile
Processed 1 file, 1 regular extents (1 refs), 0 inline, 1 fragments.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 100% 292K 292K 292K
none 100% 292K 292K 292K
Compress again with LZO:
$ btrfs fi defrag -clzo testfile
$ filefrag -vsb testfile
filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is:
9123683e
File size of testfile is 297704 (292 blocks of 1024 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 13312.. 13439: 128: encoded
1: 128.. 255: 13392.. 13519: 128: 13440: encoded
2: 256.. 291: 13480.. 13515: 36: 13520: last,encoded,eof
testfile: 3 extents found
$ compsize testfile
Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 64% 188K 292K 292K
lzo 64% 188K 292K 292K
Signed-off-by: David Sterba <dsterba@suse.com>
Boris Burkov [Mon, 14 Jul 2025 23:44:28 +0000 (16:44 -0700)]
btrfs: fix ssd_spread overallocation
If the ssd_spread mount option is enabled, then we run the so called
clustered allocator for data block groups. In practice, this results in
creating a btrfs_free_cluster which caches a block_group and borrows its
free extents for allocation.
Since the introduction of allocation size classes in 6.1, there has been
a bug in the interaction between that feature and ssd_spread.
find_free_extent() has a number of nested loops. The loop going over the
allocation stages, stored in ffe_ctl->loop and managed by
find_free_extent_update_loop(), the loop over the raid levels, and the
loop over all the block_groups in a space_info. The size class feature
relies on the block_group loop to ensure it gets a chance to see a
block_group of a given size class. However, the clustered allocator
uses the cached cluster block_group and breaks that loop. Each call to
do_allocation() will really just go back to the same cached block_group.
Normally, this is OK, as the allocation either succeeds and we don't
want to loop any more or it fails, and we clear the cluster and return
its space to the block_group.
But with size classes, the allocation can succeed, then later fail,
outside of do_allocation() due to size class mismatch. That latter
failure is not properly handled due to the highly complex multi loop
logic. The result is a painful loop where we continue to allocate the
same num_bytes from the cluster in a tight loop until it fails and
releases the cluster and lets us try a new block_group. But by then, we
have skipped great swaths of the available block_groups and are likely
to fail to allocate, looping the outer loop. In pathological cases like
the reproducer below, the cached block_group is often the very last one,
in which case we don't perform this tight bg loop but instead rip
through the ffe stages to LOOP_CHUNK_ALLOC and allocate a chunk, which
is now the last one, and we enter the tight inner loop until an
allocation failure. Then allocation succeeds on the final block_group
and if the next allocation is a size mismatch, the exact same thing
happens again.
Triggering this is as easy as mounting with -o ssd_spread and then
running:
mount -o ssd_spread $dev $mnt
dd if=/dev/zero of=$mnt/big bs=16M count=1 &>/dev/null
dd if=/dev/zero of=$mnt/med bs=4M count=1 &>/dev/null
sync
if you do the two writes + sync in a loop, you can force btrfs to spin
an excessive amount on semi-successful clustered allocations, before
ultimately failing and advancing to the stage where we force a chunk
allocation. This results in 2G of data allocated per iteration, despite
only using ~20M of data. By using a small size classed extent, the inner
loop takes longer and we can spin for longer.
The simplest, shortest term fix to unbreak this is to make the clustered
allocator size_class aware in the dumbest way, where it fails on size
class mismatch. This may hinder the operation of the clustered
allocator, but better hindered than completely broken and terribly
overallocating.
Further re-design improvements are also in the works.
Fixes:
52bb7a2166af ("btrfs: introduce size class to block group allocator")
CC: stable@vger.kernel.org # 6.1+
Reported-by: David Sterba <dsterba@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Sun, 29 Jun 2025 14:18:29 +0000 (23:18 +0900)]
btrfs: zoned: requeue to unused block group list if zone finish failed
btrfs_zone_finish() can fail for several reason. If it is -EAGAIN, we need
to try it again later. So, put the block group to the retry list properly.
Failing to do so will keep the removable block group intact until remount
and can causes unnecessary ENOSPC.
Fixes:
74e91b12b115 ("btrfs: zoned: zone finish unused block group")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Sun, 29 Jun 2025 14:07:42 +0000 (23:07 +0900)]
btrfs: zoned: do not remove unwritten non-data block group
There are some reports of "unable to find chunk map for logical
2147483648
length 16384" error message appears in dmesg. This means some IOs are
occurring after a block group is removed.
When a metadata tree node is cleaned on a zoned setup, we keep that node
still dirty and write it out not to create a write hole. However, this can
make a block group's used bytes == 0 while there is a dirty region left.
Such an unused block group is moved into the unused_bg list and processed
for removal. When the removal succeeds, the block group is removed from the
transaction->dirty_bgs list, so the unused dirty nodes in the block group
are not sent at the transaction commit time. It will be written at some
later time e.g, sync or umount, and causes "unable to find chunk map"
errors.
This can happen relatively easy on SMR whose zone size is 256MB. However,
calling do_zone_finish() on such block group returns -EAGAIN and keep that
block group intact, which is why the issue is hidden until now.
Fixes:
afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 11 Jul 2025 08:46:16 +0000 (09:46 +0100)]
btrfs: remove btrfs_clear_extent_bits()
It's just a simple wrapper around btrfs_clear_extent_bit() that passes a
NULL for its last argument (a cached extent state record), plus there is
not counter part - we have a btrfs_set_extent_bit() but we do not have a
btrfs_set_extent_bits() (plural version). So just remove it and make all
callers use btrfs_clear_extent_bit() directly.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 11 Jul 2025 08:23:09 +0000 (09:23 +0100)]
btrfs: use cached state when falling back from NOCoW write to CoW write
We have a cached extent state record from the previous extent locking so
we can use when setting the EXTENT_NORESERVE in the range, allowing the
operation to be faster if the extent io tree is relatively large.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 8 Jul 2025 15:32:33 +0000 (16:32 +0100)]
btrfs: set EXTENT_NORESERVE before range unlock in btrfs_truncate_block()
Set the EXTENT_NORESERVE bit in the io tree before unlocking the range so
that we can use the cached state and speedup the operation, since the
unlock operation releases the cached state.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Johannes Thumshirn [Tue, 8 Jul 2025 06:55:03 +0000 (08:55 +0200)]
btrfs: don't print relocation messages from auto reclaim
When BTRFS is doing automatic block-group reclaim, it is spamming the
kernel log messages a lot.
Add a 'verbose' parameter to btrfs_relocate_chunk() and
btrfs_relocate_block_group() to control the verbosity of these log
message. This way the old behaviour of printing log messages on a
user-space initiated balance operation can be kept while excessive log
spamming due to auto reclaim is mitigated.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Johannes Thumshirn [Tue, 8 Jul 2025 06:55:02 +0000 (08:55 +0200)]
btrfs: remove redundant auto reclaim log message
Remove the log message before reclaiming a chunk in
btrfs_reclaim_bgs_work(). Especially with automatic block-group
reclaiming these messages spam the kernel log.
Note there is also a tracepoint for the same condition to ease debugging.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 9 Jul 2025 15:34:20 +0000 (16:34 +0100)]
btrfs: make btrfs_check_nocow_lock() check more than one extent
Currently btrfs_check_nocow_lock() stops at the first extent it finds and
that extent may be smaller than the target range we want to NOCOW into.
But we can have multiple consecutive extents which we can NOCOW into, so
by stopping at the first one we find we just make the caller do more work
by splitting the write into multiple ones, or in the case of mmap writes
with large folios we fail with -ENOSPC in case the folio's range is
covered by more than one extent (the fallback to NOCOW for mmap writes in
case there's no available data space to reserve/allocate was recently
added by the patch "btrfs: fix -ENOSPC mmap write failure on NOCOW
files/extents").
Improve on this by checking for multiple consecutive extents.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 9 Jul 2025 15:26:13 +0000 (16:26 +0100)]
btrfs: assert we can NOCOW the range in btrfs_truncate_block()
We call btrfs_check_nocow_lock() to see if we can NOCOW a block sized
range but we don't check later if we can NOCOW the whole range.
It's unexpected to be able to NOCOW a range smaller than blocksize, so
add an assertion to check the NOCOW range matches the blocksize.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 9 Jul 2025 14:03:54 +0000 (15:03 +0100)]
btrfs: update function comment for btrfs_check_nocow_lock()
The documentation for the @nowait parameter is missing, so add it.
The @nowait parameter was added in commit
80f9d24130e4 ("btrfs: make
btrfs_check_nocow_lock nowait compatible"), which forgot to update the
function comment.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 8 Jul 2025 15:28:44 +0000 (16:28 +0100)]
btrfs: use btrfs_inode local variable at btrfs_page_mkwrite()
Most of the time we want to use the btrfs_inode, so change the local inode
variable to be a btrfs_inode instead of a VFS inode, reducing verbosity
by eliminating a lot of BTRFS_I() calls.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 8 Jul 2025 15:23:02 +0000 (16:23 +0100)]
btrfs: use variable for io_tree when clearing range in btrfs_page_mkwrite()
We have the inode's io_tree already stored in a local variable, so use it
instead of grabbing it again in the call to btrfs_clear_extent_bit().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 8 Jul 2025 15:01:19 +0000 (16:01 +0100)]
btrfs: fix -ENOSPC mmap write failure on NOCOW files/extents
If we attempt a mmap write into a NOCOW file or a prealloc extent when
there is no more available data space (or unallocated space to allocate a
new data block group) and we can do a NOCOW write (there are no reflinks
for the target extent or snapshots), we always fail due to -ENOSPC, unlike
for the regular buffered write and direct IO paths where we check that we
can do a NOCOW write in case we can't reserve data space.
Simple reproducer:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
umount $DEV &> /dev/null
mkfs.btrfs -f -b $((512 * 1024 * 1024)) $DEV
mount $DEV $MNT
touch $MNT/foobar
# Make it a NOCOW file.
chattr +C $MNT/foobar
# Add initial data to file.
xfs_io -c "pwrite -S 0xab 0 1M" $MNT/foobar
# Fill all the remaining data space and unallocated space with data.
dd if=/dev/zero of=$MNT/filler bs=4K &> /dev/null
# Overwrite the file with a mmap write. Should succeed.
xfs_io -c "mmap -w 0 1M" \
-c "mwrite -S 0xcd 0 1M" \
-c "munmap" \
$MNT/foobar
# Unmount, mount again and verify the new data was persisted.
umount $MNT
mount $DEV $MNT
od -A d -t x1 $MNT/foobar
umount $MNT
Running this:
$ ./test.sh
(...)
wrote
1048576/
1048576 bytes at offset 0
1 MiB, 256 ops; 0.0008 sec (1.188 GiB/sec and 311435.5231 ops/sec)
./test.sh: line 24: 234865 Bus error xfs_io -c "mmap -w 0 1M" -c "mwrite -S 0xcd 0 1M" -c "munmap" $MNT/foobar
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
1048576
Fix this by not failing in case we can't allocate data space and we can
NOCOW into the target extent - reserving only metadata space in this case.
After this change the test passes:
$ ./test.sh
(...)
wrote
1048576/
1048576 bytes at offset 0
1 MiB, 256 ops; 0.0007 sec (1.262 GiB/sec and 330749.3540 ops/sec)
0000000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
*
1048576
A test case for fstests will be added soon.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 8 Jul 2025 14:49:52 +0000 (16:49 +0200)]
btrfs: use clear_and_wake_up_bit() where open coded
There are two cases open coding the clear and wake up pattern, we can
use the helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Sun, 6 Jul 2025 18:23:58 +0000 (20:23 +0200)]
btrfs: accessors: rename variable for folio offset
There used to be 'oip' short for offset in page, which got changed
during conversion to folios, the name is a bit confusing so rename it.
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 1 Jul 2025 17:23:54 +0000 (19:23 +0200)]
btrfs: accessors: factor out split memcpy with two sources
The case of a reading the bytes from 2 folios needs two memcpy()s, the
compiler does not emit calls but two inline loops.
Factoring out the code makes some improvement (stack, code) and in the
future will provide an optimized implementation as well. (The analogical
version with two destinations is not done as it increases stack usage
but can be done if needed.)
The address of the second folio is reordered before the first memcpy,
which leads to an optimization reusing the vmemmap_base and
page_offset_base (implementing folio_address()).
Stack usage reduction:
btrfs_get_32 -8 (32 -> 24)
btrfs_get_64 -8 (32 -> 24)
Code size reduction:
text data bss dec hex filename
1454279 115665 16088
1586032 183370 pre/btrfs.ko
1454229 115665 16088
1585982 18333e post/btrfs.ko
DELTA: -50
As this is the last patch in this series, here's the overall diff
starting and including commit "btrfs: accessors: simplify folio bounds
checks":
Stack:
btrfs_set_16 -72 (88 -> 16)
btrfs_get_32 -56 (80 -> 24)
btrfs_set_8 -72 (88 -> 16)
btrfs_set_64 -64 (88 -> 24)
btrfs_get_8 -72 (80 -> 8)
btrfs_get_16 -64 (80 -> 16)
btrfs_set_32 -64 (88 -> 24)
btrfs_get_64 -56 (80 -> 24)
NEW (48):
report_setget_bounds 48
LOST/NEW DELTA: +48
PRE/POST DELTA: -472
Code:
text data bss dec hex filename
1456601 115665 16088
1588354 183c82 pre/btrfs.ko
1454229 115665 16088
1585982 18333e post/btrfs.ko
DELTA: -2372
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 1 Jul 2025 17:23:53 +0000 (19:23 +0200)]
btrfs: accessors: set target address at initialization
The target address for the read/write can be simplified as it's the same
expression for the first folio. This improves the generated code as the
folio address does not have to be cached on stack.
Stack usage reduction:
btrfs_set_32 -8 (32 -> 24)
btrfs_set_64 -8 (32 -> 24)
btrfs_get_16 -8 (24 -> 16)
Code size reduction:
text data bss dec hex filename
1454459 115665 16088
1586212 183424 pre/btrfs.ko
1454279 115665 16088
1586032 183370 post/btrfs.ko
DELTA: -180
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 1 Jul 2025 17:23:52 +0000 (19:23 +0200)]
btrfs: accessors: compile-time fast path for u16
Reading/writing 2 bytes (u16) may need 2 folios to be written to, each
time it's just one byte so using memcpy for that is an overkill.
Add a branch for the split case so that memcpy is now used for u32 and
u64. Another side effect is that the u16 types now don't need additional
stack space, everything fits to registers.
Stack usage is reduced:
btrfs_get_16 -8 (32 -> 24)
btrfs_set_16 -16 (32 -> 16)
Code size reduction:
text data bss dec hex filename
1454691 115665 16088
1586444 18350c pre/btrfs.ko
1454459 115665 16088
1586212 183424 post/btrfs.ko
DELTA: -232
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 1 Jul 2025 17:23:51 +0000 (19:23 +0200)]
btrfs: accessors: compile-time fast path for u8
Reading/writing 1 byte (u8) is a special case compared to the others as
it's always contained in the folio we find, so the split memcpy will
never be needed. Turn it to a compile-time check that the memcpy part
can be optimized out.
The stack usage is reduced:
btrfs_set_8 -16 (32 -> 16)
btrfs_get_8 -16 (24 -> 8)
Code size reduction:
text data bss dec hex filename
1454951 115665 16088
1586704 183610 pre/btrfs.ko
1454691 115665 16088
1586444 18350c post/btrfs.ko
DELTA: -260
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 1 Jul 2025 17:23:50 +0000 (19:23 +0200)]
btrfs: accessors: inline eb bounds check and factor out the error report
There's a check in each set/get helper if the requested range is within
extent buffer bounds, and if it's not then report it. This was in an
ASSERT statement so with CONFIG_BTRFS_ASSERT this crashes right away, on
other configs this is only reported but reading out of the bounds is
done anyway. There are currently no known reports of this particular
condition failing.
There are some drawbacks though. The behaviour dependence on the
assertions being compiled in or not and a less visible effect of
inlining report_setget_bounds() into each helper.
As the bounds check is expected to succeed almost always it's ok to
inline it but make the report a function and move it out of the helper
completely (__cold puts it to a different section). This also skips
reading/writing the requested range in case it fails.
This improves stack usage significantly:
btrfs_get_16 -48 (80 -> 32)
btrfs_get_32 -48 (80 -> 32)
btrfs_get_64 -48 (80 -> 32)
btrfs_get_8 -48 (72 -> 24)
btrfs_set_16 -56 (88 -> 32)
btrfs_set_32 -56 (88 -> 32)
btrfs_set_64 -56 (88 -> 32)
btrfs_set_8 -48 (80 -> 32)
NEW (48):
report_setget_bounds 48
LOST/NEW DELTA: +48
PRE/POST DELTA: -360
Same as .ko size:
text data bss dec hex filename
1456079 115665 16088
1587832 183a78 pre/btrfs.ko
1454951 115665 16088
1586704 183610 post/btrfs.ko
DELTA: -1128
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 1 Jul 2025 17:23:49 +0000 (19:23 +0200)]
btrfs: accessors: use type sizeof constants directly
Now unit_size is used only once, so use it directly in 'part'
calculation. Don't cache sizeof(type) in a variable. While this is a
compile-time constant, forcing the type 'int' generates worse code as it
leads to additional conversion from 32 to 64 bit type on x86_64.
The sizeof() is used only a few times and it does not make the code that
harder to read, so use it directly and let the compiler utilize the
immediate constants in the context it needs. The .ko code size slightly
increases (+50) but further patches will reduce that again.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 1 Jul 2025 17:23:48 +0000 (19:23 +0200)]
btrfs: accessors: simplify folio bounds checks
As we can a have non-contiguous range in the eb->folios, any item can be
straddling two folios and we need to check if it can be read in one go
or in two parts. For that there's a check which is not implemented in
the simplest way:
offset in folio + size <= folio size
With a simple expression transformation:
oil + size <= unit_size
size <= unit_size - oil
sizeof() <= part
this can be simplified and reusing existing run-time or compile-time
constants.
Add likely() annotation for this expression as this is the fast path and
compiler sometimes reorders that after the followup block with the
memcpy (observed in practice with other simplifications).
Overall effect on stack consumption:
btrfs_get_8 -8 (80 -> 72)
btrfs_set_8 -8 (88 -> 80)
And .ko size (due to optimizations making use of the direct constants):
text data bss dec hex filename
1456601 115665 16088
1588354 183c82 pre/btrfs.ko
1456093 115665 16088
1587846 183a86 post/btrfs.ko
DELTA: -508
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 25 Jun 2025 13:37:26 +0000 (15:37 +0200)]
btrfs: remove struct rcu_string
The only use for device name has been removed so we can kill the RCU
string API.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 25 Jun 2025 13:37:19 +0000 (15:37 +0200)]
btrfs: open code RCU for device name
The RCU protected string is only used for a device name, and RCU is used
so we can print the name and eventually synchronize against the rare
device rename in device_list_add().
We don't need the whole API just for that. Open code all the helpers and
access to the string itself.
Notable change is in device_list_add() when the device name is changed,
which is the only place that can actually happen at the same time as
message prints using the device name under RCU read lock.
Previously there was kfree_rcu() which used the embedded rcu_head to
delay freeing the object depending on the RCU mechanism. Now there's
kfree_rcu_mightsleep() which does not need the rcu_head and waits for
the grace period.
Sleeping is safe in this context and as this is a rare event it won't
interfere with the rest as it's holding the device_list_mutex.
Straightforward changes:
- rcu_string_strdup -> kstrdup
- rcu_str_deref -> rcu_dereference
- drop ->str from safe contexts and use rcu_dereference_raw() so it does
not trigger RCU validators
Historical notes:
Introduced in
606686eeac45 ("Btrfs: use rcu to protect device->name")
with a vague reference of the potential problem described in
https://lore.kernel.org/all/
20120531155304.GF11775@ZenIV.linux.org.uk/ .
The RCU protection looks like the easiest and most lightweight way of
protecting the rare event of device rename racing device_list_add()
with a random printk() that uses the device name.
Alternatives: a spin lock would require to protect the printk
anyway, a fixed buffer for the name would be eventually wrong in case
the new name is overwritten when being printed, an array switching
pointers and cleaning them up eventually resembles RCU too much.
The cleanups up to this patch should hide special case of RCU to the
minimum that only the name needs rcu_dereference(), which can be further
cleaned up to use btrfs_dev_name().
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Daniel Vacek [Fri, 4 Jul 2025 16:07:02 +0000 (18:07 +0200)]
btrfs: index buffer_tree using node size
So far we've been deriving the buffer tree index using the sector size.
But each extent buffer covers multiple sectors. This makes the buffer
tree rather sparse.
For example the typical and quite common configuration uses sector size
of 4KiB and node size of 16KiB. In this case it means the buffer tree is
using up to the maximum of 25% of it's slots. Or in other words at least
75% of the tree slots are wasted as never used.
We can score significant memory savings on the required tree nodes by
indexing the tree using the node size instead. As a result far less
slots are wasted and the tree can now use up to all 100% of it's slots
this way.
Note: This works even with unaligned tree blocks as we can still get
unique index by doing eb->start >> nodesize_shift.
Getting some stats from running fio write test, there is a bit of
variance. The values presented in the table below are medians from 5
test runs. The numbers are:
- # of allocated ebs in the tree
- # of leaf tree nodes
- highest index in the tree (radix tree width)):
ebs / leaves / Index | bare for-next | with fix
---------------------+--------------------+-------------------
post mount | 16 / 11 / 10e5c | 16 / 10 / 4240
post test | 5810 / 891 / 11cfc | 4420 / 252 / 473a
post rm | 574 / 300 / 10ef0 | 540 / 163 / 46e9
In this case (10GiB filesystem) the height of the tree is still 3 levels
but the 4x width reduction is clearly visible as expected. But since the
tree is more dense we can see the 54-72% reduction of leaf nodes. That's
very close to ideal with this test. It means the tree is getting really
dense with this kind of workload.
Also, the fio results show no performance change.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 26 Feb 2025 16:18:25 +0000 (16:18 +0000)]
btrfs: send: directly return strcmp() result when comparing recorded refs
There's no point in converting the return values from strcmp() as all we
need is that it returns a negative value if the first argument is less
than the second, a positive value if it's greater and 0 if equal. We do
not have a need for -1 instead of any other negative value and no need
for 1 instead of any other positive value - that's all that rb_find()
needs and no where else we need specific negative and positive values.
So remove the intermediate local variable and checks and return directly
the result from strcmp().
This also reduces the module's text size.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1888116 161347 16136
2065599 1f84bf fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1888052 161347 16136
2065535 1f847f fs/btrfs/btrfs.ko
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 1 Jul 2025 21:45:47 +0000 (22:45 +0100)]
btrfs: set search_commit_root to false in iterate_inodes_from_logical()
There's no point in checking at iterate_inodes_from_logical() if the path
has search_commit_root set, the only caller never sets search_commit_root
to true and it doesn't make sense for it ever to be true for the current
use case (logical_to_ino ioctl). So stop checking for that and since the
only caller allocates the path just for it to be used by
iterate_inodes_from_logical(), move the path allocation into that function.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 1 Jul 2025 22:01:56 +0000 (23:01 +0100)]
btrfs: reduce size of struct tree_mod_elem
Several members are used for specific types of tree mod log operations so
they can be placed in a union in order to reduce the structure's size.
This reduces the structure size from 112 bytes to 88 bytes on x86_64,
so we can now use the kmalloc-96 slab instead of the kmalloc-128 slab.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 1 Jul 2025 21:20:19 +0000 (22:20 +0100)]
btrfs: avoid logging tree mod log elements for irrelevant extent buffers
We are logging tree mod log operations for extent buffers from any tree
but we only need to log for the extent tree and subvolume tree, since
the tree mod log is used to get a consistent view, within a transaction,
of extents and their backrefs. So it's pointless to log operations for
trees such as the csum tree, free space tree, root tree, chunk tree,
log trees, data relocation tree, etc, as these trees are not used for
backref walking and all tree mod log users are about backref walking.
So skip extent buffers that don't belong neither to the extent nor to
subvolume trees. This avoids unnecessary memory allocations and having a
larger tree mod log rbtree with nodes that are never needed.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Boris Burkov [Thu, 5 Jun 2025 00:46:58 +0000 (17:46 -0700)]
btrfs: use readahead_expand() on compressed extents
We recently received a report of poor performance doing sequential
buffered reads of a file with compressed extents. With bs=128k, a naive
sequential dd ran as fast on a compressed file as on an uncompressed
(1.2GB/s on my reproducing system) while with bs<32k, this performance
tanked down to ~300MB/s.
i.e., slow:
dd if=some-compressed-file of=/dev/null bs=4k count=X
vs fast:
dd if=some-compressed-file of=/dev/null bs=128k count=Y
The cause of this slowness is overhead to do with looking up extent_maps
to enable readahead pre-caching on compressed extents
(add_ra_bio_pages()), as well as some overhead in the generic VFS
readahead code we hit more in the slow case. Notably, the main
difference between the two read sizes is that in the large sized request
case, we call btrfs_readahead() relatively rarely while in the smaller
request we call it for every compressed extent. So the fast case stays
in the btrfs readahead loop:
while ((folio = readahead_folio(rac)) != NULL)
btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
where the slower one breaks out of that loop every time. This results in
calling add_ra_bio_pages a lot, doing lots of extent_map lookups,
extent_map locking, etc.
This happens because although add_ra_bio_pages() does add the
appropriate un-compressed file pages to the cache, it does not
communicate back to the ractl in any way. To solve this, we should be
using readahead_expand() to signal to readahead to expand the readahead
window.
This change passes the readahead_control into the btrfs_bio_ctrl and in
the case of compressed reads sets the expansion to the size of the
extent_map we already looked up anyway. It skips the subpage case as
that one already doesn't do add_ra_bio_pages().
With this change, whether we use bs=4k or bs=128k, btrfs expands the
readahead window up to the largest compressed extent we have seen so far
(in the trivial example: 128k) and the call stacks of the two modes look
identical. Notably, we barely call add_ra_bio_pages at all. And the
performance becomes identical as well. So this change certainly "fixes"
this performance problem.
Of course, it does seem to beg a few questions:
1. Will this waste too much page cache with a too large ra window?
2. Will this somehow cause bugs prevented by the more thoughtful
checking in add_ra_bio_pages?
3. Should we delete add_ra_bio_pages?
My stabs at some answers:
1. Hard to say. See attempts at generic performance testing below. Is
there a "readahead_shrink" we should be using? Should we expand more
slowly, by half the remaining em size each time?
2. I don't think so. Since the new behavior is indistinguishable from
reading the file with a larger read size passed in, I don't see why
one would be safe but not the other.
3. Probably! I tested that and it was fine in fstests, and it seems like
the pages would get re-used just as well in the readahead case.
However, it is possible some reads that use page cache but not
btrfs_readahead() could suffer. I will investigate this further as a
follow up.
I tested the performance implications of this change in 3 ways (using
compress-force=zstd:3 for compression):
Directly test the affected workload of small sequential reads on a
compressed file (improved from ~250MB/s to ~1.2GB/s)
==========for-next==========
dd /mnt/lol/non-cmpr 4k
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.02983 s, 712 MB/s
dd /mnt/lol/non-cmpr 128k
32768+0 records in
32768+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.92403 s, 725 MB/s
dd /mnt/lol/cmpr 4k
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 17.8832 s, 240 MB/s
dd /mnt/lol/cmpr 128k
32768+0 records in
32768+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.71001 s, 1.2 GB/s
==========ra-expand==========
dd /mnt/lol/non-cmpr 4k
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.09001 s, 705 MB/s
dd /mnt/lol/non-cmpr 128k
32768+0 records in
32768+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.07664 s, 707 MB/s
dd /mnt/lol/cmpr 4k
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.79531 s, 1.1 GB/s
dd /mnt/lol/cmpr 128k
32768+0 records in
32768+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.69533 s, 1.2 GB/s
Built the linux kernel from clean (no change)
Ran fsperf. Mostly neutral results with some improvements and
regressions here and there.
Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/34601559-6c16-6ccc-1793-20a97ca0dbba@gmx.net/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 2 Jul 2025 05:38:13 +0000 (15:08 +0930)]
btrfs: populate otime when logging an inode item
[TEST FAILURE WITH EXPERIMENTAL FEATURES]
When running test case generic/508, the test case will fail with the new
btrfs shutdown support:
generic/508 - output mismatch (see /home/adam/xfstests/results//generic/508.out.bad)
--- tests/generic/508.out 2022-05-11 11:25:30.
806666664 +0930
+++ /home/adam/xfstests/results//generic/508.out.bad 2025-07-02 14:53:22.
401824212 +0930
@@ -1,2 +1,6 @@
QA output created by 508
Silence is golden
+Before:
+After : stat.btime = Thu Jan 1 09:30:00 1970
+Before:
+After : stat.btime = Wed Jul 2 14:53:22 2025
...
(Run 'diff -u /home/adam/xfstests/tests/generic/508.out /home/adam/xfstests/results//generic/508.out.bad' to see the entire diff)
Ran: generic/508
Failures: generic/508
Failed 1 of 1 tests
Please note that the test case requires shutdown support, thus the test
case will be skipped using the current upstream kernel, as it doesn't
have shutdown ioctl support.
[CAUSE]
The direct cause the 0 time stamp in the log tree:
leaf
30507008 items 2 free space 16057 generation 9 owner TREE_LOG
leaf
30507008 flags 0x1(WRITTEN) backref revision 1
checksum stored
e522548d
checksum calced
e522548d
fs uuid
57d45451-481e-43e4-aa93-
289ad707a3a0
chunk uuid
d52bd3fd-5163-4337-98a7-
7986993ad398
item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160
generation 9 transid 9 size 0 nbytes 0
block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
sequence 1 flags 0x0(none)
atime
1751432947.
492000000 (2025-07-02 14:39:07)
ctime
1751432947.
492000000 (2025-07-02 14:39:07)
mtime
1751432947.
492000000 (2025-07-02 14:39:07)
otime 0.0 (1970-01-01 09:30:00) <<<
But the old fs tree has all the correct time stamp:
btrfs-progs v6.12
fs tree key (FS_TREE ROOT_ITEM 0)
leaf
30425088 items 2 free space 16061 generation 5 owner FS_TREE
leaf
30425088 flags 0x1(WRITTEN) backref revision 1
checksum stored
48f6c57e
checksum calced
48f6c57e
fs uuid
57d45451-481e-43e4-aa93-
289ad707a3a0
chunk uuid
d52bd3fd-5163-4337-98a7-
7986993ad398
item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
generation 3 transid 0 size 0 nbytes 16384
block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0
sequence 0 flags 0x0(none)
atime
1751432947.0 (2025-07-02 14:39:07)
ctime
1751432947.0 (2025-07-02 14:39:07)
mtime
1751432947.0 (2025-07-02 14:39:07)
otime
1751432947.0 (2025-07-02 14:39:07) <<<
The root cause is that fill_inode_item() in tree-log.c is only
populating a/c/m time, not the otime (or btime in statx output).
Part of the reason is that, the vfs inode only has a/c/m time, no native
btime support yet.
[FIX]
Thankfully btrfs has its otime stored in btrfs_inode::i_otime_sec and
btrfs_inode::i_otime_nsec.
So what we really need is just fill the otime time stamp in
fill_inode_item() of tree-log.c
There is another fill_inode_item() in inode.c, which is doing the proper
otime population.
Fixes:
94edf4ae43a5 ("Btrfs: don't bother committing delayed inode updates when fsyncing")
CC: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 1 Jul 2025 14:25:14 +0000 (15:25 +0100)]
btrfs: qgroup: use btrfs_qgroup_enabled() in ioctls
We have a publicly exported btrfs_qgroup_enabled() and an ioctl.c private
qgroup_enabled() helper. Both of these test if qgroups are enabled, the
first check if the flag BTRFS_FS_QUOTA_ENABLED is set in fs_info->flags
while the second checks if fs_info->quota_root is not NULL while holding
the mutex fs_info->qgroup_ioctl_lock.
We can get away with the private ioctl.c:qgroup_enabled(), as all entry
points into the qgroup code check if fs_info->quota_root is NULL or not
while holding the mutex fs_info->qgroup_ioctl_lock, and returning the
error -ENOTCONN in case it's NULL.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 1 Jul 2025 14:44:16 +0000 (15:44 +0100)]
btrfs: qgroup: fix qgroup create ioctl returning success after quotas disabled
When quotas are disabled qgroup ioctls are supposed to return -ENOTCONN,
but the qgroup create ioctl stopped doing that when it races with a quota
disable operation, returning 0 instead. This change of behaviour happened
in commit
6ed05643ddb1 ("btrfs: create qgroup earlier in snapshot
creation").
The issue happens as follows:
1) Task A enters btrfs_ioctl_qgroup_create(), qgroups are enabled and so
qgroup_enabled() returns true since fs_info->quota_root is not NULL;
2) Task B enters btrfs_ioctl_quota_ctl() -> btrfs_quota_disable() and
disables qgroups, so now fs_info->quota_root is NULL;
3) Task A enters btrfs_create_qgroup() and calls btrfs_qgroup_mode(),
which returns BTRFS_QGROUP_MODE_DISABLED since quotas are disabled,
and then btrfs_create_qgroup() returns 0 to the caller, which makes
the ioctl return 0 instead of -ENOTCONN.
The check for fs_info->quota_root and returning -ENOTCONN if it's NULL
is made only after the call btrfs_qgroup_mode().
Fix this by moving the check for disabled quotas with btrfs_qgroup_mode()
into transaction.c:create_pending_snapshot(), so that we don't abort the
transaction if btrfs_create_qgroup() returns -ENOTCONN and quotas are
disabled.
Fixes:
6ed05643ddb1 ("btrfs: create qgroup earlier in snapshot creation")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 1 Jul 2025 10:39:44 +0000 (11:39 +0100)]
btrfs: qgroup: set quota enabled bit if quota disable fails flushing reservations
Before waiting for the rescan worker to finish and flushing reservations,
we clear the BTRFS_FS_QUOTA_ENABLED flag from fs_info. If we fail flushing
reservations we leave with the flag not set which is not correct since
quotas are still enabled - we must set back the flag on error paths, such
as when we fail to start a transaction, except for error paths that abort
a transaction. The reservation flushing happens very early before we do
any operation that actually disables quotas and before we start a
transaction, so set back BTRFS_FS_QUOTA_ENABLED if it fails.
Fixes:
af0e2aab3b70 ("btrfs: qgroup: flush reservations during quota disable")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 4 Jul 2025 09:38:03 +0000 (19:08 +0930)]
btrfs: restrict writes to opened btrfs devices
[FLAG EXCLUSION]
Commit
ead622674df5 ("btrfs: Do not restrict writes to btrfs devices")
removes the BLK_OPEN_RESTRICT_WRITES flag when opening the devices
during mount. This was an exception at the time as it depended on other
patches.
[REASON TO EXCLUDE THAT FLAG]
Btrfs needs to call btrfs_scan_one_device() to determine the fsid, no
matter if we're mounting a new fs or an existing one.
But if a fs is already mounted and the BLK_OPEN_RESTRICT_WRITES is
honored, meaning no other write open is allowed for the block device.
Then we want to mount a subvolume of the mounted fs to another mount
point, we will call btrfs_scan_one_device() again, but it will fail due
to the BLK_OPEN_RESTRICT_WRITES flag (no more write open allowed),
causing only one mount point for the fs.
Thus at that time, we had to exclude the BLK_OPEN_RESTRICT_WRITES to
allow multiple mount points for one fs.
[WHY IT'S SAFE NOW]
The root problem is, we do not need to nor should use BLK_OPEN_WRITE for
btrfs_scan_one_device().
That function is only to read out the super block, no write at all, and
BLK_OPEN_WRITE is only going to cause problems for such usage.
The root problem has been fixed by patch "btrfs: always open the device
read-only in btrfs_scan_one_device", so btrfs_scan_one_device() will
always work no matter if the device is opened with
BLK_OPEN_RESTRICT_WRITES.
[ENHANCEMENT]
Just remove the btrfs_open_mode(), as the only call site can be replaced
with regular sb_open_mode().
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 19 Jun 2025 07:18:44 +0000 (16:48 +0930)]
btrfs: use fs_holder_ops for all opened devices
Since we have btrfs_fs_info::sb (struct super_block) as our block device
holder, we can safely use fs_holder_ops for all of our block devices.
This enables freezing/thawing the filesystem from the underlying block
devices.
This may enhance hibernation/suspend support since previously
freezing/thawing a block device managed by btrfs won't do anything btrfs
specific, but only syncing the block device.
Thus before this change, freezing the underlying block devices won't
prevent future writes into the filesystem, thus may cause problems for
hibernation/suspend when btrfs is involved.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 11 Jun 2025 10:03:03 +0000 (12:03 +0200)]
btrfs: use the super_block as holder when mounting file systems
The file system type is not a very useful holder as it doesn't allow us
to go back to the actual file system instance. Pass the super_block
instead which is useful when passed back to the file system driver.
This matches what is done for all other block device based file systems,
and allows us to remove btrfs_fs_info::bdev_holder completely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Mon, 16 Jun 2025 22:11:14 +0000 (07:41 +0930)]
btrfs: delay btrfs_open_devices() until super block is created
Currently we always call btrfs_open_devices() before creating the
super block.
It's fine for now because:
- No blk_holder_ops is provided
- btrfs_fs_type is used as a holder
This means no matter who wins the device opening race, the holder will be
the same thus not affecting the later sget_fc() race.
And since no blk_holder_ops is provided, no bdev operation is depending on
the holder.
But this will no longer be true if we want to implement a proper
blk_holder_ops using fs_holder_ops.
This means we will need a proper super block as the bdev holder.
To prepare for such change:
- Add btrfs_fs_devices::holding member
This will prevent btrfs_free_stale_devices() and btrfs_close_device()
from deleting the fs_devices when there is another process trying to
mount the fs.
Along with the new member, here come the two helpers,
btrfs_fs_devices_inc_holding() and btrfs_fs_devices_dec_holding().
This will allow us to hold fs_devices without opening it.
This is needed because we cannot hold uuid_mutex while calling
sget_fc(), this will reverse the lock sequence with s_umount, causing
a lockdep warning.
- Delay btrfs_open_devices() until a super block is returned
This means we have to hold the initial fs_devices first, then unlock
uuid_mutex, call sget_fc(), then re-lock uuid_mutex, and decrease the
holding number.
For new super block case, we continue to btrfs_open_devices() with
uuid_mutex hold.
For existing super block case, we can unlock uuid_mutex and continue.
Although this means a more complex error handling path, as if we
didn't call btrfs_open_devices() (either got an existing sb, or
sget_fc() failed), we cannot let btrfs_put_fs_info() cleanup the
fs_devices, as it can be freed at any time after we decrease the hold
on fs_devices and unlock uuid_mutex.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Mon, 16 Jun 2025 00:12:20 +0000 (09:42 +0930)]
btrfs: call bdev_fput() to reclaim the blk_holder immediately
As part of the preparation for btrfs blk_holder_ops, we want to ensure
the holder of a block device has a proper lifespan.
However btrfs is always using fput() to close a block device, which has
one problem:
- fput() is deferred
Meaning we can have a block device with invalid (aka, freed) holder.
To avoid the problem and align the behavior to other code, just call
bdev_fput() instead.
There is some extra requirement on the locking, but that's all resolved
by previous patches and we should be safe to call bdev_fput().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 11 Jun 2025 10:03:00 +0000 (12:03 +0200)]
btrfs: call btrfs_close_devices() from ->kill_sb
Although btrfs is not yet implementing blk_holder_ops, there is a
requirement for proper blk_holder_ops:
- blkdev_put() must not be called under sb->s_umount
The blkdev_put()/bdev_fput() must not be called under sb->s_umount to
avoid lock order reversal with disk->open_mutex.
This is for the proper blk_holder_ops callbacks.
Currently we're fine because we call regular fput() which defers the
blk holder reclaiming.
To prepare for the future of blk_holder_ops, move the
btrfs_close_devices() calls into btrfs_free_fs_info().
That will be called from kill_sb() callbacks, which is also called for
error handing during mount failures, or there is already an existing
super block.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sun, 15 Jun 2025 22:44:25 +0000 (08:14 +0930)]
btrfs: add assertions to make super block creation more clear
When calling sget_fc(), there are 3 different situations:
a) Critical error
No super block created.
b) A new super block is created
The fc->s_fs_info is transferred to the super block, and fc->s_fs_info
is reset to NULL.
In this case sb->s_root should still be NULL, and needs to be properly
initialized later by btrfs_fill_super().
c) An existing super block is returned
The fc->s_fs_info is untouched, and anything related to that fs_info
should be properly cleaned up.
This is not obvious even with the extra comments at sget_fc().
Enhance the situation by:
- Add comments for case b) and c)
Especially for case c), the fs_info and fs_devices cleanup happens at
different timing, thus needs extra explanation.
- Move the comments closer to case b) and case c)
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 12 Jun 2025 05:44:35 +0000 (15:14 +0930)]
btrfs: get rid of re-entering of btrfs_get_tree()
[EXISTING PROBLEM]
Currently btrfs mount is split into two parts:
- btrfs_get_tree_subvol()
Which sets up the very basic fs_info, and eventually calls
mount_subvol() to mount the target subvolume.
- btrfs_get_tree_super()
This is the part doing super block allocation and if there is no
existing super block, do the real open_ctree() to open the fs.
However currently we're doing this in a complex re-entering way:
vfs_get_tree()
|- btrfs_get_tree()
|- btrfs_get_tree_subvol()
|- vfs_get_tree()
| |- btrfs_get_tree()
| |- btrfs_get_tree_super()
|- mount_subvol()
This is definitely not that easy to grasp.
[ENHANCEMENT]
The function vfs_get_tree() is only doing the following work:
- Call get_tree() call back
- Call super_wake()
- Call security_sb_set_mnt_opts()
In our case, super_wake() can be skipped, as after
btrfs_get_tree_subvol() finishes, vfs_get_tree() will call super_wake()
on the super block we got anyway.
The same applies to security_sb_set_mnt_opts(), as long as we do not
free the security from our original fc in btrfs_get_tree_subvol(), the
first vfs_get_tree() call will handle the security correctly.
So here we only need to:
- Replace vfs_get_tree() call with btrfs_get_tree_super()
- Keep the existing fc->security for vfs_get_tree() to handle the
security
This will remove the re-entering behavior and make thing much easier to
follow.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph Hellwig [Wed, 11 Jun 2025 10:02:59 +0000 (12:02 +0200)]
btrfs: always open the device read-only in btrfs_scan_one_device()
btrfs_scan_one_device() opens the block device only to read the super
block. Instead of passing a blk_mode_t argument to sometimes open
it for writing, just hard code BLK_OPEN_READ as it will never write
to the device or hand the block_device out to someone else.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Caleb Sander Mateos [Thu, 19 Jun 2025 19:27:45 +0000 (13:27 -0600)]
btrfs: don't skip accounting in early ENOTTY return in btrfs_uring_encoded_read()
btrfs_uring_encoded_read() returns early with -ENOTTY if the uring_cmd
is issued with IO_URING_F_COMPAT but the kernel doesn't support compat
syscalls. However, this early return bypasses the syscall accounting.
Go to out_acct instead to ensure the syscall is counted.
Fixes:
34310c442e17 ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)")
CC: stable@vger.kernel.org # 6.15+
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 26 Jun 2025 14:30:12 +0000 (16:30 +0200)]
btrfs: rename inode number parameter passed to btrfs_check_dir_item_collision()
The name 'dir' is misleading as it's the inode number of the directory,
so rename it accordingly.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 26 Jun 2025 14:30:11 +0000 (16:30 +0200)]
btrfs: pass bool to indicate subvolume/snapshot creation type
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 26 Jun 2025 14:30:10 +0000 (16:30 +0200)]
btrfs: pass dentry to btrfs_mksubvol() and btrfs_mksnapshot()
There's no reason to pass 'struct path' to btrfs_mksubvol(), though it's
been like the since the first commit
76dda93c6ae2c1 ("Btrfs: add
snapshot/subvolume destroy ioctl"). We only use the dentry so we should
pass it directly.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 26 Jun 2025 14:30:09 +0000 (16:30 +0200)]
btrfs: use struct qstr for subvolume ioctl helpers
We pass name and length of subvolumes separately to the related
functions, while this can be a struct qstr which is otherwise used for
dentry interfaces.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Brahmajit Das [Fri, 20 Jun 2025 16:49:57 +0000 (22:19 +0530)]
btrfs: replace strcpy() with strscpy()
strcpy() is discouraged from use due to lack of bounds checking.
Replaces it with strscpy(), the recommended alternative for null
terminated strings, to follow best practices.
There are instances where strscpy() cannot be used such as where both
the source and destination are character pointers. In that instance we
can use sysfs_emit().
Link: https://github.com/KSPP/linux/issues/88
Suggested-by: Anthony Iliopoulos <ailiop@suse.com>
Signed-off-by: Brahmajit Das <bdas@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Jun 2025 14:03:53 +0000 (16:03 +0200)]
btrfs: accessors: delete token versions of set/get helpers
Once upon a time there was a need to cache address of extent buffer
pages, as it was a costly operation (map_private_extent_buffer(),
cfed81a04eb555 ("Btrfs: add the ability to cache a pointer into the
eb")). This was not even due to use of HIGHMEM, this had been removed
before that due to possible locking issues (
a65917156e3459 ("Btrfs:
stop using highmem for extent_buffers")).
Over time the amount of work in the set/get helpers got reduced and
became quite straightforward bounds checking with an unaligned
read/write, commit
db3756c879773c ("btrfs: remove unused
map_private_extent_buffer").
The actual caching of the page_address()/folio_address() in the token
was more work for very little gain. This depended on subsequent access
into the same page/folio, otherwise the cached pointer had to be
updated.
For metadata-heavy operations this showed up in the 'perf top' profile
where the btrfs_get_token_32() calls were at the top, on my testing
machine consuming about 2-3%. The other generic 32/64 bit helpers also
appeared in the profile with similar fraction.
After removing use of the token helpers we can remove them completely,
this leads to reduction of btrfs.ko by 6.7KiB on release config.
text data bss dec hex filename
1463289 115665 16088
1595042 1856a2 pre/btrfs.ko
1456601 115665 16088
1588354 183c82 post/btrfs.ko
DELTA: -6688
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Jun 2025 14:03:52 +0000 (16:03 +0200)]
btrfs: tree-log: don't use token set/get accessors in fill_inode_item()
The token versions of set/get accessors will be removed, use the normal
helpers.
There's additional overhead of the token helpers that update the cached
address in case it moves to another page/folio. The normal versions
don't need to do that.
Note this is similar to fill_inode_item() in inode.c but with slight
differences. The two functions could be deduplicated eventually.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Jun 2025 14:03:51 +0000 (16:03 +0200)]
btrfs: don't use token set/get accessors in inode.c:fill_inode_item()
The token versions of set/get accessors will be removed, use the normal
helpers.
There's additional overhead of the token helpers that update the cached
address in case it moves to another page/folio. The normal versions
don't need to do that.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Jun 2025 14:03:50 +0000 (16:03 +0200)]
btrfs: don't use token set/get accessors for btrfs_item members
The token versions of set/get accessors will be removed, use the normal
helpers. The btrfs_item members use that interface the most but there
are no real benefits anymore.
This reduces stack consumption on x86_64 release config:
setup_items_for_insert -32 (144 -> 112)
__push_leaf_left -32 (176 -> 144)
btrfs_extend_item -16 (104 -> 88)
copy_for_split -32 (144 -> 112)
btrfs_del_items -16 (144 -> 128)
btrfs_truncate_item -24 (152 -> 128)
__push_leaf_right -24 (168 -> 144)
and module size:
text data bss dec hex filename
1463615 115665 16088
1595368 1857e8 pre/btrfs.ko
1463413 115665 16088
1595166 18571e post/btrfs.ko
DELTA: -202
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 30 Jun 2025 12:30:44 +0000 (13:30 +0100)]
btrfs: qgroup: remove no longer used fs_info->qgroup_ulist
It's not used anymore after commit
091344508249 ("btrfs: qgroup: use
qgroup_iterator in qgroup_convert_meta()"), so remove it.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 30 Jun 2025 12:19:20 +0000 (13:19 +0100)]
btrfs: qgroup: fix race between quota disable and quota rescan ioctl
There's a race between a task disabling quotas and another running the
rescan ioctl that can result in a use-after-free of qgroup records from
the fs_info->qgroup_tree rbtree.
This happens as follows:
1) Task A enters btrfs_ioctl_quota_rescan() -> btrfs_qgroup_rescan();
2) Task B enters btrfs_quota_disable() and calls
btrfs_qgroup_wait_for_completion(), which does nothing because at that
point fs_info->qgroup_rescan_running is false (it wasn't set yet by
task A);
3) Task B calls btrfs_free_qgroup_config() which starts freeing qgroups
from fs_info->qgroup_tree without taking the lock fs_info->qgroup_lock;
4) Task A enters qgroup_rescan_zero_tracking() which starts iterating
the fs_info->qgroup_tree tree while holding fs_info->qgroup_lock,
but task B is freeing qgroup records from that tree without holding
the lock, resulting in a use-after-free.
Fix this by taking fs_info->qgroup_lock at btrfs_free_qgroup_config().
Also at btrfs_qgroup_rescan() don't start the rescan worker if quotas
were already disabled.
Reported-by: cen zhang <zzzccc427@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAFRLqsV+cMDETFuzqdKSHk_FDm6tneea45krsHqPD6B3FetLpQ@mail.gmail.com/
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 30 Jun 2025 09:50:46 +0000 (10:50 +0100)]
btrfs: clear dirty status from extent buffer on error at insert_new_root()
If we failed to insert the tree mod log operation, we are not removing the
dirty status from the allocated and dirtied extent buffer before we free
it. Removing the dirty status is needed for several reasons such as to
adjust the fs_info->dirty_metadata_bytes counter and remove the dirty
status from the respective folios. So add the missing call to
btrfs_clear_buffer_dirty().
Fixes:
f61aa7ba08ab ("btrfs: do not BUG_ON() on tree mod log failure at insert_new_root()")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Johannes Thumshirn [Mon, 30 Jun 2025 14:47:35 +0000 (16:47 +0200)]
btrfs: change dump_block_groups() in btrfs_dump_space_info() from int to bool
btrfs_dump_space_info()'s parameter dump_block_groups is used as a boolean
although it is defined as an integer.
Change it from int to bool.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 27 Jun 2025 11:01:17 +0000 (13:01 +0200)]
btrfs: use pgoff_t for page index variables
Any conversion of offsets in the logical or the physical mapping space
of the pages is done by a shift and the target type should be pgoff_t
(type of struct page::index). Fix the locations where it's still
unsigned long.
Signed-off-by: David Sterba <dsterba@suse.com>
George Hu [Sat, 28 Jun 2025 05:21:30 +0000 (13:21 +0800)]
btrfs: replace nested usage of min & max with clamp in btrfs_compress_set_level()
Refactor the btrfs_compress_set_level() function by replacing the
nested usage of min() and max() macro with clamp() to simplify the
code and improve readability.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: George Hu <integral@archlinux.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dmitry Antipov [Fri, 27 Jun 2025 08:51:17 +0000 (11:51 +0300)]
btrfs: send: avoid extra calls to strlen() in gen_unique_name()
Since 'snprintf()' returns the number of characters which would
be emitted and output truncation is handled by 'ASSERT()', it
should be safe to use that return value instead of the subsequent
calls to 'strlen()' in 'gen_unique_name()'.
This also reduces the module's text size.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1897006 161571 16136
2074713 1fa859 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1896848 161571 16136
2074555 1fa7bb fs/btrfs/btrfs.ko
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 26 Jun 2025 15:57:02 +0000 (16:57 +0100)]
btrfs: qgroup: avoid memory allocation if qgroups are not enabled
At btrfs_qgroup_inherit() we allocate a qgroup record even if qgroups are
not enabled, which is unnecessary overhead and can result in subvolume
creation to fail with -ENOMEM, as create_subvol() calls this function.
Improve on this by making btrfs_qgroup_inherit() check if qgroups are
enabled earlier and return if they are not, avoiding the unnecessary
memory allocation and taking some locks.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 25 Jun 2025 15:16:05 +0000 (16:16 +0100)]
btrfs: qgroup: remove pointless error check for add_qgroup_rb() call
The add_qgroup_rb() function never returns an error pointer anymore since
commit
8d54518b5e52 ("btrfs: qgroup: pre-allocate btrfs_qgroup to reduce
GFP_ATOMIC usage"), so checking for an error pointer result at
btrfs_quota_enable() is pointless.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 23 Jun 2025 12:15:58 +0000 (13:15 +0100)]
btrfs: split btrfs_is_fstree() into multiple if statements for readability
Instead of a single if statement with multiple conditions, split it into
several if statements testing only one condition at a time and return true
or false immediately after. This makes it more immediate to reason.
The module's text size also slightly decreases, at least with gcc 14.2.0
on x86_64.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1897138 161583 16136
2074857 1fa8e9 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1896976 161583 16136
2074695 1fa847 fs/btrfs/btrfs.ko
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 23 Jun 2025 12:13:23 +0000 (13:13 +0100)]
btrfs: add btrfs prefix to is_fstree() and make it return bool
This is an exported function and therefore it should have a 'btrfs_'
prefix, to make it clear it's btrfs specific, avoid future name collisions
with code outside btrfs, and make its naming consistent with most other
btrfs exported functions.
So add a 'btrfs_' prefix to it and make it return bool instead of int,
since all we need is to return true or false.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 23 Jun 2025 11:54:40 +0000 (12:54 +0100)]
btrfs: split inode extref processing from __add_inode_ref() into a helper
The __add_inode_ref() function is quite big and with too much nesting, so
move the code that processes inode extrefs into a helper function, to make
the function easier to read and reduce the level of indentation too.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 23 Jun 2025 10:22:48 +0000 (11:22 +0100)]
btrfs: split inode ref processing from __add_inode_ref() into a helper
The __add_inode_ref() function is quite big and with too much nesting, so
move the code that processes inode refs into a helper function, to make
the function easier to read and reduce the level of indentation too.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 20 Jun 2025 15:58:48 +0000 (16:58 +0100)]
btrfs: use btrfs inodes in btrfs_rmdir() to avoid so much usage of BTRFS_I()
Almost everywhere we want to use a btrfs inode and therefore we have a
lot of calls to BTRFS_I(), making the code more verbose. Instead use btrfs
inode local variables to avoid so much use of BTRFS_I().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 20 Jun 2025 15:50:31 +0000 (16:50 +0100)]
btrfs: use inode already stored in local variable at btrfs_rmdir()
There's no need to call d_inode(dentry) when calling btrfs_unlink_inode()
since we have already stored that in a local inode variable. So just use
the local variable to make the code less verbose.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 20 Jun 2025 16:06:45 +0000 (18:06 +0200)]
btrfs: use our message helpers instead of pr_err/pr_warn/pr_info
Our message helpers accept NULL for the fs_info in the context that does
not provide and print the common header of the message. The use of pr_*
helpers is only for special reasons, like module loading, device
scanning or multi-line output (print-tree).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sun YangKai [Thu, 12 Jun 2025 08:32:23 +0000 (16:32 +0800)]
btrfs: remove partial support for lowest level from btrfs_search_forward()
Commit
323ac95bce44 ("Btrfs: don't read leaf blocks containing only
checksums during truncate") changed the condition from `level == 0` to
`level == path->lowest_level`, while its original purpose was just to do
some leaf node handling (calling btrfs_item_key_to_cpu()) and skip some
code that doesn't fit leaf nodes.
After changing the condition, the code path:
1. Also handles the non-leaf nodes when path->lowest_level is nonzero,
which is wrong. However btrfs_search_forward() is never called with a
nonzero path->lowest_level, which makes this bug not found before.
2. Makes the later if block with the same condition, which was originally
used to handle non-leaf node (calling btrfs_node_key_to_cpu()) when
lowest_level is not zero, dead code.
Since btrfs_search_forward() is never called for a path with a
lowest_level different from zero, just completely remove the partial
support for a non-zero lowest_level, simplifying a bit the code, and
assert that lowest_level is zero at the start of the function.
Suggested-by: Qu Wenruo <wqu@suse.com>
Fixes:
323ac95bce44 ("Btrfs: don't read leaf blocks containing only checksums during truncate")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qianfeng Rong [Thu, 19 Jun 2025 10:15:01 +0000 (18:15 +0800)]
btrfs: use folio_next_index() helper in check_range_has_page()
Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using
the existing helper folio_next_index().
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>