Josef Bacik [Thu, 10 Feb 2022 22:44:24 +0000 (17:44 -0500)]
btrfs: do not double complete bio on errors during compressed reads
I hit some weird panics while fixing up the error handling from
btrfs_lookup_bio_sums(). Turns out the compression path will complete
the bio we use if we set up any of the compression bios and then return
an error, and then btrfs_submit_data_bio() will also call bio_endio() on
the bio.
Fix this by making btrfs_submit_compressed_read() responsible for
calling bio_endio() on the bio if there are any errors. Currently it
was only doing it if we created the compression bios, otherwise it was
depending on btrfs_submit_data_bio() to do the right thing. This
creates the above problem, so fix up btrfs_submit_compressed_read() to
always call bio_endio() in case of an error, and then simply return from
btrfs_submit_data_bio() if we had to call
btrfs_submit_compressed_read().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 10 Feb 2022 22:44:23 +0000 (17:44 -0500)]
btrfs: track compressed bio errors as blk_status_t
Right now we just have a binary "errors" flag, so any error we get on
the compressed bio's gets translated to EIO. This isn't necessarily a
bad thing, but if we get an ENOMEM it may be nice to know that's what
happened instead of an EIO. Track our errors as a blk_status_t, and do
the appropriate setting of the errors accordingly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 10 Feb 2022 22:44:22 +0000 (17:44 -0500)]
btrfs: remove the bio argument from finish_compressed_bio_read
This bio is usually one of the compressed bio's, and we don't actually
need it in this function, so remove the argument and stop passing it
around.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 10 Feb 2022 22:44:21 +0000 (17:44 -0500)]
btrfs: check correct bio in finish_compressed_bio_read
Commit
c09abff87f90 ("btrfs: cloned bios must not be iterated by
bio_for_each_segment_all") added ASSERT()'s to make sure we weren't
calling bio_for_each_segment_all() on a RAID5/6 bio. However it was
checking the bio that the compression code passed in, not the
cb->orig_bio that we actually iterate over, so adjust this ASSERT() to
check the correct bio.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 10 Feb 2022 22:44:20 +0000 (17:44 -0500)]
btrfs: handle csum lookup errors properly on reads
Currently any error we get while trying to lookup csums during reads
shows up as a missing csum, and then on the read completion side we spit
out an error saying there was a csum mismatch and we increase the device
corruption count.
However we could have gotten an EIO from the lookup. We could also be
inside of a memory constrained container and gotten a ENOMEM while
trying to do the read. In either case we don't want to make this look
like a file system corruption problem, we want to make it look like the
actual error it is. Capture any negative value, convert it to the
appropriate blk_sts_t, free the csum array if we have one and bail.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Thu, 10 Feb 2022 22:44:19 +0000 (17:44 -0500)]
btrfs: make search_csum_tree return 0 if we get -EFBIG
We can either fail to find a csum entry at all and return -ENOENT, or we
can find a range that is close, but return -EFBIG. In essence these
both mean the same thing when we are doing a lookup for a csum in an
existing range, we didn't find a csum. We want to treat both of these
errors the same way, complain loudly that there wasn't a csum. This
currently happens anyway because we do
count = search_csum_tree();
if (count <= 0) {
// reloc and error handling
}
however it forces us to incorrectly treat EIO or ENOMEM errors as on
disk corruption. Fix this by returning 0 if we get either -ENOENT or
-EFBIG from btrfs_lookup_csum() so we can do proper error handling.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 11 Feb 2022 06:41:42 +0000 (14:41 +0800)]
btrfs: defrag: use control structure in btrfs_defrag_file()
Use defrag control structure and replace function arguments. This brings
the following benefits:
- No more strange range->start update to indicate last scanned bytenr
We have btrfs_defrag_ctrl::last_scanned (exclusive) for it directly.
- No more return value to indicate defragged sectors
Now btrfs_defrag_file() will just return 0 if no error happened.
And btrfs_defrag_ctrl::sectors_defragged will show that value.
- Fewer parameters to carry around
Now most defrag_* functions only need to fetch their policy parameters
from btrfs_defrag_ctrl directly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 11 Feb 2022 06:41:41 +0000 (14:41 +0800)]
btrfs: defrag: introduce control structure for later use
Currently btrfs_defrag_file() accepts not only
btrfs_ioctl_defrag_range_args but also other parameters like @newer_than
and @max_sectors_to_defrag for extra policies.
Those extra values are hidden from defrag ioctl and even caused bugs in
the past due to different behavior based on those extra values.
Here we introduce a new structure, btrfs_defrag_ctrl, to include:
- all members in btrfs_ioctl_defrag_range_args
- @max_sectors_to_defrag and @newer_than
- Extra values which callers of btrfs_defrag_file() may care
Like @sectors_defragged and @last_scanned.
With the new structure, also introduce a new helper,
btrfs_defrag_ioctl_args_to_ctrl() to:
- Do extra sanity check on @compress and @flags
- Do range alignment when possible
- Set default values.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 11 Feb 2022 06:41:40 +0000 (14:41 +0800)]
btrfs: uapi: introduce BTRFS_DEFRAG_RANGE_MASK for later sanity check
Add a mask for all supported defrag flags. Replace the hard coded bit
constants with ones defined by bit shifts.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 28 Jan 2022 07:21:22 +0000 (15:21 +0800)]
btrfs: defrag: remove an ambiguous condition for rejection
From the very beginning of btrfs defrag, there is a check to reject
extents which meet both conditions:
- Physically adjacent
We may want to defrag physically adjacent extents to reduce the number
of extents or the size of subvolume tree.
- Larger than 128K
This may be there for compressed extents, but unfortunately 128K is
exactly the max capacity for compressed extents.
And the check is > 128K, thus it never rejects compressed extents.
Furthermore, the compressed extent capacity bug is fixed by previous
patch, there is no reason for that check anymore.
The original check has a very small ranges to reject (the target extent
size is > 128K, and default extent threshold is 256K), and for
compressed extent it doesn't work at all.
So it's better just to remove the rejection, and allow us to defrag
physically adjacent extents.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 28 Jan 2022 07:21:21 +0000 (15:21 +0800)]
btrfs: defrag: don't defrag extents which are already at max capacity
[BUG]
For compressed extents, defrag ioctl will always try to defrag any
compressed extents, wasting not only IO but also CPU time to
compress/decompress:
mkfs.btrfs -f $DEV
mount -o compress $DEV $MNT
xfs_io -f -c "pwrite -S 0xab 0 128K" $MNT/foobar
sync
xfs_io -f -c "pwrite -S 0xcd 128K 128K" $MNT/foobar
sync
echo "=== before ==="
xfs_io -c "fiemap -v" $MNT/foobar
btrfs filesystem defrag $MNT/foobar
sync
echo "=== after ==="
xfs_io -c "fiemap -v" $MNT/foobar
Then it shows the 2 128K extents just get COW for no extra benefit, with
extra IO/CPU spent:
=== before ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26624..26879 256 0x8
1: [256..511]: 26632..26887 256 0x9
=== after ===
/mnt/btrfs/file1:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..255]: 26640..26895 256 0x8
1: [256..511]: 26648..26903 256 0x9
This affects not only v5.16 (after the defrag rework), but also v5.15
(before the defrag rework).
[CAUSE]
From the very beginning, btrfs defrag never checks if one extent is
already at its max capacity (128K for compressed extents, 128M
otherwise).
And the default extent size threshold is 256K, which is already beyond
the compressed extent max size.
This means, by default btrfs defrag ioctl will mark all compressed
extent which is not adjacent to a hole/preallocated range for defrag.
[FIX]
Introduce a helper to grab the maximum extent size, and then in
defrag_collect_targets() and defrag_check_next_extent(), reject extents
which are already at their max capacity.
Reported-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 28 Jan 2022 07:21:20 +0000 (15:21 +0800)]
btrfs: defrag: don't try to merge regular extents with preallocated extents
[BUG]
With older kernels (before v5.16), btrfs will defrag preallocated extents.
While with newer kernels (v5.16 and newer) btrfs will not defrag
preallocated extents, but it will defrag the extent just before the
preallocated extent, even it's just a single sector.
This can be exposed by the following small script:
mkfs.btrfs -f $dev > /dev/null
mount $dev $mnt
xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file
xfs_io -c "fiemap -v" $mnt/file
btrfs fi defrag $mnt/file
sync
xfs_io -c "fiemap -v" $mnt/file
The output looks like this on older kernels:
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26624..26631 8 0x0
1: [8..39]: 26632..26663 32 0x801
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..39]: 26664..26703 40 0x1
Which defrags the single sector along with the preallocated extent, and
replace them with an regular extent into a new location (caused by data
COW).
This wastes most of the data IO just for the preallocated range.
On the other hand, v5.16 is slightly better:
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26624..26631 8 0x0
1: [8..39]: 26632..26663 32 0x801
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 26664..26671 8 0x0
1: [8..39]: 26632..26663 32 0x801
The preallocated range is not defragged, but the sector before it still
gets defragged, which has no need for it.
[CAUSE]
One of the function reused by the old and new behavior is
defrag_check_next_extent(), it will determine if we should defrag
current extent by checking the next one.
It only checks if the next extent is a hole or inlined, but it doesn't
check if it's preallocated.
On the other hand, out of the function, both old and new kernel will
reject preallocated extents.
Such inconsistent behavior causes above behavior.
[FIX]
- Also check if next extent is preallocated
If so, don't defrag current extent.
- Add comments for each branch why we reject the extent
This will reduce the IO caused by defrag ioctl and autodefrag.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 11 Feb 2022 06:41:39 +0000 (14:41 +0800)]
btrfs: defrag: allow defrag_one_cluster() to skip large extent which is not a target
In the rework of btrfs_defrag_file(), we always call
defrag_one_cluster() and increase the offset by cluster size, which is
only 256K.
But there are cases where we have a large extent (e.g. 128M) which
doesn't need to be defragged at all.
Before the refactor, we can directly skip the range, but now we have to
scan that extent map again and again until the cluster moves after the
non-target extent.
Fix the problem by allow defrag_one_cluster() to increase
btrfs_defrag_ctrl::last_scanned to the end of an extent, if and only if
the last extent of the cluster is not a target.
The test script looks like this:
mkfs.btrfs -f $dev > /dev/null
mount $dev $mnt
# As btrfs ioctl uses 32M as extent_threshold
xfs_io -f -c "pwrite 0 64M" $mnt/file1
sync
# Some fragemented range to defrag
xfs_io -s -c "pwrite 65548k 4k" \
-c "pwrite 65544k 4k" \
-c "pwrite 65540k 4k" \
-c "pwrite 65536k 4k" \
$mnt/file1
sync
echo "=== before ==="
xfs_io -c "fiemap -v" $mnt/file1
echo "=== after ==="
btrfs fi defrag $mnt/file1
sync
xfs_io -c "fiemap -v" $mnt/file1
umount $mnt
With extra ftrace put into defrag_one_cluster(), before the patch it
would result tons of loops:
(As defrag_one_cluster() is inlined, the function name is its caller)
btrfs-126062 [005] ..... 4682.816026: btrfs_defrag_file: r/i=5/257 start=0 len=262144
btrfs-126062 [005] ..... 4682.816027: btrfs_defrag_file: r/i=5/257 start=262144 len=262144
btrfs-126062 [005] ..... 4682.816028: btrfs_defrag_file: r/i=5/257 start=524288 len=262144
btrfs-126062 [005] ..... 4682.816028: btrfs_defrag_file: r/i=5/257 start=786432 len=262144
btrfs-126062 [005] ..... 4682.816028: btrfs_defrag_file: r/i=5/257 start=
1048576 len=262144
...
btrfs-126062 [005] ..... 4682.816043: btrfs_defrag_file: r/i=5/257 start=
67108864 len=262144
But with this patch there will be just one loop, then directly to the
end of the extent:
btrfs-130471 [014] ..... 5434.029558: defrag_one_cluster: r/i=5/257 start=0 len=262144
btrfs-130471 [014] ..... 5434.029559: defrag_one_cluster: r/i=5/257 start=
67108864 len=16384
CC: stable@vger.kernel.org # 5.16
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sidong Yang [Fri, 11 Feb 2022 13:48:29 +0000 (13:48 +0000)]
btrfs: qgroup: remove outdated TODO comments
These comments are old, outdated and not very specific. It seems that it
doesn't help to inspire anybody to work on that. So we remove them.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sidong Yang [Sun, 6 Feb 2022 12:52:48 +0000 (12:52 +0000)]
btrfs: qgroup: remove duplicated check in adding qgroup relations
Removes duplicated check when adding qgroup relations.
btrfs_add_qgroup_relations function adds relations by calling
add_relation_rb(). add_relation_rb() checks that member/parentid exists
in current qgroup_tree. But it already checked before calling the
function. It seems that we don't need to double check.
Add new function __add_relation_rb() that adds relations with
qgroup structures and makes old function use the new one. And it makes
btrfs_add_qgroup_relation() function work without double checks by
calling the new function.
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Dāvis Mosāns [Wed, 2 Feb 2022 21:44:54 +0000 (23:44 +0200)]
btrfs: add lzo workspace buffer length constants
It makes it more readable for length checking and is be used repeatedly.
Signed-off-by: Dāvis Mosāns <davispuh@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dāvis Mosāns [Wed, 2 Feb 2022 21:44:55 +0000 (23:44 +0200)]
btrfs: prevent copying too big compressed lzo segment
Compressed length can be corrupted to be a lot larger than memory
we have allocated for buffer.
This will cause memcpy in copy_compressed_segment to write outside
of allocated memory.
This mostly results in stuck read syscall but sometimes when using
btrfs send can get #GP
kernel: general protection fault, probably for non-canonical address 0x841551d5c1000: 0000 [#1] PREEMPT SMP NOPTI
kernel: CPU: 17 PID: 264 Comm: kworker/u256:7 Tainted: P OE 5.17.0-rc2-1 #12
kernel: Workqueue: btrfs-endio btrfs_work_helper [btrfs]
kernel: RIP: 0010:lzo_decompress_bio (./include/linux/fortify-string.h:225 fs/btrfs/lzo.c:322 fs/btrfs/lzo.c:394) btrfs
Code starting with the faulting instruction
===========================================
0:* 48 8b 06 mov (%rsi),%rax <-- trapping instruction
3: 48 8d 79 08 lea 0x8(%rcx),%rdi
7: 48 83 e7 f8 and $0xfffffffffffffff8,%rdi
b: 48 89 01 mov %rax,(%rcx)
e: 44 89 f0 mov %r14d,%eax
11: 48 8b 54 06 f8 mov -0x8(%rsi,%rax,1),%rdx
kernel: RSP: 0018:
ffffb110812efd50 EFLAGS:
00010212
kernel: RAX:
0000000000001000 RBX:
000000009ca264c8 RCX:
ffff98996e6d8ff8
kernel: RDX:
0000000000000064 RSI:
000841551d5c1000 RDI:
ffffffff9500435d
kernel: RBP:
ffff989a3be856c0 R08:
0000000000000000 R09:
0000000000000000
kernel: R10:
0000000000000000 R11:
0000000000001000 R12:
ffff98996e6d8000
kernel: R13:
0000000000000008 R14:
0000000000001000 R15:
000841551d5c1000
kernel: FS:
0000000000000000(0000) GS:
ffff98a09d640000(0000) knlGS:
0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
kernel: CR2:
00001e9f984d9ea8 CR3:
000000014971a000 CR4:
00000000003506e0
kernel: Call Trace:
kernel: <TASK>
kernel: end_compressed_bio_read (fs/btrfs/compression.c:104 fs/btrfs/compression.c:1363 fs/btrfs/compression.c:323) btrfs
kernel: end_workqueue_fn (fs/btrfs/disk-io.c:1923) btrfs
kernel: btrfs_work_helper (fs/btrfs/async-thread.c:326) btrfs
kernel: process_one_work (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:212 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2312)
kernel: worker_thread (./include/linux/list.h:292 kernel/workqueue.c:2455)
kernel: ? process_one_work (kernel/workqueue.c:2397)
kernel: kthread (kernel/kthread.c:377)
kernel: ? kthread_complete_and_exit (kernel/kthread.c:332)
kernel: ret_from_fork (arch/x86/entry/entry_64.S:301)
kernel: </TASK>
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Dāvis Mosāns <davispuh@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Dāvis Mosāns [Sat, 5 Feb 2022 18:48:23 +0000 (20:48 +0200)]
btrfs: send: in case of IO error log it
Currently if we get IO error while doing send then we abort without
logging information about which file caused issue. So log it to help
with debugging.
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Dāvis Mosāns <davispuh@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 2 Feb 2022 15:26:09 +0000 (15:26 +0000)]
btrfs: get rid of warning on transaction commit when using flushoncommit
When using the flushoncommit mount option, during almost every transaction
commit we trigger a warning from __writeback_inodes_sb_nr():
$ cat fs/fs-writeback.c:
(...)
static void __writeback_inodes_sb_nr(struct super_block *sb, ...
{
(...)
WARN_ON(!rwsem_is_locked(&sb->s_umount));
(...)
}
(...)
The trace produced in dmesg looks like the following:
[947.473890] WARNING: CPU: 5 PID: 930 at fs/fs-writeback.c:2610 __writeback_inodes_sb_nr+0x7e/0xb3
[947.481623] Modules linked in: nfsd nls_cp437 cifs asn1_decoder cifs_arc4 fscache cifs_md4 ipmi_ssif
[947.489571] CPU: 5 PID: 930 Comm: btrfs-transacti Not tainted
95.16.3-srb-asrock-00001-g36437ad63879 #186
[947.497969] RIP: 0010:__writeback_inodes_sb_nr+0x7e/0xb3
[947.502097] Code: 24 10 4c 89 44 24 18 c6 (...)
[947.519760] RSP: 0018:
ffffc90000777e10 EFLAGS:
00010246
[947.523818] RAX:
0000000000000000 RBX:
0000000000963300 RCX:
0000000000000000
[947.529765] RDX:
0000000000000000 RSI:
000000000000fa51 RDI:
ffffc90000777e50
[947.535740] RBP:
ffff888101628a90 R08:
ffff888100955800 R09:
ffff888100956000
[947.541701] R10:
0000000000000002 R11:
0000000000000001 R12:
ffff888100963488
[947.547645] R13:
ffff888100963000 R14:
ffff888112fb7200 R15:
ffff888100963460
[947.553621] FS:
0000000000000000(0000) GS:
ffff88841fd40000(0000) knlGS:
0000000000000000
[947.560537] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[947.565122] CR2:
0000000008be50c4 CR3:
000000000220c000 CR4:
00000000001006e0
[947.571072] Call Trace:
[947.572354] <TASK>
[947.573266] btrfs_commit_transaction+0x1f1/0x998
[947.576785] ? start_transaction+0x3ab/0x44e
[947.579867] ? schedule_timeout+0x8a/0xdd
[947.582716] transaction_kthread+0xe9/0x156
[947.585721] ? btrfs_cleanup_transaction.isra.0+0x407/0x407
[947.590104] kthread+0x131/0x139
[947.592168] ? set_kthread_struct+0x32/0x32
[947.595174] ret_from_fork+0x22/0x30
[947.597561] </TASK>
[947.598553] ---[ end trace
644721052755541c ]---
This is because we started using writeback_inodes_sb() to flush delalloc
when committing a transaction (when using -o flushoncommit), in order to
avoid deadlocks with filesystem freeze operations. This change was made
by commit
ce8ea7cc6eb313 ("btrfs: don't call btrfs_start_delalloc_roots
in flushoncommit"). After that change we started producing that warning,
and every now and then a user reports this since the warning happens too
often, it spams dmesg/syslog, and a user is unsure if this reflects any
problem that might compromise the filesystem's reliability.
We can not just lock the sb->s_umount semaphore before calling
writeback_inodes_sb(), because that would at least deadlock with
filesystem freezing, since at fs/super.c:freeze_super() sync_filesystem()
is called while we are holding that semaphore in write mode, and that can
trigger a transaction commit, resulting in a deadlock. It would also
trigger the same type of deadlock in the unmount path. Possibly, it could
also introduce some other locking dependencies that lockdep would report.
To fix this call try_to_writeback_inodes_sb() instead of
writeback_inodes_sb(), because that will try to read lock sb->s_umount
and then will only call writeback_inodes_sb() if it was able to lock it.
This is fine because the cases where it can't read lock sb->s_umount
are during a filesystem unmount or during a filesystem freeze - in those
cases sb->s_umount is write locked and sync_filesystem() is called, which
calls writeback_inodes_sb(). In other words, in all cases where we can't
take a read lock on sb->s_umount, writeback is already being triggered
elsewhere.
An alternative would be to call btrfs_start_delalloc_roots() with a
number of pages different from LONG_MAX, for example matching the number
of delalloc bytes we currently have, in which case we would end up
starting all delalloc with filemap_fdatawrite_wbc() and not with an
async flush via filemap_flush() - that is only possible after the rather
recent commit
e076ab2a2ca70a ("btrfs: shrink delalloc pages instead of
full inodes"). However that creates a whole new can of worms due to new
lock dependencies, which lockdep complains, like for example:
[ 8948.247280] ======================================================
[ 8948.247823] WARNING: possible circular locking dependency detected
[ 8948.248353] 5.17.0-rc1-btrfs-next-111 #1 Not tainted
[ 8948.248786] ------------------------------------------------------
[ 8948.249320] kworker/u16:18/933570 is trying to acquire lock:
[ 8948.249812]
ffff9b3de1591690 (sb_internal#2){.+.+}-{0:0}, at: find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.250638]
but task is already holding lock:
[ 8948.251140]
ffff9b3e09c717d8 (&root->delalloc_mutex){+.+.}-{3:3}, at: start_delalloc_inodes+0x78/0x400 [btrfs]
[ 8948.252018]
which lock already depends on the new lock.
[ 8948.252710]
the existing dependency chain (in reverse order) is:
[ 8948.253343]
-> #2 (&root->delalloc_mutex){+.+.}-{3:3}:
[ 8948.253950] __mutex_lock+0x90/0x900
[ 8948.254354] start_delalloc_inodes+0x78/0x400 [btrfs]
[ 8948.254859] btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs]
[ 8948.255408] btrfs_commit_transaction+0x32f/0xc00 [btrfs]
[ 8948.255942] btrfs_mksubvol+0x380/0x570 [btrfs]
[ 8948.256406] btrfs_mksnapshot+0x81/0xb0 [btrfs]
[ 8948.256870] __btrfs_ioctl_snap_create+0x17f/0x190 [btrfs]
[ 8948.257413] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
[ 8948.257961] btrfs_ioctl+0x1196/0x3630 [btrfs]
[ 8948.258418] __x64_sys_ioctl+0x83/0xb0
[ 8948.258793] do_syscall_64+0x3b/0xc0
[ 8948.259146] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 8948.259709]
-> #1 (&fs_info->delalloc_root_mutex){+.+.}-{3:3}:
[ 8948.260330] __mutex_lock+0x90/0x900
[ 8948.260692] btrfs_start_delalloc_roots+0x97/0x2a0 [btrfs]
[ 8948.261234] btrfs_commit_transaction+0x32f/0xc00 [btrfs]
[ 8948.261766] btrfs_set_free_space_cache_v1_active+0x38/0x60 [btrfs]
[ 8948.262379] btrfs_start_pre_rw_mount+0x119/0x180 [btrfs]
[ 8948.262909] open_ctree+0x1511/0x171e [btrfs]
[ 8948.263359] btrfs_mount_root.cold+0x12/0xde [btrfs]
[ 8948.263863] legacy_get_tree+0x30/0x50
[ 8948.264242] vfs_get_tree+0x28/0xc0
[ 8948.264594] vfs_kern_mount.part.0+0x71/0xb0
[ 8948.265017] btrfs_mount+0x11d/0x3a0 [btrfs]
[ 8948.265462] legacy_get_tree+0x30/0x50
[ 8948.265851] vfs_get_tree+0x28/0xc0
[ 8948.266203] path_mount+0x2d4/0xbe0
[ 8948.266554] __x64_sys_mount+0x103/0x140
[ 8948.266940] do_syscall_64+0x3b/0xc0
[ 8948.267300] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 8948.267790]
-> #0 (sb_internal#2){.+.+}-{0:0}:
[ 8948.268322] __lock_acquire+0x12e8/0x2260
[ 8948.268733] lock_acquire+0xd7/0x310
[ 8948.269092] start_transaction+0x44c/0x6e0 [btrfs]
[ 8948.269591] find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.270087] btrfs_reserve_extent+0x14b/0x280 [btrfs]
[ 8948.270588] cow_file_range+0x17e/0x490 [btrfs]
[ 8948.271051] btrfs_run_delalloc_range+0x345/0x7a0 [btrfs]
[ 8948.271586] writepage_delalloc+0xb5/0x170 [btrfs]
[ 8948.272071] __extent_writepage+0x156/0x3c0 [btrfs]
[ 8948.272579] extent_write_cache_pages+0x263/0x460 [btrfs]
[ 8948.273113] extent_writepages+0x76/0x130 [btrfs]
[ 8948.273573] do_writepages+0xd2/0x1c0
[ 8948.273942] filemap_fdatawrite_wbc+0x68/0x90
[ 8948.274371] start_delalloc_inodes+0x17f/0x400 [btrfs]
[ 8948.274876] btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs]
[ 8948.275417] flush_space+0x1f2/0x630 [btrfs]
[ 8948.275863] btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
[ 8948.276438] process_one_work+0x252/0x5a0
[ 8948.276829] worker_thread+0x55/0x3b0
[ 8948.277189] kthread+0xf2/0x120
[ 8948.277506] ret_from_fork+0x22/0x30
[ 8948.277868]
other info that might help us debug this:
[ 8948.278548] Chain exists of:
sb_internal#2 --> &fs_info->delalloc_root_mutex --> &root->delalloc_mutex
[ 8948.279601] Possible unsafe locking scenario:
[ 8948.280102] CPU0 CPU1
[ 8948.280508] ---- ----
[ 8948.280915] lock(&root->delalloc_mutex);
[ 8948.281271] lock(&fs_info->delalloc_root_mutex);
[ 8948.281915] lock(&root->delalloc_mutex);
[ 8948.282487] lock(sb_internal#2);
[ 8948.282800]
*** DEADLOCK ***
[ 8948.283333] 4 locks held by kworker/u16:18/933570:
[ 8948.283750] #0:
ffff9b3dc00a9d48 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1d2/0x5a0
[ 8948.284609] #1:
ffffa90349dafe70 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1d2/0x5a0
[ 8948.285637] #2:
ffff9b3e14db5040 (&fs_info->delalloc_root_mutex){+.+.}-{3:3}, at: btrfs_start_delalloc_roots+0x97/0x2a0 [btrfs]
[ 8948.286674] #3:
ffff9b3e09c717d8 (&root->delalloc_mutex){+.+.}-{3:3}, at: start_delalloc_inodes+0x78/0x400 [btrfs]
[ 8948.287596]
stack backtrace:
[ 8948.287975] CPU: 3 PID: 933570 Comm: kworker/u16:18 Not tainted 5.17.0-rc1-btrfs-next-111 #1
[ 8948.288677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 8948.289649] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
[ 8948.290298] Call Trace:
[ 8948.290517] <TASK>
[ 8948.290700] dump_stack_lvl+0x59/0x73
[ 8948.291026] check_noncircular+0xf3/0x110
[ 8948.291375] ? start_transaction+0x228/0x6e0 [btrfs]
[ 8948.291826] __lock_acquire+0x12e8/0x2260
[ 8948.292241] lock_acquire+0xd7/0x310
[ 8948.292714] ? find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.293241] ? lock_is_held_type+0xea/0x140
[ 8948.293601] start_transaction+0x44c/0x6e0 [btrfs]
[ 8948.294055] ? find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.294518] find_free_extent+0x141e/0x1590 [btrfs]
[ 8948.294957] ? _raw_spin_unlock+0x29/0x40
[ 8948.295312] ? btrfs_get_alloc_profile+0x124/0x290 [btrfs]
[ 8948.295813] btrfs_reserve_extent+0x14b/0x280 [btrfs]
[ 8948.296270] cow_file_range+0x17e/0x490 [btrfs]
[ 8948.296691] btrfs_run_delalloc_range+0x345/0x7a0 [btrfs]
[ 8948.297175] ? find_lock_delalloc_range+0x247/0x270 [btrfs]
[ 8948.297678] writepage_delalloc+0xb5/0x170 [btrfs]
[ 8948.298123] __extent_writepage+0x156/0x3c0 [btrfs]
[ 8948.298570] extent_write_cache_pages+0x263/0x460 [btrfs]
[ 8948.299061] extent_writepages+0x76/0x130 [btrfs]
[ 8948.299495] do_writepages+0xd2/0x1c0
[ 8948.299817] ? sched_clock_cpu+0xd/0x110
[ 8948.300160] ? lock_release+0x155/0x4a0
[ 8948.300494] filemap_fdatawrite_wbc+0x68/0x90
[ 8948.300874] ? do_raw_spin_unlock+0x4b/0xa0
[ 8948.301243] start_delalloc_inodes+0x17f/0x400 [btrfs]
[ 8948.301706] ? lock_release+0x155/0x4a0
[ 8948.302055] btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs]
[ 8948.302564] flush_space+0x1f2/0x630 [btrfs]
[ 8948.302970] btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
[ 8948.303510] process_one_work+0x252/0x5a0
[ 8948.303860] ? process_one_work+0x5a0/0x5a0
[ 8948.304221] worker_thread+0x55/0x3b0
[ 8948.304543] ? process_one_work+0x5a0/0x5a0
[ 8948.304904] kthread+0xf2/0x120
[ 8948.305184] ? kthread_complete_and_exit+0x20/0x20
[ 8948.305598] ret_from_fork+0x22/0x30
[ 8948.305921] </TASK>
It all comes from the fact that btrfs_start_delalloc_roots() takes the
delalloc_root_mutex, in the transaction commit path we are holding a
read lock on one of the superblock's freeze semaphores (via
sb_start_intwrite()), the async reclaim task can also do a call to
btrfs_start_delalloc_roots(), which ends up triggering writeback with
calls to filemap_fdatawrite_wbc(), resulting in extent allocation which
in turn can call btrfs_start_transaction(), which will result in taking
the freeze semaphore via sb_start_intwrite(), forming a nasty dependency
on all those locks which can be taken in different orders by different
code paths.
So just adopt the simple approach of calling try_to_writeback_inodes_sb()
at btrfs_start_delalloc_flush().
Link: https://lore.kernel.org/linux-btrfs/20220130005258.GA7465@cuci.nl/
Link: https://lore.kernel.org/linux-btrfs/43acc426-d683-d1b6-729d-c6bc4a2fff4d@gmail.com/
Link: https://lore.kernel.org/linux-btrfs/6833930a-08d7-6fbc-0141-eb9cdfd6bb4d@gmail.com/
Link: https://lore.kernel.org/linux-btrfs/20190322041731.GF16651@hungrycats.org/
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[ add more link reports ]
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 8 Feb 2022 06:54:05 +0000 (14:54 +0800)]
btrfs: defrag: don't try to defrag extents which are under writeback
Once we start writeback (have called btrfs_run_delalloc_range()), we
allocate an extent, create an extent map point to that extent, with a
generation of (u64)-1, created the ordered extent and then clear the
DELALLOC bit from the range in the inode's io tree.
Such extent map can pass the first call of defrag_collect_targets(), as
its generation is (u64)-1, meets any possible minimal generation check.
And the range will not have DELALLOC bit, also passing the DELALLOC bit
check.
It will only be re-checked in the second call of
defrag_collect_targets(), which will wait for writeback.
But at that stage we have already spent our time waiting for some IO we
may or may not want to defrag.
Let's reject such extents early so we won't waste our time.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 8 Feb 2022 05:31:19 +0000 (13:31 +0800)]
btrfs: populate extent_map::generation when reading from disk
When btrfs_get_extent() tries to get some file extent from disk, it
never populates extent_map::generation, leaving the value to be 0.
On the other hand, for extent map generated by IO, it will get its
generation properly set at finish_ordered_io()
finish_ordered_io()
|- unpin_extent_cache(gen = trans->transid)
|- em->generation = gen;
[CAUSE]
Since extent_map::generation is mostly used by fsync code, and for fsync
they only care about modified extents, which all have their
em::generation > 0.
Thus it's fine to not populate em read from disk for fsync.
[CORNER CASE]
However autodefrag also relies on em::generation to determine if one
extent needs to be defragged.
This unpopulated extent_map::generation can prevent the following
autodefrag case from working:
mkfs.btrfs -f $dev
mount $dev $mnt -o autodefrag
# initial write to queue the inode for autodefrag
xfs_io -f -c "pwrite 0 4k" $mnt/file
sync
# Real fragmented write
xfs_io -f -s -c "pwrite -b 4096 0 32k" $mnt/file
sync
echo "=== before autodefrag ==="
xfs_io -c "fiemap -v" $mnt/file
# Drop cache to force em to be read from disk
echo 3 > /proc/sys/vm/drop_caches
mount -o remount,commit=1 $mnt
sleep 3
sync
echo "=== After autodefrag ==="
xfs_io -c "fiemap -v" $mnt/file
umount $mnt
The result looks like this:
=== before autodefrag ===
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..15]: 26672..26687 16 0x0
1: [16..31]: 26656..26671 16 0x0
2: [32..47]: 26640..26655 16 0x0
3: [48..63]: 26624..26639 16 0x1
=== After autodefrag ===
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..15]: 26672..26687 16 0x0
1: [16..31]: 26656..26671 16 0x0
2: [32..47]: 26640..26655 16 0x0
3: [48..63]: 26624..26639 16 0x1
This fragmented 32K will not be defragged by autodefrag.
[FIX]
To make things less weird, just populate extent_map::generation when
reading file extents from disk.
This would make above fragmented extents to be properly defragged:
== before autodefrag ===
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..15]: 26672..26687 16 0x0
1: [16..31]: 26656..26671 16 0x0
2: [32..47]: 26640..26655 16 0x0
3: [48..63]: 26624..26639 16 0x1
=== After autodefrag ===
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..63]: 26688..26751 64 0x1
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 15:36:45 +0000 (15:36 +0000)]
btrfs: assert we have a write lock when removing and replacing extent maps
Removing or replacing an extent map requires holding a write lock on the
extent map's tree. We currently do that everywhere, except in one of the
self tests, where it's harmless since there's no concurrency.
In order to catch possible races in the future, assert that we are holding
a write lock on the extent map tree before removing or replacing an extent
map in the tree, and update the self test to obtain a write lock before
removing extent maps.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 15:36:44 +0000 (15:36 +0000)]
btrfs: remove no longer used counter when reading data page
After commit
92082d40976ed0 ("btrfs: integrate page status update for
data read path into begin/end_page_read"), the 'nr' counter at
btrfs_do_readpage() is no longer used, we increment it but we never
read from it. So just remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 15:36:43 +0000 (15:36 +0000)]
btrfs: fix lost error return value when reading a data page
At btrfs_do_readpage(), if we get an error when trying to lookup for an
extent map, we end up marking the page with the error bit, clearing
the uptodate bit on it, and doing everything else that should be done.
However we return success (0) to the caller, when we should return the
error encoded in the extent map pointer. So fix that by returning the
error encoded in the pointer.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 15:36:42 +0000 (15:36 +0000)]
btrfs: stop checking for NULL return from btrfs_get_extent()
At extent_io.c, in the read page and write page code paths, we are testing
if the return value from btrfs_get_extent() can be NULL. However that is
not possible, as btrfs_get_extent() always returns either an error pointer
or a (non-NULL) pointer to an extent map structure.
Everywhere else outside extent_io.c we never check for NULL, we always
treat any returned value as a non-NULL pointer if it does not encode an
error.
So check only for the IS_ERR() case at extent_io.c.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 14:55:50 +0000 (14:55 +0000)]
btrfs: prepare extents to be logged before locking a log tree path
When we want to log an extent, in the fast fsync path, we obtain a path
to the leaf that will hold the file extent item either through a deletion
search, via btrfs_drop_extents(), or through an insertion search using
btrfs_insert_empty_item(). After that we fill the file extent item's
fields one by one directly on the leaf.
Instead of doing that, we could prepare the file extent item before
obtaining a btree path, and then copy the prepared extent item with a
single operation once we get the path. This helps avoid some contention
on the log tree, since we are holding write locks for longer than
necessary, especially in the case where the path is obtained via
btrfs_drop_extents() through a deletion search, which always keeps a
write lock on the nodes at levels 1 and 2 (besides the leaf).
This change does that, we prepare the file extent item that is going to
be inserted before acquiring a path, and then copy it into a leaf using
a single copy operation once we get a path.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The following test was run to measure the impact of the whole patchset:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
NUM_JOBS=8
FILE_SIZE=128M
RUN_TIME=200
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=randwrite
fsync=1
fallocate=none
group_reporting=1
direct=0
bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
ioengine=sync
filesize=$FILE_SIZE
runtime=$RUN_TIME
time_based
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test ran inside a VM (8 cores, 32G of RAM) with the target disk
mapping to a raw NVMe device, and using a non-debug kernel config
(Debian's default config).
Before the patchset:
WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=22.7GiB (24.4GB), run=200013-200013msec
After the patchset:
WRITE: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=24.3GiB (26.1GB), run=200007-200007msec
A 7.8% gain on throughput and +7.0% more IO done in the same period of
time (200 seconds).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 14:55:49 +0000 (14:55 +0000)]
btrfs: remove useless path release in the fast fsync path
There's no point in calling btrfs_release_path() after finishing the loop
that logs the modified extents, since log_one_extent() returns with the
path released. In case the list of extents is empty, the path is already
released, so there's no need for that case as well.
So just remove that unnecessary btrfs_release_path() call.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 14:55:48 +0000 (14:55 +0000)]
btrfs: remove constraint on number of visited leaves when replacing extents
At btrfs_drop_extents(), we try to replace a range of file extent items
with a new file extent in a single btree search, to avoid the need to do
a search for deletion, followed by a path release and followed by yet
another search for insertion.
When I originally added that optimization, in commit
1acae57b161ef1
("Btrfs: faster file extent item replace operations"), I left a constraint
to do the fast replace only if we visited a single leaf. That was because
in the most common case we find all file extent items that need to be
deleted (or trimmed) in a single leaf, however it can work for other
common cases like when we need to delete a few file extent items located
at the end of a leaf and a few more located at the beginning of the next
leaf. The key for the new file extent item is greater than the key of
any deleted or trimmed file extent item from previous leaves, so we are
fine to use the last leaf that we found as long as we are holding a
write lock on it - even if the new key ends up at slot 0, as if that's
the case, the btree search has obtained a write lock on any upper nodes
that need to have a key pointer updated.
So removed the constraint that limits the optimization to the case where
we visited only a single leaf.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 14:55:47 +0000 (14:55 +0000)]
btrfs: avoid unnecessary computation when deleting items from a leaf
When deleting items from a leaf, we always compute the sum of the data
sizes of the items that are going to be deleted. However we only use
that sum when the last item to delete is behind the last item in the
leaf. This unnecessarily wastes CPU time when we are deleting either
the whole leaf or from some slot > 0 up to the last item in the leaf,
and both of these cases are common (e.g. truncation operation, either
as a result of truncate(2) or when logging inodes, deleting checksums
after removing a large enough extent, etc).
So compute only the sum of the data sizes if the last item to be
deleted does not match the last item in the leaf.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 14:55:46 +0000 (14:55 +0000)]
btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
When we delete items from a leaf, if we end up with more than two thirds
of unused leaf space, we try to delete the leaf by moving all its items
into its left and right neighbour leaves. Sometimes that is not possible
because there is not enough free space in the left and right leaves, and
in that case we end up not deleting our leaf.
The way we are doing this is not ideal and can be improved in the
following ways:
1) When we call push_leaf_left(), we pass a value of 1 byte to the data
size parameter of push_leaf_left(). This is not realistic value because
no item can have a size less than 25 bytes, which is the size of struct
btrfs_item. This means that means that if the left leaf has not enough
free space to push any item, we end up COWing it even if we end up not
changing its content at all.
COWing that leaf means allocating a new metadata extent, marking it
dirty and doing more IO when committing a transaction or when syncing a
log tree. For a log tree case, it's particularly more important to
avoid the useless COW operation, as more IO can imply a higher latency
for an fsync operation.
So instead of passing 1 as the minimum data size for push_leaf_left(),
pass the size of the first item in our leaf, as we don't want to COW
the left leaf if we can't at least push the first item of our leaf;
2) When we call push_leaf_right(), we also pass a value of 1 byte as the
data size parameter of push_leaf_right(). Like the previous case, it
will also result in COWing the right leaf even if we are not able to
move any items into it, since there can't be any item with a size
smaller than 25 bytes (the size of struct btrfs_item).
So instead of passing 1 as the minimum data size to push_leaf_right(),
pass a size that corresponds to the sum of the size of all the
remaining items in our leaf. We are not interested in moving less than
that, because if we do, we are not able to delete our leaf and we have
COWed the right leaf for nothing. Plus, moving only some of the items
of our leaf, it means an even less balanced tree.
Just like the previous case, we want to avoid the useless COW of the
right leaf, this way we don't have to spend time allocating one new
metadata extent, and doing more IO when committing a transaction or
syncing a log tree. For the log tree case it's specially more important
because more IO can result in a higher latency for a fsync operation.
So adjust the minimum data size passed to push_leaf_left() and
push_leaf_right() as mentioned above.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
Not being able to delete a leaf that became less than 1/3 full after
deleting items from it is actually common. For example, for the fio test
mentioned in the changelog of patch 6/6, we are only able to delete a
leaf at btrfs_del_items() about 5.3% of the time, due to its left and
right neighbour leaves not having enough free space to push all the
remaining items into them.
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 14:55:45 +0000 (14:55 +0000)]
btrfs: remove unnecessary leaf free space checks when pushing items
When trying to push items from a leaf into its left and right neighbours,
we lock the left or right leaf, check if it has the required minimum free
space, COW the leaf and then check again if it has the minimum required
free space. This second check is pointless:
1) Most and foremost because it's not needed. We have a write lock on the
leaf and on its parent node, so no one can come in and change either
the pre-COW or post-COW version of the leaf for the whole duration of
the push_leaf_left() and push_leaf_right() calls;
2) The call to btrfs_leaf_free_space() is not trivial, it has a fair
amount of arithmetic operations and access to fields in the leaf's
header and items, so it's not very cheap.
So remove the duplicated free space checks.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Johannes Thumshirn [Fri, 4 Feb 2022 12:06:27 +0000 (04:06 -0800)]
btrfs: stop checking for NULL return from btrfs_get_extent_fiemap()
In get_extent_skip_holes() we're checking the return of
btrfs_get_extent_fiemap() for an error pointer or NULL, but
btrfs_get_extent_fiemap() will never return NULL, only error pointers or
a valid extent_map.
The other caller of btrfs_get_extent_fiemap(), find_desired_extent(),
correctly only checks for error-pointers.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pankaj Raghav [Fri, 4 Feb 2022 12:00:22 +0000 (13:00 +0100)]
btrfs: zoned: remove redundant assignment in btrfs_check_zoned_mode
Remove the redundant assignment to zone_info variable in
btrfs_check_zoned_mode function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 1 Feb 2022 14:42:07 +0000 (15:42 +0100)]
btrfs: replace BUILD_BUG_ON by static_assert
The static_assert introduced in
6bab69c65013 ("build_bug.h: add wrapper
for _Static_assert") has been supported by compilers for a long time
(gcc 4.6, clang 3.0) and can be used in header files. We don't need to
put BUILD_BUG_ON to random functions but rather keep it next to the
definition.
The exception here is the UAPI header btrfs_tree.h that could be
potentially included by userspace code and the static assert is not
defined (nor used in any other header).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sun, 30 Jan 2022 12:53:15 +0000 (20:53 +0800)]
btrfs: don't hold CPU for too long when defragging a file
There is a user report about "btrfs filesystem defrag" causing 120s
timeout problem.
For btrfs_defrag_file() it will iterate all file extents if called from
defrag ioctl, thus it can take a long time.
There is no reason not to release the CPU during such a long operation.
Add cond_resched() after defragged one cluster.
CC: stable@vger.kernel.org # 5.16
Link: https://lore.kernel.org/linux-btrfs/10e51417-2203-f0a4-2021-86c8511cc367@gmx.com
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Johannes Thumshirn [Wed, 26 Jan 2022 13:46:23 +0000 (05:46 -0800)]
btrfs: zoned: allow DUP on meta-data block groups
Allow creating or reading block-groups on a zoned device with DUP as a
meta-data profile.
This works because we're using the zoned_meta_io_lock and REQ_OP_WRITE
operations for meta-data on zoned btrfs, so all writes to meta-data zones
are aligned to the zone's write-pointer.
Upon loading of the block-group, it is ensured both zones do have the same
zone capacity and write-pointer offsets, so no extra machinery is needed
to keep the write-pointers in sync for the meta-data zones. If this
prerequisite is not met, loading of the block-group is refused.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Johannes Thumshirn [Wed, 26 Jan 2022 13:46:22 +0000 (05:46 -0800)]
btrfs: zoned: prepare for allowing DUP on zoned
Allow for a block-group to be placed on more than one physical zone.
This is a preparation for allowing DUP profiles for meta-data on a zoned
file-system.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Johannes Thumshirn [Wed, 26 Jan 2022 13:46:21 +0000 (05:46 -0800)]
btrfs: zoned: make zone finishing multi stripe capable
Currently finishing of a zone only works if the block group isn't
spanning more than one zone.
This limitation is purely artificial and can be easily expanded to block
groups being places across multiple zones.
This is a preparation for allowing DUP and later more complex block-group
profiles on zoned btrfs.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Johannes Thumshirn [Wed, 26 Jan 2022 13:46:20 +0000 (05:46 -0800)]
btrfs: zoned: make zone activation multi stripe capable
Currently activation of a zone only works if the block group isn't
spanning more than one zone.
This limitation is purely artificial and can be easily expanded to block
groups being places across multiple zones.
This is a preparation for allowing DUP and later more complex block-group
profiles on zoned btrfs.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:08 +0000 (15:40 -0500)]
btrfs: add support for multiple global roots
With extent tree v2 you will be able to create multiple csum, extent,
and free space trees. They will be used based on the block group, which
will now use the block_group_item->chunk_objectid to point to the set of
global roots that it will use. When allocating new block groups we'll
simply mod the gigabyte offset of the block group against the number of
global roots we have and that will be the block groups global id.
>From there we can take the bytenr that we're modifying in the respective
tree, look up the block group and get that block groups corresponding
global root id. From there we can get to the appropriate global root
for that bytenr.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:07 +0000 (15:40 -0500)]
btrfs: add code to support the block group root
This code adds the on disk structures for the block group root, which
will hold the block group items for extent tree v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:06 +0000 (15:40 -0500)]
btrfs: abstract out loading the tree root
We're going to be adding more roots that need to be loaded from the
super block, so abstract out the code to read the tree_root from the
super block, and use this helper for the chunk root as well. This will
make it simpler to load the new trees in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:05 +0000 (15:40 -0500)]
btrfs: tree-checker: don't fail on empty extent roots for extent tree v2
For extent tree v2 we can definitely have empty extent roots, so skip
this particular check if we have that set.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:04 +0000 (15:40 -0500)]
btrfs: disable space cache related mount options for extent tree v2
We cannot fall back on the slow caching for extent tree v2, which means
we can't just arbitrarily clear the free space trees at mount time.
Furthermore we can't do v1 space cache with extent tree v2. Simply
ignore these mount options for extent tree v2 as they aren't relevant.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:03 +0000 (15:40 -0500)]
btrfs: disable snapshot creation/deletion for extent tree v2
When we stop tracking metadata blocks all of snapshotting will break, so
disable it until I add the snapshot root and drop tree support.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:02 +0000 (15:40 -0500)]
btrfs: disable scrub for extent-tree-v2
Scrub depends on extent references for every block, and with extent tree
v2 we won't have that, so disable scrub until we can add back the proper
code to handle extent-tree-v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:01 +0000 (15:40 -0500)]
btrfs: disable qgroups in extent tree v2
Backref lookups are going to be drastically different with extent tree
v2, disable qgroups until we do the work to add this support for extent
tree v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:00 +0000 (15:40 -0500)]
btrfs: disable device manipulation ioctl's EXTENT_TREE_V2
Device add, remove, and replace all require balance, which doesn't work
right now on extent tree v2, so disable these for now.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:39:59 +0000 (15:39 -0500)]
btrfs: disable balance for extent tree v2 for now
With global root id's it makes it problematic to do backref lookups for
balance. This isn't hard to deal with, but future changes are going to
make it impossible to lookup backrefs on any COWonly roots, so go ahead
and disable balance for now on extent tree v2 until we can add balance
support back in future patches.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:39:58 +0000 (15:39 -0500)]
btrfs: add definition for EXTENT_TREE_V2
This adds the initial definition of the EXTENT_TREE_V2 incompat feature
flag. This also hides the support behind CONFIG_BTRFS_DEBUG.
THIS IS A IN DEVELOPMENT FORMAT CHANGE, DO NOT USE UNLESS YOU ARE A
DEVELOPER OR A TESTER.
The format is in flux and will be added in stages, any fs will need to
be re-made between updates to the format.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 20 Jan 2022 11:00:11 +0000 (11:00 +0000)]
btrfs: use single variable to track return value at btrfs_log_inode()
At btrfs_log_inode(), we have two variables to track errors and the
return value of the function, named 'ret' and 'err'. In some places we
use 'ret' and if gets a non-zero value we assign its value to 'err'
and then jump to the 'out' label, while in other places we use 'err'
directly without 'ret' as an intermediary. This is inconsistent, error
prone and not necessary. So change that to use only the 'ret' variable,
making this consistent with most functions in btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 20 Jan 2022 11:00:10 +0000 (11:00 +0000)]
btrfs: avoid inode logging during rename and link when possible
During a rename or link operation, we need to determine if an inode was
previously logged or not, and if it was, do some update to the logged
inode. We used to rely exclusively on the logged_trans field of struct
btrfs_inode to determine that, but that was not reliable because the
value of that field is not persisted in the inode item, so it's lost
when an inode is evicted and loaded back again. That led to several
issues in the past, such as not persisting deletions (such as the case
fixed by commit
803f0f64d17769 ("Btrfs: fix fsync not persisting dentry
deletions due to inode evictions")), or resulting in losing a file
after an inode eviction followed by a rename (commit
ecc64fab7d49c6
("btrfs: fix lost inode on log replay after mix of fsync, rename and
inode eviction")), besides other issues.
So the inode_logged() helper was introduced and used to determine if an
inode was possibly logged before in the current transaction, with the
caveat that it could return false positives, in the sense that even if an
inode was not logged before in the current transaction, it could still
return true, but never to return false in case the inode was logged.
>From a functional point of view that is fine, but from a performance
perspective it can introduce significant latencies to rename and link
operations, as they will end up doing inode logging even when it is not
necessary.
Recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package
installations and upgrades, with the zypper tool, were often taking a
long time to complete. With strace it could be observed that zypper was
spending about 99% of its time on rename operations, and then with
further analysis we checked that directory logging was happening too
frequently. Taking into account that installation/upgrade of some of the
packages needed a few thousand file renames, the slowdown was very
noticeable for the user.
The issue was caused indirectly due to an excessive number of inode
evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or
a 5.16-rc8 kernel. While triggering the inode evictions if something
outside btrfs' control, btrfs could still behave better by eliminating
the false positives from the inode_logged() helper.
So change inode_logged() to actually eliminate such false positives caused
by inode eviction and when an inode was never logged since the filesystem
was mounted, as both cases relate to when the logged_trans field of struct
btrfs_inode has a value of zero. When it can not determine if the inode
was logged based only on the logged_trans value, lookup for the existence
of the inode item in the log tree - if it's there then we known the inode
was logged, if it's not there then it can not have been logged in the
current transaction. Once we determine if the inode was logged, update
the logged_trans value to avoid future calls to have to search in the log
tree again.
Alternatively, we could start storing logged_trans in the on disk inode
item structure (struct btrfs_inode_item) in the unused space it still has,
but that would be a bit odd because:
1) We only care about logged_trans since the filesystem was mounted, we
don't care about its value from a previous mount. Having it persisted
in the inode item structure would not make the best use of the precious
unused space;
2) In order to get logged_trans persisted before inode eviction, we would
have to update the delayed inode when we finish logging the inode and
update its logged_trans in struct btrfs_inode, which makes it a bit
cumbersome since we need to check if the delayed inode exists, if not
create it and populate it and deal with any errors (-ENOMEM mostly).
This change is part of a patchset comprised of the following patches:
1/5 btrfs: add helper to delete a dir entry from a log tree
2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
3/5 btrfs: avoid logging all directory changes during renames
4/5 btrfs: stop doing unnecessary log updates during a rename
5/5 btrfs: avoid inode logging during rename and link when possible
The following test script mimics part of what the zypper tool does during
package installations/upgrades. It does not triggers inode evictions, but
it's similar because it triggers false positives from the inode_logged()
helper, because the inodes have a logged_trans of 0, there's a log tree
due to a fsync of an unrelated file and the directory inode has its
last_trans field set to the current transaction:
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=10000
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
# Now do some change to an unrelated file and fsync it.
# This is just to create a log tree to make sure that inode_logged()
# does not return false when called against "testdir".
xfs_io -f -c "pwrite 0 4K" -c "fsync" $MNT/foo
# Do some change to testdir. This is to make sure inode_logged()
# will return true when called against "testdir", because its
# logged_trans is 0, it was changed in the current transaction
# and there's a log tree.
echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
echo "Renaming $NUM_FILES files..."
start=$(date +%s%N)
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
done
end=$(date +%s%N)
dur=$(( (end - start) /
1000000 ))
echo "Renames took $dur milliseconds"
umount $MNT
Testing this change on a box using a non-debug kernel (Debian's default
kernel config) gave the following results:
NUM_FILES=10000, before patchset: 27837 ms
NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9236 ms (-66.8%)
NUM_FILES=10000, after whole patchset applied: 8902 ms (-68.0%)
NUM_FILES=5000, before patchset: 9127 ms
NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4640 ms (-49.2%)
NUM_FILES=5000, after whole patchset applied: 4441 ms (-51.3%)
NUM_FILES=2000, before patchset: 2528 ms
NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1983 ms (-21.6%)
NUM_FILES=2000, after whole patchset applied: 1747 ms (-30.9%)
NUM_FILES=1000, before patchset: 1085 ms
NUM_FILES=1000, after patches 1/5 to 4/5 applied: 893 ms (-17.7%)
NUM_FILES=1000, after whole patchset applied: 867 ms (-20.1%)
Running dbench on the same physical machine with the following script:
$ cat run-dbench.sh
#!/bin/bash
NUM_JOBS=$(nproc --all)
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-O no-holes -R free-space-tree"
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 120 $NUM_JOBS
umount $MNT
Before patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX
3761352 0.032 143.843
Close
2762770 0.002 2.273
Rename 159304 0.291 67.037
Unlink 759784 0.207 143.998
Deltree 72 4.028 15.977
Mkdir 36 0.003 0.006
Qpathinfo
3409780 0.013 9.678
Qfileinfo 596772 0.001 0.878
Qfsinfo 625189 0.003 1.245
Sfileinfo 306443 0.006 1.840
Find
1318106 0.063 19.798
WriteX
1871137 0.021 8.532
ReadX
5897325 0.003 3.567
LockX 12252 0.003 0.258
UnlockX 12252 0.002 0.100
Flush 263666 3.327 155.632
Throughput 980.047 MB/sec 12 clients 12 procs max_latency=155.636 ms
After whole patchset applied:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX
4195584 0.033 107.742
Close
3081932 0.002 1.935
Rename 177641 0.218 14.905
Unlink 847333 0.166 107.822
Deltree 118 5.315 15.247
Mkdir 59 0.004 0.048
Qpathinfo
3802612 0.014 10.302
Qfileinfo 666748 0.001 1.034
Qfsinfo 697329 0.003 0.944
Sfileinfo 341712 0.006 2.099
Find
1470365 0.065 9.359
WriteX
2093921 0.021 8.087
ReadX
6576234 0.003 3.407
LockX 13660 0.003 0.308
UnlockX 13660 0.002 0.114
Flush 294090 2.906 115.539
Throughput 1093.11 MB/sec 12 clients 12 procs max_latency=115.544 ms
+11.5% throughput -25.8% max latency rename max latency -77.8%
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 20 Jan 2022 11:00:09 +0000 (11:00 +0000)]
btrfs: stop doing unnecessary log updates during a rename
During a rename, we call __btrfs_unlink_inode(), which will call
btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(), in order
to remove an inode reference and a directory entry from the log. These
are necessary when __btrfs_unlink_inode() is called from the unlink path,
but not necessary when it's called from a rename context, because:
1) For the btrfs_del_inode_ref_in_log() call, it's pointless to delete the
inode reference related to the old name, because later in the rename
path we call btrfs_log_new_name(), which will drop all inode references
from the log and copy all inode references from the subvolume tree to
the log tree. So we are doing one unnecessary btree operation which
adds additional latency and lock contention in case there are other
tasks accessing the log tree;
2) For the btrfs_del_dir_entries_in_log() call, we are now doing the
equivalent at btrfs_log_new_name() since the previous patch in the
series, that has the subject "btrfs: avoid logging all directory
changes during renames". In fact, having __btrfs_unlink_inode() call
this function not only adds additional latency and lock contention due
to the extra btree operation, but also can make btrfs_log_new_name()
unnecessarily log a range item to track the deletion of the old name,
since it has no way to known that the directory entry related to the
old name was previously logged and already deleted by
__btrfs_unlink_inode() through its call to
btrfs_del_dir_entries_in_log().
So skip those calls at __btrfs_unlink_inode() when we are doing a rename.
Skipping them also allows us now to reduce the duration of time we are
pinning a log transaction during renames, which is always beneficial as
it's not delaying so much other tasks trying to sync the log tree, in
particular we end up not holding the log transaction pinned while adding
the new name (adding inode ref, directory entry, etc).
This change is part of a patchset comprised of the following patches:
1/5 btrfs: add helper to delete a dir entry from a log tree
2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
3/5 btrfs: avoid logging all directory changes during renames
4/5 btrfs: stop doing unnecessary log updates during a rename
5/5 btrfs: avoid inode logging during rename and link when possible
Just like the previous patch in the series, "btrfs: avoid logging all
directory changes during renames", the following script mimics part of
what a package installation/upgrade with zypper does, which is basically
renaming a lot of files, in some directory under /usr, to a name with a
suffix of "-RPMDELETE":
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=10000
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
# Do some change to testdir and fsync it.
echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
xfs_io -c "fsync" $MNT/testdir
echo "Renaming $NUM_FILES files..."
start=$(date +%s%N)
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
done
end=$(date +%s%N)
dur=$(( (end - start) /
1000000 ))
echo "Renames took $dur milliseconds"
umount $MNT
Testing this change on box a using a non-debug kernel (Debian's default
kernel config) gave the following results:
NUM_FILES=10000, before patchset: 27399 ms
NUM_FILES=10000, after patches 1/5 to 3/5 applied: 9093 ms (-66.8%)
NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9016 ms (-67.1%)
NUM_FILES=5000, before patchset: 9241 ms
NUM_FILES=5000, after patches 1/5 to 3/5 applied: 4642 ms (-49.8%)
NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4553 ms (-50.7%)
NUM_FILES=2000, before patchset: 2550 ms
NUM_FILES=2000, after patches 1/5 to 3/5 applied: 1788 ms (-29.9%)
NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1767 ms (-30.7%)
NUM_FILES=1000, before patchset: 1088 ms
NUM_FILES=1000, after patches 1/5 to 3/5 applied: 905 ms (-16.9%)
NUM_FILES=1000, after patches 1/5 to 4/5 applied: 883 ms (-18.8%)
The next patch in the series (5/5), also contains dbench results after
applying to whole patchset.
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 20 Jan 2022 11:00:08 +0000 (11:00 +0000)]
btrfs: avoid logging all directory changes during renames
When doing a rename of a file, if the file or its old parent directory
were logged before, we log the new name of the file and then make sure
we log the old parent directory, to ensure that after a log replay the
old name of the file is deleted and the new name added.
The logging of the old parent directory can take some time, because it
will scan all leaves modified in the current transaction, check which
directory entries were already logged, copy the ones that were not
logged before, etc. In this rename context all we need to do is make
sure that the old name of the file is deleted on log replay, so instead
of triggering a directory log operation, we can just delete the old
directory entry from the log if it's there, or in case it isn't there,
just log a range item to signal log replay that the old name must be
deleted. So change btrfs_log_new_name() to do that.
This scenario is actually not uncommon to trigger, and recently on a
5.15 kernel, an openSUSE Tumbleweed user reported package installations
and upgrades, with the zypper tool, were often taking a long time to
complete, much more than usual. With strace it could be observed that
zypper was spending over 99% of its time on rename operations, and then
with further analysis we checked that directory logging was happening
too frequently and causing high latencies for the rename operations.
Taking into account that installation/upgrade of some of these packages
needed about a few thousand file renames, the slowdown was very noticeable
for the user.
The issue was caused indirectly due to an excessive number of inode
evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14
or a 5.16-rc8 kernel. After an inode eviction we can't tell for sure,
in an efficient way, if an inode was previously logged in the current
transaction, so we are pessimistic and assume it was, because in case
it was we need to update the logged inode. More details on that in one
of the patches in the same series (subject "btrfs: avoid inode logging
during rename and link when possible"). Either way, in case the parent
directory was logged before, we currently do more work then necessary
during a rename, and this change minimizes that amount of work.
The following script mimics part of what a package installation/upgrade
with zypper does, which is basically renaming a lot of files, in some
directory under /usr, to a name with a suffix of "-RPMDELETE":
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=10000
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
# Do some change to testdir and fsync it.
echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
xfs_io -c "fsync" $MNT/testdir
echo "Renaming $NUM_FILES files..."
start=$(date +%s%N)
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
done
end=$(date +%s%N)
dur=$(( (end - start) /
1000000 ))
echo "Renames took $dur milliseconds"
umount $MNT
Testing this change on box using a non-debug kernel (Debian's default
kernel config) gave the following results:
NUM_FILES=10000, before this patch: 27399 ms
NUM_FILES=10000, after this patch: 9093 ms (-66.8%)
NUM_FILES=5000, before this patch: 9241 ms
NUM_FILES=5000, after this patch: 4642 ms (-49.8%)
NUM_FILES=2000, before this patch: 2550 ms
NUM_FILES=2000, after this patch: 1788 ms (-29.9%)
NUM_FILES=1000, before this patch: 1088 ms
NUM_FILES=1000, after this patch: 905 ms (-16.9%)
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 20 Jan 2022 11:00:07 +0000 (11:00 +0000)]
btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
In the next patch in the series, there will be the need to access the old
name, and its length, of an inode when logging the inode during a rename.
So instead of passing the inode to btrfs_log_new_name() pass the dentry,
because from the dentry we can get the inode, the name and its length.
This will avoid passing 3 new parameters to btrfs_log_new_name() in the
next patch - the name, its length and an index number. This way we end
up passing only 1 new parameter, the index number.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 20 Jan 2022 11:00:06 +0000 (11:00 +0000)]
btrfs: add helper to delete a dir entry from a log tree
Move the code that finds and deletes a logged dir entry out of
btrfs_del_dir_entries_in_log() into a helper function. This new helper
function will be used by another patch in the same series, and serves
to avoid having duplicated logic.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Minghao Chi [Tue, 11 Jan 2022 01:57:16 +0000 (01:57 +0000)]
btrfs: send: remove redundant ret variable in fs_path_copy
Return value from fs_path_add_path() directly instead of taking this in
another redundant variable.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 13 Jan 2022 15:16:18 +0000 (17:16 +0200)]
btrfs: move QUOTA_ENABLED check to rescan_should_stop from btrfs_qgroup_rescan_worker
Instead of having 2 places that short circuit the qgroup leaf scan have
everything in the qgroup_rescan_leaf function. In addition to that, also
ensure that the inconsistent qgroup flag is set when rescan_should_stop
returns true. This both retains the old behavior when -EINTR was set in
the body of the loop and at the same time also extends this behavior
when scanning is interrupted due to remount or unmount operations.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jiapeng Chong [Fri, 21 Jan 2022 11:42:24 +0000 (19:42 +0800)]
btrfs: scrub: remove redundant initialization of increment
increment is being initialized to map->stripe_len but this is never
read as increment is overwritten later on. Remove the redundant
initialization.
Cleans up the following clang-analyzer warning:
fs/btrfs/scrub.c:3193:6: warning: Value stored to 'increment' during its
initialization is never read [clang-analyzer-deadcode.DeadStores].
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jiapeng Chong [Fri, 21 Jan 2022 11:43:51 +0000 (19:43 +0800)]
btrfs: zoned: remove redundant initialization of to_add
to_add is being initialized to len but this is never read as to_add is
overwritten later on. Remove the redundant initialization.
Cleans up the following clang-analyzer warning:
fs/btrfs/extent-tree.c:2769:8: warning: Value stored to 'to_add' during
its initialization is never read [clang-analyzer-deadcode.DeadStores].
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 13 Jan 2022 04:44:10 +0000 (12:44 +0800)]
btrfs: cleanup temporary variables when finding rotational device status
The pointer to struct request_queue is used only to get device type
rotating or the non-rotating. So use it directly.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 12 Jan 2022 05:06:02 +0000 (13:06 +0800)]
btrfs: use dev_t to match device in device_matched
Commit "btrfs: add device major-minor info in the struct btrfs_device"
saved the device major-minor number in the struct btrfs_device upon
discovering it.
So no need to lookup_bdev() again just match, which means
device_matched() can go away.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 12 Jan 2022 05:06:01 +0000 (13:06 +0800)]
btrfs: add device major-minor info in the struct btrfs_device
Internally it is common to use the major-minor number to identify a
device and, at a few locations in btrfs, we use the major-minor number
to match the device.
So when we identify a new btrfs device through device add or device
replace or device-scan/ready save the device's major-minor (dev_t) in the
struct btrfs_device so that we don't have to call lookup_bdev() again.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 12 Jan 2022 05:06:00 +0000 (13:06 +0800)]
btrfs: match stale devices by dev_t
After the commit "btrfs: harden identification of the stale device", we
don't have to match the device path anymore. Instead, we match the dev_t.
So pass in the dev_t instead of the device path, in the call chain
btrfs_forget_devices()->btrfs_free_stale_devices().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 12 Jan 2022 05:05:59 +0000 (13:05 +0800)]
btrfs: harden identification of a stale device
Identifying and removing the stale device from the fs_uuids list is done
by btrfs_free_stale_devices(). btrfs_free_stale_devices() in turn
depends on device_path_matched() to check if the device appears in more
than one btrfs_device structure.
The matching of the device happens by its path, the device path. However,
when device mapper is in use, the dm device paths are nothing but a link
to the actual block device, which leads to the device_path_matched()
failing to match.
Fix this by matching the dev_t as provided by lookup_bdev() instead of
plain string compare of the device paths.
Reported-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Mon, 17 Jan 2022 15:50:39 +0000 (23:50 +0800)]
btrfs: simplify fs_devices member access in btrfs_init_dev_replace_tgtdev
In btrfs_init_dev_replace_tgtdev() we dereference fs_info to get
fs_devices many times, instead save a point to the fs_devices.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sahil Kang [Sun, 16 Jan 2022 02:48:47 +0000 (18:48 -0800)]
btrfs: reuse existing inode from btrfs_ioctl
btrfs_ioctl extracts inode from file so we can pass that into the
callbacks.
Signed-off-by: Sahil Kang <sahil.kang@asilaycomputing.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 11 Jan 2022 16:00:26 +0000 (18:00 +0200)]
btrfs: move missing device handling in a dedicate function
This simplifies the code flow in read_one_chunk and makes error handling
when handling missing devices a bit simpler by reducing it to a single
check if something went wrong. No functional changes.
Reviewed-by: Su Yue <l@damenly.su>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 15 Dec 2021 12:20:01 +0000 (12:20 +0000)]
btrfs: stop trying to log subdirectories created in past transactions
When logging a directory we are trying to log subdirectories that were
changed in the current transaction and created in a past transaction.
This type of behaviour was introduced by commit
2f2ff0ee5e4303 ("Btrfs:
fix metadata inconsistencies after directory fsync"), to fix some metadata
inconsistencies that in the meanwhile no longer need this behaviour due to
numerous other changes that happened throughout the years.
This behaviour, besides not needed anymore, it's also undesirable because:
1) It's not reliable because it's only triggered for the directories
of dentries (dir items) that happen to be present on a leaf that
was changed in the current transaction. If a dentry that points to
a directory resides on a leaf that was not changed in the current
transaction, then it's not logged, as at log_dir_items() and
log_new_dir_dentries() we use btrfs_search_forward();
2) It's not required by posix or any standard, it's undefined territory.
The only way to guarantee a subdirectory is logged, it to explicitly
fsync it;
Making the behaviour guaranteed would require scanning all directory
items, check which point to a directory, and then fsync each subdirectory
which was modified in the current transaction. This could be very
expensive for large directories with many subdirectories and/or large
subdirectories.
So remove that obsolete logic.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 15 Dec 2021 12:20:00 +0000 (12:20 +0000)]
btrfs: stop copying old dir items when logging a directory
When logging a directory, we go over every leaf of the subvolume tree that
was changed in the current transaction and copy all its dir index keys to
the log tree.
That includes copying dir index keys created in past transactions. This is
done mostly for simplicity, as after logging the keys we log an item that
specifies the start and end ranges of the keys we logged. That item is
then used during log replay to figure out which keys need to be deleted -
every key in that range that we find in the subvolume tree and is not in
the log tree, needs to be deleted.
Now that we log only dir index keys, and not dir item keys anymore, when
we remove dentries from a directory (due to unlink and rename operations),
we can get entire leaves that we changed only for deleting old dir index
keys, or that have few dir index keys that are new - this is due to the
fact that the offset for new index keys comes from a monotonically
increasing counter.
We can avoid logging dir index keys from past transactions, and in order
to track the deletions, only log range items (BTRFS_DIR_LOG_INDEX_KEY key
type) when we find gaps between consecutive index keys. This massively
reduces the amount of logged metadata when we have deleted directory
entries, even if it's a small percentage of the total number of entries.
The reduction comes from both less items that are logged and instead of
logging many dir index items (struct btrfs_dir_item), which have a size
of 30 bytes plus a file name, we typically log just a few range items
(struct btrfs_dir_log_item), which take only 8 bytes each.
Even if no entries were deleted from a directory and only new entries
were added, we typically still get a reduction on the amount of logged
metadata, because it's very likely the first leaf that got the new
dir index entries also has several old dir index entries.
So change the logging logic to not log dir index keys created in past
transactions and log a range item for every gap it finds between each
pair of consecutive index keys, to ensure deletions are tracked and
replayed on log replay.
This patch is part of a patchset comprised of the following patches:
1/4 btrfs: don't log unnecessary boundary keys when logging directory
2/4 btrfs: put initial index value of a directory in a constant
3/4 btrfs: stop copying old dir items when logging a directory
4/4 btrfs: stop trying to log subdirectories created in past transactions
The following test was run on a branch without this patchset and on a
branch with the first three patches applied:
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=
1000000
NUM_FILE_DELETES=10000
MKFS_OPTIONS="-O no-holes -R free-space-tree"
MOUNT_OPTIONS="-o ssd"
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
del_inc=$(( $NUM_FILES / $NUM_FILE_DELETES ))
for ((i = 1; i <= $NUM_FILES; i += $del_inc)); do
rm -f $MNT/testdir/file_$i
done
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) /
1000000 ))
echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
echo
umount $MNT
The test was run on a non-debug kernel (Debian's default kernel config),
and the results were the following for various values of NUM_FILES and
NUM_FILE_DELETES:
** before, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
dir fsync took 585 ms after deleting 10000 files
** after, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
dir fsync took 34 ms after deleting 10000 files (-94.2%)
** before, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
dir fsync took 50 ms after deleting 1000 files
** after, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
dir fsync took 7 ms after deleting 1000 files (-86.0%)
** before, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
dir fsync took 9 ms after deleting 100 files
** after, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
dir fsync took 5 ms after deleting 100 files (-44.4%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 15 Dec 2021 12:19:59 +0000 (12:19 +0000)]
btrfs: put initial index value of a directory in a constant
At btrfs_set_inode_index_count() we refer twice to the number 2 as the
initial index value for a directory (when it's empty), with a proper
comment explaining the reason for that value. In the next patch I'll
have to use that magic value in the directory logging code, so put
the value in a #define at btrfs_inode.h, to avoid hardcoding the
magic value again at tree-log.c.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 15 Dec 2021 12:19:58 +0000 (12:19 +0000)]
btrfs: don't log unnecessary boundary keys when logging directory
Before we start to log dir index keys from a leaf, we check if there is a
previous index key, which normally is at the end of a leaf that was not
changed in the current transaction. Then we log that key and set the start
of logged range (item of type BTRFS_DIR_LOG_INDEX_KEY) to the offset of
that key. This is to ensure that if there were deleted index keys between
that key and the first key we are going to log, those deletions are
replayed in case we need to replay to the log after a power failure.
However we really don't need to log that previous key, we can just set the
start of the logged range to that key's offset plus 1. This achieves the
same and avoids logging one dir index key.
The same logic is performed when we finish logging the index keys of a
leaf and we find that the next leaf has index keys and was not changed in
the current transaction. We are logging the first key of that next leaf
and use its offset as the end of range we log. This is just to ensure that
if there were deleted index keys between the last index key we logged and
the first key of that next leaf, those index keys are deleted if we end
up replaying the log. However that is not necessary, we can avoid logging
that first index key of the next leaf and instead set the end of the
logged range to match the offset of that index key minus 1.
So avoid logging those index keys at the boundaries and adjust the start
and end offsets of the logged ranges as described above.
This patch is part of a patchset comprised of the following patches:
1/4 btrfs: don't log unnecessary boundary keys when logging directory
2/4 btrfs: put initial index value of a directory in a constant
3/4 btrfs: stop copying old dir items when logging a directory
4/4 btrfs: stop trying to log subdirectories created in past transactions
Performance test results are listed in the changelog of patch 3/4.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sahil Kang [Wed, 5 Jan 2022 08:30:06 +0000 (00:30 -0800)]
btrfs: reuse existing pointers from btrfs_ioctl
btrfs_ioctl already contains pointers to the inode and btrfs_root
structs, so we can pass them into the subfunctions instead of the
toplevel struct file.
Signed-off-by: Sahil Kang <sahil.kang@asilaycomputing.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 4 Jan 2022 12:53:41 +0000 (12:53 +0000)]
btrfs: remove write and wait of struct walk_control
The ->write and ->wait fields of struct walk_control, used for log trees,
are not used since 2008, more specifically since commit
d0c803c4049c5c
("Btrfs: Record dirty pages tree-log pages in an extent_io tree") and
since commit
d0c803c4049c5c ("Btrfs: Record dirty pages tree-log pages in
an extent_io tree"). So just remove them, along with the function
btrfs_write_tree_block(), which is also not used anymore after removing
the ->write member.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Linus Torvalds [Sun, 13 Feb 2022 20:13:30 +0000 (12:13 -0800)]
Linux 5.17-rc4
Linus Torvalds [Sun, 13 Feb 2022 19:58:11 +0000 (11:58 -0800)]
Merge tag 'kbuild-fixes-v5.17-2' of git://git./linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix the truncated path issue for HAVE_GCC_PLUGINS test in Kconfig
- Move -Wunsligned-access to W=1 builds to avoid sprinkling warnings
for the latest Clang
- Fix missing fclose() in Kconfig
- Fix Kconfig to touch dep headers correctly when KCONFIG_AUTOCONFIG is
overridden.
* tag 'kbuild-fixes-v5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: fix failing to generate auto.conf
kconfig: fix missing fclose() on error paths
Makefile.extrawarn: Move -Wunaligned-access to W=1
kconfig: let 'shell' return enough output for deep path names
Linus Torvalds [Sun, 13 Feb 2022 18:06:40 +0000 (10:06 -0800)]
Merge tag 'irq-urgent-2022-02-13' of git://git./linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"Interrupt chip driver fixes:
- Don't install an hotplug notifier for GICV3-ITS on systems which do
not need it to prevent a warning in the notifier about inconsistent
state
- Add the missing device tree matching for the T-HEAD PLIC variant so
the related SoC is properly supported"
* tag 'irq-urgent-2022-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/sifive-plic: Add missing thead,c900-plic match string
dt-bindings: update riscv plic compatible string
irqchip/gic-v3-its: Skip HP notifier when no ITS is registered
Linus Torvalds [Sun, 13 Feb 2022 17:43:34 +0000 (09:43 -0800)]
Merge tag 'objtool_urgent_for_v5.17_rc4' of git://git./linux/kernel/git/tip/tip
Pull objtool fix from Borislav Petkov:
"Fix a case where objtool would mistakenly warn about instructions
being unreachable"
* tag 'objtool_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bug: Merge annotate_reachable() into _BUG_FLAGS() asm
Linus Torvalds [Sun, 13 Feb 2022 17:27:26 +0000 (09:27 -0800)]
Merge tag 'sched_urgent_for_v5.17_rc4' of git://git./linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
"Fix a NULL-ptr dereference when recalculating a sched entity's weight"
* tag 'sched_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix fault in reweight_entity
Linus Torvalds [Sun, 13 Feb 2022 17:25:26 +0000 (09:25 -0800)]
Merge tag 'perf_urgent_for_v5.17_rc4' of git://git./linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:
"Prevent cgroup event list corruption when switching events"
* tag 'perf_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix list corruption in perf_cgroup_switch()
Linus Torvalds [Sun, 13 Feb 2022 17:22:52 +0000 (09:22 -0800)]
Merge tag 'x86_urgent_for_v5.17_rc4' of git://git./linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
"Prevent softlockups when tearing down large SGX enclaves"
* tag 'x86_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Silence softlockup detection when releasing large enclaves
Linus Torvalds [Sun, 13 Feb 2022 17:16:45 +0000 (09:16 -0800)]
Merge tag '5.17-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three small smb3 reconnect fixes and an error log clarification"
* tag '5.17-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: mark sessions for reconnection in helper function
cifs: call helper functions for marking channels for reconnect
cifs: call cifs_reconnect when a connection is marked
[smb3] improve error message when mount options conflict with posix
Thomas Gleixner [Sun, 13 Feb 2022 13:16:23 +0000 (14:16 +0100)]
Merge tag 'irqchip-fixes-5.17-2' of git://git./linux/kernel/git/maz/arm-platforms into irq/urgent
Pull irqchip fixes from Marc Zyngier:
- Don't register a hotplug notifier on GICv3 systems that advertise
LPI support, but have no ITS to make use of it
- Add missing DT matching for the thead,c900-plic variant of the
SiFive PLIC
Link: https://lore.kernel.org/r/20220211110038.1179155-1-maz@kernel.org
Linus Torvalds [Sat, 12 Feb 2022 18:29:02 +0000 (10:29 -0800)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two minor fixes in the lpfc driver. One changing the classification of
trace messages and the other fixing a build issue when NVME_FC is
disabled"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: lpfc: Reduce log messages seen after firmware download
scsi: lpfc: Remove NVMe support if kernel has NVME_FC disabled
Linus Torvalds [Sat, 12 Feb 2022 18:16:32 +0000 (10:16 -0800)]
Merge tag 'char-misc-5.17-rc4' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are a small number of char/misc driver fixes for 5.17-rc4 for
reported issues. They contain:
- phy driver fixes
- iio driver fix
- eeprom driver fix
- speakup regression fix
- fastrpc fix
All of these have been in linux-next with no reported issues"
* tag 'char-misc-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
iio: buffer: Fix file related error handling in IIO_BUFFER_GET_FD_IOCTL
speakup-dectlk: Restore pitch setting
bus: mhi: pci_generic: Add mru_default for Cinterion MV31-W
bus: mhi: pci_generic: Add mru_default for Foxconn SDX55
eeprom: ee1004: limit i2c reads to I2C_SMBUS_BLOCK_MAX
misc: fastrpc: avoid double fput() on failed usercopy
phy: dphy: Correct clk_pre parameter
phy: phy-mtk-tphy: Fix duplicated argument in phy-mtk-tphy
phy: stm32: fix a refcount leak in stm32_usbphyc_pll_enable()
phy: xilinx: zynqmp: Fix bus width setting for SGMII
phy: cadence: Sierra: fix error handling bugs in probe()
phy: ti: Fix missing sentinel for clk_div_table
phy: broadcom: Kconfig: Fix PHY_BRCM_USB config option
phy: usb: Leave some clocks running during suspend
Linus Torvalds [Sat, 12 Feb 2022 18:10:35 +0000 (10:10 -0800)]
Merge tag 'staging-5.17-rc4' of git://git./linux/kernel/git/gregkh/staging
Pullstaging driver fixes from Greg KH:
"Here are two staging driver fixes for 5.17-rc4. These are:
- fbtft error path fix
- vc04_services rcu dereference fix
Both of these have been in linux-next for a while with no reported
issues"
* tag 'staging-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: fbtft: Fix error path in fbtft_driver_module_init()
staging: vc04_services: Fix RCU dereference check
Linus Torvalds [Sat, 12 Feb 2022 18:01:55 +0000 (10:01 -0800)]
Merge tag 'tty-5.17-rc4' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are four small tty/serial fixes for 5.17-rc4. They are:
- 8250_pericom change revert to fix a reported regression
- two speculation fixes for vt_ioctl
- n_tty regression fix for polling
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
vt_ioctl: add array_index_nospec to VT_ACTIVATE
vt_ioctl: fix array_index_nospec in vt_setactivate
serial: 8250_pericom: Revert "Re-enable higher baud rates"
n_tty: wake up poll(POLLRDNORM) on receiving data
Linus Torvalds [Sat, 12 Feb 2022 17:56:18 +0000 (09:56 -0800)]
Merge tag 'usb-5.17-rc4' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB driver fixes for 5.17-rc4 that resolve some
reported issues and add new device ids:
- usb-serial new device ids
- ulpi cleanup fixes
- f_fs use-after-free fix
- dwc3 driver fixes
- ax88179_178a usb network driver fix
- usb gadget fixes
There is a revert at the end of this series to resolve a build problem
that 0-day found yesterday. Most of these have been in linux-next,
except for the last few, and all have now passed 0-day tests"
* tag 'usb-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
Revert "usb: dwc2: drd: fix soft connect when gadget is unconfigured"
usb: dwc2: drd: fix soft connect when gadget is unconfigured
usb: gadget: rndis: check size of RNDIS_MSG_SET command
USB: gadget: validate interface OS descriptor requests
usb: core: Unregister device on component_add() failure
net: usb: ax88179_178a: Fix out-of-bounds accesses in RX fixup
usb: dwc3: gadget: Prevent core from processing stale TRBs
USB: serial: cp210x: add CPI Bulk Coin Recycler id
USB: serial: cp210x: add NCR Retail IO box id
USB: serial: ftdi_sio: add support for Brainboxes US-159/235/320
usb: gadget: f_uac2: Define specific wTerminalType
usb: gadget: udc: renesas_usb3: Fix host to USB_ROLE_NONE transition
usb: raw-gadget: fix handling of dual-direction-capable endpoints
usb: usb251xb: add boost-up property support
usb: ulpi: Call of_node_put correctly
usb: ulpi: Move of_node_put to ulpi_dev_release
USB: serial: option: add ZTE MF286D modem
USB: serial: ch341: add support for GW Instek USB2.0-Serial devices
usb: f_fs: Fix use-after-free for epfile
usb: dwc3: xilinx: fix uninitialized return value
Linus Torvalds [Sat, 12 Feb 2022 17:12:44 +0000 (09:12 -0800)]
Merge tag 's390-5.17-4' of git://git./linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
"Maintainers and reviewers changes:
- Add Alexander Gordeev as maintainer for s390.
- Christian Borntraeger will focus on s390 KVM maintainership and
stays as s390 reviewer.
Fixes:
- Fix clang build of modules loader KUnit test.
- Fix kernel panic in CIO code on FCES path-event when no driver is
attached to a device or the driver does not provide the path_event
function"
* tag 's390-5.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/cio: verify the driver availability for path_event call
s390/module: fix building test_modules_helpers.o with clang
MAINTAINERS: downgrade myself to Reviewer for s390
MAINTAINERS: add Alexander Gordeev as maintainer for s390
Linus Torvalds [Sat, 12 Feb 2022 17:08:57 +0000 (09:08 -0800)]
Merge tag 'for-linus-5.17a-rc4-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- Two small cleanups
- Another fix for addressing the EFI framebuffer above 4GB when running
as Xen dom0
- A patch to let Xen guests use reserved bits in MSI- and IO-APIC-
registers for extended APIC-IDs the same way KVM guests are doing it
already
* tag 'for-linus-5.17a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/pci: Make use of the helper macro LIST_HEAD()
xen/x2apic: Fix inconsistent indenting
xen/x86: detect support for extended destination ID
xen/x86: obtain full video frame buffer address for Dom0 also under EFI
Linus Torvalds [Sat, 12 Feb 2022 17:04:05 +0000 (09:04 -0800)]
Merge tag 'seccomp-v5.17-rc4' of git://git./linux/kernel/git/kees/linux
Pull seccomp fixes from Kees Cook:
"This fixes a corner case of fatal SIGSYS being ignored since v5.15.
Along with the signal fix is a change to seccomp so that seeing
another syscall after a fatal filter result will cause seccomp to kill
the process harder.
Summary:
- Force HANDLER_EXIT even for SIGNAL_UNKILLABLE
- Make seccomp self-destruct after fatal filter results
- Update seccomp samples for easier behavioral demonstration"
* tag 'seccomp-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
samples/seccomp: Adjust sample to also provide kill option
seccomp: Invalidate seccomp mode to catch death failures
signal: HANDLER_EXIT should clear SIGNAL_UNKILLABLE
Linus Torvalds [Sat, 12 Feb 2022 16:57:37 +0000 (08:57 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"5 patches.
Subsystems affected by this patch series: binfmt, procfs, and mm
(vmscan, memcg, and kfence)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
kfence: make test case compatible with run time set sample interval
mm: memcg: synchronize objcg lists with a dedicated spinlock
mm: vmscan: remove deadlock due to throttling failing to make progress
fs/proc: task_mmu.c: don't read mapcount for migration entry
fs/binfmt_elf: fix PT_LOAD p_align values for loaders
Jing Leng [Fri, 11 Feb 2022 09:27:36 +0000 (17:27 +0800)]
kconfig: fix failing to generate auto.conf
When the KCONFIG_AUTOCONFIG is specified (e.g. export \
KCONFIG_AUTOCONFIG=output/config/auto.conf), the directory of
include/config/ will not be created, so kconfig can't create deps
files in it and auto.conf can't be generated.
Signed-off-by: Jing Leng <jleng@ambarella.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Greg Kroah-Hartman [Sat, 12 Feb 2022 09:08:54 +0000 (10:08 +0100)]
Revert "usb: dwc2: drd: fix soft connect when gadget is unconfigured"
This reverts commit
269cbcf7b72de6f0016806d4a0cec1d689b55a87.
It causes build errors as reported by the kernel test robot.
Link: https://lore.kernel.org/r/202202112236.AwoOTtHO-lkp@intel.com
Reported-by: kernel test robot <lkp@intel.com>
Fixes:
269cbcf7b72d ("usb: dwc2: drd: fix soft connect when gadget is unconfigured")
Cc: stable@kernel.org
Cc: Amelie Delaunay <amelie.delaunay@foss.st.com>
Cc: Minas Harutyunyan <Minas.Harutyunyan@synopsys.com>
Cc: Fabrice Gasnier <fabrice.gasnier@foss.st.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Peng Liu [Sat, 12 Feb 2022 00:32:35 +0000 (16:32 -0800)]
kfence: make test case compatible with run time set sample interval
The parameter kfence_sample_interval can be set via boot parameter and
late shell command, which is convenient for automated tests and KFENCE
parameter optimization. However, KFENCE test case just uses
compile-time CONFIG_KFENCE_SAMPLE_INTERVAL, which will make KFENCE test
case not run as users desired. Export kfence_sample_interval, so that
KFENCE test case can use run-time-set sample interval.
Link: https://lkml.kernel.org/r/20220207034432.185532-1-liupeng256@huawei.com
Signed-off-by: Peng Liu <liupeng256@huawei.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Christian Knig <christian.koenig@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Sat, 12 Feb 2022 00:32:32 +0000 (16:32 -0800)]
mm: memcg: synchronize objcg lists with a dedicated spinlock
Alexander reported a circular lock dependency revealed by the mmap1 ltp
test:
LOCKDEP_CIRCULAR (suite: ltp, case: mtest06 (mmap1))
WARNING: possible circular locking dependency detected
5.17.0-
20220113.rc0.git0.
f2211f194038.300.fc35.s390x+debug #1 Not tainted
------------------------------------------------------
mmap1/202299 is trying to acquire lock:
00000001892c0188 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x4a/0xe0
but task is already holding lock:
00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&sighand->siglock){-.-.}-{2:2}:
__lock_acquire+0x604/0xbd8
lock_acquire.part.0+0xe2/0x238
lock_acquire+0xb0/0x200
_raw_spin_lock_irqsave+0x6a/0xd8
__lock_task_sighand+0x90/0x190
cgroup_freeze_task+0x2e/0x90
cgroup_migrate_execute+0x11c/0x608
cgroup_update_dfl_csses+0x246/0x270
cgroup_subtree_control_write+0x238/0x518
kernfs_fop_write_iter+0x13e/0x1e0
new_sync_write+0x100/0x190
vfs_write+0x22c/0x2d8
ksys_write+0x6c/0xf8
__do_syscall+0x1da/0x208
system_call+0x82/0xb0
-> #0 (css_set_lock){..-.}-{2:2}:
check_prev_add+0xe0/0xed8
validate_chain+0x736/0xb20
__lock_acquire+0x604/0xbd8
lock_acquire.part.0+0xe2/0x238
lock_acquire+0xb0/0x200
_raw_spin_lock_irqsave+0x6a/0xd8
obj_cgroup_release+0x4a/0xe0
percpu_ref_put_many.constprop.0+0x150/0x168
drain_obj_stock+0x94/0xe8
refill_obj_stock+0x94/0x278
obj_cgroup_charge+0x164/0x1d8
kmem_cache_alloc+0xac/0x528
__sigqueue_alloc+0x150/0x308
__send_signal+0x260/0x550
send_signal+0x7e/0x348
force_sig_info_to_task+0x104/0x180
force_sig_fault+0x48/0x58
__do_pgm_check+0x120/0x1f0
pgm_check_handler+0x11e/0x180
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&sighand->siglock);
lock(css_set_lock);
lock(&sighand->siglock);
lock(css_set_lock);
*** DEADLOCK ***
2 locks held by mmap1/202299:
#0:
00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
#1:
00000001892ad560 (rcu_read_lock){....}-{1:2}, at: percpu_ref_put_many.constprop.0+0x0/0x168
stack backtrace:
CPU: 15 PID: 202299 Comm: mmap1 Not tainted 5.17.0-
20220113.rc0.git0.
f2211f194038.300.fc35.s390x+debug #1
Hardware name: IBM 3906 M04 704 (LPAR)
Call Trace:
dump_stack_lvl+0x76/0x98
check_noncircular+0x136/0x158
check_prev_add+0xe0/0xed8
validate_chain+0x736/0xb20
__lock_acquire+0x604/0xbd8
lock_acquire.part.0+0xe2/0x238
lock_acquire+0xb0/0x200
_raw_spin_lock_irqsave+0x6a/0xd8
obj_cgroup_release+0x4a/0xe0
percpu_ref_put_many.constprop.0+0x150/0x168
drain_obj_stock+0x94/0xe8
refill_obj_stock+0x94/0x278
obj_cgroup_charge+0x164/0x1d8
kmem_cache_alloc+0xac/0x528
__sigqueue_alloc+0x150/0x308
__send_signal+0x260/0x550
send_signal+0x7e/0x348
force_sig_info_to_task+0x104/0x180
force_sig_fault+0x48/0x58
__do_pgm_check+0x120/0x1f0
pgm_check_handler+0x11e/0x180
INFO: lockdep is turned off.
In this example a slab allocation from __send_signal() caused a
refilling and draining of a percpu objcg stock, resulted in a releasing
of another non-related objcg. Objcg release path requires taking the
css_set_lock, which is used to synchronize objcg lists.
This can create a circular dependency with the sighandler lock, which is
taken with the locked css_set_lock by the freezer code (to freeze a
task).
In general it seems that using css_set_lock to synchronize objcg lists
makes any slab allocations and deallocation with the locked css_set_lock
and any intervened locks risky.
To fix the problem and make the code more robust let's stop using
css_set_lock to synchronize objcg lists and use a new dedicated spinlock
instead.
Link: https://lkml.kernel.org/r/Yfm1IHmoGdyUR81T@carbon.dhcp.thefacebook.com
Fixes:
bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Tested-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Jeremy Linton <jeremy.linton@arm.com>
Tested-by: Jeremy Linton <jeremy.linton@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Sat, 12 Feb 2022 00:32:29 +0000 (16:32 -0800)]
mm: vmscan: remove deadlock due to throttling failing to make progress
A soft lockup bug in kcompactd was reported in a private bugzilla with
the following visible in dmesg;
watchdog: BUG: soft lockup - CPU#33 stuck for 26s! [kcompactd0:479]
watchdog: BUG: soft lockup - CPU#33 stuck for 52s! [kcompactd0:479]
watchdog: BUG: soft lockup - CPU#33 stuck for 78s! [kcompactd0:479]
watchdog: BUG: soft lockup - CPU#33 stuck for 104s! [kcompactd0:479]
The machine had 256G of RAM with no swap and an earlier failed
allocation indicated that node 0 where kcompactd was run was potentially
unreclaimable;
Node 0 active_anon:29355112kB inactive_anon:2913528kB active_file:0kB
inactive_file:0kB unevictable:64kB isolated(anon):0kB isolated(file):0kB
mapped:8kB dirty:0kB writeback:0kB shmem:26780kB shmem_thp:
0kB shmem_pmdmapped: 0kB anon_thp: 23480320kB writeback_tmp:0kB
kernel_stack:2272kB pagetables:24500kB all_unreclaimable? yes
Vlastimil Babka investigated a crash dump and found that a task
migrating pages was trying to drain PCP lists;
PID: 52922 TASK:
ffff969f820e5000 CPU: 19 COMMAND: "kworker/u128:3"
Call Trace:
__schedule
schedule
schedule_timeout
wait_for_completion
__flush_work
__drain_all_pages
__alloc_pages_slowpath.constprop.114
__alloc_pages
alloc_migration_target
migrate_pages
migrate_to_node
do_migrate_pages
cpuset_migrate_mm_workfn
process_one_work
worker_thread
kthread
ret_from_fork
This failure is specific to CONFIG_PREEMPT=n builds. The root of the
problem is that kcompact0 is not rescheduling on a CPU while a task that
has isolated a large number of the pages from the LRU is waiting on
kcompact0 to reschedule so the pages can be released. While
shrink_inactive_list() only loops once around too_many_isolated, reclaim
can continue without rescheduling if sc->skipped_deactivate == 1 which
could happen if there was no file LRU and the inactive anon list was not
low.
Link: https://lkml.kernel.org/r/20220203100326.GD3301@suse.de
Fixes:
d818fca1cac3 ("mm/vmscan: throttle reclaim and compaction when too may pages are isolated")
Signed-off-by: Mel Gorman <mgorman@suse.de>
Debugged-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Sat, 12 Feb 2022 00:32:26 +0000 (16:32 -0800)]
fs/proc: task_mmu.c: don't read mapcount for migration entry
The syzbot reported the below BUG:
kernel BUG at include/linux/page-flags.h:785!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 4392 Comm: syz-executor560 Not tainted 5.16.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:PageDoubleMap include/linux/page-flags.h:785 [inline]
RIP: 0010:__page_mapcount+0x2d2/0x350 mm/util.c:744
Call Trace:
page_mapcount include/linux/mm.h:837 [inline]
smaps_account+0x470/0xb10 fs/proc/task_mmu.c:466
smaps_pte_entry fs/proc/task_mmu.c:538 [inline]
smaps_pte_range+0x611/0x1250 fs/proc/task_mmu.c:601
walk_pmd_range mm/pagewalk.c:128 [inline]
walk_pud_range mm/pagewalk.c:205 [inline]
walk_p4d_range mm/pagewalk.c:240 [inline]
walk_pgd_range mm/pagewalk.c:277 [inline]
__walk_page_range+0xe23/0x1ea0 mm/pagewalk.c:379
walk_page_vma+0x277/0x350 mm/pagewalk.c:530
smap_gather_stats.part.0+0x148/0x260 fs/proc/task_mmu.c:768
smap_gather_stats fs/proc/task_mmu.c:741 [inline]
show_smap+0xc6/0x440 fs/proc/task_mmu.c:822
seq_read_iter+0xbb0/0x1240 fs/seq_file.c:272
seq_read+0x3e0/0x5b0 fs/seq_file.c:162
vfs_read+0x1b5/0x600 fs/read_write.c:479
ksys_read+0x12d/0x250 fs/read_write.c:619
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The reproducer was trying to read /proc/$PID/smaps when calling
MADV_FREE at the mean time. MADV_FREE may split THPs if it is called
for partial THP. It may trigger the below race:
CPU A CPU B
----- -----
smaps walk: MADV_FREE:
page_mapcount()
PageCompound()
split_huge_page()
page = compound_head(page)
PageDoubleMap(page)
When calling PageDoubleMap() this page is not a tail page of THP anymore
so the BUG is triggered.
This could be fixed by elevated refcount of the page before calling
mapcount, but that would prevent it from counting migration entries, and
it seems overkilling because the race just could happen when PMD is
split so all PTE entries of tail pages are actually migration entries,
and smaps_account() does treat migration entries as mapcount == 1 as
Kirill pointed out.
Add a new parameter for smaps_account() to tell this entry is migration
entry then skip calling page_mapcount(). Don't skip getting mapcount
for device private entries since they do track references with mapcount.
Pagemap also has the similar issue although it was not reported. Fixed
it as well.
[shy828301@gmail.com: v4]
Link: https://lkml.kernel.org/r/20220203182641.824731-1-shy828301@gmail.com
[nathan@kernel.org: avoid unused variable warning in pagemap_pmd_range()]
Link: https://lkml.kernel.org/r/20220207171049.1102239-1-nathan@kernel.org
Link: https://lkml.kernel.org/r/20220120202805.3369-1-shy828301@gmail.com
Fixes:
e9b61f19858a ("thp: reintroduce split_huge_page()")
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: syzbot+1f52b3a18d5633fa7f82@syzkaller.appspotmail.com
Acked-by: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Sat, 12 Feb 2022 00:32:22 +0000 (16:32 -0800)]
fs/binfmt_elf: fix PT_LOAD p_align values for loaders
Rui Salvaterra reported that Aisleroit solitaire crashes with "Wrong
__data_start/_end pair" assertion from libgc after update to v5.17-rc1.
Bisection pointed to commit
9630f0d60fec ("fs/binfmt_elf: use PT_LOAD
p_align values for static PIE") that fixed handling of static PIEs, but
made the condition that guards load_bias calculation to exclude loader
binaries.
Restoring the check for presence of interpreter fixes the problem.
Link: https://lkml.kernel.org/r/20220202121433.3697146-1-rppt@kernel.org
Fixes:
9630f0d60fec ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: Rui Salvaterra <rsalvaterra@gmail.com>
Tested-by: Rui Salvaterra <rsalvaterra@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H.J. Lu" <hjl.tools@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 11 Feb 2022 21:40:03 +0000 (13:40 -0800)]
Merge tag 'soc-fixes-5.17-1' of git://git./linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"This is a fairly large set of bugfixes, most of which had been sent a
while ago but only now made it into the soc tree:
Maintainer file updates:
- Claudiu Beznea now co-maintains the at91 soc family, replacing
Ludovic Desroches.
- Michael Walle maintains the sl28cpld drivers
- Alain Volmat and Raphael Gallais-Pou take over some drivers for ST
platforms
- Alim Akhtar is an additional reviewer for Samsung platforms
Code fixes:
- Op-tee had a problem with object lifetime that needs a slightly
complex fix, as well as another bug with error handling.
- Several minor issues for the OMAP platform, including a regression
with the timer
- A Kconfig change to fix a build-time issue on Intel SoCFPGA
Device tree fixes:
- The Amlogic Meson platform fixes a boot regression on am1-odroid, a
spurious interrupt, and a problem with reserved memory regions
- In the i.MX platform, several bug fixes are needed to make devices
work correctly: SD card detection, alarmtimer, and sound card on
some board. One patch for the GPU got in there by accident and gets
reverted again.
- TI K3 needs a fix for J721S2 serial port numbers
- ux500 needs a fix to mount the SD card as root on the Skomer phone"
* tag 'soc-fixes-5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (46 commits)
Revert "arm64: dts: imx8mn-venice-gw7902: disable gpu"
arm64: Remove ARCH_VULCAN
MAINTAINERS: add myself as a maintainer for the sl28cpld
MAINTAINERS: add IRC to ARM sub-architectures and Devicetree
MAINTAINERS: arm: samsung: add Git tree and IRC
ARM: dts: Fix boot regression on Skomer
ARM: dts: spear320: Drop unused and undocumented 'irq-over-gpio' property
soc: aspeed: lpc-ctrl: Block error printing on probe defer cases
docs/ABI: testing: aspeed-uart-routing: Escape asterisk
MAINTAINERS: update drm/stm drm/sti and cec/sti maintainers
MAINTAINERS: Update Benjamin Gaignard maintainer status
ARM: socfpga: fix missing RESET_CONTROLLER
arm64: dts: meson-sm1-odroid: fix boot loop after reboot
arm64: dts: meson-g12: drop BL32 region from SEI510/SEI610
arm64: dts: meson-g12: add ATF BL32 reserved-memory region
arm64: dts: meson-gx: add ATF BL32 reserved-memory region
arm64: dts: meson-sm1-bananapi-m5: fix wrong GPIO domain for GPIOE_2
arm64: dts: meson-sm1-odroid: use correct enable-gpio pin for tf-io regulator
arm64: dts: meson-g12b-odroid-n2: fix typo 'dio2133'
optee: use driver internal tee_context for some rpc
...