Josef Bacik [Fri, 18 Feb 2022 15:03:29 +0000 (10:03 -0500)]
btrfs: do not clean up repair bio if submit fails
The submit helper will always run bio_endio() on the bio if it fails to
submit, so cleaning up the bio just leads to a variety of use-after-free
and NULL pointer dereference bugs because we race with the endio
function that is cleaning up the bio. Instead just return BLK_STS_OK as
the repair function has to continue to process the rest of the pages,
and the endio for the repair bio will do the appropriate cleanup for the
page that it was given.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 18 Feb 2022 15:03:28 +0000 (10:03 -0500)]
btrfs: do not try to repair bio that has no mirror set
If we fail to submit a bio for whatever reason, we may not have setup a
mirror_num for that bio. This means we shouldn't try to do the repair
workflow, if we do we'll hit an BUG_ON(!failrec->this_mirror) in
clean_io_failure. Instead simply skip the repair workflow if we have no
mirror set, and add an assert to btrfs_check_repairable() to make it
easier to catch what is happening in the future.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 18 Feb 2022 15:03:27 +0000 (10:03 -0500)]
btrfs: do not double complete bio on errors during compressed reads
I hit some weird panics while fixing up the error handling from
btrfs_lookup_bio_sums(). Turns out the compression path will complete
the bio we use if we set up any of the compression bios and then return
an error, and then btrfs_submit_data_bio() will also call bio_endio() on
the bio.
Fix this by making btrfs_submit_compressed_read() responsible for
calling bio_endio() on the bio if there are any errors. Currently it
was only doing it if we created the compression bios, otherwise it was
depending on btrfs_submit_data_bio() to do the right thing. This
creates the above problem, so fix up btrfs_submit_compressed_read() to
always call bio_endio() in case of an error, and then simply return from
btrfs_submit_data_bio() if we had to call
btrfs_submit_compressed_read().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 18 Feb 2022 15:03:26 +0000 (10:03 -0500)]
btrfs: track compressed bio errors as blk_status_t
Right now we just have a binary "errors" flag, so any error we get on
the compressed bio's gets translated to EIO. This isn't necessarily a
bad thing, but if we get an ENOMEM it may be nice to know that's what
happened instead of an EIO. Track our errors as a blk_status_t, and do
the appropriate setting of the errors accordingly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 18 Feb 2022 15:03:25 +0000 (10:03 -0500)]
btrfs: remove the bio argument from finish_compressed_bio_read
This bio is usually one of the compressed bio's, and we don't actually
need it in this function, so remove the argument and stop passing it
around.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 18 Feb 2022 15:03:24 +0000 (10:03 -0500)]
btrfs: check correct bio in finish_compressed_bio_read
Commit
c09abff87f90 ("btrfs: cloned bios must not be iterated by
bio_for_each_segment_all") added ASSERT()'s to make sure we weren't
calling bio_for_each_segment_all() on a RAID5/6 bio. However it was
checking the bio that the compression code passed in, not the
cb->orig_bio that we actually iterate over, so adjust this ASSERT() to
check the correct bio.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 18 Feb 2022 15:03:23 +0000 (10:03 -0500)]
btrfs: handle csum lookup errors properly on reads
Currently any error we get while trying to lookup csums during reads
shows up as a missing csum, and then on the read completion side we
print an error saying there was a csum mismatch and we increase the
device corruption count.
However we could have gotten an EIO from the lookup. We could also be
inside of a memory constrained container and gotten a ENOMEM while
trying to do the read. In either case we don't want to make this look
like a file system corruption problem, we want to make it look like the
actual error it is. Capture any negative value, convert it to the
appropriate blk_status_t, free the csum array if we have one and bail.
Note: a possible improvement would be to make the relocation code look
up the owning inode and see if it's marked as NODATASUM and set
EXTENT_NODATASUM there, that way if there's corruption and there isn't a
checksum when we want it we can fail here rather than later.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 18 Feb 2022 15:03:22 +0000 (10:03 -0500)]
btrfs: make search_csum_tree return 0 if we get -EFBIG
We can either fail to find a csum entry at all and return -ENOENT, or we
can find a range that is close, but return -EFBIG. In essence these
both mean the same thing when we are doing a lookup for a csum in an
existing range, we didn't find a csum. We want to treat both of these
errors the same way, complain loudly that there wasn't a csum. This
currently happens anyway because we do
count = search_csum_tree();
if (count <= 0) {
// reloc and error handling
}
However it forces us to incorrectly treat EIO or ENOMEM errors as on
disk corruption. Fix this by returning 0 if we get either -ENOENT or
-EFBIG from btrfs_lookup_csum() so we can do proper error handling.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Tue, 13 Aug 2019 23:00:02 +0000 (16:00 -0700)]
btrfs: add BTRFS_IOC_ENCODED_WRITE
The implementation resembles direct I/O: we have to flush any ordered
extents, invalidate the page cache, and do the io tree/delalloc/extent
map/ordered extent dance. From there, we can reuse the compression code
with a minor modification to distinguish the write from writeback. This
also creates inline extents when possible.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 10 Oct 2019 00:59:07 +0000 (17:59 -0700)]
btrfs: add BTRFS_IOC_ENCODED_READ ioctl
There are 4 main cases:
1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
from disk.
4. Regular, compressed extents: we read the entire compressed extent
from disk and indicate what subset of the decompressed extent is in
the file.
This initial implementation simplifies a few things that can be improved
in the future:
- Cases 1, 3, and 4 allocate temporary memory to read into before
copying out to userspace.
- We don't do read repair, because it turns out that read repair is
currently broken for compressed data.
- We hold the inode lock during the operation.
Note that we don't need to hold the mmap lock. We may race with
btrfs_page_mkwrite() and read the old data from before the page was
dirtied:
btrfs_page_mkwrite btrfs_encoded_read
---------------------------------------------------
(enter) (enter)
btrfs_wait_ordered_range
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
(exit)
lock_extent_bits
read extent (dirty page hasn't been flushed,
so this is the old data)
unlock_extent_cached
(exit)
we read the old data from before the page was dirtied. But, that's true
even if we were to hold the mmap lock:
btrfs_page_mkwrite btrfs_encoded_read
-------------------------------------------------------------------
(enter) (enter)
btrfs_inode_lock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) (blocked)
btrfs_wait_ordered_range
lock_extent_bits
read extent (page hasn't been dirtied,
so this is the old data)
unlock_extent_cached
btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) returns
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
In other words, this is inherently racy, so it's fine that we return the
old data in this tiny window.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Mon, 16 Aug 2021 22:58:29 +0000 (15:58 -0700)]
btrfs: add definitions and documentation for encoded I/O ioctls
In order to allow sending and receiving compressed data without
decompressing it, we need an interface to write pre-compressed data
directly to the filesystem and the matching interface to read compressed
data without decompressing it. This adds the definitions for ioctls to
do that and detailed explanations of how to use them.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 7 Nov 2019 23:19:16 +0000 (15:19 -0800)]
btrfs: optionally extend i_size in cow_file_range_inline()
Currently, an inline extent is always created after i_size is extended
from btrfs_dirty_pages(). However, for encoded writes, we only want to
update i_size after we successfully created the inline extent. Add an
update_i_size parameter to cow_file_range_inline() and
insert_inline_extent() and pass in the size of the extent rather than
determining it from i_size.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Tue, 16 Nov 2021 22:03:45 +0000 (14:03 -0800)]
btrfs: clean up cow_file_range_inline()
The start parameter to cow_file_range_inline() (and
insert_inline_extent()) is always 0, so get rid of it and simplify the
logic in those two functions. Pass btrfs_inode to insert_inline_extent()
and remove the redundant root parameter. Also document the requirements
for creating an inline extent. No functional change.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Tue, 19 Nov 2019 06:45:55 +0000 (22:45 -0800)]
btrfs: support different disk extent size for delalloc
Currently, we always reserve the same extent size in the file and extent
size on disk for delalloc because the former is the worst case for the
latter. For BTRFS_IOC_ENCODED_WRITE writes, we know the exact size of
the extent on disk, which may be less than or greater than (for
bookends) the size in the file. Add a disk_num_bytes parameter to
btrfs_delalloc_reserve_metadata() so that we can reserve the correct
amount of csum bytes. No functional change.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Wed, 6 Nov 2019 20:11:56 +0000 (12:11 -0800)]
btrfs: add ram_bytes and offset to btrfs_ordered_extent
Currently, we only create ordered extents when ram_bytes == num_bytes
and offset == 0. However, BTRFS_IOC_ENCODED_WRITE writes may create
extents which only refer to a subset of the full unencoded extent, so we
need to plumb these fields through the ordered extent infrastructure and
pass them down to insert_reserved_file_extent().
Since we're changing the btrfs_add_ordered_extent* signature, let's get
rid of the trivial wrappers and add a kernel-doc.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Wed, 6 Nov 2019 23:38:43 +0000 (15:38 -0800)]
btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio()
btrfs_csum_one_bio() loops over each filesystem block in the bio while
keeping a cursor of its current logical position in the file in order to
look up the ordered extent to add the checksums to. However, this
doesn't make much sense for compressed extents, as a sector on disk does
not correspond to a sector of decompressed file data. It happens to work
because:
1) the compressed bio always covers one ordered extent
2) the size of the bio is always less than the size of the ordered
extent
However, the second point will not always be true for encoded writes.
Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that
it can assume that the bio only covers one ordered extent. Since we're
already changing the signature, let's get rid of the contig parameter
and make it implied by the offset parameter, similar to the change we
recently made to btrfs_lookup_bio_sums(). Additionally, let's rename
nr_sectors to blockcount to make it clear that it's the number of
filesystem blocks, not the number of 512-byte sectors.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Thu, 12 Aug 2021 22:34:57 +0000 (15:34 -0700)]
fs: export variant of generic_write_checks without iov_iter
Encoded I/O in Btrfs needs to check a write with a given logical size
without an iov_iter that matches that size (because the iov_iter we have
is for the compressed data). So, factor out the parts of
generic_write_check() that don't need an iov_iter into a new
generic_write_checks_count() function and export that.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Omar Sandoval [Wed, 4 Sep 2019 19:13:25 +0000 (12:13 -0700)]
fs: export rw_verify_area()
I'm adding btrfs ioctls to read and write compressed data, and rather
than duplicating the checks in rw_verify_area(), let's just export it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sidong Yang [Fri, 11 Feb 2022 13:48:29 +0000 (13:48 +0000)]
btrfs: qgroup: remove outdated TODO comments
These comments are old, outdated and not very specific. It seems that it
doesn't help to inspire anybody to work on that. So we remove them.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sidong Yang [Sun, 6 Feb 2022 12:52:48 +0000 (12:52 +0000)]
btrfs: qgroup: remove duplicated check in adding qgroup relations
Removes duplicated check when adding qgroup relations.
btrfs_add_qgroup_relations function adds relations by calling
add_relation_rb(). add_relation_rb() checks that member/parentid exists
in current qgroup_tree. But it already checked before calling the
function. It seems that we don't need to double check.
Add new function __add_relation_rb() that adds relations with
qgroup structures and makes old function use the new one. And it makes
btrfs_add_qgroup_relation() function work without double checks by
calling the new function.
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Dāvis Mosāns [Wed, 2 Feb 2022 21:44:54 +0000 (23:44 +0200)]
btrfs: add lzo workspace buffer length constants
It makes it more readable for length checking and is be used repeatedly.
Signed-off-by: Dāvis Mosāns <davispuh@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 8 Feb 2022 05:31:19 +0000 (13:31 +0800)]
btrfs: populate extent_map::generation when reading from disk
When btrfs_get_extent() tries to get some file extent from disk, it
never populates extent_map::generation, leaving the value to be 0.
On the other hand, for extent map generated by IO, it will get its
generation properly set at finish_ordered_io()
finish_ordered_io()
|- unpin_extent_cache(gen = trans->transid)
|- em->generation = gen;
[CAUSE]
Since extent_map::generation is mostly used by fsync code, and for fsync
they only care about modified extents, which all have their
em::generation > 0.
Thus it's fine to not populate em read from disk for fsync.
[CORNER CASE]
However autodefrag also relies on em::generation to determine if one
extent needs to be defragged.
This unpopulated extent_map::generation can prevent the following
autodefrag case from working:
mkfs.btrfs -f $dev
mount $dev $mnt -o autodefrag
# initial write to queue the inode for autodefrag
xfs_io -f -c "pwrite 0 4k" $mnt/file
sync
# Real fragmented write
xfs_io -f -s -c "pwrite -b 4096 0 32k" $mnt/file
sync
echo "=== before autodefrag ==="
xfs_io -c "fiemap -v" $mnt/file
# Drop cache to force em to be read from disk
echo 3 > /proc/sys/vm/drop_caches
mount -o remount,commit=1 $mnt
sleep 3
sync
echo "=== After autodefrag ==="
xfs_io -c "fiemap -v" $mnt/file
umount $mnt
The result looks like this:
=== before autodefrag ===
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..15]: 26672..26687 16 0x0
1: [16..31]: 26656..26671 16 0x0
2: [32..47]: 26640..26655 16 0x0
3: [48..63]: 26624..26639 16 0x1
=== After autodefrag ===
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..15]: 26672..26687 16 0x0
1: [16..31]: 26656..26671 16 0x0
2: [32..47]: 26640..26655 16 0x0
3: [48..63]: 26624..26639 16 0x1
This fragmented 32K will not be defragged by autodefrag.
[FIX]
To make things less weird, just populate extent_map::generation when
reading file extents from disk.
This would make above fragmented extents to be properly defragged:
== before autodefrag ===
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..15]: 26672..26687 16 0x0
1: [16..31]: 26656..26671 16 0x0
2: [32..47]: 26640..26655 16 0x0
3: [48..63]: 26624..26639 16 0x1
=== After autodefrag ===
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..63]: 26688..26751 64 0x1
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 15:36:45 +0000 (15:36 +0000)]
btrfs: assert we have a write lock when removing and replacing extent maps
Removing or replacing an extent map requires holding a write lock on the
extent map's tree. We currently do that everywhere, except in one of the
self tests, where it's harmless since there's no concurrency.
In order to catch possible races in the future, assert that we are holding
a write lock on the extent map tree before removing or replacing an extent
map in the tree, and update the self test to obtain a write lock before
removing extent maps.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 15:36:44 +0000 (15:36 +0000)]
btrfs: remove no longer used counter when reading data page
After commit
92082d40976ed0 ("btrfs: integrate page status update for
data read path into begin/end_page_read"), the 'nr' counter at
btrfs_do_readpage() is no longer used, we increment it but we never
read from it. So just remove it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 15:36:43 +0000 (15:36 +0000)]
btrfs: fix lost error return value when reading a data page
At btrfs_do_readpage(), if we get an error when trying to lookup for an
extent map, we end up marking the page with the error bit, clearing
the uptodate bit on it, and doing everything else that should be done.
However we return success (0) to the caller, when we should return the
error encoded in the extent map pointer. So fix that by returning the
error encoded in the pointer.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 15:36:42 +0000 (15:36 +0000)]
btrfs: stop checking for NULL return from btrfs_get_extent()
At extent_io.c, in the read page and write page code paths, we are testing
if the return value from btrfs_get_extent() can be NULL. However that is
not possible, as btrfs_get_extent() always returns either an error pointer
or a (non-NULL) pointer to an extent map structure.
Everywhere else outside extent_io.c we never check for NULL, we always
treat any returned value as a non-NULL pointer if it does not encode an
error.
So check only for the IS_ERR() case at extent_io.c.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 14:55:50 +0000 (14:55 +0000)]
btrfs: prepare extents to be logged before locking a log tree path
When we want to log an extent, in the fast fsync path, we obtain a path
to the leaf that will hold the file extent item either through a deletion
search, via btrfs_drop_extents(), or through an insertion search using
btrfs_insert_empty_item(). After that we fill the file extent item's
fields one by one directly on the leaf.
Instead of doing that, we could prepare the file extent item before
obtaining a btree path, and then copy the prepared extent item with a
single operation once we get the path. This helps avoid some contention
on the log tree, since we are holding write locks for longer than
necessary, especially in the case where the path is obtained via
btrfs_drop_extents() through a deletion search, which always keeps a
write lock on the nodes at levels 1 and 2 (besides the leaf).
This change does that, we prepare the file extent item that is going to
be inserted before acquiring a path, and then copy it into a leaf using
a single copy operation once we get a path.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The following test was run to measure the impact of the whole patchset:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
NUM_JOBS=8
FILE_SIZE=128M
RUN_TIME=200
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=randwrite
fsync=1
fallocate=none
group_reporting=1
direct=0
bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
ioengine=sync
filesize=$FILE_SIZE
runtime=$RUN_TIME
time_based
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test ran inside a VM (8 cores, 32G of RAM) with the target disk
mapping to a raw NVMe device, and using a non-debug kernel config
(Debian's default config).
Before the patchset:
WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=22.7GiB (24.4GB), run=200013-200013msec
After the patchset:
WRITE: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=24.3GiB (26.1GB), run=200007-200007msec
A 7.8% gain on throughput and +7.0% more IO done in the same period of
time (200 seconds).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 14:55:49 +0000 (14:55 +0000)]
btrfs: remove useless path release in the fast fsync path
There's no point in calling btrfs_release_path() after finishing the loop
that logs the modified extents, since log_one_extent() returns with the
path released. In case the list of extents is empty, the path is already
released, so there's no need for that case as well.
So just remove that unnecessary btrfs_release_path() call.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 14:55:48 +0000 (14:55 +0000)]
btrfs: remove constraint on number of visited leaves when replacing extents
At btrfs_drop_extents(), we try to replace a range of file extent items
with a new file extent in a single btree search, to avoid the need to do
a search for deletion, followed by a path release and followed by yet
another search for insertion.
When I originally added that optimization, in commit
1acae57b161ef1
("Btrfs: faster file extent item replace operations"), I left a constraint
to do the fast replace only if we visited a single leaf. That was because
in the most common case we find all file extent items that need to be
deleted (or trimmed) in a single leaf, however it can work for other
common cases like when we need to delete a few file extent items located
at the end of a leaf and a few more located at the beginning of the next
leaf. The key for the new file extent item is greater than the key of
any deleted or trimmed file extent item from previous leaves, so we are
fine to use the last leaf that we found as long as we are holding a
write lock on it - even if the new key ends up at slot 0, as if that's
the case, the btree search has obtained a write lock on any upper nodes
that need to have a key pointer updated.
So removed the constraint that limits the optimization to the case where
we visited only a single leaf.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 14:55:47 +0000 (14:55 +0000)]
btrfs: avoid unnecessary computation when deleting items from a leaf
When deleting items from a leaf, we always compute the sum of the data
sizes of the items that are going to be deleted. However we only use
that sum when the last item to delete is behind the last item in the
leaf. This unnecessarily wastes CPU time when we are deleting either
the whole leaf or from some slot > 0 up to the last item in the leaf,
and both of these cases are common (e.g. truncation operation, either
as a result of truncate(2) or when logging inodes, deleting checksums
after removing a large enough extent, etc).
So compute only the sum of the data sizes if the last item to be
deleted does not match the last item in the leaf.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 14:55:46 +0000 (14:55 +0000)]
btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
When we delete items from a leaf, if we end up with more than two thirds
of unused leaf space, we try to delete the leaf by moving all its items
into its left and right neighbour leaves. Sometimes that is not possible
because there is not enough free space in the left and right leaves, and
in that case we end up not deleting our leaf.
The way we are doing this is not ideal and can be improved in the
following ways:
1) When we call push_leaf_left(), we pass a value of 1 byte to the data
size parameter of push_leaf_left(). This is not realistic value because
no item can have a size less than 25 bytes, which is the size of struct
btrfs_item. This means that means that if the left leaf has not enough
free space to push any item, we end up COWing it even if we end up not
changing its content at all.
COWing that leaf means allocating a new metadata extent, marking it
dirty and doing more IO when committing a transaction or when syncing a
log tree. For a log tree case, it's particularly more important to
avoid the useless COW operation, as more IO can imply a higher latency
for an fsync operation.
So instead of passing 1 as the minimum data size for push_leaf_left(),
pass the size of the first item in our leaf, as we don't want to COW
the left leaf if we can't at least push the first item of our leaf;
2) When we call push_leaf_right(), we also pass a value of 1 byte as the
data size parameter of push_leaf_right(). Like the previous case, it
will also result in COWing the right leaf even if we are not able to
move any items into it, since there can't be any item with a size
smaller than 25 bytes (the size of struct btrfs_item).
So instead of passing 1 as the minimum data size to push_leaf_right(),
pass a size that corresponds to the sum of the size of all the
remaining items in our leaf. We are not interested in moving less than
that, because if we do, we are not able to delete our leaf and we have
COWed the right leaf for nothing. Plus, moving only some of the items
of our leaf, it means an even less balanced tree.
Just like the previous case, we want to avoid the useless COW of the
right leaf, this way we don't have to spend time allocating one new
metadata extent, and doing more IO when committing a transaction or
syncing a log tree. For the log tree case it's specially more important
because more IO can result in a higher latency for a fsync operation.
So adjust the minimum data size passed to push_leaf_left() and
push_leaf_right() as mentioned above.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
Not being able to delete a leaf that became less than 1/3 full after
deleting items from it is actually common. For example, for the fio test
mentioned in the changelog of patch 6/6, we are only able to delete a
leaf at btrfs_del_items() about 5.3% of the time, due to its left and
right neighbour leaves not having enough free space to push all the
remaining items into them.
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 3 Feb 2022 14:55:45 +0000 (14:55 +0000)]
btrfs: remove unnecessary leaf free space checks when pushing items
When trying to push items from a leaf into its left and right neighbours,
we lock the left or right leaf, check if it has the required minimum free
space, COW the leaf and then check again if it has the minimum required
free space. This second check is pointless:
1) Most and foremost because it's not needed. We have a write lock on the
leaf and on its parent node, so no one can come in and change either
the pre-COW or post-COW version of the leaf for the whole duration of
the push_leaf_left() and push_leaf_right() calls;
2) The call to btrfs_leaf_free_space() is not trivial, it has a fair
amount of arithmetic operations and access to fields in the leaf's
header and items, so it's not very cheap.
So remove the duplicated free space checks.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Johannes Thumshirn [Fri, 4 Feb 2022 12:06:27 +0000 (04:06 -0800)]
btrfs: stop checking for NULL return from btrfs_get_extent_fiemap()
In get_extent_skip_holes() we're checking the return of
btrfs_get_extent_fiemap() for an error pointer or NULL, but
btrfs_get_extent_fiemap() will never return NULL, only error pointers or
a valid extent_map.
The other caller of btrfs_get_extent_fiemap(), find_desired_extent(),
correctly only checks for error-pointers.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pankaj Raghav [Fri, 4 Feb 2022 12:00:22 +0000 (13:00 +0100)]
btrfs: zoned: remove redundant assignment in btrfs_check_zoned_mode
Remove the redundant assignment to zone_info variable in
btrfs_check_zoned_mode function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 1 Feb 2022 14:42:07 +0000 (15:42 +0100)]
btrfs: replace BUILD_BUG_ON by static_assert
The static_assert introduced in
6bab69c65013 ("build_bug.h: add wrapper
for _Static_assert") has been supported by compilers for a long time
(gcc 4.6, clang 3.0) and can be used in header files. We don't need to
put BUILD_BUG_ON to random functions but rather keep it next to the
definition.
The exception here is the UAPI header btrfs_tree.h that could be
potentially included by userspace code and the static assert is not
defined (nor used in any other header).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Johannes Thumshirn [Wed, 26 Jan 2022 13:46:23 +0000 (05:46 -0800)]
btrfs: zoned: allow DUP on meta-data block groups
Allow creating or reading block-groups on a zoned device with DUP as a
meta-data profile.
This works because we're using the zoned_meta_io_lock and REQ_OP_WRITE
operations for meta-data on zoned btrfs, so all writes to meta-data zones
are aligned to the zone's write-pointer.
Upon loading of the block-group, it is ensured both zones do have the same
zone capacity and write-pointer offsets, so no extra machinery is needed
to keep the write-pointers in sync for the meta-data zones. If this
prerequisite is not met, loading of the block-group is refused.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Johannes Thumshirn [Wed, 26 Jan 2022 13:46:22 +0000 (05:46 -0800)]
btrfs: zoned: prepare for allowing DUP on zoned
Allow for a block-group to be placed on more than one physical zone.
This is a preparation for allowing DUP profiles for meta-data on a zoned
file-system.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Johannes Thumshirn [Wed, 26 Jan 2022 13:46:21 +0000 (05:46 -0800)]
btrfs: zoned: make zone finishing multi stripe capable
Currently finishing of a zone only works if the block group isn't
spanning more than one zone.
This limitation is purely artificial and can be easily expanded to block
groups being places across multiple zones.
This is a preparation for allowing DUP and later more complex block-group
profiles on zoned btrfs.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Johannes Thumshirn [Wed, 26 Jan 2022 13:46:20 +0000 (05:46 -0800)]
btrfs: zoned: make zone activation multi stripe capable
Currently activation of a zone only works if the block group isn't
spanning more than one zone.
This limitation is purely artificial and can be easily expanded to block
groups being places across multiple zones.
This is a preparation for allowing DUP and later more complex block-group
profiles on zoned btrfs.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:08 +0000 (15:40 -0500)]
btrfs: add support for multiple global roots
With extent tree v2 you will be able to create multiple csum, extent,
and free space trees. They will be used based on the block group, which
will now use the block_group_item->chunk_objectid to point to the set of
global roots that it will use. When allocating new block groups we'll
simply mod the gigabyte offset of the block group against the number of
global roots we have and that will be the block groups global id.
>From there we can take the bytenr that we're modifying in the respective
tree, look up the block group and get that block groups corresponding
global root id. From there we can get to the appropriate global root
for that bytenr.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:07 +0000 (15:40 -0500)]
btrfs: add code to support the block group root
This code adds the on disk structures for the block group root, which
will hold the block group items for extent tree v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:06 +0000 (15:40 -0500)]
btrfs: abstract out loading the tree root
We're going to be adding more roots that need to be loaded from the
super block, so abstract out the code to read the tree_root from the
super block, and use this helper for the chunk root as well. This will
make it simpler to load the new trees in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:05 +0000 (15:40 -0500)]
btrfs: tree-checker: don't fail on empty extent roots for extent tree v2
For extent tree v2 we can definitely have empty extent roots, so skip
this particular check if we have that set.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:04 +0000 (15:40 -0500)]
btrfs: disable space cache related mount options for extent tree v2
We cannot fall back on the slow caching for extent tree v2, which means
we can't just arbitrarily clear the free space trees at mount time.
Furthermore we can't do v1 space cache with extent tree v2. Simply
ignore these mount options for extent tree v2 as they aren't relevant.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:03 +0000 (15:40 -0500)]
btrfs: disable snapshot creation/deletion for extent tree v2
When we stop tracking metadata blocks all of snapshotting will break, so
disable it until I add the snapshot root and drop tree support.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:02 +0000 (15:40 -0500)]
btrfs: disable scrub for extent-tree-v2
Scrub depends on extent references for every block, and with extent tree
v2 we won't have that, so disable scrub until we can add back the proper
code to handle extent-tree-v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:01 +0000 (15:40 -0500)]
btrfs: disable qgroups in extent tree v2
Backref lookups are going to be drastically different with extent tree
v2, disable qgroups until we do the work to add this support for extent
tree v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:00 +0000 (15:40 -0500)]
btrfs: disable device manipulation ioctl's EXTENT_TREE_V2
Device add, remove, and replace all require balance, which doesn't work
right now on extent tree v2, so disable these for now.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:39:59 +0000 (15:39 -0500)]
btrfs: disable balance for extent tree v2 for now
With global root id's it makes it problematic to do backref lookups for
balance. This isn't hard to deal with, but future changes are going to
make it impossible to lookup backrefs on any COWonly roots, so go ahead
and disable balance for now on extent tree v2 until we can add balance
support back in future patches.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:39:58 +0000 (15:39 -0500)]
btrfs: add definition for EXTENT_TREE_V2
This adds the initial definition of the EXTENT_TREE_V2 incompat feature
flag. This also hides the support behind CONFIG_BTRFS_DEBUG.
THIS IS A IN DEVELOPMENT FORMAT CHANGE, DO NOT USE UNLESS YOU ARE A
DEVELOPER OR A TESTER.
The format is in flux and will be added in stages, any fs will need to
be re-made between updates to the format.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 20 Jan 2022 11:00:11 +0000 (11:00 +0000)]
btrfs: use single variable to track return value at btrfs_log_inode()
At btrfs_log_inode(), we have two variables to track errors and the
return value of the function, named 'ret' and 'err'. In some places we
use 'ret' and if gets a non-zero value we assign its value to 'err'
and then jump to the 'out' label, while in other places we use 'err'
directly without 'ret' as an intermediary. This is inconsistent, error
prone and not necessary. So change that to use only the 'ret' variable,
making this consistent with most functions in btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 20 Jan 2022 11:00:10 +0000 (11:00 +0000)]
btrfs: avoid inode logging during rename and link when possible
During a rename or link operation, we need to determine if an inode was
previously logged or not, and if it was, do some update to the logged
inode. We used to rely exclusively on the logged_trans field of struct
btrfs_inode to determine that, but that was not reliable because the
value of that field is not persisted in the inode item, so it's lost
when an inode is evicted and loaded back again. That led to several
issues in the past, such as not persisting deletions (such as the case
fixed by commit
803f0f64d17769 ("Btrfs: fix fsync not persisting dentry
deletions due to inode evictions")), or resulting in losing a file
after an inode eviction followed by a rename (commit
ecc64fab7d49c6
("btrfs: fix lost inode on log replay after mix of fsync, rename and
inode eviction")), besides other issues.
So the inode_logged() helper was introduced and used to determine if an
inode was possibly logged before in the current transaction, with the
caveat that it could return false positives, in the sense that even if an
inode was not logged before in the current transaction, it could still
return true, but never to return false in case the inode was logged.
>From a functional point of view that is fine, but from a performance
perspective it can introduce significant latencies to rename and link
operations, as they will end up doing inode logging even when it is not
necessary.
Recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package
installations and upgrades, with the zypper tool, were often taking a
long time to complete. With strace it could be observed that zypper was
spending about 99% of its time on rename operations, and then with
further analysis we checked that directory logging was happening too
frequently. Taking into account that installation/upgrade of some of the
packages needed a few thousand file renames, the slowdown was very
noticeable for the user.
The issue was caused indirectly due to an excessive number of inode
evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or
a 5.16-rc8 kernel. While triggering the inode evictions if something
outside btrfs' control, btrfs could still behave better by eliminating
the false positives from the inode_logged() helper.
So change inode_logged() to actually eliminate such false positives caused
by inode eviction and when an inode was never logged since the filesystem
was mounted, as both cases relate to when the logged_trans field of struct
btrfs_inode has a value of zero. When it can not determine if the inode
was logged based only on the logged_trans value, lookup for the existence
of the inode item in the log tree - if it's there then we known the inode
was logged, if it's not there then it can not have been logged in the
current transaction. Once we determine if the inode was logged, update
the logged_trans value to avoid future calls to have to search in the log
tree again.
Alternatively, we could start storing logged_trans in the on disk inode
item structure (struct btrfs_inode_item) in the unused space it still has,
but that would be a bit odd because:
1) We only care about logged_trans since the filesystem was mounted, we
don't care about its value from a previous mount. Having it persisted
in the inode item structure would not make the best use of the precious
unused space;
2) In order to get logged_trans persisted before inode eviction, we would
have to update the delayed inode when we finish logging the inode and
update its logged_trans in struct btrfs_inode, which makes it a bit
cumbersome since we need to check if the delayed inode exists, if not
create it and populate it and deal with any errors (-ENOMEM mostly).
This change is part of a patchset comprised of the following patches:
1/5 btrfs: add helper to delete a dir entry from a log tree
2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
3/5 btrfs: avoid logging all directory changes during renames
4/5 btrfs: stop doing unnecessary log updates during a rename
5/5 btrfs: avoid inode logging during rename and link when possible
The following test script mimics part of what the zypper tool does during
package installations/upgrades. It does not triggers inode evictions, but
it's similar because it triggers false positives from the inode_logged()
helper, because the inodes have a logged_trans of 0, there's a log tree
due to a fsync of an unrelated file and the directory inode has its
last_trans field set to the current transaction:
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=10000
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
# Now do some change to an unrelated file and fsync it.
# This is just to create a log tree to make sure that inode_logged()
# does not return false when called against "testdir".
xfs_io -f -c "pwrite 0 4K" -c "fsync" $MNT/foo
# Do some change to testdir. This is to make sure inode_logged()
# will return true when called against "testdir", because its
# logged_trans is 0, it was changed in the current transaction
# and there's a log tree.
echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
echo "Renaming $NUM_FILES files..."
start=$(date +%s%N)
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
done
end=$(date +%s%N)
dur=$(( (end - start) /
1000000 ))
echo "Renames took $dur milliseconds"
umount $MNT
Testing this change on a box using a non-debug kernel (Debian's default
kernel config) gave the following results:
NUM_FILES=10000, before patchset: 27837 ms
NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9236 ms (-66.8%)
NUM_FILES=10000, after whole patchset applied: 8902 ms (-68.0%)
NUM_FILES=5000, before patchset: 9127 ms
NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4640 ms (-49.2%)
NUM_FILES=5000, after whole patchset applied: 4441 ms (-51.3%)
NUM_FILES=2000, before patchset: 2528 ms
NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1983 ms (-21.6%)
NUM_FILES=2000, after whole patchset applied: 1747 ms (-30.9%)
NUM_FILES=1000, before patchset: 1085 ms
NUM_FILES=1000, after patches 1/5 to 4/5 applied: 893 ms (-17.7%)
NUM_FILES=1000, after whole patchset applied: 867 ms (-20.1%)
Running dbench on the same physical machine with the following script:
$ cat run-dbench.sh
#!/bin/bash
NUM_JOBS=$(nproc --all)
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-O no-holes -R free-space-tree"
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 120 $NUM_JOBS
umount $MNT
Before patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX
3761352 0.032 143.843
Close
2762770 0.002 2.273
Rename 159304 0.291 67.037
Unlink 759784 0.207 143.998
Deltree 72 4.028 15.977
Mkdir 36 0.003 0.006
Qpathinfo
3409780 0.013 9.678
Qfileinfo 596772 0.001 0.878
Qfsinfo 625189 0.003 1.245
Sfileinfo 306443 0.006 1.840
Find
1318106 0.063 19.798
WriteX
1871137 0.021 8.532
ReadX
5897325 0.003 3.567
LockX 12252 0.003 0.258
UnlockX 12252 0.002 0.100
Flush 263666 3.327 155.632
Throughput 980.047 MB/sec 12 clients 12 procs max_latency=155.636 ms
After whole patchset applied:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX
4195584 0.033 107.742
Close
3081932 0.002 1.935
Rename 177641 0.218 14.905
Unlink 847333 0.166 107.822
Deltree 118 5.315 15.247
Mkdir 59 0.004 0.048
Qpathinfo
3802612 0.014 10.302
Qfileinfo 666748 0.001 1.034
Qfsinfo 697329 0.003 0.944
Sfileinfo 341712 0.006 2.099
Find
1470365 0.065 9.359
WriteX
2093921 0.021 8.087
ReadX
6576234 0.003 3.407
LockX 13660 0.003 0.308
UnlockX 13660 0.002 0.114
Flush 294090 2.906 115.539
Throughput 1093.11 MB/sec 12 clients 12 procs max_latency=115.544 ms
+11.5% throughput -25.8% max latency rename max latency -77.8%
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 20 Jan 2022 11:00:09 +0000 (11:00 +0000)]
btrfs: stop doing unnecessary log updates during a rename
During a rename, we call __btrfs_unlink_inode(), which will call
btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(), in order
to remove an inode reference and a directory entry from the log. These
are necessary when __btrfs_unlink_inode() is called from the unlink path,
but not necessary when it's called from a rename context, because:
1) For the btrfs_del_inode_ref_in_log() call, it's pointless to delete the
inode reference related to the old name, because later in the rename
path we call btrfs_log_new_name(), which will drop all inode references
from the log and copy all inode references from the subvolume tree to
the log tree. So we are doing one unnecessary btree operation which
adds additional latency and lock contention in case there are other
tasks accessing the log tree;
2) For the btrfs_del_dir_entries_in_log() call, we are now doing the
equivalent at btrfs_log_new_name() since the previous patch in the
series, that has the subject "btrfs: avoid logging all directory
changes during renames". In fact, having __btrfs_unlink_inode() call
this function not only adds additional latency and lock contention due
to the extra btree operation, but also can make btrfs_log_new_name()
unnecessarily log a range item to track the deletion of the old name,
since it has no way to known that the directory entry related to the
old name was previously logged and already deleted by
__btrfs_unlink_inode() through its call to
btrfs_del_dir_entries_in_log().
So skip those calls at __btrfs_unlink_inode() when we are doing a rename.
Skipping them also allows us now to reduce the duration of time we are
pinning a log transaction during renames, which is always beneficial as
it's not delaying so much other tasks trying to sync the log tree, in
particular we end up not holding the log transaction pinned while adding
the new name (adding inode ref, directory entry, etc).
This change is part of a patchset comprised of the following patches:
1/5 btrfs: add helper to delete a dir entry from a log tree
2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
3/5 btrfs: avoid logging all directory changes during renames
4/5 btrfs: stop doing unnecessary log updates during a rename
5/5 btrfs: avoid inode logging during rename and link when possible
Just like the previous patch in the series, "btrfs: avoid logging all
directory changes during renames", the following script mimics part of
what a package installation/upgrade with zypper does, which is basically
renaming a lot of files, in some directory under /usr, to a name with a
suffix of "-RPMDELETE":
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=10000
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
# Do some change to testdir and fsync it.
echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
xfs_io -c "fsync" $MNT/testdir
echo "Renaming $NUM_FILES files..."
start=$(date +%s%N)
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
done
end=$(date +%s%N)
dur=$(( (end - start) /
1000000 ))
echo "Renames took $dur milliseconds"
umount $MNT
Testing this change on box a using a non-debug kernel (Debian's default
kernel config) gave the following results:
NUM_FILES=10000, before patchset: 27399 ms
NUM_FILES=10000, after patches 1/5 to 3/5 applied: 9093 ms (-66.8%)
NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9016 ms (-67.1%)
NUM_FILES=5000, before patchset: 9241 ms
NUM_FILES=5000, after patches 1/5 to 3/5 applied: 4642 ms (-49.8%)
NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4553 ms (-50.7%)
NUM_FILES=2000, before patchset: 2550 ms
NUM_FILES=2000, after patches 1/5 to 3/5 applied: 1788 ms (-29.9%)
NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1767 ms (-30.7%)
NUM_FILES=1000, before patchset: 1088 ms
NUM_FILES=1000, after patches 1/5 to 3/5 applied: 905 ms (-16.9%)
NUM_FILES=1000, after patches 1/5 to 4/5 applied: 883 ms (-18.8%)
The next patch in the series (5/5), also contains dbench results after
applying to whole patchset.
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 20 Jan 2022 11:00:08 +0000 (11:00 +0000)]
btrfs: avoid logging all directory changes during renames
When doing a rename of a file, if the file or its old parent directory
were logged before, we log the new name of the file and then make sure
we log the old parent directory, to ensure that after a log replay the
old name of the file is deleted and the new name added.
The logging of the old parent directory can take some time, because it
will scan all leaves modified in the current transaction, check which
directory entries were already logged, copy the ones that were not
logged before, etc. In this rename context all we need to do is make
sure that the old name of the file is deleted on log replay, so instead
of triggering a directory log operation, we can just delete the old
directory entry from the log if it's there, or in case it isn't there,
just log a range item to signal log replay that the old name must be
deleted. So change btrfs_log_new_name() to do that.
This scenario is actually not uncommon to trigger, and recently on a
5.15 kernel, an openSUSE Tumbleweed user reported package installations
and upgrades, with the zypper tool, were often taking a long time to
complete, much more than usual. With strace it could be observed that
zypper was spending over 99% of its time on rename operations, and then
with further analysis we checked that directory logging was happening
too frequently and causing high latencies for the rename operations.
Taking into account that installation/upgrade of some of these packages
needed about a few thousand file renames, the slowdown was very noticeable
for the user.
The issue was caused indirectly due to an excessive number of inode
evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14
or a 5.16-rc8 kernel. After an inode eviction we can't tell for sure,
in an efficient way, if an inode was previously logged in the current
transaction, so we are pessimistic and assume it was, because in case
it was we need to update the logged inode. More details on that in one
of the patches in the same series (subject "btrfs: avoid inode logging
during rename and link when possible"). Either way, in case the parent
directory was logged before, we currently do more work then necessary
during a rename, and this change minimizes that amount of work.
The following script mimics part of what a package installation/upgrade
with zypper does, which is basically renaming a lot of files, in some
directory under /usr, to a name with a suffix of "-RPMDELETE":
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=10000
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
# Do some change to testdir and fsync it.
echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
xfs_io -c "fsync" $MNT/testdir
echo "Renaming $NUM_FILES files..."
start=$(date +%s%N)
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
done
end=$(date +%s%N)
dur=$(( (end - start) /
1000000 ))
echo "Renames took $dur milliseconds"
umount $MNT
Testing this change on box using a non-debug kernel (Debian's default
kernel config) gave the following results:
NUM_FILES=10000, before this patch: 27399 ms
NUM_FILES=10000, after this patch: 9093 ms (-66.8%)
NUM_FILES=5000, before this patch: 9241 ms
NUM_FILES=5000, after this patch: 4642 ms (-49.8%)
NUM_FILES=2000, before this patch: 2550 ms
NUM_FILES=2000, after this patch: 1788 ms (-29.9%)
NUM_FILES=1000, before this patch: 1088 ms
NUM_FILES=1000, after this patch: 905 ms (-16.9%)
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 20 Jan 2022 11:00:07 +0000 (11:00 +0000)]
btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
In the next patch in the series, there will be the need to access the old
name, and its length, of an inode when logging the inode during a rename.
So instead of passing the inode to btrfs_log_new_name() pass the dentry,
because from the dentry we can get the inode, the name and its length.
This will avoid passing 3 new parameters to btrfs_log_new_name() in the
next patch - the name, its length and an index number. This way we end
up passing only 1 new parameter, the index number.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 20 Jan 2022 11:00:06 +0000 (11:00 +0000)]
btrfs: add helper to delete a dir entry from a log tree
Move the code that finds and deletes a logged dir entry out of
btrfs_del_dir_entries_in_log() into a helper function. This new helper
function will be used by another patch in the same series, and serves
to avoid having duplicated logic.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Minghao Chi [Tue, 11 Jan 2022 01:57:16 +0000 (01:57 +0000)]
btrfs: send: remove redundant ret variable in fs_path_copy
Return value from fs_path_add_path() directly instead of taking this in
another redundant variable.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 13 Jan 2022 15:16:18 +0000 (17:16 +0200)]
btrfs: move QUOTA_ENABLED check to rescan_should_stop from btrfs_qgroup_rescan_worker
Instead of having 2 places that short circuit the qgroup leaf scan have
everything in the qgroup_rescan_leaf function. In addition to that, also
ensure that the inconsistent qgroup flag is set when rescan_should_stop
returns true. This both retains the old behavior when -EINTR was set in
the body of the loop and at the same time also extends this behavior
when scanning is interrupted due to remount or unmount operations.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jiapeng Chong [Fri, 21 Jan 2022 11:42:24 +0000 (19:42 +0800)]
btrfs: scrub: remove redundant initialization of increment
increment is being initialized to map->stripe_len but this is never
read as increment is overwritten later on. Remove the redundant
initialization.
Cleans up the following clang-analyzer warning:
fs/btrfs/scrub.c:3193:6: warning: Value stored to 'increment' during its
initialization is never read [clang-analyzer-deadcode.DeadStores].
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Jiapeng Chong [Fri, 21 Jan 2022 11:43:51 +0000 (19:43 +0800)]
btrfs: zoned: remove redundant initialization of to_add
to_add is being initialized to len but this is never read as to_add is
overwritten later on. Remove the redundant initialization.
Cleans up the following clang-analyzer warning:
fs/btrfs/extent-tree.c:2769:8: warning: Value stored to 'to_add' during
its initialization is never read [clang-analyzer-deadcode.DeadStores].
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Thu, 13 Jan 2022 04:44:10 +0000 (12:44 +0800)]
btrfs: cleanup temporary variables when finding rotational device status
The pointer to struct request_queue is used only to get device type
rotating or the non-rotating. So use it directly.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 12 Jan 2022 05:06:02 +0000 (13:06 +0800)]
btrfs: use dev_t to match device in device_matched
Commit "btrfs: add device major-minor info in the struct btrfs_device"
saved the device major-minor number in the struct btrfs_device upon
discovering it.
So no need to lookup_bdev() again just match, which means
device_matched() can go away.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 12 Jan 2022 05:06:01 +0000 (13:06 +0800)]
btrfs: add device major-minor info in the struct btrfs_device
Internally it is common to use the major-minor number to identify a
device and, at a few locations in btrfs, we use the major-minor number
to match the device.
So when we identify a new btrfs device through device add or device
replace or device-scan/ready save the device's major-minor (dev_t) in the
struct btrfs_device so that we don't have to call lookup_bdev() again.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 12 Jan 2022 05:06:00 +0000 (13:06 +0800)]
btrfs: match stale devices by dev_t
After the commit "btrfs: harden identification of the stale device", we
don't have to match the device path anymore. Instead, we match the dev_t.
So pass in the dev_t instead of the device path, in the call chain
btrfs_forget_devices()->btrfs_free_stale_devices().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 12 Jan 2022 05:05:59 +0000 (13:05 +0800)]
btrfs: harden identification of a stale device
Identifying and removing the stale device from the fs_uuids list is done
by btrfs_free_stale_devices(). btrfs_free_stale_devices() in turn
depends on device_path_matched() to check if the device appears in more
than one btrfs_device structure.
The matching of the device happens by its path, the device path. However,
when device mapper is in use, the dm device paths are nothing but a link
to the actual block device, which leads to the device_path_matched()
failing to match.
Fix this by matching the dev_t as provided by lookup_bdev() instead of
plain string compare of the device paths.
Reported-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Mon, 17 Jan 2022 15:50:39 +0000 (23:50 +0800)]
btrfs: simplify fs_devices member access in btrfs_init_dev_replace_tgtdev
In btrfs_init_dev_replace_tgtdev() we dereference fs_info to get
fs_devices many times, instead save a point to the fs_devices.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sahil Kang [Sun, 16 Jan 2022 02:48:47 +0000 (18:48 -0800)]
btrfs: reuse existing inode from btrfs_ioctl
btrfs_ioctl extracts inode from file so we can pass that into the
callbacks.
Signed-off-by: Sahil Kang <sahil.kang@asilaycomputing.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 11 Jan 2022 16:00:26 +0000 (18:00 +0200)]
btrfs: move missing device handling in a dedicate function
This simplifies the code flow in read_one_chunk and makes error handling
when handling missing devices a bit simpler by reducing it to a single
check if something went wrong. No functional changes.
Reviewed-by: Su Yue <l@damenly.su>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 15 Dec 2021 12:20:01 +0000 (12:20 +0000)]
btrfs: stop trying to log subdirectories created in past transactions
When logging a directory we are trying to log subdirectories that were
changed in the current transaction and created in a past transaction.
This type of behaviour was introduced by commit
2f2ff0ee5e4303 ("Btrfs:
fix metadata inconsistencies after directory fsync"), to fix some metadata
inconsistencies that in the meanwhile no longer need this behaviour due to
numerous other changes that happened throughout the years.
This behaviour, besides not needed anymore, it's also undesirable because:
1) It's not reliable because it's only triggered for the directories
of dentries (dir items) that happen to be present on a leaf that
was changed in the current transaction. If a dentry that points to
a directory resides on a leaf that was not changed in the current
transaction, then it's not logged, as at log_dir_items() and
log_new_dir_dentries() we use btrfs_search_forward();
2) It's not required by posix or any standard, it's undefined territory.
The only way to guarantee a subdirectory is logged, it to explicitly
fsync it;
Making the behaviour guaranteed would require scanning all directory
items, check which point to a directory, and then fsync each subdirectory
which was modified in the current transaction. This could be very
expensive for large directories with many subdirectories and/or large
subdirectories.
So remove that obsolete logic.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 15 Dec 2021 12:20:00 +0000 (12:20 +0000)]
btrfs: stop copying old dir items when logging a directory
When logging a directory, we go over every leaf of the subvolume tree that
was changed in the current transaction and copy all its dir index keys to
the log tree.
That includes copying dir index keys created in past transactions. This is
done mostly for simplicity, as after logging the keys we log an item that
specifies the start and end ranges of the keys we logged. That item is
then used during log replay to figure out which keys need to be deleted -
every key in that range that we find in the subvolume tree and is not in
the log tree, needs to be deleted.
Now that we log only dir index keys, and not dir item keys anymore, when
we remove dentries from a directory (due to unlink and rename operations),
we can get entire leaves that we changed only for deleting old dir index
keys, or that have few dir index keys that are new - this is due to the
fact that the offset for new index keys comes from a monotonically
increasing counter.
We can avoid logging dir index keys from past transactions, and in order
to track the deletions, only log range items (BTRFS_DIR_LOG_INDEX_KEY key
type) when we find gaps between consecutive index keys. This massively
reduces the amount of logged metadata when we have deleted directory
entries, even if it's a small percentage of the total number of entries.
The reduction comes from both less items that are logged and instead of
logging many dir index items (struct btrfs_dir_item), which have a size
of 30 bytes plus a file name, we typically log just a few range items
(struct btrfs_dir_log_item), which take only 8 bytes each.
Even if no entries were deleted from a directory and only new entries
were added, we typically still get a reduction on the amount of logged
metadata, because it's very likely the first leaf that got the new
dir index entries also has several old dir index entries.
So change the logging logic to not log dir index keys created in past
transactions and log a range item for every gap it finds between each
pair of consecutive index keys, to ensure deletions are tracked and
replayed on log replay.
This patch is part of a patchset comprised of the following patches:
1/4 btrfs: don't log unnecessary boundary keys when logging directory
2/4 btrfs: put initial index value of a directory in a constant
3/4 btrfs: stop copying old dir items when logging a directory
4/4 btrfs: stop trying to log subdirectories created in past transactions
The following test was run on a branch without this patchset and on a
branch with the first three patches applied:
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=
1000000
NUM_FILE_DELETES=10000
MKFS_OPTIONS="-O no-holes -R free-space-tree"
MOUNT_OPTIONS="-o ssd"
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
del_inc=$(( $NUM_FILES / $NUM_FILE_DELETES ))
for ((i = 1; i <= $NUM_FILES; i += $del_inc)); do
rm -f $MNT/testdir/file_$i
done
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) /
1000000 ))
echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
echo
umount $MNT
The test was run on a non-debug kernel (Debian's default kernel config),
and the results were the following for various values of NUM_FILES and
NUM_FILE_DELETES:
** before, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
dir fsync took 585 ms after deleting 10000 files
** after, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
dir fsync took 34 ms after deleting 10000 files (-94.2%)
** before, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
dir fsync took 50 ms after deleting 1000 files
** after, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
dir fsync took 7 ms after deleting 1000 files (-86.0%)
** before, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
dir fsync took 9 ms after deleting 100 files
** after, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
dir fsync took 5 ms after deleting 100 files (-44.4%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 15 Dec 2021 12:19:59 +0000 (12:19 +0000)]
btrfs: put initial index value of a directory in a constant
At btrfs_set_inode_index_count() we refer twice to the number 2 as the
initial index value for a directory (when it's empty), with a proper
comment explaining the reason for that value. In the next patch I'll
have to use that magic value in the directory logging code, so put
the value in a #define at btrfs_inode.h, to avoid hardcoding the
magic value again at tree-log.c.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 15 Dec 2021 12:19:58 +0000 (12:19 +0000)]
btrfs: don't log unnecessary boundary keys when logging directory
Before we start to log dir index keys from a leaf, we check if there is a
previous index key, which normally is at the end of a leaf that was not
changed in the current transaction. Then we log that key and set the start
of logged range (item of type BTRFS_DIR_LOG_INDEX_KEY) to the offset of
that key. This is to ensure that if there were deleted index keys between
that key and the first key we are going to log, those deletions are
replayed in case we need to replay to the log after a power failure.
However we really don't need to log that previous key, we can just set the
start of the logged range to that key's offset plus 1. This achieves the
same and avoids logging one dir index key.
The same logic is performed when we finish logging the index keys of a
leaf and we find that the next leaf has index keys and was not changed in
the current transaction. We are logging the first key of that next leaf
and use its offset as the end of range we log. This is just to ensure that
if there were deleted index keys between the last index key we logged and
the first key of that next leaf, those index keys are deleted if we end
up replaying the log. However that is not necessary, we can avoid logging
that first index key of the next leaf and instead set the end of the
logged range to match the offset of that index key minus 1.
So avoid logging those index keys at the boundaries and adjust the start
and end offsets of the logged ranges as described above.
This patch is part of a patchset comprised of the following patches:
1/4 btrfs: don't log unnecessary boundary keys when logging directory
2/4 btrfs: put initial index value of a directory in a constant
3/4 btrfs: stop copying old dir items when logging a directory
4/4 btrfs: stop trying to log subdirectories created in past transactions
Performance test results are listed in the changelog of patch 3/4.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sahil Kang [Wed, 5 Jan 2022 08:30:06 +0000 (00:30 -0800)]
btrfs: reuse existing pointers from btrfs_ioctl
btrfs_ioctl already contains pointers to the inode and btrfs_root
structs, so we can pass them into the subfunctions instead of the
toplevel struct file.
Signed-off-by: Sahil Kang <sahil.kang@asilaycomputing.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 4 Jan 2022 12:53:41 +0000 (12:53 +0000)]
btrfs: remove write and wait of struct walk_control
The ->write and ->wait fields of struct walk_control, used for log trees,
are not used since 2008, more specifically since commit
d0c803c4049c5c
("Btrfs: Record dirty pages tree-log pages in an extent_io tree") and
since commit
d0c803c4049c5c ("Btrfs: Record dirty pages tree-log pages in
an extent_io tree"). So just remove them, along with the function
btrfs_write_tree_block(), which is also not used anymore after removing
the ->write member.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Linus Torvalds [Sun, 27 Feb 2022 22:36:33 +0000 (14:36 -0800)]
Linux 5.17-rc6
Linus Torvalds [Sun, 27 Feb 2022 21:07:40 +0000 (13:07 -0800)]
Merge tag 'irq-urgent-2022-02-27' of git://git./linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
"A single fix for a regression caused by the recent PCI/MSI rework
which resulted in a recursive locking problem in the VMD driver.
The cure is to cache the relevant information upfront instead of
retrieving it at runtime"
* tag 'irq-urgent-2022-02-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
PCI: vmd: Prevent recursive locking on interrupt allocation
Linus Torvalds [Sun, 27 Feb 2022 20:42:37 +0000 (12:42 -0800)]
Merge tag 'dma-mapping-5.17-1' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fix from Christoph Hellwig:
- fix a swiotlb info leak (Halil Pasic)
* tag 'dma-mapping-5.17-1' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: fix info leak with DMA_FROM_DEVICE
Linus Torvalds [Sun, 27 Feb 2022 20:30:54 +0000 (12:30 -0800)]
Merge tag 'pinctrl-v5-17-3' of git://git./linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
- Fix some drive strength and pull-up code in the K210 driver.
- Add the Alder Lake-M ACPI ID so it starts to work properly.
- Use a static name for the StarFive GPIO irq_chip, forestalling an
upcoming fixes series from Marc Zyngier.
- Fix an ages old bug in the Tegra 186 driver where we were indexing at
random into struct and being lucky getting the right member.
* tag 'pinctrl-v5-17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
gpio: tegra186: Fix chip_data type confusion
pinctrl: starfive: Use a static name for the GPIO irq_chip
pinctrl: tigerlake: Revert "Add Alder Lake-M ACPI ID"
pinctrl: k210: Fix bias-pull-up
pinctrl: fix loop in k210_pinconf_get_drive()
Linus Torvalds [Sat, 26 Feb 2022 20:10:17 +0000 (12:10 -0800)]
Merge tag 'trace-v5.17-rc4' of git://git./linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- rtla (Real-Time Linux Analysis tool):
- fix typo in man page
- Update API -e to -E before it is released
- Error message fix and memory leak fix
- Partially uninline trace event soft disable to shrink text
- Fix function graph start up test
- Have triggers affect the trace instance they are in and not top level
- Have osnoise sleep in the units it says it uses
- Remove unused ftrace stub function
- Remove event probe redundant info from event in the buffer
- Fix group ownership setting in tracefs
- Ensure trace buffer is minimum size to prevent crashes
* tag 'trace-v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
rtla/osnoise: Fix error message when failing to enable trace instance
rtla/osnoise: Free params at the exit
rtla/hist: Make -E the short version of --entries
tracing: Fix selftest config check for function graph start up test
tracefs: Set the group ownership in apply_options() not parse_options()
tracing/osnoise: Make osnoise_main to sleep for microseconds
ftrace: Remove unused ftrace_startup_enable() stub
tracing: Ensure trace buffer is at least 4096 bytes large
tracing: Uninline trace_trigger_soft_disabled() partly
eprobes: Remove redundant event type information
tracing: Have traceon and traceoff trigger honor the instance
tracing: Dump stacktrace trigger to the corresponding instance
rtla: Fix systme -> system typo on man page
Linus Torvalds [Sat, 26 Feb 2022 20:00:44 +0000 (12:00 -0800)]
Merge tag 'fixes-2022-02-26' of git://git./linux/kernel/git/rppt/memblock
Pull memblock fix from Mike Rapoport:
"Use kfree() to release kmalloced memblock regions
memblock.{reserved,memory}.regions may be allocated using kmalloc()
in memblock_double_array(). Use kfree() to release these kmalloced
regions"
* tag 'fixes-2022-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
memblock: use kfree() to release kmalloced memblock regions
Linus Torvalds [Sat, 26 Feb 2022 19:52:14 +0000 (11:52 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"12 patches.
Subsystems affected by this patch series: MAINTAINERS, mailmap, memfd,
and mm (hugetlb, kasan, hugetlbfs, pagemap, selftests, memcg, and
slab)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
selftests/memfd: clean up mapping in mfd_fail_write
mailmap: update Roman Gushchin's email
MAINTAINERS, SLAB: add Roman as reviewer, git tree
MAINTAINERS: add Shakeel as a memcg co-maintainer
MAINTAINERS: remove Vladimir from memcg maintainers
MAINTAINERS: add Roman as a memcg co-maintainer
selftest/vm: fix map_fixed_noreplace test failure
mm: fix use-after-free bug when mm->mmap is reused after being freed
hugetlbfs: fix a truncation issue in hugepages parameter
kasan: test: prevent cache merging in kmem_cache_double_destroy
mm/hugetlb: fix kernel crash with hugetlb mremap
MAINTAINERS: add sysctl-next git tree
Linus Torvalds [Sat, 26 Feb 2022 18:26:24 +0000 (10:26 -0800)]
Merge tag 'riscv-for-linus-5.17-rc6' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A fix for the K210 sdcard defconfig, to avoid using a
fixed delay for the root FS
- A fix to make sure there's a proper call frame for
trace_hardirqs_{on,off}().
* tag 'riscv-for-linus-5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: fix oops caused by irqsoff latency tracer
riscv: fix nommu_k210_sdcard_defconfig
Linus Torvalds [Sat, 26 Feb 2022 17:53:19 +0000 (09:53 -0800)]
Merge tag 'xfs-5.17-fixes-2' of git://git./fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Nothing exciting, just more fixes for not returning sync_filesystem
error values (and eliding it when it's not necessary).
Summary:
- Only call sync_filesystem when we're remounting the filesystem
readonly readonly, and actually check its return value"
* tag 'xfs-5.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: only bother with sync_filesystem during readonly remount
Mike Kravetz [Sat, 26 Feb 2022 03:11:26 +0000 (19:11 -0800)]
selftests/memfd: clean up mapping in mfd_fail_write
Running the memfd script ./run_hugetlbfs_test.sh will often end in error
as follows:
memfd-hugetlb: CREATE
memfd-hugetlb: BASIC
memfd-hugetlb: SEAL-WRITE
memfd-hugetlb: SEAL-FUTURE-WRITE
memfd-hugetlb: SEAL-SHRINK
fallocate(ALLOC) failed: No space left on device
./run_hugetlbfs_test.sh: line 60: 166855 Aborted (core dumped) ./memfd_test hugetlbfs
opening: ./mnt/memfd
fuse: DONE
If no hugetlb pages have been preallocated, run_hugetlbfs_test.sh will
allocate 'just enough' pages to run the test. In the SEAL-FUTURE-WRITE
test the mfd_fail_write routine maps the file, but does not unmap. As a
result, two hugetlb pages remain reserved for the mapping. When the
fallocate call in the SEAL-SHRINK test attempts allocate all hugetlb
pages, it is short by the two reserved pages.
Fix by making sure to unmap in mfd_fail_write.
Link: https://lkml.kernel.org/r/20220219004340.56478-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Sat, 26 Feb 2022 03:11:23 +0000 (19:11 -0800)]
mailmap: update Roman Gushchin's email
I'm moving to a @linux.dev account. Map my old addresses.
Link: https://lkml.kernel.org/r/20220221200006.416377-1-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Sat, 26 Feb 2022 03:11:20 +0000 (19:11 -0800)]
MAINTAINERS, SLAB: add Roman as reviewer, git tree
The slab code has an overlap with kmem accounting, where Roman has done
a lot of work recently and it would be useful to make sure he's CC'd on
patches that potentially affect it. Thus add him as a reviewer for the
SLAB subsystem.
Also while at it, add the link to slab git tree.
Link: https://lkml.kernel.org/r/20220222103104.13241-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shakeel Butt [Sat, 26 Feb 2022 03:11:17 +0000 (19:11 -0800)]
MAINTAINERS: add Shakeel as a memcg co-maintainer
I have been contributing and reviewing to the memcg codebase for last
couple of years. So, making it official.
Link: https://lkml.kernel.org/r/20220224060148.4092228-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vladimir Davydov [Sat, 26 Feb 2022 03:11:14 +0000 (19:11 -0800)]
MAINTAINERS: remove Vladimir from memcg maintainers
Link: https://lkml.kernel.org/r/4ad1f8da49d7b71c84a0c15bd5347f5ce704e730.1645608825.git.vdavydov.dev@gmail.com
Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Sat, 26 Feb 2022 03:11:11 +0000 (19:11 -0800)]
MAINTAINERS: add Roman as a memcg co-maintainer
Add myself as a memcg co-maintainer. My primary focus over last few
years was the kernel memory accounting stack, but I do work on some
other parts of the memory controller as well.
Link: https://lkml.kernel.org/r/20220221233951.659048-1-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aneesh Kumar K.V [Sat, 26 Feb 2022 03:11:08 +0000 (19:11 -0800)]
selftest/vm: fix map_fixed_noreplace test failure
On the latest RHEL the test fails due to executable mapped at 256MB
address
# ./map_fixed_noreplace
mmap() @ 0x10000000-0x10050000 p=0xffffffffffffffff result=File exists
10000000-
10010000 r-xp
00000000 fd:04
34905657 /root/rpmbuild/BUILD/kernel-5.14.0-56.el9/linux-5.14.0-56.el9.ppc64le/tools/testing/selftests/vm/map_fixed_noreplace
10010000-
10020000 r--p
00000000 fd:04
34905657 /root/rpmbuild/BUILD/kernel-5.14.0-56.el9/linux-5.14.0-56.el9.ppc64le/tools/testing/selftests/vm/map_fixed_noreplace
10020000-
10030000 rw-p
00010000 fd:04
34905657 /root/rpmbuild/BUILD/kernel-5.14.0-56.el9/linux-5.14.0-56.el9.ppc64le/tools/testing/selftests/vm/map_fixed_noreplace
10029b90000-
10029bc0000 rw-p
00000000 00:00 0 [heap]
7fffbb510000-
7fffbb750000 r-xp
00000000 fd:04 24534 /usr/lib64/libc.so.6
7fffbb750000-
7fffbb760000 r--p
00230000 fd:04 24534 /usr/lib64/libc.so.6
7fffbb760000-
7fffbb770000 rw-p
00240000 fd:04 24534 /usr/lib64/libc.so.6
7fffbb780000-
7fffbb7a0000 r--p
00000000 00:00 0 [vvar]
7fffbb7a0000-
7fffbb7b0000 r-xp
00000000 00:00 0 [vdso]
7fffbb7b0000-
7fffbb800000 r-xp
00000000 fd:04 24514 /usr/lib64/ld64.so.2
7fffbb800000-
7fffbb810000 r--p
00040000 fd:04 24514 /usr/lib64/ld64.so.2
7fffbb810000-
7fffbb820000 rw-p
00050000 fd:04 24514 /usr/lib64/ld64.so.2
7fffd93f0000-
7fffd9420000 rw-p
00000000 00:00 0 [stack]
Error: couldn't map the space we need for the test
Fix this by finding a free address using mmap instead of hardcoding
BASE_ADDRESS.
Link: https://lkml.kernel.org/r/20220217083417.373823-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Jann Horn <jannh@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Suren Baghdasaryan [Sat, 26 Feb 2022 03:11:05 +0000 (19:11 -0800)]
mm: fix use-after-free bug when mm->mmap is reused after being freed
oom reaping (__oom_reap_task_mm) relies on a 2 way synchronization with
exit_mmap. First it relies on the mmap_lock to exclude from unlock
path[1], page tables tear down (free_pgtables) and vma destruction.
This alone is not sufficient because mm->mmap is never reset.
For historical reasons[2] the lock is taken there is also MMF_OOM_SKIP
set for oom victims before.
The oom reaper only ever looks at oom victims so the whole scheme works
properly but process_mrelease can opearate on any task (with fatal
signals pending) which doesn't really imply oom victims. That means
that the MMF_OOM_SKIP part of the synchronization doesn't work and it
can see a task after the whole address space has been demolished and
traverse an already released mm->mmap list. This leads to use after
free as properly caught up by KASAN report.
Fix the issue by reseting mm->mmap so that MMF_OOM_SKIP synchronization
is not needed anymore. The MMF_OOM_SKIP is not removed from exit_mmap
yet but it acts mostly as an optimization now.
[1]
27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
[2]
212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
[mhocko@suse.com: changelog rewrite]
Link: https://lore.kernel.org/all/00000000000072ef2c05d7f81950@google.com/
Link: https://lkml.kernel.org/r/20220215201922.1908156-1-surenb@google.com
Fixes:
64591e8605d6 ("mm: protect free_pgtables with mmap_lock write lock in exit_mmap")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: syzbot+2ccf63a4bd07cf39cab0@syzkaller.appspotmail.com
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jan Engelhardt <jengelh@inai.de>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Liu Yuntao [Sat, 26 Feb 2022 03:11:02 +0000 (19:11 -0800)]
hugetlbfs: fix a truncation issue in hugepages parameter
When we specify a large number for node in hugepages parameter, it may
be parsed to another number due to truncation in this statement:
node = tmp;
For example, add following parameter in command line:
hugepagesz=1G hugepages=
4294967297:5
and kernel will allocate 5 hugepages for node 1 instead of ignoring it.
I move the validation check earlier to fix this issue, and slightly
simplifies the condition here.
Link: https://lkml.kernel.org/r/20220209134018.8242-1-liuyuntao10@huawei.com
Fixes:
b5389086ad7be0 ("hugetlbfs: extend the definition of hugepages parameter to support node allocation")
Signed-off-by: Liu Yuntao <liuyuntao10@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Konovalov [Sat, 26 Feb 2022 03:10:59 +0000 (19:10 -0800)]
kasan: test: prevent cache merging in kmem_cache_double_destroy
With HW_TAGS KASAN and kasan.stacktrace=off, the cache created in the
kmem_cache_double_destroy() test might get merged with an existing one.
Thus, the first kmem_cache_destroy() call won't actually destroy it but
will only decrease the refcount. This causes the test to fail.
Provide an empty constructor for the created cache to prevent the cache
from getting merged.
Link: https://lkml.kernel.org/r/b597bd434c49591d8af00ee3993a42c609dc9a59.1644346040.git.andreyknvl@google.com
Fixes:
f98f966cd750 ("kasan: test: add test case for double-kmem_cache_destroy()")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aneesh Kumar K.V [Sat, 26 Feb 2022 03:10:56 +0000 (19:10 -0800)]
mm/hugetlb: fix kernel crash with hugetlb mremap
This fixes the below crash:
kernel BUG at include/linux/mm.h:2373!
cpu 0x5d: Vector: 700 (Program Check) at [
c00000003c6e76e0]
pc:
c000000000581a54: pmd_to_page+0x54/0x80
lr:
c00000000058d184: move_hugetlb_page_tables+0x4e4/0x5b0
sp:
c00000003c6e7980
msr:
9000000000029033
current = 0xc00000003bd8d980
paca = 0xc000200fff610100 irqmask: 0x03 irq_happened: 0x01
pid = 9349, comm = hugepage-mremap
kernel BUG at include/linux/mm.h:2373!
move_hugetlb_page_tables+0x4e4/0x5b0 (link register)
move_hugetlb_page_tables+0x22c/0x5b0 (unreliable)
move_page_tables+0xdbc/0x1010
move_vma+0x254/0x5f0
sys_mremap+0x7c0/0x900
system_call_exception+0x160/0x2c0
the kernel can't use huge_pte_offset before it set the pte entry because
a page table lookup check for huge PTE bit in the page table to
differentiate between a huge pte entry and a pointer to pte page. A
huge_pte_alloc won't mark the page table entry huge and hence kernel
should not use huge_pte_offset after a huge_pte_alloc.
Link: https://lkml.kernel.org/r/20220211063221.99293-1-aneesh.kumar@linux.ibm.com
Fixes:
550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed vma")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Luis Chamberlain [Sat, 26 Feb 2022 03:10:53 +0000 (19:10 -0800)]
MAINTAINERS: add sysctl-next git tree
Add a git tree for sysctls as there's been quite a bit of work lately to
remove all the syctls out of kernel/sysctl.c and move to their respective
places, so coordination has been needed to avoid conflicts. This tree
will also help soak these changes on linux-next prior to getting to Linus.
Link: https://lkml.kernel.org/r/20220218182736.3694508-1-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daniel Bristot de Oliveira [Fri, 18 Feb 2022 17:57:09 +0000 (18:57 +0100)]
rtla/osnoise: Fix error message when failing to enable trace instance
When a trace instance creation fails, tools are printing:
Could not enable -> osnoiser <- tracer for tracing
Print the actual (and correct) name of the tracer it fails to enable.
Link: https://lkml.kernel.org/r/53ef0582605af91eca14b19dba9fc9febb95d4f9.1645206561.git.bristot@kernel.org
Fixes:
b1696371d865 ("rtla: Helper functions for rtla")
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Daniel Bristot de Oliveira [Fri, 18 Feb 2022 17:57:08 +0000 (18:57 +0100)]
rtla/osnoise: Free params at the exit
The variable that stores the parsed command line arguments are not
being free()d at the rtla osnoise top exit path.
Free params variable before exiting.
Link: https://lkml.kernel.org/r/0be31d8259c7c53b98a39769d60cfeecd8421785.1645206561.git.bristot@kernel.org
Fixes:
1eceb2fc2ca5 ("rtla/osnoise: Add osnoise top mode")
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Daniel Bristot de Oliveira [Fri, 18 Feb 2022 17:57:07 +0000 (18:57 +0100)]
rtla/hist: Make -E the short version of --entries
Currently, --entries uses -e as the short version in the hist mode of
timerlat and osnoise tools. But as -e is already used to enable events
on trace sessions by other tools, thus let's keep it available for the
same usage for all rtla tools.
Make -E the short version of --entries for hist mode on all tools.
Note: rtla was merged in this merge window, so rtla was not released yet.
Link: https://lkml.kernel.org/r/5dbf0cbe7364d3a05e708926b41a097c59a02b1e.1645206561.git.bristot@kernel.org
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Christophe Leroy [Mon, 20 Dec 2021 16:38:06 +0000 (16:38 +0000)]
tracing: Fix selftest config check for function graph start up test
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is required to test
direct tramp.
Link: https://lkml.kernel.org/r/bdc7e594e13b0891c1d61bc8d56c94b1890eaed7.1640017960.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt (Google) [Fri, 25 Feb 2022 20:34:26 +0000 (15:34 -0500)]
tracefs: Set the group ownership in apply_options() not parse_options()
Al Viro brought it to my attention that the dentries may not be filled
when the parse_options() is called, causing the call to set_gid() to
possibly crash. It should only be called if parse_options() succeeds
totally anyway.
He suggested the logical place to do the update is in apply_options().
Link: https://lore.kernel.org/all/20220225165219.737025658@goodmis.org/
Link: https://lkml.kernel.org/r/20220225153426.1c4cab6b@gandalf.local.home
Cc: stable@vger.kernel.org
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes:
48b27b6b5191 ("tracefs: Set all files to the same group ownership as the mount option")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>