Christoph Hellwig [Sun, 9 Feb 2025 04:33:43 +0000 (05:33 +0100)]
xfs: reflow xfs_dec_freecounter
Let the successful allocation be the main path through the function
with exception handling in branches to make the code easier to
follow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Carlos Maiolino [Mon, 3 Mar 2025 12:01:06 +0000 (13:01 +0100)]
Merge branch 'vfs-6.15.iomap' of git://git./linux/kernel/git/vfs/vfs into xfs-6.15-merge
Christian Brauner [Thu, 27 Feb 2025 10:33:06 +0000 (11:33 +0100)]
Merge branch 'vfs-6.15.shared.iomap' of gitolite.pub/scm/linux/kernel/git/vfs/vfs
Pull in the VFS changes shared with xfs for making RWF_DONTCACHE work
with buffered writes for xfs.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner [Thu, 27 Feb 2025 10:28:00 +0000 (11:28 +0100)]
Merge patch series "iomap: make buffered writes work with RWF_DONTCACHE"
Support buffered writes with RWF_DONTCACHE.
* patches from https://lore.kernel.org/r/
20250204184047.356762-2-axboe@kernel.dk:
xfs: flag as supporting FOP_DONTCACHE
iomap: make buffered writes work with RWF_DONTCACHE
Link: https://lore.kernel.org/r/20250204184047.356762-2-axboe@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
Jens Axboe [Tue, 4 Feb 2025 18:40:00 +0000 (11:40 -0700)]
xfs: flag as supporting FOP_DONTCACHE
Read side was already fully supported, and with the write side
appropriately punted to the worker queue, all that's needed now is
setting FOP_DONTCACHE in the file_operations structure to enable full
support for read and write uncached IO.
This provides similar benefits to using RWF_DONTCACHE with reads. Testing
buffered writes on 32 files:
writing bs 65536, uncached 0
1s: 196035MB/sec
2s: 132308MB/sec
3s: 132438MB/sec
4s: 116528MB/sec
5s: 103898MB/sec
6s: 108893MB/sec
7s: 99678MB/sec
8s: 106545MB/sec
9s: 106826MB/sec
10s: 101544MB/sec
11s: 111044MB/sec
12s: 124257MB/sec
13s: 116031MB/sec
14s: 114540MB/sec
15s: 115011MB/sec
16s: 115260MB/sec
17s: 116068MB/sec
18s: 116096MB/sec
where it's quite obvious where the page cache filled, and performance
dropped from to about half of where it started, settling in at around
115GB/sec. Meanwhile, 32 kswapds were running full steam trying to
reclaim pages.
Running the same test with uncached buffered writes:
writing bs 65536, uncached 1
1s: 198974MB/sec
2s: 189618MB/sec
3s: 193601MB/sec
4s: 188582MB/sec
5s: 193487MB/sec
6s: 188341MB/sec
7s: 194325MB/sec
8s: 188114MB/sec
9s: 192740MB/sec
10s: 189206MB/sec
11s: 193442MB/sec
12s: 189659MB/sec
13s: 191732MB/sec
14s: 190701MB/sec
15s: 191789MB/sec
16s: 191259MB/sec
17s: 190613MB/sec
18s: 191951MB/sec
and the behavior is fully predictable, performing the same throughout
even after the page cache would otherwise have fully filled with dirty
data. It's also about 65% faster, and using half the CPU of the system
compared to the normal buffered write.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20250204184047.356762-3-axboe@kernel.dk
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Jens Axboe [Tue, 4 Feb 2025 18:39:59 +0000 (11:39 -0700)]
iomap: make buffered writes work with RWF_DONTCACHE
Add iomap buffered write support for RWF_DONTCACHE. If RWF_DONTCACHE is
set for a write, mark the folios being written as uncached. Then
writeback completion will drop the pages. The write_iter handler simply
kicks off writeback for the pages, and writeback completion will take
care of the rest.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20250204184047.356762-2-axboe@kernel.dk
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Jens Axboe [Tue, 4 Feb 2025 18:40:00 +0000 (11:40 -0700)]
xfs: flag as supporting FOP_DONTCACHE
Read side was already fully supported, and with the write side
appropriately punted to the worker queue, all that's needed now is
setting FOP_DONTCACHE in the file_operations structure to enable full
support for read and write uncached IO.
This provides similar benefits to using RWF_DONTCACHE with reads. Testing
buffered writes on 32 files:
writing bs 65536, uncached 0
1s: 196035MB/sec
2s: 132308MB/sec
3s: 132438MB/sec
4s: 116528MB/sec
5s: 103898MB/sec
6s: 108893MB/sec
7s: 99678MB/sec
8s: 106545MB/sec
9s: 106826MB/sec
10s: 101544MB/sec
11s: 111044MB/sec
12s: 124257MB/sec
13s: 116031MB/sec
14s: 114540MB/sec
15s: 115011MB/sec
16s: 115260MB/sec
17s: 116068MB/sec
18s: 116096MB/sec
where it's quite obvious where the page cache filled, and performance
dropped from to about half of where it started, settling in at around
115GB/sec. Meanwhile, 32 kswapds were running full steam trying to
reclaim pages.
Running the same test with uncached buffered writes:
writing bs 65536, uncached 1
1s: 198974MB/sec
2s: 189618MB/sec
3s: 193601MB/sec
4s: 188582MB/sec
5s: 193487MB/sec
6s: 188341MB/sec
7s: 194325MB/sec
8s: 188114MB/sec
9s: 192740MB/sec
10s: 189206MB/sec
11s: 193442MB/sec
12s: 189659MB/sec
13s: 191732MB/sec
14s: 190701MB/sec
15s: 191789MB/sec
16s: 191259MB/sec
17s: 190613MB/sec
18s: 191951MB/sec
and the behavior is fully predictable, performing the same throughout
even after the page cache would otherwise have fully filled with dirty
data. It's also about 65% faster, and using half the CPU of the system
compared to the normal buffered write.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20250204184047.356762-3-axboe@kernel.dk
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Jens Axboe [Tue, 4 Feb 2025 18:39:59 +0000 (11:39 -0700)]
iomap: make buffered writes work with RWF_DONTCACHE
Add iomap buffered write support for RWF_DONTCACHE. If RWF_DONTCACHE is
set for a write, mark the folios being written as uncached. Then
writeback completion will drop the pages. The write_iter handler simply
kicks off writeback for the pages, and writeback completion will take
care of the rest.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20250204184047.356762-2-axboe@kernel.dk
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner [Wed, 26 Feb 2025 08:42:43 +0000 (09:42 +0100)]
Merge patch series "iomap: incremental advance conversion -- phase 2"
Brian Foster <bfoster@redhat.com> says:
Here's phase 2 of the incremental iter advance conversions. This updates
all remaining iomap operations to advance the iter within the operation
and thus removes the need to advance from the core iomap iterator. Once
all operations are switched over, the core advance code is removed and
the processed field is renamed to reflect that it is now a pure status
code.
For context, this was first introduced in a previous series [1] that
focused mainly on the core mechanism and iomap buffered write. This is
because original impetus was to facilitate a folio batch mechanism where
a filesystem can optionally provide a batch of folios to process for a
given mapping (i.e. zero range of an unwritten mapping with dirty folios
in pagecache). That is still WIP, but the broader point is that this was
originally intended as an optional mode until consensus that fell out of
discussion was that it would be preferable to convert over everything.
This presumably facilitates some other future work and simplifies
semantics in the core iteration code.
Patches 1-3 convert over iomap buffered read, direct I/O and various
other remaining ops (swap, etc.). Patches 4-9 convert over the various
DAX iomap operations. Finally, patches 10-12 introduce some cleanups now
that all iomap operations have updated iteration semantics.
* patches from https://lore.kernel.org/r/
20250224144757.237706-1-bfoster@redhat.com:
iomap: introduce a full map advance helper
iomap: rename iomap_iter processed field to status
iomap: remove unnecessary advance from iomap_iter()
dax: advance the iomap_iter on pte and pmd faults
dax: advance the iomap_iter on dedupe range
dax: advance the iomap_iter on unshare range
dax: advance the iomap_iter on zero range
dax: push advance down into dax_iomap_iter() for read and write
dax: advance the iomap_iter in the read/write path
iomap: convert misc simple ops to incremental advance
iomap: advance the iter on direct I/O
iomap: advance the iter directly on buffered read
Link: https://lore.kernel.org/r/20250224144757.237706-1-bfoster@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Mon, 24 Feb 2025 14:47:57 +0000 (09:47 -0500)]
iomap: introduce a full map advance helper
Various iomap_iter_advance() calls advance by the full mapping
length and thus have no need for the current length input or
post-advance remaining length output from the standard advance
function. Add an iomap_iter_advance_full() helper to clean up these
cases.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250224144757.237706-13-bfoster@redhat.com
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Mon, 24 Feb 2025 14:47:56 +0000 (09:47 -0500)]
iomap: rename iomap_iter processed field to status
The iter.processed field name is no longer appropriate now that
iomap operations do not return the number of bytes processed. Rename
the field to iter.status to reflect that a success or error code is
expected.
Also change the type to int as there is no longer a need for an s64.
This reduces the size of iomap_iter by 8 bytes due to a combination
of smaller type and reduction in structure padding. While here, fix
up the return types of various _iter() helpers to reflect the type
change.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250224144757.237706-12-bfoster@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Mon, 24 Feb 2025 14:47:55 +0000 (09:47 -0500)]
iomap: remove unnecessary advance from iomap_iter()
At this point, all iomap operations have been updated to advance the
iomap_iter directly before returning to iomap_iter(). Therefore, the
complexity of handling both the old and new semantics is no longer
required and can be removed from iomap_iter().
Update iomap_iter() to expect success or failure status in
iter.processed. As a precaution and developer hint to prevent
inadvertent use of old semantics, warn on a positive return code and
fail the operation. Remove the unnecessary advance and simplify the
termination logic.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250224144757.237706-11-bfoster@redhat.com
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Mon, 24 Feb 2025 14:47:54 +0000 (09:47 -0500)]
dax: advance the iomap_iter on pte and pmd faults
Advance the iomap_iter on PTE and PMD faults. Each of these
operations assign a hardcoded size to iter.processed. Replace those
with an advance and status return.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250224144757.237706-10-bfoster@redhat.com
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Mon, 24 Feb 2025 14:47:53 +0000 (09:47 -0500)]
dax: advance the iomap_iter on dedupe range
Advance the iter on successful dedupe. Dedupe range uses two iters
and iterates so long as both have outstanding work, so
correspondingly this needs to advance both on each iteration. Since
dax_range_compare_iter() now returns status instead of a byte count,
update the variable name in the caller as well.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250224144757.237706-9-bfoster@redhat.com
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Mon, 24 Feb 2025 14:47:52 +0000 (09:47 -0500)]
dax: advance the iomap_iter on unshare range
Advance the iter and return 0 or an error code for success or
failure.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250224144757.237706-8-bfoster@redhat.com
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Mon, 24 Feb 2025 14:47:51 +0000 (09:47 -0500)]
dax: advance the iomap_iter on zero range
Update the DAX zero range iomap iter handler to advance the iter
directly. Advance by the full length in the hole/unwritten case, or
otherwise advance incrementally in the zeroing loop. In either case,
return 0 or an error code for success or failure.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250224144757.237706-7-bfoster@redhat.com
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Mon, 24 Feb 2025 14:47:50 +0000 (09:47 -0500)]
dax: push advance down into dax_iomap_iter() for read and write
DAX read and write currently advances the iter after the
dax_iomap_iter() returns the number of bytes processed rather than
internally within the iter handler itself, as most other iomap
operations do. Push the advance down into dax_iomap_iter() and
update the function to return op status instead of bytes processed.
dax_iomap_iter() shortcuts reads from a hole or unwritten mapping by
directly zeroing the iov_iter, so advance the iomap_iter similarly
in that case.
The DAX processing loop can operate on a range slightly different
than defined by the iomap_iter depending on circumstances. For
example, a read may be truncated by inode size, a read or write
range can be increased due to page alignment, etc. Therefore, this
patch aims to retain as much of the existing logic as possible.
The loop control logic remains pos based, but is sampled from the
iomap_iter on each iteration after the advance instead of being
updated manually. Similarly, length is updated based on the output
of the advance instead of being updated manually. The advance itself
is based on the number of bytes transferred, which was previously
used to update the local copies of pos and length.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250224144757.237706-6-bfoster@redhat.com
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Mon, 24 Feb 2025 14:47:49 +0000 (09:47 -0500)]
dax: advance the iomap_iter in the read/write path
DAX reads and writes flow through dax_iomap_iter(), which has one or
more subtleties in terms of how it processes a range vs. what is
specified in the iomap_iter. To keep things simple and remove the
dependency on iomap_iter() advances, convert a positive return from
dax_iomap_iter() to the new advance and status return semantics. The
advance can be pushed further down in future patches.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250224144757.237706-5-bfoster@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Mon, 24 Feb 2025 14:47:48 +0000 (09:47 -0500)]
iomap: convert misc simple ops to incremental advance
Update several of the remaining iomap operations to advance the iter
directly rather than via return value. This includes page faults,
fiemap, seek data/hole and swapfile activation.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250224144757.237706-4-bfoster@redhat.com
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Mon, 24 Feb 2025 14:47:47 +0000 (09:47 -0500)]
iomap: advance the iter on direct I/O
Update iomap direct I/O to advance the iter directly rather than via
iter.processed. Update each mapping type helper to advance based on
the amount of data processed and return success or failure.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250224144757.237706-3-bfoster@redhat.com
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Mon, 24 Feb 2025 14:47:46 +0000 (09:47 -0500)]
iomap: advance the iter directly on buffered read
iomap buffered read advances the iter via iter.processed. To
continue separating iter advance from return status, update
iomap_readpage_iter() to advance the iter instead of returning the
number of bytes processed. In turn, drop the offset parameter and
sample the updated iter->pos at the start of the function. Update
the callers to loop based on remaining length in the current
iteration instead of number of bytes processed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250224144757.237706-2-bfoster@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christoph Hellwig [Mon, 24 Feb 2025 23:48:55 +0000 (15:48 -0800)]
xfs: remove the XBF_STALE check from xfs_buf_rele_cached
xfs_buf_stale already set b_lru_ref to 0, and thus prevents the buffer
from moving to the LRU. Remove the duplicate check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 24 Feb 2025 23:48:54 +0000 (15:48 -0800)]
xfs: remove most in-flight buffer accounting
The buffer cache keeps a bt_io_count per-CPU counter to track all
in-flight I/O, which is used to ensure no I/O is in flight when
unmounting the file system.
For most I/O we already keep track of inflight I/O at higher levels:
- for synchronous I/O (xfs_buf_read/xfs_bwrite/xfs_buf_delwri_submit),
the caller has a reference and waits for I/O completions using
xfs_buf_iowait
- for xfs_buf_delwri_submit_nowait the only caller (AIL writeback)
tracks the log items that the buffer attached to
This only leaves only xfs_buf_readahead_map as a submitter of
asynchronous I/O that is not tracked by anything else. Replace the
bt_io_count per-cpu counter with a more specific bt_readahead_count
counter only tracking readahead I/O. This allows to simply increment
it when submitting readahead I/O and decrementing it when it completed,
and thus simplify xfs_buf_rele and remove the needed for the
XBF_NO_IOACCT flags and the XFS_BSTATE_IN_FLIGHT buffer state.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 24 Feb 2025 23:48:53 +0000 (15:48 -0800)]
xfs: decouple buffer readahead from the normal buffer read path
xfs_buf_readahead_map is the only caller of xfs_buf_read_map and thus
_xfs_buf_read that is not synchronous. Split it from xfs_buf_read_map
so that the asynchronous path is self-contained and the now purely
synchronous xfs_buf_read_map / _xfs_buf_read implementation can be
simplified.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 24 Feb 2025 23:48:52 +0000 (15:48 -0800)]
xfs: reduce context switches for synchronous buffered I/O
Currently all metadata I/O completions happen in the m_buf_workqueue
workqueue. But for synchronous I/O (i.e. all buffer reads) there is no
need for that, as there always is a called in process context that is
waiting for the I/O. Factor out the guts of xfs_buf_ioend into a
separate helper and call it from xfs_buf_iowait to avoid a double
an extra context switch to the workqueue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Thu, 6 Feb 2025 06:15:00 +0000 (07:15 +0100)]
xfs: flush inodegc before swapon
Fix the brand new xfstest that tries to swapon on a recently unshared
file and use the chance to document the other bit of magic in this
function.
The big comment is taken from a mailinglist post by Dave Chinner.
Fixes:
5e672cd69f0a53 ("xfs: introduce xfs_inodegc_push()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Thu, 6 Feb 2025 06:15:01 +0000 (07:15 +0100)]
xfs: rename xfs_iomap_swapfile_activate to xfs_vm_swap_activate
Match the method name and the naming convention or address_space
operations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Carlos Maiolino [Mon, 3 Feb 2025 13:04:57 +0000 (14:04 +0100)]
xfs: Do not allow norecovery mount with quotacheck
Mounting a filesystem that requires quota state changing will generate a
transaction.
We already check for a read-only device; we should do that for
norecovery too.
A quotacheck on a norecovery mount, and with the right log size, will cause
the mount process to hang on:
[<0>] xlog_grant_head_wait+0x5d/0x2a0 [xfs]
[<0>] xlog_grant_head_check+0x112/0x180 [xfs]
[<0>] xfs_log_reserve+0xe3/0x260 [xfs]
[<0>] xfs_trans_reserve+0x179/0x250 [xfs]
[<0>] xfs_trans_alloc+0x101/0x260 [xfs]
[<0>] xfs_sync_sb+0x3f/0x80 [xfs]
[<0>] xfs_qm_mount_quotas+0xe3/0x2f0 [xfs]
[<0>] xfs_mountfs+0x7ad/0xc20 [xfs]
[<0>] xfs_fs_fill_super+0x762/0xa50 [xfs]
[<0>] get_tree_bdev_flags+0x131/0x1d0
[<0>] vfs_get_tree+0x26/0xd0
[<0>] vfs_cmd_create+0x59/0xe0
[<0>] __do_sys_fsconfig+0x4e3/0x6b0
[<0>] do_syscall_64+0x82/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
This is caused by a transaction running with bogus initialized head/tail
I initially hit this while running generic/050, with random log
sizes, but I managed to reproduce it reliably here with the steps
below:
mkfs.xfs -f -lsize=1025M -f -b size=4096 -m crc=1,reflink=1,rmapbt=1, -i
sparse=1 /dev/vdb2 > /dev/null
mount -o usrquota,grpquota,prjquota /dev/vdb2 /mnt
xfs_io -x -c 'shutdown -f' /mnt
umount /mnt
mount -o ro,norecovery,usrquota,grpquota,prjquota /dev/vdb2 /mnt
Last mount hangs up
As we add yet another validation if quota state is changing, this also
add a new helper named xfs_qm_validate_state_change(), factoring the
quota state changes out of xfs_qm_newmount() to reduce cluttering
within it.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Lukas Herbolt [Mon, 3 Feb 2025 08:55:13 +0000 (09:55 +0100)]
xfs: do not check NEEDSREPAIR if ro,norecovery mount.
If there is corrutpion on the filesystem andxfs_repair
fails to repair it. The last resort of getting the data
is to use norecovery,ro mount. But if the NEEDSREPAIR is
set the filesystem cannot be mounted. The flag must be
cleared out manually using xfs_db, to get access to what
left over of the corrupted fs.
Signed-off-by: Lukas Herbolt <lukas@herbolt.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Mon, 3 Feb 2025 00:50:14 +0000 (16:50 -0800)]
xfs: fix data fork format filtering during inode repair
Coverity noticed that xrep_dinode_bad_metabt_fork never runs because
XFS_DINODE_FMT_META_BTREE is always filtered out in the mode selection
switch of xrep_dinode_check_dfork.
Metadata btrees are allowed only in the data forks of regular files, so
add this case explicitly. I guess this got fubard during a refactoring
prior to 6.13 and I didn't notice until now. :/
Coverity-id:
1617714
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Mon, 3 Feb 2025 00:50:14 +0000 (16:50 -0800)]
xfs: fix online repair probing when CONFIG_XFS_ONLINE_REPAIR=n
I received a report from the release engineering side of the house that
xfs_scrub without the -n flag (aka fix it mode) would try to fix a
broken filesystem even on a kernel that doesn't have online repair built
into it:
# xfs_scrub -dTvn /mnt/test
EXPERIMENTAL xfs_scrub program in use! Use at your own risk!
Phase 1: Find filesystem geometry.
/mnt/test: using 1 threads to scrub.
Phase 1: Memory used: 132k/0k (108k/25k), time: 0.00/ 0.00/ 0.00s
<snip>
Phase 4: Repair filesystem.
<snip>
Info: /mnt/test/some/victimdir directory entries: Attempting repair. (repair.c line 351)
Corruption: /mnt/test/some/victimdir directory entries: Repair unsuccessful; offline repair required. (repair.c line 204)
Source: https://blogs.oracle.com/linux/post/xfs-online-filesystem-repair
It is strange that xfs_scrub doesn't refuse to run, because the kernel
is supposed to return EOPNOTSUPP if we actually needed to run a repair,
and xfs_io's repair subcommand will perror that. And yet:
# xfs_io -x -c 'repair probe' /mnt/test
#
The first problem is commit
dcb660f9222fd9 (4.15) which should have had
xchk_probe set the CORRUPT OFLAG so that any of the repair machinery
will get called at all.
It turns out that some refactoring that happened in the 6.6-6.8 era
broke the operation of this corner case. What we *really* want to
happen is that all the predicates that would steer xfs_scrub_metadata()
towards calling xrep_attempt() should function the same way that they do
when repair is compiled in; and then xrep_attempt gets to return the
fatal EOPNOTSUPP error code that causes the probe to fail.
Instead, commit
8336a64eb75cba (6.6) started the failwhale swimming by
hoisting OFLAG checking logic into a helper whose non-repair stub always
returns false, causing scrub to return "repair not needed" when in fact
the repair is not supported. Prior to that commit, the oflag checking
that was open-coded in scrub.c worked correctly.
Similarly, in commit
4bdfd7d15747b1 (6.8) we hoisted the IFLAG_REPAIR
and ALREADY_FIXED logic into a helper whose non-repair stub always
returns false, so we never enter the if test body that would have called
xrep_attempt, let alone fail to decode the OFLAGs correctly.
The final insult (yes, we're doing The Naked Gun now) is commit
48a72f60861f79 (6.8) in which we hoisted the "are we going to try a
repair?" predicate into yet another function with a non-repair stub
always returns false.
Fix xchk_probe to trigger xrep_probe if repair is enabled, or return
EOPNOTSUPP directly if it is not. For all the other scrub types, we
need to fix the header predicates so that the ->repair functions (which
are all xrep_notsupported) get called to return EOPNOTSUPP. Commit
48a72 is tagged here because the scrub code prior to LTS 6.12 are
incomplete and not worth patching.
Reported-by: David Flynn <david.flynn@oracle.com>
Cc: <stable@vger.kernel.org> # v6.8
Fixes:
8336a64eb75c ("xfs: don't complain about unfixed metadata when repairs were injected")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christian Brauner [Mon, 10 Feb 2025 11:46:43 +0000 (12:46 +0100)]
Merge patch series "iomap: incremental per-operation iter advance"
Brian Foster <bfoster@redhat.com> says:
This is a first pass at supporting more incremental, per-operation
iomap_iter advancement. The motivation for this is folio_batch support
for zero range, where the fs provides a batch of folios to process
in certain situations. Since the batch may not be logically contiguous,
processing loops require a bit more flexibility than the typical offset
based iteration.
The current iteration model basically has the operation _iter() handler
lift the pos/length wrt to the current iomap out of the iomap_iter,
process it locally, then return the result to be stored in
iter.processed. The latter is overloaded with error status, so the
handler must decide whether to return error or a partial completion
(i.e. consider a short write). iomap_iter() then uses the result to
advance the iter and look up the next iomap.
The updated model proposed in this series is to allow an operation to
advance the iter itself as subranges are processed and then return
success or failure in iter.processed. Note that at least initially, this
is implemented as an optional mode to minimize churn. This series
converts operations that use iomap_write_begin(): buffered write,
unshare, and zero range.
The main advantage of this is that the future folio_batch work can be
plumbed down into the folio get path more naturally, and the
associated codepath can advance the iter itself when appropriate rather
than require each operation to manage the gaps in the range being
processed. Some secondary advantages are a little less boilerplate code
for walking ranges and more clear semantics for partial completions in
the event of errors, etc.
* patches from https://lore.kernel.org/r/
20250207143253.314068-1-bfoster@redhat.com:
iomap: advance the iter directly on zero range
iomap: advance the iter directly on unshare range
iomap: advance the iter directly on buffered writes
iomap: support incremental iomap_iter advances
iomap: export iomap_iter_advance() and return remaining length
iomap: lift iter termination logic from iomap_iter_advance()
iomap: lift error code check out of iomap_iter_advance()
iomap: refactor iomap_iter() length check and tracepoint
iomap: split out iomap check and reset logic from iter advance
iomap: factor out iomap length helper
Link: https://lore.kernel.org/r/20250207143253.314068-1-bfoster@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Fri, 7 Feb 2025 14:32:53 +0000 (09:32 -0500)]
iomap: advance the iter directly on zero range
Modify zero range to advance the iter directly. Replace the local pos
and length calculations with direct advances and loop based on iter
state instead.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250207143253.314068-11-bfoster@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Fri, 7 Feb 2025 14:32:52 +0000 (09:32 -0500)]
iomap: advance the iter directly on unshare range
Modify unshare range to advance the iter directly. Replace the local
pos and length calculations with direct advances and loop based on
iter state instead.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250207143253.314068-10-bfoster@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Fri, 7 Feb 2025 14:32:51 +0000 (09:32 -0500)]
iomap: advance the iter directly on buffered writes
Modify the buffered write path to advance the iter directly. Replace
the local pos and length calculations with direct advances and loop
based on iter state instead.
Also remove the -EAGAIN return hack as it is no longer necessary now
that separate return channels exist for processing progress and error
returns. For example, the existing write handler must return either a
count of bytes written or error if the write is interrupted, but
presumably wants to return -EAGAIN directly in order to break the higher
level iomap_iter() loop.
Since the current iteration may have made some progress, it unwinds the
iter on the way out to return the error while ensuring that portion of
the write can be retried. If -EAGAIN occurs at any point beyond the
first iteration, iomap_file_buffered_write() will then observe progress
based on iter->pos to return a short write.
With incremental advances on the iomap_iter, iomap_write_iter() can
simply return the error. iomap_iter() completes whatever progress was
made based on iomap_iter position and still breaks out of the iter loop
based on the error code in iter.processed. The end result of the write
is similar in terms of being a short write if progress was made or error
return otherwise.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250207143253.314068-9-bfoster@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Fri, 7 Feb 2025 14:32:50 +0000 (09:32 -0500)]
iomap: support incremental iomap_iter advances
The current iomap_iter iteration model reads the mapping from the
filesystem, processes the subrange of the operation associated with
the current mapping, and returns the number of bytes processed back
to the iteration code. The latter advances the position and
remaining length of the iter in preparation for the next iteration.
At the _iter() handler level, this tends to produce a processing
loop where the local code pulls the current position and remaining
length out of the iter, iterates it locally based on file offset,
and then breaks out when the associated range has been fully
processed.
This works well enough for current handlers, but upcoming
enhancements require a bit more flexibility in certain situations.
Enhancements for zero range will lead to a situation where the
processing loop is no longer a pure ascending offset walk, but
rather dictated by pagecache state and folio lookup. Since folio
lookup and write preparation occur at different levels, it is more
difficult to manage position and length outside of the iter.
To provide more flexibility to certain iomap operations, introduce
support for incremental iomap_iter advances from within the
operation itself. This allows more granular advances for operations
that might not use the typical file offset based walk.
Note that the semantics for operations that use incremental advances
is slightly different than traditional operations. Operations that
advance the iter directly are expected to return success or failure
(i.e. 0 or negative error code) in iter.processed rather than the
number of bytes processed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250207143253.314068-8-bfoster@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Fri, 7 Feb 2025 14:32:49 +0000 (09:32 -0500)]
iomap: export iomap_iter_advance() and return remaining length
As a final step for generic iter advance, export the helper and
update it to return the remaining length of the current iteration
after the advance. This will usually be 0 in the iomap_iter() case,
but will be useful for the various operations that iterate on their
own and will be updated to advance as they progress.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250207143253.314068-7-bfoster@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Fri, 7 Feb 2025 14:32:48 +0000 (09:32 -0500)]
iomap: lift iter termination logic from iomap_iter_advance()
The iter termination logic in iomap_iter_advance() is only needed by
iomap_iter() to determine whether to proceed with the next mapping
for an ongoing operation. The old logic sets ret to 1 and then
terminates if the operation is complete (iter->len == 0) or the
previous iteration performed no work and the mapping has not been
marked stale. The stale check exists to allow operations to
retry the current mapping if an inconsistency has been detected.
To further genericize iomap_iter_advance(), lift the termination
logic into iomap_iter() and update the former to return success (0)
or an error code. iomap_iter() continues on successful advance and
non-zero iter->len or otherwise terminates in the no progress (and
not stale) or error cases.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250207143253.314068-6-bfoster@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Fri, 7 Feb 2025 14:32:47 +0000 (09:32 -0500)]
iomap: lift error code check out of iomap_iter_advance()
The error code is only used to check whether iomap_iter() should
terminate due to an error returned in iter.processed. Lift the check
out of iomap_iter_advance() in preparation to make it more generic.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250207143253.314068-5-bfoster@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Fri, 7 Feb 2025 14:32:46 +0000 (09:32 -0500)]
iomap: refactor iomap_iter() length check and tracepoint
iomap_iter() checks iomap.length to skip individual code blocks not
appropriate for the initial case where there is no mapping in the
iter. To prepare for upcoming changes, refactor the code to jump
straight to the ->iomap_begin() handler in the initial case and move
the tracepoint to the top of the function so it always executes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250207143253.314068-4-bfoster@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Fri, 7 Feb 2025 14:32:45 +0000 (09:32 -0500)]
iomap: split out iomap check and reset logic from iter advance
In preparation for more granular iomap_iter advancing, break out
some of the logic associated with higher level iteration from
iomap_advance_iter(). Specifically, factor the iomap reset code into
a separate helper and lift the iomap.length check into the calling
code, similar to how ->iomap_end() calls are handled.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250207143253.314068-3-bfoster@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brian Foster [Fri, 7 Feb 2025 14:32:44 +0000 (09:32 -0500)]
iomap: factor out iomap length helper
In preparation to support more granular iomap iter advancing, factor
the pos/len values as parameters to length calculation.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20250207143253.314068-2-bfoster@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Linus Torvalds [Sun, 9 Feb 2025 20:45:03 +0000 (12:45 -0800)]
Linux 6.14-rc2
Linus Torvalds [Sun, 9 Feb 2025 18:05:32 +0000 (10:05 -0800)]
Merge tag 'kbuild-fixes-v6.14' of git://git./linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Suppress false-positive -Wformat-{overflow,truncation}-non-kprintf
warnings regardless of the W= option
- Avoid CONFIG_TRIM_UNUSED_KSYMS dropping symbols passed to symbol_get()
- Fix a build regression of the Debian linux-headers package
* tag 'kbuild-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: install-extmod-build: add missing quotation marks for CC variable
kbuild: fix misspelling in scripts/Makefile.lib
kbuild: keep symbols for symbol_get() even with CONFIG_TRIM_UNUSED_KSYMS
scripts/Makefile.extrawarn: Do not show clang's non-kprintf warnings at W=1
Linus Torvalds [Sun, 9 Feb 2025 17:47:06 +0000 (09:47 -0800)]
Merge tag 'pm-6.14-rc2-2' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Fix a recently introduced kernel crash due to a NULL pointer
dereference during system-wide suspend (Rafael Wysocki)"
* tag 'pm-6.14-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: core: Restrict power.set_active propagation
Linus Torvalds [Sun, 9 Feb 2025 17:41:38 +0000 (09:41 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Correctly clean the BSS to the PoC before allowing EL2 to access it
on nVHE/hVHE/protected configurations
- Propagate ownership of debug registers in protected mode after the
rework that landed in 6.14-rc1
- Stop pretending that we can run the protected mode without a GICv3
being present on the host
- Fix a use-after-free situation that can occur if a vcpu fails to
initialise the NV shadow S2 MMU contexts
- Always evaluate the need to arm a background timer for fully
emulated guest timers
- Fix the emulation of EL1 timers in the absence of FEAT_ECV
- Correctly handle the EL2 virtual timer, specially when HCR_EL2.E2H==0
s390:
- move some of the guest page table (gmap) logic into KVM itself,
inching towards the final goal of completely removing gmap from the
non-kvm memory management code.
As an initial set of cleanups, move some code from mm/gmap into kvm
and start using __kvm_faultin_pfn() to fault-in pages as needed;
but especially stop abusing page->index and page->lru to aid in the
pgdesc conversion.
x86:
- Add missing check in the fix to defer starting the huge page
recovery vhost_task
- SRSO_USER_KERNEL_NO does not need SYNTHESIZED_F"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (31 commits)
KVM: x86/mmu: Ensure NX huge page recovery thread is alive before waking
KVM: remove kvm_arch_post_init_vm
KVM: selftests: Fix spelling mistake "initally" -> "initially"
kvm: x86: SRSO_USER_KERNEL_NO is not synthesized
KVM: arm64: timer: Don't adjust the EL2 virtual timer offset
KVM: arm64: timer: Correctly handle EL1 timer emulation when !FEAT_ECV
KVM: arm64: timer: Always evaluate the need for a soft timer
KVM: arm64: Fix nested S2 MMU structures reallocation
KVM: arm64: Fail protected mode init if no vgic hardware is present
KVM: arm64: Flush/sync debug state in protected mode
KVM: s390: selftests: Streamline uc_skey test to issue iske after sske
KVM: s390: remove the last user of page->index
KVM: s390: move PGSTE softbits
KVM: s390: remove useless page->index usage
KVM: s390: move gmap_shadow_pgt_lookup() into kvm
KVM: s390: stop using lists to keep track of used dat tables
KVM: s390: stop using page->index for non-shadow gmaps
KVM: s390: move some gmap shadowing functions away from mm/gmap.c
KVM: s390: get rid of gmap_translate()
KVM: s390: get rid of gmap_fault()
...
Rafael J. Wysocki [Sat, 8 Feb 2025 17:54:28 +0000 (18:54 +0100)]
PM: sleep: core: Restrict power.set_active propagation
Commit
3775fc538f53 ("PM: sleep: core: Synchronize runtime PM status of
parents and children") exposed an issue related to simple_pm_bus_pm_ops
that uses pm_runtime_force_suspend() and pm_runtime_force_resume() as
bus type PM callbacks for the noirq phases of system-wide suspend and
resume.
The problem is that pm_runtime_force_suspend() does not distinguish
runtime-suspended devices from devices for which runtime PM has never
been enabled, so if it sees a device with runtime PM status set to
RPM_ACTIVE, it will assume that runtime PM is enabled for that device
and so it will attempt to suspend it with the help of its runtime PM
callbacks which may not be ready for that. As it turns out, this
causes simple_pm_bus_runtime_suspend() to crash due to a NULL pointer
dereference.
Another problem related to the above commit and simple_pm_bus_pm_ops is
that setting runtime PM status of a device handled by the latter to
RPM_ACTIVE will actually prevent it from being resumed because
pm_runtime_force_resume() only resumes devices with runtime PM status
set to RPM_SUSPENDED.
To mitigate these issues, do not allow power.set_active to propagate
beyond the parent of the device with DPM_FLAG_SMART_SUSPEND set that
will need to be resumed, which should be a sufficient stop-gap for the
time being, but they will need to be properly addressed in the future
because in general during system-wide resume it is necessary to resume
all devices in a dependency chain in which at least one device is going
to be resumed.
Fixes:
3775fc538f53 ("PM: sleep: core: Synchronize runtime PM status of parents and children")
Closes: https://lore.kernel.org/linux-pm/
1c2433d4-7e0f-4395-b841-
b8eac7c25651@nvidia.com/
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/6137505.lOV4Wx5bFT@rjwysocki.net
Linus Torvalds [Sat, 8 Feb 2025 22:12:17 +0000 (14:12 -0800)]
Merge tag 'hardening-v6.14-rc2' of git://git./linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
"Address a KUnit stack initialization regression that got tickled on
m68k, and solve a Clang(v14 and earlier) bug found by 0day:
- Fix stackinit KUnit regression on m68k
- Use ARRAY_SIZE() for memtostr*()/strtomem*()"
* tag 'hardening-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
string.h: Use ARRAY_SIZE() for memtostr*()/strtomem*()
compiler.h: Introduce __must_be_byte_array()
compiler.h: Move C string helpers into C-only kernel section
stackinit: Fix comment for test_small_end
stackinit: Keep selftest union size small on m68k
Linus Torvalds [Sat, 8 Feb 2025 22:04:21 +0000 (14:04 -0800)]
Merge tag 'seccomp-v6.14-rc2' of git://git./linux/kernel/git/kees/linux
Pull seccomp fix from Kees Cook:
"This is really a work-around for x86_64 having grown a syscall to
implement uretprobe, which has caused problems since v6.11.
This may change in the future, but for now, this fixes the unintended
seccomp filtering when uretprobe switched away from traps, and does so
with something that should be easy to backport.
- Allow uretprobe on x86_64 to avoid behavioral complications (Eyal
Birger)"
* tag 'seccomp-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
selftests/seccomp: validate uretprobe syscall passes through seccomp
seccomp: passthrough uretprobe systemcall without filtering
Linus Torvalds [Sat, 8 Feb 2025 21:59:24 +0000 (13:59 -0800)]
Merge tag 'execve-v6.14-rc2' of git://git./linux/kernel/git/kees/linux
Pull execve fix from Kees Cook:
"This is an alpha-specific fix, but since it touched ELF I was asked to
carry it.
- alpha/elf: Fix misc/setarch test of util-linux by removing 32bit
support (Eric W. Biederman)"
* tag 'execve-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
alpha/elf: Fix misc/setarch test of util-linux by removing 32bit support
Linus Torvalds [Sat, 8 Feb 2025 21:45:34 +0000 (13:45 -0800)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"A number of fairly small fixes, mostly in drivers but two in the core
to change a retry for depopulation (a trendy new hdd thing that
reorganizes blocks away from failing elements) and one to fix a GFP_
annotation to avoid a lock dependency (the third core patch is all in
testing)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: qla1280: Fix kernel oops when debug level > 2
scsi: ufs: core: Fix error return with query response
scsi: storvsc: Set correct data length for sending SCSI command without payload
scsi: ufs: core: Fix use-after free in init error and remove paths
scsi: core: Do not retry I/Os during depopulation
scsi: core: Use GFP_NOIO to avoid circular locking dependency
scsi: ufs: Fix toggling of clk_gating.state when clock gating is not allowed
scsi: ufs: core: Ensure clk_gating.lock is used only after initialization
scsi: ufs: core: Simplify temperature exception event handling
scsi: target: core: Add line break to status show
scsi: ufs: core: Fix the HIGH/LOW_TEMP Bit Definitions
scsi: core: Add passthrough tests for success and no failure definitions
Linus Torvalds [Sat, 8 Feb 2025 21:35:17 +0000 (13:35 -0800)]
Merge tag 'i2c-for-6.14-rc2' of git://git./linux/kernel/git/wsa/linux
Pull i2c reverts from Wolfram Sang:
"It turned out the new mechanism for handling created devices does not
handle all muxing cases.
Revert the changes to give a proper solution more time"
* tag 'i2c-for-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
Revert "i2c: Replace list-based mechanism for handling auto-detected clients"
Revert "i2c: Replace list-based mechanism for handling userspace-created clients"
Linus Torvalds [Sat, 8 Feb 2025 20:22:21 +0000 (12:22 -0800)]
Merge tag 'rust-fixes-6.14' of https://github.com/Rust-for-Linux/linux
Pull rust fixes from Miguel Ojeda:
- Do not export KASAN ODR symbols to avoid gendwarfksyms warnings
- Fix future Rust 1.86.0 (to be released 2025-04-03) x86_64 builds
- Clean future Rust 1.86.0 (to be released 2025-04-03) warning
- Fix future GCC 15 (to be released in a few months) builds
- Fix `rusttest` target in macOS
* tag 'rust-fixes-6.14' of https://github.com/Rust-for-Linux/linux:
x86: rust: set rustc-abi=x86-softfloat on rustc>=1.86.0
rust: kbuild: do not export generated KASAN ODR symbols
rust: kbuild: add -fzero-init-padding-bits to bindgen_skip_cflags
rust: init: use explicit ABI to clean warning in future compilers
rust: kbuild: use host dylib naming in rusttestlib-kernel
Linus Torvalds [Sat, 8 Feb 2025 20:18:02 +0000 (12:18 -0800)]
Merge tag 'ftrace-v6.14-rc1' of git://git./linux/kernel/git/trace/linux-trace
Pull ftrace fix from Steven Rostedt:
"Function graph fix of notrace functions.
When the function graph tracer was restructured to use the global
section of the meta data in the shadow stack, the bit logic was
changed. There's a TRACE_GRAPH_NOTRACE_BIT that is the bit number in
the mask that tells if the function graph tracer is currently in the
"notrace" mode. The TRACE_GRAPH_NOTRACE is the mask with that bit set.
But when the code we restructured, the TRACE_GRAPH_NOTRACE_BIT was
used when it should have been the TRACE_GRAPH_NOTRACE mask. This made
notrace not work properly"
* tag 'ftrace-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT
Linus Torvalds [Sat, 8 Feb 2025 20:04:00 +0000 (12:04 -0800)]
Merge tag 'x86-urgent-2025-02-08' of git://git./linux/kernel/git/tip/tip
Pull x86 fix from Ingo Molnar:
"Fix a build regression on GCC 15 builds, caused by GCC changing the
default C version that is overriden in the main Makefile but not in
the x86 boot code Makefile"
* tag 'x86-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Use '-std=gnu11' to fix build with GCC 15
Linus Torvalds [Sat, 8 Feb 2025 19:55:03 +0000 (11:55 -0800)]
Merge tag 'timers-urgent-2025-02-08' of git://git./linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
"Fix a PREEMPT_RT bug in the clocksource verification code that caused
false positive warnings.
Also fix a timer migration setup bug when new CPUs are added"
* tag 'timers-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Fix off-by-one root mis-connection
clocksource: Use migrate_disable() to avoid calling get_random_u32() in atomic context
Linus Torvalds [Sat, 8 Feb 2025 19:16:22 +0000 (11:16 -0800)]
Merge tag 'sched-urgent-2025-02-08' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Fix a cfs_rq->h_nr_runnable accounting bug that trips up a defensive
SCHED_WARN_ON() on certain workloads. The bug is believed to be
(accidentally) self-correcting, hence no behavioral side effects are
expected.
Also print se.slice in debug output, since this value can now be set
via the syscall ABI and can be useful to track"
* tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/debug: Provide slice length for fair tasks
sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Linus Torvalds [Sat, 8 Feb 2025 19:05:54 +0000 (11:05 -0800)]
Merge tag 'irq-urgent-2025-02-08' of git://git./linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
"Another followup fix for the procps genirq output formatting
regression caused by an optimization"
* tag 'irq-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Remove leading space from irq_chip::irq_print_chip() callbacks
Linus Torvalds [Sat, 8 Feb 2025 18:54:11 +0000 (10:54 -0800)]
Merge tag 'locking-urgent-2025-02-08' of git://git./linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Fix a dangling pointer bug in the futex code used by the uring code.
It isn't causing problems at the moment due to uring ABI limitations
leaving it essentially unused in current usages, but is a good idea to
fix nevertheless"
* tag 'locking-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Pass in task to futex_queue()
Steven Rostedt [Sat, 8 Feb 2025 05:15:11 +0000 (00:15 -0500)]
fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT
The code was restructured where the function graph notrace code, that
would not trace a function and all its children is done by setting a
NOTRACE flag when the function that is not to be traced is hit.
There's a TRACE_GRAPH_NOTRACE_BIT which defines the bit in the flags and a
TRACE_GRAPH_NOTRACE which is the mask with that bit set. But the
restructuring used TRACE_GRAPH_NOTRACE_BIT when it should have used
TRACE_GRAPH_NOTRACE.
For example:
# cd /sys/kernel/tracing
# echo set_track_prepare stack_trace_save > set_graph_notrace
# echo function_graph > current_tracer
# cat trace
[..]
0) | __slab_free() {
0) | free_to_partial_list() {
0) | arch_stack_walk() {
0) | __unwind_start() {
0) 0.501 us | get_stack_info();
Where a non filter trace looks like:
# echo > set_graph_notrace
# cat trace
0) | free_to_partial_list() {
0) | set_track_prepare() {
0) | stack_trace_save() {
0) | arch_stack_walk() {
0) | __unwind_start() {
Where the filter should look like:
# cat trace
0) | free_to_partial_list() {
0) | _raw_spin_lock_irqsave() {
0) 0.350 us | preempt_count_add();
0) 0.351 us | do_raw_spin_lock();
0) 2.440 us | }
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250208001511.535be150@batman.local.home
Fixes:
b84214890a9bc ("function_graph: Move graph notrace bit to shadow stack global var")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Nathan Chancellor [Thu, 17 Oct 2024 17:09:22 +0000 (10:09 -0700)]
kbuild: Move -Wenum-enum-conversion to W=2
-Wenum-enum-conversion was strengthened in clang-19 to warn for C, which
caused the kernel to move it to W=1 in commit
75b5ab134bb5 ("kbuild:
Move -Wenum-{compare-conditional,enum-conversion} into W=1") because
there were numerous instances that would break builds with -Werror.
Unfortunately, this is not a full solution, as more and more developers,
subsystems, and distributors are building with W=1 as well, so they
continue to see the numerous instances of this warning.
Since the move to W=1, there have not been many new instances that have
appeared through various build reports and the ones that have appeared
seem to be following similar existing patterns, suggesting that most
instances of this warning will not be real issues. The only alternatives
for silencing this warning are adding casts (which is generally seen as
an ugly practice) or refactoring the enums to macro defines or a unified
enum (which may be undesirable because of type safety in other parts of
the code).
Move the warning to W=2, where warnings that occur frequently but may be
relevant should reside.
Cc: stable@vger.kernel.org
Fixes:
75b5ab134bb5 ("kbuild: Move -Wenum-{compare-conditional,enum-conversion} into W=1")
Link: https://lore.kernel.org/ZwRA9SOcOjjLJcpi@google.com/
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 8 Feb 2025 03:23:06 +0000 (19:23 -0800)]
Merge tag 'v6.14rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Three DFS fixes: DFS mount fix, fix for noisy log msg and one to
remove some unused code
- SMB3 Lease fix
* tag 'v6.14rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: change lease epoch type from unsigned int to __u16
smb: client: get rid of kstrdup() in get_ses_refpath()
smb: client: fix noisy when tree connecting to DFS interlink targets
smb: client: don't trust DFSREF_STORAGE_SERVER bit
Linus Torvalds [Fri, 7 Feb 2025 20:21:54 +0000 (12:21 -0800)]
Merge tag 'drm-fixes-2025-02-08' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Just regular drm fixes, amdgpu, xe and i915 mostly, but a few
scattered fixes. I think one of the i915 fixes fixes some build combos
that Guenter was seeing.
amdgpu:
- Add new tiling flag for DCC write compress disable
- Add BO metadata flag for DCC
- Fix potential out of bounds access in display
- Seamless boot fix
- CONFIG_FRAME_WARN fix
- PSR1 fix
xe:
- OA uAPI related fixes
- Fix SRIOV migration initialization
- Restore devcoredump to a sane state
i915:
- Fix the build error with clamp after WARN_ON on gcc 13.x+
- HDCP related fixes
- PMU fix zero delta busyness issue
- Fix page cleanup on DMA remap failure
- Drop 64bpp YUV formats from ICL+ SDR planes
- GuC log related fix
- DisplayPort related fixes
ivpu:
- Fix error handling
komeda:
- add return check
zynqmp:
- fix locking in DP code
ast:
- fix AST DP timeout
cec:
- fix broken CEC adapter check"
* tag 'drm-fixes-2025-02-08' of https://gitlab.freedesktop.org/drm/kernel: (29 commits)
drm/i915/dp: Fix potential infinite loop in 128b/132b SST
Revert "drm/amd/display: Use HW lock mgr for PSR1"
drm/amd/display: Respect user's CONFIG_FRAME_WARN more for dml files
accel/amdxdna: Add MODULE_FIRMWARE() declarations
drm/i915/dp: Iterate DSC BPP from high to low on all platforms
drm/xe: Fix and re-enable xe_print_blob_ascii85()
drm/xe/devcoredump: Move exec queue snapshot to Contexts section
drm/xe/oa: Set stream->pollin in xe_oa_buffer_check_unlocked
drm/xe/pf: Fix migration initialization
drm/xe/oa: Preserve oa_ctrl unused bits
drm/amd/display: Fix seamless boot sequence
drm/amd/display: Fix out-of-bound accesses
drm/amdgpu: add a BO metadata flag to disable write compression for Vulkan
drm/i915/backlight: Return immediately when scale() finds invalid parameters
drm/i915/dp: Return min bpc supported by source instead of 0
drm/i915/dp: fix the Adaptive sync Operation mode for SDP
drm/i915/guc: Debug print LRC state entries only if the context is pinned
drm/i915: Drop 64bpp YUV formats from ICL+ SDR planes
drm/i915: Fix page cleanup on DMA remap failure
drm/i915/pmu: Fix zero delta busyness issue
...
Linus Torvalds [Fri, 7 Feb 2025 19:05:50 +0000 (11:05 -0800)]
Merge tag 'stable/for-linus-6.14-rc1-tag' of git://git./linux/kernel/git/konrad/ibft
Pull ibft fixes from Konrad Rzeszutek Wilk:
"Two tiny fixes to IBFT code: one for Kconfig and another for IPv6"
* tag 'stable/for-linus-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft:
iscsi_ibft: Fix UBSAN shift-out-of-bounds warning in ibft_attr_show_nic()
firmware: iscsi_ibft: fix ISCSI_IBFT Kconfig entry
Linus Torvalds [Fri, 7 Feb 2025 19:00:33 +0000 (11:00 -0800)]
Merge tag 'block-6.14-
20250207' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- MD pull request via Song:
- fix an error handling path for md-linear
- NVMe pull request via Keith:
- Connection fixes for fibre channel transport (Daniel)
- Endian fixes (Keith, Christoph)
- Cleanup fix for host memory buffer (Francis)
- Platform specific power quirks (Georg)
- Target memory leak (Sagi)
- Use appropriate controller state accessor (Daniel)
- Fixup for a regression introduced last week, where sunvdc wasn't
updated for an API change, causing compilation failures on sparc64.
* tag 'block-6.14-
20250207' of git://git.kernel.dk/linux:
drivers/block/sunvdc.c: update the correct AIP call
md: Fix linear_set_limits()
nvme-fc: use ctrl state getter
nvme: make nvme_tls_attrs_group static
nvmet: add a missing endianess conversion in nvmet_execute_admin_connect
nvmet: the result field in nvmet_alloc_ctrl_args is little endian
nvmet: fix a memory leak in controller identify
nvme-fc: do not ignore connectivity loss during connecting
nvme: handle connectivity loss in nvme_set_queue_count
nvme-fc: go straight to connecting state when initializing
nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk
nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk
nvme-pci: remove redundant dma frees in hmb
nvmet: fix rw control endian access
WangYuli [Fri, 7 Feb 2025 07:08:55 +0000 (15:08 +0800)]
kbuild: install-extmod-build: add missing quotation marks for CC variable
While attempting to build a Debian packages with CC="ccache gcc", I
saw the following error as builddeb builds linux-headers-$KERNELVERSION:
make HOSTCC=ccache gcc VPATH= srcroot=. -f ./scripts/Makefile.build obj=debian/linux-headers-6.14.0-rc1/usr/src/linux-headers-6.14.0-rc1/scripts
make[6]: *** No rule to make target 'gcc'. Stop.
Upon investigation, it seems that one instance of $(CC) variable reference
in ./scripts/package/install-extmod-build was missing quotation marks,
causing the above error.
Add the missing quotation marks around $(CC) to fix build.
Fixes:
5f73e7d0386d ("kbuild: refactor cross-compiling linux-headers package")
Co-developed-by: Mingcong Bai <jeffbai@aosc.io>
Signed-off-by: Mingcong Bai <jeffbai@aosc.io>
Tested-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Linus Torvalds [Fri, 7 Feb 2025 18:34:50 +0000 (10:34 -0800)]
Merge tag 'pm-6.14-rc2' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a handful of issues in the amd-pstate driver, the airoha
cpufreq driver build, a (recently added) possible NULL pointer
dereference in the cpufreq code and a possible memory leak in the
power capping subsystem:
- Fix cpufreq_policy reference counting and prevent max_perf from
going above the current limit in amd-pstate, and drop a redundant
goto label from it (Dhananjay Ugwekar)
- Prevent the per-policy boost_enabled flag in amd-pstate from
getting out of sync with the actual state after boot failures
(Lifeng Zheng)
- Fix a recently added possible NULL pointer dereference in the
cpufreq core (Aboorva Devarajan)
- Fix a build issue related to CONFIG_OF and COMPILE_TEST
dependencies in the airoha cpufreq driver (Arnd Bergmann)
- Fix a possible memory leak in the power capping subsystem (Joe
Hattori)"
* tag 'pm-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq/amd-pstate: Fix cpufreq_policy ref counting
cpufreq: prevent NULL dereference in cpufreq_online()
cpufreq: airoha: modify CONFIG_OF dependency
cpufreq/amd-pstate: Fix max_perf updation with schedutil
cpufreq/amd-pstate: Remove the goto label in amd_pstate_update_limits
cpufreq/amd-pstate: Fix per-policy boost flag incorrect when fail
powercap: call put_device() on an error path in powercap_register_control_type()
Linus Torvalds [Fri, 7 Feb 2025 18:08:25 +0000 (10:08 -0800)]
Merge tag 'acpi-6.14-rc2' of git://git./linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix three assorted issues, including one recent regression:
- Add an ACPI IRQ override quirk for Eluktronics MECH-17 to make the
internal keyboard work (Gannon Kolding)
- Make acpi_data_prop_read() reflect the OF counterpart behavior in
error cases (Andy Shevchenko)
- Remove recently added strict ACPI PRM handler address checks that
prevented PRM from working on some platforms in the field (Aubrey
Li)"
* tag 'acpi-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: PRM: Remove unnecessary strict handler address checks
ACPI: resource: IRQ override for Eluktronics MECH-17
ACPI: property: Fix return value for nval == 0 in acpi_data_prop_read()
Linus Torvalds [Fri, 7 Feb 2025 17:50:33 +0000 (09:50 -0800)]
Merge tag 'gpio-fixes-for-v6.14-rc2' of git://git./linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix interrupt support in gpio-pca953x
- fix configfs attribute locking in gpio-sim
- limit the visibility of the GPIO_GRGPIO Kconfig symbol to OF systems
only
- update MAINTAINERS
* tag 'gpio-fixes-for-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
MAINTAINERS: Use my kernel.org address for ACPI GPIO work
gpio: GPIO_GRGPIO should depend on OF
gpio: sim: lock hog configfs items if present
gpio: pca953x: Improve interrupt support
Linus Torvalds [Fri, 7 Feb 2025 17:22:31 +0000 (09:22 -0800)]
Merge tag 'vfs-6.14-rc2.fixes' of git://git./linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix fsnotify FMODE_NONOTIFY* handling.
This also disables fsnotify on all pseudo files by default apart from
very select exceptions. This carries a regression risk so we need to
watch out and adapt accordingly. However, it is overall a significant
improvement over the current status quo where every rando file can
get fsnotify enabled.
- Cleanup and simplify lockref_init() after recent lockref changes.
- Fix vboxfs build with gcc-15.
- Add an assert into inode_set_cached_link() to catch corrupt links.
- Allow users to also use an empty string check to detect whether a
given mount option string was empty or not.
- Fix how security options were appended to statmount()'s ->mnt_opt
field.
- Fix statmount() selftests to always check the returned mask.
- Fix uninitialized value in vfs_statx_path().
- Fix pidfs_ioctl() sanity checks to guard against ioctl() overloading
and preserve extensibility.
* tag 'vfs-6.14-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
vfs: sanity check the length passed to inode_set_cached_link()
pidfs: improve ioctl handling
fsnotify: disable pre-content and permission events by default
selftests: always check mask returned by statmount(2)
fsnotify: disable notification by default for all pseudo files
fs: fix adding security options to statmount.mnt_opt
fsnotify: use accessor to set FMODE_NONOTIFY_*
lockref: remove count argument of lockref_init
gfs2: switch to lockref_init(..., 1)
gfs2: use lockref_init for gl_lockref
statmount: let unset strings be empty
vboxsf: fix building with GCC 15
fs/stat.c: avoid harmless garbage value problem in vfs_statx_path()
Linus Torvalds [Fri, 7 Feb 2025 17:16:07 +0000 (09:16 -0800)]
Merge tag 'bcachefs-2025-02-06.2' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Nothing major, things continue to be fairly quiet over here.
- add a SubmittingPatches to clarify that patches submitted for
bcachefs do, in fact, need to be tested
- discard path now correctly issues journal flushes when needed, this
fixes performance issues when the filesystem is nearly full and
we're bottlenecked on copygc
- fix a bug that could cause the pending rebalance work accounting to
be off when devices are being onlined/offlined; users should report
if they are still seeing this
- and a few more trivial ones"
* tag 'bcachefs-2025-02-06.2' of git://evilpiepirate.org/bcachefs:
bcachefs: bch2_bkey_sectors_need_rebalance() now only depends on bch_extent_rebalance
bcachefs: Fix rcu imbalance in bch2_fs_btree_key_cache_exit()
bcachefs: Fix discard path journal flushing
bcachefs: fix deadlock in journal_entry_open()
bcachefs: fix incorrect pointer check in __bch2_subvolume_delete()
bcachefs docs: SubmittingPatches.rst
Hector Martin [Thu, 6 Feb 2025 18:21:46 +0000 (03:21 +0900)]
MAINTAINERS: Remove myself
I no longer have any faith left in the kernel development process or
community management approach.
Apple/ARM platform development will continue downstream. If I feel like
sending some patches upstream in the future myself for whatever subtree
I may, or I may not. Anyone who feels like fighting the upstreaming
fight themselves is welcome to do so.
Signed-off-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Machek [Wed, 5 Feb 2025 18:42:01 +0000 (19:42 +0100)]
MAINTAINERS: Move Pavel to kernel.org address
I need to filter my emails better, switch to pavel@kernel.org address
to help with that.
Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jens Axboe [Fri, 7 Feb 2025 14:23:03 +0000 (07:23 -0700)]
Merge tag 'md-6.14-
20250206' of https://git./linux/kernel/git/mdraid/linux into block-6.14
Pull MD fix from Song:
"This patch, by Bart Van Assche, fixes an error handling path for
md-linear."
* tag 'md-6.14-
20250206' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
md: Fix linear_set_limits()
Rafael J. Wysocki [Fri, 7 Feb 2025 12:06:31 +0000 (13:06 +0100)]
Merge branches 'acpi-property' and 'acpi-resource'
Merge a new ACPI IRQ override quirk for Eluktronics MECH-17 (Gannon
Kolding) and an acpi_data_prop_read() fix making it reflect the OF
counterpart behavior in error cases (Andy Shevchenko).
* acpi-property:
ACPI: property: Fix return value for nval == 0 in acpi_data_prop_read()
* acpi-resource:
ACPI: resource: IRQ override for Eluktronics MECH-17
Rafael J. Wysocki [Fri, 7 Feb 2025 11:43:58 +0000 (12:43 +0100)]
Merge branch 'pm-powercap'
Fix a possible memory leak in the power capping subsystem (Joe Hattori).
* pm-powercap:
powercap: call put_device() on an error path in powercap_register_control_type()
Mateusz Guzik [Tue, 4 Feb 2025 21:32:07 +0000 (22:32 +0100)]
vfs: sanity check the length passed to inode_set_cached_link()
This costs a strlen() call when instatianating a symlink.
Preferably it would be hidden behind VFS_WARN_ON (or compatible), but
there is no such facility at the moment. With the facility in place the
call can be patched out in production kernels.
In the meantime, since the cost is being paid unconditionally, use the
result to a fixup the bad caller.
This is not expected to persist in the long run (tm).
Sample splat:
bad length passed for symlink [/tmp/syz-imagegen43743633/file0/file0] (got 131109, expected 37)
[rest of WARN blurp goes here]
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20250204213207.337980-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner [Tue, 4 Feb 2025 13:51:20 +0000 (14:51 +0100)]
pidfs: improve ioctl handling
Pidfs supports extensible and non-extensible ioctls. The extensible
ioctls need to check for the ioctl number itself not just the ioctl
command otherwise both backward- and forward compatibility are broken.
The pidfs ioctl handler also needs to look at the type of the ioctl
command to guard against cases where "[...] a daemon receives some
random file descriptor from a (potentially less privileged) client and
expects the FD to be of some specific type, it might call ioctl() on
this FD with some type-specific command and expect the call to fail if
the FD is of the wrong type; but due to the missing type check, the
kernel instead performs some action that userspace didn't expect."
(cf. [1]]
Link: https://lore.kernel.org/r/20250204-work-pidfs-ioctl-v1-1-04987d239575@kernel.org
Link: https://lore.kernel.org/r/CAG48ez2K9A5GwtgqO31u9ZL292we8ZwAA=TJwwEv7wRuJ3j4Lw@mail.gmail.com
Fixes:
8ce352818820 ("pidfs: check for valid ioctl commands")
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reported-by: Jann Horn <jannh@google.com>
Cc: stable@vger.kernel.org # v6.13; please backport with 8ce352818820 ("pidfs: check for valid ioctl commands")
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner [Tue, 4 Feb 2025 10:25:45 +0000 (11:25 +0100)]
Merge patch series "Fix for huge faults regression"
Amir Goldstein <amir73il@gmail.com> says:
The two Fix patches have been tested by Alex together and each one
independently.
I also verified that they pass the LTP inoityf/fanotify tests.
* patches from https://lore.kernel.org/r/
20250203223205.861346-1-amir73il@gmail.com:
fsnotify: disable pre-content and permission events by default
fsnotify: disable notification by default for all pseudo files
fsnotify: use accessor to set FMODE_NONOTIFY_*
Link: https://lore.kernel.org/r/20250203223205.861346-1-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Amir Goldstein [Mon, 3 Feb 2025 22:32:05 +0000 (23:32 +0100)]
fsnotify: disable pre-content and permission events by default
After introducing pre-content events, we had a regression related to
disabling huge faults on files that should never have pre-content events
enabled.
This happened because the default f_mode of allocated files (0) does
not disable pre-content events.
Pre-content events are disabled in file_set_fsnotify_mode_by_watchers()
but internal files may not get to call this helper.
Initialize f_mode to disable permission and pre-content events for all
files and if needed they will be enabled for the callers of
file_set_fsnotify_mode_by_watchers().
Fixes:
20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches")
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Closes: https://lore.kernel.org/linux-fsdevel/
20250131121703.
1e4d00a7.alex.williamson@redhat.com/
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20250203223205.861346-4-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Miklos Szeredi [Wed, 29 Jan 2025 16:06:41 +0000 (17:06 +0100)]
selftests: always check mask returned by statmount(2)
STATMOUNT_MNT_OPTS can actually be missing if there are no options. This
is a change of behavior since
75ead69a7173 ("fs: don't let statmount return
empty strings").
The other checks shouldn't actually trigger, but add them for correctness
and for easier debugging if the test fails.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20250129160641.35485-1-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Amir Goldstein [Mon, 3 Feb 2025 22:32:04 +0000 (23:32 +0100)]
fsnotify: disable notification by default for all pseudo files
Most pseudo files are not applicable for fsnotify events at all,
let alone to the new pre-content events.
Disable notifications to all files allocated with alloc_file_pseudo()
and enable legacy inotify events for the specific cases of pipe and
socket, which have known users of inotify events.
Pre-content events are also kept disabled for sockets and pipes.
Fixes:
20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches")
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Closes: https://lore.kernel.org/linux-fsdevel/
20250131121703.
1e4d00a7.alex.williamson@redhat.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wi2pThSVY=zhO=ZKxViBj5QCRX-=AS2+rVknQgJnHXDFg@mail.gmail.com/
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20250203223205.861346-3-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Miklos Szeredi [Wed, 29 Jan 2025 15:12:53 +0000 (16:12 +0100)]
fs: fix adding security options to statmount.mnt_opt
Prepending security options was made conditional on sb->s_op->show_options,
but security options are independent of sb options.
Fixes:
056d33137bf9 ("fs: prepend statmount.mnt_opts string with security_sb_mnt_opts()")
Fixes:
f9af549d1fd3 ("fs: export mount options via statmount()")
Cc: stable@vger.kernel.org # v6.11
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20250129151253.33241-1-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Amir Goldstein [Mon, 3 Feb 2025 22:32:03 +0000 (23:32 +0100)]
fsnotify: use accessor to set FMODE_NONOTIFY_*
The FMODE_NONOTIFY_* bits are a 2-bits mode. Open coding manipulation
of those bits is risky. Use an accessor file_set_fsnotify_mode() to
set the mode.
Rename file_set_fsnotify_mode() => file_set_fsnotify_mode_from_watchers()
to make way for the simple accessor name.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20250203223205.861346-2-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner [Thu, 30 Jan 2025 15:04:42 +0000 (16:04 +0100)]
Merge patch series "further lockref cleanups"
Andreas Gruenbacher <agruenba@redhat.com> says:
Here's an updated version with an additional comment saying that
lockref_init() initializes count to 1.
* patches from https://lore.kernel.org/r/
20250130135624.
1899988-1-agruenba@redhat.com:
lockref: remove count argument of lockref_init
gfs2: switch to lockref_init(..., 1)
gfs2: use lockref_init for gl_lockref
Link: https://lore.kernel.org/r/20250130135624.1899988-1-agruenba@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Andreas Gruenbacher [Thu, 30 Jan 2025 13:56:23 +0000 (14:56 +0100)]
lockref: remove count argument of lockref_init
All users of lockref_init() now initialize the count to 1, so hardcode
that and remove the count argument.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Link: https://lore.kernel.org/r/20250130135624.1899988-4-agruenba@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Andreas Gruenbacher [Thu, 30 Jan 2025 13:56:22 +0000 (14:56 +0100)]
gfs2: switch to lockref_init(..., 1)
In qd_alloc(), initialize the lockref count to 1 to cover the common
case. Compensate for that in gfs2_quota_init() by adjusting the count
back down to 0; this only occurs when mounting the filesystem rw.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Link: https://lore.kernel.org/r/20250130135624.1899988-3-agruenba@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Andreas Gruenbacher [Thu, 30 Jan 2025 13:56:21 +0000 (14:56 +0100)]
gfs2: use lockref_init for gl_lockref
Move the initialization of gl_lockref from gfs2_init_glock_once() to
gfs2_glock_get(). This allows to use lockref_init() there.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Link: https://lore.kernel.org/r/20250130135624.1899988-2-agruenba@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Miklos Szeredi [Thu, 30 Jan 2025 12:15:00 +0000 (13:15 +0100)]
statmount: let unset strings be empty
Just like it's normal for unset values to be zero, unset strings should be
empty instead of containing random values.
It seems to be a typical mistake that the mask returned by statmount is not
checked, which can result in various bugs.
With this fix, these bugs are prevented, since it is highly likely that
userspace would just want to turn the missing mask case into an empty
string anyway (most of the recently found cases are of this type).
Link: https://lore.kernel.org/all/CAJfpegsVCPfCn2DpM8iiYSS5DpMsLB8QBUCHecoj6s0Vxf4jzg@mail.gmail.com/
Fixes:
68385d77c05b ("statmount: simplify string option retrieval")
Fixes:
46eae99ef733 ("add statmount(2) syscall")
Cc: stable@vger.kernel.org # v6.8
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20250130121500.113446-1-mszeredi@redhat.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brahmajit Das [Tue, 21 Jan 2025 16:26:48 +0000 (21:56 +0530)]
vboxsf: fix building with GCC 15
Building with GCC 15 results in build error
fs/vboxsf/super.c:24:54: error: initializer-string for array of ‘unsigned char’ is too long [-Werror=unterminated-string-initialization]
24 | static const unsigned char VBSF_MOUNT_SIGNATURE[4] = "\000\377\376\375";
| ^~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
Due to GCC having enabled -Werror=unterminated-string-initialization[0]
by default. Separately initializing each array element of
VBSF_MOUNT_SIGNATURE to ensure NUL termination, thus satisfying GCC 15
and fixing the build error.
[0]: https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wno-unterminated-string-initialization
Signed-off-by: Brahmajit Das <brahmajit.xyz@gmail.com>
Link: https://lore.kernel.org/r/20250121162648.1408743-1-brahmajit.xyz@gmail.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Su Hui [Sun, 19 Jan 2025 02:59:47 +0000 (10:59 +0800)]
fs/stat.c: avoid harmless garbage value problem in vfs_statx_path()
Clang static checker(scan-build) warning:
fs/stat.c:287:21: warning: The left expression of the compound assignment is
an uninitialized value. The computed value will also be garbage.
287 | stat->result_mask |= STATX_MNT_ID_UNIQUE;
| ~~~~~~~~~~~~~~~~~ ^
fs/stat.c:290:21: warning: The left expression of the compound assignment is
an uninitialized value. The computed value will also be garbage.
290 | stat->result_mask |= STATX_MNT_ID;
When vfs_getattr() failed because of security_inode_getattr(), 'stat' is
uninitialized. In this case, there is a harmless garbage problem in
vfs_statx_path(). It's better to return error directly when
vfs_getattr() failed, avoiding garbage value and more clearly.
Signed-off-by: Su Hui <suhui@nfschina.com>
Link: https://lore.kernel.org/r/20250119025946.1168957-1-suhui@nfschina.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Frederic Weisbecker [Wed, 5 Feb 2025 16:02:20 +0000 (17:02 +0100)]
timers/migration: Fix off-by-one root mis-connection
Before attaching a new root to the old root, the children counter of the
new root is checked to verify that only the upcoming CPU's top group have
been connected to it. However since the recently added commit
b729cc1ec21a
("timers/migration: Fix another race between hotplug and idle entry/exit")
this check is not valid anymore because the old root is pre-accounted
as a child to the new root. Therefore after connecting the upcoming
CPU's top group to the new root, the children count to be expected must
be 2 and not 1 anymore.
This omission results in the old root to not be connected to the new
root. Then eventually the system may run with more than one top level,
which defeats the purpose of a single idle migrator.
Also the old root is pre-accounted but not connected upon the new root
creation. But it can be connected to the new root later on. Therefore
the old root may be accounted twice to the new root. The propagation of
such overcommit can end up creating a double final top-level root with a
groupmask incorrectly initialized. Although harmless given that the final
top level roots will never have a parent to walk up to, this oddity
opportunistically reported the core issue:
WARNING: CPU: 8 PID: 0 at kernel/time/timer_migration.c:543 tmigr_requires_handle_remote
CPU: 8 UID: 0 PID: 0 Comm: swapper/8
RIP: 0010:tmigr_requires_handle_remote
Call Trace:
<IRQ>
? tmigr_requires_handle_remote
? hrtimer_run_queues
update_process_times
tick_periodic
tick_handle_periodic
__sysvec_apic_timer_interrupt
sysvec_apic_timer_interrupt
</IRQ>
Fix the problem by taking the old root into account in the children count
of the new root so the connection is not omitted.
Also warn when more than one top level group exists to better detect
similar issues in the future.
Fixes:
b729cc1ec21a ("timers/migration: Fix another race between hotplug and idle entry/exit")
Reported-by: Matt Fleming <mfleming@cloudflare.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250205160220.39467-1-frederic@kernel.org
Geert Uytterhoeven [Wed, 5 Feb 2025 14:22:56 +0000 (15:22 +0100)]
genirq: Remove leading space from irq_chip::irq_print_chip() callbacks
The space separator was factored out from the multiple chip name prints,
but several irq_chip::irq_print_chip() callbacks still print a leading
space. Remove the superfluous double spaces.
Fixes:
9d9f204bdf7243bf ("genirq/proc: Add missing space separator back")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/893f7e9646d8933cd6786d5a1ef3eb076d263768.1738764803.git.geert+renesas@glider.be
Dave Airlie [Fri, 7 Feb 2025 05:37:12 +0000 (15:37 +1000)]
Merge tag 'drm-intel-fixes-2025-02-06' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes
- Fix the build error with clamp after WARN_ON on gcc 13.x+ (Guenter)
- HDCP related fixes (Suraj)
- PMU fix zero delta busyness issue (Umesh)
- Fix page cleanup on DMA remap failure (Brian)
- Drop 64bpp YUV formats from ICL+ SDR planes (Ville)
- GuC log related fix (Daniele)
- DisplayPort related fixes (Ankit, Jani)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/Z6TDHpgI6dnOc0KI@intel.com
Dave Airlie [Fri, 7 Feb 2025 05:27:23 +0000 (15:27 +1000)]
Merge tag 'drm-xe-fixes-2025-02-06' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
UAPI Changes:
- OA uAPI related fixes (Ashutosh)
Driver Changes:
- Fix SRIOV migration initialization (Michal)
- Restore devcoredump to a sane state (Lucas)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/Z6S9rI1ScT_5Aw6_@intel.com
Dave Airlie [Fri, 7 Feb 2025 04:47:11 +0000 (14:47 +1000)]
Merge tag 'drm-misc-fixes-2025-02-06' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
A couple of fixes for ivpu to error handling, komeda for format
handling, AST DP timeout fix when enabling the output, locking fix for
zynqmp DP support, tiled format handling in drm/client, and refcounting
fix for bochs
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <mripard@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250206-encouraging-judicious-quoll-adc1dc@houat
Dave Airlie [Fri, 7 Feb 2025 03:53:59 +0000 (13:53 +1000)]
Merge tag 'amd-drm-fixes-6.14-2025-02-05' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.14-2025-02-05:
amdgpu:
- Add BO metadata flag for DCC
- Fix potential out of bounds access in display
- Seamless boot fix
- CONFIG_FRAME_WARN fix
- PSR1 fix
UAPI:
- Add new tiling flag for DCC write compress disable
Proposed userspace: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33255
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250205214910.3664690-1-alexander.deucher@amd.com
Kent Overstreet [Sun, 26 Jan 2025 02:29:45 +0000 (21:29 -0500)]
bcachefs: bch2_bkey_sectors_need_rebalance() now only depends on bch_extent_rebalance
Previously, bch2_bkey_sectors_need_rebalance() called
bch2_target_accepts_data(), checking whether the target is writable.
However, this means that adding or removing devices from a target would
change the value of bch2_bkey_sectors_need_rebalance() for an existing
extent; this needs to be invariant so that the extent trigger can
correctly maintain rebalance_work accounting.
Instead, check target_accepts_data() in io_opts_to_rebalance_opts(),
before creating the bch_extent_rebalance entry.
This fixes (one?) cause of rebalance_work accounting being off.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 3 Feb 2025 16:35:11 +0000 (11:35 -0500)]
bcachefs: Fix rcu imbalance in bch2_fs_btree_key_cache_exit()
Spotted by sparse.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 27 Jan 2025 06:21:44 +0000 (01:21 -0500)]
bcachefs: Fix discard path journal flushing
The discard path is supposed to issue journal flushes when there's too
many buckets empty buckets that need a journal commit before they can be
written to again, but at some point this code seems to have been lost.
Bring it back with a new optimization to make sure we don't issue too
many journal flushes: the journal now tracks the sequence number of the
most recent flush in progress, which the discard path uses when deciding
which buckets need a journal flush.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>