Jens Axboe [Tue, 27 May 2025 15:35:12 +0000 (09:35 -0600)]
iomap: don't lose folio dropbehind state for overwrites
DONTCACHE I/O must have the completion punted to a workqueue, just like
what is done for unwritten extents, as the completion needs task context
to perform the invalidation of the folio(s). However, if writeback is
started off filemap_fdatawrite_range() off generic_sync() and it's an
overwrite, then the DONTCACHE marking gets lost as iomap_add_to_ioend()
don't look at the folio being added and no further state is passed down
to help it know that this is a dropbehind/DONTCACHE write.
Check if the folio being added is marked as dropbehind, and set
IOMAP_IOEND_DONTCACHE if that is the case. Then XFS can factor this into
the decision making of completion context in xfs_submit_ioend().
Additionally include this ioend flag in the NOMERGE flags, to avoid
mixing it with unrelated IO.
Since this is the 3rd flag that will cause XFS to punt the completion to
a workqueue, add a helper so that each one of them can get appropriately
commented.
This fixes extra page cache being instantiated when the write performed
is an overwrite, rather than newly instantiated blocks.
Fixes:
b2cd5ae693a3 ("iomap: make buffered writes work with RWF_DONTCACHE")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 27 May 2025 13:15:17 +0000 (07:15 -0600)]
mm/filemap: unify dropbehind flag testing and clearing
The read and write side does this a bit differently, unify it such that
the _{read,write} helpers check the bit before locking, and the generic
handler is in charge of clearing the bit and invalidating, once under
the folio lock.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 27 May 2025 13:06:43 +0000 (07:06 -0600)]
mm/filemap: unify read/write dropbehind naming
The read side is filemap_end_dropbehind_read(), while the write side
used folio_ as the prefix rather than filemap_. The read side makes more
sense, unify the naming such that the write side follows that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 27 May 2025 01:07:40 +0000 (19:07 -0600)]
Revert "Disable FOP_DONTCACHE for now due to bugs"
This reverts commit
478ad02d6844217cc7568619aeb0809d93ade43d.
Both the read and write side dirty && writeback races should be resolved
now, revert the commit that disabled FOP_DONTCACHE for filesystems.
Link: https://lore.kernel.org/linux-fsdevel/20250525083209.GS2023217@ZenIV/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 27 May 2025 01:06:38 +0000 (19:06 -0600)]
mm/filemap: use filemap_end_dropbehind() for read invalidation
Use the filemap_end_dropbehind() helper rather than calling
folio_unmap_invalidate() directly, as we need to check if the folio has
been redirtied or marked for writeback once the folio lock has been
re-acquired.
Cc: stable@vger.kernel.org
Reported-by: Trond Myklebust <trondmy@hammerspace.com>
Fixes:
8026e49bff9b ("mm/filemap: add read support for RWF_DONTCACHE")
Link: https://lore.kernel.org/linux-fsdevel/ba8a9805331ce258a622feaca266b163db681a10.camel@hammerspace.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 27 May 2025 01:03:27 +0000 (19:03 -0600)]
mm/filemap: gate dropbehind invalidate on folio !dirty && !writeback
It's possible for the folio to either get marked for writeback or
redirtied. Add a helper, filemap_end_dropbehind(), which guards the
folio_unmap_invalidate() call behind check for the folio being both
non-dirty and not under writeback AFTER the folio lock has been
acquired. Use this helper folio_end_dropbehind_write().
Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes:
fb7d3bc41493 ("mm/filemap: drop streaming/uncached pages when writeback completes")
Link: https://lore.kernel.org/linux-fsdevel/20250525083209.GS2023217@ZenIV/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Mon, 26 May 2025 21:38:57 +0000 (14:38 -0700)]
Merge tag 'powerpc-6.16-1' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc updates from Madhavan Srinivasan:
- Support for dynamic preemption
- Migrate powerpc boards GPIO driver to new setter API
- Added new PMU for KVM host-wide measurement
- Enhancement to htmdump driver to support more functions
- Added character device for couple RTAS supported APIs
- Minor fixes and cleanup
Thanks to Amit Machhiwal, Athira Rajeev, Bagas Sanjaya, Bartosz
Golaszewski, Christophe Leroy, Eddie James, Gaurav Batra, Gautam
Menghani, Geert Uytterhoeven, Haren Myneni, Hari Bathini, Jiri Slaby
(SUSE), Linus Walleij, Michal Suchanek, Naveen N Rao (AMD), Nilay
Shroff, Ricardo B. Marlière, Ritesh Harjani (IBM), Sathvika Vasireddy,
Shrikanth Hegde, Stephen Rothwell, Sourabh Jain, Thorsten Blum, Vaibhav
Jain, Venkat Rao Bagalkote, and Viktor Malik.
* tag 'powerpc-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (52 commits)
MAINTAINERS: powerpc: Remove myself as a reviewer
powerpc/iommu: Use str_disabled_enabled() helper
powerpc/powermac: Use str_enabled_disabled() and str_on_off() helpers
powerpc/mm/fault: Use str_write_read() helper function
powerpc: Replace strcpy() with strscpy() in proc_ppc64_init()
powerpc/pseries/iommu: Fix kmemleak in TCE table userspace view
powerpc/kernel: Fix ppc_save_regs inclusion in build
powerpc: Transliterate author name and remove FIXME
powerpc/pseries/htmdump: Include header file to get is_kvm_guest() definition
KVM: PPC: Book3S HV: Fix IRQ map warnings with XICS on pSeries KVM Guest
powerpc/8xx: Reduce alignment constraint for kernel memory
powerpc/boot: Fix build with gcc 15
powerpc/pseries/htmdump: Add documentation for H_HTM debugfs interface
powerpc/pseries/htmdump: Add htm capabilities support to htmdump module
powerpc/pseries/htmdump: Add htm flags support to htmdump module
powerpc/pseries/htmdump: Add htm setup support to htmdump module
powerpc/pseries/htmdump: Add htm info support to htmdump module
powerpc/pseries/htmdump: Add htm status support to htmdump module
powerpc/pseries/htmdump: Add htm start support to htmdump module
powerpc/pseries/htmdump: Add htm configure support to htmdump module
...
Linus Torvalds [Mon, 26 May 2025 21:36:05 +0000 (14:36 -0700)]
Merge tag 's390-6.16-1' of git://git./linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Large rework of the protected key crypto code to allow for
asynchronous handling without memory allocation
- Speed up system call entry/exit path by re-implementing lazy ASCE
handling
- Add module autoload support for the diag288_wdt watchdog device
driver
- Get rid of s390 specific strcpy() and strncpy() implementations, and
switch all remaining users to strscpy() when possible
- Various other small fixes and improvements
* tag 's390-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (51 commits)
s390/pci: Serialize device addition and removal
s390/pci: Allow re-add of a reserved but not yet removed device
s390/pci: Prevent self deletion in disable_slot()
s390/pci: Remove redundant bus removal and disable from zpci_release_device()
s390/crypto: Extend protected key conversion retry loop
s390/pci: Fix __pcilg_mio_inuser() inline assembly
s390/ptrace: Always inline regs_get_kernel_stack_nth() and regs_get_register()
s390/thread_info: Cleanup header includes
s390/extmem: Add workaround for DCSS unload diag
s390/crypto: Rework protected key AES for true asynch support
s390/cpacf: Rework cpacf_pcc() to return condition code
s390/mm: Fix potential use-after-free in __crst_table_upgrade()
s390/mm: Add mmap_assert_write_locked() check to crst_table_upgrade()
s390/string: Remove strcpy() implementation
s390/con3270: Use strscpy() instead of strcpy()
s390/boot: Use strspcy() instead of strcpy()
s390: Simple strcpy() to strscpy() conversions
s390/pkey/crypto: Introduce xflags param for pkey in-kernel API
s390/pkey: Provide and pass xflags within pkey and zcrypt layers
s390/uv: Remove uv_get_secret_metadata function
...
Linus Torvalds [Mon, 26 May 2025 21:29:44 +0000 (14:29 -0700)]
Merge tag 'linux_kselftest-kunit-6.16-rc1' of git://git./linux/kernel/git/shuah/linux-kselftest
Pull kunit updates from Shuah Khan:
- Enable qemu_config for riscv32, sparc 64-bit, PowerPC 32-bit BE and
64-bit LE
- Enable CONFIG_SPARC32 to clearly differentiate between sparc 32-bit
and 64-bit configurations
- Enable CONFIG_CPU_BIG_ENDIAN to clearly differentiate between powerpc
LE and BE configurations
- Add feature to list available architectures to kunit tool
- Fixes to bugs and changes to documentation
* tag 'linux_kselftest-kunit-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: Fix wrong parameter to kunit_deactivate_static_stub()
kunit: tool: add test counts to JSON output
Documentation: kunit: improve example on testing static functions
kunit: executor: Remove const from kunit_filter_suites() allocation type
kunit: qemu_configs: Disable faulting tests on 32-bit SPARC
kunit: qemu_configs: Add 64-bit SPARC configuration
kunit: qemu_configs: sparc: Explicitly enable CONFIG_SPARC32=y
kunit: qemu_configs: Add PowerPC 32-bit BE and 64-bit LE
kunit: qemu_configs: powerpc: Explicitly enable CONFIG_CPU_BIG_ENDIAN=y
kunit: tool: Implement listing of available architectures
kunit: qemu_configs: Add riscv32 config
kunit: configs: Enable CONFIG_INIT_STACK_ALL_PATTERN in all_tests
Linus Torvalds [Mon, 26 May 2025 21:25:23 +0000 (14:25 -0700)]
Merge tag 'linux_kselftest-next-6.16-rc1' of git://git./linux/kernel/git/shuah/linux-kselftest
Pull Kselftest updates from Shuah Khan:
"Fixes:
- cpufreq test to not double suspend in rtcwake case
- compile error in pid_namespace test
- run_kselftest.sh to use readlink if realpath is not available
- cpufreq basic read and update testcases
- ftrace to add poll to a gen_file so test can find it at run-time
- spelling errors in perf_events test"
* tag 'linux_kselftest-next-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/run_kselftest.sh: Use readlink if realpath is not available
selftests/timens: timerfd: Use correct clockid type in tclock_gettime()
selftests/timens: Make run_tests() functions static
selftests/timens: Print TAP headers
selftests: pid_namespace: add missing sys/mount.h include in pid_max.c
kselftest: cpufreq: Get rid of double suspend in rtcwake case
selftests/cpufreq: Fix cpufreq basic read and update testcases
selftests/ftrace: Convert poll to a gen_file
selftests/perf_events: Fix spelling mistake "sycnhronize" -> "synchronize"
Linus Torvalds [Mon, 26 May 2025 21:20:50 +0000 (14:20 -0700)]
Merge tag 'next.2025.05.17a' of git://git./linux/kernel/git/rcu/linux
Pull RCU updates from Joel Fernandes:
- Removed swake_up_one_online() workaround
- Reverted an incorrect rcuog wake-up fix from offline softirq
- Rust RCU Guard methods marked as inline
- Updated MAINTAINERS with Joel’s and Zqiang's new email address
- Replaced magic constant in rcu_seq_done_exact() with named constant
- Added warning mechanism to validate rcu_seq_done_exact()
- Switched SRCU polling API to use rcu_seq_done_exact()
- Commented on redundant delta check in rcu_seq_done_exact()
- Made ->gpwrap tests in rcutorture more frequent
- Fixed reuse of ARM64 images in rcutorture
- rcutorture improved to check Kconfig and reader conflict handling
- Extracted logic from rcu_torture_one_read() for clarity
- Updated LWN RCU API documentation links
- Enabled --do-rt in torture.sh for CONFIG_PREEMPT_RT
- Added tests for SRCU up/down reader primitives
- Added comments and delays checks in rcutorture
- Deprecated srcu_read_lock_lite() and srcu_read_unlock_lite() via checkpatch
- Added --do-normal and --do-no-normal to torture.sh
- Added RCU Rust binding tests to torture.sh
- Reduced CPU overcommit and removed MAXSMP/CPUMASK_OFFSTACK in TREE01
- Replaced kmalloc() with kcalloc() in rcuscale
- Refined listRCU example code for stale data elimination
- Fixed hardirq count bug for x86 in cpu_stall_cputime
- Added safety checks in rcu/nocb for offloaded rdp access
- Other miscellaneous changes
* tag 'next.2025.05.17a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (27 commits)
rcutorture: Fix issue with re-using old images on ARM64
rcutorture: Remove MAXSMP and CPUMASK_OFFSTACK from TREE01
rcutorture: Reduce TREE01 CPU overcommit
torture: Check for "Call trace:" as well as "Call Trace:"
rcutorture: Perform more frequent testing of ->gpwrap
torture: Add testing of RCU's Rust bindings to torture.sh
torture: Add --do-{,no-}normal to torture.sh
checkpatch: Deprecate srcu_read_lock_lite() and srcu_read_unlock_lite()
rcutorture: Comment invocations of tick_dep_set_task()
rcu/nocb: Add Safe checks for access offloaded rdp
rcuscale: using kcalloc() to relpace kmalloc()
doc/RCU/listRCU: refine example code for eliminating stale data
doc: Update LWN RCU API links in whatisRCU.rst
Revert "rcu/nocb: Fix rcuog wake-up from offline softirq"
rust: sync: rcu: Mark Guard methods as inline
rcu/cpu_stall_cputime: fix the hardirq count for x86 architecture
rcu: Remove swake_up_one_online() bandaid
MAINTAINERS: Update Zqiang's email address
rcutorture: Make torture.sh --do-rt use CONFIG_PREEMPT_RT
srcu: Use rcu_seq_done_exact() for polling API
...
Linus Torvalds [Mon, 26 May 2025 21:12:31 +0000 (14:12 -0700)]
Merge tag 'tpmdd-next-6.16' of git://git./linux/kernel/git/jarkko/linux-tpmdd
Pull tpm updates from Jarkko Sakkinen:
"This is only a small pull request with fixes, as possible features
moved to +1 release"
* tag 'tpmdd-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
tpm_crb: ffa_tpm: fix/update comments describing the CRB over FFA ABI
tpm_crb_ffa: use dev_xx() macro to print log
tpm_ffa_crb: access tpm service over FF-A direct message request v2
tpm: remove kmalloc failure error message
Linus Torvalds [Mon, 26 May 2025 20:47:28 +0000 (13:47 -0700)]
Merge tag 'v6.16-p1' of git://git./linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Fix memcpy_sglist to handle partially overlapping SG lists
- Use memcpy_sglist to replace null skcipher
- Rename CRYPTO_TESTS to CRYPTO_BENCHMARK
- Flip CRYPTO_MANAGER_DISABLE_TEST into CRYPTO_SELFTESTS
- Hide CRYPTO_MANAGER
- Add delayed freeing of driver crypto_alg structures
Compression:
- Allocate large buffers on first use instead of initialisation in scomp
- Drop destination linearisation buffer in scomp
- Move scomp stream allocation into acomp
- Add acomp scatter-gather walker
- Remove request chaining
- Add optional async request allocation
Hashing:
- Remove request chaining
- Add optional async request allocation
- Move partial block handling into API
- Add ahash support to hmac
- Fix shash documentation to disallow usage in hard IRQs
Algorithms:
- Remove unnecessary SIMD fallback code on x86 and arm/arm64
- Drop avx10_256 xts(aes)/ctr(aes) on x86
- Improve avx-512 optimisations for xts(aes)
- Move chacha arch implementations into lib/crypto
- Move poly1305 into lib/crypto and drop unused Crypto API algorithm
- Disable powerpc/poly1305 as it has no SIMD fallback
- Move sha256 arch implementations into lib/crypto
- Convert deflate to acomp
- Set block size correctly in cbcmac
Drivers:
- Do not use sg_dma_len before mapping in sun8i-ss
- Fix warm-reboot failure by making shutdown do more work in qat
- Add locking in zynqmp-sha
- Remove cavium/zip
- Add support for PCI device 0x17D8 to ccp
- Add qat_6xxx support in qat
- Add support for RK3576 in rockchip-rng
- Add support for i.MX8QM in caam
Others:
- Fix irq_fpu_usable/kernel_fpu_begin inconsistency during CPU bring-up
- Add new SEV/SNP platform shutdown API in ccp"
* tag 'v6.16-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (382 commits)
x86/fpu: Fix irq_fpu_usable() to return false during CPU onlining
crypto: qat - add missing header inclusion
crypto: api - Redo lookup on EEXIST
Revert "crypto: testmgr - Add hash export format testing"
crypto: marvell/cesa - Do not chain submitted requests
crypto: powerpc/poly1305 - add depends on BROKEN for now
Revert "crypto: powerpc/poly1305 - Add SIMD fallback"
crypto: ccp - Add missing tee info reg for teev2
crypto: ccp - Add missing bootloader info reg for pspv5
crypto: sun8i-ce - move fallback ahash_request to the end of the struct
crypto: octeontx2 - Use dynamic allocated memory region for lmtst
crypto: octeontx2 - Initialize cptlfs device info once
crypto: xts - Only add ecb if it is not already there
crypto: lrw - Only add ecb if it is not already there
crypto: testmgr - Add hash export format testing
crypto: testmgr - Use ahash for generic tfm
crypto: hmac - Add ahash support
crypto: testmgr - Ignore EEXIST on shash allocation
crypto: algapi - Add driver template support to crypto_inst_setname
crypto: shash - Set reqsize in shash_alg
...
Linus Torvalds [Mon, 26 May 2025 20:32:06 +0000 (13:32 -0700)]
Merge tag 'crc-for-linus' of git://git./linux/kernel/git/ebiggers/linux
Pull CRC updates from Eric Biggers:
"Cleanups for the kernel's CRC (cyclic redundancy check) code:
- Use __ro_after_init where appropriate
- Remove unnecessary static_key on s390
- Rename some source code files
- Rename the crc32 and crc32c crypto API modules
- Use subsys_initcall instead of arch_initcall
- Restore maintainers for crc_kunit.c
- Fold crc16_byte() into crc16.c
- Add some SPDX license identifiers"
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crc32: add SPDX license identifier
lib/crc16: unexport crc16_table and crc16_byte()
w1: ds2406: use crc16() instead of crc16_byte() loop
MAINTAINERS: add crc_kunit.c back to CRC LIBRARY
lib/crc: make arch-optimized code use subsys_initcall
crypto: crc32 - remove "generic" from file and module names
x86/crc: drop "glue" from filenames
sparc/crc: drop "glue" from filenames
s390/crc: drop "glue" from filenames
powerpc/crc: rename crc32-vpmsum_core.S to crc-vpmsum-template.S
powerpc/crc: drop "glue" from filenames
arm64/crc: drop "glue" from filenames
arm/crc: drop "glue" from filenames
s390/crc32: Remove no-op module init and exit functions
s390/crc32: Remove have_vxrs static key
lib/crc: make the CPU feature static keys __ro_after_init
Linus Torvalds [Mon, 26 May 2025 20:27:40 +0000 (13:27 -0700)]
Merge tag 'fscrypt-for-linus' of git://git./fs/fscrypt/linux
Pull fscrypt update from Eric Biggers:
"Add support for 'hardware-wrapped inline encryption keys' to fscrypt.
When enabled on supported platforms, this feature protects file
contents keys from certain attacks, such as cold boot attacks.
This feature uses the block layer support for wrapped keys which was
merged in 6.15. Wrapped key support has existed out-of-tree in Android
for a long time, and it's finally ready for upstream now that there is
a platform on which it works end-to-end with upstream.
Specifically, it works on the Qualcomm SM8650 HDK, using the Qualcomm
ICE (Inline Crypto Engine) and HWKM (Hardware Key Manager). The
corresponding driver support is included in the SCSI tree for 6.16.
Validation for this feature includes two new tests that were already
merged into xfstests (generic/368 and generic/369)"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
fscrypt: add support for hardware-wrapped keys
Linus Torvalds [Mon, 26 May 2025 19:56:01 +0000 (12:56 -0700)]
Merge tag 'xfs-merge-6.16' of git://git./fs/xfs/xfs-linux
Pull xfs updates from Carlos Maiolino:
- Atomic writes for XFS
- Remove experimental warnings for pNFS, scrub and parent pointers
* tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits)
xfs: add inode to zone caching for data placement
xfs: free the item in xfs_mru_cache_insert on failure
xfs: remove the EXPERIMENTAL warning for pNFS
xfs: remove some EXPERIMENTAL warnings
xfs: Remove deprecated xfs_bufd sysctl parameters
xfs: stop using set_blocksize
xfs: allow sysadmins to specify a maximum atomic write limit at mount time
xfs: update atomic write limits
xfs: add xfs_calc_atomic_write_unit_max()
xfs: add xfs_file_dio_write_atomic()
xfs: commit CoW-based atomic writes atomically
xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()
xfs: add xfs_atomic_write_cow_iomap_begin()
xfs: refine atomic write size check in xfs_file_write_iter()
xfs: refactor xfs_reflink_end_cow_extent()
xfs: allow block allocator to take an alignment hint
xfs: ignore HW which cannot atomic write a single block
xfs: add helpers to compute transaction reservation for finishing intent items
xfs: add helpers to compute log item overhead
xfs: separate out setting buftarg atomic writes limits
...
Linus Torvalds [Mon, 26 May 2025 19:47:41 +0000 (12:47 -0700)]
Merge tag 'erofs-for-6.16-rc1' of git://git./linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, Intel QAT hardware accelerators are supported to
improve DEFLATE decompression performance. I've tested it with the
enwik9 dataset of 1 MiB pclusters on our Intel Sapphire Rapids
bare-metal server and a PL0 ESSD, and the sequential read performance
even surpasses LZ4 software decompression on this setup.
In addition, a `fsoffset` mount option is introduced for file-backed
mounts to specify the filesystem offset in order to adapt customized
container formats.
And other improvements and minor cleanups. Summary:
- Add a `fsoffset` mount option to specify the filesystem offset
- Support Intel QAT accelerators to boost up the DEFLATE algorithm
- Initialize per-CPU workers and CPU hotplug hooks lazily to avoid
unnecessary overhead when EROFS is not mounted
- Fix file handle encoding for 64-bit NIDs
- Minor cleanups"
* tag 'erofs-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: support DEFLATE decompression by using Intel QAT
erofs: clean up erofs_{init,exit}_sysfs()
erofs: add 'fsoffset' mount option to specify filesystem offset
erofs: lazily initialize per-CPU workers and CPU hotplug hooks
erofs: refine readahead tracepoint
erofs: avoid using multiple devices with different type
erofs: fix file handle encoding for 64-bit NIDs
Linus Torvalds [Mon, 26 May 2025 19:43:30 +0000 (12:43 -0700)]
Merge tag 'bcachefs-2025-05-24' of git://evilpiepirate.org/bcachefs
Pull bcachefs updates from Kent Overstreet:
- Poisoned extents can now be moved: this lets us handle bitrotted data
without deleting it. For now, reading from poisoned extents only
returns -EIO: in the future we'll have an API for specifying "read
this data even if there were bitflips".
- Incompatible features may now be enabled at runtime, via
"opts/version_upgrade" in sysfs. Toggle it to incompatible, and then
toggle it back - option changes via the sysfs interface are
persistent.
- Various changes to support deployable disk images:
- RO mounts now use less memory
- Images may be stripped of alloc info, particularly useful for
slimming them down if they will primarily be mounted RO. Alloc
info will be automatically regenerated on first RW mount, and
this is quite fast
- Filesystem images generated with 'bcachefs image' will be
automatically resized the first time they're mounted on a larger
device
The images 'bcachefs image' generates with compression enabled have
been comparable in size to those generated by squashfs and erofs -
but you get a full RW capable filesystem
- Major error message improvements for btree node reads, data reads,
and elsewhere. We now build up a single error message that lists all
the errors encountered, actions taken to repair, and success/failure
of the IO. This extends to other error paths that may kick off other
actions, e.g. scheduling recovery passes: actions we took because of
an error are included in that error message, with
grouping/indentation so we can see what caused what.
- New option, 'rebalance_on_ac_only'. Does exactly what the name
suggests, quite handy with background compression.
- Repair/self healing:
- We can now kick off recovery passes and run them in the
background if we detect errors. Currently, this is just used by
code that walks backpointers. We now also check for missing
backpointers at runtime and run check_extents_to_backpointers if
required. The messy 6.14 upgrade left missing backpointers for
some users, and this will correct that automatically instead of
requiring a manual fsck - some users noticed this as copygc
spinning and not making progress.
In the future, as more recovery passes come online, we'll be able
to repair and recover from nearly anything - except for
unreadable btree nodes, and that's why you're using replication,
of course - without shutting down the filesystem.
- There's a new recovery pass, for checking the rebalance_work
btree, which tracks extents that rebalance will process later.
- Hardening:
- Close the last known hole in btree iterator/btree locking
assertions: path->should_be_locked paths must stay locked until
the end of the transaction. This shook out a few bugs, including
a performance issue that was causing unnecessary path_upgrade
transaction restarts.
- Performance:
- Faster snapshot deletion: this is an incompatible feature, as it
requires new sentinal values, for safety. Snapshot deletion no
longer has to do a full metadata scan, it now just scans the
inodes btree: if an extent/dirent/xattr is present for a given
snapshot ID, we already require that an inode be present with
that same snapshot ID.
If/when users hit scalability limits again (ridiculously huge
filesystems with lots of inodes, and many sparse snapshots), let
me know - the next step will be to add an index from snapshot ID
-> inode number, which won't be too hard.
- Faster device removal: the "scan for pointers to this device" no
longer does a full metadata scan, instead it walks backpointers.
Like fast snapshot deletion this is another incompat feature: it
also requires a new sentinal value, because we don't want to
reuse these device IDs until after a fsck.
- We're now coalescing redundant accounting updates prior to
transaction commit, taking some pressure off the journal. Shortly
we'll also be doing multiple extent updates in a transaction in
the main write path, which combined with the previous should
drastically cut down on the amount of metadata updates we have to
journal.
- Stack usage improvements: All allocator state has been moved off the
stack
- Debug improvements:
- enumerated refcounts: The debug code previously used for
filesystem write refs is now a small library, and used for other
heavily used refcounts. Different users of a refcount are
enumerated, making it much easier to debug refcount issues.
- Async object debugging: There's a new kconfig option that makes
various async objects (different types of bios, data updates,
write ops, etc.) visible in debugfs, and it should be fast enough
to leave on in production.
- Various sets of assertions no longer require
CONFIG_BCACHEFS_DEBUG, instead they're controlled by module
parameters and static keys, meaning users won't need to compile
custom kernels as often to help debug issues.
- bch2_trans_kmalloc() calls can be tracked (there's a new kconfig
option). With it on you can check the btree_transaction_stats in
debugfs to see the bch2_trans_kmalloc() calls a transaction did
when it used the most memory.
* tag 'bcachefs-2025-05-24' of git://evilpiepirate.org/bcachefs: (218 commits)
bcachefs: Don't mount bs > ps without TRANSPARENT_HUGEPAGE
bcachefs: Fix btree_iter_next_node() for new locking asserts
bcachefs: Ensure we don't use a blacklisted journal seq
bcachefs: Small check_fix_ptr fixes
bcachefs: Fix opts.recovery_pass_last
bcachefs: Fix allocate -> self healing path
bcachefs: Fix endianness in casefold check/repair
bcachefs: Path must be locked if trans->locked && should_be_locked
bcachefs: Simplify bch2_path_put()
bcachefs: Plumb btree_trans for more locking asserts
bcachefs: Clear trans->locked before unlock
bcachefs: Clear should_be_locked before unlock in key_cache_drop()
bcachefs: bch2_path_get() reuses paths if upgrade_fails & !should_be_locked
bcachefs: Give out new path if upgrade fails
bcachefs: Fix btree_path_get_locks when not doing trans restart
bcachefs: btree_node_locked_type_nowrite()
bcachefs: Kill bch2_path_put_nokeep()
bcachefs: bch2_journal_write_checksum()
bcachefs: Reduce stack usage in data_update_index_update()
bcachefs: bch2_trans_log_str()
...
Linus Torvalds [Mon, 26 May 2025 19:35:08 +0000 (12:35 -0700)]
Merge tag 'gfs2-for-6.16' of git://git./linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Fix the long-standing warnings in inode_to_wb() when CONFIG_LOCKDEP
is enabled: gfs2 doesn't support cgroup writeback and so inode->i_wb
will never change. This is the counterpart of commit
9e888998ea4d
("writeback: fix false warning in inode_to_wb()")
- Fix a hang introduced by commit
8d391972ae2d ("gfs2: Remove
__gfs2_writepage()"): prevent gfs2_logd from creating transactions
for jdata pages while trying to flush the log
- Fix a race between gfs2_create_inode() and gfs2_evict_inode() by
deallocating partially created inodes on the gfs2_create_inode()
error path
- Fix a bug in the journal head lookup code that could cause mount to
fail after successful recovery
- Various smaller fixes and cleanups from various people
* tag 'gfs2-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (23 commits)
gfs2: No more gfs2_find_jhead caching
gfs2: Get rid of duplicate log head lookup
gfs2: Simplify clean_journal
gfs2: Simplify gfs2_log_pointers_init
gfs2: Move gfs2_log_pointers_init
gfs2: Minor comments fix
gfs2: Don't start unnecessary transactions during log flush
gfs2: Move gfs2_trans_add_databufs
gfs2: Rename jdata_dirty_folio to gfs2_jdata_dirty_folio
gfs2: avoid inefficient use of crc32_le_shift()
gfs2: Do not call iomap_zero_range beyond eof
gfs: don't check for AOP_WRITEPAGE_ACTIVATE in gfs2_write_jdata_batch
gfs2: Fix usage of bio->bi_status in gfs2_end_log_write
gfs2: deallocate inodes in gfs2_create_inode
gfs2: Move GIF_ALLOC_FAILED check out of gfs2_ea_dealloc
gfs2: Move gfs2_dinode_dealloc
gfs2: Don't reread inodes unnecessarily
gfs2: gfs2_create_inode error handling fix
gfs2: Remove unnecessary NULL check before free_percpu()
gfs2: check sb_min_blocksize return value
...
Linus Torvalds [Mon, 26 May 2025 19:28:55 +0000 (12:28 -0700)]
Merge tag 'configfs-for-v6.16' of git://git./linux/kernel/git/a.hindborg/linux
Pull configfs updates from Andreas Hindborg:
- Allow creation of rw files with custom permissions. This allows
drivers to better protect secrets written through configfs
- Fix a bug where an error condition did not cause an early return
while populating attributes
- Report ENOMEM rather than EFAULT when kvasprintf() fails in
config_item_set_name()
- Add a Rust API for configfs. This allows Rust drivers to use configfs
through a memory safe interface
* tag 'configfs-for-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/a.hindborg/linux:
MAINTAINERS: add configfs Rust abstractions
rust: configfs: add a sample demonstrating configfs usage
rust: configfs: introduce rust support for configfs
configfs: Correct error value returned by API config_item_set_name()
configfs: Do not override creating attribute file failure in populate_attrs()
configfs: Delete semicolon from macro type_print() definition
configfs: Add CONFIGFS_ATTR_PERM helper
Linus Torvalds [Mon, 26 May 2025 19:24:43 +0000 (12:24 -0700)]
Merge tag 'for-6.16-tag' of git://git./linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Apart from numerous cleanups, there are some performance improvements
and one minor mount option update. There's one more radix-tree
conversion (one remaining), and continued work towards enabling large
folios (almost finished).
Performance:
- extent buffer conversion to xarray gains throughput and runtime
improvements on metadata heavy operations doing writeback (sample
test shows +50% throughput, -33% runtime)
- extent io tree cleanups lead to performance improvements by
avoiding unnecessary searches or repeated searches
- more efficient extent unpinning when committing transaction
(estimated run time improvement 3-5%)
User visible changes:
- remove standalone mount option 'nologreplay', deprecated in 5.9,
replacement is 'rescue=nologreplay'
- in scrub, update reporting, add back device stats message after
detected errors (accidentally removed during recent refactoring)
Core:
- convert extent buffer radix tree to xarray
- in subpage mode, move block perfect compression out of experimental
build
- in zoned mode, introduce sub block groups to allow managing special
block groups, like the one for relocation or tree-log, to handle
some corner cases of ENOSPC
- in scrub, simplify bitmaps for block tracking status
- continued preparations for large folios:
- remove assertions for folio order 0
- add support where missing: compression, buffered write, defrag,
hole punching, subpage, send
- fix fsync of files with no hard links not persisting deletion
- reject tree blocks which are not nodesize aligned, a precaution
from 4.9 times
- move transaction abort calls closer to the error sites
- remove usage of some struct bio_vec internals
- simplifications in extent map
- extent IO cleanups and optimizations
- error handling improvements
- enhanced ASSERT() macro with optional format strings
- cleanups:
- remove unused code
- naming unifications, dropped __, added prefix
- merge similar functions
- use common helpers for various data structures"
* tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (198 commits)
btrfs: move misplaced comment of btrfs_path::keep_locks
btrfs: remove standalone "nologreplay" mount option
btrfs: use a single variable to track return value at btrfs_page_mkwrite()
btrfs: don't return VM_FAULT_SIGBUS on failure to set delalloc for mmap write
btrfs: simplify early error checking in btrfs_page_mkwrite()
btrfs: pass true to btrfs_delalloc_release_space() at btrfs_page_mkwrite()
btrfs: fix wrong start offset for delalloc space release during mmap write
btrfs: fix harmless race getting delayed ref head count when running delayed refs
btrfs: log error codes during failures when writing super blocks
btrfs: simplify error return logic when getting folio at prepare_one_folio()
btrfs: return real error from __filemap_get_folio() calls
btrfs: remove superfluous return value check at btrfs_dio_iomap_begin()
btrfs: fix invalid data space release when truncating block in NOCOW mode
btrfs: update Kconfig option descriptions
btrfs: update list of features built under experimental config
btrfs: send: remove btrfs_debug() calls
btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent()
btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes()
btrfs: fold error checks when allocating ordered extent and update comments
btrfs: check we grabbed inode reference when allocating an ordered extent
...
Linus Torvalds [Mon, 26 May 2025 19:13:22 +0000 (12:13 -0700)]
Merge tag 'for-6.16/io_uring-
20250523' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Avoid indirect function calls in io-wq for executing and freeing
work.
The design of io-wq is such that it can be a generic mechanism, but
as it's just used by io_uring now, may as well avoid these indirect
calls
- Clean up registered buffers for networking
- Add support for IORING_OP_PIPE. Pretty straight forward, allows
creating pipes with io_uring, particularly useful for having these be
instantiated as direct descriptors
- Clean up the coalescing support fore registered buffers
- Add support for multiple interface queues for zero-copy rx
networking. As this feature was merged for 6.15 it supported just a
single ifq per ring
- Clean up the eventfd support
- Add dma-buf support to zero-copy rx
- Clean up and improving the request draining support
- Clean up provided buffer support, most notably with an eye toward
making the legacy support less intrusive
- Minor fdinfo cleanups, dropping support for dumping what credentials
are registered
- Improve support for overflow CQE handling, getting rid of GFP_ATOMIC
for allocating overflow entries where possible
- Improve detection of cases where io-wq doesn't need to spawn a new
worker unnecessarily
- Various little cleanups
* tag 'for-6.16/io_uring-
20250523' of git://git.kernel.dk/linux: (59 commits)
io_uring/cmd: warn on reg buf imports by ineligible cmds
io_uring/io-wq: only create a new worker if it can make progress
io_uring/io-wq: ignore non-busy worker going to sleep
io_uring/io-wq: move hash helpers to the top
trace/io_uring: fix io_uring_local_work_run ctx documentation
io_uring: finish IOU_OK -> IOU_COMPLETE transition
io_uring: add new helpers for posting overflows
io_uring: pass in struct io_big_cqe to io_alloc_ocqe()
io_uring: make io_alloc_ocqe() take a struct io_cqe pointer
io_uring: split alloc and add of overflow
io_uring: open code io_req_cqe_overflow()
io_uring/fdinfo: get rid of dumping credentials
io_uring/fdinfo: only compile if CONFIG_PROC_FS is set
io_uring/kbuf: unify legacy buf provision and removal
io_uring/kbuf: refactor __io_remove_buffers
io_uring/kbuf: don't compute size twice on prep
io_uring/kbuf: drop extra vars in io_register_pbuf_ring
io_uring/kbuf: use mem_is_zero()
io_uring/kbuf: account ring io_buffer_list memory
io_uring: drain based on allocates reqs
...
Linus Torvalds [Mon, 26 May 2025 18:39:36 +0000 (11:39 -0700)]
Merge tag 'for-6.16/block-
20250523' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- ublk updates:
- Add support for updating the size of a ublk instance
- Zero-copy improvements
- Auto-registering of buffers for zero-copy
- Series simplifying and improving GET_DATA and request lookup
- Series adding quiesce support
- Lots of selftests additions
- Various cleanups
- NVMe updates via Christoph:
- add per-node DMA pools and use them for PRP/SGL allocations
(Caleb Sander Mateos, Keith Busch)
- nvme-fcloop refcounting fixes (Daniel Wagner)
- support delayed removal of the multipath node and optionally
support the multipath node for private namespaces (Nilay Shroff)
- support shared CQs in the PCI endpoint target code (Wilfred
Mallawa)
- support admin-queue only authentication (Hannes Reinecke)
- use the crc32c library instead of the crypto API (Eric Biggers)
- misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
Reinecke, Leon Romanovsky, Gustavo A. R. Silva)
- MD updates via Yu:
- Fix that normal IO can be starved by sync IO, found by mkfs on
newly created large raid5, with some clean up patches for bdev
inflight counters
- Clean up brd, getting rid of atomic kmaps and bvec poking
- Add loop driver specifically for zoned IO testing
- Eliminate blk-rq-qos calls with a static key, if not enabled
- Improve hctx locking for when a plug has IO for multiple queues
pending
- Remove block layer bouncing support, which in turn means we can
remove the per-node bounce stat as well
- Improve blk-throttle support
- Improve delay support for blk-throttle
- Improve brd discard support
- Unify IO scheduler switching. This should also fix a bunch of lockdep
warnings we've been seeing, after enabling lockdep support for queue
freezing/unfreezeing
- Add support for block write streams via FDP (flexible data placement)
on NVMe
- Add a bunch of block helpers, facilitating the removal of a bunch of
duplicated boilerplate code
- Remove obsolete BLK_MQ pci and virtio Kconfig options
- Add atomic/untorn write support to blktrace
- Various little cleanups and fixes
* tag 'for-6.16/block-
20250523' of git://git.kernel.dk/linux: (186 commits)
selftests: ublk: add test for UBLK_F_QUIESCE
ublk: add feature UBLK_F_QUIESCE
selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
traceevent/block: Add REQ_ATOMIC flag to block trace events
ublk: run auto buf unregisgering in same io_ring_ctx with registering
io_uring: add helper io_uring_cmd_ctx_handle()
ublk: remove io argument from ublk_auto_buf_reg_fallback()
ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
selftests: ublk: support UBLK_F_AUTO_BUF_REG
ublk: support UBLK_AUTO_BUF_REG_FALLBACK
ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
ublk: prepare for supporting to register request buffer automatically
ublk: convert to refcount_t
selftests: ublk: make IO & device removal test more stressful
nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
nvme: introduce multipath_always_on module param
nvme-multipath: introduce delayed removal of the multipath head node
nvme-pci: derive and better document max segments limits
nvme-pci: use struct_size for allocation struct nvme_dev
...
Linus Torvalds [Mon, 26 May 2025 18:32:28 +0000 (11:32 -0700)]
Merge tag 'vfs-6.16-rc1.selftests' of git://git./linux/kernel/git/vfs/vfs
Pull vfs selftests updates from Christian Brauner:
"This contains various cleanups, fixes, and extensions for out
filesystem selftests"
* tag 'vfs-6.16-rc1.selftests' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests/fs/mount-notify: add a test variant running inside userns
selftests/filesystems: create setup_userns() helper
selftests/filesystems: create get_unique_mnt_id() helper
selftests/fs/mount-notify: build with tools include dir
selftests/mount_settattr: remove duplicate syscall definitions
selftests/pidfd: move syscall definitions into wrappers.h
selftests/fs/statmount: build with tools include dir
selftests/filesystems: move wrapper.h out of overlayfs subdir
selftests/mount_settattr: ensure that ext4 filesystem can be created
selftests/mount_settattr: add missing STATX_MNT_ID_UNIQUE define
selftests/mount_settattr: don't define sys_open_tree() twice
Linus Torvalds [Mon, 26 May 2025 18:28:42 +0000 (11:28 -0700)]
Merge tag 'vfs-6.16-rc1.iomap' of git://git./linux/kernel/git/vfs/vfs
Pull iomap updates from Christian Brauner:
- More fallout and preparatory work associated with the folio batch
prototype posted a while back.
Mainly this just cleans up some of the helpers and pushes some
pos/len trimming further down in the write begin path.
- Add missing flag descriptions to the iomap documentation
* tag 'vfs-6.16-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: rework iomap_write_begin() to return folio offset and length
iomap: push non-large folio check into get folio path
iomap: helper to trim pos/bytes to within folio
iomap: drop pos param from __iomap_[get|put]_folio()
iomap: drop unnecessary pos param from iomap_write_[begin|end]
iomap: resample iter->pos after iomap_write_begin() calls
iomap: trace: Add missing flags to [IOMAP_|IOMAP_F_]FLAGS_STRINGS
Documentation: iomap: Add missing flags description
Linus Torvalds [Mon, 26 May 2025 18:17:01 +0000 (11:17 -0700)]
Merge tag 'vfs-6.16-rc1.coredump' of git://git./linux/kernel/git/vfs/vfs
Pull coredump updates from Christian Brauner:
"This adds support for sending coredumps over an AF_UNIX socket. It
also makes (implicit) use of the new SO_PEERPIDFD ability to hand out
pidfds for reaped peer tasks
The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a saf way to
handle them instead of relying on super privileged coredumping helpers
This will also be significantly more lightweight since the kernel
doens't have to do a fork()+exec() for each crashing process to spawn
a usermodehelper. Instead the kernel just connects to the AF_UNIX
socket and userspace can process it concurrently however it sees fit.
Support for userspace is incoming starting with systemd-coredump
There's more work coming in that direction next cycle. The rest below
goes into some details and background
Coredumping currently supports two modes:
(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
spawned as a child of the system_unbound_wq or kthreadd
For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries
The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional
parameters pass information about the task that is generating the
coredump to the binary that processes the coredump
In the example the core_pattern shown causes the kernel to spawn
systemd-coredump as a usermode helper. There's various conceptual
consequences of this (non-exhaustive list):
- systemd-coredump is spawned with file descriptor number 0 (stdin)
connected to the read-end of the pipe. All other file descriptors
are closed. That specifically includes 1 (stdout) and 2 (stderr).
This has already caused bugs because userspace assumed that this
cannot happen (Whether or not this is a sane assumption is
irrelevant)
- systemd-coredump will be spawned as a child of system_unbound_wq.
So it is not a child of any userspace process and specifically not
a child of PID 1. It cannot be waited upon and is in a weird hybrid
upcall which are difficult for userspace to control correctly
- systemd-coredump is spawned with full kernel privileges. This
necessitates all kinds of weird privilege dropping excercises in
userspace to make this safe
- A new usermode helper has to be spawned for each crashing process
This adds a new mode:
(3) Dumping into an AF_UNIX socket
Userspace can set /proc/sys/kernel/core_pattern to:
@/path/to/coredump.socket
The "@" at the beginning indicates to the kernel that an AF_UNIX
coredump socket will be used to process coredumps
The coredump socket must be located in the initial mount namespace.
When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket:
- The coredump server uses SO_PEERPIDFD to get a stable handle on the
connected crashing task. The retrieved pidfd will provide a stable
reference even if the crashing task gets SIGKILLed while generating
the coredump. That is a huge attack vector right now
- By setting core_pipe_limit non-zero userspace can guarantee that
the crashing task cannot be reaped behind it's back and thus
process all necessary information in /proc/<pid>. The SO_PEERPIDFD
can be used to detect whether /proc/<pid> still refers to the same
process
The core_pipe_limit isn't used to rate-limit connections to the
socket. This can simply be done via AF_UNIX socket directly
- The pidfd for the crashing task will contain information how the
task coredumps. The PIDFD_GET_INFO ioctl gained a new flag
PIDFD_INFO_COREDUMP which can be used to retreive the coredump
information
If the coredump gets a new coredump client connection the kernel
guarantees that PIDFD_INFO_COREDUMP information is available.
Currently the following information is provided in the new
@coredump_mask extension to struct pidfd_info:
* PIDFD_COREDUMPED is raised if the task did actually coredump
* PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping
(e.g., undumpable)
* PIDFD_COREDUMP_USER is raised if this is a regular coredump and
doesn't need special care by the coredump server
* PIDFD_COREDUMP_ROOT is raised if the generated coredump should
be treated as sensitive and the coredump server should restrict
access to the generated coredump to sufficiently privileged
users"
* tag 'vfs-6.16-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
mips, net: ensure that SOCK_COREDUMP is defined
selftests/coredump: add tests for AF_UNIX coredumps
selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
coredump: validate socket name as it is written
coredump: show supported coredump modes
pidfs, coredump: add PIDFD_INFO_COREDUMP
coredump: add coredump socket
coredump: reflow dump helpers a little
coredump: massage do_coredump()
coredump: massage format_corename()
Linus Torvalds [Mon, 26 May 2025 17:30:02 +0000 (10:30 -0700)]
Merge tag 'vfs-6.16-rc1.pidfs' of git://git./linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:
"Features:
- Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD
socket option
SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for
the process that called connect() or listen(). This is heavily used
to safely authenticate clients in userspace avoiding security bugs
due to pid recycling races (dbus, polkit, systemd, etc.)
SO_PEERPIDFD currently doesn't support handing out pidfds if the
sk->sk_peer_pid thread-group leader has already been reaped. In
this case it currently returns EINVAL. Userspace still wants to get
a pidfd for a reaped process to have a stable handle it can pass
on. This is especially useful now that it is possible to retrieve
exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s
PIDFD_INFO_EXIT flag
Another summary has been provided by David Rheinsberg:
> A pidfd can outlive the task it refers to, and thus user-space
> must already be prepared that the task underlying a pidfd is
> gone at the time they get their hands on the pidfd. For
> instance, resolving the pidfd to a PID via the fdinfo must be
> prepared to read `-1`.
>
> Despite user-space knowing that a pidfd might be stale, several
> kernel APIs currently add another layer that checks for this. In
> particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was
> already reaped, but returns a stale pidfd if the task is reaped
> immediately after the respective alive-check.
>
> This has the unfortunate effect that user-space now has two ways
> to check for the exact same scenario: A syscall might return
> EINVAL/ESRCH/... *or* the pidfd might be stale, even though
> there is no particular reason to distinguish both cases. This
> also propagates through user-space APIs, which pass on pidfds.
> They must be prepared to pass on `-1` *or* the pidfd, because
> there is no guaranteed way to get a stale pidfd from the kernel.
>
> Userspace must already deal with a pidfd referring to a reaped
> task as the task may exit and get reaped at any time will there
> are still many pidfds referring to it
In order to allow handing out reaped pidfd SO_PEERPIDFD needs to
ensure that PIDFD_INFO_EXIT information is available whenever a
pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi
promises that reaped pidfds are only handed out if it is guaranteed
that the caller sees the exit information:
TEST_F(pidfd_info, success_reaped)
{
struct pidfd_info info = {
.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
};
/*
* Process has already been reaped and PIDFD_INFO_EXIT been set.
* Verify that we can retrieve the exit status of the process.
*/
ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
ASSERT_TRUE(WIFEXITED(info.exit_code));
ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
}
To hand out pidfds for reaped processes we thus allocate a pidfs
entry for the relevant sk->sk_peer_pid at the time the
sk->sk_peer_pid is stashed and drop it when the socket is
destroyed. This guarantees that exit information will always be
recorded for the sk->sk_peer_pid task and we can hand out pidfds
for reaped processes
- Hand a pidfd to the coredump usermode helper process
Give userspace a way to instruct the kernel to install a pidfd for
the crashing process into the process started as a usermode helper.
There's still tricky race-windows that cannot be easily or
sometimes not closed at all by userspace. There's various ways like
looking at the start time of a process to make sure that the
usermode helper process is started after the crashing process but
it's all very very brittle and fraught with peril
The crashed-but-not-reaped process can be killed by userspace
before coredump processing programs like systemd-coredump have had
time to manually open a PIDFD from the PID the kernel provides
them, which means they can be tricked into reading from an
arbitrary process, and they run with full privileges as they are
usermode helper processes
Even if that specific race-window wouldn't exist it's still the
safest and cleanest way to let the kernel provide the pidfd
directly instead of requiring userspace to do it manually. In
parallel with this commit we already have systemd adding support
for this in [1]
When the usermode helper process is forked we install a pidfd file
descriptor three into the usermode helper's file descriptor table
so it's available to the exec'd program
Since usermode helpers are either children of the system_unbound_wq
workqueue or kthreadd we know that the file descriptor table is
empty and can thus always use three as the file descriptor number
Note, that we'll install a pidfd for the thread-group leader even
if a subthread is calling do_coredump(). We know that task linkage
hasn't been removed yet and even if this @current isn't the actual
thread-group leader we know that the thread-group leader cannot be
reaped until
@current has exited
- Allow telling when a task has not been found from finding the wrong
task when creating a pidfd
We currently report EINVAL whenever a struct pid has no tasked
attached anymore thereby conflating two concepts:
(1) The task has already been reaped
(2) The caller requested a pidfd for a thread-group leader but the
pid actually references a struct pid that isn't used as a
thread-group leader
This is causing issues for non-threaded workloads as in where they
expect ESRCH to be reported, not EINVAL
So allow userspace to reliably distinguish between (1) and (2)
- Make it possible to detect when a pidfs entry would outlive the
struct pid it pinned
- Add a range of new selftests
Cleanups:
- Remove unneeded NULL check from pidfd_prepare() for passed struct
pid
- Avoid pointless reference count bump during release_task()
Fixes:
- Various fixes to the pidfd and coredump selftests
- Fix error handling for replace_fd() when spawning coredump usermode
helper"
* tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfs: detect refcount bugs
coredump: hand a pidfd to the usermode coredump helper
coredump: fix error handling for replace_fd()
pidfs: move O_RDWR into pidfs_alloc_file()
selftests: coredump: Raise timeout to 2 minutes
selftests: coredump: Fix test failure for slow machines
selftests: coredump: Properly initialize pointer
net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
pidfs: get rid of __pidfd_prepare()
net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
pidfs: register pid in pidfs
net, pidfd: report EINVAL for ESRCH
release_task: kill the no longer needed get/put_pid(thread_pid)
pidfs: ensure consistent ENOENT/ESRCH reporting
exit: move wake_up_all() pidfd waiters into __unhash_process()
selftest/pidfd: add test for thread-group leader pidfd open for thread
pidfd: improve uapi when task isn't found
pidfd: remove unneeded NULL check from pidfd_prepare()
selftests/pidfd: adapt to recent changes
Linus Torvalds [Mon, 26 May 2025 16:55:56 +0000 (09:55 -0700)]
Merge tag 'vfs-6.16-rc1.mount' of git://git./linux/kernel/git/vfs/vfs
Pull vfs mount updates from Christian Brauner:
"This contains minor mount updates for this cycle:
- mnt->mnt_devname can never be NULL so simplify the code handling
that case
- Add a comment about concurrent changes during statmount() and
listmount()
- Update the STATMOUNT_SUPPORTED macro
- Convert mount flags to an enum"
* tag 'vfs-6.16-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
statmount: update STATMOUNT_SUPPORTED macro
fs: convert mount flags to enum
->mnt_devname is never NULL
mount: add a comment about concurrent changes with statmount()/listmount()
Linus Torvalds [Mon, 26 May 2025 16:33:44 +0000 (09:33 -0700)]
Merge tag 'vfs-6.16-rc1.super' of git://git./linux/kernel/git/vfs/vfs
Pull vfs freezing updates from Christian Brauner:
"This contains various filesystem freezing related work for this cycle:
- Allow the power subsystem to support filesystem freeze for suspend
and hibernate.
Now all the pieces are in place to actually allow the power
subsystem to freeze/thaw filesystems during suspend/resume.
Filesystems are only frozen and thawed if the power subsystem does
actually own the freeze.
If the filesystem is already frozen by the time we've frozen all
userspace processes we don't care to freeze it again. That's
userspace's job once the process resumes. We only actually freeze
filesystems if we absolutely have to and we ignore other failures
to freeze.
We could bubble up errors and fail suspend/resume if the error
isn't EBUSY (aka it's already frozen) but I don't think that this
is worth it. Filesystem freezing during suspend/resume is
best-effort. If the user has 500 ext4 filesystems mounted and 4
fail to freeze for whatever reason then we simply skip them.
What we have now is already a big improvement and let's see how we
fare with it before making our lives even harder (and uglier) than
we have to.
- Allow efivars to support freeze and thaw
Allow efivarfs to partake to resync variable state during system
hibernation and suspend. Add freeze/thaw support.
This is a pretty straightforward implementation. We simply add
regular freeze/thaw support for both userspace and the kernel.
efivars is the first pseudofilesystem that adds support for
filesystem freezing and thawing.
The simplicity comes from the fact that we simply always resync
variable state after efivarfs has been frozen. It doesn't matter
whether that's because of suspend, userspace initiated freeze or
hibernation. Efivars is simple enough that it doesn't matter that
we walk all dentries. There are no directories and there aren't
insane amounts of entries and both freeze/thaw are already
heavy-handed operations. If userspace initiated a freeze/thaw cycle
they would need CAP_SYS_ADMIN in the initial user namespace (as
that's where efivarfs is mounted) so it can't be triggered by
random userspace. IOW, we really really don't care"
* tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
f2fs: fix freezing filesystem during resize
kernfs: add warning about implementing freeze/thaw
efivarfs: support freeze/thaw
power: freeze filesystems during suspend/resume
libfs: export find_next_child()
super: add filesystem freezing helpers for suspend and hibernate
gfs2: pass through holder from the VFS for freeze/thaw
super: use common iterator (Part 2)
super: use a common iterator (Part 1)
super: skip dying superblocks early
super: simplify user_get_super()
super: remove pointless s_root checks
fs: allow all writers to be frozen
locking/percpu-rwsem: add freezable alternative to down_read
Linus Torvalds [Mon, 26 May 2025 16:02:39 +0000 (09:02 -0700)]
Merge tag 'vfs-6.16-rc1.misc' of git://git./linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Use folios for symlinks in the page cache
FUSE already uses folios for its symlinks. Mirror that conversion
in the generic code and the NFS code. That lets us get rid of a few
folio->page->folio conversions in this path, and some of the few
remaining users of read_cache_page() / read_mapping_page()
- Try and make a few filesystem operations killable on the VFS
inode->i_mutex level
- Add sysctl vfs_cache_pressure_denom for bulk file operations
Some workloads need to preserve more dentries than we currently
allow through out sysctl interface
A HDFS servers with 12 HDDs per server, on a HDFS datanode startup
involves scanning all files and caching their metadata (including
dentries and inodes) in memory. Each HDD contains approximately 2
million files, resulting in a total of ~20 million cached dentries
after initialization
To minimize dentry reclamation, they set vfs_cache_pressure to 1.
Despite this configuration, memory pressure conditions can still
trigger reclamation of up to 50% of cached dentries, reducing the
cache from 20 million to approximately 10 million entries. During
the subsequent cache rebuild period, any HDFS datanode restart
operation incurs substantial latency penalties until full cache
recovery completes
To maintain service stability, more dentries need to be preserved
during memory reclamation. The current minimum reclaim ratio (1/100
of total dentries) remains too aggressive for such workload. This
patch introduces vfs_cache_pressure_denom for more granular cache
pressure control
The configuration [vfs_cache_pressure=1,
vfs_cache_pressure_denom=10000] effectively maintains the full 20
million dentry cache under memory pressure, preventing datanode
restart performance degradation
- Avoid some jumps in inode_permission() using likely()/unlikely()
- Avid a memory access which is most likely a cache miss when
descending into devcgroup_inode_permission()
- Add fastpath predicts for stat() and fdput()
- Anonymous inodes currently don't come with a proper mode causing
issues in the kernel when we want to add useful VFS debug assert.
Fix that by giving them a proper mode and masking it off when we
report it to userspace which relies on them not having any mode
- Anonymous inodes currently allow to change inode attributes because
the VFS falls back to simple_setattr() if i_op->setattr isn't
implemented. This means the ownership and mode for every single
user of anon_inode_inode can be changed. Block that as it's either
useless or actively harmful. If specific ownership is needed the
respective subsystem should allocate anonymous inodes from their
own private superblock
- Raise SB_I_NODEV and SB_I_NOEXEC on the anonymous inode superblock
- Add proper tests for anonymous inode behavior
- Make it easy to detect proper anonymous inodes and to ensure that
we can detect them in codepaths such as readahead()
Cleanups:
- Port pidfs to the new anon_inode_{g,s}etattr() helpers
- Try to remove the uselib() system call
- Add unlikely branch hint return path for poll
- Add unlikely branch hint on return path for core_sys_select
- Don't allow signals to interrupt getdents copying for fuse
- Provide a size hint to dir_context for during readdir()
- Use writeback_iter directly in mpage_writepages
- Update compression and mtime descriptions in initramfs
documentation
- Update main netfs API document
- Remove useless plus one in super_cache_scan()
- Remove unnecessary NULL-check guards during setns()
- Add separate separate {get,put}_cgroup_ns no-op cases
Fixes:
- Fix typo in root= kernel parameter description
- Use KERN_INFO for infof()|info_plog()|infofc()
- Correct comments of fs_validate_description()
- Mark an unlikely if condition with unlikely() in
vfs_parse_monolithic_sep()
- Delete macro fsparam_u32hex()
- Remove unused and problematic validate_constant_table()
- Fix potential unsigned integer underflow in fs_name()
- Make file-nr output the total allocated file handles"
* tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (43 commits)
fs: Pass a folio to page_put_link()
nfs: Use a folio in nfs_get_link()
fs: Convert __page_get_link() to use a folio
fs/read_write: make default_llseek() killable
fs/open: make do_truncate() killable
fs/open: make chmod_common() and chown_common() killable
include/linux/fs.h: add inode_lock_killable()
readdir: supply dir_context.count as readdir buffer size hint
vfs: Add sysctl vfs_cache_pressure_denom for bulk file operations
fuse: don't allow signals to interrupt getdents copying
Documentation: fix typo in root= kernel parameter description
include/cgroup: separate {get,put}_cgroup_ns no-op case
kernel/nsproxy: remove unnecessary guards
fs: use writeback_iter directly in mpage_writepages
fs: remove useless plus one in super_cache_scan()
fs: add S_ANON_INODE
fs: remove uselib() system call
device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission()
fs/fs_parse: Remove unused and problematic validate_constant_table()
fs: touch up predicts in inode_permission()
...
Linus Torvalds [Mon, 26 May 2025 15:46:59 +0000 (08:46 -0700)]
Merge tag 'vfs-6.16-rc1.mount.api' of git://git./linux/kernel/git/vfs/vfs
Pull vfs mount api conversions from Christian Brauner:
"This converts the bfs and omfs filesystems to the new mount api"
* tag 'vfs-6.16-rc1.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
omfs: convert to new mount API
bfs: convert bfs to use the new mount api
Linus Torvalds [Mon, 26 May 2025 15:23:09 +0000 (08:23 -0700)]
Merge tag 'vfs-6.16-rc1.writepage' of git://git./linux/kernel/git/vfs/vfs
Pull final writepage conversion from Christian Brauner:
"This converts vboxfs from ->writepage() to ->writepages().
This was the last user of the ->writepage() method. So remove
->writepage() completely and all references to it"
* tag 'vfs-6.16-rc1.writepage' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Remove aops->writepage
mm: Remove swap_writepage() and shmem_writepage()
ttm: Call shmem_writeout() from ttm_backup_backup_page()
i915: Use writeback_iter()
shmem: Add shmem_writeout()
writeback: Remove writeback_use_writepage()
migrate: Remove call to ->writepage
vboxsf: Convert to writepages
9p: Add a migrate_folio method
Linus Torvalds [Mon, 26 May 2025 15:02:43 +0000 (08:02 -0700)]
Merge tag 'vfs-6.16-rc1.async.dir' of git://git./linux/kernel/git/vfs/vfs
Pull vfs directory lookup updates from Christian Brauner:
"This contains cleanups for the lookup_one*() family of helpers.
We expose a set of functions with names containing "lookup_one_len"
and others without the "_len". This difference has nothing to do with
"len". It's rater a historical accident that can be confusing.
The functions without "_len" take a "mnt_idmap" pointer. This is found
in the "vfsmount" and that is an important question when choosing
which to use: do you have a vfsmount, or are you "inside" the
filesystem. A related question is "is permission checking relevant
here?".
nfsd and cachefiles *do* have a vfsmount but *don't* use the non-_len
functions. They pass nop_mnt_idmap and refuse to work on filesystems
which have any other idmap.
This work changes nfsd and cachefile to use the lookup_one family of
functions and to explictily pass &nop_mnt_idmap which is consistent
with all other vfs interfaces used where &nop_mnt_idmap is explicitly
passed.
The remaining uses of the "_one" functions do not require permission
checks so these are renamed to be "_noperm" and the permission
checking is removed.
This series also changes these lookup function to take a qstr instead
of separate name and len. In many cases this simplifies the call"
* tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
VFS: change lookup_one_common and lookup_noperm_common to take a qstr
Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS
VFS: rename lookup_one_len family to lookup_noperm and remove permission check
cachefiles: Use lookup_one() rather than lookup_one_len()
nfsd: Use lookup_one() rather than lookup_one_len()
VFS: improve interface for lookup_one functions
Eric Biggers [Sun, 18 May 2025 19:32:12 +0000 (12:32 -0700)]
x86/fpu: Fix irq_fpu_usable() to return false during CPU onlining
irq_fpu_usable() incorrectly returned true before the FPU is
initialized. The x86 CPU onlining code can call sha256() to checksum
AMD microcode images, before the FPU is initialized. Since sha256()
recently gained a kernel-mode FPU optimized code path, a crash occurred
in kernel_fpu_begin_mask() during hotplug CPU onlining.
(The crash did not occur during boot-time CPU onlining, since the
optimized sha256() code is not enabled until subsys_initcalls run.)
Fix this by making irq_fpu_usable() return false before fpu__init_cpu()
has run. To do this without adding any additional overhead to
irq_fpu_usable(), replace the existing per-CPU bool in_kernel_fpu with
kernel_fpu_allowed which tracks both initialization and usage rather
than just usage. The initial state is false; FPU initialization sets it
to true; kernel-mode FPU sections toggle it to false and then back to
true; and CPU offlining restores it to the initial state of false.
Fixes:
11d7956d526f ("crypto: x86/sha256 - implement library instead of shash")
Reported-by: Ayush Jain <Ayush.Jain3@amd.com>
Closes: https://lore.kernel.org/r/
20250516112217.GBaCcf6Yoc6LkIIryP@fat_crate.local
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Ayush Jain <Ayush.Jain3@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Linus Torvalds [Sun, 25 May 2025 23:09:23 +0000 (16:09 -0700)]
Linux 6.15
Linus Torvalds [Sun, 25 May 2025 22:43:36 +0000 (15:43 -0700)]
Disable FOP_DONTCACHE for now due to bugs
This is kind of last-minute, but Al Viro reported that the new
FOP_DONTCACHE flag causes memory corruption due to use-after-free
issues.
This was triggered by commit
974c5e6139db ("xfs: flag as supporting
FOP_DONTCACHE"), but that is not the underlying bug - it is just the
first user of the flag.
Vlastimil Babka suspects the underlying problem stems from the
folio_end_writeback() logic introduced in commit
fb7d3bc414939
("mm/filemap: drop streaming/uncached pages when writeback completes").
The most straightforward fix would be to just revert the commit that
exposed this, but Matthew Wilcox points out that other filesystems are
also starting to enable the FOP_DONTCACHE logic, so this instead
disables that bit globally for now.
The fix will hopefully end up being trivial and we can just re-enable
this logic after more testing, but until such a time we'll have to
disable the new FOP_DONTCACHE flag.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/all/20250525083209.GS2023217@ZenIV/
Triggered-by: 974c5e6139db ("xfs: flag as supporting FOP_DONTCACHE")
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 25 May 2025 14:48:35 +0000 (07:48 -0700)]
Merge tag 'mm-hotfixes-stable-2025-05-25-00-58' of git://git./linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"22 hotfixes.
13 are cc:stable and the remainder address post-6.14 issues or aren't
considered necessary for -stable kernels. 19 are for MM"
* tag 'mm-hotfixes-stable-2025-05-25-00-58' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
mailmap: add Jarkko's employer email address
mm: fix copy_vma() error handling for hugetlb mappings
memcg: always call cond_resched() after fn()
mm/hugetlb: fix kernel NULL pointer dereference when replacing free hugetlb folios
mm: vmalloc: only zero-init on vrealloc shrink
mm: vmalloc: actually use the in-place vrealloc region
alloc_tag: allocate percpu counters for module tags dynamically
module: release codetag section when module load fails
mm/cma: make detection of highmem_start more robust
MAINTAINERS: add mm memory policy section
MAINTAINERS: add mm ksm section
kasan: avoid sleepable page allocation from atomic context
highmem: add folio_test_partial_kmap()
MAINTAINERS: add hung-task detector section
taskstats: fix struct taskstats breaks backward compatibility since version 15
mm/truncate: fix out-of-bounds when doing a right-aligned split
MAINTAINERS: add mm reclaim section
MAINTAINERS: update page allocator section
mm: fix VM_UFFD_MINOR == VM_SHADOW_STACK on USERFAULTFD=y && ARM64_GCS=y
mm: mmap: map MAP_STACK to VM_NOHUGEPAGE only if THP is enabled
...
Jarkko Sakkinen [Fri, 23 May 2025 12:11:04 +0000 (15:11 +0300)]
mailmap: add Jarkko's employer email address
Add the current employer email address to mailmap.
Link: https://lkml.kernel.org/r/20250523121105.15850-1-jarkko@kernel.org
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Alexander Sverdlin <alexander.sverdlin@gmail.com>
Cc: Antonio Quartulli <antonio@openvpn.net>
Cc: Carlos Bilbao <carlos.bilbao@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ricardo Cañuelo Navarro [Fri, 23 May 2025 12:19:10 +0000 (14:19 +0200)]
mm: fix copy_vma() error handling for hugetlb mappings
If, during a mremap() operation for a hugetlb-backed memory mapping,
copy_vma() fails after the source vma has been duplicated and opened (ie.
vma_link() fails), the error is handled by closing the new vma. This
updates the hugetlbfs reservation counter of the reservation map which at
this point is referenced by both the source vma and the new copy. As a
result, once the new vma has been freed and copy_vma() returns, the
reservation counter for the source vma will be incorrect.
This patch addresses this corner case by clearing the hugetlb private page
reservation reference for the new vma and decrementing the reference
before closing the vma, so that vma_close() won't update the reservation
counter. This is also what copy_vma_and_data() does with the source vma
if copy_vma() succeeds, so a helper function has been added to do the
fixup in both functions.
The issue was reported by a private syzbot instance and can be reproduced
using the C reproducer in [1]. It's also a possible duplicate of public
syzbot report [2]. The WARNING report is:
============================================================
page_counter underflow: -1024 nr_pages=1024
WARNING: CPU: 0 PID: 3287 at mm/page_counter.c:61 page_counter_cancel+0xf6/0x120
Modules linked in:
CPU: 0 UID: 0 PID: 3287 Comm: repro__WARNING_ Not tainted 6.15.0-rc7+ #54 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014
RIP: 0010:page_counter_cancel+0xf6/0x120
Code: ff 5b 41 5e 41 5f 5d c3 cc cc cc cc e8 f3 4f 8f ff c6 05 64 01 27 06 01 48 c7 c7 60 15 f8 85 48 89 de 4c 89 fa e8 2a a7 51 ff <0f> 0b e9 66 ff ff ff 44 89 f9 80 e1 07 38 c1 7c 9d 4c 81
RSP: 0018:
ffffc900025df6a0 EFLAGS:
00010246
RAX:
2edfc409ebb44e00 RBX:
fffffffffffffc00 RCX:
ffff8880155f0000
RDX:
0000000000000000 RSI:
0000000000000001 RDI:
0000000000000000
RBP:
dffffc0000000000 R08:
ffffffff81c4a23c R09:
1ffff1100330482a
R10:
dffffc0000000000 R11:
ffffed100330482b R12:
0000000000000000
R13:
ffff888058a882c0 R14:
ffff888058a882c0 R15:
0000000000000400
FS:
0000000000000000(0000) GS:
ffff88808fc53000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
00000000004b33e0 CR3:
00000000076d6000 CR4:
00000000000006f0
Call Trace:
<TASK>
page_counter_uncharge+0x33/0x80
hugetlb_cgroup_uncharge_counter+0xcb/0x120
hugetlb_vm_op_close+0x579/0x960
? __pfx_hugetlb_vm_op_close+0x10/0x10
remove_vma+0x88/0x130
exit_mmap+0x71e/0xe00
? __pfx_exit_mmap+0x10/0x10
? __mutex_unlock_slowpath+0x22e/0x7f0
? __pfx_exit_aio+0x10/0x10
? __up_read+0x256/0x690
? uprobe_clear_state+0x274/0x290
? mm_update_next_owner+0xa9/0x810
__mmput+0xc9/0x370
exit_mm+0x203/0x2f0
? __pfx_exit_mm+0x10/0x10
? taskstats_exit+0x32b/0xa60
do_exit+0x921/0x2740
? do_raw_spin_lock+0x155/0x3b0
? __pfx_do_exit+0x10/0x10
? __pfx_do_raw_spin_lock+0x10/0x10
? _raw_spin_lock_irq+0xc5/0x100
do_group_exit+0x20c/0x2c0
get_signal+0x168c/0x1720
? __pfx_get_signal+0x10/0x10
? schedule+0x165/0x360
arch_do_signal_or_restart+0x8e/0x7d0
? __pfx_arch_do_signal_or_restart+0x10/0x10
? __pfx___se_sys_futex+0x10/0x10
syscall_exit_to_user_mode+0xb8/0x2c0
do_syscall_64+0x75/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x422dcd
Code: Unable to access opcode bytes at 0x422da3.
RSP: 002b:
00007ff266cdb208 EFLAGS:
00000246 ORIG_RAX:
00000000000000ca
RAX:
0000000000000001 RBX:
00007ff266cdbcdc RCX:
0000000000422dcd
RDX:
00000000000f4240 RSI:
0000000000000081 RDI:
00000000004c7bec
RBP:
00007ff266cdb220 R08:
203a6362696c6720 R09:
203a6362696c6720
R10:
0000200000c00000 R11:
0000000000000246 R12:
ffffffffffffffd0
R13:
0000000000000002 R14:
00007ffe1cb5f520 R15:
00007ff266cbb000
</TASK>
============================================================
Link: https://lkml.kernel.org/r/20250523-warning_in_page_counter_cancel-v2-1-b6df1a8cfefd@igalia.com
Link: https://people.igalia.com/rcn/kernel_logs/20250422__WARNING_in_page_counter_cancel__repro.c
Link: https://lore.kernel.org/all/67000a50.050a0220.49194.048d.GAE@google.com/
Signed-off-by: Ricardo Cañuelo Navarro <rcn@igalia.com>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Florent Revest <revest@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Breno Leitao [Fri, 23 May 2025 17:21:06 +0000 (10:21 -0700)]
memcg: always call cond_resched() after fn()
I am seeing soft lockup on certain machine types when a cgroup OOMs. This
is happening because killing the process in certain machine might be very
slow, which causes the soft lockup and RCU stalls. This happens usually
when the cgroup has MANY processes and memory.oom.group is set.
Example I am seeing in real production:
[462012.244552] Memory cgroup out of memory: Killed process
3370438 (crosvm) ....
....
[462037.318059] Memory cgroup out of memory: Killed process
4171372 (adb) ....
[462037.348314] watchdog: BUG: soft lockup - CPU#64 stuck for 26s! [stat_manager-ag:
1618982]
....
Quick look at why this is so slow, it seems to be related to serial flush
for certain machine types. For all the crashes I saw, the target CPU was
at console_flush_all().
In the case above, there are thousands of processes in the cgroup, and it
is soft locking up before it reaches the 1024 limit in the code (which
would call the cond_resched()). So, cond_resched() in 1024 blocks is not
sufficient.
Remove the counter-based conditional rescheduling logic and call
cond_resched() unconditionally after each task iteration, after fn() is
called. This avoids the lockup independently of how slow fn() is.
Link: https://lkml.kernel.org/r/20250523-memcg_fix-v1-1-ad3eafb60477@debian.org
Fixes:
ade81479c7dd ("memcg: fix soft lockup in the OOM process")
Signed-off-by: Breno Leitao <leitao@debian.org>
Suggested-by: Rik van Riel <riel@surriel.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Michael van der Westhuizen <rmikey@meta.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Chen Ridong <chenridong@huawei.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ge Yang [Thu, 22 May 2025 03:22:17 +0000 (11:22 +0800)]
mm/hugetlb: fix kernel NULL pointer dereference when replacing free hugetlb folios
A kernel crash was observed when replacing free hugetlb folios:
BUG: kernel NULL pointer dereference, address:
0000000000000028
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary)
RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0
RSP: 0018:
ffffc9000b30fa90 EFLAGS:
00010286
RAX:
0000000000000000 RBX:
0000000000342cca RCX:
ffffea0043000000
RDX:
ffffc9000b30fb08 RSI:
ffffea0043000000 RDI:
0000000000000000
RBP:
ffffc9000b30fb20 R08:
0000000000001000 R09:
0000000000000000
R10:
ffff88886f92eb00 R11:
0000000000000000 R12:
ffffea0043000000
R13:
0000000000000000 R14:
00000000010c0200 R15:
0000000000000004
FS:
00007fcda5f14740(0000) GS:
ffff8888ec1d8000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000000000028 CR3:
0000000391402000 CR4:
0000000000350ef0
Call Trace:
<TASK>
replace_free_hugepage_folios+0xb6/0x100
alloc_contig_range_noprof+0x18a/0x590
? srso_return_thunk+0x5/0x5f
? down_read+0x12/0xa0
? srso_return_thunk+0x5/0x5f
cma_range_alloc.constprop.0+0x131/0x290
__cma_alloc+0xcf/0x2c0
cma_alloc_write+0x43/0xb0
simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110
debugfs_attr_write+0x46/0x70
full_proxy_write+0x62/0xa0
vfs_write+0xf8/0x420
? srso_return_thunk+0x5/0x5f
? filp_flush+0x86/0xa0
? srso_return_thunk+0x5/0x5f
? filp_close+0x1f/0x30
? srso_return_thunk+0x5/0x5f
? do_dup2+0xaf/0x160
? srso_return_thunk+0x5/0x5f
ksys_write+0x65/0xe0
do_syscall_64+0x64/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
There is a potential race between __update_and_free_hugetlb_folio() and
replace_free_hugepage_folios():
CPU1 CPU2
__update_and_free_hugetlb_folio replace_free_hugepage_folios
folio_test_hugetlb(folio)
-- It's still hugetlb folio.
__folio_clear_hugetlb(folio)
hugetlb_free_folio(folio)
h = folio_hstate(folio)
-- Here, h is NULL pointer
When the above race condition occurs, folio_hstate(folio) returns NULL,
and subsequent access to this NULL pointer will cause the system to crash.
To resolve this issue, execute folio_hstate(folio) under the protection
of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not
return NULL.
Link: https://lkml.kernel.org/r/1747884137-26685-1-git-send-email-yangge1116@126.com
Fixes:
04f13d241b8b ("mm: replace free hugepage folios after migration")
Signed-off-by: Ge Yang <yangge1116@126.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kees Cook [Thu, 15 May 2025 21:42:16 +0000 (14:42 -0700)]
mm: vmalloc: only zero-init on vrealloc shrink
The common case is to grow reallocations, and since init_on_alloc will
have already zeroed the whole allocation, we only need to zero when
shrinking the allocation.
Link: https://lkml.kernel.org/r/20250515214217.619685-2-kees@kernel.org
Fixes:
a0309faf1cb0 ("mm: vmalloc: support more granular vrealloc() sizing")
Signed-off-by: Kees Cook <kees@kernel.org>
Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: "Erhard F." <erhard_f@mailbox.org>
Cc: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kees Cook [Thu, 15 May 2025 21:42:15 +0000 (14:42 -0700)]
mm: vmalloc: actually use the in-place vrealloc region
Patch series "mm: vmalloc: Actually use the in-place vrealloc region".
This fixes a performance regression[1] with vrealloc()[1].
The refactoring to not build a new vmalloc region only actually worked
when shrinking. Actually return the resized area when it grows. Ugh.
Link: https://lkml.kernel.org/r/20250515214217.619685-1-kees@kernel.org
Fixes:
a0309faf1cb0 ("mm: vmalloc: support more granular vrealloc() sizing")
Signed-off-by: Kees Cook <kees@kernel.org>
Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Closes: https://lore.kernel.org/all/
20250515-bpf-verifier-slowdown-vwo2meju4cgp2su5ckj@6gi6ssxbnfqg [1]
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Tested-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Reviewed-by: Danilo Krummrich <dakr@kernel.org>
Cc: "Erhard F." <erhard_f@mailbox.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Suren Baghdasaryan [Sat, 17 May 2025 00:07:39 +0000 (17:07 -0700)]
alloc_tag: allocate percpu counters for module tags dynamically
When a module gets unloaded it checks whether any of its tags are still in
use and if so, we keep the memory containing module's allocation tags
alive until all tags are unused. However percpu counters referenced by
the tags are freed by free_module(). This will lead to UAF if the memory
allocated by a module is accessed after module was unloaded.
To fix this we allocate percpu counters for module allocation tags
dynamically and we keep it alive for tags which are still in use after
module unloading. This also removes the requirement of a larger
PERCPU_MODULE_RESERVE when memory allocation profiling is enabled because
percpu memory for counters does not need to be reserved anymore.
Link: https://lkml.kernel.org/r/20250517000739.5930-1-surenb@google.com
Fixes:
0db6f8d7820a ("alloc_tag: load module tags into separate contiguous memory")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: David Wang <00107082@163.com>
Closes: https://lore.kernel.org/all/
20250516131246.6244-1-
00107082@163.com/
Tested-by: David Wang <00107082@163.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Wang [Mon, 19 May 2025 16:38:23 +0000 (00:38 +0800)]
module: release codetag section when module load fails
When module load fails after memory for codetag section is ready, codetag
section memory will not be properly released. This causes memory leak,
and if next module load happens to get the same module address, codetag
may pick the uninitialized section when manipulating tags during module
unload, and leads to "unable to handle page fault" BUG.
Link: https://lkml.kernel.org/r/20250519163823.7540-1-00107082@163.com
Fixes:
0db6f8d7820a ("alloc_tag: load module tags into separate contiguous memory")
Closes: https://lore.kernel.org/all/
20250516131246.6244-1-
00107082@163.com/
Signed-off-by: David Wang <00107082@163.com>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport (Microsoft) [Mon, 19 May 2025 17:18:05 +0000 (20:18 +0300)]
mm/cma: make detection of highmem_start more robust
Pratyush Yadav reports the following crash:
------------[ cut here ]------------
kernel BUG at arch/x86/mm/physaddr.c:23!
ception 0x06 IP 10:
ffffffff812ebbf8 error 0 cr2 0xffff88903ffff000
CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc6+ #231 PREEMPT(undef)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
RIP: 0010:__phys_addr+0x58/0x60
Code: 01 48 89 c2 48 d3 ea 48 85 d2 75 05 e9 91 52 cf 00 0f 0b 48 3d ff ff ff 1f 77 0f 48 8b 05 20 54 55 01 48 01 d0 e9 78 52 cf 00 <0f> 0b 90 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 0000:
ffffffff82803dd8 EFLAGS:
00010006 ORIG_RAX:
0000000000000000
RAX:
000000007fffffff RBX:
00000000ffffffff RCX:
0000000000000000
RDX:
000000007fffffff RSI:
0000000280000000 RDI:
ffffffffffffffff
RBP:
ffffffff82803e68 R08:
0000000000000000 R09:
0000000000000000
R10:
ffffffff83153180 R11:
ffffffff82803e48 R12:
ffffffff83c9aed0
R13:
0000000000000000 R14:
0000001040000000 R15:
0000000000000000
FS:
0000000000000000(0000) GS:
0000000000000000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
ffff88903ffff000 CR3:
0000000002838000 CR4:
00000000000000b0
Call Trace:
<TASK>
? __cma_declare_contiguous_nid+0x6e/0x340
? cma_declare_contiguous_nid+0x33/0x70
? dma_contiguous_reserve_area+0x2f/0x70
? setup_arch+0x6f1/0x870
? start_kernel+0x52/0x4b0
? x86_64_start_reservations+0x29/0x30
? x86_64_start_kernel+0x7c/0x80
? common_startup_64+0x13e/0x141
The reason is that __cma_declare_contiguous_nid() does:
highmem_start = __pa(high_memory - 1) + 1;
If dma_contiguous_reserve_area() (or any other CMA declaration) is
called before free_area_init(), high_memory is uninitialized. Without
CONFIG_DEBUG_VIRTUAL, it will likely work but use the wrong value for
highmem_start.
The issue occurs because commit
e120d1bc12da ("arch, mm: set high_memory
in free_area_init()") moved initialization of high_memory after the call
to dma_contiguous_reserve() -> __cma_declare_contiguous_nid() on several
architectures.
In the case CONFIG_HIGHMEM is enabled, some architectures that actually
support HIGHMEM (arm, powerpc and x86) have initialization of high_memory
before a possible call to __cma_declare_contiguous_nid() and some
initialized high_memory late anyway (arc, csky, microblase, mips, sparc,
xtensa) even before the commit
e120d1bc12da so they are fine with using
uninitialized value of high_memory.
And in the case CONFIG_HIGHMEM is disabled high_memory essentially becomes
the first address after memory end, so instead of relying on high_memory
to calculate highmem_start use memblock_end_of_DRAM() and eliminate the
dependency of CMA area creation on high_memory in majority of
configurations.
Link: https://lkml.kernel.org/r/20250519171805.1288393-1-rppt@kernel.org
Fixes:
e120d1bc12da ("arch, mm: set high_memory in free_area_init()")
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reported-by: Pratyush Yadav <ptyadav@amazon.de>
Tested-by: Pratyush Yadav <ptyadav@amazon.de>
Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bo Liu [Thu, 22 May 2025 09:49:31 +0000 (05:49 -0400)]
erofs: support DEFLATE decompression by using Intel QAT
This patch introduces the use of the Intel QAT to offload EROFS data
decompression, aiming to improve the decompression performance.
A 285MiB dataset is used with the following command to create EROFS
images with different cluster sizes:
$ mkfs.erofs -zdeflate,level=9 -C{4096,16384,65536,131072,262144}
Fio is used to test the following read patterns:
$ fio -filename=testfile -bs=4k -rw=read -name=job1
$ fio -filename=testfile -bs=4k -rw=randread -name=job1
$ fio -filename=testfile -bs=4k -rw=randread --io_size=14m -name=job1
Here are some performance numbers for reference:
Processors: Intel(R) Xeon(R) 6766E (144 cores)
Memory: 512 GiB
|-----------------------------------------------------------------------------|
| | Cluster size | sequential read | randread | small randread(5%) |
|-----------|--------------|-----------------|-----------|--------------------|
| Intel QAT | 4096 | 538 MiB/s | 112 MiB/s | 20.76 MiB/s |
| Intel QAT | 16384 | 699 MiB/s | 158 MiB/s | 21.02 MiB/s |
| Intel QAT | 65536 | 917 MiB/s | 278 MiB/s | 20.90 MiB/s |
| Intel QAT | 131072 | 1056 MiB/s | 351 MiB/s | 23.36 MiB/s |
| Intel QAT | 262144 | 1145 MiB/s | 431 MiB/s | 26.66 MiB/s |
| deflate | 4096 | 499 MiB/s | 108 MiB/s | 21.50 MiB/s |
| deflate | 16384 | 422 MiB/s | 125 MiB/s | 18.94 MiB/s |
| deflate | 65536 | 452 MiB/s | 159 MiB/s | 13.02 MiB/s |
| deflate | 131072 | 452 MiB/s | 177 MiB/s | 11.44 MiB/s |
| deflate | 262144 | 466 MiB/s | 194 MiB/s | 10.60 MiB/s |
Signed-off-by: Bo Liu <liubo03@inspur.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250522094931.28956-1-liubo03@inspur.com
[ Gao Xiang: refine the commit message. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Linus Torvalds [Sun, 25 May 2025 01:54:18 +0000 (18:54 -0700)]
Merge tag 'input-for-v6.15-rc7' of git://git./linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
- even more Xbox controllers added to xpad driver: Turtle Beach Recon
Wired Controller, Turtle Beach Stealth Ultra, and PowerA Wired
Controller
- a fix to Synaptics RMI driver to not crash if controller reports
unsupported version of F34 (firmware flash) function
* tag 'input-for-v6.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: synaptics-rmi - fix crash with unsupported versions of F34
Input: xpad - add more controllers
Linus Torvalds [Sun, 25 May 2025 01:48:17 +0000 (18:48 -0700)]
Merge tag 'spi-fix-v6.15-rc7' of git://git./linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A few final fixes for v6.15, some driver fixes for the Freescale DSPI
driver pulled over from their vendor code and another instance of the
fixes Greg has been sending throughout the kernel for constification
of the bus_type in driver core match() functions"
* tag 'spi-fix-v6.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: spi-fsl-dspi: Reset SR flags before sending a new message
spi: spi-fsl-dspi: Halt the module after a new message transfer
spi: spi-fsl-dspi: restrict register range for regmap access
spi: use container_of_cont() for to_spi_device()
Linus Torvalds [Sat, 24 May 2025 16:01:41 +0000 (09:01 -0700)]
Merge tag 'iommu-fixes-v6.15-rc7' of git://git./linux/kernel/git/iommu/linux
Pull iommu fix from Joerg Roedel:
- core: skip PASID validation for devices without PASID support
* tag 'iommu-fixes-v6.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
iommu: Skip PASID validation for devices without PASID capability
Kent Overstreet [Sat, 24 May 2025 01:59:12 +0000 (21:59 -0400)]
bcachefs: Don't mount bs > ps without TRANSPARENT_HUGEPAGE
Large folios aren't supported without TRANSPARENT_HUGEPAGE
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Sat, 24 May 2025 00:11:43 +0000 (20:11 -0400)]
bcachefs: Fix btree_iter_next_node() for new locking asserts
We can't unlock a should_be_locked path unless we're in a transaction
restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 23 May 2025 18:03:06 +0000 (14:03 -0400)]
bcachefs: Ensure we don't use a blacklisted journal seq
Different versions differ on the size of the blacklist range; it is
theoretically possible that we could end up with blacklisted journal
sequence numbers newer than the newest seq we find in the journal, and
pick a new start seq that's blacklisted.
Explicitly check for this in bch2_fs_journal_start().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 23 May 2025 18:19:25 +0000 (14:19 -0400)]
bcachefs: Small check_fix_ptr fixes
We don't want to change the bucket gen, on gen mismatch: it's possible
to have multiple btree nodes with different gens in the same bucket that
we want to keep, if we have to recover from btree node scan.
It's also not necessary to set g->gen_valid; add a comment to that
effect.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 23 May 2025 22:31:53 +0000 (18:31 -0400)]
bcachefs: Fix opts.recovery_pass_last
This was lost in the giant recovery pass rework - but it's used heavily
by bcachefs subcommand utilities.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 23 May 2025 22:30:10 +0000 (18:30 -0400)]
bcachefs: Fix allocate -> self healing path
When we go to allocate and find taht a bucket in the freespace btree is
actually allocated, we're supposed to return nonzero to tell the
allocator to skip it.
This fixes an emergency read only due to a bucket/ptr gen mismatch - we
also don't return the correct bucket gen when this happens.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Fri, 23 May 2025 17:13:44 +0000 (13:13 -0400)]
bcachefs: Fix endianness in casefold check/repair
Fixes:
010c89468134 ("bcachefs: Check for casefolded dirents in non casefolded dirs")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Linus Torvalds [Fri, 23 May 2025 22:17:55 +0000 (15:17 -0700)]
Merge tag 'drm-fixes-2025-05-24' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Weekly drm fixes pull, on target to be quiet, just one amdgpu, one
edid and a few minor xe fixes.
edid:
- fix HDR metadata reset
amdgpu:
- Hibernate fix
xe:
- Make sure to check all forcewakes when dumping mocs
- Fix wrong use of read64 on 32b register
- Synchronize Panther Lake PCI IDs"
* tag 'drm-fixes-2025-05-24' of https://gitlab.freedesktop.org/drm/kernel:
drm/xe/ptl: Update the PTL pci id table
drm/xe: Use xe_mmio_read32() to read mtcfg register
drm/xe/mocs: Check if all domains awake
Revert "drm/amd: Keep display off while going into S4"
drm/edid: fixed the bug that hdr metadata was not reset
Dave Airlie [Fri, 23 May 2025 21:42:16 +0000 (07:42 +1000)]
Merge tag 'drm-xe-fixes-2025-05-23' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
Driver Changes:
- Make sure to check all forcewakes when dumping mocs
- Fix wrong use of read64 on 32b register
- Synchronize Panther Lake PCI IDs
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/uixp5cq7emz32lmwwvq4vbujppugfozhyj3cm2aqzx4lcg7ivn@m2khvf4kvz5p
Dave Airlie [Fri, 23 May 2025 21:41:55 +0000 (07:41 +1000)]
Merge tag 'amd-drm-fixes-6.15-2025-05-22' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.15-2025-05-22:
amdgpu:
- Hibernate fix
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250522183941.9606-1-alexander.deucher@amd.com
Dave Airlie [Fri, 23 May 2025 21:37:45 +0000 (07:37 +1000)]
Merge tag 'drm-misc-fixes-2025-05-22' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
Short summary of fixes pull:
edid:
- fix HDR metadata reset
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://lore.kernel.org/r/20250522113902.GA7000@localhost.localdomain
Linus Torvalds [Fri, 23 May 2025 16:47:43 +0000 (09:47 -0700)]
Merge tag 'thermal-6.15-rc8' of git://git./linux/kernel/git/rafael/linux-pm
Pull thermal control fix from Rafael Wysocki:
"This fixes a coding mistake in the x86_pkg_temp_thermal Intel thermal
driver that was introduced by an incorrect conflict resolution during
a merge (Zhang Rui)"
* tag 'thermal-6.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: intel: x86_pkg_temp_thermal: Fix bogus trip temperature
Stuart Yoder [Wed, 30 Apr 2025 15:47:23 +0000 (10:47 -0500)]
tpm_crb: ffa_tpm: fix/update comments describing the CRB over FFA ABI
-Fix the comment describing the 'start' function, which was a cut/paste
mistake for a different function.
-The comment for DIRECT_REQ and DIRECT_RESP only mentioned AArch32
and listed 32-bit function IDs. Update to include 64-bit.
Signed-off-by: Stuart Yoder <stuart.yoder@arm.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Yeoreum Yun [Tue, 15 Apr 2025 18:50:13 +0000 (19:50 +0100)]
tpm_crb_ffa: use dev_xx() macro to print log
Instead of pr_xxx() macro, use dev_xxx() to print log.
This patch changes some error log level to warn log level when
the tpm_crb_ffa secure partition doesn't support properly but
system can run without it.
(i.e) unsupport of direct message ABI or unsupported ABI version
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Yeoreum Yun [Tue, 15 Apr 2025 18:50:12 +0000 (19:50 +0100)]
tpm_ffa_crb: access tpm service over FF-A direct message request v2
For secure partition with multi service, tpm_ffa_crb can access tpm
service with direct message request v2 interface according to chapter 3.3,
TPM Service Command Response Buffer Interface Over FF-A specificationi v1.0 BET.
This patch reflects this spec to access tpm service over
FF-A direct message request v2 ABI.
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Colin Ian King [Wed, 30 Apr 2025 08:34:35 +0000 (09:34 +0100)]
tpm: remove kmalloc failure error message
The kmalloc failure message is just noise. Remove it and replace -EFAULT
with -ENOMEM as standard for out of memory allocation error returns.
Link: https://lore.kernel.org/linux-integrity/20250430083435.860146-1-colin.i.king@gmail.com/
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Linus Torvalds [Fri, 23 May 2025 15:42:29 +0000 (08:42 -0700)]
Merge tag 'v6.15-rc8-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- Fix for rename regression due to the recent VFS lookup changes
- Fix write failure
- locking fix for oplock handling
* tag 'v6.15-rc8-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: use list_first_entry_or_null for opinfo_get_list()
ksmbd: fix rename failure
ksmbd: fix stream write failure
Ming Lei [Thu, 22 May 2025 16:35:21 +0000 (00:35 +0800)]
selftests: ublk: add test for UBLK_F_QUIESCE
Add test generic_11 for covering new control command of
UBLK_U_CMD_QUIESCE_DEV.
Add 'quiesce -n dev_id' sub-command on ublk utility for transitioning
device state to quiesce states, then verify the feature via generic_10
by doing quiesce and recovery.
Cc: Yoav Cohen <yoav@nvidia.com>
Link: https://lore.kernel.org/linux-block/DM4PR12MB632807AB7CDCE77D1E5AB7D0A9B92@DM4PR12MB6328.namprd12.prod.outlook.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250522163523.406289-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Thu, 22 May 2025 16:35:20 +0000 (00:35 +0800)]
ublk: add feature UBLK_F_QUIESCE
Add feature UBLK_F_QUIESCE, which adds control command `UBLK_U_CMD_QUIESCE_DEV`
for quiescing device, then device state can become `UBLK_S_DEV_QUIESCED`
or `UBLK_S_DEV_FAIL_IO` finally from ublk_ch_release() with ublk server
cooperation.
This feature can help to support to upgrade ublk server application by
shutting down ublk server gracefully, meantime keep ublk block device
persistent during the upgrading period.
The feature is only available for UBLK_F_USER_RECOVERY.
Suggested-by: Yoav Cohen <yoav@nvidia.com>
Link: https://lore.kernel.org/linux-block/DM4PR12MB632807AB7CDCE77D1E5AB7D0A9B92@DM4PR12MB6328.namprd12.prod.outlook.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250522163523.406289-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Thu, 22 May 2025 16:35:19 +0000 (00:35 +0800)]
selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
Add test generic_10 for covering new control command of UBLK_U_CMD_UPDATE_SIZE.
Add 'update_size -s|--size size_in_bytes' sub-command on ublk utility for
supporting this feature, then verify the feature via generic_10.
Cc: Omri Mann <omri@nvidia.com>
Cc: Jared Holzman <jholzman@nvidia.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250522163523.406289-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ritesh Harjani (IBM) [Thu, 22 May 2025 13:51:10 +0000 (19:21 +0530)]
traceevent/block: Add REQ_ATOMIC flag to block trace events
Filesystems like XFS can implement atomic write I/O using either
REQ_ATOMIC flag set in the bio or via CoW operation. It will be useful
if we have a flag in trace events to distinguish between the two. This
patch adds char 'U' (Untorn writes) to rwbs field of the trace events
if REQ_ATOMIC flag is set in the bio.
<W/ REQ_ATOMIC>
=================
xfs_io-4238 [009] ..... 4148.126843: block_rq_issue: 259,0 WFSU 16384 () 768 + 32 none,0,0 [xfs_io]
<idle>-0 [009] d.h1. 4148.129864: block_rq_complete: 259,0 WFSU () 768 + 32 none,0,0 [0]
<W/O REQ_ATOMIC>
===============
xfs_io-4237 [010] ..... 4143.325616: block_rq_issue: 259,0 WS 16384 () 768 + 32 none,0,0 [xfs_io]
<idle>-0 [010] d.H1. 4143.329138: block_rq_complete: 259,0 WS () 768 + 32 none,0,0 [0]
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/44317cb2ec4588f6a2c1501a96684e6a1196e8ba.1747921498.git.ritesh.list@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 23 May 2025 15:04:13 +0000 (08:04 -0700)]
Merge tag 'soc-fixes-6.15-3' of git://git./linux/kernel/git/soc/soc
Pull SoC fixes from Arnd Bergmann:
"A few last minute fixes:
- two driver fixes for samsung/google platforms, both addressing
mistakes in changes from the 6.15 merge window
- a revert for an allwinner devicetree change that caused problems
- a fix for an older regression with the LEDs on Marvell Armada 3720
- a defconfig change to enable chacha20 again after a crypto
subsystem change in 6.15 inadventently turned it off"
* tag 'soc-fixes-6.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
arm64: defconfig: Ensure CRYPTO_CHACHA20_NEON is selected
arm64: dts: marvell: uDPU: define pinctrl state for alarm LEDs
Revert "arm64: dts: allwinner: h6: Use RSB for AXP805 PMIC connection"
soc: samsung: usi: prevent wrong bits inversion during unconfiguring
firmware: exynos-acpm: check saved RX before bailing out on empty RX queue
Linus Torvalds [Fri, 23 May 2025 14:59:39 +0000 (07:59 -0700)]
Merge tag 'platform-drivers-x86-v6.15-6' of git://git./linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform drivers fixes from Ilpo Järvinen:
- dell-wmi-sysman: Avoid buffer overflow in current_password_store()
- fujitsu-laptop: Support Lifebook S2110 hotkeys
- intel/pmc: Fix Arrow Lake U/H NPU PCI ID
- think-lmi: Fix attribute name usage for non-compliant items
- thinkpad_acpi: Ignore battery threshold change event notification
* tag 'platform-drivers-x86-v6.15-6' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86/intel/pmc: Fix Arrow Lake U/H NPU PCI ID
platform/x86: think-lmi: Fix attribute name usage for non-compliant items
platform/x86: thinkpad_acpi: Ignore battery threshold change event notification
platform/x86: dell-wmi-sysman: Avoid buffer overflow in current_password_store()
platform/x86: fujitsu-laptop: Support Lifebook S2110 hotkeys
Linus Torvalds [Fri, 23 May 2025 14:51:05 +0000 (07:51 -0700)]
Merge tag 'vfs-6.15-rc8.fixes' of git://git./linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"This contains a small set of fixes for the blocking buffer lookup
conversion done earlier this cycle.
It adds a missing conversion in the getblk slowpath and a few minor
optimizations and cleanups"
* tag 'vfs-6.15-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs/buffer: optimize discard_buffer()
fs/buffer: remove superfluous statements
fs/buffer: avoid redundant lookup in getblk slowpath
fs/buffer: use sleeping lookup in __getblk_slowpath()
Pavel Begunkov [Fri, 23 May 2025 09:04:46 +0000 (10:04 +0100)]
io_uring/cmd: warn on reg buf imports by ineligible cmds
For IORING_URING_CMD_FIXED-less commands io_uring doesn't pull buf_index
from the sqe, so imports might succeed if the index coincide, e.g. when
it's 0, but otherwise it's error prone. Warn if someone tries to import
without the flag.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/a1c2c88e53c3fe96978f23d50c6bc66c2c79c337.1747991070.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dmitry V. Levin [Sun, 11 May 2025 22:49:53 +0000 (01:49 +0300)]
statmount: update STATMOUNT_SUPPORTED macro
According to commit
8f6116b5b77b ("statmount: add a new supported_mask
field"), STATMOUNT_SUPPORTED macro shall be updated whenever a new flag
is added.
Fixes:
7a54947e727b ("Merge patch series "fs: allow changing idmappings"")
Signed-off-by: "Dmitry V. Levin" <ldv@strace.io>
Link: https://lore.kernel.org/20250511224953.GA17849@strace.io
Signed-off-by: Christian Brauner <brauner@kernel.org>
Stephen Brennan [Wed, 7 May 2025 22:34:01 +0000 (15:34 -0700)]
fs: convert mount flags to enum
In prior kernel versions (5.8-6.8), commit
9f6c61f96f2d9 ("proc/mounts:
add cursor") introduced MNT_CURSOR, a flag used by readers from
/proc/mounts to keep their place while reading the file. Later, commit
2eea9ce4310d8 ("mounts: keep list of mounts in an rbtree") removed this
flag and its value has since been repurposed.
For debuggers iterating over the list of mounts, cursors should be
skipped as they are irrelevant. Detecting whether an element is a cursor
can be difficult. Since the MNT_CURSOR flag is a preprocessor constant,
it's not present in debuginfo, and since its value is repurposed, we
cannot hard-code it. For this specific issue, cursors are possible to
detect in other ways, but ideally, we would be able to read the mount
flag definitions out of the debuginfo. For that reason, convert the
mount flags to an enum.
Link: https://github.com/osandov/drgn/pull/496
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Link: https://lore.kernel.org/20250507223402.2795029-1-stephen.s.brennan@oracle.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Al Viro [Mon, 21 Apr 2025 03:35:09 +0000 (04:35 +0100)]
->mnt_devname is never NULL
Not since
8f2918898eb5 "new helpers: vfs_create_mount(), fc_mount()"
back in 2018. Get rid of the dead checks...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/20250421033509.GV2023217@ZenIV
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner [Wed, 16 Apr 2025 08:19:05 +0000 (10:19 +0200)]
mount: add a comment about concurrent changes with statmount()/listmount()
Add some comments in there highlighting a few non-obvious assumptions.
Link: https://lore.kernel.org/20250416-zerknirschen-aluminium-14a55639076f@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
Jens Axboe [Fri, 23 May 2025 12:08:49 +0000 (06:08 -0600)]
io_uring/io-wq: only create a new worker if it can make progress
Hashed work is serialized by io-wq, intended to be used for cases like
serializing buffered writes to a regular file, where the file system
will serialize the workers anyway with a mutex or similar. Since they
would be forcibly serialized and blocked, it's more efficient for io-wq
to handle these individually rather than issue them in parallel.
If a worker is currently handling a hashed work item and gets blocked,
don't create a new worker if the next work item is also hashed and
mapped to the same bucket. That new worker would not be able to make any
progress anyway.
Reported-by: Fengnan Chang <changfengnan@bytedance.com>
Reported-by: Diangang Li <lidiangang@bytedance.com>
Link: https://lore.kernel.org/io-uring/20250522090909.73212-1-changfengnan@bytedance.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 23 May 2025 12:05:37 +0000 (06:05 -0600)]
io_uring/io-wq: ignore non-busy worker going to sleep
When an io-wq worker goes to sleep, it checks if there's work to do.
If there is, it'll create a new worker. But if this worker is currently
idle, it'll either get woken right back up immediately, or someone
else has already created the necessary worker to handle this work.
Only go through the worker creation logic if the current worker is
currently handling a work item. That means it's being scheduled out as
part of handling that work, not just going to sleep on its own.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 23 May 2025 12:04:29 +0000 (06:04 -0600)]
io_uring/io-wq: move hash helpers to the top
Just in preparation for using them higher up in the function, move
these generic helpers.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kent Overstreet [Wed, 10 Apr 2024 03:53:57 +0000 (23:53 -0400)]
bcachefs: Path must be locked if trans->locked && should_be_locked
If path->should_be_locked is true, that means user code (of the btree
API) has seen, in this transaction, something guarded by the node this
path has locked, and we have to keep it locked until the end of the
transaction.
Assert that we're not violating this; should_be_locked should also be
cleared only in _very_ special situations.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 22 May 2025 19:52:15 +0000 (15:52 -0400)]
bcachefs: Simplify bch2_path_put()
Simplify the "do we need to keep this locked?" checks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 22 May 2025 19:33:14 +0000 (15:33 -0400)]
bcachefs: Plumb btree_trans for more locking asserts
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 22 May 2025 20:04:15 +0000 (16:04 -0400)]
bcachefs: Clear trans->locked before unlock
We're adding new should_be_locked assertions: it's going to be illegal
to unlock a should_be_locked path when trans->locked is true.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 22 May 2025 20:03:08 +0000 (16:03 -0400)]
bcachefs: Clear should_be_locked before unlock in key_cache_drop()
We're adding new should_be_locked assertions, also add a comment
explaining why clearing should_be_locked is safe here.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 22 May 2025 22:12:54 +0000 (18:12 -0400)]
bcachefs: bch2_path_get() reuses paths if upgrade_fails & !should_be_locked
Small additional optimization over the previous patch, bringing us
closer to the original behaviour, except when we need to clone to avoid
a transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 22 May 2025 22:00:45 +0000 (18:00 -0400)]
bcachefs: Give out new path if upgrade fails
Avoid transaction restarts due to failure to upgrade - we can traverse a
new iterator without a transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 22 May 2025 20:54:31 +0000 (16:54 -0400)]
bcachefs: Fix btree_path_get_locks when not doing trans restart
btree_path_get_locks, on failure, shouldn't unlock if we're not issuing
a transaction restart: we might drop locks we're not supposed to (if
path->should_be_locked is set).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 22 May 2025 22:03:32 +0000 (18:03 -0400)]
bcachefs: btree_node_locked_type_nowrite()
Small helper to improve locking assertions.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Thu, 22 May 2025 19:40:24 +0000 (15:40 -0400)]
bcachefs: Kill bch2_path_put_nokeep()
bch2_path_put_nokeep() was intended for paths we wouldn't need to
preserve for a transaction restart - it always frees them right away
when the ref hits 0.
But since paths are shared, freeing unconditionally is a bug, the path
might have been used elsewhere and have should_be_locked set, i.e. we
need to keep it locked until the end of the transaction.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Arnd Bergmann [Thu, 22 May 2025 10:21:41 +0000 (11:21 +0100)]
crypto: qat - add missing header inclusion
Without this header, the build of the new qat_6xxx driver fails when
CONFIG_PCI_IOV is not set:
In file included from drivers/crypto/intel/qat/qat_common/adf_gen6_shared.c:7:
drivers/crypto/intel/qat/qat_common/adf_gen4_pfvf.h: In function 'adf_gen4_init_pf_pfvf_ops':
drivers/crypto/intel/qat/qat_common/adf_gen4_pfvf.h:13:34: error: 'adf_pfvf_comms_disabled' undeclared (first use in this function)
13 | pfvf_ops->enable_comms = adf_pfvf_comms_disabled;
| ^~~~~~~~~~~~~~~~~~~~~~~
Fixes:
17fd7514ae68 ("crypto: qat - add qat_6xxx driver")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Herbert Xu [Mon, 19 May 2025 10:29:38 +0000 (18:29 +0800)]
crypto: api - Redo lookup on EEXIST
When two crypto algorithm lookups occur at the same time with
different names for the same algorithm, e.g., ctr(aes-generic)
and ctr(aes), they will both be instantiated. However, only one
of them can be registered. The second instantiation will fail
with EEXIST.
Avoid failing the second lookup by making it retry, but only once
because there are tricky names such as gcm_base(ctr(aes),ghash)
that will always fail, despite triggering instantiation and EEXIST.
Reported-by: Ingo Franzki <ifranzki@linux.ibm.com>
Fixes:
2825982d9d66 ("[CRYPTO] api: Added event notification")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Herbert Xu [Fri, 23 May 2025 09:20:59 +0000 (17:20 +0800)]
Revert "crypto: testmgr - Add hash export format testing"
This reverts commit
18c438b228558e05ede7dccf947a6547516fc0c7.
The s390 hmac and sha3 algorithms are failing the test. Revert
the change until they have been fixed.
Reported-by: Ingo Franzki <ifranzki@linux.ibm.com>
Link: https://lore.kernel.org/all/623a7fcb-b4cb-48e6-9833-57ad2b32a252@linux.ibm.com/
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Todd Brandt [Tue, 20 May 2025 10:45:55 +0000 (03:45 -0700)]
platform/x86/intel/pmc: Fix Arrow Lake U/H NPU PCI ID
The ARL requires that the GMA and NPU devices both be in D3Hot in order
for PC10 and S0iX to be achieved in S2idle. The original ARL-H/U addition
to the intel_pmc_core driver attempted to do this by switching them to D3
in the init and resume calls of the intel_pmc_core driver.
The problem is the ARL-H/U have a different NPU device and thus are not
being properly set and thus S0iX does not work properly in ARL-H/U. This
patch creates a new ARL-H specific device id that is correct and also
adds the D3 fixup to the suspend callback. This way if the PCI devies
drop from D3 to D0 after resume they can be corrected for the next
suspend. Thus there is no dropout in S0iX.
Fixes:
bd820906ea9d ("platform/x86/intel/pmc: Add Arrow Lake U/H support to intel_pmc_core driver")
Signed-off-by: Todd Brandt <todd.e.brandt@intel.com>
Link: https://lore.kernel.org/r/a61f78be45c13f39e122dcc684b636f4b21e79a0.1747737446.git.todd.e.brandt@intel.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Christian Brauner [Fri, 23 May 2025 08:47:06 +0000 (10:47 +0200)]
mips, net: ensure that SOCK_COREDUMP is defined
For historical reasons mips has to override the socket enum values but
the defines are all the same. So simply move the ARCH_HAS_SOCKET_TYPES
scope.
Fixes:
a9194f88782a ("coredump: add coredump socket")
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Matt Atwood [Tue, 20 May 2025 19:57:49 +0000 (12:57 -0700)]
drm/xe/ptl: Update the PTL pci id table
Update to current bspec table.
Bspec: 72574
Signed-off-by: Matt Atwood <matthew.s.atwood@intel.com>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Reviewed-by: Clint Taylor <Clinton.A.Taylor@intel.com>
Link: https://lore.kernel.org/r/20250520195749.371748-1-matthew.s.atwood@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit
49c6dc74b5968885f421f9f1b45eb4890b955870)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Shuicheng Lin [Tue, 13 May 2025 15:30:10 +0000 (15:30 +0000)]
drm/xe: Use xe_mmio_read32() to read mtcfg register
The mtcfg register is a 32-bit register and should therefore be
accessed using xe_mmio_read32().
Other 3 changes per codestyle suggestion:
"
xe_mmio.c:83: CHECK: Alignment should match open parenthesis
xe_mmio.c:131: CHECK: Comparison to NULL could be written "!xe->mmio.regs"
xe_mmio.c:315: CHECK: line length of 103 exceeds 100 columns
"
Fixes:
dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
Link: https://lore.kernel.org/r/20250513153010.3464767-1-shuicheng.lin@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit
d2662cf8f44a68deb6c76ad9f1d9f29dbf7ba601)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Tejas Upadhyay [Tue, 6 May 2025 14:23:00 +0000 (19:53 +0530)]
drm/xe/mocs: Check if all domains awake
Check if all domains are awake specially for
LNCF regs
Fixes:
298661cd9cea ("drm/xe: Fix MOCS debugfs LNCF readout")
Improvements-suggested-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250506142300.1865783-1-tejas.upadhyay@intel.com
Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
(cherry picked from commit
a383cf218ef8bb35d4c03958bd956573b65cf778)
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>