Christoph Hellwig [Wed, 14 May 2025 10:50:37 +0000 (10:50 +0000)]
xfs: free the item in xfs_mru_cache_insert on failure
Call the provided free_func when xfs_mru_cache_insert as that's what
the callers need to do anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Wed, 14 May 2025 04:44:20 +0000 (06:44 +0200)]
xfs: remove the EXPERIMENTAL warning for pNFS
The pNFS layout support has been around for 10 years without major
issues, drop the EXPERIMENTAL warning.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Tue, 13 May 2025 14:35:29 +0000 (07:35 -0700)]
xfs: remove some EXPERIMENTAL warnings
Online fsck was finished a year ago, in Linux 6.10. The exchange-range
syscall and parent pointers were merged in the same cycle. None of
these have encountered any serious errors in the year that they've been
in the kernel (or the many many years they've been under development) so
let's drop the shouty warnings.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Carlos Maiolino [Wed, 14 May 2025 10:38:53 +0000 (12:38 +0200)]
Merge branch 'atomic_writes-6.16' into xfs-6.16-merge
Required update due to conflict with patch:
xfs: stop using set_blocksize
Conflicts:
fs/xfs/xfs_buf.c
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Zizhi Wo [Tue, 6 May 2025 01:15:40 +0000 (09:15 +0800)]
xfs: Remove deprecated xfs_bufd sysctl parameters
Commit
64af7a6ea5a4 ("xfs: remove deprecated sysctls") removed the
deprecated xfsbufd-related sysctl interface, but forgot to delete the
corresponding parameters: "xfs_buf_timer" and "xfs_buf_age".
This patch removes those parameters and makes no other changes.
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Wed, 23 Apr 2025 19:54:13 +0000 (12:54 -0700)]
xfs: stop using set_blocksize
XFS has its own buffer cache for metadata that uses submit_bio, which
means that it no longer uses the block device pagecache for anything.
Create a more lightweight helper that runs the blocksize checks and
flushes dirty data and use that instead. No more truncating the
pagecache because XFS does not use it or care about it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Carlos Maiolino [Wed, 14 May 2025 10:20:57 +0000 (12:20 +0200)]
Merge branch 'block-6.15' of git://git./linux/kernel/git/axboe/linux-block into xfs-6.16-merge
Merging block tree into XFS because of some dependencies like
bdev_validate_blocksize()
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Keith Busch [Fri, 9 May 2025 15:38:02 +0000 (08:38 -0700)]
block: always allocate integrity buffer when required
Many nvme metadata formats can not strip or generate the metadata on the
controller side. For these formats, a host provided integrity buffer is
mandatory even if it isn't checked.
The block integrity read_verify and write_generate attributes prevent
allocating the metadata buffer, but we need it when the format requires
it, otherwise reads and writes will be rejected by the driver with IO
errors.
Assume the integrity buffer can be offloaded to the controller if the
metadata size is the same as the protection information size. Otherwise
provide an unchecked host buffer when the read verify or write
generation attributes are disabled. This fixes the following nvme
warning:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 371 at drivers/nvme/host/core.c:1036 nvme_setup_rw+0x122/0x210
...
RIP: 0010:nvme_setup_rw+0x122/0x210
...
Call Trace:
<TASK>
nvme_setup_cmd+0x1b4/0x280
nvme_queue_rqs+0xc4/0x1f0 [nvme]
blk_mq_dispatch_queue_requests+0x24a/0x430
blk_mq_flush_plug_list+0x50/0x140
__blk_flush_plug+0xc1/0x100
__submit_bio+0x1c1/0x360
? submit_bio_noacct_nocheck+0x2d6/0x3c0
submit_bio_noacct_nocheck+0x2d6/0x3c0
? submit_bio_noacct+0x47/0x4c0
submit_bio_wait+0x48/0xa0
__blkdev_direct_IO_simple+0xee/0x210
? current_time+0x1d/0x100
? current_time+0x1d/0x100
? __bio_clone+0xb0/0xb0
blkdev_read_iter+0xbb/0x140
vfs_read+0x239/0x310
ksys_read+0x58/0xc0
do_syscall_64+0x6c/0x180
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250509153802.3482493-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Carlos Maiolino [Fri, 9 May 2025 07:15:39 +0000 (09:15 +0200)]
Merge tag 'atomic-writes-6.16_2025-05-07' of https://git./linux/kernel/git/djwong/xfs-linux into atomic_writes
large atomic writes for xfs [v12.1]
Currently atomic write support for xfs is limited to writing a single
block as we have no way to guarantee alignment and that the write covers
a single extent.
This series introduces a method to issue atomic writes via a
software-based method.
The software-based method is used as a fallback for when attempting to
issue an atomic write over misaligned or multiple extents.
For xfs, this support is based on reflink CoW support.
The basic idea of this CoW method is to alloc a range in the CoW fork,
write the data, and atomically update the mapping.
Initial mysql performance testing has shown this method to perform ok.
However, there we are only using 16K atomic writes (and 4K block size),
so typically - and thankfully - this software fallback method won't be
used often.
For other FSes which want large atomics writes and don't support CoW, I
think that they can follow the example in [0].
Catherine is currently working on further xfstests for this feature,
which we hope to share soon.
About 17/17, maybe it can be omitted as there is no strong demand to have
it included.
Based on
bfecc4091e07 (xfs/next-rc, xfs/for-next) xfs: allow ro mounts
if rtdev or logdev are read-only
[0] https://lore.kernel.org/linux-xfs/
20250102140411.14617-1-john.g.garry@oracle.com/
Differences to v12:
- add more review tags
Differences to v11:
- split "xfs: ignore ..." patch
- inline sync_blockdev() in xfs_alloc_buftarg() (Christoph)
- fix xfs_calc_rtgroup_awu_max() for 0 block count (Darrick)
- Add RB tag from Christoph (thanks!)
Differences to v10:
- add "xfs: only call xfs_setsize_buftarg once ..." by Darrick
- symbol renames in "xfs: ignore HW which cannot..." by Darrick
Differences to v9:
- rework "ignore HW which cannot .." patch by Darrick
- Ensure power-of-2 max always for unit min/max when no HW support
With a bit of luck, this should all go splendidly.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Jens Axboe [Thu, 8 May 2025 15:08:23 +0000 (09:08 -0600)]
Merge tag 'nvme-6.15-2025-05-08' of git://git.infradead.org/nvme into block-6.15
Pull NVMe fix from Christoph:
"nvme fixes for linux 6.15
- unblock ctrl state transition for firmware update (Daniel Wagner)"
* tag 'nvme-6.15-2025-05-08' of git://git.infradead.org/nvme:
nvme: unblock ctrl state transition for firmware update
Aaron Lu [Thu, 8 May 2025 08:30:36 +0000 (16:30 +0800)]
block: remove test of incorrect io priority level
Ever since commit
eca2040972b4("scsi: block: ioprio: Clean up interface
definition"), the macro IOPRIO_PRIO_LEVEL() will mask the level value to
something between 0 and 7 so necessarily, level will always be lower than
IOPRIO_NR_LEVELS(8).
Remove this obsolete check.
Reported-by: Kexin Wei <ys.weikexin@h3c.com>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250508083018.GA769554@bytedance
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Darrick J. Wong [Wed, 7 May 2025 21:18:34 +0000 (14:18 -0700)]
xfs: allow sysadmins to specify a maximum atomic write limit at mount time
Introduce a mount option to allow sysadmins to specify the maximum size
of an atomic write. If the filesystem can work with the supplied value,
that becomes the new guaranteed maximum.
The value mustn't be too big for the existing filesystem geometry (max
write size, max AG/rtgroup size). We dynamically recompute the
tr_atomic_write transaction reservation based on the given block size,
check that the current log size isn't less than the new minimum log size
constraints, and set a new maximum.
The actual software atomic write max is still computed based off of
tr_atomic_ioend the same way it has for the past few commits. Note also
that xfs_calc_atomic_write_log_geometry is non-static because mkfs will
need that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
John Garry [Wed, 7 May 2025 21:18:33 +0000 (14:18 -0700)]
xfs: update atomic write limits
Update the limits returned from xfs_get_atomic_write_{min, max, max_opt)().
No reflink support always means no CoW-based atomic writes.
For updating xfs_get_atomic_write_min(), we support blocksize only and that
depends on HW or reflink support.
For updating xfs_get_atomic_write_max(), for no reflink, we are limited to
blocksize but only if HW support. Otherwise we are limited to combined
limit in mp->m_atomic_write_unit_max.
For updating xfs_get_atomic_write_max_opt(), ultimately we are limited by
the bdev atomic write limit. If xfs_get_atomic_write_max() does not report
> 1x blocksize, then just continue to report 0 as before.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: update comments in the helper functions]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
John Garry [Wed, 7 May 2025 21:18:32 +0000 (14:18 -0700)]
xfs: add xfs_calc_atomic_write_unit_max()
Now that CoW-based atomic writes are supported, update the max size of an
atomic write for the data device.
The limit of a CoW-based atomic write will be the limit of the number of
logitems which can fit into a single transaction.
In addition, the max atomic write size needs to be aligned to the agsize.
Limit the size of atomic writes to the greatest power-of-two factor of the
agsize so that allocations for an atomic write will always be aligned
compatibly with the alignment requirements of the storage.
Function xfs_atomic_write_logitems() is added to find the limit the number
of log items which can fit in a single transaction.
Amend the max atomic write computation to create a new transaction
reservation type, and compute the maximum size of an atomic write
completion (in fsblocks) based on this new transaction reservation.
Initially, tr_atomic_write is a clone of tr_itruncate, which provides a
reasonable level of parallelism. In the next patch, we'll add a mount
option so that sysadmins can configure their own limits.
[djwong: use a new reservation type for atomic write ioends, refactor
group limit calculations]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[jpg: rounddown power-of-2 always]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
John Garry [Wed, 7 May 2025 21:18:31 +0000 (14:18 -0700)]
xfs: add xfs_file_dio_write_atomic()
Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.
Now HW offload will not be required to support atomic writes and is
an optional feature.
CoW-based atomic writes will be supported with out-of-places write and
atomic extent remapping.
Either mode of operation may be used for an atomic write, depending on the
circumstances.
The preferred method is HW offload as it will be faster. If HW offload is
not available then we always use the CoW-based method. If HW offload is
available but not possible to use, then again we use the CoW-based method.
If available, HW offload would not be possible for the write length
exceeding the HW offload limit, the write spanning multiple extents,
unaligned disk blocks, etc.
Apart from the write exceeding the HW offload limit, other conditions for
HW offload usage can only be detected in the iomap handling for the write.
As such, we use a fallback method to issue the write if we detect in the
->iomap_begin() handler that HW offload is not possible. Special code
-ENOPROTOOPT is returned from ->iomap_begin() to inform that HW offload is
not possible.
In other words, atomic writes are supported on any filesystem that can
perform out of place write remapping atomically (i.e. reflink) up to
some fairly large size. If the conditions are right (a single correctly
aligned overwrite mapping) then the filesystem will use any available
hardware support to avoid the filesystem metadata updates.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
John Garry [Wed, 7 May 2025 21:18:31 +0000 (14:18 -0700)]
xfs: commit CoW-based atomic writes atomically
When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.
For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.
Note that there is a limit on the amount of log intent items which can be
fit into a single transaction, but this is being ignored for now since
the count of items for a typical atomic write would be much less than is
typically supported. A typical atomic write would be expected to be 64KB
or less, which means only 16 possible extents unmaps, which is quite
small.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add tr_atomic_ioend]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
John Garry [Wed, 7 May 2025 21:18:30 +0000 (14:18 -0700)]
xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()
For when large atomic writes (> 1x FS block) are supported, there will be
various occasions when HW offload may not be possible.
Such instances include:
- unaligned extent mapping wrt write length
- extent mappings which do not cover the full write, e.g. the write spans
sparse or mixed-mapping extents
- the write length is greater than HW offload can support
- no hardware support at all
In those cases, we need to fallback to the CoW-based atomic write mode. For
this, report special code -ENOPROTOOPT to inform the caller that HW
offload-based method is not possible.
In addition to the occasions mentioned, if the write covers an unallocated
range, we again judge that we need to rely on the CoW-based method when we
would need to allocate anything more than 1x block. This is because if we
allocate less blocks that is required for the write, then again HW
offload-based method would not be possible. So we are taking a pessimistic
approach to writes covering unallocated space.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: various cleanups]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
John Garry [Wed, 7 May 2025 21:18:29 +0000 (14:18 -0700)]
xfs: add xfs_atomic_write_cow_iomap_begin()
For CoW-based atomic writes, reuse the infrastructure for reflink CoW fork
support.
Add ->iomap_begin() callback xfs_atomic_write_cow_iomap_begin() to create
staging mappings in the CoW fork for atomic write updates.
The general steps in the function are as follows:
- find extent mapping in the CoW fork for the FS block range being written
- if part or full extent is found, proceed to process found extent
- if no extent found, map in new blocks to the CoW fork
- convert unwritten blocks in extent if required
- update iomap extent mapping and return
The bulk of this function is quite similar to the processing in
xfs_reflink_allocate_cow(), where we try to find an extent mapping; if
none exists, then allocate a new extent in the CoW fork, convert unwritten
blocks, and return a mapping.
Performance testing has shown the XFS_ILOCK_EXCL locking to be quite
a bottleneck, so this is an area which could be optimised in future.
Christoph Hellwig contributed almost all of the code in
xfs_atomic_write_cow_iomap_begin().
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add a new xfs_can_sw_atomic_write to convey intent better]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
John Garry [Wed, 7 May 2025 21:18:28 +0000 (14:18 -0700)]
xfs: refine atomic write size check in xfs_file_write_iter()
Currently the size of atomic write allowed is fixed at the blocksize.
To start to lift this restriction, partly refactor
xfs_report_atomic_write() to into helpers -
xfs_get_atomic_write_{min, max}() - and use those helpers to find the
per-inode atomic write limits and check according to that.
Also add xfs_get_atomic_write_max_opt() to return the optimal limit, and
just return 0 since large atomics aren't supported yet.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
John Garry [Wed, 7 May 2025 21:18:28 +0000 (14:18 -0700)]
xfs: refactor xfs_reflink_end_cow_extent()
Refactor xfs_reflink_end_cow_extent() into separate parts which process
the CoW range and commit the transaction.
This refactoring will be used in future for when it is required to commit
a range of extents as a single transaction, similar to how it was done
pre-commit
d6f215f359637.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
John Garry [Wed, 7 May 2025 21:18:27 +0000 (14:18 -0700)]
xfs: allow block allocator to take an alignment hint
Add a BMAPI flag to provide a hint to the block allocator to align extents
according to the extszhint.
This will be useful for atomic writes to ensure that we are not being
allocated extents which are not suitable (for atomic writes).
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Darrick J. Wong [Wed, 7 May 2025 21:18:26 +0000 (14:18 -0700)]
xfs: ignore HW which cannot atomic write a single block
Currently only HW which can write at least 1x block is supported.
For supporting atomic writes > 1x block, a CoW-based method will also be
used and this will not be resticted to using HW which can write >= 1x
block.
However for deciding if HW-based atomic writes can be used, we need to
start adding checks for write length < HW min, which complicates the
code. Indeed, a statx field similar to unit_max_opt should also be
added for this minimum, which is undesirable.
HW which can only write > 1x blocks would be uncommon and quite weird,
so let's just not support it.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 7 May 2025 21:18:25 +0000 (14:18 -0700)]
xfs: add helpers to compute transaction reservation for finishing intent items
In the transaction reservation code, hoist the logic that computes the
reservation needed to finish one log intent item into separate helper
functions. These will be used in subsequent patches to estimate the
number of blocks that an online repair can commit to reaping in the same
transaction as the change committing the new data structure.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Darrick J. Wong [Wed, 7 May 2025 21:18:25 +0000 (14:18 -0700)]
xfs: add helpers to compute log item overhead
Add selected helpers to estimate the transaction reservation required to
write various log intent and buffer items to the log. These helpers
will be used by the online repair code for more precise estimations of
how much work can be done in a single transaction.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Darrick J. Wong [Wed, 7 May 2025 21:18:24 +0000 (14:18 -0700)]
xfs: separate out setting buftarg atomic writes limits
Separate out setting buftarg atomic writes limits into a dedicated
function, xfs_configure_buftarg_atomic_writes(), to keep the specific
functionality self-contained.
For naming consistency, rename xfs_setsize_buftarg() ->
xfs_configure_buftarg().
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[jpg: separate out from patch "xfs: ignore HW which ..."]
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
John Garry [Wed, 7 May 2025 21:18:23 +0000 (14:18 -0700)]
xfs: rename xfs_inode_can_atomicwrite() -> xfs_inode_can_hw_atomic_write()
In future we will want to be able to check if specifically HW offload-based
atomic writes are possible, so rename xfs_inode_can_atomicwrite() ->
xfs_inode_can_hw_atomicwrite().
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[djwong: add an underscore to be consistent with everything else]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Darrick J. Wong [Wed, 7 May 2025 21:18:22 +0000 (14:18 -0700)]
xfs: only call xfs_setsize_buftarg once per buffer target
It's silly to call xfs_setsize_buftarg from xfs_alloc_buftarg with the
block device LBA size because we don't need to ask the block layer to
validate a geometry number that it provided us. Instead, set the
preliminary bt_meta_sector* fields to the LBA size in preparation for
reading the primary super.
However, we still want to flush and invalidate the pagecache for all
three block devices before we start reading metadata from those devices,
so call sync_blockdev() per bdev in xfs_alloc_buftarg().
This will enable a subsequent patch to validate hw atomic write geometry
against the filesystem geometry.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
[jpg: call sync_blockdev() from xfs_alloc_buftarg()]
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
John Garry [Wed, 7 May 2025 21:18:21 +0000 (14:18 -0700)]
fs: add atomic write unit max opt to statx
XFS will be able to support large atomic writes (atomic write > 1x block)
in future. This will be achieved by using different operating methods,
depending on the size of the write.
Specifically a new method of operation based in FS atomic extent remapping
will be supported in addition to the current HW offload-based method.
The FS method will generally be appreciably slower performing than the
HW-offload method. However the FS method will be typically able to
contribute to achieving a larger atomic write unit max limit.
XFS will support a hybrid mode, where HW offload method will be used when
possible, i.e. HW offload is used when the length of the write is
supported, and for other times FS-based atomic writes will be used.
As such, there is an atomic write length at which the user may experience
appreciably slower performance.
Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt.
When zero, it means that there is no such performance boundary.
Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is
ok for older kernels which don't support this new field, as they would
report 0 in this field (from zeroing in cp_statx()) already. Furthermore
those older kernels don't support large atomic writes - apart from block
fops, but there would be consistent performance there for atomic writes
in range [unit min, unit max].
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Daniel Wagner [Fri, 2 May 2025 08:58:00 +0000 (10:58 +0200)]
nvme: unblock ctrl state transition for firmware update
The original nvme subsystem design didn't have a CONNECTING state; the
state machine allowed transitions from RESETTING to LIVE directly.
With the introduction of nvme fabrics the CONNECTING state was
introduce. Over time the nvme-pci started to use the CONNECTING state as
well.
Eventually, a bug fix for the nvme-fc started to depend that the only
valid transition to LIVE was from CONNECTING. Though this change didn't
update the firmware update handler which was still depending on
RESETTING to LIVE transition.
The simplest way to address it for the time being is to switch into
CONNECTING state before going to LIVE state.
Fixes:
d2fe192348f9 ("nvme: only allow entering LIVE from CONNECTING state")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Closes: https://lore.kernel.org/all/
0134ea15-8d5f-41f7-9e9a-
d7e6d82accaa@roeck-us.net
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Johannes Thumshirn [Tue, 6 May 2025 11:27:30 +0000 (13:27 +0200)]
block: only update request sector if needed
In case of a ZONE APPEND write, regardless of native ZONE APPEND or the
emulation layer in the zone write plugging code, the sector the data got
written to by the device needs to be updated in the bio.
At the moment, this is done for every native ZONE APPEND write and every
request that is flagged with 'BIO_ZONE_WRITE_PLUGGING'. But thus
superfluously updates the sector for regular writes to a zoned block
device.
Check if a bio is a native ZONE APPEND write or if the bio is flagged as
'BIO_EMULATES_ZONE_APPEND', meaning the block layer's zone write plugging
code handles the ZONE APPEND and translates it into a regular write and
back. Only if one of these two criterion is met, update the sector in the
bio upon completion.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/dea089581cb6b777c1cd1500b38ac0b61df4b2d1.1746530748.git.jth@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lizhi Xu [Mon, 28 Apr 2025 14:36:26 +0000 (22:36 +0800)]
loop: Add sanity check for read/write_iter
Some file systems do not support read_iter/write_iter, such as selinuxfs
in this issue.
So before calling them, first confirm that the interface is supported and
then call it.
It is releavant in that vfs_iter_read/write have the check, and removal
of their used caused szybot to be able to hit this issue.
Fixes:
f2fed441c69b ("loop: stop using vfs_iter__{read,write} for buffered I/O")
Reported-by: syzbot+6af973a3b8dfd2faefdc@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=
6af973a3b8dfd2faefdc
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250428143626.3318717-1-lizhi.xu@windriver.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dave Chinner [Wed, 30 Apr 2025 23:27:24 +0000 (09:27 +1000)]
xfs: don't assume perags are initialised when trimming AGs
When running fstrim immediately after mounting a V4 filesystem,
the fstrim fails to trim all the free space in the filesystem. It
only trims the first extent in the by-size free space tree in each
AG and then returns. If a second fstrim is then run, it runs
correctly and the entire free space in the filesystem is iterated
and discarded correctly.
The problem lies in the setup of the trim cursor - it assumes that
pag->pagf_longest is valid without either reading the AGF first or
checking if xfs_perag_initialised_agf(pag) is true or not.
As a result, when a filesystem is mounted without reading the AGF
(e.g. a clean mount on a v4 filesystem) and the first operation is a
fstrim call, pag->pagf_longest is zero and so the free extent search
starts at the wrong end of the by-size btree and exits after
discarding the first record in the tree.
Fix this by deferring the initialisation of tcur->count to after
we have locked the AGF and guaranteed that the perag is properly
initialised. We trigger this on tcur->count == 0 after locking the
AGF, as this will only occur on the first call to
xfs_trim_gather_extents() for each AG. If we need to iterate,
tcur->count will be set to the length of the record we need to
restart at, so we can use this to ensure we only sample a valid
pag->pagf_longest value for the iteration.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Fixes:
89cfa899608f ("xfs: reduce AGF hold times during fstrim operations")
Cc: <stable@vger.kernel.org> # v6.6
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Jens Axboe [Thu, 1 May 2025 13:56:02 +0000 (07:56 -0600)]
Merge tag 'nvme-6.15-2025-05-01' of git://git.infradead.org/nvme into block-6.15
Pull NVMe fixes from Christoph:
"nvme fixes for Linux 6.15
- fix queue unquiesce check on PCI slot_reset (Keith Busch)
- fix premature queue removal and I/O failover in nvme-tcp
(Michael Liang)
- don't restore null sk_state_change (Alistair Francis)
- select CONFIG_TLS where needed (Alistair Francis)
- always free derived key data (Hannes Reinecke)
- more quirks (Wentao Guan)"
* tag 'nvme-6.15-2025-05-01' of git://git.infradead.org/nvme:
nvmet-auth: always free derived key data
nvmet-tcp: don't restore null sk_state_change
nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS
nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS
nvme-tcp: fix premature queue removal and I/O failover
nvme-pci: add quirks for WDC Blue SN550 15b7:5009
nvme-pci: add quirks for device 126f:1001
nvme-pci: fix queue unquiesce check on slot_reset
Hans Holmberg [Wed, 30 Apr 2025 08:35:34 +0000 (08:35 +0000)]
xfs: allow ro mounts if rtdev or logdev are read-only
Allow read-only mounts on rtdevs and logdevs that are marked as
read-only and make sure those mounts can't be remounted read-write.
Use the sb_open_mode helper to make sure that we don't try to open
devices with write access enabled for read-only mounts.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Hannes Reinecke [Fri, 25 Apr 2025 09:34:34 +0000 (11:34 +0200)]
nvmet-auth: always free derived key data
After calling nvme_auth_derive_tls_psk() we need to free the resulting
psk data, as either TLS is disable (and we don't need the data anyway)
or the psk data is copied into the resulting key (and can be free, too).
Fixes:
fa2e0f8bbc68 ("nvmet-tcp: support secure channel concatenation")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Suggested-by: Maurizio Lombardi <mlombard@bsdbackstore.eu>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Alistair Francis [Wed, 23 Apr 2025 06:06:21 +0000 (16:06 +1000)]
nvmet-tcp: don't restore null sk_state_change
queue->state_change is set as part of nvmet_tcp_set_queue_sock(), but if
the TCP connection isn't established when nvmet_tcp_set_queue_sock() is
called then queue->state_change isn't set and sock->sk->sk_state_change
isn't replaced.
As such we don't need to restore sock->sk->sk_state_change if
queue->state_change is NULL.
This avoids NULL pointer dereferences such as this:
[ 286.462026][ C0] BUG: kernel NULL pointer dereference, address:
0000000000000000
[ 286.462814][ C0] #PF: supervisor instruction fetch in kernel mode
[ 286.463796][ C0] #PF: error_code(0x0010) - not-present page
[ 286.464392][ C0] PGD
8000000140620067 P4D
8000000140620067 PUD
114201067 PMD 0
[ 286.465086][ C0] Oops: Oops: 0010 [#1] SMP KASAN PTI
[ 286.465559][ C0] CPU: 0 UID: 0 PID: 1628 Comm: nvme Not tainted 6.15.0-rc2+ #11 PREEMPT(voluntary)
[ 286.466393][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
[ 286.467147][ C0] RIP: 0010:0x0
[ 286.467420][ C0] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[ 286.467977][ C0] RSP: 0018:
ffff8883ae008580 EFLAGS:
00010246
[ 286.468425][ C0] RAX:
0000000000000000 RBX:
ffff88813fd34100 RCX:
ffffffffa386cc43
[ 286.469019][ C0] RDX:
1ffff11027fa68b6 RSI:
0000000000000008 RDI:
ffff88813fd34100
[ 286.469545][ C0] RBP:
ffff88813fd34160 R08:
0000000000000000 R09:
ffffed1027fa682c
[ 286.470072][ C0] R10:
ffff88813fd34167 R11:
0000000000000000 R12:
ffff88813fd344c3
[ 286.470585][ C0] R13:
ffff88813fd34112 R14:
ffff88813fd34aec R15:
ffff888132cdd268
[ 286.471070][ C0] FS:
00007fe3c04c7d80(0000) GS:
ffff88840743f000(0000) knlGS:
0000000000000000
[ 286.471644][ C0] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 286.472543][ C0] CR2:
ffffffffffffffd6 CR3:
000000012daca000 CR4:
00000000000006f0
[ 286.473500][ C0] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 286.474467][ C0] DR3:
0000000000000000 DR6:
00000000ffff07f0 DR7:
0000000000000400
[ 286.475453][ C0] Call Trace:
[ 286.476102][ C0] <IRQ>
[ 286.476719][ C0] tcp_fin+0x2bb/0x440
[ 286.477429][ C0] tcp_data_queue+0x190f/0x4e60
[ 286.478174][ C0] ? __build_skb_around+0x234/0x330
[ 286.478940][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.479659][ C0] ? __pfx_tcp_data_queue+0x10/0x10
[ 286.480431][ C0] ? tcp_try_undo_loss+0x640/0x6c0
[ 286.481196][ C0] ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90
[ 286.482046][ C0] ? kvm_clock_get_cycles+0x14/0x30
[ 286.482769][ C0] ? ktime_get+0x66/0x150
[ 286.483433][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.484146][ C0] tcp_rcv_established+0x6e4/0x2050
[ 286.484857][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.485523][ C0] ? ipv4_dst_check+0x160/0x2b0
[ 286.486203][ C0] ? __pfx_tcp_rcv_established+0x10/0x10
[ 286.486917][ C0] ? lock_release+0x217/0x2c0
[ 286.487595][ C0] tcp_v4_do_rcv+0x4d6/0x9b0
[ 286.488279][ C0] tcp_v4_rcv+0x2af8/0x3e30
[ 286.488904][ C0] ? raw_local_deliver+0x51b/0xad0
[ 286.489551][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.490198][ C0] ? __pfx_tcp_v4_rcv+0x10/0x10
[ 286.490813][ C0] ? __pfx_raw_local_deliver+0x10/0x10
[ 286.491487][ C0] ? __pfx_nf_confirm+0x10/0x10 [nf_conntrack]
[ 286.492275][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.492900][ C0] ip_protocol_deliver_rcu+0x8f/0x370
[ 286.493579][ C0] ip_local_deliver_finish+0x297/0x420
[ 286.494268][ C0] ip_local_deliver+0x168/0x430
[ 286.494867][ C0] ? __pfx_ip_local_deliver+0x10/0x10
[ 286.495498][ C0] ? __pfx_ip_local_deliver_finish+0x10/0x10
[ 286.496204][ C0] ? ip_rcv_finish_core+0x19a/0x1f20
[ 286.496806][ C0] ? lock_release+0x217/0x2c0
[ 286.497414][ C0] ip_rcv+0x455/0x6e0
[ 286.497945][ C0] ? __pfx_ip_rcv+0x10/0x10
[ 286.498550][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.499137][ C0] ? __pfx_ip_rcv_finish+0x10/0x10
[ 286.499763][ C0] ? lock_release+0x217/0x2c0
[ 286.500327][ C0] ? dl_scaled_delta_exec+0xd1/0x2c0
[ 286.500922][ C0] ? __pfx_ip_rcv+0x10/0x10
[ 286.501480][ C0] __netif_receive_skb_one_core+0x166/0x1b0
[ 286.502173][ C0] ? __pfx___netif_receive_skb_one_core+0x10/0x10
[ 286.502903][ C0] ? lock_acquire+0x2b2/0x310
[ 286.503487][ C0] ? process_backlog+0x372/0x1350
[ 286.504087][ C0] ? lock_release+0x217/0x2c0
[ 286.504642][ C0] process_backlog+0x3b9/0x1350
[ 286.505214][ C0] ? process_backlog+0x372/0x1350
[ 286.505779][ C0] __napi_poll.constprop.0+0xa6/0x490
[ 286.506363][ C0] net_rx_action+0x92e/0xe10
[ 286.506889][ C0] ? __pfx_net_rx_action+0x10/0x10
[ 286.507437][ C0] ? timerqueue_add+0x1f0/0x320
[ 286.507977][ C0] ? sched_clock_cpu+0x68/0x540
[ 286.508492][ C0] ? lock_acquire+0x2b2/0x310
[ 286.509043][ C0] ? kvm_sched_clock_read+0xd/0x20
[ 286.509607][ C0] ? handle_softirqs+0x1aa/0x7d0
[ 286.510187][ C0] handle_softirqs+0x1f2/0x7d0
[ 286.510754][ C0] ? __pfx_handle_softirqs+0x10/0x10
[ 286.511348][ C0] ? irqtime_account_irq+0x181/0x290
[ 286.511937][ C0] ? __dev_queue_xmit+0x85d/0x3450
[ 286.512510][ C0] do_softirq.part.0+0x89/0xc0
[ 286.513100][ C0] </IRQ>
[ 286.513548][ C0] <TASK>
[ 286.513953][ C0] __local_bh_enable_ip+0x112/0x140
[ 286.514522][ C0] ? __dev_queue_xmit+0x85d/0x3450
[ 286.515072][ C0] __dev_queue_xmit+0x872/0x3450
[ 286.515619][ C0] ? nft_do_chain+0xe16/0x15b0 [nf_tables]
[ 286.516252][ C0] ? __pfx___dev_queue_xmit+0x10/0x10
[ 286.516817][ C0] ? selinux_ip_postroute+0x43c/0xc50
[ 286.517433][ C0] ? __pfx_selinux_ip_postroute+0x10/0x10
[ 286.518061][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.518606][ C0] ? ip_output+0x164/0x4a0
[ 286.519149][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.519671][ C0] ? ip_finish_output2+0x17d5/0x1fb0
[ 286.520258][ C0] ip_finish_output2+0xb4b/0x1fb0
[ 286.520787][ C0] ? __pfx_ip_finish_output2+0x10/0x10
[ 286.521355][ C0] ? __ip_finish_output+0x15d/0x750
[ 286.521890][ C0] ip_output+0x164/0x4a0
[ 286.522372][ C0] ? __pfx_ip_output+0x10/0x10
[ 286.522872][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.523402][ C0] ? _raw_spin_unlock_irqrestore+0x4c/0x60
[ 286.524031][ C0] ? __pfx_ip_finish_output+0x10/0x10
[ 286.524605][ C0] ? __ip_queue_xmit+0x999/0x2260
[ 286.525200][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.525744][ C0] ? ipv4_dst_check+0x16a/0x2b0
[ 286.526279][ C0] ? lock_release+0x217/0x2c0
[ 286.526793][ C0] __ip_queue_xmit+0x1883/0x2260
[ 286.527324][ C0] ? __skb_clone+0x54c/0x730
[ 286.527827][ C0] __tcp_transmit_skb+0x209b/0x37a0
[ 286.528374][ C0] ? __pfx___tcp_transmit_skb+0x10/0x10
[ 286.528952][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.529472][ C0] ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90
[ 286.530152][ C0] ? trace_hardirqs_on+0x12/0x120
[ 286.530691][ C0] tcp_write_xmit+0xb81/0x88b0
[ 286.531224][ C0] ? mod_memcg_state+0x4d/0x60
[ 286.531736][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.532253][ C0] __tcp_push_pending_frames+0x90/0x320
[ 286.532826][ C0] tcp_send_fin+0x141/0xb50
[ 286.533352][ C0] ? __pfx_tcp_send_fin+0x10/0x10
[ 286.533908][ C0] ? __local_bh_enable_ip+0xab/0x140
[ 286.534495][ C0] inet_shutdown+0x243/0x320
[ 286.535077][ C0] nvme_tcp_alloc_queue+0xb3b/0x2590 [nvme_tcp]
[ 286.535709][ C0] ? do_raw_spin_lock+0x129/0x260
[ 286.536314][ C0] ? __pfx_nvme_tcp_alloc_queue+0x10/0x10 [nvme_tcp]
[ 286.536996][ C0] ? do_raw_spin_unlock+0x54/0x1e0
[ 286.537550][ C0] ? _raw_spin_unlock+0x29/0x50
[ 286.538127][ C0] ? do_raw_spin_lock+0x129/0x260
[ 286.538664][ C0] ? __pfx_do_raw_spin_lock+0x10/0x10
[ 286.539249][ C0] ? nvme_tcp_alloc_admin_queue+0xd5/0x340 [nvme_tcp]
[ 286.539892][ C0] ? __wake_up+0x40/0x60
[ 286.540392][ C0] nvme_tcp_alloc_admin_queue+0xd5/0x340 [nvme_tcp]
[ 286.541047][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.541589][ C0] nvme_tcp_setup_ctrl+0x8b/0x7a0 [nvme_tcp]
[ 286.542254][ C0] ? _raw_spin_unlock_irqrestore+0x4c/0x60
[ 286.542887][ C0] ? __pfx_nvme_tcp_setup_ctrl+0x10/0x10 [nvme_tcp]
[ 286.543568][ C0] ? trace_hardirqs_on+0x12/0x120
[ 286.544166][ C0] ? _raw_spin_unlock_irqrestore+0x35/0x60
[ 286.544792][ C0] ? nvme_change_ctrl_state+0x196/0x2e0 [nvme_core]
[ 286.545477][ C0] nvme_tcp_create_ctrl+0x839/0xb90 [nvme_tcp]
[ 286.546126][ C0] nvmf_dev_write+0x3db/0x7e0 [nvme_fabrics]
[ 286.546775][ C0] ? rw_verify_area+0x69/0x520
[ 286.547334][ C0] vfs_write+0x218/0xe90
[ 286.547854][ C0] ? do_syscall_64+0x9f/0x190
[ 286.548408][ C0] ? trace_hardirqs_on_prepare+0xdb/0x120
[ 286.549037][ C0] ? syscall_exit_to_user_mode+0x93/0x280
[ 286.549659][ C0] ? __pfx_vfs_write+0x10/0x10
[ 286.550259][ C0] ? do_syscall_64+0x9f/0x190
[ 286.550840][ C0] ? syscall_exit_to_user_mode+0x8e/0x280
[ 286.551516][ C0] ? trace_hardirqs_on_prepare+0xdb/0x120
[ 286.552180][ C0] ? syscall_exit_to_user_mode+0x93/0x280
[ 286.552834][ C0] ? ksys_read+0xf5/0x1c0
[ 286.553386][ C0] ? __pfx_ksys_read+0x10/0x10
[ 286.553964][ C0] ksys_write+0xf5/0x1c0
[ 286.554499][ C0] ? __pfx_ksys_write+0x10/0x10
[ 286.555072][ C0] ? trace_hardirqs_on_prepare+0xdb/0x120
[ 286.555698][ C0] ? syscall_exit_to_user_mode+0x93/0x280
[ 286.556319][ C0] ? do_syscall_64+0x54/0x190
[ 286.556866][ C0] do_syscall_64+0x93/0x190
[ 286.557420][ C0] ? rcu_read_unlock+0x17/0x60
[ 286.557986][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.558526][ C0] ? lock_release+0x217/0x2c0
[ 286.559087][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.559659][ C0] ? count_memcg_events.constprop.0+0x4a/0x60
[ 286.560476][ C0] ? exc_page_fault+0x7a/0x110
[ 286.561064][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.561647][ C0] ? lock_release+0x217/0x2c0
[ 286.562257][ C0] ? do_user_addr_fault+0x171/0xa00
[ 286.562839][ C0] ? do_user_addr_fault+0x4a2/0xa00
[ 286.563453][ C0] ? irqentry_exit_to_user_mode+0x84/0x270
[ 286.564112][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.564677][ C0] ? irqentry_exit_to_user_mode+0x84/0x270
[ 286.565317][ C0] ? trace_hardirqs_on_prepare+0xdb/0x120
[ 286.565922][ C0] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 286.566542][ C0] RIP: 0033:0x7fe3c05e6504
[ 286.567102][ C0] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 8b 10 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[ 286.568931][ C0] RSP: 002b:
00007fff76444f58 EFLAGS:
00000202 ORIG_RAX:
0000000000000001
[ 286.569807][ C0] RAX:
ffffffffffffffda RBX:
000000003b40d930 RCX:
00007fe3c05e6504
[ 286.570621][ C0] RDX:
00000000000000cf RSI:
000000003b40d930 RDI:
0000000000000003
[ 286.571443][ C0] RBP:
0000000000000003 R08:
00000000000000cf R09:
000000003b40d930
[ 286.572246][ C0] R10:
0000000000000000 R11:
0000000000000202 R12:
000000003b40cd60
[ 286.573069][ C0] R13:
00000000000000cf R14:
00007fe3c07417f8 R15:
00007fe3c073502e
[ 286.573886][ C0] </TASK>
Closes: https://lore.kernel.org/linux-nvme/5hdonndzoqa265oq3bj6iarwtfk5dewxxjtbjvn5uqnwclpwt6@a2n6w3taxxex/
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Alistair Francis [Tue, 29 Apr 2025 22:23:47 +0000 (08:23 +1000)]
nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS
Ensure that TLS support is enabled in the kernel when
CONFIG_NVME_TARGET_TCP_TLS is enabled. Without this the code compiles,
but does not actually work unless something else enables CONFIG_TLS.
Fixes:
675b453e0241 ("nvmet-tcp: enable TLS handshake upcall")
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Alistair Francis [Tue, 29 Apr 2025 22:40:25 +0000 (08:40 +1000)]
nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS
Ensure that TLS support is enabled in the kernel when
CONFIG_NVME_TCP_TLS is enabled. Without this the code compiles, but does
not actually work unless something else enables CONFIG_TLS.
Fixes:
be8e82caa68 ("nvme-tcp: enable TLS handshake upcall")
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Michael Liang [Tue, 29 Apr 2025 16:42:01 +0000 (10:42 -0600)]
nvme-tcp: fix premature queue removal and I/O failover
This patch addresses a data corruption issue observed in nvme-tcp during
testing.
In an NVMe native multipath setup, when an I/O timeout occurs, all
inflight I/Os are canceled almost immediately after the kernel socket is
shut down. These canceled I/Os are reported as host path errors,
triggering a failover that succeeds on a different path.
However, at this point, the original I/O may still be outstanding in the
host's network transmission path (e.g., the NIC’s TX queue). From the
user-space app's perspective, the buffer associated with the I/O is
considered completed since they're acked on the different path and may
be reused for new I/O requests.
Because nvme-tcp enables zero-copy by default in the transmission path,
this can lead to corrupted data being sent to the original target,
ultimately causing data corruption.
We can reproduce this data corruption by injecting delay on one path and
triggering i/o timeout.
To prevent this issue, this change ensures that all inflight
transmissions are fully completed from host's perspective before
returning from queue stop. To handle concurrent I/O timeout from multiple
namespaces under the same controller, always wait in queue stop
regardless of queue's state.
This aligns with the behavior of queue stopping in other NVMe fabric
transports.
Fixes:
3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Wentao Guan [Thu, 24 Apr 2025 02:40:10 +0000 (10:40 +0800)]
nvme-pci: add quirks for WDC Blue SN550 15b7:5009
Add two quirks for the WDC Blue SN550 (PCI ID 15b7:5009) based on user
reports and hardware analysis:
- NVME_QUIRK_NO_DEEPEST_PS:
liaozw talked to me the problem and solved with
nvme_core.default_ps_max_latency_us=0, so add the quirk.
I also found some reports in the following link.
- NVME_QUIRK_BROKEN_MSI:
after get the lspci from Jack Rio.
I think that the disk also have NVME_QUIRK_BROKEN_MSI.
described in commit
d5887dc6b6c0 ("nvme-pci: Add quirk for broken MSIs")
as sean said in link which match the MSI 1/32 and MSI-X 17.
Log:
lspci -nn | grep -i memory
03:00.0 Non-Volatile memory controller [0108]: Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less) [15b7:5009] (rev 01)
lspci -v -d 15b7:5009
03:00.0 Non-Volatile memory controller: Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less) (rev 01) (prog-if 02 [NVM Express])
Subsystem: Sandisk Corp WD Blue SN550 NVMe SSD
Flags: bus master, fast devsel, latency 0, IRQ 35, IOMMU group 10
Memory at
fe800000 (64-bit, non-prefetchable) [size=16K]
Memory at
fe804000 (64-bit, non-prefetchable) [size=256]
Capabilities: [80] Power Management version 3
Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
Capabilities: [c0] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [150] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [1b8] Latency Tolerance Reporting
Capabilities: [300] Secondary PCI Express
Capabilities: [900] L1 PM Substates
Kernel driver in use: nvme
dmesg | grep nvme
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.12.20-amd64-desktop-rolling root=UUID= ro splash quiet nvme_core.default_ps_max_latency_us=0 DEEPIN_GFXMODE=
[ 0.059301] Kernel command line: BOOT_IMAGE=/vmlinuz-6.12.20-amd64-desktop-rolling root=UUID= ro splash quiet nvme_core.default_ps_max_latency_us=0 DEEPIN_GFXMODE=
[ 0.542430] nvme nvme0: pci function 0000:03:00.0
[ 0.560426] nvme nvme0: allocated 32 MiB host memory buffer.
[ 0.562491] nvme nvme0: 16/0/0 default/read/poll queues
[ 0.567764] nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8 p9
[ 6.388726] EXT4-fs (nvme0n1p7): mounted filesystem ro with ordered data mode. Quota mode: none.
[ 6.893421] EXT4-fs (nvme0n1p7): re-mounted r/w. Quota mode: none.
[ 7.125419] Adding 16777212k swap on /dev/nvme0n1p8. Priority:-2 extents:1 across:16777212k SS
[ 7.157588] EXT4-fs (nvme0n1p6): mounted filesystem r/w with ordered data mode. Quota mode: none.
[ 7.165021] EXT4-fs (nvme0n1p9): mounted filesystem r/w with ordered data mode. Quota mode: none.
[ 8.036932] nvme nvme0: using unchecked data buffer
[ 8.096023] block nvme0n1: No UUID available providing old NGUID
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5887dc6b6c054d0da3cd053afc15b7be1f45ff6
Link: https://lore.kernel.org/all/20240422162822.3539156-1-sean.anderson@linux.dev/
Reported-by: liaozw <hedgehog-002@163.com>
Closes: https://bbs.deepin.org.cn/post/286300
Reported-by: rugk <rugk+github@posteo.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=208123
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Wentao Guan [Tue, 22 Apr 2025 12:17:25 +0000 (20:17 +0800)]
nvme-pci: add quirks for device 126f:1001
This commit adds NVME_QUIRK_NO_DEEPEST_PS and NVME_QUIRK_BOGUS_NID for
device [126f:1001].
It is similar to commit
e89086c43f05 ("drivers/nvme: Add quirks for
device 126f:2262")
Diff is according the dmesg, use NVME_QUIRK_IGNORE_DEV_SUBNQN.
dmesg | grep -i nvme0:
nvme nvme0: pci function 0000:01:00.0
nvme nvme0: missing or invalid SUBNQN field.
nvme nvme0: 12/0/0 default/read/poll queues
Link:https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=
e89086c43f0500bc7c4ce225495b73b8ce234c1f
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Thu, 24 Apr 2025 17:18:01 +0000 (10:18 -0700)]
nvme-pci: fix queue unquiesce check on slot_reset
A zero return means the reset was successfully scheduled. We don't want
to unquiesce the queues while the reset_work is pending, as that will
just flush out requeued requests to a failed completion.
Fixes:
71a5bb153be104 ("nvme: ensure disabling pairs with unquiesce")
Reported-by: Dhankaran Singh Ajravat <dhankaran@meta.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Ming Lei [Tue, 29 Apr 2025 02:29:39 +0000 (10:29 +0800)]
ublk: remove the check of ublk_need_req_ref() from __ublk_check_and_get_req
__ublk_check_and_get_req() is only called from ublk_check_and_get_req()
and ublk_register_io_buf(), the same check has been covered in the two
calling sites.
So remove the check from __ublk_check_and_get_req().
Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250429022941.1718671-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 29 Apr 2025 02:29:38 +0000 (10:29 +0800)]
ublk: enhance check for register/unregister io buffer command
The simple check of UBLK_IO_FLAG_OWNED_BY_SRV can avoid incorrect
register/unregister io buffer easily, so check it before calling
starting to register/un-register io buffer.
Also only allow io buffer register/unregister uring_cmd in case of
UBLK_F_SUPPORT_ZERO_COPY.
Also mark argument 'ublk_queue *' of ublk_register_io_buf as const.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes:
1f6540e2aabb ("ublk: zc register/unregister bvec")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250429022941.1718671-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 29 Apr 2025 02:29:37 +0000 (10:29 +0800)]
ublk: decouple zero copy from user copy
UBLK_F_USER_COPY and UBLK_F_SUPPORT_ZERO_COPY are two different
features, and shouldn't be coupled together.
Commit
1f6540e2aabb ("ublk: zc register/unregister bvec") enables
user copy automatically in case of UBLK_F_SUPPORT_ZERO_COPY, this way
isn't correct.
So decouple zero copy from user copy, and use independent helper to
check each one.
Fixes:
1f6540e2aabb ("ublk: zc register/unregister bvec")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250429022941.1718671-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 29 Apr 2025 02:29:36 +0000 (10:29 +0800)]
selftests: ublk: fix UBLK_F_NEED_GET_DATA
Commit
57e13a2e8cd2 ("selftests: ublk: support user recovery") starts to
support UBLK_F_NEED_GET_DATA for covering recovery feature, however the
ublk utility implementation isn't done correctly.
Fix it by supporting UBLK_F_NEED_GET_DATA correctly.
Also add test generic_07 for covering UBLK_F_NEED_GET_DATA.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes:
57e13a2e8cd2 ("selftests: ublk: support user recovery")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250429022941.1718671-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Darrick J. Wong [Wed, 23 Apr 2025 19:54:13 +0000 (12:54 -0700)]
xfs: stop using set_blocksize
XFS has its own buffer cache for metadata that uses submit_bio, which
means that it no longer uses the block device pagecache for anything.
Create a more lightweight helper that runs the blocksize checks and
flushes dirty data and use that instead. No more truncating the
pagecache because XFS does not use it or care about it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Carlos Maiolino [Mon, 28 Apr 2025 09:32:06 +0000 (11:32 +0200)]
Merge remote-tracking branch 'linux-block/block-6.15' into xfs tree
We need two patches inside linux-block tree as dependencies of the patch
which will follow this merge.
Specifically, we need:
block: fix race between set_blocksize and read paths
block: hoist block size validation code to a separate function
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Ming Lei [Fri, 25 Apr 2025 01:37:40 +0000 (09:37 +0800)]
ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd
ublk_cancel_cmd() calls io_uring_cmd_done() to complete uring_cmd, but
we may have scheduled task work via io_uring_cmd_complete_in_task() for
dispatching request, then kernel crash can be triggered.
Fix it by not trying to canceling the command if ublk block request is
started.
Fixes:
216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd")
Reported-by: Jared Holzman <jholzman@nvidia.com>
Tested-by: Jared Holzman <jholzman@nvidia.com>
Closes: https://lore.kernel.org/linux-block/
d2179120-171b-47ba-b664-
23242981ef19@nvidia.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250425013742.1079549-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 25 Apr 2025 01:37:39 +0000 (09:37 +0800)]
ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA
We call io_uring_cmd_complete_in_task() to schedule task_work for handling
UBLK_U_IO_NEED_GET_DATA.
This way is really not necessary because the current context is exactly
the ublk queue context, so call ublk_dispatch_req() directly for handling
UBLK_U_IO_NEED_GET_DATA.
Fixes:
216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd")
Tested-by: Jared Holzman <jholzman@nvidia.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250425013742.1079549-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 23 Apr 2025 05:37:42 +0000 (07:37 +0200)]
block: don't autoload drivers on blk-cgroup configuration
Loading a driver just to configure blk-cgroup doesn't make sense, as that
assumes and already existing device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 23 Apr 2025 05:37:41 +0000 (07:37 +0200)]
block: don't autoload drivers on stat
blkdev_get_no_open can trigger the legacy autoload of block drivers. A
simple stat of a block device has not historically done that, so disable
this behavior again.
Fixes:
9abcfbd235f5 ("block: Add atomic write support for statx")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 23 Apr 2025 05:37:40 +0000 (07:37 +0200)]
block: remove the backing_inode variable in bdev_statx
backing_inode is only used once, so remove it and update the comment
describing the bdev lookup to be a bit more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 23 Apr 2025 05:37:39 +0000 (07:37 +0200)]
block: move blkdev_{get,put} _no_open prototypes out of blkdev.h
These are only to be used by block internal code. Remove the comment
as we grew more users due to reworking block device node opening.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 24 Apr 2025 08:25:21 +0000 (10:25 +0200)]
block: never reduce ra_pages in blk_apply_bdi_limits
When the user increased the read-ahead size through sysfs this value
currently get lost if the device is reprobe, including on a resume
from suspend.
As there is no hardware limitation for the read-ahead size there is
no real need to reset it or track a separate hardware limitation
like for max_sectors.
This restores the pre-atomic queue limit behavior in the sd driver as
sd did not use blk_queue_io_opt and thus never updated the read ahead
size to the value based of the optimal I/O, but changes behavior for
all other drivers. As the new behavior seems useful and sd is the
driver for which the readahead size tweaks are most useful that seems
like a worthwhile trade off.
Fixes:
804e498e0496 ("sd: convert to the atomic queue limits API")
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20250424082521.1967286-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Uday Shankar [Wed, 23 Apr 2025 21:29:03 +0000 (15:29 -0600)]
selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils
Some distributions, such as centos stream 9, still have a version of
coreutils which does not yet support the %Hr and %Lr formats for stat(1)
[1, 2]. Running ublk selftests on these distributions results in the
following error in tests that use the _get_disk_dev_t helper:
line 23: ?r: syntax error: operand expected (error token is "?r")
To better accommodate older distributions, rewrite _get_disk_dev_t to
use the much older %t and %T formats for stat instead.
[1] https://github.com/coreutils/coreutils/blob/v9.0/NEWS#L114
[2] https://pkgs.org/download/coreutils
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250423-ublk_selftests-v1-2-7d060e260e76@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 24 Apr 2025 12:27:54 +0000 (06:27 -0600)]
Merge tag 'nvme-6.15-2025-04-24' of git://git.infradead.org/nvme into block-6.15
Pull NVMe fix from Christoph:
"nvme fixes for Linux 6.15
- fix an out-of-bounds access in nvmet_enable_port (Richard Weinberger)"
* tag 'nvme-6.15-2025-04-24' of git://git.infradead.org/nvme:
nvmet: fix out-of-bounds access in nvmet_enable_port
Ming Lei [Mon, 21 Apr 2025 23:59:42 +0000 (07:59 +0800)]
selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx'
'delay_us' shouldn't be added to 'struct dev_ctx' since now it is
handled by per-target command line & 'struct fault_inject_ctx'.
So remove it.
Fixes:
81586652bb1f ("selftests: ublk: add generic_06 for covering fault inject")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250421235947.715272-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 21 Apr 2025 23:59:41 +0000 (07:59 +0800)]
selftests: ublk: fix recover test
When adding recovery test:
- 'break' is missed for handling '-g' argument
- test name of test_generic_05.sh is wrong
So fix the two.
Fixes:
57e13a2e8cd2 ("selftests: ublk: support user recovery")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250421235947.715272-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Darrick J. Wong [Wed, 23 Apr 2025 19:53:57 +0000 (12:53 -0700)]
block: hoist block size validation code to a separate function
Hoist the block size validation code to bdev_validate_blocksize so that
we can call it from filesystems that don't care about the bdev pagecache
manipulations of set_blocksize.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/174543795720.4139148.840349813093799165.stgit@frogsfrogsfrogs
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Darrick J. Wong [Wed, 23 Apr 2025 19:53:42 +0000 (12:53 -0700)]
block: fix race between set_blocksize and read paths
With the new large sector size support, it's now the case that
set_blocksize can change i_blksize and the folio order in a manner that
conflicts with a concurrent reader and causes a kernel crash.
Specifically, let's say that udev-worker calls libblkid to detect the
labels on a block device. The read call can create an order-0 folio to
read the first 4096 bytes from the disk. But then udev is preempted.
Next, someone tries to mount an 8k-sectorsize filesystem from the same
block device. The filesystem calls set_blksize, which sets i_blksize to
8192 and the minimum folio order to 1.
Now udev resumes, still holding the order-0 folio it allocated. It then
tries to schedule a read bio and do_mpage_readahead tries to create
bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because
the page size is 4096 but the blocksize is 8192 so no bufferheads are
attached and the bh walk never sets bdev. We then submit the bio with a
NULL block device and crash.
Therefore, truncate the page cache after flushing but before updating
i_blksize. However, that's not enough -- we also need to lock out file
IO and page faults during the update. Take both the i_rwsem and the
invalidate_lock in exclusive mode for invalidations, and in shared mode
for read/write operations.
I don't know if this is the correct fix, but xfs/259 found it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/174543795699.4139148.2086129139322431423.stgit@frogsfrogsfrogs
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hans Holmberg [Tue, 22 Apr 2025 11:50:07 +0000 (11:50 +0000)]
xfs: remove duplicate Zoned Filesystems sections in admin-guide
Remove the duplicated section and while at it, turn spaces into tabs.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Fixes:
c7b67ddc3c99 ("xfs: document zoned rt specifics in admin-guide")
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Carlos Maiolino [Tue, 22 Apr 2025 12:54:54 +0000 (14:54 +0200)]
XFS: fix zoned gc threshold math for 32-bit arches
xfs_zoned_need_gc makes use of mult_frac() to calculate the threshold
for triggering the zoned garbage collector, but, turns out mult_frac()
doesn't properly work with 64-bit data types and this caused build
failures on some 32-bit architectures.
Fix this by essentially open coding mult_frac() in a 64-bit friendly
way.
Notice we don't need to bother with counters underflow here because
xfs_estimate_freecounter() will always return a positive value, as it
leverages percpu_counter_read_positive to read such counters.
Fixes:
845abeb1f06a ("xfs: add tunable threshold parameter for triggering zone GC")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/
202504181233.F7D9Atra-lkp@intel.com/
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Richard Weinberger [Fri, 18 Apr 2025 08:02:50 +0000 (10:02 +0200)]
nvmet: fix out-of-bounds access in nvmet_enable_port
When trying to enable a port that has no transport configured yet,
nvmet_enable_port() uses NVMF_TRTYPE_MAX (255) to query the transports
array, causing an out-of-bounds access:
[ 106.058694] BUG: KASAN: global-out-of-bounds in nvmet_enable_port+0x42/0x1da
[ 106.058719] Read of size 8 at addr
ffffffff89dafa58 by task ln/632
[...]
[ 106.076026] nvmet: transport type 255 not supported
Since commit
200adac75888, NVMF_TRTYPE_MAX is the default state as configured by
nvmet_ports_make().
Avoid this by checking for NVMF_TRTYPE_MAX before proceeding.
Fixes:
200adac75888 ("nvme: Add PCI transport type")
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Jens Axboe [Thu, 17 Apr 2025 12:18:49 +0000 (06:18 -0600)]
Merge tag 'nvme-6.15-2025-04-17' of git://git.infradead.org/nvme into block-6.15
Pull NVMe fixes from Christoph:
"nvme fixes for Linux 6.15
- fix scan failure for non-ANA multipath controllers (Hannes Reinecke)
- fix multipath sysfs links creation for some cases (Hannes Reinecke)
- PCIe endpoint fixes (Damien Le Moal)
- use NULL instead of 0 in the auth code (Damien Le Moal)"
* tag 'nvme-6.15-2025-04-17' of git://git.infradead.org/nvme:
nvmet: pci-epf: cleanup link state management
nvmet: pci-epf: clear CC and CSTS when disabling the controller
nvmet: pci-epf: always fully initialize completion entries
nvmet: auth: use NULL to clear a pointer in nvmet_auth_sq_free()
nvme-multipath: sysfs links may not be created for devices
nvme: fixup scan failure for non-ANA multipath controllers
Hans Holmberg [Wed, 9 Apr 2025 12:39:56 +0000 (12:39 +0000)]
xfs: document zoned rt specifics in admin-guide
Document the lifetime, nolifetime and max_open_zones mount options
added for zoned rt file systems.
Also add documentation describing the max_open_zones sysfs attribute
exposed in /sys/fs/xfs/<dev>/zoned/
Fixes:
4e4d52075577 ("xfs: add the zoned space allocator")
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Jens Axboe [Thu, 17 Apr 2025 03:08:28 +0000 (21:08 -0600)]
Merge tag 'md-6.15-
20250416' of https://git./linux/kernel/git/mdraid/linux into block-6.15
Pull MD fixes from Yu:
"- fix raid10 missing discard IO accounting (Yu Kuai)
- fix bitmap stats for bitmap file (Zheng Qixing)
- fix oops while reading all member disks failed during check/repair
(Meir Elisha)"
* tag 'md-6.15-
20250416' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
md/raid1: Add check for missing source disk in process_checks()
md/md-bitmap: fix stats collection for external bitmaps
md/raid10: fix missing discard IO accounting
Uday Shankar [Wed, 16 Apr 2025 03:54:42 +0000 (11:54 +0800)]
selftests: ublk: add generic_06 for covering fault inject
Add one simple fault inject target, and verify if an application using ublk
device sees an I/O error quickly after the ublk server dies.
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-9-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Apr 2025 03:54:41 +0000 (11:54 +0800)]
ublk: simplify aborting ublk request
Now ublk_abort_queue() is moved to ublk char device release handler,
meantime our request queue is "quiesced" because either ->canceling was
set from uring_cmd cancel function or all IOs are inflight and can't be
completed by ublk server, things becomes easy much:
- all uring_cmd are done, so we needn't to mark io as UBLK_IO_FLAG_ABORTED
for handling completion from uring_cmd
- ublk char device is closed, no one can hold IO request reference any more,
so we can simply complete this request or requeue it for ublk_nosrv_should_reissue_outstanding.
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Apr 2025 03:54:40 +0000 (11:54 +0800)]
ublk: remove __ublk_quiesce_dev()
Remove __ublk_quiesce_dev() and open code for updating device state as
QUIESCED.
We needn't to drain inflight requests in __ublk_quiesce_dev() any more,
because all inflight requests are aborted in ublk char device release
handler.
Also we needn't to set ->canceling in __ublk_quiesce_dev() any more
because it is done unconditionally now in ublk_ch_release().
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Uday Shankar [Wed, 16 Apr 2025 03:54:39 +0000 (11:54 +0800)]
ublk: improve detection and handling of ublk server exit
There are currently two ways in which ublk server exit is detected by
ublk_drv:
1. uring_cmd cancellation. If there are any outstanding uring_cmds which
have not been completed to the ublk server when it exits, io_uring
calls the uring_cmd callback with a special cancellation flag as the
issuing task is exiting.
2. I/O timeout. This is needed in addition to the above to handle the
"saturated queue" case, when all I/Os for a given queue are in the
ublk server, and therefore there are no outstanding uring_cmds to
cancel when the ublk server exits.
There are a couple of issues with this approach:
- It is complex and inelegant to have two methods to detect the same
condition
- The second method detects ublk server exit only after a long delay
(~30s, the default timeout assigned by the block layer). This delays
the nosrv behavior from kicking in and potential subsequent recovery
of the device.
The second issue is brought to light with the new test_generic_06 which
will be added in following patch. It fails before this fix:
selftests: ublk: test_generic_06.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 30.0611 s, 0.0 kB/s
DEAD
dd took 31 seconds to exit (>= 5s tolerance)!
generic_06 : [FAIL]
Fix this by instead detecting and handling ublk server exit in the
character file release callback. This has several advantages:
- This one place can handle both saturated and unsaturated queues. Thus,
it replaces both preexisting methods of detecting ublk server exit.
- It runs quickly on ublk server exit - there is no 30s delay.
- It starts the process of removing task references in ublk_drv. This is
needed if we want to relax restrictions in the driver like letting
only one thread serve each queue
There is also the disadvantage that the character file release callback
can also be triggered by intentional close of the file, which is a
significant behavior change. Preexisting ublk servers (libublksrv) are
dependent on the ability to open/close the file multiple times. To
address this, only transition to a nosrv state if the file is released
while the ublk device is live. This allows for programs to open/close
the file multiple times during setup. It is still a behavior change if a
ublk server decides to close/reopen the file while the device is LIVE
(i.e. while it is responsible for serving I/O), but that would be highly
unusual. This behavior is in line with what is done by FUSE, which is
very similar to ublk in that a userspace daemon is providing services
traditionally provided by the kernel.
With this change in, the new test (and all other selftests, and all
ublksrv tests) pass:
selftests: ublk: test_generic_06.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 0.
0376731 s, 0.0 kB/s
DEAD
generic_04 : [PASS]
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Apr 2025 03:54:38 +0000 (11:54 +0800)]
ublk: move device reset into ublk_ch_release()
ublk_ch_release() is called after ublk char device is closed, when all
uring_cmd are done, so it is perfect fine to move ublk device reset to
ublk_ch_release() from ublk_ctrl_start_recovery().
This way can avoid to grab the exiting daemon task_struct too long.
However, reset of the following ublk IO flags has to be moved until ublk
io_uring queues are ready:
- ubq->canceling
For requeuing IO in case of ublk_nosrv_dev_should_queue_io() before device
is recovered
- ubq->fail_io
For failing IO in case of UBLK_F_USER_RECOVERY_FAIL_IO before device is
recovered
- ublk_io->flags
For preventing using io->cmd
With this way, recovery is simplified a lot.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Apr 2025 03:54:37 +0000 (11:54 +0800)]
ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io
Now ublk deals with ublk_nosrv_dev_should_queue_io() by keeping request
queue as quiesced. This way is fragile because queue quiesce crosses syscalls
or process contexts.
Switch to rely on ubq->canceling for dealing with
ublk_nosrv_dev_should_queue_io(), because it has been used for this purpose
during io_uring context exiting, and it can be reused before recovering too.
In ublk_queue_rq(), the request will be added to requeue list without
kicking off requeue in case of ubq->canceling, and finally requests added in
requeue list will be dispatched from either ublk_stop_dev() or
ublk_ctrl_end_recovery().
Meantime we have to move reset of ubq->canceling from ublk_ctrl_start_recovery()
to ublk_ctrl_end_recovery(), when IO handling can be recovered completely.
Then blk_mq_quiesce_queue() and blk_mq_unquiesce_queue() are always used
in same context.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250416035444.99569-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Apr 2025 03:54:36 +0000 (11:54 +0800)]
ublk: add ublk_force_abort_dev()
Add ublk_force_abort_dev() for handling ublk_nosrv_dev_should_queue_io()
in ublk_stop_dev(). Then queue quiesce and unquiesce can be paired in
single function.
Meantime not change device state to QUIESCED any more, since the disk is
going to be removed soon.
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Uday Shankar [Wed, 16 Apr 2025 03:54:35 +0000 (11:54 +0800)]
ublk: properly serialize all FETCH_REQs
Most uring_cmds issued against ublk character devices are serialized
because each command affects only one queue, and there is an early check
which only allows a single task (the queue's ubq_daemon) to issue
uring_cmds against that queue. However, this mechanism does not work for
FETCH_REQs, since they are expected before ubq_daemon is set. Since
FETCH_REQs are only used at initialization and not in the fast path,
serialize them using the per-ublk-device mutex. This fixes a number of
data races that were previously possible if a badly behaved ublk server
decided to issue multiple FETCH_REQs against the same qid/tag
concurrently.
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:29 +0000 (10:30 +0800)]
selftests: ublk: move creating UBLK_TMP into _prep_test()
test may exit early because of missing program or not having required
feature before calling _prep_test(), then $UBLK_TMP isn't cleaned.
Fix it by moving creating $UBLK_TMP into _prep_test(), any resources
created since _prep_test() will be cleaned by _cleanup_test().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-14-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:28 +0000 (10:30 +0800)]
selftests: ublk: add test_stress_05.sh
Add test_stress_05.sh for covering removing device with recovery
enabled.
io-hang has been observed with the following patch:
https://lore.kernel.org/linux-block/
20250403-ublk_timeout-v3-1-
aa09f76c7451@purestorage.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-13-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:27 +0000 (10:30 +0800)]
selftests: ublk: support user recovery
Add user recovery feature.
Meantime add user recovery test: generic_04 and generic_05(zero copy)
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-12-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:26 +0000 (10:30 +0800)]
selftests: ublk: support target specific command line
Support target specific command line for making related command line code
handling more readable & clean.
Also helps for adding new features.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-11-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:25 +0000 (10:30 +0800)]
selftests: ublk: increase max nr_queues and queue depth
Increase max nr_queues to 32, and queue depth to 1024.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-10-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:24 +0000 (10:30 +0800)]
selftests: ublk: set queue pthread's cpu affinity
In NUMA machine, ublk IO performance is very sensitive with queue
pthread's affinity setting.
Retrieve queue's affinity and select the 1st cpu as queue thread's sched
affinity, and it is observed that single cpu task affinity can get
stable & good performance if client application is put on proper cpu.
Dump this info when adding one ublk device. Use shmem to communicate
queue's tid between parent and daemon.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-9-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:23 +0000 (10:30 +0800)]
selftests: ublk: setup ring with IORING_SETUP_SINGLE_ISSUER/IORING_SETUP_DEFER_TASKRUN
It is observed that this way is more efficient for fast nvme backing
file.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:22 +0000 (10:30 +0800)]
selftests: ublk: add two stress tests for zero copy feature
Add stress_03 & stress_04 for covering zero copy feature.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:21 +0000 (10:30 +0800)]
selftests: ublk: run stress tests in parallel
Run stress tests in parallel, meantime add shell local function to
simplify the two stress tests.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:20 +0000 (10:30 +0800)]
selftests: ublk: make sure _add_ublk_dev can return in sub-shell
Detach ublk daemon from the starting process completely by double-fork and
clearing its process group, so that `_add_ublk_dev` can return from sub-shell.
Then it is more friendly for writing shell test script for adding/recovering
ublk device.
Prepare for running ublk test in parallel.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:19 +0000 (10:30 +0800)]
selftests: ublk: cleanup backfile automatically
Use global array of $UBLK_BACKFILES for storing all backfile name, then
clean them automatically.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:18 +0000 (10:30 +0800)]
selftests: ublk: add io_uring uapi header
Add io_uring UAPI header so that ublk can work with latest uapi
definition.
Fix the following build failure:
stripe.c: In function ‘stripe_to_uring_op’:
stripe.c:120:29: error: ‘IORING_OP_READV_FIXED’ undeclared (first use in this function); did you mean ‘IORING_OP_READ_FIXED’?
120 | return zc ? IORING_OP_READV_FIXED : IORING_OP_READV;
| ^~~~~~~~~~~~~~~~~~~~~
| IORING_OP_READ_FIXED
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Fixes:
57ed58c13256 ("selftests: ublk: enable zero copy for stripe target")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:17 +0000 (10:30 +0800)]
selftests: ublk: fix ublk_find_tgt()
Bounds check for iterator variable `i` is missed, so add it and fix
ublk_find_tgt().
Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Martin K. Petersen [Wed, 16 Apr 2025 20:04:10 +0000 (16:04 -0400)]
block: integrity: Do not call set_page_dirty_lock()
Placing multiple protection information buffers inside the same page
can lead to oopses because set_page_dirty_lock() can't be called from
interrupt context.
Since a protection information buffer is not backed by a file there is
no point in setting its page dirty, there is nothing to synchronize.
Drop the call to set_page_dirty_lock() and remove the last argument to
bio_integrity_unpin_bvec().
Cc: stable@vger.kernel.org
Fixes:
492c5d455969 ("block: bio-integrity: directly map user buffers")
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/yq1v7r3ev9g.fsf@ca-mkp.ca.oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Darrick J. Wong [Tue, 15 Apr 2025 00:33:45 +0000 (17:33 -0700)]
xfs: fix fsmap for internal zoned devices
Filesystems with an internal zoned rt section use xfs_rtblock_t values
that are relative to the start of the data device. When fsmap reports
on internal rt sections, it reports the space used by the data section
as "OWN_FS".
Unfortunately, the logic for resuming a query isn't quite right, so
xfs/273 fails because it stress-tests GETFSMAP with a single-record
buffer. If we enter the "report fake space as OWN_FS" block with a
nonzero key[0].fmr_length, we should add that to key[0].fmr_physical
and recheck if we still need to emit the fake record. We should /not/
just return 0 from the whole function because that prevents all rmap
record iteration.
If we don't enter that block, the resumption is still wrong.
keys[*].fmr_physical is a reflection of what we copied out to userspace
on a previous query, which means that it already accounts for rgstart.
It is not correct to add rtstart_daddr when computing start_rtb or
end_rtb, so stop that.
While we're at it, add a xfs_has_zoned to make it clear that this is a
zoned filesystem thing.
Fixes:
e50ec7fac81aa2 ("xfs: enable fsmap reporting for internal RT devices")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Meir Elisha [Tue, 8 Apr 2025 14:38:08 +0000 (17:38 +0300)]
md/raid1: Add check for missing source disk in process_checks()
During recovery/check operations, the process_checks function loops
through available disks to find a 'primary' source with successfully
read data.
If no suitable source disk is found after checking all possibilities,
the 'primary' index will reach conf->raid_disks * 2. Add an explicit
check for this condition after the loop. If no source disk was found,
print an error message and return early to prevent further processing
without a valid primary source.
Link: https://lore.kernel.org/linux-raid/20250408143808.1026534-1-meir.elisha@volumez.com
Signed-off-by: Meir Elisha <meir.elisha@volumez.com>
Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Zhang Xianwei [Sat, 15 Mar 2025 06:32:16 +0000 (14:32 +0800)]
xfs: Fix spelling mistake "drity" -> "dirty"
There is a spelling mistake in fs/xfs/xfs_log.c. Fix it.
Signed-off-by: Zhang Xianwei <zhang.xianwei8@zte.com.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Damien Le Moal [Fri, 11 Apr 2025 01:42:11 +0000 (10:42 +0900)]
nvmet: pci-epf: cleanup link state management
Since the link_up boolean field of struct nvmet_pci_epf_ctrl is always
set to true when nvmet_pci_epf_start_ctrl() is called, assign true to
this field in nvmet_pci_epf_start_ctrl(). Conversely, since this field
is set to false when nvmet_pci_epf_stop_ctrl() is called, set this field
to false directly inside that function.
While at it, also add information messages to notify the user of the PCI
link state changes to help troubleshoot any link stability issues
without needing to enable debug messages.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Damien Le Moal [Fri, 11 Apr 2025 01:42:10 +0000 (10:42 +0900)]
nvmet: pci-epf: clear CC and CSTS when disabling the controller
When a host shuts down the controller when shutting down but does so
without first disabling the controller, the enable bit remains set in
the controller configuration register. When the host restarts and
attempts to enable the controller again, the
nvmet_pci_epf_poll_cc_work() function is unable to detect the change
from 0 to 1 of the enable bit, and thus the controller is not enabled
again, which result in a device scan timeout on the host. This problem
also occurs if the host shuts down uncleanly or if the PCIe link goes
down: as the CC.EN value is not reset, the controller is not enabled
again when the host restarts.
Fix this by introducing the function nvmet_pci_epf_clear_ctrl_config()
to clear the CC and CSTS registers of the controller when the PCIe link
is lost (nvmet_pci_epf_stop_ctrl() function), or when starting the
controller fails (nvmet_pci_epf_enable_ctrl() fails). Also use this
function in nvmet_pci_epf_init_bar() to simplify the initialization of
the CC and CSTS registers.
Furthermore, modify the function nvmet_pci_epf_disable_ctrl() to clear
the CC.EN bit and write this updated value to the BAR register when the
controller is shutdown by the host, to ensure that upon restart, we can
detect the host setting CC.EN.
Fixes:
0faa0fe6f90e ("nvmet: New NVMe PCI endpoint function target driver")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Damien Le Moal [Fri, 11 Apr 2025 01:42:09 +0000 (10:42 +0900)]
nvmet: pci-epf: always fully initialize completion entries
For a command that is normally processed through the command request
execute() function, the completion entry for the command is initialized
by __nvmet_req_complete() and nvmet_pci_epf_cq_work() only needs to set
the status field and the phase of the completion entry before posting
the entry to the completion queue.
However, for commands that are failed due to an internal error (e.g. the
command data buffer allocation fails), the command request execute()
function is not called and __nvmet_req_complete() is never executed for
the command, leaving the command completion entry uninitialized. For
such command failed before calling req->execute(), the host ends up
seeing completion entries with an invalid submission queue ID and
command ID.
Avoid such issue by always fully initilizing a command completion entry
in nvmet_pci_epf_cq_work(), setting the entry submission queue head, ID
and command ID.
Fixes:
0faa0fe6f90e ("nvmet: New NVMe PCI endpoint function target driver")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Damien Le Moal [Fri, 11 Apr 2025 01:00:15 +0000 (10:00 +0900)]
nvmet: auth: use NULL to clear a pointer in nvmet_auth_sq_free()
When compiling with C=1, the following sparse warning is generated:
auth.c:243:23: warning: Using plain integer as NULL pointer
Avoid this warning by using NULL to instead of 0 to set the sq tls_key
pointer.
Fixes:
fa2e0f8bbc68 ("nvmet-tcp: support secure channel concatenation")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Tue, 15 Apr 2025 06:47:37 +0000 (08:47 +0200)]
nvme-multipath: sysfs links may not be created for devices
When rapidly rescanning for new namespaces nvme_mpath_add_sysfs_link() may be
called for a block device not added to sysfs. But NVME_NS_SYSFS_ATTR_LINK
had already been set, so when checking this device a second time we will fail
to create the link.
Fix this by exchanging the order of the block device check and the
NVME_NS_SYSFS_ATTR_LINK bit check.
Fixes:
4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin io-policy")
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>**
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Mon, 14 Apr 2025 12:05:09 +0000 (14:05 +0200)]
nvme: fixup scan failure for non-ANA multipath controllers
Commit
62baf70c3274 caused the ANA log page to be re-read, even on
controllers that do not support ANA. While this should generally
harmless, some controllers hang on the unsupported log page and
never finish probing.
Fixes:
62baf70c3274 ("nvme: re-read ANA log page after ns scan completes")
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Tested-by: Srikanth Aithal <sraithal@amd.com>
[hch: more detailed commit message]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Caleb Sander Mateos [Wed, 16 Apr 2025 00:41:10 +0000 (18:41 -0600)]
ublk: don't suggest CONFIG_BLK_DEV_UBLK=Y
The CONFIG_BLK_DEV_UBLK help text suggests setting the config option to
Y so task_work_add() can be used to dispatch I/O, improving performance.
However, this mechanism was removed in commit
29dc5d06613f2 ("ublk: kill
queuing request by task_work_add"). So remove this paragraph from the
config help text.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250416004111.3242817-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 9 Apr 2025 13:09:40 +0000 (15:09 +0200)]
loop: stop using vfs_iter_{read,write} for buffered I/O
vfs_iter_{read,write} always perform direct I/O when the file has the
O_DIRECT flag set, which breaks disabling direct I/O using the
LOOP_SET_STATUS / LOOP_SET_STATUS64 ioctls.
This was recenly reported as a regression, but as far as I can tell
was only uncovered by better checking for block sizes and has been
around since the direct I/O support was added.
Fix this by using the existing aio code that calls the raw read/write
iter methods instead. Note that despite the comments there is no need
for block drivers to ever call flush_dcache_page themselves, and the
call is a left-over from prehistoric times.
Fixes:
ab1cb278bc70 ("block: loop: introduce ioctl command of LOOP_SET_DIRECT_IO")
Reported-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20250409130940.3685677-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>