git.kernel.dk Git - linux-block.git/log

sunrpc: handle SVC_GARBAGE during svc auth processing as auth error

tianshuo han reported a remotely-triggerable crash if the client sends a
kernel RPC server a specially crafted packet. If decoding the RPC reply
fails in such a way that SVC_GARBAGE is returned without setting the
rq_accept_statp pointer, then that pointer can be dereferenced and a
value stored there.

If it's the first time the thread has processed an RPC, then that
pointer will be set to NULL and the kernel will crash. In other cases,
it could create a memory scribble.

The server sunrpc code treats a SVC_GARBAGE return from svc_authenticate
or pg_authenticate as if it should send a GARBAGE_ARGS reply. RFC 5531
says that if authentication fails that the RPC should be rejected
instead with a status of AUTH_ERR.

Handle a SVC_GARBAGE return as an AUTH_ERROR, with a reason of
AUTH_BADCRED instead of returning GARBAGE_ARGS in that case. This
sidesteps the whole problem of touching the rpc_accept_statp pointer in
this situation and avoids the crash.

Cc: stable@kernel.org
Fixes: 29cd2927fb91 ("SUNRPC: Fix encoding of accepted but unsuccessful RPC replies")
Reported-by: tianshuo han <hantianshuo233@gmail.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: use threads array as-is in netlink interface

The old nfsdfs interface for starting a server with multiple pools
handles the special case of a single entry array passed down from
userland by distributing the threads over every NUMA node.

The netlink control interface however constructs an array of length
nfsd_nrpools() and fills any unprovided slots with 0's. This behavior
defeats the special casing that the old interface relies on.

Change nfsd_nl_threads_set_doit() to pass down the array from userland
as-is.

Fixes: 7f5c330b2620 ("nfsd: allow passing in array of thread counts via netlink")
Cc: stable@vger.kernel.org
Reported-by: Mike Snitzer <snitzer@kernel.org>
Closes: https://lore.kernel.org/linux-nfs/aDC-ftnzhJAlwqwh@kernel.org/
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Cleanup/fix initial rq_pages allocation

While investigating some reports of memory-constrained NUMA machines
failing to mount v3 and v4.0 nfs mounts, we found that svc_init_buffer()
was not attempting to retry allocations from the bulk page allocator.
Typically, this results in a single page allocation being returned and
the mount attempt fails with -ENOMEM. A retry would have allowed the mount
to succeed.

Additionally, it seems that the bulk allocation in svc_init_buffer() is
redundant because svc_alloc_arg() will perform the required allocation and
does the correct thing to retry the allocations.

The call to allocate memory in svc_alloc_arg() drops the preferred node
argument, but I expect we'll still allocate on the preferred node because
the allocation call happens within the svc thread context, which chooses
the node with memory closest to the current thread's execution.

This patch cleans out the bulk allocation in svc_init_buffer() to allow
svc_alloc_arg() to handle the allocation/retry logic for rq_pages.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Fixes: ed603bcf4fea ("sunrpc: Replace the rq_pages array with dynamically-allocated memory")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Avoid corruption of a referring call list

The new code neglects to remove a freshly-allocated RCL from the
callback's referring call list when no matching referring call is
found.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202505171002.cE46sdj5-lkp@intel.com/
Fixes: 4f3c8d8c9e10 ("NFSD: Implement CB_SEQUENCE referring call lists")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

xdrgen: Fix code generated for counted arrays

When an XDR counted array has a maximum element count, xdrgen adds
a bounds check to the encoder or decoder for that type. But in cases
where the .x provides no maximum element count, such as

struct notify4 {
        /* composed from notify_type4 or notify_deviceid_type4 */
        bitmap4         notify_mask;
        notifylist4     notify_vals;
};

struct CB_NOTIFY4args {
        stateid4    cna_stateid;
        nfs_fh4     cna_fh;
        notify4     cna_changes<>;
};

xdrgen is supposed to omit that bounds check. Some of the Jinja2
templates handle that correctly, but a few are incorrect and leave
the bounds check in place with a maximum of zero, which causes
encoding/decoding of that type to fail unconditionally.

Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Bump the maximum payload size for the server

Increase the maximum server-side RPC payload to 4MB. The default
remains at 1MB.

An API to adjust the operational maximum was added in 2006 by commit
596bbe53eb3a ("[PATCH] knfsd: Allow max size of NFSd payload to be
configured"). To adjust the operational maximum using this API, shut
down the NFS server. Then echo a new value into:

/proc/fs/nfsd/max_block_size

And restart the NFS server.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Add a "default" block size

We'd like to increase the maximum r/wsize that NFSD can support,
but without introducing possible regressions. So let's add a
default setting of 1MB. A subsequent patch will raise the
maximum value but leave the default alone.

No behavior change is expected.

Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro

The 8192-byte maximum is a protocol-defined limit, and we already
have a symbolic constant defined whose name matches the name of
the limit defined in the protocol. Replace the duplicate.

No change in behavior is expected.

Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Remove NFSD_BUFSIZE

Clean up: The documenting comment for NFSD_BUFSIZE is quite stale.
NFSD_BUFSIZE is used only for NFSv4 Reply these days; never for
NFSv2 or v3, and never for RPC Calls. Even so, the byte count
estimate does not include the size of the NFSv4 COMPOUND Reply
HEADER or the RPC auth flavor.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

sunrpc: Remove the RPCSVC_MAXPAGES macro

It is no longer used.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pages

Allow allocation of more entries in the sc_pages[] array when the
maximum size of an RPC message is increased.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages

Allow allocation of more entries in the rc_pages[] array when the
maximum size of an RPC message is increased.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

sunrpc: Adjust size of socket's receive page array dynamically

As a step towards making NFSD's maximum rsize and wsize variable at
run-time, make sk_pages a flexible array.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Remove svc_rqst :: rq_vec

Clean up: This array is no longer used.

On a system with 8-byte pointers and 4KB pages, pahole reports that
the rq_vec[] array accounts for 4144 bytes.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Remove svc_fill_write_vector()

Clean up: This API is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Use rqstp->rq_bvec in nfsd_iter_write()

If we can get rid of all uses of rq_vec, then it can be removed.
Replace one use of rqstp::rq_vec with rqstp::rq_bvec.

The feeling of layering violation grows stronger now that
<linux/sunrpc/xdr.h> is included in fs/nfsd/vfs.c.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Export xdr_buf_to_bvec()

Prepare xdr_buf_to_bvec() to be invoked from upper layer protocol
code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: De-duplicate the svc_fill_write_vector() call sites

All three call sites do the same thing.

I'm struggling with this a bit, however. struct xdr_buf is an XDR
layer object and unmarshaling a WRITE payload is clearly a task
intended to be done by the proc and xdr functions, not by VFS. This
feels vaguely like a layering violation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Use rqstp->rq_bvec in nfsd_iter_read()

If we can get rid of all uses of rq_vec, then it can be removed.
Replace one use of rqstp::rq_vec with rqstp::rq_bvec.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

sunrpc: Replace the rq_bvec array with dynamically-allocated memory

As a step towards making NFSD's maximum rsize and wsize variable at
run-time, replace the fixed-size rq_bvec[] array in struct svc_rqst
with a chunk of dynamically-allocated memory.

The rq_bvec[] array contains enough bio_vecs to handle each page in
a maximum size RPC message.

On a system with 8-byte pointers and 4KB pages, pahole reports that
the rq_bvec[] array is 4144 bytes. This patch replaces that array
with a single 8-byte pointer field.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

sunrpc: Replace the rq_pages array with dynamically-allocated memory

As a step towards making NFSD's maximum rsize and wsize variable at
run-time, replace the fixed-size rq_vec[] array in struct svc_rqst
with a chunk of dynamically-allocated memory.

On a system with 8-byte pointers and 4KB pages, pahole reports that
the rq_pages[] array is 2080 bytes. This patch replaces that with
a single 8-byte pointer field.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

sunrpc: Remove backchannel check in svc_init_buffer()

The server's backchannel uses struct svc_rqst, but does not use the
pages in svc_rqst::rq_pages. It's rq_arg::pages and rq_res::pages
comes from the RPC client's page allocator. Currently,
svc_init_buffer() skips allocating pages in rq_pages for that
reason.

Except that, svc_rqst::rq_pages is filled anyway when a backchannel
svc_rqst is passed to svc_recv() -> and then to svc_alloc_arg().

This isn't really a problem at the moment, except that these pages
are allocated but then never used, as far as I can tell.

The problem is that later in this series, in addition to populating
the entries of rq_pages[], svc_init_buffer() will also allocate the
memory underlying the rq_pages[] array itself. If that allocation is
skipped, then svc_alloc_args() chases a NULL pointer for ingress
backchannel requests.

This approach avoids introducing extra conditional logic in
svc_alloc_args(), which is a hot path.

Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

sunrpc: Add a helper to derive maxpages from sv_max_mesg

This page count is to be used to allocate various arrays of pages
and bio_vecs, replacing the fixed RPCSVC_MAXPAGES value.

The documenting comment is somewhat stale -- of course NFSv4
COMPOUND procedures may have multiple payloads.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

svcrdma: Reduce the number of rdma_rw contexts per-QP

There is an upper bound on the number of rdma_rw contexts that can
be created per QP.

This invisible upper bound is because rdma_create_qp() adds one or
more additional SQEs for each ctxt that the ULP requests via
qp_attr.cap.max_rdma_ctxs. The QP's actual Send Queue length is on
the order of the sum of qp_attr.cap.max_send_wr and a factor times
qp_attr.cap.max_rdma_ctxs. The factor can be up to three, depending
on whether MR operations are required before RDMA Reads.

This limit is not visible to RDMA consumers via dev->attrs. When the
limit is surpassed, QP creation fails with -ENOMEM. For example:

svcrdma's estimate of the number of rdma_rw contexts it needs is
three times the number of pages in RPCSVC_MAXPAGES. When MAXPAGES
is about 260, the internally-computed SQ length should be:

64 credits + 10 backlog + 3 * (3 * 260) = 2414

Which is well below the advertised qp_max_wr of 32768.

If RPCSVC_MAXPAGES is increased to 4MB, that's 1040 pages:

64 credits + 10 backlog + 3 * (3 * 1040) = 9434

However, QP creation fails. Dynamic printk for mlx5 shows:

calc_sq_size:618:(pid 1514): send queue size (9326 * 256 / 64 -> 65536) exceeds limits(32768)

Although 9326 is still far below qp_max_wr, QP creation still
fails.

Because the total SQ length calculation is opaque to RDMA consumers,
there doesn't seem to be much that can be done about this except for
consumers to try to keep the requested rdma_rw ctxt count low.

Fixes: 2da0f610e733 ("svcrdma: Increase the per-transport rw_ctx count")
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: remove legacy dprintks from GETATTR and STATFS codepaths

Observability here is now covered by static tracepoints.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: remove legacy READDIR dprintks

Observability here is now covered by static tracepoints.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: remove dprintks for v2/3 RENAME events

Observability here is now covered by static tracepoints.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: remove REMOVE/RMDIR dprintks

Observability here is now covered by static tracepoints.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: remove old LINK dprintks

Observability here is now covered by static tracepoints.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: remove old v2/3 SYMLINK dprintks

Observability here is now covered by static tracepoints.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: remove old v2/3 create path dprintks

Observability here is now covered by static tracepoints.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: add tracepoint for getattr and statfs events

There isn't a common helper for getattrs, so add these into the
protocol-specific helpers.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: add tracepoint to nfsd_readdir

Observe the start of NFS READDIR operations.

The NFS READDIR's count argument can be interesting when tuning a
client's readdir behavior.

However, the count argument is not passed to nfsd_readdir(). To
properly capture the count argument, this tracepoint must appear in
each proc function before the nfsd_readdir() call.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: add tracepoint to nfsd_rename

Observe the start of RENAME operations for all NFS versions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: add tracepoints for unlink events

Observe the start of UNLINK, REMOVE, and RMDIR operations for all
NFS versions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: add tracepoint to nfsd_link()

Observe the start of NFS LINK operations.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: add tracepoint to nfsd_symlink

Observe the start of SYMLINK operations for all NFS versions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: add nfsd_vfs_create tracepoints

Observe the start of file and directory creation for all NFS
versions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: add a tracepoint to nfsd_lookup_dentry

Replace the dprintk in nfsd_lookup_dentry() with a trace point.
nfsd_lookup_dentry() is called frequently enough that enabling this
dprintk call site would result in log floods and performance issues.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: add a tracepoint for nfsd_setattr

Turn Sargun's internal kprobe based implementation of this into a normal
static tracepoint. Also, remove the dprintk's that got added recently
with the fix for zero-length ACLs.

Cc: Sargun Dillon <sargun@sargun.me>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Add a Call equivalent to the NFSD_TRACE_PROC_RES macros

Introduce tracing helpers that can be used before the procedure
status code is known. These macros are similar to the
SVC_RQST_ENDPOINT helpers, but they can be modified to include
NFS-specific fields if that is needed later.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Use sockaddr instead of a generic array

Record and emit presentation addresses using tracing helpers
designed for the task.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Implement FATTR4_CLONE_BLKSIZE attribute

RFC 7862 states that if an NFS server implements a CLONE operation,
it MUST also implement FATTR4_CLONE_BLKSIZE. NFSD implements CLONE,
but does not implement FATTR4_CLONE_BLKSIZE.

Note that in Section 12.2, RFC 7862 claims that
FATTR4_CLONE_BLKSIZE is RECOMMENDED, not REQUIRED. Likely this is
because a minor version is not permitted to add a REQUIRED
attribute. Confusing.

We assume this attribute reports a block size as a count of bytes,
as RFC 7862 does not specify a unit.

Reported-by: Roland Mainz <roland.mainz@nrubsig.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Roland Mainz <roland.mainz@nrubsig.org>
Cc: stable@vger.kernel.org # v6.7+
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: use SHA-256 library API instead of crypto_shash API

This user of SHA-256 does not support any other algorithm, so the
crypto_shash abstraction provides no value. Just use the SHA-256
library API instead, which is much simpler and easier to use.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

svcrdma: Unregister the device if svc_rdma_accept() fails

To handle device removal, svc_rdma_accept() requests removal
notification for the underlying device when accepting a connection.
However svc_rdma_free() is not invoked if svc_rdma_accept() fails.
There needs to be a matching "unregister" in that case; otherwise
the device cannot be removed.

Fixes: c4de97f7c454 ("svcrdma: Handle device removal outside of the CM event handler")
Cc: stable@vger.kernel.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

sunrpc: allow SOMAXCONN backlogged TCP connections

The connection backlog passed to listen() denotes the number of
connections that are fully established, but that have not yet been
accept()ed. If the amount goes above that level, new connection requests
will be dropped on the floor until the value goes down. If all the knfsd
threads are bogged down in (e.g.) disk I/O, new connection attempts can
stall because of this.

For the same rationale that Trond points out in the userland patch [1],
ensure that svc_xprt sockets created by the kernel allow SOMAXCONN
(4096) backlogged connections instead of the 64 that they do today.

[1]: https://lore.kernel.org/linux-nfs/20240308180223.2965601-1-trond.myklebust@hammerspace.com/

Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: Initialize ssc before laundromat_work to prevent NULL dereference

In nfs4_state_start_net(), laundromat_work may access nfsd_ssc through
nfs4_laundromat -> nfsd4_ssc_expire_umount. If nfsd_ssc isn't initialized,
this can cause NULL pointer dereference.

Normally the delayed start of laundromat_work allows sufficient time for
nfsd_ssc initialization to complete. However, when the kernel waits too
long for userspace responses (e.g. in nfs4_state_start_net ->
nfsd4_end_grace -> nfsd4_record_grace_done -> nfsd4_cld_grace_done ->
cld_pipe_upcall -> __cld_pipe_upcall -> wait_for_completion path), the
delayed work may start before nfsd_ssc initialization finishes.

Fix this by moving nfsd_ssc initialization before starting laundromat_work.

Fixes: f4e44b393389 ("NFSD: delay unmount source's export after inter-server copy completed.")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

MAINTAINERS: Update Neil Brown's email address

Neil is planning retirement, and has asked me to replace his Suse
email address with his personal email address. Both addresses
currently route to the same mailbox.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

sunrpc: add info about xprt queue times to svc_xprt_dequeue tracepoint

I've been looking at a problem where we see increased RPC timeouts in
clients when the nfs_layout_flexfiles dataserver_timeo value is tuned
very low (6s). This is necessary to ensure quick failover to a different
mirror if a server goes down, but it causes a lot more major RPC timeouts.

Ultimately, the problem is server-side however. It's sometimes doesn't
respond to connection attempts. My theory is that the interrupt handler
runs when a connection comes in, the xprt ends up being enqueued, but it
takes a significant amount of time for the nfsd thread to pick it up.

Currently, the svc_xprt_dequeue tracepoint displays "wakeup-us". This is
the time between the wake_up() call, and the thread dequeueing the xprt.
If no thread was woken, or the thread ended up picking up a different
xprt than intended, then this value won't tell us how long the xprt was
waiting.

Add a new xpt_qtime field to struct svc_xprt and set it in
svc_xprt_enqueue(). When the dequeue tracepoint fires, also store the
time that the xprt sat on the queue in total. Display it as "qtime-us".

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: add commit start/done tracepoints around nfsd_commit()

Very useful for gauging how long the vfs_fsync_range() takes.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: nfsd4_spo_must_allow() must check this is a v4 compound request

If the request being processed is not a v4 compound request, then
examining the cstate can have undefined results.

This patch adds a check that the rpc procedure being executed
(rq_procinfo) is the NFSPROC4_COMPOUND procedure.

Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: fix access checking for NLM under XPRTSEC policies

When an export policy with xprtsec policy is set with "tls"
and/or "mtls", but an NFS client is doing a v3 xprtsec=tls
mount, then NLM locking calls fail with an error because
there is currently no support for NLM with TLS.

Until such support is added, allow NLM calls under TLS-secured
policy.

Fixes: 4cc9b9f2bf4d ("nfsd: refine and rename NFSD_MAY_LOCK")
Cc: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: remove redundant WARN_ON_ONCE in nfsd4_write

It can be removed since svc_fill_write_vector already has the
same WARN_ON_ONCE.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Add experimental setting to disable the use of splice read

NFSD currently has two separate code paths for handling read
requests. One uses page splicing; the other is a traditional read
based on an iov iterator.

Because most Linux file systems support splice read, the latter
does not get nearly the same test experience as splice reads.

To force the use of vectored reads for testing and benchmarking,
introduce the ability to disable splice reads for all NFS READ
operations.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Add /sys/kernel/debug/nfsd

Create a small sandbox under /sys/kernel/debug for experimental NFS
server feature settings. There is no API/ABI compatibility guarantee
for these settings.

The only documentation for such settings, if any documentation exists,
is in the kernel source code.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: fix race between nfsd registration and exports_proc

As of now nfsd calls create_proc_exports_entry() at start of init_nfsd
and cleanup by remove_proc_entry() at last of exit_nfsd.

Which causes kernel OOPs if there is race between below 2 operations:
(i) exportfs -r
(ii) mount -t nfsd none /proc/fs/nfsd

for 5.4 kernel ARM64:

CPU 1:
el1_irq+0xbc/0x180
arch_counter_get_cntvct+0x14/0x18
running_clock+0xc/0x18
preempt_count_add+0x88/0x110
prep_new_page+0xb0/0x220
get_page_from_freelist+0x2d8/0x1778
__alloc_pages_nodemask+0x15c/0xef0
__vmalloc_node_range+0x28c/0x478
__vmalloc_node_flags_caller+0x8c/0xb0
kvmalloc_node+0x88/0xe0
nfsd_init_net+0x6c/0x108 [nfsd]
ops_init+0x44/0x170
register_pernet_operations+0x114/0x270
register_pernet_subsys+0x34/0x50
init_nfsd+0xa8/0x718 [nfsd]
do_one_initcall+0x54/0x2e0

CPU 2 :
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010

PC is at : exports_net_open+0x50/0x68 [nfsd]

Call trace:
exports_net_open+0x50/0x68 [nfsd]
exports_proc_open+0x2c/0x38 [nfsd]
proc_reg_open+0xb8/0x198
do_dentry_open+0x1c4/0x418
vfs_open+0x38/0x48
path_openat+0x28c/0xf18
do_filp_open+0x70/0xe8
do_sys_open+0x154/0x248

Sometimes it crashes at exports_net_open() and sometimes cache_seq_next_rcu().

and same is happening on latest 6.14 kernel as well:

[    0.000000] Linux version 6.14.0-rc5-next-20250304-dirty
...
[  285.455918] Unable to handle kernel paging request at virtual address 00001f4800001f48
...
[  285.464902] pc : cache_seq_next_rcu+0x78/0xa4
...
[  285.469695] Call trace:
[  285.470083]  cache_seq_next_rcu+0x78/0xa4 (P)
[  285.470488]  seq_read+0xe0/0x11c
[  285.470675]  proc_reg_read+0x9c/0xf0
[  285.470874]  vfs_read+0xc4/0x2fc
[  285.471057]  ksys_read+0x6c/0xf4
[  285.471231]  __arm64_sys_read+0x1c/0x28
[  285.471428]  invoke_syscall+0x44/0x100
[  285.471633]  el0_svc_common.constprop.0+0x40/0xe0
[  285.471870]  do_el0_svc_compat+0x1c/0x34
[  285.472073]  el0_svc_compat+0x2c/0x80
[  285.472265]  el0t_32_sync_handler+0x90/0x140
[  285.472473]  el0t_32_sync+0x19c/0x1a0
[  285.472887] Code: f9400885 93407c23 937d7c27 11000421 (f86378a3)
[  285.473422] ---[ end trace 0000000000000000 ]---

It reproduced simply with below script:
while [ 1 ]
do
/exportfs -r
done &

while [ 1 ]
do
insmod /nfsd.ko
mount -t nfsd none /proc/fs/nfsd
umount /proc/fs/nfsd
rmmod nfsd
done &

So exporting interfaces to user space shall be done at last and
cleanup at first place.

With change there is no Kernel OOPs.

Co-developed-by: Shubham Rana <s9.rana@samsung.com>
Signed-off-by: Shubham Rana <s9.rana@samsung.com>
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: unregister filesystem in case genl_register_family() fails

With rpc_status netlink support, unregister of register_filesystem()
was missed in case of genl_register_family() fails.

Correcting it by making new label.

Fixes: bd9d6a3efa97 ("NFSD: add rpc_status netlink support")
Cc: stable@vger.kernel.org
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

sunrpc: fix race in cache cleanup causing stale nextcheck time

When cache cleanup runs concurrently with cache entry removal, a race
condition can occur that leads to incorrect nextcheck times. This can
delay cache cleanup for the cache_detail by up to 1800 seconds:

1. cache_clean() sets nextcheck to current time plus 1800 seconds
2. While scanning a non-empty bucket, concurrent cache entry removal can
   empty that bucket
3. cache_clean() finds no cache entries in the now-empty bucket to update
   the nextcheck time
4. This maybe delays the next scan of the cache_detail by up to 1800
   seconds even when it should be scanned earlier based on remaining
   entries

Fix this by moving the hash_lock acquisition earlier in cache_clean().
This ensures bucket emptiness checks and nextcheck updates happen
atomically, preventing the race between cleanup and entry removal.

Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

sunrpc: update nextcheck time when adding new cache entries

The cache_detail structure uses a "nextcheck" field to control hash table
scanning intervals. When a table scan begins, nextcheck is set to current
time plus 1800 seconds. During scanning, if cache_detail is not empty and
a cache entry's expiry time is earlier than the current nextcheck, the
nextcheck is updated to that expiry time.

This mechanism ensures that:
1) Empty cache_details are scanned every 1800 seconds to avoid unnecessary
scans
2) Non-empty cache_details are scanned based on the earliest expiry time
found

However, when adding a new cache entry to an empty cache_detail, the
nextcheck time was not being updated, remaining at 1800 seconds. This
could delay cache cleanup for up to 1800 seconds, potentially blocking
threads(such as nfsd) that are waiting for cache cleanup.

Fix this by updating the nextcheck time whenever a new cache entry is
added.

Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Record each NFSv4 call's session slot index

Help the client resolve the race between the reply to an
asynchronous COPY reply and the associated CB_OFFLOAD callback by
planting the session, slot, and sequence number of the COPY in the
CB_SEQUENCE contained in the CB_OFFLOAD COMPOUND.

Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Implement CB_SEQUENCE referring call lists

The slot index number of the current COMPOUND has, until now, not
been needed outside of nfsd4_sequence(). But to record the tuple
that represents a referring call, the slot number will be needed
when processing subsequent operations in the COMPOUND.

Refactor the code that allocates a new struct nfsd4_slot to ensure
that the new sl_index field is always correctly initialized.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Implement CB_SEQUENCE referring call lists

We have yet to implement a mechanism in NFSD for resolving races
between a server's reply and a related callback operation. For
example, a CB_OFFLOAD callback can race with the matching COPY
response. The client will not recognize the copy state ID in the
CB_OFFLOAD callback until the COPY response arrives.

Trond adds:
> It is also needed for the same kind of race with delegation
> recalls, layout recalls, CB_NOTIFY_DEVICEID and would also be
> helpful (although not as strongly required) for CB_NOTIFY_LOCK.

RFC 8881 Section 20.9.3 describes referring call lists this way:
> The csa_referring_call_lists array is the list of COMPOUND
> requests, identified by session ID, slot ID, and sequence ID.
> These are requests that the client previously sent to the server.
> These previous requests created state that some operation(s) in
> the same CB_COMPOUND as the csa_referring_call_lists are
> identifying. A session ID is included because leased state is tied
> to a client ID, and a client ID can have multiple sessions. See
> Section 2.10.6.3.

Introduce the XDR infrastructure for populating the
csa_referring_call_lists argument of CB_SEQUENCE. Subsequent patches
will put the referring call list to use.

Note that cb_sequence_enc_sz estimates that only zero or one rcl is
included in each CB_SEQUENCE, but the new infrastructure can
manage any number of referring calls.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Shorten CB_OFFLOAD response to NFS4ERR_DELAY

Try not to prolong the wait for completion of a COPY or COPY_NOTIFY
operation.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: OFFLOAD_CANCEL should mark an async COPY as completed

Update the status of an async COPY operation when it has been
stopped. OFFLOAD_STATUS needs to indicate that the COPY is no longer
running.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Linux 6.15-rc6

Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
"ARM:

   - Avoid use of uninitialized memcache pointer in user_mem_abort()

   - Always set HCR_EL2.xMO bits when running in VHE, allowing
     interrupts to be taken while TGE=0 and fixing an ugly bug on
     AmpereOne that occurs when taking an interrupt while clearing the
     xMO bits (AC03_CPU_36)

   - Prevent VMMs from hiding support for AArch64 at any EL virtualized
     by KVM

   - Save/restore the host value for HCRX_EL2 instead of restoring an
     incorrect fixed value

   - Make host_stage2_set_owner_locked() check that the entire requested
     range is memory rather than just the first page

  RISC-V:

   - Add missing reset of smstateen CSRs

  x86:

   - Forcibly leave SMM on SHUTDOWN interception on AMD CPUs to avoid
     causing problems due to KVM stuffing INIT on SHUTDOWN (KVM needs to
     sanitize the VMCB as its state is undefined after SHUTDOWN,
     emulating INIT is the least awful choice).

   - Track the valid sync/dirty fields in kvm_run as a u64 to ensure KVM
     KVM doesn't goof a sanity check in the future.

   - Free obsolete roots when (re)loading the MMU to fix a bug where
     pre-faulting memory can get stuck due to always encountering a
     stale root.

   - When dumping GHCB state, use KVM's snapshot instead of the raw GHCB
     page to print state, so that KVM doesn't print stale/wrong
     information.

   - When changing memory attributes (e.g. shared <=> private), add
     potential hugepage ranges to the mmu_invalidate_range_{start,end}
     set so that KVM doesn't create a shared/private hugepage when the
     the corresponding attributes will become mixed (the attributes are
     commited *after* KVM finishes the invalidation).

   - Rework the SRSO mitigation to enable BP_SPEC_REDUCE only when KVM
     has at least one active VM. Effectively BP_SPEC_REDUCE when KVM is
     loaded led to very measurable performance regressions for non-KVM
     workloads"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: SVM: Set/clear SRSO's BP_SPEC_REDUCE on 0 <=> 1 VM count transitions
  KVM: arm64: Fix memory check in host_stage2_set_owner_locked()
  KVM: arm64: Kill HCRX_HOST_FLAGS
  KVM: arm64: Properly save/restore HCRX_EL2
  KVM: arm64: selftest: Don't try to disable AArch64 support
  KVM: arm64: Prevent userspace from disabling AArch64 support at any virtualisable EL
  KVM: arm64: Force HCR_EL2.xMO to 1 at all times in VHE mode
  KVM: arm64: Fix uninitialized memcache pointer in user_mem_abort()
  KVM: x86/mmu: Prevent installing hugepages when mem attributes are changing
  KVM: SVM: Update dump_ghcb() to use the GHCB snapshot fields
  KVM: RISC-V: reset smstateen CSRs
  KVM: x86/mmu: Check and free obsolete roots in kvm_mmu_reload()
  KVM: x86: Check that the high 32bits are clear in kvm_arch_vcpu_ioctl_run()
  KVM: SVM: Forcibly leave SMM mode on SHUTDOWN interception

Merge tag 'mips-fixes_6.15_1' of git://git./linux/kernel/git/mips/linux

Pull MIPS fixes from Thomas Bogendoerfer:

- Fix delayed timers

- Fix NULL pointer deref

- Fix wrong range check

* tag 'mips-fixes_6.15_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  MIPS: Fix MAX_REG_OFFSET
  MIPS: CPS: Fix potential NULL pointer dereferences in cps_prepare_cpus()
  MIPS: rename rollback_handler with skipover_handler
  MIPS: Move r4k_wait() to .cpuidle.text section
  MIPS: Fix idle VS timer enqueue

Merge tag 'x86-urgent-2025-05-11' of git://git./linux/kernel/git/tip/tip

Pull x86 fix from Ingo Molnar:
"Fix a boot regression on very old x86 CPUs without CPUID support"

* tag 'x86-urgent-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Consolidate the loader enablement checking

Merge tag 'timers-urgent-2025-05-11' of git://git./linux/kernel/git/tip/tip

Pull misc timers fixes from Ingo Molnar:

- Fix time keeping bugs in CLOCK_MONOTONIC_COARSE clocks

- Work around absolute relocations into vDSO code that GCC erroneously
   emits in certain arm64 build environments

- Fix a false positive lockdep warning in the i8253 clocksource driver

* tag 'timers-urgent-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource/i8253: Use raw_spinlock_irqsave() in clockevent_i8253_disable()
  arm64: vdso: Work around invalid absolute relocations from GCC
  timekeeping: Prevent coarse clocks going backwards

Merge tag 'input-for-v6.15-rc5' of git://git./linux/kernel/git/dtor/input

Pull input fixes from Dmitry Torokhov:

- Synaptics touchpad on multiple laptops (Dynabook Portege X30L-G,
   Dynabook Portege X30-D, TUXEDO InfinityBook Pro 14 v5, Dell Precision
   M3800, HP Elitebook 850 G1) switched from PS/2 to SMBus mode

- a number of new controllers added to xpad driver: HORI Drum
   controller, PowerA Fusion Pro 4, PowerA MOGA XP-Ultra controller,
   8BitDo Ultimate 2 Wireless Controller, 8BitDo Ultimate 3-mode
   Controller, Hyperkin DuchesS Xbox One controller

- fixes to xpad driver to properly handle Mad Catz JOYTECH NEO SE
   Advanced and PDP Mirror's Edge Official controllers

- fixes to xpad driver to properly handle "Share" button on some
   controllers

- a fix for device initialization timing and for waking up the
   controller in cyttsp5 driver

- a fix for hisi_powerkey driver to properly wake up from s2idle state

- other assorted cleanups and fixes

* tag 'input-for-v6.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: xpad - fix xpad_device sorting
  Input: xpad - add support for several more controllers
  Input: xpad - fix Share button on Xbox One controllers
  Input: xpad - fix two controller table values
  Input: hisi_powerkey - enable system-wakeup for s2idle
  Input: synaptics - enable InterTouch on Dell Precision M3800
  Input: synaptics - enable InterTouch on TUXEDO InfinityBook Pro 14 v5
  Input: synaptics - enable InterTouch on Dynabook Portege X30L-G
  Input: synaptics - enable InterTouch on Dynabook Portege X30-D
  Input: synaptics - enable SMBus for HP Elitebook 850 G1
  Input: mtk-pmic-keys - fix possible null pointer dereference
  Input: xpad - add support for 8BitDo Ultimate 2 Wireless Controller
  Input: cyttsp5 - fix power control issue on wakeup
  MAINTAINERS: .mailmap: update Mattijs Korpershoek's email address
  dt-bindings: mediatek,mt6779-keypad: Update Mattijs' email address
  Input: stmpe-ts - use module alias instead of device table
  Input: cyttsp5 - ensure minimum reset pulse width
  Input: sparcspkr - avoid unannotated fall-through
  input/joystick: magellan: Mark __nonstring look-up table

Merge tag 'fixes-2025-05-11' of git://git./linux/kernel/git/rppt/memblock

Pull memblock fixes from Mike Rapoport:

- Mark set_high_memory() as __init to fix section mismatch

- Accept memory allocated in memblock_double_array() to mitigate crash
   of SNP guests

* tag 'fixes-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  memblock: Accept allocated memory before use in memblock_double_array()
  mm,mm_init: Mark set_high_memory as __init

Input: xpad - fix xpad_device sorting

A recent commit put one entry in the wrong place. This just moves it to the
right place.

Signed-off-by: Vicki Pfau <vi@endrift.com>
Link: https://lore.kernel.org/r/20250328234345.989761-5-vi@endrift.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: xpad - add support for several more controllers

This adds support for several new controllers, all of which include
Share buttons:

- HORI Drum controller
- PowerA Fusion Pro 4
- 8BitDo Ultimate 3-mode Controller
- Hyperkin DuchesS Xbox One controller
- PowerA MOGA XP-Ultra controller

Signed-off-by: Vicki Pfau <vi@endrift.com>
Link: https://lore.kernel.org/r/20250328234345.989761-4-vi@endrift.com
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: xpad - fix Share button on Xbox One controllers

The Share button, if present, is always one of two offsets from the end of the
file, depending on the presence of a specific interface. As we lack parsing for
the identify packet we can't automatically determine the presence of that
interface, but we can hardcode which of these offsets is correct for a given
controller.

More controllers are probably fixable by adding the MAP_SHARE_BUTTON in the
future, but for now I only added the ones that I have the ability to test
directly.

Signed-off-by: Vicki Pfau <vi@endrift.com>
Link: https://lore.kernel.org/r/20250328234345.989761-2-vi@endrift.com
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: xpad - fix two controller table values

Two controllers -- Mad Catz JOYTECH NEO SE Advanced and PDP Mirror's
Edge Official -- were missing the value of the mapping field, and thus
wouldn't detect properly.

Signed-off-by: Vicki Pfau <vi@endrift.com>
Link: https://lore.kernel.org/r/20250328234345.989761-1-vi@endrift.com
Fixes: 540602a43ae5 ("Input: xpad - add a few new VID/PID combinations")
Fixes: 3492321e2e60 ("Input: xpad - add multiple supported devices")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: hisi_powerkey - enable system-wakeup for s2idle

To wake up the system from s2idle when pressing the power-button, let's
convert from using pm_wakeup_event() to pm_wakeup_dev_event(), as it allows
us to specify the "hard" in-parameter, which needs to be set for s2idle.

Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250306115021.797426-1-ulf.hansson@linaro.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Merge tag 'mm-hotfixes-stable-2025-05-10-14-23' of git://git./linux/kernel/git/akpm/mm

Pull misc hotfixes from Andrew Morton:
"22 hotfixes. 13 are cc:stable and the remainder address post-6.14
  issues or aren't considered necessary for -stable kernels.

  About half are for MM. Five OCFS2 fixes and a few MAINTAINERS updates"

* tag 'mm-hotfixes-stable-2025-05-10-14-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
  mm: fix folio_pte_batch() on XEN PV
  nilfs2: fix deadlock warnings caused by lock dependency in init_nilfs()
  mm/hugetlb: copy the CMA flag when demoting
  mm, swap: fix false warning for large allocation with !THP_SWAP
  selftests/mm: fix a build failure on powerpc
  selftests/mm: fix build break when compiling pkey_util.c
  mm: vmalloc: support more granular vrealloc() sizing
  tools/testing/selftests: fix guard region test tmpfs assumption
  ocfs2: stop quota recovery before disabling quotas
  ocfs2: implement handshaking with ocfs2 recovery thread
  ocfs2: switch osb->disable_recovery to enum
  mailmap: map Uwe's BayLibre addresses to a single one
  MAINTAINERS: add mm THP section
  mm/userfaultfd: fix uninitialized output field for -EAGAIN race
  selftests/mm: compaction_test: support platform with huge mount of memory
  MAINTAINERS: add core mm section
  ocfs2: fix panic in failed foilio allocation
  mm/huge_memory: fix dereferencing invalid pmd migration entry
  MAINTAINERS: add reverse mapping section
  x86: disable image size check for test builds
  ...

Merge tag 'driver-core-6.15-rc6' of git://git./linux/kernel/git/driver-core/driver-core

Pull driver core fix from Greg KH:
"Here is a single driver core fix for a regression for platform devices
  that is a regression from a change that went into 6.15-rc1 that
  affected Pixel devices. It has been in linux-next for over a week with
  no reported problems"

* tag 'driver-core-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
  platform: Fix race condition during DMA configure at IOMMU probe time

Merge tag 'usb-6.15-rc6' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
"Here are some small USB driver fixes for 6.15-rc6. Included in here
  are:

   - typec driver fixes

   - usbtmc ioctl fixes

   - xhci driver fixes

   - cdnsp driver fixes

   - some gadget driver fixes

  Nothing really major, just all little stuff that people have reported
  being issues. All of these have been in linux-next this week with no
  reported issues"

* tag 'usb-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  xhci: dbc: Avoid event polling busyloop if pending rx transfers are inactive.
  usb: xhci: Don't trust the EP Context cycle bit when moving HW dequeue
  usb: usbtmc: Fix erroneous generic_read ioctl return
  usb: usbtmc: Fix erroneous wait_srq ioctl return
  usb: usbtmc: Fix erroneous get_stb ioctl error returns
  usb: typec: tcpm: delay SNK_TRY_WAIT_DEBOUNCE to SRC_TRYWAIT transition
  USB: usbtmc: use interruptible sleep in usbtmc_read
  usb: cdnsp: fix L1 resume issue for RTL_REVISION_NEW_LPM version
  usb: typec: ucsi: displayport: Fix NULL pointer access
  usb: typec: ucsi: displayport: Fix deadlock
  usb: misc: onboard_usb_dev: fix support for Cypress HX3 hubs
  usb: uhci-platform: Make the clock really optional
  usb: dwc3: gadget: Make gadget_wakeup asynchronous
  usb: gadget: Use get_status callback to set remote wakeup capability
  usb: gadget: f_ecm: Add get_status callback
  usb: host: tegra: Prevent host controller crash when OTG port is used
  usb: cdnsp: Fix issue with resuming from L1
  usb: gadget: tegra-xudc: ACK ST_RC after clearing CTRL_RUN

Merge tag 'staging-6.15-rc6' of git://git./linux/kernel/git/gregkh/staging

Pull staging driver fixes from Greg KH:
"Here are three small staging driver fixes for 6.15-rc6. These are:

   - bcm2835-camera driver fix

   - two axis-fifo driver fixes

  All of these have been in linux-next for a few weeks with no reported
  issues"

* tag 'staging-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: axis-fifo: Remove hardware resets for user errors
  staging: axis-fifo: Correct handling of tx_fifo_depth for size validation
  staging: bcm2835-camera: Initialise dev in v4l2_dev

Merge tag 'char-misc-6.15-rc6' of git://git./linux/kernel/git/gregkh/char-misc

Pull char/misc/IIO driver fixes from Greg KH:
"Here are a bunch of small driver fixes (mostly all IIO) for 6.15-rc6.
  Included in here are:

   - loads of tiny IIO driver fixes for reported issues

   - hyperv driver fix for a much-reported and worked on sysfs ring
     buffer creation bug

  All of these have been in linux-next for over a week (the IIO ones for
  many weeks now), with no reported issues"

* tag 'char-misc-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (30 commits)
  Drivers: hv: Make the sysfs node size for the ring buffer dynamic
  uio_hv_generic: Fix sysfs creation path for ring buffer
  iio: adis16201: Correct inclinometer channel resolution
  iio: adc: ad7606: fix serial register access
  iio: pressure: mprls0025pa: use aligned_s64 for timestamp
  iio: imu: adis16550: align buffers for timestamp
  staging: iio: adc: ad7816: Correct conditional logic for store mode
  iio: adc: ad7266: Fix potential timestamp alignment issue.
  iio: adc: ad7768-1: Fix insufficient alignment of timestamp.
  iio: adc: dln2: Use aligned_s64 for timestamp
  iio: accel: adxl355: Make timestamp 64-bit aligned using aligned_s64
  iio: temp: maxim-thermocouple: Fix potential lack of DMA safe buffer.
  iio: chemical: pms7003: use aligned_s64 for timestamp
  iio: chemical: sps30: use aligned_s64 for timestamp
  iio: imu: inv_mpu6050: align buffer for timestamp
  iio: imu: st_lsm6dsx: Fix wakeup source leaks on device unbind
  iio: adc: qcom-spmi-iadc: Fix wakeup source leaks on device unbind
  iio: accel: fxls8962af: Fix wakeup source leaks on device unbind
  iio: adc: ad7380: fix event threshold shift
  iio: hid-sensor-prox: Fix incorrect OFFSET calculation
  ...

Merge tag 'i2c-for-6.15-rc6' of git://git./linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:

- omap: use correct function to read from device tree

- MAINTAINERS: remove Seth from ISMT maintainership

* tag 'i2c-for-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
MAINTAINERS: Remove entry for Seth Heasley
i2c: omap: fix deprecated of_property_read_bool() use

Merge tag 'for-linus-6.15a-rc6-tag' of git://git./linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:

- A fix for the xenbus driver allowing to use a PVH Dom0 with
   Xenstore running in another domain

- A fix for the xenbus driver addressing a rare race condition
   resulting in NULL dereferences and other problems

- A fix for the xen-swiotlb driver fixing a problem seen on Arm
   platforms

* tag 'for-linus-6.15a-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xenbus: Use kref to track req lifetime
  xenbus: Allow PVH dom0 a non-local xenstore
  xen: swiotlb: Use swiotlb bouncing if kmalloc allocation demands it

Merge tag 'pull-fixes' of git://git./linux/kernel/git/viro/vfs

Pull mount fixes from Al Viro:
"A couple of races around legalize_mnt vs umount (both fairly old and
  hard to hit) plus two bugs in move_mount(2) - both around 'move
  detached subtree in place' logics"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix IS_MNT_PROPAGATING uses
  do_move_mount(): don't leak MNTNS_PROPAGATING on failures
  do_umount(): add missing barrier before refcount checks in sync case
  __legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock

Merge tag 'kvm-x86-fixes-6.15-rcN' of https://github.com/kvm-x86/linux into HEAD

KVM x86 fixes for 6.15-rcN

- Forcibly leave SMM on SHUTDOWN interception on AMD CPUs to avoid causing
   problems due to KVM stuffing INIT on SHUTDOWN (KVM needs to sanitize the
   VMCB as its state is undefined after SHUTDOWN, emulating INIT is the
   least awful choice).

- Track the valid sync/dirty fields in kvm_run as a u64 to ensure KVM
   KVM doesn't goof a sanity check in the future.

- Free obsolete roots when (re)loading the MMU to fix a bug where
   pre-faulting memory can get stuck due to always encountering a stale
   root.

- When dumping GHCB state, use KVM's snapshot instead of the raw GHCB page
   to print state, so that KVM doesn't print stale/wrong information.

- When changing memory attributes (e.g. shared <=> private), add potential
   hugepage ranges to the mmu_invalidate_range_{start,end} set so that KVM
   doesn't create a shared/private hugepage when the the corresponding
   attributes will become mixed (the attributes are commited *after* KVM
   finishes the invalidation).

- Rework the SRSO mitigation to enable BP_SPEC_REDUCE only when KVM has at
   least one active VM.  Effectively BP_SPEC_REDUCE when KVM is loaded led
   to very measurable performance regressions for non-KVM workloads.

Merge tag 'kvmarm-fixes-6.15-3' of https://git./linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.15, round #3

- Avoid use of uninitialized memcache pointer in user_mem_abort()

- Always set HCR_EL2.xMO bits when running in VHE, allowing interrupts
   to be taken while TGE=0 and fixing an ugly bug on AmpereOne that
   occurs when taking an interrupt while clearing the xMO bits
   (AC03_CPU_36)

- Prevent VMMs from hiding support for AArch64 at any EL virtualized by
   KVM

- Save/restore the host value for HCRX_EL2 instead of restoring an
   incorrect fixed value

- Make host_stage2_set_owner_locked() check that the entire requested
   range is memory rather than just the first page

Merge tag 'kvm-riscv-fixes-6.15-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv fixes for 6.15, take #1

- Add missing reset of smstateen CSRs

Merge tag 'i2c-host-fixes-6.15-rc6' of git://git./linux/kernel/git/andi.shyti/linux into i2c/for-current

i2c-host-fixes for v6.15-rc6

- omap: use correct function to read from device tree
- MAINTAINERS: remove Seth from ISMT maintainership

Merge tag '6.15-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

- Fix dentry leak which can cause umount crash

- Add warning for parse contexts error on compounded operation

* tag '6.15-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: Avoid race in open_cached_dir with lease breaks
smb3 client: warn when parse contexts returns error on compounded operation

fix IS_MNT_PROPAGATING uses

propagate_mnt() does not attach anything to mounts created during
propagate_mnt() itself.  What's more, anything on ->mnt_slave_list
of such new mount must also be new, so we don't need to even look
there.

When move_mount() had been introduced, we've got an additional
class of mounts to skip - if we are moving from anon namespace,
we do not want to propagate to mounts we are moving (i.e. all
mounts in that anon namespace).

Unfortunately, the part about "everything on their ->mnt_slave_list
will also be ignorable" is not true - if we have propagation graph
A -> B -> C
and do OPEN_TREE_CLONE open_tree() of B, we get
A -> [B <-> B'] -> C
as propagation graph, where B' is a clone of B in our detached tree.
Making B private will result in
A -> B' -> C
C still gets propagation from A, as it would after making B private
if we hadn't done that open_tree(), but now the propagation goes
through B'.  Trying to move_mount() our detached tree on subdirectory
in A should have
* moved B' on that subdirectory in A
* skipped the corresponding subdirectory in B' itself
* copied B' on the corresponding subdirectory in C.
As it is, the logics in propagation_next() and friends ends up
skipping propagation into C, since it doesn't consider anything
downstream of B'.

IOW, walking the propagation graph should only skip the ->mnt_slave_list
of new mounts; the only places where the check for "in that one
anon namespace" are applicable are propagate_one() (where we should
treat that as the same kind of thing as "mountpoint we are looking
at is not visible in the mount we are looking at") and
propagation_would_overmount().  The latter is better dealt with
in the caller (can_move_mount_beneath()); on the first call of
propagation_would_overmount() the test is always false, on the
second it is always true in "move from anon namespace" case and
always false in "move within our namespace" one, so it's easier
to just use check_mnt() before bothering with the second call and
be done with that.

Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_move_mount(): don't leak MNTNS_PROPAGATING on failures

as it is, a failed move_mount(2) from anon namespace breaks
all further propagation into that namespace, including normal
mounts in non-anon namespaces that would otherwise propagate
there.

Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

do_umount(): add missing barrier before refcount checks in sync case

do_umount() analogue of the race fixed in 119e1ef80ecf "fix
__legitimize_mnt()/mntput() race".  Here we want to make sure that
if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will
notice their refcount increment.  Harder to hit than mntput_no_expire()
one, fortunately, and consequences are milder (sync umount acting
like umount -l on a rare race with RCU pathwalk hitting at just the
wrong time instead of use-after-free galore mntput_no_expire()
counterpart used to be hit).  Still a bug...

Fixes: 48a066e72d97 ("RCU'd vfsmounts")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

__legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lock

... or we risk stealing final mntput from sync umount - raising mnt_count
after umount(2) has verified that victim is not busy, but before it
has set MNT_SYNC_UMOUNT; in that case __legitimize_mnt() doesn't see
that it's safe to quietly undo mnt_count increment and leaves dropping
the reference to caller, where it'll be a full-blown mntput().

Check under mount_lock is needed; leaving the current one done before
taking that makes no sense - it's nowhere near common enough to bother
with.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Merge tag 'rust-fixes-6.15-2' of git://git./linux/kernel/git/ojeda/linux

Pull rust fixes from Miguel Ojeda:

- Make CFI_AUTO_DEFAULT depend on !RUST or Rust >= 1.88.0

- Clean Rust (and Clippy) lints for the upcoming Rust 1.87.0 and 1.88.0
   releases

- Clean objtool warning for the upcoming Rust 1.87.0 release by adding
   one more noreturn function

* tag 'rust-fixes-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
  x86/Kconfig: make CFI_AUTO_DEFAULT depend on !RUST or Rust >= 1.88
  rust: clean Rust 1.88.0's `clippy::uninlined_format_args` lint
  rust: clean Rust 1.88.0's warning about `clippy::disallowed_macros` configuration
  rust: clean Rust 1.88.0's `unnecessary_transmutes` lint
  rust: allow Rust 1.87.0's `clippy::ptr_eq` lint
  objtool/rust: add one more `noreturn` Rust function for Rust 1.87.0

Merge tag 'drm-fixes-2025-05-10' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"Weekly drm fixes, bit bigger than last week, but overall amdgpu/xe
  with some ivpu bits and a random few fixes, and dropping the
  ttm_backup struct which wrapped struct file and was recently
  frowned at.

  drm:
   - Fix overflow when generating wedged event

  ttm:
   - Fix documentation
   - Remove struct ttm_backup

  panel:
   - simple: Fix timings for AUO G101EVN010

  amdgpu:
   - DC FP fixes
   - Freesync fix
   - DMUB AUX fixes
   - VCN fix
   - Hibernation fixes
   - HDP fixes

  xe:
   - Prevent PF queue overflow
   - Hold all forcewake during mocs test
   - Remove GSC flush on reset path
   - Fix forcewake put on error path
   - Fix runtime warning when building without svm

  i915:
   - Fix oops on resume after disconnecting DP MST sinks during suspend
   - Fix SPLC num_waiters refcounting

  ivpu:
   - Increase timeouts
   - Fix deadlock in cmdq ioctl
   - Unlock mutices in correct order

  v3d:
   - Avoid memory leak in job handling"

* tag 'drm-fixes-2025-05-10' of https://gitlab.freedesktop.org/drm/kernel: (32 commits)
  drm/i915/dp: Fix determining SST/MST mode during MTP TU state computation
  drm/xe: Add config control for svm flush work
  drm/xe: Release force wake first then runtime power
  drm/xe/gsc: do not flush the GSC worker from the reset path
  drm/xe/tests/mocs: Hold XE_FORCEWAKE_ALL for LNCF regs
  drm/xe: Add page queue multiplier
  drm/amdgpu/hdp7: use memcfg register to post the write for HDP flush
  drm/amdgpu/hdp6: use memcfg register to post the write for HDP flush
  drm/amdgpu/hdp5.2: use memcfg register to post the write for HDP flush
  drm/amdgpu/hdp5: use memcfg register to post the write for HDP flush
  drm/amdgpu/hdp4: use memcfg register to post the write for HDP flush
  drm/amdgpu: fix pm notifier handling
  Revert "drm/amd: Stop evicting resources on APUs in suspend"
  drm/amdgpu/vcn: using separate VCN1_AON_SOC offset
  drm/amd/display: Fix wrong handling for AUX_DEFER case
  drm/amd/display: Copy AUX read reply data whenever length > 0
  drm/amd/display: Remove incorrect checking in dmub aux handler
  drm/amd/display: Fix the checking condition in dmub aux handling
  drm/amd/display: Shift DMUB AUX reply command if necessary
  drm/amd/display: Call FP Protect Before Mode Programming/Mode Support
  ...

Merge tag 'drm-intel-fixes-2025-05-09' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes

drm/i915 fixes for v6.15-rc6:
- Fix oops on resume after disconnecting DP MST sinks during suspend
- Fix SPLC num_waiters refcounting

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://lore.kernel.org/r/87tt5umeaw.fsf@intel.com

Merge tag 'drm-xe-fixes-2025-05-09' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes

Driver Changes:
- Prevent PF queue overflow
- Hold all forcewake during mocs test
- Remove GSC flush on reset path
- Fix forcewake put on error path
- Fix runtime warning when building without svm

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/jffqa56f2zp4i5ztz677cdspgxhnw7qfop3dd3l2epykfpfvza@q2nw6wapsphz

Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux

Pull arm64 fix from Catalin Marinas:
"Move the arm64_use_ng_mappings variable from the .bss to the .data
  section as it is accessed very early during boot with the MMU off and
  before the .bss has been initialised.

  This could lead to incorrect idmap page table"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: cpufeature: Move arm64_use_ng_mappings to the .data section to prevent wrong idmap generation

Merge tag 'riscv-for-linus-6.15-rc6' of git://git./linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:

- The compressed half-word misaligned access instructions (c.lhu, c.lh,
   and c.sh) from the Zcb extension are now properly emulated

- A series of fixes to properly emulate permissions while handling
   userspace misaligned accesses

- A pair of fixes for PR_GET_TAGGED_ADDR_CTRL to avoid accessing the
   envcfg CSR on systems that don't support that CSR, and to report
   those failures up to userspace

- The .rela.dyn section is no longer stripped from vmlinux, as it is
   necessary to relocate the kernel under some conditions (including
   kexec)

* tag 'riscv-for-linus-6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Disallow PR_GET_TAGGED_ADDR_CTRL without Supm
  scripts: Do not strip .rela.dyn section
  riscv: Fix kernel crash due to PR_SET_TAGGED_ADDR_CTRL
  riscv: misaligned: use get_user() instead of __get_user()
  riscv: misaligned: enable IRQs while handling misaligned accesses
  riscv: misaligned: factorize trap handling
  riscv: misaligned: Add handling for ZCB instructions

Merge tag 'block-6.15-20250509' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

- Fix for a regression in this series for loop and read/write iterator
   handling

- zone append block update tweak

- remove a broken IO priority test

- NVMe pull request via Christoph:
      - unblock ctrl state transition for firmware update (Daniel
        Wagner)

* tag 'block-6.15-20250509' of git://git.kernel.dk/linux:
  block: remove test of incorrect io priority level
  nvme: unblock ctrl state transition for firmware update
  block: only update request sector if needed
  loop: Add sanity check for read/write_iter