David Howells [Mon, 16 Dec 2024 20:41:17 +0000 (20:41 +0000)]
netfs: Change the read result collector to only use one work item
Change the way netfslib collects read results to do all the collection for
a particular read request using a single work item that walks along the
subrequest queue as subrequests make progress or complete, unlocking folios
progressively rather than doing the unlock in parallel as parallel requests
come in.
The code is remodelled to be more like the write-side code, though only
using a single stream. This makes it more directly comparable and thus
easier to duplicate fixes between the two sides.
This has a number of advantages:
(1) It's simpler. There doesn't need to be a complex donation mechanism
to handle mismatches between the size and alignment of subrequests and
folios. The collector unlocks folios as the subrequests covering each
complete.
(2) It should cause less scheduler overhead as there's a single work item
in play unlocking pages in parallel when a read gets split up into a
lot of subrequests instead of one per subrequest.
Whilst the parallellism is nice in theory, in practice, the vast
majority of loads are sequential reads of the whole file, so
committing a bunch of threads to unlocking folios out of order doesn't
help in those cases.
(3) It should make it easier to implement content decryption. A folio
cannot be decrypted until all the requests that contribute to it have
completed - and, again, most loads are sequential and so, most of the
time, we want to begin decryption sequentially (though it's great if
the decryption can happen in parallel).
There is a disadvantage in that we're losing the ability to decrypt and
unlock things on an as-things-arrive basis which may affect some
applications.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-28-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:15 +0000 (20:41 +0000)]
afs: Make {Y,}FS.FetchData an asynchronous operation
Make FS.FetchData and YFS.FetchData an asynchronous operation in that the
request is queued in AF_RXRPC and then we return to the caller rather than
waiting. Processing of the returning packets is then done inline if it's a
synchronous VFS/VM call (readdir, read_folio, sync DIO, prep for write) or
offloaded to a workqueue if asynchronous VM calls (eg. readahead, async
DIO).
This reduces the chain of workqueues invoking workqueues and cuts out some
of the overhead, driving rxrpc data extraction and netfslib read collection
from a thread that's going to block to completion anyway if possible.
The ->done() call op is also split with ->immediate_cancel() handling the
cancellation on failure to begin the call and ->done() handling the rest.
This means that the AFS async FetchData code doesn't try to terminate the
netfs subrequest twice.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-26-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:14 +0000 (20:41 +0000)]
afs: Fix cleanup of immediately failed async calls
If we manage to begin an async call, but fail to transmit any data on it
due to a signal, we then abort it which causes a race between the
notification of call completion from rxrpc and our attempt to cancel the
notification. The notification will be necessary, however, for async
FetchData to terminate the netfs subrequest.
However, since we get a notification from rxrpc upon completion of a call
(aborted or otherwise), we can just leave it to that.
This leads to calls not getting cleaned up, but appearing in
/proc/net/rxrpc/calls as being aborted with code 6.
Fix this by making the "error_do_abort:" case of afs_make_call() abort the
call and then abandon it to the notification handler.
Fixes:
34fa47612bfe ("afs: Fix race in async call refcounting")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-25-dhowells@redhat.com
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:13 +0000 (20:41 +0000)]
afs: Eliminate afs_read
Now that directory and symlink reads go through netfslib, the afs_read
struct is mostly redundant with almost all data duplicated in the
netfs_io_request and netfs_io_subrequest structs that are also available
any time we're doing a fetch.
Eliminate afs_read by moving the one field we still need there to the
afs_call struct (we may be given a different amount of data than what we
asked for and have to track what remains of that) and using the
netfs_io_subrequest directly instead.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-24-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:12 +0000 (20:41 +0000)]
afs: Use netfslib for symlinks, allowing them to be cached
Use netfslib to read symlinks, thereby allowing them to be cached by
fscache and cachefiles.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-23-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:11 +0000 (20:41 +0000)]
afs: Use netfslib for directories
In the AFS ecosystem, directories are just a special type of file that is
downloaded and parsed locally. Download is done by the same mechanism as
ordinary files and the data can be cached. There is one important semantic
restriction on directories over files: the client must download the entire
directory in one go because, for example, the server could fabricate the
contents of the blob on the fly with each download and give a different
image each time.
So that we can cache the directory download, switch AFS directory support
over to using the netfslib single-object API, thereby allowing directory
content to be stored in the local cache.
To make this work, the following changes are made:
(1) A directory's contents are now stored in a folio_queue chain attached
to the afs_vnode (inode) struct rather than its associated pagecache,
though multipage folios are still used to hold the data. The folio
queue is discarded when the directory inode is evicted.
This also helps with the phasing out of ITER_XARRAY.
(2) Various directory operations are made to use and unuse the cache
cookie.
(3) The content checking, content dumping and content iteration are now
performed with a standard iov_iter iterator over the contents of the
folio queue.
(4) Iteration and modification must be done with the vnode's validate_lock
held. In conjunction with (1), this means that the iteration can be
done without the need to lock pages or take extra refs on them, unlike
when accessing ->i_pages.
(5) Convert to using netfs_read_single() to read data.
(6) Provide a ->writepages() to call netfs_writeback_single() to save the
data to the cache according to the VM's scheduling whilst holding the
validate_lock read-locked as (4).
(7) Change local directory image editing functions:
(a) Provide a function to get a specific block by number from the
folio_queue as we can no longer use the i_pages xarray to locate
folios by index. This uses a cursor to remember the current
position as we need to iterate through the directory contents.
The block is kmapped before being returned.
(b) Make the function in (a) extend the directory by an extra folio if
we run out of space.
(c) Raise the check of the block free space counter, for those blocks
that have one, higher in the function to eliminate a call to get a
block.
(d) Remove the page unlocking and putting done during the editing
loops. This is no longer necessary as the folio_queue holds the
references and the pages are no longer in the pagecache.
(e) Mark the inode dirty and pin the cache usage till writeback at the
end of a successful edit.
(8) Don't set the large_folios flag on the inode as we do the allocation
ourselves rather than the VM doing it automatically.
(9) Mark the inode as being a single object that isn't uploaded to the
server.
(10) Enable caching on directories.
(11) Only set the upload key for writeback for regular files.
Notes:
(*) We keep the ->release_folio(), ->invalidate_folio() and
->migrate_folio() ops as we set the mapping pointer on the folio.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-22-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:10 +0000 (20:41 +0000)]
afs: Make afs_init_request() get a key if not given a file
In a future patch, AFS directory caching will go through netfslib and this
will involve, at times, running on behalf of ->lookup(), which doesn't
provide us with a file from which we can get an authentication key.
If a file isn't provided, make afs_init_request() get a key from the
process's keyrings instead when setting up a read.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-21-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:09 +0000 (20:41 +0000)]
netfs: Add support for caching single monolithic objects such as AFS dirs
Add support for caching the content of a file that contains a single
monolithic object that must be read/written with a single I/O operation,
such as an AFS directory.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-20-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: netfs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:08 +0000 (20:41 +0000)]
netfs: Add functions to build/clean a buffer in a folio_queue
Add two netfslib functions to build up or clean up a buffer in a
folio_queue. The first, netfs_alloc_folioq_buffer() will add folios to a
buffer, extending up at least to the given size. If it can, it will add
multipage folios. The folios are optionally have the mapping set and will
have the index set according to the distance from the front of the folio
queue.
The second function will free up a folio queue and put any folios in the
queue that have the first mark set.
The netfs_folio tracepoint is also altered to cope with folios that have a
NULL mapping, and the folios being added/put will have trace lines emitted
and will be accounted in the stats.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-19-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: netfs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:07 +0000 (20:41 +0000)]
afs: Add more tracepoints to do with tracking validity
Add wrappers to set and clear the callback promise and to mark a directory
as invalidated, and add tracepoints to track these events:
(1) afs_cb_promise: Log when a callback promise is set on a vnode.
(2) afs_vnode_invalid: Log when the server's callback promise for a vnode
is no longer valid and we need to refetch the vnode metadata.
(3) afs_dir_invalid: Log when the contents of a directory are marked
invalid and requiring refetching from the server and the cache
invalidating.
and two tracepoints to record data version number management:
(4) afs_set_dv: Log when the DV is recorded on a vnode.
(5) afs_dv_mismatch: Log when the DV recorded on a vnode plus the expected
delta for the operation does not match the DV we got back from the
server.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-18-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:06 +0000 (20:41 +0000)]
cachefiles: Add auxiliary data trace
Add a display of the first 8 bytes of the downloaded auxiliary data and of
the on-disk stored auxiliary data as these are used in coherency
management. In the case of afs, this holds the data version number.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-17-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:05 +0000 (20:41 +0000)]
cachefiles: Add some subrequest tracepoints
Add some tracepoints into the cachefiles write paths.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-16-dhowells@redhat.com
cc: netfs@lists.linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:04 +0000 (20:41 +0000)]
netfs: Remove some extraneous directory invalidations
In the directory editing code, we shouldn't re-invalidate the directory
if it is already invalidated.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-15-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:03 +0000 (20:41 +0000)]
afs: Fix directory format encoding struct
The AFS directory format structure, union afs_xdr_dir_block::meta, has too
many alloc counter slots declared and so pushes the hash table along and
over the data. This doesn't cause a problem at the moment because I'm
currently ignoring the hash table and only using the correct number of
alloc_ctrs in the code anyway. In future, however, I should start using
the hash table to try and speed up afs_lookup().
Fix this by using the correct constant to declare the counter array.
Fixes:
4ea219a839bf ("afs: Split the directory content defs into a header")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-14-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:02 +0000 (20:41 +0000)]
afs: Fix EEXIST error returned from afs_rmdir() to be ENOTEMPTY
AFS servers pass back a code indicating EEXIST when they're asked to remove
a directory that is not empty rather than ENOTEMPTY because not all the
systems that an AFS server can run on have the latter error available and
AFS preexisted the addition of that error in general.
Fix afs_rmdir() to translate EEXIST to ENOTEMPTY.
Fixes:
260a980317da ("[AFS]: Add "directory write" support.")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-13-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:01 +0000 (20:41 +0000)]
afs: Don't use mutex for I/O operation lock
Don't use the standard mutex for the I/O operation lock, but rather
implement our own as the standard mutex must be released in the same thread
as locked it. This is a problem when it comes to doing async FetchData
where the lock will be dropped from the workqueue that processed the
incoming data and not from the issuing thread.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-12-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:41:00 +0000 (20:41 +0000)]
netfs: Don't use bh spinlock
All the accessing of the subrequest lists is now done in process context,
possibly in a workqueue, but not now in a BH context, so we don't need the
lock against BH interference when taking the netfs_io_request::lock
spinlock.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-11-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:40:59 +0000 (20:40 +0000)]
netfs: Drop the was_async arg from netfs_read_subreq_terminated()
Drop the was_async argument from netfs_read_subreq_terminated(). Almost
every caller is either in process context and passes false. Some
filesystems delegate the call to a workqueue to avoid doing the work in
their network message queue parsing thread.
The only exception is netfs_cache_read_terminated() which handles
completion in the cache - which is usually a callback from the backing
filesystem in softirq context, though it can be from process context if an
error occurred. In this case, delegate to a workqueue.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wiVC5Cgyz6QKXFu6fTaA6h4CjexDR-OV9kL6Vo5x9v8=A@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-10-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:40:58 +0000 (20:40 +0000)]
netfs: Drop the error arg from netfs_read_subreq_terminated()
Drop the error argument from netfs_read_subreq_terminated() in favour of
passing the value in subreq->error.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-9-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:40:57 +0000 (20:40 +0000)]
netfs: Split retry code out of fs/netfs/write_collect.c
Split write-retry code out of fs/netfs/write_collect.c as it will become
more elaborate when content crypto is introduced.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-8-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:40:56 +0000 (20:40 +0000)]
netfs: Make netfs_advance_write() return size_t
netfs_advance_write() calculates the amount of data it's attaching to a
stream with size_t, but then returns this as an int. Switch the return
value to size_t for consistency.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-7-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:40:55 +0000 (20:40 +0000)]
netfs: Abstract out a rolling folio buffer implementation
A rolling buffer is a series of folios held in a list of folio_queues. New
folios and folio_queue structs may be inserted at the head simultaneously
with spent ones being removed from the tail without the need for locking.
The rolling buffer includes an iov_iter and it has to be careful managing
this as the list of folio_queues is extended such that an oops doesn't
incurred because the iterator was pointing to the end of a folio_queue
segment that got appended to and then removed.
We need to use the mechanism twice, once for read and once for write, and,
in future patches, we will use a second rolling buffer to handle bounce
buffering for content encryption.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-6-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:40:54 +0000 (20:40 +0000)]
netfs: Add a tracepoint to log the lifespan of folio_queue structs
Add a tracepoint to log the lifespan of folio_queue structs. For tracing
illustrative purposes, folio_queues are tagged with the debug ID of
whatever they're related to (typically a netfs_io_request) and a debug ID
of their own.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-5-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:40:53 +0000 (20:40 +0000)]
netfs: Use a folio_queue allocation and free functions
Provide and use folio_queue allocation and free functions to combine the
allocation, initialisation and stat (un)accounting steps that are repeated
in several places.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-4-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:40:52 +0000 (20:40 +0000)]
cachefiles: Clean up some whitespace in trace header
Clean up some whitespace in the cachefiles trace header.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-3-dhowells@redhat.com
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:40:51 +0000 (20:40 +0000)]
netfs: Clean up some whitespace in trace header
Clean up some whitespace in the netfs trace header.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-2-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner [Fri, 20 Dec 2024 21:08:16 +0000 (22:08 +0100)]
Merge patch series "netfs, ceph, nfs, cachefiles: Miscellaneous fixes/changes"
David Howells <dhowells@redhat.com> says:
Here are some miscellaneous fixes and changes for netfslib and the ceph and
nfs filesystems:
(1) Ignore silly-rename files from afs and nfs when building the header
archive in a kernel build.
(2) netfs: Fix the way read result collection applies results to folios
when each folio is being read by multiple subrequests and the results
come out of order.
(3) netfs: Fix ENOMEM handling in buffered reads.
(4) nfs: Fix an oops in nfs_netfs_init_request() when copying to the cache.
(5) cachefiles: Parse the "secctx" command immediately to get the correct
error rather than leaving it to the "bind" command.
(6) netfs: Remove a redundant smp_rmb(). This isn't a bug per se and
could be deferred.
(7) netfs: Fix missing barriers by using clear_and_wake_up_bit().
(8) netfs: Work around recursion in read retry by failing and abandoning
the retried subrequest if no I/O is performed.
[!] NOTE: This only works around the recursion problem if the
recursion keeps returning no data. If the server manages, say, to
repeatedly return a single byte of data faster than the retry
algorithm can complete, it will still recurse and the stack
overrun may still occur. Actually fixing this requires quite an
intrusive change which will hopefully make the next merge window.
(9) netfs: Fix the clearance of a folio_queue when unlocking the page if
we're going to want to subsequently send the queue for copying to the
cache (if, for example, we're using ceph).
(10) netfs: Fix the lack of cancellation of copy-to-cache when the cache
for a file is temporarily disabled (for example when a DIO write is
done to the file). This patch and (9) fix hangs with ceph.
With these patches, I can run xfstest -g quick to completion on ceph with a
local cache.
The patches can also be found here with a bonus cifs patch:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-fixes
* patches from https://lore.kernel.org/r/
20241213135013.
2964079-1-dhowells@redhat.com:
netfs: Fix is-caching check in read-retry
netfs: Fix the (non-)cancellation of copy when cache is temporarily disabled
netfs: Fix ceph copy to cache on write-begin
netfs: Work around recursion by abandoning retry if nothing read
netfs: Fix missing barriers by using clear_and_wake_up_bit()
netfs: Remove redundant use of smp_rmb()
cachefiles: Parse the "secctx" immediately
nfs: Fix oops in nfs_netfs_init_request() when copying to cache
netfs: Fix enomem handling in buffered reads
netfs: Fix non-contiguous donation between completed reads
kheaders: Ignore silly-rename files
Link: https://lore.kernel.org/r/20241213135013.2964079-1-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Mon, 16 Dec 2024 20:34:45 +0000 (20:34 +0000)]
netfs: Fix is-caching check in read-retry
netfs: Fix is-caching check in read-retry
The read-retry code checks the NETFS_RREQ_COPY_TO_CACHE flag to determine
if there might be failed reads from the cache that need turning into reads
from the server, with the intention of skipping the complicated part if it
can. The code that set the flag, however, got lost during the read-side
rewrite.
Fix the check to see if the cache_resources are valid instead. The flag
can then be removed.
Fixes:
ee4cdf7ba857 ("netfs: Speed up buffered reading")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3752048.1734381285@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Fri, 13 Dec 2024 13:50:10 +0000 (13:50 +0000)]
netfs: Fix the (non-)cancellation of copy when cache is temporarily disabled
When the caching for a cookie is temporarily disabled (e.g. due to a DIO
write on that file), future copying to the cache for that file is disabled
until all fds open on that file are closed. However, if netfslib is using
the deprecated PG_private_2 method (such as is currently used by ceph), and
decides it wants to copy to the cache, netfs_advance_write() will just bail
at the first check seeing that the cache stream is unavailable, and
indicate that it dealt with all the content.
This means that we have no subrequests to provide notifications to drive
the state machine or even to pin the request and the request just gets
discarded, leaving the folios with PG_private_2 set.
Fix this by jumping directly to cancel the request if the cache is not
available. That way, we don't remove mark3 from the folio_queue list and
netfs_pgpriv2_cancel() will clean up the folios.
This was found by running the generic/013 xfstest against ceph with an
active cache and the "-o fsc" option passed to ceph. That would usually
hang
Fixes:
ee4cdf7ba857 ("netfs: Speed up buffered reading")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Closes: https://lore.kernel.org/r/CAKPOu+_4m80thNy5_fvROoxBm689YtA0dZ-=gcmkzwYSY4syqw@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241213135013.2964079-11-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Fri, 13 Dec 2024 13:50:09 +0000 (13:50 +0000)]
netfs: Fix ceph copy to cache on write-begin
At the end of netfs_unlock_read_folio() in which folios are marked
appropriately for copying to the cache (either with by being marked dirty
and having their private data set or by having PG_private_2 set) and then
unlocked, the folio_queue struct has the entry pointing to the folio
cleared. This presents a problem for netfs_pgpriv2_write_to_the_cache(),
which is used to write folios marked with PG_private_2 to the cache as it
expects to be able to trawl the folio_queue list thereafter to find the
relevant folios, leading to a hang.
Fix this by not clearing the folio_queue entry if we're going to do the
deprecated copy-to-cache. The clearance will be done instead as the folios
are written to the cache.
This can be reproduced by starting cachefiles, mounting a ceph filesystem
with "-o fsc" and writing to it.
Fixes:
796a4049640b ("netfs: In readahead, put the folio refs as soon extracted")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Closes: https://lore.kernel.org/r/CAKPOu+_4m80thNy5_fvROoxBm689YtA0dZ-=gcmkzwYSY4syqw@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241213135013.2964079-10-dhowells@redhat.com
Fixes:
ee4cdf7ba857 ("netfs: Speed up buffered reading")
cc: Jeff Layton <jlayton@kernel.org>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: netfs@lists.linux.dev
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Fri, 13 Dec 2024 13:50:08 +0000 (13:50 +0000)]
netfs: Work around recursion by abandoning retry if nothing read
syzkaller reported recursion with a loop of three calls (netfs_rreq_assess,
netfs_retry_reads and netfs_rreq_terminated) hitting the limit of the stack
during an unbuffered or direct I/O read.
There are a number of issues:
(1) There is no limit on the number of retries.
(2) A subrequest is supposed to be abandoned if it does not transfer
anything (NETFS_SREQ_NO_PROGRESS), but that isn't checked under all
circumstances.
(3) The actual root cause, which is this:
if (atomic_dec_and_test(&rreq->nr_outstanding))
netfs_rreq_terminated(rreq, ...);
When we do a retry, we bump the rreq->nr_outstanding counter to
prevent the final cleanup phase running before we've finished
dispatching the retries. The problem is if we hit 0, we have to do
the cleanup phase - but we're in the cleanup phase and end up
repeating the retry cycle, hence the recursion.
Work around the problem by limiting the number of retries. This is based
on Lizhi Xu's patch[1], and makes the following changes:
(1) Replace NETFS_SREQ_NO_PROGRESS with NETFS_SREQ_MADE_PROGRESS and make
the filesystem set it if it managed to read or write at least one byte
of data. Clear this bit before issuing a subrequest.
(2) Add a ->retry_count member to the subrequest and increment it any time
we do a retry.
(3) Remove the NETFS_SREQ_RETRYING flag as it is superfluous with
->retry_count. If the latter is non-zero, we're doing a retry.
(4) Abandon a subrequest if retry_count is non-zero and we made no
progress.
(5) Use ->retry_count in both the write-side and the read-size.
[?] Question: Should I set a hard limit on retry_count in both read and
write? Say it hits 50, we always abandon it. The problem is that
these changes only mitigate the issue. As long as it made at least one
byte of progress, the recursion is still an issue. This patch
mitigates the problem, but does not fix the underlying cause. I have
patches that will do that, but it's an intrusive fix that's currently
pending for the next merge window.
The oops generated by KASAN looks something like:
BUG: TASK stack guard page was hit at
ffffc9000482ff48 (stack is
ffffc90004830000..
ffffc90004838000)
Oops: stack guard page: 0000 [#1] PREEMPT SMP KASAN NOPTI
...
RIP: 0010:mark_lock+0x25/0xc60 kernel/locking/lockdep.c:4686
...
mark_usage kernel/locking/lockdep.c:4646 [inline]
__lock_acquire+0x906/0x3ce0 kernel/locking/lockdep.c:5156
lock_acquire.part.0+0x11b/0x380 kernel/locking/lockdep.c:5825
local_lock_acquire include/linux/local_lock_internal.h:29 [inline]
___slab_alloc+0x123/0x1880 mm/slub.c:3695
__slab_alloc.constprop.0+0x56/0xb0 mm/slub.c:3908
__slab_alloc_node mm/slub.c:3961 [inline]
slab_alloc_node mm/slub.c:4122 [inline]
kmem_cache_alloc_noprof+0x2a7/0x2f0 mm/slub.c:4141
radix_tree_node_alloc.constprop.0+0x1e8/0x350 lib/radix-tree.c:253
idr_get_free+0x528/0xa40 lib/radix-tree.c:1506
idr_alloc_u32+0x191/0x2f0 lib/idr.c:46
idr_alloc+0xc1/0x130 lib/idr.c:87
p9_tag_alloc+0x394/0x870 net/9p/client.c:321
p9_client_prepare_req+0x19f/0x4d0 net/9p/client.c:644
p9_client_zc_rpc.constprop.0+0x105/0x880 net/9p/client.c:793
p9_client_read_once+0x443/0x820 net/9p/client.c:1570
p9_client_read+0x13f/0x1b0 net/9p/client.c:1534
v9fs_issue_read+0x115/0x310 fs/9p/vfs_addr.c:74
netfs_retry_read_subrequests fs/netfs/read_retry.c:60 [inline]
netfs_retry_reads+0x153a/0x1d00 fs/netfs/read_retry.c:232
netfs_rreq_assess+0x5d3/0x870 fs/netfs/read_collect.c:371
netfs_rreq_terminated+0xe5/0x110 fs/netfs/read_collect.c:407
netfs_retry_reads+0x155e/0x1d00 fs/netfs/read_retry.c:235
netfs_rreq_assess+0x5d3/0x870 fs/netfs/read_collect.c:371
netfs_rreq_terminated+0xe5/0x110 fs/netfs/read_collect.c:407
netfs_retry_reads+0x155e/0x1d00 fs/netfs/read_retry.c:235
netfs_rreq_assess+0x5d3/0x870 fs/netfs/read_collect.c:371
...
netfs_rreq_terminated+0xe5/0x110 fs/netfs/read_collect.c:407
netfs_retry_reads+0x155e/0x1d00 fs/netfs/read_retry.c:235
netfs_rreq_assess+0x5d3/0x870 fs/netfs/read_collect.c:371
netfs_rreq_terminated+0xe5/0x110 fs/netfs/read_collect.c:407
netfs_retry_reads+0x155e/0x1d00 fs/netfs/read_retry.c:235
netfs_rreq_assess+0x5d3/0x870 fs/netfs/read_collect.c:371
netfs_rreq_terminated+0xe5/0x110 fs/netfs/read_collect.c:407
netfs_dispatch_unbuffered_reads fs/netfs/direct_read.c:103 [inline]
netfs_unbuffered_read fs/netfs/direct_read.c:127 [inline]
netfs_unbuffered_read_iter_locked+0x12f6/0x19b0 fs/netfs/direct_read.c:221
netfs_unbuffered_read_iter+0xc5/0x100 fs/netfs/direct_read.c:256
v9fs_file_read_iter+0xbf/0x100 fs/9p/vfs_file.c:361
do_iter_readv_writev+0x614/0x7f0 fs/read_write.c:832
vfs_readv+0x4cf/0x890 fs/read_write.c:1025
do_preadv fs/read_write.c:1142 [inline]
__do_sys_preadv fs/read_write.c:1192 [inline]
__se_sys_preadv fs/read_write.c:1187 [inline]
__x64_sys_preadv+0x22d/0x310 fs/read_write.c:1187
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
Fixes:
ee4cdf7ba857 ("netfs: Speed up buffered reading")
Closes: https://syzkaller.appspot.com/bug?extid=
1fc6f64c40a9d143cfb6
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241108034020.3695718-1-lizhi.xu@windriver.com/
Link: https://lore.kernel.org/r/20241213135013.2964079-9-dhowells@redhat.com
Tested-by: syzbot+885c03ad650731743489@syzkaller.appspotmail.com
Suggested-by: Lizhi Xu <lizhi.xu@windriver.com>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs@lists.linux.dev
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Reported-by: syzbot+885c03ad650731743489@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Fri, 13 Dec 2024 13:50:07 +0000 (13:50 +0000)]
netfs: Fix missing barriers by using clear_and_wake_up_bit()
Use clear_and_wake_up_bit() rather than something like:
clear_bit_unlock(NETFS_RREQ_IN_PROGRESS, &rreq->flags);
wake_up_bit(&rreq->flags, NETFS_RREQ_IN_PROGRESS);
as there needs to be a barrier inserted between which is present in
clear_and_wake_up_bit().
Fixes:
288ace2f57c9 ("netfs: New writeback implementation")
Fixes:
ee4cdf7ba857 ("netfs: Speed up buffered reading")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241213135013.2964079-8-dhowells@redhat.com
Reviewed-by: Akira Yokosawa <akiyks@gmail.com>
cc: Zilin Guan <zilin@seu.edu.cn>
cc: Akira Yokosawa <akiyks@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Zilin Guan [Fri, 13 Dec 2024 13:50:06 +0000 (13:50 +0000)]
netfs: Remove redundant use of smp_rmb()
The function netfs_unbuffered_write_iter_locked() in
fs/netfs/direct_write.c contains an unnecessary smp_rmb() call after
wait_on_bit(). Since wait_on_bit() already incorporates a memory barrier
that ensures the flag update is visible before the function returns, the
smp_rmb() provides no additional benefit and incurs unnecessary overhead.
This patch removes the redundant barrier to simplify and optimize the code.
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241207021952.2978530-1-zilin@seu.edu.cn/
Link: https://lore.kernel.org/r/20241213135013.2964079-7-dhowells@redhat.com
Reviewed-by: Akira Yokosawa <akiyks@gmail.com>
cc: Akira Yokosawa <akiyks@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Max Kellermann [Fri, 13 Dec 2024 13:50:05 +0000 (13:50 +0000)]
cachefiles: Parse the "secctx" immediately
Instead of storing an opaque string, call security_secctx_to_secid()
right in the "secctx" command handler and store only the numeric
"secid". This eliminates an unnecessary string allocation and allows
the daemon to receive errors when writing the "secctx" command instead
of postponing the error to the "bind" command handler. For example,
if the kernel was built without `CONFIG_SECURITY`, "bind" will return
`EOPNOTSUPP`, but the daemon doesn't know why. With this patch, the
"secctx" will instead return `EOPNOTSUPP` which is the right context
for this error.
This patch adds a boolean flag `have_secid` because I'm not sure if we
can safely assume that zero is the special secid value for "not set".
This appears to be true for SELinux, Smack and AppArmor, but since
this attribute is not documented, I'm unable to derive a stable
guarantee for that.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241209141554.638708-1-max.kellermann@ionos.com/
Link: https://lore.kernel.org/r/20241213135013.2964079-6-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Fri, 13 Dec 2024 13:50:04 +0000 (13:50 +0000)]
nfs: Fix oops in nfs_netfs_init_request() when copying to cache
When netfslib wants to copy some data that has just been read on behalf of
nfs, it creates a new write request and calls nfs_netfs_init_request() to
initialise it, but with a NULL file pointer. This causes
nfs_file_open_context() to oops - however, we don't actually need the nfs
context as we're only going to write to the cache.
Fix this by just returning if we aren't given a file pointer and emit a
warning if the request was for something other than copy-to-cache.
Further, fix nfs_netfs_free_request() so that it doesn't try to free the
context if the pointer is NULL.
Fixes:
ee4cdf7ba857 ("netfs: Speed up buffered reading")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Closes: https://lore.kernel.org/r/CAKPOu+9DyMbKLhyJb7aMLDTb=Fh0T8Teb9sjuf_pze+XWT1VaQ@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241213135013.2964079-5-dhowells@redhat.com
cc: Trond Myklebust <trondmy@kernel.org>
cc: Anna Schumaker <anna@kernel.org>
cc: Dave Wysochanski <dwysocha@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-nfs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Fri, 13 Dec 2024 13:50:03 +0000 (13:50 +0000)]
netfs: Fix enomem handling in buffered reads
If netfs_read_to_pagecache() gets an error from either ->prepare_read() or
from netfs_prepare_read_iterator(), it needs to decrement ->nr_outstanding,
cancel the subrequest and break out of the issuing loop. Currently, it
only does this for two of the cases, but there are two more that aren't
handled.
Fix this by moving the handling to a common place and jumping to it from
all four places. This is in preference to inserting a wrapper around
netfs_prepare_read_iterator() as proposed by Dmitry Antipov[1].
Link: https://lore.kernel.org/r/20241202093943.227786-1-dmantipov@yandex.ru/
Fixes:
ee4cdf7ba857 ("netfs: Speed up buffered reading")
Reported-by: syzbot+404b4b745080b6210c6c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=
404b4b745080b6210c6c
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241213135013.2964079-4-dhowells@redhat.com
Tested-by: syzbot+404b4b745080b6210c6c@syzkaller.appspotmail.com
cc: Dmitry Antipov <dmantipov@yandex.ru>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Fri, 13 Dec 2024 13:50:02 +0000 (13:50 +0000)]
netfs: Fix non-contiguous donation between completed reads
When a read subrequest finishes, if it doesn't have sufficient coverage to
complete the folio(s) covering either side of it, it will donate the excess
coverage to the adjacent subrequests on either side, offloading
responsibility for unlocking the folio(s) covered to them.
Now, preference is given to donating down to a lower file offset over
donating up because that check is done first - but there's no check that
the lower subreq is actually contiguous, and so we can end up donating
incorrectly.
The scenario seen[1] is that an 8MiB readahead request spanning four 2MiB
folios is split into eight 1MiB subreqs (numbered 1 through 8). These
terminate in the order 1,6,2,5,3,7,4,8. What happens is:
- 1 donates to 2
- 6 donates to 5
- 2 completes, unlocking the first folio (with 1).
- 5 completes, unlocking the third folio (with 6).
- 3 donates to 4
- 7 donates to 4 incorrectly
- 4 completes, unlocking the second folio (with 3), but can't use
the excess from 7.
- 8 donates to 4, also incorrectly.
Fix this by preventing downward donation if the subreqs are not contiguous
(in the example above, 7 donates to 4 across the gap left by 5 and 6).
Reported-by: Shyam Prasad N <nspmangalore@gmail.com>
Closes: https://lore.kernel.org/r/CANT5p=qBwjBm-D8soFVVtswGEfmMtQXVW83=TNfUtvyHeFQZBA@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/526707.1733224486@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/20241213135013.2964079-3-dhowells@redhat.com
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
David Howells [Fri, 13 Dec 2024 13:50:01 +0000 (13:50 +0000)]
kheaders: Ignore silly-rename files
Tell tar to ignore silly-rename files (".__afs*" and ".nfs*") when building
the header archive. These occur when a file that is open is unlinked
locally, but hasn't yet been closed. Such files are visible to the user
via the getdents() syscall and so programs may want to do things with them.
During the kernel build, such files may be made during the processing of
header files and the cleanup may get deferred by fput() which may result in
tar seeing these files when it reads the directory, but they may have
disappeared by the time it tries to open them, causing tar to fail with an
error. Further, we don't want to include them in the tarball if they still
exist.
With CONFIG_HEADERS_INSTALL=y, something like the following may be seen:
find: './kernel/.tmp_cpio_dir/include/dt-bindings/reset/.__afs2080': No such file or directory
tar: ./include/linux/greybus/.__afs3C95: File removed before we read it
The find warning doesn't seem to cause a problem.
Fix this by telling tar when called from in gen_kheaders.sh to exclude such
files. This only affects afs and nfs; cifs uses the Windows Hidden
attribute to prevent the file from being seen.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241213135013.2964079-2-dhowells@redhat.com
cc: Masahiro Yamada <masahiroy@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-kernel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Amir Goldstein [Thu, 19 Dec 2024 11:53:01 +0000 (12:53 +0100)]
fs: relax assertions on failure to encode file handles
Encoding file handles is usually performed by a filesystem >encode_fh()
method that may fail for various reasons.
The legacy users of exportfs_encode_fh(), namely, nfsd and
name_to_handle_at(2) syscall are ready to cope with the possibility
of failure to encode a file handle.
There are a few other users of exportfs_encode_{fh,fid}() that
currently have a WARN_ON() assertion when ->encode_fh() fails.
Relax those assertions because they are wrong.
The second linked bug report states commit
16aac5ad1fa9 ("ovl: support
encoding non-decodable file handles") in v6.6 as the regressing commit,
but this is not accurate.
The aforementioned commit only increases the chances of the assertion
and allows triggering the assertion with the reproducer using overlayfs,
inotify and drop_caches.
Triggering this assertion was always possible with other filesystems and
other reasons of ->encode_fh() failures and more particularly, it was
also possible with the exact same reproducer using overlayfs that is
mounted with options index=on,nfs_export=on also on kernels < v6.6.
Therefore, I am not listing the aforementioned commit as a Fixes commit.
Backport hint: this patch will have a trivial conflict applying to
v6.6.y, and other trivial conflicts applying to stable kernels < v6.6.
Reported-by: syzbot+ec07f6f5ce62b858579f@syzkaller.appspotmail.com
Tested-by: syzbot+ec07f6f5ce62b858579f@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-unionfs/
671fd40c.
050a0220.4735a.024f.GAE@google.com/
Reported-by: Dmitry Safonov <dima@arista.com>
Closes: https://lore.kernel.org/linux-fsdevel/CAGrbwDTLt6drB9eaUagnQVgdPBmhLfqqxAf3F+Juqy_o6oP8uw@mail.gmail.com/
Cc: stable@vger.kernel.org
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20241219115301.465396-1-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Zhang Kunbo [Tue, 17 Dec 2024 07:18:36 +0000 (07:18 +0000)]
fs: fix missing declaration of init_files
fs/file.c should include include/linux/init_task.h for
declaration of init_files. This fixes the sparse warning:
fs/file.c:501:21: warning: symbol 'init_files' was not declared. Should it be static?
Signed-off-by: Zhang Kunbo <zhangkunbo@huawei.com>
Link: https://lore.kernel.org/r/20241217071836.2634868-1-zhangkunbo@huawei.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Miklos Szeredi [Wed, 11 Dec 2024 12:11:17 +0000 (13:11 +0100)]
fs: fix is_mnt_ns_file()
Commit
1fa08aece425 ("nsfs: convert to path_from_stashed() helper") reused
nsfs dentry's d_fsdata, which no longer contains a pointer to
proc_ns_operations.
Fix the remaining use in is_mnt_ns_file().
Fixes:
1fa08aece425 ("nsfs: convert to path_from_stashed() helper")
Cc: stable@vger.kernel.org # v6.9
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20241211121118.85268-1-mszeredi@redhat.com
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner [Wed, 11 Dec 2024 10:09:16 +0000 (11:09 +0100)]
Merge patch series "iomap: fix zero padding data issue in concurrent append writes"
Long Li <leo.lilong@huawei.com> says:
This patch series fixes zero padding data issues in concurrent append
write scenarios. A detailed problem description and solution can be
found in patch 2. Patch 1 is introduced as preparation for the fix in
patch 2, eliminating the need to resample inode size for io_size
trimming and avoiding issues caused by inode size changes during
concurrent writeback and truncate operations.
* patches from https://lore.kernel.org/r/
20241209114241.
3725722-1-leo.lilong@huawei.com:
iomap: fix zero padding data issue in concurrent append writes
iomap: pass byte granular end position to iomap_add_to_ioend
Link: https://lore.kernel.org/r/20241209114241.3725722-1-leo.lilong@huawei.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Long Li [Mon, 9 Dec 2024 11:42:40 +0000 (19:42 +0800)]
iomap: fix zero padding data issue in concurrent append writes
During concurrent append writes to XFS filesystem, zero padding data
may appear in the file after power failure. This happens due to imprecise
disk size updates when handling write completion.
Consider this scenario with concurrent append writes same file:
Thread 1: Thread 2:
------------ -----------
write [A, A+B]
update inode size to A+B
submit I/O [A, A+BS]
write [A+B, A+B+C]
update inode size to A+B+C
<I/O completes, updates disk size to min(A+B+C, A+BS)>
<power failure>
After reboot:
1) with A+B+C < A+BS, the file has zero padding in range [A+B, A+B+C]
|< Block Size (BS) >|
|
DDDDDDDDDDDDDDDD0000000000000000|
^ ^ ^
A A+B A+B+C
(EOF)
2) with A+B+C > A+BS, the file has zero padding in range [A+B, A+BS]
|< Block Size (BS) >|< Block Size (BS) >|
|
DDDDDDDDDDDDDDDD0000000000000000|
00000000000000000000000000000000|
^ ^ ^ ^
A A+B A+BS A+B+C
(EOF)
D = Valid Data
0 = Zero Padding
The issue stems from disk size being set to min(io_offset + io_size,
inode->i_size) at I/O completion. Since io_offset+io_size is block
size granularity, it may exceed the actual valid file data size. In
the case of concurrent append writes, inode->i_size may be larger
than the actual range of valid file data written to disk, leading to
inaccurate disk size updates.
This patch modifies the meaning of io_size to represent the size of
valid data within EOF in an ioend. If the ioend spans beyond i_size,
io_size will be trimmed to provide the file with more accurate size
information. This is particularly useful for on-disk size updates
at completion time.
After this change, ioends that span i_size will not grow or merge with
other ioends in concurrent scenarios. However, these cases that need
growth/merging rarely occur and it seems no noticeable performance impact.
Although rounding up io_size could enable ioend growth/merging in these
scenarios, we decided to keep the code simple after discussion [1].
Another benefit is that it makes the xfs_ioend_is_append() check more
accurate, which can reduce unnecessary end bio callbacks of xfs_end_bio()
in certain scenarios, such as repeated writes at the file tail without
extending the file size.
Link [1]: https://patchwork.kernel.org/project/xfs/patch/
20241113091907.56937-1-leo.lilong@huawei.com
Fixes:
ae259a9c8593 ("fs: introduce iomap infrastructure") # goes further back than this
Signed-off-by: Long Li <leo.lilong@huawei.com>
Link: https://lore.kernel.org/r/20241209114241.3725722-3-leo.lilong@huawei.com
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Long Li [Mon, 9 Dec 2024 11:42:39 +0000 (19:42 +0800)]
iomap: pass byte granular end position to iomap_add_to_ioend
This is a preparatory patch for fixing zero padding issues in concurrent
append write scenarios. In the following patches, we need to obtain
byte-granular writeback end position for io_size trimming after EOF
handling.
Due to concurrent writeback and truncate operations, inode size may
shrink. Resampling inode size would force writeback code to handle the
newly appeared post-EOF blocks, which is undesirable. As Dave
explained in [1]:
"Really, the issue is that writeback mappings have to be able to
handle the range being mapped suddenly appear to be beyond EOF.
This behaviour is a longstanding writeback constraint, and is what
iomap_writepage_handle_eof() is attempting to handle.
We handle this by only sampling i_size_read() whilst we have the
folio locked and can determine the action we should take with that
folio (i.e. nothing, partial zeroing, or skip altogether). Once
we've made the decision that the folio is within EOF and taken
action on it (i.e. moved the folio to writeback state), we cannot
then resample the inode size because a truncate may have started
and changed the inode size."
To avoid resampling inode size after EOF handling, we convert end_pos
to byte-granular writeback position and return it from EOF handling
function.
Since iomap_set_range_dirty() can handle unaligned lengths, this
conversion has no impact on it. However, iomap_find_dirty_range()
requires aligned start and end range to find dirty blocks within the
given range, so the end position needs to be rounded up when passed
to it.
LINK [1]: https://lore.kernel.org/linux-xfs/Z1Gg0pAa54MoeYME@localhost.localdomain/
Signed-off-by: Long Li <leo.lilong@huawei.com>
Link: https://lore.kernel.org/r/20241209114241.3725722-2-leo.lilong@huawei.com
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Christian Brauner [Wed, 4 Dec 2024 11:00:12 +0000 (12:00 +0100)]
Merge patch series "jbd2: two straightforward fixes"
Two simple fixes for jdb2.
* patches from https://lore.kernel.org/r/
20241203014407.805916-1-yi.zhang@huaweicloud.com:
jbd2: flush filesystem device before updating tail sequence
jbd2: increase IO priority for writing revoke records
Link: https://lore.kernel.org/r/20241203014407.805916-1-yi.zhang@huaweicloud.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Zhang Yi [Tue, 3 Dec 2024 01:44:07 +0000 (09:44 +0800)]
jbd2: flush filesystem device before updating tail sequence
When committing transaction in jbd2_journal_commit_transaction(), the
disk caches for the filesystem device should be flushed before updating
the journal tail sequence. However, this step is missed if the journal
is not located on the filesystem device. As a result, the filesystem may
become inconsistent following a power failure or system crash. Fix it by
ensuring that the filesystem device is flushed appropriately.
Fixes:
3339578f0578 ("jbd2: cleanup journal tail after transaction commit")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20241203014407.805916-3-yi.zhang@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Zhang Yi [Tue, 3 Dec 2024 01:44:06 +0000 (09:44 +0800)]
jbd2: increase IO priority for writing revoke records
Commit '
6a3afb6ac6df ("jbd2: increase the journal IO's priority")'
increases the priority of journal I/O by marking I/O with the
JBD2_JOURNAL_REQ_FLAGS. However, that commit missed the revoke buffers,
so also addresses that kind of I/Os.
Fixes:
6a3afb6ac6df ("jbd2: increase the journal IO's priority")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20241203014407.805916-2-yi.zhang@huaweicloud.com
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Brahmajit Das [Fri, 4 Oct 2024 19:51:32 +0000 (01:21 +0530)]
fs/qnx6: Fix building with GCC 15
qnx6_checkroot() had been using weirdly spelled initializer - it needed
to initialize 3-element arrays of char and it used NUL-padded
3-character string literals (i.e. 4-element initializers, with
completely pointless zeroes at the end).
That had been spotted by gcc-15[*]; prior to that gcc quietly dropped
the 4th element of initializers.
However, none of that had been needed in the first place - all this
array is used for is checking that the first directory entry in root
directory is "." and the second - "..". The check had been expressed as
a loop, using that match_root[] array. Since there is no chance that we
ever want to extend that list of entries, the entire thing is much too
fancy for its own good; what we need is just a couple of explicit
memcmp() and that's it.
[*]: fs/qnx6/inode.c: In function ‘qnx6_checkroot’:
fs/qnx6/inode.c:182:41: error: initializer-string for array of ‘char’ is too long [-Werror=unterminated-string-initialization]
182 | static char match_root[2][3] = {".\0\0", "..\0"};
| ^~~~~~~
fs/qnx6/inode.c:182:50: error: initializer-string for array of ‘char’ is too long [-Werror=unterminated-string-initialization]
182 | static char match_root[2][3] = {".\0\0", "..\0"};
| ^~~~~~
Signed-off-by: Brahmajit Das <brahmajit.xyz@gmail.com>
Link: https://lore.kernel.org/r/20241004195132.1393968-1-brahmajit.xyz@gmail.com
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Leo Stone [Sun, 1 Dec 2024 05:14:19 +0000 (21:14 -0800)]
hfs: Sanity check the root record
In the syzbot reproducer, the hfs_cat_rec for the root dir has type
HFS_CDR_FIL after being read with hfs_bnode_read() in hfs_super_fill().
This indicates it should be used as an hfs_cat_file, which is 102 bytes.
Only the first 70 bytes of that struct are initialized, however,
because the entrylength passed into hfs_bnode_read() is still the length of
a directory record. This causes uninitialized values to be used later on,
when the hfs_cat_rec union is treated as the larger hfs_cat_file struct.
Add a check to make sure the retrieved record has the correct type
for the root directory (HFS_CDR_DIR), and make sure we load the correct
number of bytes for a directory record.
Reported-by: syzbot+2db3c7526ba68f4ea776@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=
2db3c7526ba68f4ea776
Tested-by: syzbot+2db3c7526ba68f4ea776@syzkaller.appspotmail.com
Tested-by: Leo Stone <leocstone@gmail.com>
Signed-off-by: Leo Stone <leocstone@gmail.com>
Link: https://lore.kernel.org/r/20241201051420.77858-1-leocstone@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Linus Torvalds [Sun, 1 Dec 2024 22:28:56 +0000 (14:28 -0800)]
Linux 6.13-rc1
Linus Torvalds [Sun, 1 Dec 2024 21:38:24 +0000 (13:38 -0800)]
Merge tag 'i2c-for-6.13-rc1-part3' of git://git./linux/kernel/git/wsa/linux
Pull i2c component probing support from Wolfram Sang:
"Add OF component probing.
Some devices are designed and manufactured with some components having
multiple drop-in replacement options. These components are often
connected to the mainboard via ribbon cables, having the same signals
and pin assignments across all options. These may include the display
panel and touchscreen on laptops and tablets, and the trackpad on
laptops. Sometimes which component option is used in a particular
device can be detected by some firmware provided identifier, other
times that information is not available, and the kernel has to try to
probe each device.
Instead of a delicate dance between drivers and device tree quirks,
this change introduces a simple I2C component probe function. For a
given class of devices on the same I2C bus, it will go through all of
them, doing a simple I2C read transfer and see which one of them
responds. It will then enable the device that responds"
* tag 'i2c-for-6.13-rc1-part3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
MAINTAINERS: fix typo in I2C OF COMPONENT PROBER
of: base: Document prefix argument for of_get_next_child_with_prefix()
i2c: Fix whitespace style issue
arm64: dts: mediatek: mt8173-elm-hana: Mark touchscreens and trackpads as fail
platform/chrome: Introduce device tree hardware prober
i2c: of-prober: Add GPIO support to simple helpers
i2c: of-prober: Add simple helpers for regulator support
i2c: Introduce OF component probe function
of: base: Add for_each_child_of_node_with_prefix()
of: dynamic: Add of_changeset_update_prop_string
Linus Torvalds [Sun, 1 Dec 2024 21:10:51 +0000 (13:10 -0800)]
Merge tag 'trace-printf-v6.13' of git://git./linux/kernel/git/trace/linux-trace
Pull bprintf() removal from Steven Rostedt:
- Remove unused bprintf() function, that was added with the rest of the
"bin-printf" functions.
These are functions that are used by trace_printk() that allows to
quickly save the format and arguments into the ring buffer without
the expensive processing of converting numbers to ASCII. Then on
output, at a much later time, the ring buffer is read and the string
processing occurs then. The bprintf() was added for consistency but
was never used. It can be safely removed.
* tag 'trace-printf-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
printf: Remove unused 'bprintf'
Linus Torvalds [Sun, 1 Dec 2024 20:41:21 +0000 (12:41 -0800)]
Merge tag 'timers_urgent_for_v6.13_rc1' of git://git./linux/kernel/git/tip/tip
Pull timer fixes from Borislav Petkov:
- Fix a case where posix timers with a thread-group-wide target would
miss signals if some of the group's threads are exiting
- Fix a hang caused by ndelay() calling the wrong delay function
__udelay()
- Fix a wrong offset calculation in adjtimex(2) when using ADJ_MICRO
(microsecond resolution) and a negative offset
* tag 'timers_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-timers: Target group sigqueue to current task only if not exiting
delay: Fix ndelay() spuriously treated as udelay()
ntp: Remove invalid cast in time offset math
Linus Torvalds [Sun, 1 Dec 2024 20:37:58 +0000 (12:37 -0800)]
Merge tag 'irq_urgent_for_v6.13_rc1' of git://git./linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Move the ->select callback to the correct ops structure in
irq-mvebu-sei to fix some Marvell Armada platforms
- Add a workaround for Hisilicon ITS erratum
162100801 which can cause
some virtual interrupts to get lost
- More platform_driver::remove() conversion
* tag 'irq_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip: Switch back to struct platform_driver::remove()
irqchip/gicv3-its: Add workaround for hip09 ITS erratum
162100801
irqchip/irq-mvebu-sei: Move misplaced select() callback to SEI CP domain
Linus Torvalds [Sun, 1 Dec 2024 20:35:37 +0000 (12:35 -0800)]
Merge tag 'x86_urgent_for_v6.13_rc1' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Add a terminating zero end-element to the array describing AMD CPUs
affected by erratum 1386 so that the matching loop actually
terminates instead of going off into the weeds
- Update the boot protocol documentation to mention the fact that the
preferred address to load the kernel to is considered in the
relocatable kernel case too
- Flush the memory buffer containing the microcode patch after applying
microcode on AMD Zen1 and Zen2, to avoid unnecessary slowdowns
- Make sure the PPIN CPU feature flag is cleared on all CPUs if PPIN
has been disabled
* tag 'x86_urgent_for_v6.13_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Terminate the erratum_1386_microcode array
x86/Documentation: Update algo in init_size description of boot protocol
x86/microcode/AMD: Flush patch buffer mapping after application
x86/mm: Carve out INVLPG inline asm for use by others
x86/cpu: Fix PPIN initialization
Linus Torvalds [Sun, 1 Dec 2024 17:23:33 +0000 (09:23 -0800)]
strscpy: write destination buffer only once
The point behind strscpy() was to once and for all avoid all the
problems with 'strncpy()' and later broken "fixed" versions like
strlcpy() that just made things worse.
So strscpy not only guarantees NUL-termination (unlike strncpy), it also
doesn't do unnecessary padding at the destination. But at the same time
also avoids byte-at-a-time reads and writes by _allowing_ some extra NUL
writes - within the size, of course - so that the whole copy can be done
with word operations.
It is also stable in the face of a mutable source string: it explicitly
does not read the source buffer multiple times (so an implementation
using "strnlen()+memcpy()" would be wrong), and does not read the source
buffer past the size (like the mis-design that is strlcpy does).
Finally, the return value is designed to be simple and unambiguous: if
the string cannot be copied fully, it returns an actual negative error,
making error handling clearer and simpler (and the caller already knows
the size of the buffer). Otherwise it returns the string length of the
result.
However, there was one final stability issue that can be important to
callers: the stability of the destination buffer.
In particular, the same way we shouldn't read the source buffer more
than once, we should avoid doing multiple writes to the destination
buffer: first writing a potentially non-terminated string, and then
terminating it with NUL at the end does not result in a stable result
buffer.
Yes, it gives the right result in the end, but if the rule for the
destination buffer was that it is _always_ NUL-terminated even when
accessed concurrently with updates, the final byte of the buffer needs
to always _stay_ as a NUL byte.
[ Note that "final byte is NUL" here is literally about the final byte
in the destination array, not the terminating NUL at the end of the
string itself. There is no attempt to try to make concurrent reads and
writes give any kind of consistent string length or contents, but we
do want to guarantee that there is always at least that final
terminating NUL character at the end of the destination array if it
existed before ]
This is relevant in the kernel for the tsk->comm[] array, for example.
Even without locking (for either readers or writers), we want to know
that while the buffer contents may be garbled, it is always a valid C
string and always has a NUL character at 'comm[TASK_COMM_LEN-1]' (and
never has any "out of thin air" data).
So avoid any "copy possibly non-terminated string, and terminate later"
behavior, and write the destination buffer only once.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dr. David Alan Gilbert [Wed, 2 Oct 2024 17:31:47 +0000 (18:31 +0100)]
printf: Remove unused 'bprintf'
bprintf() is unused. Remove it. It was added in the commit
4370aa4aa753
("vsprintf: add binary printf") but as far as I can see was never used,
unlike the other two functions in that patch.
Link: https://lore.kernel.org/20241002173147.210107-1-linux@treblig.org
Reviewed-by: Andy Shevchenko <andy@kernel.org>
Acked-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Linus Torvalds [Sun, 1 Dec 2024 02:30:22 +0000 (18:30 -0800)]
Merge tag 'turbostat-2024.11.30' of git://git./linux/kernel/git/lenb/linux
Pull turbostat updates from Len Brown:
- assorted minor bug fixes
- assorted platform specific tweaks
- initial RAPL PSYS (SysWatt) support
* tag 'turbostat-2024.11.30' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
tools/power turbostat: 2024.11.30
tools/power turbostat: Add RAPL psys as a built-in counter
tools/power turbostat: Fix child's argument forwarding
tools/power turbostat: Force --no-perf in --dump mode
tools/power turbostat: Add support for /sys/class/drm/card1
tools/power turbostat: Cache graphics sysfs file descriptors during probe
tools/power turbostat: Consolidate graphics sysfs access
tools/power turbostat: Remove unnecessary fflush() call
tools/power turbostat: Enhance platform divergence description
tools/power turbostat: Add initial support for GraniteRapids-D
tools/power turbostat: Remove PC3 support on Lunarlake
tools/power turbostat: Rename arl_features to lnl_features
tools/power turbostat: Add back PC8 support on Arrowlake
tools/power turbostat: Remove PC7/PC9 support on MTL
tools/power turbostat: Honor --show CPU, even when even when num_cpus=1
tools/power turbostat: Fix trailing '\n' parsing
tools/power turbostat: Allow using cpu device in perf counters on hybrid platforms
tools/power turbostat: Fix column printing for PMT xtal_time counters
tools/power turbostat: fix GCC9 build regression
Linus Torvalds [Sun, 1 Dec 2024 02:23:05 +0000 (18:23 -0800)]
Merge tag 'pci-v6.13-fixes-1' of git://git./linux/kernel/git/pci/pci
Pull PCI fix from Bjorn Helgaas:
- When removing a PCI device, only look up and remove a platform device
if there is an associated device node for which there could be a
platform device, to fix a merge window regression (Brian Norris)
* tag 'pci-v6.13-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI/pwrctrl: Unregister platform device only if one actually exists
Linus Torvalds [Sun, 1 Dec 2024 02:14:56 +0000 (18:14 -0800)]
Merge tag 'lsm-pr-
20241129' of git://git./linux/kernel/git/pcmoore/lsm
Pull ima fix from Paul Moore:
"One small patch to fix a function parameter / local variable naming
snafu that went up to you in the current merge window"
* tag 'lsm-pr-
20241129' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
ima: uncover hidden variable in ima_match_rules()
Linus Torvalds [Sat, 30 Nov 2024 23:47:29 +0000 (15:47 -0800)]
Merge tag 'block-6.13-
20242901' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- NVMe pull request via Keith:
- Use correct srcu list traversal (Breno)
- Scatter-gather support for metadata (Keith)
- Fabrics shutdown race condition fix (Nilay)
- Persistent reservations updates (Guixin)
- Add the required bits for MD atomic write support for raid0/1/10
- Correct return value for unknown opcode in ublk
- Fix deadlock with zone revalidation
- Fix for the io priority request vs bio cleanups
- Use the correct unsigned int type for various limit helpers
- Fix for a race in loop
- Cleanup blk_rq_prep_clone() to prevent uninit-value warning and make
it easier for actual humans to read
- Fix potential UAF when iterating tags
- A few fixes for bfq-iosched UAF issues
- Fix for brd discard not decrementing the allocated page count
- Various little fixes and cleanups
* tag 'block-6.13-
20242901' of git://git.kernel.dk/linux: (36 commits)
brd: decrease the number of allocated pages which discarded
block, bfq: fix bfqq uaf in bfq_limit_depth()
block: Don't allow an atomic write be truncated in blkdev_write_iter()
mq-deadline: don't call req_get_ioprio from the I/O completion handler
block: Prevent potential deadlock in blk_revalidate_disk_zones()
block: Remove extra part pointer NULLify in blk_rq_init()
nvme: tuning pr code by using defined structs and macros
nvme: introduce change ptpl and iekey definition
block: return bool from get_disk_ro and bdev_read_only
block: remove a duplicate definition for bdev_read_only
block: return bool from blk_rq_aligned
block: return unsigned int from blk_lim_dma_alignment_and_pad
block: return unsigned int from queue_dma_alignment
block: return unsigned int from bdev_io_opt
block: req->bio is always set in the merge code
block: don't bother checking the data direction for merges
block: blk-mq: fix uninit-value in blk_rq_prep_clone and refactor
Revert "block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()"
md/raid10: Atomic write support
md/raid1: Atomic write support
...
Linus Torvalds [Sat, 30 Nov 2024 23:43:02 +0000 (15:43 -0800)]
Merge tag 'io_uring-6.13-
20242901' of git://git.kernel.dk/linux
Pull more io_uring updates from Jens Axboe:
- Remove a leftover struct from when the cqwait registered waiting was
transitioned to regions.
- Fix for an issue introduced in this merge window, where nop->fd might
be used uninitialized. Ensure it's always set.
- Add capping of the task_work run in local task_work mode, to prevent
bursty and long chains from adding too much latency.
- Work around xa_store() leaving ->head non-NULL if it encounters an
allocation error during storing. Just a debug trigger, and can go
away once xa_store() behaves in a more expected way for this
condition. Not a major thing as it basically requires fault injection
to trigger it.
- Fix a few mapping corner cases
- Fix KCSAN complaint on reading the table size post unlock. Again not
a "real" issue, but it's easy to silence by just keeping the reading
inside the lock that protects it.
* tag 'io_uring-6.13-
20242901' of git://git.kernel.dk/linux:
io_uring/tctx: work around xa_store() allocation error issue
io_uring: fix corner case forgetting to vunmap
io_uring: fix task_work cap overshooting
io_uring: check for overflows in io_pin_pages
io_uring/nop: ensure nop->fd is always initialized
io_uring: limit local tw done
io_uring: add io_local_work_pending()
io_uring/region: return negative -E2BIG in io_create_region()
io_uring: protect register tracing
io_uring: remove io_uring_cqwait_reg_arg
Linus Torvalds [Sat, 30 Nov 2024 23:36:17 +0000 (15:36 -0800)]
Merge tag 'dma-mapping-6.13-2024-11-30' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fix from Christoph Hellwig:
- fix physical address calculation for struct dma_debug_entry (Fedor
Pchelkin)
* tag 'dma-mapping-6.13-2024-11-30' of git://git.infradead.org/users/hch/dma-mapping:
dma-debug: fix physical address calculation for struct dma_debug_entry
Linus Torvalds [Sat, 30 Nov 2024 22:51:08 +0000 (14:51 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull more kvm updates from Paolo Bonzini:
- ARM fixes
- RISC-V Svade and Svadu (accessed and dirty bit) extension support for
host and guest
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: riscv: selftests: Add Svade and Svadu Extension to get-reg-list test
RISC-V: KVM: Add Svade and Svadu Extensions Support for Guest/VM
dt-bindings: riscv: Add Svade and Svadu Entries
RISC-V: Add Svade and Svadu Extensions Support
KVM: arm64: Use MDCR_EL2.HPME to evaluate overflow of hyp counters
KVM: arm64: Ignore PMCNTENSET_EL0 while checking for overflow status
KVM: arm64: Mark set_sysreg_masks() as inline to avoid build failure
KVM: arm64: vgic-its: Add stronger type-checking to the ITS entry sizes
KVM: arm64: vgic: Kill VGIC_MAX_PRIVATE definition
KVM: arm64: vgic: Make vgic_get_irq() more robust
KVM: arm64: vgic-v3: Sanitise guest writes to GICR_INVLPIR
Linus Torvalds [Sat, 30 Nov 2024 22:45:29 +0000 (14:45 -0800)]
Merge tag 'sh-for-v6.13-tag1' of git://git./linux/kernel/git/glaubitz/sh-linux
Pull sh updates from John Paul Adrian Glaubitz:
"Two small fixes.
The first one by Huacai Chen addresses a runtime warning when
CONFIG_CPUMASK_OFFSTACK and CONFIG_DEBUG_PER_CPU_MAPS are selected
which occurs because the cpuinfo code on sh incorrectly uses NR_CPUS
when iterating CPUs instead of the runtime limit nr_cpu_ids.
A second fix by Dan Carpenter fixes a use-after-free bug in
register_intc_controller() which occurred as a result of improper
error handling in the interrupt controller driver code when
registering an interrupt controller during plat_irq_setup() on sh"
* tag 'sh-for-v6.13-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux:
sh: intc: Fix use-after-free bug in register_intc_controller()
sh: cpuinfo: Fix a warning for CONFIG_CPUMASK_OFFSTACK
Linus Torvalds [Sat, 30 Nov 2024 22:33:44 +0000 (14:33 -0800)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Deselect ARCH_CORRECT_STACKTRACE_ON_KRETPROBE so that tests depending
on it don't run (and fail) on arm64
- Fix lockdep assert in the Arm SMMUv3 PMU driver
- Fix the port and device ID bits setting in the Arm CMN perf driver
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
perf/arm-cmn: Ensure port and device id bits are set properly
perf/arm-smmuv3: Fix lockdep assert in ->event_init()
arm64: disable ARCH_CORRECT_STACKTRACE_ON_KRETPROBE tests
Len Brown [Sat, 30 Nov 2024 21:22:00 +0000 (16:22 -0500)]
tools/power turbostat: 2024.11.30
since 2024.07.26:
assorted minor bug fixes
assorted platform specific tweaks
initial RAPL PSYS (SysWatt) support
Signed-off-by: Len Brown <len.brown@intel.com>
Patryk Wlazlyn [Wed, 2 Oct 2024 13:05:15 +0000 (15:05 +0200)]
tools/power turbostat: Add RAPL psys as a built-in counter
Introduce the counter as a part of global, platform counters structure.
We open the counter for only one cpu, but otherwise treat it as an
ordinary RAPL counter, allowing for grouped perf read.
The counter is disabled by default, because it's interpretation may
require additional, platform specific information, making it unsuitable
for general use.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Patryk Wlazlyn [Wed, 13 Nov 2024 14:48:22 +0000 (15:48 +0100)]
tools/power turbostat: Fix child's argument forwarding
Add '+' to optstring when early scanning for --no-msr and --no-perf.
It causes option processing to stop as soon as a nonoption argument is
encountered, effectively skipping child's arguments.
Fixes:
3e4048466c39 ("tools/power turbostat: Add --no-msr option")
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Patryk Wlazlyn [Thu, 24 Oct 2024 13:17:45 +0000 (15:17 +0200)]
tools/power turbostat: Force --no-perf in --dump mode
Force the --no-perf early to prevent using it as a source. User asks for
raw values, but perf returns them relative to the opening of the file
descriptor.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:46 +0000 (15:59 +0800)]
tools/power turbostat: Add support for /sys/class/drm/card1
On some machines, the graphics device is enumerated as
/sys/class/drm/card1 instead of /sys/class/drm/card0. The current
implementation does not handle this scenario, resulting in the loss of
graphics C6 residency and frequency information.
Add support for /sys/class/drm/card1, ensuring that turbostat can
retrieve and display the graphics columns for these platforms.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:45 +0000 (15:59 +0800)]
tools/power turbostat: Cache graphics sysfs file descriptors during probe
Snapshots of the graphics sysfs knobs are taken based on file
descriptors. To optimize this process, open the files and cache the file
descriptors during the graphics probe phase. As a result, the previously
cached pathnames become redundant and are removed.
This change aims to streamline the code without altering its functionality.
No functional change intended.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:44 +0000 (15:59 +0800)]
tools/power turbostat: Consolidate graphics sysfs access
Currently, there is an inconsistency in how graphics sysfs knobs are
accessed: graphics residency sysfs knobs are opened and closed for each
read, while graphics frequency sysfs knobs are opened once and remain
open until turbostat exits. This inconsistency is confusing and adds
unnecessary code complexity.
Consolidate the access method by opening the sysfs files once and
reusing the file pointers for subsequent accesses. This approach
simplifies the code and ensures a consistent method for accessing
graphics sysfs knobs.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:43 +0000 (15:59 +0800)]
tools/power turbostat: Remove unnecessary fflush() call
The graphics sysfs knobs are read-only, making the use of fflush()
before reading them redundant.
Remove the unnecessary fflush() call.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:42 +0000 (15:59 +0800)]
tools/power turbostat: Enhance platform divergence description
In various generations, platforms often share a majority of features,
diverging only in a few specific aspects. The current approach of using
hardcoded values in 'platform_features' structure fails to effectively
represent these divergences.
To improve the description of platform divergence:
1. Each newly introduced 'platform_features' structure must have a base,
typically derived from the previous generation.
2. Platform feature values should be inherited from the base structure
rather than being hardcoded.
This approach ensures a more accurate and maintainable representation of
platform-specific features across different generations.
Converts `adl_features` and `lnl_features` to follow this new scheme.
No functional change.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:41 +0000 (15:59 +0800)]
tools/power turbostat: Add initial support for GraniteRapids-D
Add initial support for GraniteRapids-D. It shares the same features
with SapphireRapids.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:40 +0000 (15:59 +0800)]
tools/power turbostat: Remove PC3 support on Lunarlake
Lunarlake supports CC1/CC6/CC7/PC2/PC6/PC10.
Remove PC3 support on Lunarlake.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:39 +0000 (15:59 +0800)]
tools/power turbostat: Rename arl_features to lnl_features
As ARL shares the same features with ADL/RPL/MTL, now 'arl_features' is
used by Lunarlake platform only.
Rename 'arl_features' to 'lnl_features'.
No functional change.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:38 +0000 (15:59 +0800)]
tools/power turbostat: Add back PC8 support on Arrowlake
Similar to ADL/RPL/MTL, ARL supports CC1/CC6/CC7/PC2/PC3/PC6/PC8/PC10.
Add back PC8 support on Arrowlake.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Thu, 14 Nov 2024 07:59:37 +0000 (15:59 +0800)]
tools/power turbostat: Remove PC7/PC9 support on MTL
Similar to ADL/RPL, MTL support CC1/CC6/CC7/PC2/PC3/PC6/PC8/CP10.
Remove PC7/PC9 support on MTL.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Patryk Wlazlyn [Tue, 17 Sep 2024 20:33:26 +0000 (22:33 +0200)]
tools/power turbostat: Honor --show CPU, even when even when num_cpus=1
Honor --show CPU and --show Core when "topo.num_cpus == 1".
Previously turbostat assumed that on a 1-CPU system, these
columns should never appear.
Honoring these flags makes it easier for several programs
that parse turbostat output.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Zhang Rui [Tue, 27 Aug 2024 05:07:51 +0000 (13:07 +0800)]
tools/power turbostat: Fix trailing '\n' parsing
parse_cpu_string() parses the string input either from command line or
from /sys/fs/cgroup/cpuset.cpus.effective to get a list of CPUs that
turbostat can run with.
The cpu string returned by /sys/fs/cgroup/cpuset.cpus.effective contains
a trailing '\n', but strtoul() fails to treat this as an error.
That says, for the code below
val = ("\n", NULL, 10);
val returns 0, and errno is also not set.
As a result, CPU0 is erroneously considered as allowed CPU and this
causes failures when turbostat tries to run on CPU0.
get_counters: Could not migrate to CPU 0
...
turbostat: re-initialized with num_cpus 8, allowed_cpus 5
get_counters: Could not migrate to CPU 0
Add a check to return immediately if '\n' or '\0' is detected.
Fixes:
8c3dd2c9e542 ("tools/power/turbostat: Abstrct function for parsing cpu string")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Patryk Wlazlyn [Tue, 20 Aug 2024 16:47:59 +0000 (18:47 +0200)]
tools/power turbostat: Allow using cpu device in perf counters on hybrid platforms
Intel hybrid platforms expose different perf devices for P and E cores.
Instead of one, "/sys/bus/event_source/devices/cpu" device, there are
"/sys/bus/event_source/devices/{cpu_core,cpu_atom}".
This, however makes it more complicated for the user,
because most of the counters are available on both and had to be
handled manually.
This patch allows users to use "virtual" cpu device that is seemingly
translated to cpu_core and cpu_atom perf devices, depending on the type
of a CPU we are opening the counter for.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Patryk Wlazlyn [Wed, 7 Aug 2024 11:43:39 +0000 (13:43 +0200)]
tools/power turbostat: Fix column printing for PMT xtal_time counters
If the very first printed column was for a PMT counter of type xtal_time
we would misalign the column header, because we were always printing the
delimiter.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Todd Brandt [Wed, 31 Jul 2024 16:24:09 +0000 (12:24 -0400)]
tools/power turbostat: fix GCC9 build regression
Fix build regression seen when using old gcc-9 compiler.
Signed-off-by: Todd Brandt <todd.e.brandt@intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Linus Torvalds [Sat, 30 Nov 2024 21:41:50 +0000 (13:41 -0800)]
Merge tag 'kbuild-v6.13' of git://git./linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Add generic support for built-in boot DTB files
- Enable TAB cycling for dialog buttons in nconfig
- Fix issues in streamline_config.pl
- Refactor Kconfig
- Add support for Clang's AutoFDO (Automatic Feedback-Directed
Optimization)
- Add support for Clang's Propeller, a profile-guided optimization.
- Change the working directory to the external module directory for M=
builds
- Support building external modules in a separate output directory
- Enable objtool for *.mod.o and additional kernel objects
- Use lz4 instead of deprecated lz4c
- Work around a performance issue with "git describe"
- Refactor modpost
* tag 'kbuild-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (85 commits)
kbuild: rename .tmp_vmlinux.kallsyms0.syms to .tmp_vmlinux0.syms
gitignore: Don't ignore 'tags' directory
kbuild: add dependency from vmlinux to resolve_btfids
modpost: replace tdb_hash() with hash_str()
kbuild: deb-pkg: add python3:native to build dependency
genksyms: reduce indentation in export_symbol()
modpost: improve error messages in device_id_check()
modpost: rename alias symbol for MODULE_DEVICE_TABLE()
modpost: rename variables in handle_moddevtable()
modpost: move strstarts() to modpost.h
modpost: convert do_usb_table() to a generic handler
modpost: convert do_of_table() to a generic handler
modpost: convert do_pnp_device_entry() to a generic handler
modpost: convert do_pnp_card_entries() to a generic handler
modpost: call module_alias_printf() from all do_*_entry() functions
modpost: pass (struct module *) to do_*_entry() functions
modpost: remove DEF_FIELD_ADDR_VAR() macro
modpost: deduplicate MODULE_ALIAS() for all drivers
modpost: introduce module_alias_printf() helper
modpost: remove unnecessary check in do_acpi_entry()
...
Linus Torvalds [Sat, 30 Nov 2024 19:18:16 +0000 (11:18 -0800)]
Merge tag 'rtc-6.13' of git://git./linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"New drivers:
- Amlogic A4 and A5 RTC
- Marvell 88PM886 PMIC RTC
- Renesas RTCA-3 for Renesas RZ/G3S
Driver updates:
- ab-eoz9: fix temperature and alarm support
- cmos: improve locking behaviour
- isl12022: add alarm support
- m48t59: improve epoch handling
- mt6359: add range
- rzn1: fix BCD conversions and simplify driver"
* tag 'rtc-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (38 commits)
rtc: ab-eoz9: don't fail temperature reads on undervoltage notification
rtc: rzn1: reduce register access
rtc: rzn1: drop superfluous wday calculation
m68k: mvme147, mvme16x: Adopt rtc-m48t59 platform driver
rtc: brcmstb-waketimer: don't include 'pm_wakeup.h' directly
rtc: m48t59: Use platform_data struct for year offset value
rtc: ab-eoz9: fix abeoz9_rtc_read_alarm
rtc: rv3028: fix RV3028_TS_COUNT type
rtc: rzn1: update Michel's email
rtc: rzn1: fix BCD to rtc_time conversion errors
rtc: amlogic-a4: fix compile error
rtc: amlogic-a4: drop error messages
MAINTAINERS: Add an entry for Amlogic RTC driver
rtc: support for the Amlogic on-chip RTC
dt-bindings: rtc: Add Amlogic A4 and A5 RTC
rtc: add driver for Marvell 88PM886 PMIC RTC
rtc: check if __rtc_read_time was successful in rtc_timer_do_work()
rtc: pcf8563: Switch to regmap
rtc: pcf8563: Sort headers alphabetically
rtc: abx80x: Fix WDT bit position of the status register
...
Linus Torvalds [Sat, 30 Nov 2024 18:34:54 +0000 (10:34 -0800)]
Merge tag 'uml-for-linus-6.13-rc1' of git://git./linux/kernel/git/uml/linux
Pull UML updates from Richard Weinberger:
- Lots of cleanups, mostly from Benjamin Berg and Tiwei Bie
- Removal of unused code
- Fix for sparse warnings
- Cleanup around stub_exe()
* tag 'uml-for-linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: (68 commits)
hostfs: Fix the NULL vs IS_ERR() bug for __filemap_get_folio()
um: move thread info into task
um: Always dump trace for specified task in show_stack
um: vector: Do not use drvdata in release
um: net: Do not use drvdata in release
um: ubd: Do not use drvdata in release
um: ubd: Initialize ubd's disk pointer in ubd_add
um: virtio_uml: query the number of vqs if supported
um: virtio_uml: fix call_fd IRQ allocation
um: virtio_uml: send SET_MEM_TABLE message with the exact size
um: remove broken double fault detection
um: remove duplicate UM_NSEC_PER_SEC definition
um: remove file sync for stub data
um: always include kconfig.h and compiler-version.h
um: set DONTDUMP and DONTFORK flags on KASAN shadow memory
um: fix sparse warnings in signal code
um: fix sparse warnings from regset refactor
um: Remove double zero check
um: fix stub exe build with CONFIG_GCOV
um: Use os_set_pdeathsig helper in winch thread/process
...
Linus Torvalds [Sat, 30 Nov 2024 18:32:47 +0000 (10:32 -0800)]
Merge tag 'ubifs-for-linus-6.13-rc1' of git://git./linux/kernel/git/rw/ubifs
Pull JFFS2, UBI and UBIFS updates from Richard Weinberger:
"JFFS2:
- Bug fix for rtime compression
- Various cleanups
UBI:
- Cleanups for fastmap and wear leveling
UBIFS:
- Add support for FS_IOC_GETFSSYSFSPATH
- Remove dead ioctl code
- Fix UAF in ubifs_tnc_end_commit()"
* tag 'ubifs-for-linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: (25 commits)
ubifs: Fix uninitialized use of err in ubifs_jnl_write_inode()
jffs2: Prevent rtime decompress memory corruption
jffs2: remove redundant check on outpos > pos
fs: jffs2: Fix inconsistent indentation in jffs2_mark_node_obsolete
jffs2: Correct some typos in comments
jffs2: fix use of uninitialized variable
jffs2: Use str_yes_no() helper function
mtd: ubi: remove redundant check on bytes_left at end of function
mtd: ubi: fix unreleased fwnode_handle in find_volume_fwnode()
ubifs: authentication: Fix use-after-free in ubifs_tnc_end_commit
ubi: fastmap: Fix duplicate slab cache names while attaching
ubifs: xattr: remove unused anonymous enum
ubifs: Reduce kfree() calls in ubifs_purge_xattrs()
ubifs: Call iput(xino) only once in ubifs_purge_xattrs()
ubi: wl: Close down wear-leveling before nand is suspended
mtd: ubi: Rmove unused declaration in header file
ubifs: Correct the total block count by deducting journal reservation
ubifs: Convert to use ERR_CAST()
ubifs: add support for FS_IOC_GETFSSYSFSPATH
ubifs: remove unused ioctl flags GETFLAGS/SETFLAGS
...
Linus Torvalds [Sat, 30 Nov 2024 18:28:14 +0000 (10:28 -0800)]
Merge tag '9p-for-6.13-rc1' of https://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
- usbg: fix alloc failure handling & build-as-module
- xen: couple of fixes
- v9fs_cache_register/unregister code cleanup
* tag '9p-for-6.13-rc1' of https://github.com/martinetd/linux:
net/9p/usbg: allow building as standalone module
9p/xen: fix release of IRQ
9p/xen: fix init sequence
net/9p/usbg: fix handling of the failed kzalloc() memory allocation
fs/9p: replace functions v9fs_cache_{register|unregister} with direct calls
Linus Torvalds [Sat, 30 Nov 2024 18:22:38 +0000 (10:22 -0800)]
Merge tag 'ceph-for-6.13-rc1' of https://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"A fix for the mount "device" string parser from Patrick and two cred
reference counting fixups from Max, marked for stable.
Also included a number of cleanups and a tweak to MAINTAINERS to avoid
unnecessarily CCing netdev list"
* tag 'ceph-for-6.13-rc1' of https://github.com/ceph/ceph-client:
ceph: fix cred leak in ceph_mds_check_access()
ceph: pass cred pointer to ceph_mds_auth_match()
ceph: improve caps debugging output
ceph: correct ceph_mds_cap_peer field name
ceph: correct ceph_mds_cap_item field name
ceph: miscellaneous spelling fixes
ceph: Use strscpy() instead of strcpy() in __get_snap_name()
ceph: Use str_true_false() helper in status_show()
ceph: requalify some char pointers as const
ceph: extract entity name from device id
MAINTAINERS: exclude net/ceph from networking
ceph: Remove fs/ceph deadcode
libceph: Remove unused ceph_crypto_key_encode
libceph: Remove unused ceph_osdc_watch_check
libceph: Remove unused pagevec functions
libceph: Remove unused ceph_pagelist functions
Linus Torvalds [Sat, 30 Nov 2024 18:17:53 +0000 (10:17 -0800)]
Merge tag 'nfs-for-6.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Bugfixes:
- nfs/localio: fix for a memory corruption in nfs_local_read_done
- Revert "nfs: don't reuse partially completed requests in
nfs_lock_and_join_requests"
- nfsv4:
- ignore SB_RDONLY when mounting nfs
- Fix a use-after-free problem in open()
- sunrpc:
- clear XPRT_SOCK_UPD_TIMEOUT when reseting the transport
- timeout and cancel TLS handshake with -ETIMEDOUT
- fix one UAF issue caused by sunrpc kernel tcp socket
- Fix a hang in TLS sock_close if sk_write_pending
- pNFS/blocklayout: Fix device registration issues
Features and cleanups:
- localio cleanups from Mike Snitzer
- Clean up refcounting on the nfs version modules
- __counted_by() annotations
- nfs: make processes that are waiting for an I/O lock killable"
* tag 'nfs-for-6.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (24 commits)
fs/nfs/io: make nfs_start_io_*() killable
nfs/blocklayout: Limit repeat device registration on failure
nfs/blocklayout: Don't attempt unregister for invalid block device
sunrpc: fix one UAF issue caused by sunrpc kernel tcp socket
SUNRPC: timeout and cancel TLS handshake with -ETIMEDOUT
sunrpc: clear XPRT_SOCK_UPD_TIMEOUT when reset transport
nfs: ignore SB_RDONLY when mounting nfs
Revert "nfs: don't reuse partially completed requests in nfs_lock_and_join_requests"
Revert "fs: nfs: fix missing refcnt by replacing folio_set_private by folio_attach_private"
nfs/localio: must clear res.replen in nfs_local_read_done
NFSv4.0: Fix a use-after-free problem in the asynchronous open()
NFSv4.0: Fix the wake up of the next waiter in nfs_release_seqid()
SUNRPC: Fix a hang in TLS sock_close if sk_write_pending
sunrpc: remove newlines from tracepoints
nfs: Annotate struct pnfs_commit_array with __counted_by()
nfs/localio: eliminate need for nfs_local_fsync_work forward declaration
nfs/localio: remove extra indirect nfs_to call to check {read,write}_iter
nfs/localio: eliminate unnecessary kref in nfs_local_fsync_ctx
nfs/localio: remove redundant suid/sgid handling
NFS: Implement get_nfs_version()
...
Linus Torvalds [Sat, 30 Nov 2024 18:14:42 +0000 (10:14 -0800)]
Merge tag '6.13-rc-part2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client updates from Steve French:
- directory lease fixes
- password rotation fixes
- reconnect fix
- fix for SMB3.02 mounts
- DFS (global namespace) fixes
- fixes for special file handling (most relating to better handling
various types of symlinks)
- two minor cleanups
* tag '6.13-rc-part2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: (22 commits)
cifs: update internal version number
cifs: unlock on error in smb3_reconfigure()
cifs: during remount, make sure passwords are in sync
cifs: support mounting with alternate password to allow password rotation
smb: Initialize cfid->tcon before performing network ops
smb: During unmount, ensure all cached dir instances drop their dentry
smb: client: fix noisy message when mounting shares
smb: client: don't try following DFS links in cifs_tree_connect()
smb: client: allow reconnect when sending ioctl
smb: client: get rid of @nlsc param in cifs_tree_connect()
smb: client: allow more DFS referrals to be cached
cifs: Fix parsing reparse point with native symlink in SMB1 non-UNICODE session
cifs: Validate content of WSL reparse point buffers
cifs: Improve guard for excluding $LXDEV xattr
cifs: Add support for parsing WSL-style symlinks
cifs: Validate content of native symlink
cifs: Fix parsing native symlinks relative to the export
smb: client: fix NULL ptr deref in crypto_aead_setkey()
Update misleading comment in cifs_chan_update_iface
smb: client: change return value in open_cached_dir_by_dentry() if !cfids
...
Linus Torvalds [Sat, 30 Nov 2024 18:06:56 +0000 (10:06 -0800)]
Merge tag '6.13-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server updates from Steve French:
- fix use after free due to race in ksmd workqueue handler
- debugging improvements
- fix incorrectly formatted response when client attempts SMB1
- improve memory allocation to reduce chance of OOM
- improve delays between retries when killing sessions
* tag '6.13-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix use-after-free in SMB request handling
ksmbd: add debug print for pending request during server shutdown
ksmbd: add netdev-up/down event debug print
ksmbd: add debug prints to know what smb2 requests were received
ksmbd: add debug print for rdma capable
ksmbd: use msleep instaed of schedule_timeout_interruptible()
ksmbd: use __GFP_RETRY_MAYFAIL
ksmbd: fix malformed unsupported smb1 negotiate response
Brian Norris [Tue, 26 Nov 2024 21:04:34 +0000 (13:04 -0800)]
PCI/pwrctrl: Unregister platform device only if one actually exists
If a PCI device has an associated device_node with power supplies,
pci_bus_add_device() creates platform devices for use by pwrctrl. When the
PCI device is removed, pci_stop_dev() uses of_find_device_by_node() to
locate the related platform device, then unregisters it.
But when we remove a PCI device with no associated device node,
dev_of_node(dev) is NULL, and of_find_device_by_node(NULL) returns the
first device with "dev->of_node == NULL". The result is that we (a)
mistakenly unregister a completely unrelated platform device, leading to
issues like the first trace below, and (b) dereference the NULL pointer
from dev_of_node() when clearing OF_POPULATED, as in the second trace.
Unregister a platform device only if there is one associated with this PCI
device. This resolves issues seen when doing:
# echo 1 > /sys/bus/pci/devices/.../remove
Sample issue from unregistering the wrong platform device:
WARNING: CPU: 0 PID: 5095 at drivers/regulator/core.c:5885 regulator_unregister+0x140/0x160
Call trace:
regulator_unregister+0x140/0x160
devm_rdev_release+0x1c/0x30
release_nodes+0x68/0x100
devres_release_all+0x98/0xf8
device_unbind_cleanup+0x20/0x70
device_release_driver_internal+0x1f4/0x240
device_release_driver+0x20/0x40
bus_remove_device+0xd8/0x170
device_del+0x154/0x380
device_unregister+0x28/0x88
of_device_unregister+0x1c/0x30
pci_stop_bus_device+0x154/0x1b0
pci_stop_and_remove_bus_device_locked+0x28/0x48
remove_store+0xa0/0xb8
dev_attr_store+0x20/0x40
sysfs_kf_write+0x4c/0x68
Later NULL pointer dereference for of_node_clear_flag(NULL, OF_POPULATED):
Unable to handle kernel NULL pointer dereference at virtual address
00000000000000c0
Call trace:
pci_stop_bus_device+0x190/0x1b0
pci_stop_and_remove_bus_device_locked+0x28/0x48
remove_store+0xa0/0xb8
dev_attr_store+0x20/0x40
sysfs_kf_write+0x4c/0x68
Link: https://lore.kernel.org/r/20241126210443.4052876-1-briannorris@chromium.org
Fixes:
681725afb6b9 ("PCI/pwrctl: Remove pwrctl device without iterating over all children of pwrctl parent")
Reported-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Closes: https://lore.kernel.org/r/
1732890621-19656-1-git-send-email-ssengar@linux.microsoft.com
Signed-off-by: Brian Norris <briannorris@chromium.org>
[bhelgaas: commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Linus Torvalds [Sat, 30 Nov 2024 17:03:16 +0000 (09:03 -0800)]
Merge tag 'tty-6.13-rc1' of git://git./linux/kernel/git/gregkh/tty
Pull tty / serial driver updates from Greg KH:
"Here is a small set of tty and serial driver updates for 6.13-rc1.
Nothing major at all this time, only some small changes:
- few device tree binding updates
- 8250_exar serial driver updates
- imx serial driver updates
- sprd_serial driver updates
- other tiny serial driver updates, full details in the shortlog
All of these have been in linux-next for a while with one reported
issue, but that commit has now been reverted"
* tag 'tty-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (37 commits)
Revert "serial: sh-sci: Clean sci_ports[0] after at earlycon exit"
serial: amba-pl011: fix build regression
dt-bindings: serial: Add a new compatible string for ums9632
serial: sprd: Add support for sc9632
tty/serial/altera_uart: unwrap error log string
tty/serial/altera_jtaguart: unwrap error log string
serial: amba-pl011: Fix RX stall when DMA is used
tty: ldsic: fix tty_ldisc_autoload sysctl's proc_handler
serial: 8250_fintek: Add support for
F81216E
serial: sh-sci: Clean sci_ports[0] after at earlycon exit
tty: atmel_serial: Fix typo retreives to retrieves
tty: atmel_serial: Use devm_platform_ioremap_resource()
serial: 8250: omap: Move pm_runtime_get_sync
tty: serial: samsung: Add Exynos8895 compatible
dt-bindings: serial: samsung: Add samsung,exynos8895-uart compatible
serial: 8250_dw: Add Sophgo SG2044 quirk
dt-bindings: serial: snps-dw-apb-uart: Add Sophgo SG2044 uarts
dt-bindings: serial: snps,dw-apb-uart: merge duplicate compatible entry.
altera_jtaguart: Use dev_err() to report error attaching IRQ
altera_uart: Use dev_err() to report error attaching IRQ handler
...
Greg Kroah-Hartman [Sat, 30 Nov 2024 15:55:56 +0000 (16:55 +0100)]
Revert "serial: sh-sci: Clean sci_ports[0] after at earlycon exit"
This reverts commit
3791ea69a4858b81e0277f695ca40f5aae40f312.
It was reported to cause boot-time issues, so revert it for now.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes:
3791ea69a485 ("serial: sh-sci: Clean sci_ports[0] after at earlycon exit")
Cc: stable <stable@kernel.org>
Cc: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dan Carpenter [Wed, 23 Oct 2024 08:41:59 +0000 (11:41 +0300)]
sh: intc: Fix use-after-free bug in register_intc_controller()
In the error handling for this function, d is freed without ever
removing it from intc_list which would lead to a use after free.
To fix this, let's only add it to the list after everything has
succeeded.
Fixes:
2dcec7a988a1 ("sh: intc: set_irq_wake() support")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Huacai Chen [Thu, 14 Jul 2022 08:41:36 +0000 (16:41 +0800)]
sh: cpuinfo: Fix a warning for CONFIG_CPUMASK_OFFSTACK
When CONFIG_CPUMASK_OFFSTACK and CONFIG_DEBUG_PER_CPU_MAPS are selected,
cpu_max_bits_warn() generates a runtime warning similar as below when
showing /proc/cpuinfo. Fix this by using nr_cpu_ids (the runtime limit)
instead of NR_CPUS to iterate CPUs.
[ 3.052463] ------------[ cut here ]------------
[ 3.059679] WARNING: CPU: 3 PID: 1 at include/linux/cpumask.h:108 show_cpuinfo+0x5e8/0x5f0
[ 3.070072] Modules linked in: efivarfs autofs4
[ 3.076257] CPU: 0 PID: 1 Comm: systemd Not tainted 5.19-rc5+ #1052
[ 3.099465] Stack :
9000000100157b08 9000000000f18530 9000000000cf846c 9000000100154000
[ 3.109127]
9000000100157a50 0000000000000000 9000000100157a58 9000000000ef7430
[ 3.118774]
90000001001578e8 0000000000000040 0000000000000020 ffffffffffffffff
[ 3.128412]
0000000000aaaaaa 1ab25f00eec96a37 900000010021de80 900000000101c890
[ 3.138056]
0000000000000000 0000000000000000 0000000000000000 0000000000aaaaaa
[ 3.147711]
ffff8000339dc220 0000000000000001 0000000006ab4000 0000000000000000
[ 3.157364]
900000000101c998 0000000000000004 9000000000ef7430 0000000000000000
[ 3.167012]
0000000000000009 000000000000006c 0000000000000000 0000000000000000
[ 3.176641]
9000000000d3de08 9000000001639390 90000000002086d8 00007ffff0080286
[ 3.186260]
00000000000000b0 0000000000000004 0000000000000000 0000000000071c1c
[ 3.195868] ...
[ 3.199917] Call Trace:
[ 3.203941] [<
90000000002086d8>] show_stack+0x38/0x14c
[ 3.210666] [<
9000000000cf846c>] dump_stack_lvl+0x60/0x88
[ 3.217625] [<
900000000023d268>] __warn+0xd0/0x100
[ 3.223958] [<
9000000000cf3c90>] warn_slowpath_fmt+0x7c/0xcc
[ 3.231150] [<
9000000000210220>] show_cpuinfo+0x5e8/0x5f0
[ 3.238080] [<
90000000004f578c>] seq_read_iter+0x354/0x4b4
[ 3.245098] [<
90000000004c2e90>] new_sync_read+0x17c/0x1c4
[ 3.252114] [<
90000000004c5174>] vfs_read+0x138/0x1d0
[ 3.258694] [<
90000000004c55f8>] ksys_read+0x70/0x100
[ 3.265265] [<
9000000000cfde9c>] do_syscall+0x7c/0x94
[ 3.271820] [<
9000000000202fe4>] handle_syscall+0xc4/0x160
[ 3.281824] ---[ end trace
8b484262b4b8c24c ]---
Cc: stable@vger.kernel.org
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Linus Torvalds [Fri, 29 Nov 2024 21:06:06 +0000 (13:06 -0800)]
Merge tag 'drm-next-2024-11-29' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Merge window fixes, mostly amdgpu and xe, with a few other minor ones,
all looks fairly normal,
i915:
- hdcp: Fix when the first read and write are retried
xe:
- Wake up waiters after wait condition set to true
- Mark the preempt fence workqueue as reclaim
- Update xe2 graphics name string
- Fix a couple of guc submit races
- Fix pat index usage in migrate
- Ensure non-cached migrate pagetable bo mappings
- Take a PM ref in the delayed snapshot capture worker
amdgpu:
- SMU 13.0.6 fixes
- XGMI fixes
- SMU 13.0.7 fixes
- Misc code cleanups
- Plane refcount fixes
- DCN 4.0.1 fixes
- DC power fixes
- DTO fixes
- NBIO 7.11 fixes
- SMU 14.0.x fixes
- Reset fixes
- Enable DC on LoongArch
- Sysfs hotplug warning fix
- Misc small fixes
- VCN 4.0.3 fix
- Slab usage fix
- Jpeg delayed work fix
amdkfd:
- wptr handling fixes
radeon:
- Use ttm_bo_move_null()
- Constify struct pci_device_id
- Fix spurious hotplug
- HPD fix
rockchip
- fix 32-bit build"
* tag 'drm-next-2024-11-29' of https://gitlab.freedesktop.org/drm/kernel: (48 commits)
drm/xe: Take PM ref in delayed snapshot capture worker
drm/xe/migrate: use XE_BO_FLAG_PAGETABLE
drm/xe/migrate: fix pat index usage
drm/xe/guc_submit: fix race around suspend_pending
drm/xe/guc_submit: fix race around pending_disable
drm/xe: Update xe2_graphics name string
drm/rockchip: avoid 64-bit division
Revert "drm/radeon: Delay Connector detecting when HPD singals is unstable"
drm/amdgpu/jpeg: cancel the jpeg worker
drm/amdgpu: fix usage slab after free
drm/amdgpu/vcn: reset fw_shared when VCPU buffers corrupted on vcn v4.0.3
drm/amdgpu: Fix sysfs warning when hotplugging
drm/amdgpu: Add sysfs interface for vcn reset mask
drm/amdgpu/gmc7: fix wait_for_idle callers
drm/amd/pm: Remove arcturus min power limit
drm/amd/pm: skip setting the power source on smu v14.0.2/3
drm/amd/pm: disable pcie speed switching on Intel platform for smu v14.0.2/3
drm/amdkfd: Use the correct wptr size
drm/xe: Mark preempt fence workqueue as reclaim
drm/xe/ufence: Wake up waiters after setting ufence->signalled
...